Topic: Possible WolfMQTT/Wolf SSL Memory Leak

I have a successfully gotten WolfMQTT and WolfSSL to communicate with Azure IOT hub through a Cell modem.   For some time I have been working on adding some robustness to the software as the cellular modem can frequently lose connection I have to gracefully terminate and restart.   (NRF52 is the platform).

My system will Tx once a minute and I have been leaving it run to see where potential hangups are.   

1.)    The  TCP connection will go down about every few hours.   I modified the MQTT statemachine to go through the WMQ_DISCONNECT and WMQ_NET_DISCONNECT states and then restart at WMQ begin.

2.) It looks like MqttClient_NetDisconnect() calls functions to free resources use by WolfSSL.

3.)   After about 24 hours I always get an error in the TLS setup.   Today it was:

MqttSocket_TlsConnect Error -1: Num -112, mp_exptmod error state


4.)  Other times I never get through the TLS connection as it always returns a "CONTINUE".  It takes about 24 hours (roughly 50 or 60 restarts) for this error to pop up.

5.)   Since it takes so long for it occur, it hard to capture statistics but today I was ablw to attach a debugger.        What was interesting is that my IRQ routines are still running (as there are diagnostic messages from the modem to a serial terminal).

In every case where there was an issue it was locked up in the std C lib "Free" function.  Unfortunately I did not have any more data from the call stack.

6.)   I believe the only place where my code does malloc/free in in WolfSSL.    WolfMQTT did have a malloc for it's context struct but I change it to statically allocate.

Right now I am trying to make this problem happen quicker but I think there is something happening with freeing resources.      I am going to try to fix the state machine so it will keep trying to disconnect and reconnect to see if the issue still shows up.

I am also going to look at using static allocation for WolfSSL (currently my stack is 65k and the heap is 72k,   the WolfSSL test function pass OK).

I am posting this to see if there is any info on "mp_exptmod error state".   I'll post updates as this might be useful to others.

Share

Re: Possible WolfMQTT/Wolf SSL Memory Leak

A few things as an update

1.)   I am using the library with non-blocking IO enabled
2.)   I was actually using V1.1, not the latest V1.2


I setup a test that would just continually connect, public one message and then disconnect.  In this scenario I noticed I could never get the hardfault (which was the result of a bad malloc/free)

After some more investigation I found a few more things.

a.)  There were instances where my cell modem wasn't actually closing the TCP connection.   This was causing issues when the state machine would restart.

After fixing a.)   the actual bug became easier to reproduce (almost on every reconnect cycle I could get a hardfault during a malloc).

I reviewed my IO calls in mqttnet.c and realized that my net_write routine would sometimes actually block until I got an acknowlegement from my modem.  I re-implemented the logic to return an MQTT_CODE_CONTINUE until I got my reponse.

That is when I found a memory leak.    MqttSocket_TlsSocketSend would return WOLFSSL_CBIO_ERR_WANT_WRITE on an rc of 0 or MQTT_CODE_ERROR_TIMEOUT, not MQTT_CODE_CONTINUE.    When I traced the call stack it looks like the connect/hello packet for the TLS connection would malloc many times as I returned MQTT_CODE_CONTINUE in my net_write function.    I could make this happen on demand if I forced my net_write to block, or make it non blocking and use MQTT_CODE_CONTINUE.

I looked through the latest code on github and saw that the latest version handles MQTT_CODE_CONTINUE in MqttSocket_TlsSocketSend correctly.

I worked in V1.2 of WolfMQTT and started testing again.  Hopefully this version fixes the issue.

I do still plan on looking at static allocation in WolfSSL as well.

Share

Re: Possible WolfMQTT/Wolf SSL Memory Leak

Just as an update.     I have been running continuously for a couple weeks now.     I believe the upgrade to V1.2 that handles the MqttSocket_TlsSocketSend  MQTT_CODE_CONTINUE  correctly fixed the problem.

Share