We have a MFC based Server and a Client application that communicates using TCP/IP Sockets. The Server application is a dynamic Calculation engine for Chemical Process plant and does calculations continuously and responds to dynamic inputs from the client. The plant values (structure) are sent every second using a thread with a frequency of 1 second and the values are received in the client using a thread with the same frequency of 1 second. The total size of the data sent through the sockets is around 2500bytes/second. We use socket serialization through a Class derived from CObject for sending and receiving data through sockets. This application runs fine in the LAN and initially the applications were developed to run in LAN Environment.
When I try to run the applications in internet by using a static IP address for the server, the communication is established and the values are send and received successfully. However, when the applications are run for a period of more than 2 hours one of the following result is observed repeatedly.
1. The client application hangs up and shows no response.
2. Both the server and client gets disconnected but runs independently without any data exchange.
3. The client application gets closed which may be after an application crash.
Note: The same applications run for more than 15 hours without any problems in a LAN setup.
Please help me to understand the underlying problem and provide me a solution for this problem.
This can't be answered or solved without seeing the related code and knowing about the used protocol and data flow.
However, here are some tips:
With WAN communication, packets might get lost and resend. Packets might arrive in a different order. If order matters, the sequence number of the raw packets can be used or the protocol must provide a similar field (which might be an existing time stamp with adequate resolution).
The receiving threads must be event driven (your post might indicate that they are timer driven).
Check all function return codes and check for timeouts (detect dropped connections). Doing so might solve the problem of the hanging client (possible scenario: client waits for an event or state that is never occuring).
When encountering problems by code, provide methods to gracefully drop a connection so that you can restart. This should be done on both sides which requires the client to be able to send to the server (keep alive messages and close requests).
I have copied the Server code and Client code below. As of now, both the functions send (Server side) and Receive code (client code) are timer driven with frequency of 1 second. Can you please detail me this step which you had mentioned as "Check all function return codes and check for timeouts (detect dropped connections)." as I am unable to understand that clearly.
This function call in the below code in Server "tFrm->UpdateOprs(&tData);" sends the data to all the connected clients (In my case, I have connected only one Client). This function call ("tFrm->UpdateOprs(&tData);") to send data is called 5 times in a Send cycle and correspondingly 5 calls to receive data in the client side is called.
//Server Side Code
long int Offset;
Sending data timer driven is fine. But for receiving you should use an own thread that handles the receiving and supports timeouts. Using serializing might be not the best choice.
When not detecting timeouts and the connection is dropped, the read function will wait forever. Even after establishing a new connection, the pending wait will never finish because the new connection uses a different socket handle.
The typical flow of receiving in a worker thread (not using serialization):
// Wait for events using WaitForMultipleObjects() here
// Events: new data available and stop thread
// Timeout: According to requirements; e.g. 10 seconds
// Try to reconnect here or break when main thread reconnects
// Receive data so that number of package bytes are contained (when not fixed size)
// Receive complete package here if available
// Otherwise read available date into buffer and use local variables to continue receiving
// Process it here (deserialize)
// Send response here
// Process data here or pass them to main thread
if (CriticalError) // in any function called above
// Report error
// Close connection here or in main thread
// Use the return value to indicate the reason why the thread has terminated
// or pass the state to the main thread
Just to mention that I am a novice to socket programming.
As per your suggestion, I changed the receive at the client side by removing the thread and added the OnReceive() event function to receive data from the server. Whenever OnReceive() is called, I called the Receive() to receive the data from the server. Also I removed the Serialization from both the server and client and send the data as a normal packet.
However when I run the server and client and logged the data received time to a file by adding the code to log the current time inside the OnReceive() function of the client, it had run successfully for nearly 10 hours. After that there is no logging of time in the file seen, and I assume that the connection is disconnected between the server and the client and hence the OnReceive() event was not triggered. Please guide me how to handle this and establish a continous connection. Also How to detect if the connection is lost?
That is why all event waiting calls should have a timeout. If that occurs, the server should close the corresponding socket and go into listen state again. Similar for the client: close and try to connect again.
There is no direct and fast method to detect an interrupted connection. Interrupted in the sense off one side closing the connection without announcement or even network disconnect.
When using CAsyncSocket, you have to implement your own timer that is reset when data are received.
With non-blocking calls you usually wait for events with WaitFor(Single|Multiple)Object(s) which have a timeout parameter. I recommend this when using a worker thread for receiving because the call can also catch a terminate event to stop the thread when closing the connection by intention like when terminating the application.
Once a timeout is detected you close the related connections / sockets like when terminating the application. Afterwards enter the listen state again on the server and try to (re-)connect on the client like when starting the applications.
A complete example - even when knowing the used socket type - would be far too much for this forum. But there are a lot of tutorials in the net and here at CP about the topic.
I am using CSocket implementation for server and client with Serialization using a class derived from CObject. As already mentioned, I use a thread to send and receive data continuously every second from Server Side. Similarly I use a thread to send and receive data continuously every second from client Side.
Can I implement this timer suggested by you for CAsyncSocket in my code that is using serialization and continuous send/receive operation. Please suggest how to do it especially to handle connection lost issues (to detect connection failure and reconnect properly) and to avoid crash problems.
In the client side:
I used OnReceive() and a timer variable which is reset to zero in that function. In a separate thread of 1 second frequency, I increment the timer variable to one. So every one second the timer variable is incremented but it is reset to zero in the OnReceive since the data is received in OnReceive() which is sent by the server every second. When the timer variable exceeds 30 counts, it is elapsed. When the timer is elapsed, I closed the client socket using shutdown and deleted the socket.
But When I tested the Server and the client, the server hangs after some 2 hours. Previously my client got hanged or lost connection.
Any suggestion, why the server is getting hanged this time?
Just to be clear
1. Works in environment A
2. Doesn't work in environment B
Obviously then the environment, not your code, is where the problem originates.
As one possibility there is a firewall rule that is disconnecting/dropping the connection after 2 hours. If that is the problem then the solution is to either fix the rule or alter your application such that it recreates the connection more frequently than the rule disconnects. So say every hour although I might go with less than that. That said though defensive programming would suggest that the connection could be lost, arbitrarily, for any number of reasons so you should be attempting to restore the connection anyways.
FYI might want to verify "crash" versus exit. I worked with one system where it turned out there was some sort of monitor that was externally terminating the application after a certain amount of time. As a windows client app all threads should have a generic global system catch which catches "system" exceptions and logs them. Also normal requests to exit should be logged as well.