Fork and Exec in DMTCP This is intended as one of a series of informal documents to describe and partially document some of the more subtle DMTCP data structures and algorithms. These documents are snapshots in time, and they may become somewhat out-of-date over time (and hopefully also refreshed to re-sync them with the code again). Fork and exec create a new process, and change the program of the process, respectively. DMTCP must create or maintain the state of the original process/program associated with DMTCP. DMTCP creates wrappers around the fork and exec family of system calls for this reason. A second reason for the wrappers is that the semantics of fork and exec specify that only the current thread survives (fork), or a single new thread is created (exec). In the case of fork, the thread calling fork (which is always a user thread) will return to the DMTCP wrapper, which will then re-initialize the MTCP library and other DMTCP state. In the case of exec, the exec wrapper, places the DMTCP hijack library back in the path specified by the LD_PRELOAD environment variable, prior to exec. The new program then runs the dmtcpWorker in the DMTCP hijack library, which loads the MTCP library, initializes it and other state, and then removes the DMTCP hijack library from the LD_PRELOAD path. (The hijack library is removed again to be transparent to the user program.) Further, the DMTCP coordinator has a concept of a DMTCP computation. A DMTCP computation exists so long as there are sockets connected to the coordinator. When the last socket disconnects (usually because all processes exited or died), then the computation ends. If there is no current computation, and a new socket connection to the coordinator is created, that begins a new DMTCP computation. The coordinator maintains a unique DMTCP computation id for each DMTCP computation. This document is divided into three parts: I. DMTCP actions during a fork II. DMTCP actions during an exec III. Actions of DMTCP Coordinator during a fork or exec ==== I. DMTCP actions during a fork FILL IN ==== II. DMTCP actions during an exec FILL IN ==== III. Actions of DMTCP Coordinator during a fork or exec When the DMTCP coordinator broadcasts a ''kill'' command (initiated by a user ''kill'' request to the coordinator), then the DMTCP checkpoint thread of each user process exits. Currently, the DMTCP checkpoint thread must cooperate with the user threads in exiting, since the unique dmtcpWorker object was created by the first thread, a user thread. This will be changed in the future to create the dmtcpWorker object as part of a placement array that will not be destroyed by the destructor when the original user thread exits. A related issue occurs when a user joinable thread exits. See thread-creation.txt for a discussion of this situation. Another issue occurs due to a potential race when the coordinator broadcasts a ''kill'' command to each of the currently attached sockets, and a user process is creating a new child process and an associated socket to the coordinator. First, when a user process calls fork, the fork wrapper will proactively create a new socket and connect to the coordinator. It will also send a DMT_UPDATE_PROCESS_INFO_AFTER_FORK message to the coordinator. This is done by the parent, _before_ the child has been created. After the child is created, the parent will close its copy of the socket. If the parent dies before this can happen, this has the same effect, when the O/S releases the parent's copy of the socket. Naturally, no checkpoint is allowed during this sensitive time while the coordinator is waiting to hear from the new child to be created. This design was chosen instead of having the child process connect to the coordinator after it is created. That alternative design was rejected because it could lead to a race. The ''kill'' command could happen after the child process has been created and before the child process has connected to the coordinator. In this case, the parent process would exit, and the coordinator would perceive a socket disconnect. The coordinator would then see a socket connection from the child process, and view this as an entirely new DMTCP computation. Next, there is a further danger of a race occurring if the coordinator sends a ''kill'' command that is received by the DMTCP checkpoint thread of the parent process, and at the same time, a user thread in the parent has created a new socket connection to the coordinator, and after the parent process has created the child process. In this case, the child process holds a connection to the coordinator, and the child has never received a ''kill'' command through that socket from the coordinator. To remedy this, the coordinator notes that a ''kill'' is in process from the time that it receives a ''kill'' request from the user until the last process of the current DMTCP computation has disconnected its socket from the coordinator. If a new process connects during the current computation (and therefore sends a DMT_UPDATE_PROCESS_INFO_AFTER_FORK message to the coordinator), then the coordinator replies to DMT_UPDATE_PROCESS_INFO_AFTER_FORK with a ''kill'' command. The above scheme works because the coordinator is single-threaded, and there are no asynchronous events. The coordinator is always executing an event loop. On detecting a connection, disconnection, etc., the top-level loop invokes methods for OnConnect, OnDisconnect, etc. Hence, when the coordinator sees a connection by a user with a ''kill'' request, it issues a ''kill'' command to all currently connected sockets, before returning to the top level of the event loop. It is only within the top level loop, that the coordinator can then recognize a new OnConnect event (from a child process in our scenario) -- along with the OnDisconnect events as each of the original processes of the DMTCP computation respond to the ''kill'' command. There is one final issue. The DMTCP coordinator might see the child process with its DMT_UPDATE_PROCESS_INFO_AFTER_FORK message only after a new user process has connected to the coordinator with the intention of starting a new computation. In this case, the coordinator will have just seen the number of connected sockets drop to zero. So, the coordinator knows that the new user process represents a nw computation. After this, the old child process connects with the DMT_UPDATE_PROCESS_INFO_AFTER_FORK message and its old computation id as a parameter. This allows the coordinator to recognize that this process is part of the old computation, and a ''kill'' message is sent back. Similarly, if the DMT_UPDATE_PROCESS_INFO_AFTER_FORK arrives at the coordinator when there is no current computation, then the coordinator will recognize this as an old process _independently_ of its computation id, and so a ''kill'' message will be returned. Finally, we discuss the simpler case of an exec by the user process after a ''kill'' message is sent. First, we note that the ''kill'' message is a single 'int', and hence it is atomic. Either it is read by the checkpoint thread or it is not. The atomicity of the ''kill'' message is important. The checkpoint thread might otherwise read part of a ''kill'' message, only to be interrupted by a fork or exec. In the case of a fork, the checkpoint thread in the parent would continue to read the ''kill'' message uninterrupted, and the analysis proceeds as earlier. But in the case of exec, the atomicity is required. Next, the ''kill'' message is atomic and is the first work of the message. Hence, in the case of exec, the ''kill'' message may or may not be consumed before the exec. But the ''kill'' cannot be half consumed. If the checkpoint thread of the new program sees the ''kill'' message after the exec, the new program will exit. If the ''kill'' message was already consumed before the exec, then the user thread of the new program will create a new checkpoint thread. This new checkpoint thread will discover the old socket to the coordinator in a special "LIFEBOAT" preserved by DMTCP across exec. The new checkpoint thread will then recognize that it is in the middle of an exec (based on the "LIFEBOAT"), and send a DMT_UPDATE_PROCESS_INFO_AFTER_EXEC to the coordinator with the old computation id (also found in the "LIFEBOAT"). The rest of the analysis proceeds as before. If the computation id is not part of the current computation of the coordinator, then the coordinator replies with a new ''kill'' message, and the checkpoint thread will cause the process to exit. This all happens before the ''main'' routine of the new program begins to run. Note that the coordinator remains stateless in this analysis, except that it remembers the current computation id. However, if the computation ever dies, and DMTCP must restart, then the restarted user processes will then inform the coordinator of their old computation id. The coordinator again remembers the computation id in this case, and the old computation can be resumed.