PID/TID virtualization in DMTCP

This is intended as one of a series of informal documents to describe
and partially document some of the more subtle DMTCP data structures
and algorithms.  These documents are snapshots in time, and they
may become somewhat out-of-date over time (and hopefully also refreshed
to re-sync them with the code again).

This document is about pid and tid virtualization.  A closely related
document is thread-creation.txt.  The fundamental issue here is that
processes and threads receive a pid and tid when they are created.
But after checkpoint and restart, the kernel creates new versions with
new pids and tids.  So, the kernel knows them by their current pid/tid.
In a naive implementation, if the application code asks for the pid/tid
from the kernel, it will see their current pid/tid.  If the application
saves that value in a variable prior to checkpoint, and re-uses it
after restart, then the application will know the pid/tid according
their original value.

To get around this problem, DMTCP maintains 'class dmtcp::VirtualPidTable'
in virtualpidtable.cpp.  One can print it via:
dmtcp::VirtualPidTable::instance().printPidMaps()
or in gdb as:
(gdb) p dmtcp::VirtualPidTable.printPidMaps(dmtcp::VirtualPidTable::instance())

This allows most programs to work fine.  But we still have to worry about
id collisions.  If a new pid or tid is created, and its pid/tid corresponds
to an original pid/tid in the table, then we will have two pids or tids
with the same original id.

To avoid this situation, a DMTCP wrapper is placed around such
calls as fork() (execwrappers.cpp) and _clone() (pidwrappers.cpp).
We will describe the processing of _clone.  From this, it will be
easy to observe the same pattern (in a simpler form) for fork().

Currently, MTCP also places a wrapper around clone(), in addition to DMTCP.
MTCP does this to find out the current threads and tids of a process.
MTCP might also do that by inspecting /proc/self/task, although there
could be dangers of a race in that scheme.

So, when an application calls pthread_create(), libc will then
call clone() or __clone().  The file mtcpinterface.cpp intercepts this
through its clone() wrapper (in libdmtcp.so, which was preloaded).

If this is a new thread, __clone() will call mtcpFuncPtrs.clone(),
which is the MTCP definition of clone().  (If DMTCP is restarting,
then __clone() will instead call __real_clone(), which will call
libc.so:clone().)  Note that DMTCP will call clone() on a DMTCP
function, mtcpinterface.cpp:clone_start().  The function clone_start()
acts as a wrapper around the end user's thread start function.
In particular, clone_start() will process tid conflicts (see below)
and threads exiting, while still calling the user's original thread
start function.

In the case of a new thread, we now find ourselves inside mtcp.c:clone().
The function mtcp.c:__clone() then calls clone_entry, which is a pointer
to libc:__clone().  The MTCP call to libc:__clone() calls it on the MTCP
function mtcp.c:threadcloned(), which records the new thread for MTCP,
calls the original user's function passed in clone(), and then records the
exiting of that thread for MTCP.  We then return to MTCP:__clone(), which
then returns to DMTCP:__clone().  (However, since DMTCP has inserted its
own mtcpinterface.cpp:clone_start() as the argument to DMTCP:__clone(),
in fact, mtcp.c:threadcloned() will call mtcpinterface.cpp:clone_start(),
which will finally call the end user's thread start function.)

NOTE:  SOME OF THE LOGIC DESCRIBED IN THIS NEXT PARAGRAPH IS NOW OBSOLETE.
PLEASE SEE thread-creation.txt FOR A MORE UP-TO-DATE EXPLANATION.
In fact, the full story is a little more complicated.  If the target
program called pthread_create, then glibc:pthread_create will call
clone on its own wrapper function around the user's thread start function.
So, the full calling sequence for the newly created child thread
looks like this:
    glibc:clone -> MTCP:threadcloned -> DMTCP:clone_start
       -> glibc:start_thread -> user_thread_start_fnc
[ The child thread was created by the glibc call to the kernel clone syscall. ]
Note also that glibc:start_thread() calls syscall(SYS_exit, val) for the
thread to exit.  So, the last part of the wrappers MTCP:threadcloned
and DMTCP:clone_start don't actually exit.  In the future, we may
insert those wrappers through pthread_create() instead of through clone()
in order to guarantee that the pthread_create wrapper function,
glibc:start_thread, is the outermost thread wrapper, as intended.  This
improper wrapper order has implications for detecting TID conflicts
(see below).]

If DMTCP:__clone() discovers that the tid of the newly created thread is
the same as the original thread of a different thread, then this is declared
a TID conflict.  (This can happen if the different thread was created
prior to the last checkpoint/restart and now has a new current tid.)
In the case of a TID conflict, DMTCP has called MTCP:clone() and is now
in the last part of DMTCP:__clone().  The newly created thread is
in DMTCP:clone_start().  Both the parent that called clone() and the
new thread in clone_start() are aware of the thread conflict.  This is because
the new thread can call isConflictingTid(gettid()), while the parent
discovers the child tid from the clone call and can also call
isConflictingTid() on it.

In this case of a TID conflict, the child does not call the user thread
function.  Instead it would ideally return, allowing the thread to exit.
At the same time, DMTCP:__clone() calls __clone() again (eventually reaching
libc.so) and receives a new thread with a new tid.  If the new tid is not
in conflict, this thread is accepted, and execution continues.

However, at the time of this writing, the ordering of the wrapper
functions is not correct (see above).  Normally, pthread_create defines
its own wrapper function, glibc:start_thread(), and calls clone()
on its own wrapper function.  The glibc:start_thread() function calls
"syscall(exit,val)" for the thread to exit before returning to clone().
This is needed to avoid the default behavior of clone().  If the thread
were to return to clone() (glibc:clone.S), clone() would call __GI_exit()
to kill the entire process.

Currently, we have:
  glibc:clone()->MTCP:threadcloned->DMTCP:clone_start->glibc:start_thread
In the case of a TID conflict, DMTCP never calls glibc:start_thread.
Hence, DMTCP must call "syscall(exit,val)" in order to avoid eventually
returning to clone(), which would cause the entire process to exit.

For some further background, note that some of the "cleanup" operations
in glibc:create_thread() are (in glibc:pthread_create.c):
  __nptl_death_event(), __free_tcb(), and __deallocate_stack().

===
However, on restart, mtcp.c calls a special version that will execute an
initial function mtcp.c:restarthread.  Furthermore, MTCP:threadcloned()
will insert the current TID in the TCB (thread control block, also
referred to in mtcp.c as the TLS thread descriptor) for the restarted
thread.  The TCB is a glibc construct of type "struct user_desc".
Unfortunately, glibc does not export the offset of the TID in the TCB.
So, MTCP uses some heuristics for finding it, and then (as a sanity check)
comparing with the known TID offset for specific versions of glibc.
Without this, glibc will not know the current TID, and will use an
incorrect TID in talking to the kernel.

===
On restart, the TLS of the newly recreated thread must be set to
the original tid so that libc.so continues to work with the current
tid.  Our wrappers around libc.so will continue to make translations
between the original tid (potentially stored in the user's application)
and the current tid (stored in libc.so and the kernel).  DMTCP guarantees
that original tids are always unique, as described above.

The DMTCP wrapper, gettid(), maintains a thread-local variable
dmtcp_thread_pid to save the original tid of each thread.