old-cross-binutils/gdb/testsuite/gdb.threads
Pedro Alves 2c8c5d375e testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others)
The buildbots show that attach-many-short-lived-thread.exp is racy.
But after staring at debug logs and playing with SystemTap scripts for
a (long) while, I figured out that neither GDB, nor the kernel nor the
test's program itself are at fault.

The problem is simply that the testsuite machinery is currently
subject to PID-reuse races.  The attach-many-short-lived-threads.c
test program just happens to be much more susceptible to trigger this
race because threads and processes share the same number space on
Linux, and the test spawns many many short lived threads in
succession, thus enlarging the race window a lot.

Part of the problem is that several tests spawn processes with "exec&"
(in order to test the "attach" command) , and then at the end of the
test, to make sure things are cleaned up, issue a 'remote_spawn "kill
-p $testpid"'.  Since with tcl's "exec&", tcl itself is responsible
for reaping the process's exit status, when we go kill the process,
testpid may have already exited _and_ its status may have (and often
has) been reaped already.  Thus it can happen that another process
meanwhile reuses $testpid, and that "kill" command kills the wrong
process...  Frequently, that happens to be
attach-many-short-lived-thread, but this explains other test's races
as well.

In the attach-many-short-lived-threads test, it sometimes manifests
like this:

 (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads
 Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done.
 (gdb)           Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb
 attach 5940
 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940
 warning: process 5940 is a zombie - the process has already terminated
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ptrace: Operation not permitted.
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach
 info threads
 No threads.
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads
 set breakpoint always-inserted on
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on

Other times the process dies while the test is ongoing (the process is
ptrace-stopped):

 (gdb) print again = 1
 Cannot access memory at address 0x6020cc
 (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior

(Recall that on Linux, SIGKILL is not interceptable)

And other times it dies just while we're detaching:

 $4 = 319
 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left
 detach
 Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process
 (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach

GDB mishandles the latter (it should ignore ESRCH while detaching just
like when continuing), but that's another story.

The fix here is to change spawn_wait_for_attach to use Expect's
'spawn' command instead of Tcl's 'exec&' to spawn programs, because
with spawn we control when to wait for/reap the process.  That allows
killing the process by PID without being subject to pid-reuse races,
because even if the process is already dead, the kernel won't reuse
the process's PID until the zombie is reaped.

The other part of the problem lies in DejaGnu itself, unfortunately.
I have occasionally seen tests (attach-many-short-lived-threads
included, but not only that one) die with a random inexplicable
SIGTERM too, and that too is caused by the same reason, except that in
that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp:

    exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"
    ...
    catch "wait -i $shell_id"

Even if the program exits promptly, that whole cascade of kills
carries on in the background, thus potentially killing the poor
process that manages to reuse $pid...

I sent a fix for that to the DejaGnu list:
 http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html

With both patches in place, I haven't seen
attach-many-short-lived-threads.exp fail again.

Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver.

gdb/testsuite/ChangeLog:
2015-07-31  Pedro Alves  <palves@redhat.com>

	* gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id.
	Use spawn_id_get_pid.  Wait for spawn id after eof.  Use
	kill_wait_spawned_process instead of explicit "kill -9".
	* gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach
	returning a spawn id instead of a pid.  Use spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/attach-twice.exp: Likewise.
	* gdb.base/attach.exp: Likewise.
	(do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and
	gdb_test_multiple.
	* gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach
	returning a spawn id instead of a pid.  Use spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/valgrind-infcall.exp: Likewise.
	* gdb.multi/multi-attach.exp: Likewise.
	* gdb.python/py-prompt.exp: Likewise.
	* gdb.python/py-sync-interp.exp: Likewise.
	* gdb.server/ext-attach.exp: Likewise.
	* gdb.threads/attach-into-signal.exp (corefunc): Use
	spawn_wait_for_attach, spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.threads/attach-many-short-lived-threads.exp: Adjust to
	spawn_wait_for_attach returning a spawn id instead of a pid.  Use
	spawn_id_get_pid and kill_wait_spawned_process.
	* gdb.threads/attach-stopped.exp (corefunc): Use
	spawn_wait_for_attach, spawn_id_get_pid and
	kill_wait_spawned_process.
	* gdb.base/break-interp.exp: Rename $res to $test_spawn_id.
	Use spawn_id_get_pid.  Wait for spawn id after eof.  Use
	kill_wait_spawned_process instead of explicit "kill -9".
	* lib/gdb.exp (can_spawn_for_attach): Adjust comment.
	(kill_wait_spawned_process, spawn_id_get_pid): New procedures.
	(spawn_wait_for_attach): Use spawn instead of exec to spawn
	processes.  Don't map cygwin/windows pids here.  Now returns a
	spawn id list.
2015-07-31 20:06:24 +01:00
..
attach-into-signal.c
attach-into-signal.exp testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) 2015-07-31 20:06:24 +01:00
attach-many-short-lived-threads.c Improve gdb.threads/attach-many-short-lived-threads.exp timeout handling 2015-02-06 13:24:32 +01:00
attach-many-short-lived-threads.exp testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) 2015-07-31 20:06:24 +01:00
attach-stopped.c
attach-stopped.exp testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) 2015-07-31 20:06:24 +01:00
bp_in_thread.c
bp_in_thread.exp
break-while-running.c
break-while-running.exp
clone-new-thread-event.c
clone-new-thread-event.exp
clone-thread_db.c gdb.threads/clone-thread_db.c: Add missing includes and fix pthread_join call 2015-03-04 09:13:49 +00:00
clone-thread_db.exp PR18006: internal error if threaded program calls clone(CLONE_VM) 2015-02-20 19:00:21 +00:00
continue-pending-after-query.c Linux: make target_is_async_p return false when async is off 2015-01-23 11:12:39 +00:00
continue-pending-after-query.exp Linux: make target_is_async_p return false when async is off 2015-01-23 11:12:39 +00:00
continue-pending-status.c native/Linux: internal error if resume is short-circuited 2015-03-19 12:26:49 +00:00
continue-pending-status.exp gdbserver/Linux: unbreak thread event randomization 2015-03-19 12:38:05 +00:00
corethreads.c
corethreads.exp
create-fail.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
create-fail.exp
current-lwp-dead.c
current-lwp-dead.exp
dlopen-libpthread-lib.c
dlopen-libpthread.c
dlopen-libpthread.exp
execl.c
execl.exp
execl1.c
fork-child-threads.c
fork-child-threads.exp
fork-plus-threads.c PR threads/18600: Threads left stopped after fork+thread spawn 2015-07-30 18:50:29 +01:00
fork-plus-threads.exp remote follow fork and spurious child stops in non-stop mode 2015-07-30 18:52:53 +01:00
fork-thread-pending.c
fork-thread-pending.exp
gcore-stale-thread.c
gcore-stale-thread.exp
gcore-thread.exp
hand-call-in-threads.c
hand-call-in-threads.exp
hand-call-new-thread.c PR threads/18127 - threads spawned by infcall end up stuck in "running" state 2015-06-29 16:07:57 +01:00
hand-call-new-thread.exp PR threads/18127 - threads spawned by infcall end up stuck in "running" state 2015-06-29 16:07:57 +01:00
ia64-sigill.c gdb.threads/{siginfo-thread.c,watchthreads-reorder.c,ia64-sigill.c} races with GDB 2015-01-09 13:58:29 +00:00
ia64-sigill.exp
info-threads-cur-sal-2.c
info-threads-cur-sal.c
info-threads-cur-sal.exp
interrupted-hand-call.c
interrupted-hand-call.exp
kill.c
kill.exp
killed.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
killed.exp
leader-exit.c
leader-exit.exp
linux-dp.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
linux-dp.exp
local-watch-wrong-thread.c
local-watch-wrong-thread.exp
Makefile.in
manythreads.c
manythreads.exp gdb.threads/manythreads.exp: can't read "test": no such variable 2015-04-01 15:30:13 +01:00
multi-create-ns-info-thr.exp gdb.threads/multi-create-ns-info-thr.exp and native-extended-remote board 2015-02-21 12:03:23 +00:00
multi-create.c
multi-create.exp
multiple-step-overs.c Add test for PR18214 and PR18216 - multiple step-overs with queued signals 2015-04-08 19:59:03 +01:00
multiple-step-overs.exp gdb/18216: displaced step+deliver signal, a thread needs step-over, crash 2015-04-10 10:36:23 +01:00
next-bp-other-thread.c
next-bp-other-thread.exp
no-unwaited-for-left.c
no-unwaited-for-left.exp kfail two tests in no-unwaited-for-left.exp for remote target 2015-04-02 13:51:31 +01:00
non-ldr-exc-1.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
non-ldr-exc-1.exp
non-ldr-exc-2.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
non-ldr-exc-2.exp
non-ldr-exc-3.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
non-ldr-exc-3.exp
non-ldr-exc-4.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
non-ldr-exc-4.exp
non-ldr-exit.c PR gdb/18717: internal error if non-leader thread exits process 2015-07-24 17:49:17 +01:00
non-ldr-exit.exp PR gdb/18717: internal error if non-leader thread exits process 2015-07-24 17:49:17 +01:00
non-stop-fair-events.c Properly set alarm value in gdb.threads/non-stop-fair-events.exp 2015-04-07 11:30:07 +01:00
non-stop-fair-events.exp Properly set alarm value in gdb.threads/non-stop-fair-events.exp 2015-04-07 11:30:07 +01:00
pending-step.c
pending-step.exp
print-threads.c
print-threads.exp
pthread_cond_wait.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
pthread_cond_wait.exp
pthreads.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
pthreads.exp
queue-signal.c
queue-signal.exp
reconnect-signal.c
reconnect-signal.exp
schedlock.c
schedlock.exp Make "set scheduler-locking step" depend on user intention, only 2015-03-24 17:50:31 +00:00
siginfo-threads.c gdb.threads/{siginfo-thread.c,watchthreads-reorder.c,ia64-sigill.c} races with GDB 2015-01-09 13:58:29 +00:00
siginfo-threads.exp
signal-command-handle-nopass.c
signal-command-handle-nopass.exp
signal-command-multiple-signals-pending.c
signal-command-multiple-signals-pending.exp
signal-delivered-right-thread.c
signal-delivered-right-thread.exp
signal-sigtrap.c Add "signal SIGTRAP" test 2015-02-10 19:30:55 +00:00
signal-sigtrap.exp Add "signal SIGTRAP" test 2015-02-10 19:30:55 +00:00
signal-while-stepping-over-bp-other-thread.c
signal-while-stepping-over-bp-other-thread.exp Cleanup signal-while-stepping-over-bp-other-thread.exp 2015-04-10 19:49:00 +01:00
sigstep-threads.c
sigstep-threads.exp
sigthread.c
sigthread.exp
staticthreads.c
staticthreads.exp
step-bg-decr-pc-switch-thread.c Fix adjust_pc_after_break, remove still current thread check 2015-02-11 09:45:41 +00:00
step-bg-decr-pc-switch-thread.exp Fix adjust_pc_after_break, remove still current thread check 2015-02-11 09:45:41 +00:00
step-over-lands-on-breakpoint.c
step-over-lands-on-breakpoint.exp Test step-over-{lands-on-breakpoint|trips-on-watchpoint}.exp with displaced stepping 2015-04-10 13:31:59 +01:00
step-over-trips-on-watchpoint.c Make gdb.threads/step-over-trips-on-watchpoint.exp effective on !x86 2015-04-10 13:11:32 +01:00
step-over-trips-on-watchpoint.exp step-over-trips-on-watchpoint.exp: Don't put addresses in test messages 2015-04-10 19:23:24 +01:00
stepi-random-signal.c
stepi-random-signal.exp
switch-threads.c
switch-threads.exp
thread-execl.c
thread-execl.exp follow-exec: delete all non-execing threads 2015-03-03 01:25:17 +00:00
thread-find.exp
thread-specific-bp.c
thread-specific-bp.exp Fix gdb.threads/thread-specific-bp.exp race 2015-03-04 17:23:55 +00:00
thread-specific.c
thread-specific.exp
thread-unwindonsignal.exp
thread_check.c
thread_check.exp
thread_events.c
thread_events.exp
threadapply.c
threadapply.exp
threxit-hop-specific.c
threxit-hop-specific.exp
tid-reuse.c Crash on thread id wrap around 2015-04-01 13:38:06 +01:00
tid-reuse.exp Crash on thread id wrap around 2015-04-01 13:38:06 +01:00
tls-main.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
tls-nodebug.c
tls-nodebug.exp
tls-shared.c
tls-shared.exp
tls-var-main.c
tls-var.c
tls-var.exp
tls.c
tls.exp
tls2.c
watchpoint-fork-child.c
watchpoint-fork-mt.c Remove testsuite compile errors with GCC5. 2015-01-25 18:50:56 +01:00
watchpoint-fork-parent.c
watchpoint-fork-st.c
watchpoint-fork.exp
watchpoint-fork.h
watchthreads-reorder.c gdb.threads/{siginfo-thread.c,watchthreads-reorder.c,ia64-sigill.c} races with GDB 2015-01-09 13:58:29 +00:00
watchthreads-reorder.exp
watchthreads.c
watchthreads.exp
watchthreads2.c
watchthreads2.exp
wp-replication.c
wp-replication.exp