old-cross-binutils/gdb/testsuite/gdb.multi/multi-attach.exp

60 lines
2.1 KiB
Text
Raw Normal View History

PR server/16255: gdbserver cannot attach to a second inferior that is multi-threaded. On Linux, we need to explicitly ptrace attach to all lwps of a process. Because GDB might not be connected yet when an attach is requested, and thus it may not be possible to activate thread_db, as that requires access to symbols (IOW, gdbserver --attach), a while ago we make linux_attach loop over the lwps as listed by /proc/PID/task to find the lwps to attach to. linux_attach_lwp_1 has: ... if (initial) /* If lwp is the tgid, we handle adding existing threads later. Otherwise we just add lwp without bothering about any other threads. */ ptid = ptid_build (lwpid, lwpid, 0); else { /* Note that extracting the pid from the current inferior is safe, since we're always called in the context of the same process as this new thread. */ int pid = pid_of (current_inferior); ptid = ptid_build (pid, lwpid, 0); } That "safe" comment referred to linux_attach_lwp being called by thread-db.c. But this was clearly missed when a new call to linux_attach_lwp_1 was added to linux_attach. As a result, current_inferior will be set to some random process, and non-initial lwps of the second inferior get assigned the pid of the wrong inferior. E.g., in the case of attaching to two inferiors, for the second inferior (and so on), non-initial lwps of the second inferior get assigned the pid of the first inferior. This doesn't trigger on the first inferior, when current_inferior is NULL, add_thread switches the current inferior to the newly added thread. Rather than making linux_attach switch current_inferior temporarily (thus avoiding further reliance on global state), or making linux_attach_lwp_1 get the tgid from /proc, which add extra syscalls, and will be wrong in case of the user having originally attached directly to a non-tgid lwp, and then that lwp spawning new clones (the ptid.pid field of further new clones should be the same as the original lwp's pid, which is not the tgid), we note that callers of linux_attach_lwp/linux_attach_lwp_1 always have the right pid handy already, so they can pass it down along with the lwpid. The only other reason for the "initial" parameter is to error out instead of warn in case of attach failure, when we're first attaching to a process. There are only three callers of linux_attach_lwp/linux_attach_lwp_1, and each wants to print a different warn/error string, so we can just move the error/warn out of linux_attach_lwp_1 to the callers, thus getting rid of the "initial" parameter. There really nothing gdbserver-specific about attaching to two threaded processes, so this adds a new test under gdb.multi/. The test passes cleanly against the native GNU/Linux target, but fails/triggers the bug against GDBserver (before the patch), with the native-extended-remote board (as plain remote doesn't support multi-process). Tested on x86_64 Fedora 17, with the native-extended-gdbserver board. gdb/gdbserver/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * linux-low.c (linux_attach_fail_reason_string): New function. (linux_attach_lwp): Delete. (linux_attach_lwp_1): Rename to ... (linux_attach_lwp): ... this. Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach): Adjust to call linux_attach_lwp. Call error on failure to attach to the tgid. Call warning when failing to attach to an lwp. * linux-low.h (linux_attach_lwp): Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach_fail_reason_string): New declaration. * thread-db.c (attach_thread): Adjust to linux_attach_lwp's interface change. Use linux_attach_fail_reason_string. gdb/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * common/linux-ptrace.c (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. Remove "warning: " and newline from built string. * common/linux-ptrace.h (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. * linux-nat.c (linux_nat_attach): Adjust to use linux_ptrace_attach_fail_reason. gdb/testsuite/ 2014-04-25 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR server/16255 * gdb.multi/multi-attach.c: New file. * gdb.multi/multi-attach.exp: New file.
2014-04-25 18:07:33 +00:00
# This testcase is part of GDB, the GNU debugger.
# Copyright 2007-2016 Free Software Foundation, Inc.
PR server/16255: gdbserver cannot attach to a second inferior that is multi-threaded. On Linux, we need to explicitly ptrace attach to all lwps of a process. Because GDB might not be connected yet when an attach is requested, and thus it may not be possible to activate thread_db, as that requires access to symbols (IOW, gdbserver --attach), a while ago we make linux_attach loop over the lwps as listed by /proc/PID/task to find the lwps to attach to. linux_attach_lwp_1 has: ... if (initial) /* If lwp is the tgid, we handle adding existing threads later. Otherwise we just add lwp without bothering about any other threads. */ ptid = ptid_build (lwpid, lwpid, 0); else { /* Note that extracting the pid from the current inferior is safe, since we're always called in the context of the same process as this new thread. */ int pid = pid_of (current_inferior); ptid = ptid_build (pid, lwpid, 0); } That "safe" comment referred to linux_attach_lwp being called by thread-db.c. But this was clearly missed when a new call to linux_attach_lwp_1 was added to linux_attach. As a result, current_inferior will be set to some random process, and non-initial lwps of the second inferior get assigned the pid of the wrong inferior. E.g., in the case of attaching to two inferiors, for the second inferior (and so on), non-initial lwps of the second inferior get assigned the pid of the first inferior. This doesn't trigger on the first inferior, when current_inferior is NULL, add_thread switches the current inferior to the newly added thread. Rather than making linux_attach switch current_inferior temporarily (thus avoiding further reliance on global state), or making linux_attach_lwp_1 get the tgid from /proc, which add extra syscalls, and will be wrong in case of the user having originally attached directly to a non-tgid lwp, and then that lwp spawning new clones (the ptid.pid field of further new clones should be the same as the original lwp's pid, which is not the tgid), we note that callers of linux_attach_lwp/linux_attach_lwp_1 always have the right pid handy already, so they can pass it down along with the lwpid. The only other reason for the "initial" parameter is to error out instead of warn in case of attach failure, when we're first attaching to a process. There are only three callers of linux_attach_lwp/linux_attach_lwp_1, and each wants to print a different warn/error string, so we can just move the error/warn out of linux_attach_lwp_1 to the callers, thus getting rid of the "initial" parameter. There really nothing gdbserver-specific about attaching to two threaded processes, so this adds a new test under gdb.multi/. The test passes cleanly against the native GNU/Linux target, but fails/triggers the bug against GDBserver (before the patch), with the native-extended-remote board (as plain remote doesn't support multi-process). Tested on x86_64 Fedora 17, with the native-extended-gdbserver board. gdb/gdbserver/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * linux-low.c (linux_attach_fail_reason_string): New function. (linux_attach_lwp): Delete. (linux_attach_lwp_1): Rename to ... (linux_attach_lwp): ... this. Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach): Adjust to call linux_attach_lwp. Call error on failure to attach to the tgid. Call warning when failing to attach to an lwp. * linux-low.h (linux_attach_lwp): Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach_fail_reason_string): New declaration. * thread-db.c (attach_thread): Adjust to linux_attach_lwp's interface change. Use linux_attach_fail_reason_string. gdb/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * common/linux-ptrace.c (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. Remove "warning: " and newline from built string. * common/linux-ptrace.h (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. * linux-nat.c (linux_nat_attach): Adjust to use linux_ptrace_attach_fail_reason. gdb/testsuite/ 2014-04-25 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR server/16255 * gdb.multi/multi-attach.c: New file. * gdb.multi/multi-attach.exp: New file.
2014-04-25 18:07:33 +00:00
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# Test attaching to multiple threaded programs.
standard_testfile
if {![can_spawn_for_attach]} {
PR server/16255: gdbserver cannot attach to a second inferior that is multi-threaded. On Linux, we need to explicitly ptrace attach to all lwps of a process. Because GDB might not be connected yet when an attach is requested, and thus it may not be possible to activate thread_db, as that requires access to symbols (IOW, gdbserver --attach), a while ago we make linux_attach loop over the lwps as listed by /proc/PID/task to find the lwps to attach to. linux_attach_lwp_1 has: ... if (initial) /* If lwp is the tgid, we handle adding existing threads later. Otherwise we just add lwp without bothering about any other threads. */ ptid = ptid_build (lwpid, lwpid, 0); else { /* Note that extracting the pid from the current inferior is safe, since we're always called in the context of the same process as this new thread. */ int pid = pid_of (current_inferior); ptid = ptid_build (pid, lwpid, 0); } That "safe" comment referred to linux_attach_lwp being called by thread-db.c. But this was clearly missed when a new call to linux_attach_lwp_1 was added to linux_attach. As a result, current_inferior will be set to some random process, and non-initial lwps of the second inferior get assigned the pid of the wrong inferior. E.g., in the case of attaching to two inferiors, for the second inferior (and so on), non-initial lwps of the second inferior get assigned the pid of the first inferior. This doesn't trigger on the first inferior, when current_inferior is NULL, add_thread switches the current inferior to the newly added thread. Rather than making linux_attach switch current_inferior temporarily (thus avoiding further reliance on global state), or making linux_attach_lwp_1 get the tgid from /proc, which add extra syscalls, and will be wrong in case of the user having originally attached directly to a non-tgid lwp, and then that lwp spawning new clones (the ptid.pid field of further new clones should be the same as the original lwp's pid, which is not the tgid), we note that callers of linux_attach_lwp/linux_attach_lwp_1 always have the right pid handy already, so they can pass it down along with the lwpid. The only other reason for the "initial" parameter is to error out instead of warn in case of attach failure, when we're first attaching to a process. There are only three callers of linux_attach_lwp/linux_attach_lwp_1, and each wants to print a different warn/error string, so we can just move the error/warn out of linux_attach_lwp_1 to the callers, thus getting rid of the "initial" parameter. There really nothing gdbserver-specific about attaching to two threaded processes, so this adds a new test under gdb.multi/. The test passes cleanly against the native GNU/Linux target, but fails/triggers the bug against GDBserver (before the patch), with the native-extended-remote board (as plain remote doesn't support multi-process). Tested on x86_64 Fedora 17, with the native-extended-gdbserver board. gdb/gdbserver/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * linux-low.c (linux_attach_fail_reason_string): New function. (linux_attach_lwp): Delete. (linux_attach_lwp_1): Rename to ... (linux_attach_lwp): ... this. Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach): Adjust to call linux_attach_lwp. Call error on failure to attach to the tgid. Call warning when failing to attach to an lwp. * linux-low.h (linux_attach_lwp): Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach_fail_reason_string): New declaration. * thread-db.c (attach_thread): Adjust to linux_attach_lwp's interface change. Use linux_attach_fail_reason_string. gdb/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * common/linux-ptrace.c (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. Remove "warning: " and newline from built string. * common/linux-ptrace.h (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. * linux-nat.c (linux_nat_attach): Adjust to use linux_ptrace_attach_fail_reason. gdb/testsuite/ 2014-04-25 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR server/16255 * gdb.multi/multi-attach.c: New file. * gdb.multi/multi-attach.exp: New file.
2014-04-25 18:07:33 +00:00
return 0
}
if {[prepare_for_testing "failed to prepare" $testfile $srcfile {debug additional_flags=-lpthread}]} {
return -1
}
# Start the programs running and then wait for a bit, to be sure that
# they can be attached to.
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 19:06:24 +00:00
set spawn_id_list [spawn_wait_for_attach [list $binfile $binfile]]
set test_spawn_id1 [lindex $spawn_id_list 0]
set test_spawn_id2 [lindex $spawn_id_list 1]
set testpid1 [spawn_id_get_pid $test_spawn_id1]
set testpid2 [spawn_id_get_pid $test_spawn_id2]
PR server/16255: gdbserver cannot attach to a second inferior that is multi-threaded. On Linux, we need to explicitly ptrace attach to all lwps of a process. Because GDB might not be connected yet when an attach is requested, and thus it may not be possible to activate thread_db, as that requires access to symbols (IOW, gdbserver --attach), a while ago we make linux_attach loop over the lwps as listed by /proc/PID/task to find the lwps to attach to. linux_attach_lwp_1 has: ... if (initial) /* If lwp is the tgid, we handle adding existing threads later. Otherwise we just add lwp without bothering about any other threads. */ ptid = ptid_build (lwpid, lwpid, 0); else { /* Note that extracting the pid from the current inferior is safe, since we're always called in the context of the same process as this new thread. */ int pid = pid_of (current_inferior); ptid = ptid_build (pid, lwpid, 0); } That "safe" comment referred to linux_attach_lwp being called by thread-db.c. But this was clearly missed when a new call to linux_attach_lwp_1 was added to linux_attach. As a result, current_inferior will be set to some random process, and non-initial lwps of the second inferior get assigned the pid of the wrong inferior. E.g., in the case of attaching to two inferiors, for the second inferior (and so on), non-initial lwps of the second inferior get assigned the pid of the first inferior. This doesn't trigger on the first inferior, when current_inferior is NULL, add_thread switches the current inferior to the newly added thread. Rather than making linux_attach switch current_inferior temporarily (thus avoiding further reliance on global state), or making linux_attach_lwp_1 get the tgid from /proc, which add extra syscalls, and will be wrong in case of the user having originally attached directly to a non-tgid lwp, and then that lwp spawning new clones (the ptid.pid field of further new clones should be the same as the original lwp's pid, which is not the tgid), we note that callers of linux_attach_lwp/linux_attach_lwp_1 always have the right pid handy already, so they can pass it down along with the lwpid. The only other reason for the "initial" parameter is to error out instead of warn in case of attach failure, when we're first attaching to a process. There are only three callers of linux_attach_lwp/linux_attach_lwp_1, and each wants to print a different warn/error string, so we can just move the error/warn out of linux_attach_lwp_1 to the callers, thus getting rid of the "initial" parameter. There really nothing gdbserver-specific about attaching to two threaded processes, so this adds a new test under gdb.multi/. The test passes cleanly against the native GNU/Linux target, but fails/triggers the bug against GDBserver (before the patch), with the native-extended-remote board (as plain remote doesn't support multi-process). Tested on x86_64 Fedora 17, with the native-extended-gdbserver board. gdb/gdbserver/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * linux-low.c (linux_attach_fail_reason_string): New function. (linux_attach_lwp): Delete. (linux_attach_lwp_1): Rename to ... (linux_attach_lwp): ... this. Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach): Adjust to call linux_attach_lwp. Call error on failure to attach to the tgid. Call warning when failing to attach to an lwp. * linux-low.h (linux_attach_lwp): Take a ptid instead of a pid as argument. Remove "initial" parameter. Return int instead of void. Don't error or warn here. (linux_attach_fail_reason_string): New declaration. * thread-db.c (attach_thread): Adjust to linux_attach_lwp's interface change. Use linux_attach_fail_reason_string. gdb/ 2014-04-25 Pedro Alves <palves@redhat.com> PR server/16255 * common/linux-ptrace.c (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. Remove "warning: " and newline from built string. * common/linux-ptrace.h (linux_ptrace_attach_warnings): Rename to ... (linux_ptrace_attach_fail_reason): ... this. * linux-nat.c (linux_nat_attach): Adjust to use linux_ptrace_attach_fail_reason. gdb/testsuite/ 2014-04-25 Simon Marchi <simon.marchi@ericsson.com> Pedro Alves <palves@redhat.com> PR server/16255 * gdb.multi/multi-attach.c: New file. * gdb.multi/multi-attach.exp: New file.
2014-04-25 18:07:33 +00:00
gdb_test "attach $testpid1" \
"Attaching to program: .*, process $testpid1.*(in|at).*" \
"attach to program 1"
gdb_test "backtrace" ".*main.*" "backtrace 1"
gdb_test "add-inferior -exec $binfile" \
"Added inferior 2.*" \
"add second inferior"
gdb_test "inferior 2" ".*Switching to inferior 2.*" "switch to second inferior"
gdb_test "attach $testpid2" \
"Attaching to program: .*, process $testpid2.*(in|at).*" \
"attach to program 2"
gdb_test "backtrace" ".*main.*" "backtrace 2"
gdb_test "kill" "" "kill inferior 2" "Kill the program being debugged.*" "y"
gdb_test "inferior 1" ".*Switching to inferior 1.*"
gdb_test "kill" "" "kill inferior 1" "Kill the program being debugged.*" "y"
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others) The buildbots show that attach-many-short-lived-thread.exp is racy. But after staring at debug logs and playing with SystemTap scripts for a (long) while, I figured out that neither GDB, nor the kernel nor the test's program itself are at fault. The problem is simply that the testsuite machinery is currently subject to PID-reuse races. The attach-many-short-lived-threads.c test program just happens to be much more susceptible to trigger this race because threads and processes share the same number space on Linux, and the test spawns many many short lived threads in succession, thus enlarging the race window a lot. Part of the problem is that several tests spawn processes with "exec&" (in order to test the "attach" command) , and then at the end of the test, to make sure things are cleaned up, issue a 'remote_spawn "kill -p $testpid"'. Since with tcl's "exec&", tcl itself is responsible for reaping the process's exit status, when we go kill the process, testpid may have already exited _and_ its status may have (and often has) been reaped already. Thus it can happen that another process meanwhile reuses $testpid, and that "kill" command kills the wrong process... Frequently, that happens to be attach-many-short-lived-thread, but this explains other test's races as well. In the attach-many-short-lived-threads test, it sometimes manifests like this: (gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done. (gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb attach 5940 Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940 warning: process 5940 is a zombie - the process has already terminated ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ptrace: Operation not permitted. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach info threads No threads. (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads set breakpoint always-inserted on (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on Other times the process dies while the test is ongoing (the process is ptrace-stopped): (gdb) print again = 1 Cannot access memory at address 0x6020cc (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior (Recall that on Linux, SIGKILL is not interceptable) And other times it dies just while we're detaching: $4 = 319 (gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left detach Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process (gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach GDB mishandles the latter (it should ignore ESRCH while detaching just like when continuing), but that's another story. The fix here is to change spawn_wait_for_attach to use Expect's 'spawn' command instead of Tcl's 'exec&' to spawn programs, because with spawn we control when to wait for/reap the process. That allows killing the process by PID without being subject to pid-reuse races, because even if the process is already dead, the kernel won't reuse the process's PID until the zombie is reaped. The other part of the problem lies in DejaGnu itself, unfortunately. I have occasionally seen tests (attach-many-short-lived-threads included, but not only that one) die with a random inexplicable SIGTERM too, and that too is caused by the same reason, except that in that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp: exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &" ... catch "wait -i $shell_id" Even if the program exits promptly, that whole cascade of kills carries on in the background, thus potentially killing the poor process that manages to reuse $pid... I sent a fix for that to the DejaGnu list: http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html With both patches in place, I haven't seen attach-many-short-lived-threads.exp fail again. Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver. gdb/testsuite/ChangeLog: 2015-07-31 Pedro Alves <palves@redhat.com> * gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/attach-twice.exp: Likewise. * gdb.base/attach.exp: Likewise. (do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and gdb_test_multiple. * gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/valgrind-infcall.exp: Likewise. * gdb.multi/multi-attach.exp: Likewise. * gdb.python/py-prompt.exp: Likewise. * gdb.python/py-sync-interp.exp: Likewise. * gdb.server/ext-attach.exp: Likewise. * gdb.threads/attach-into-signal.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-many-short-lived-threads.exp: Adjust to spawn_wait_for_attach returning a spawn id instead of a pid. Use spawn_id_get_pid and kill_wait_spawned_process. * gdb.threads/attach-stopped.exp (corefunc): Use spawn_wait_for_attach, spawn_id_get_pid and kill_wait_spawned_process. * gdb.base/break-interp.exp: Rename $res to $test_spawn_id. Use spawn_id_get_pid. Wait for spawn id after eof. Use kill_wait_spawned_process instead of explicit "kill -9". * lib/gdb.exp (can_spawn_for_attach): Adjust comment. (kill_wait_spawned_process, spawn_id_get_pid): New procedures. (spawn_wait_for_attach): Use spawn instead of exec to spawn processes. Don't map cygwin/windows pids here. Now returns a spawn id list.
2015-07-31 19:06:24 +00:00
kill_wait_spawned_process $test_spawn_id1
kill_wait_spawned_process $test_spawn_id2