2016-01-01 04:33:14 +00:00
|
|
|
# Copyright 2008-2016 Free Software Foundation, Inc.
|
2014-12-17 20:40:05 +00:00
|
|
|
|
|
|
|
# This program is free software; you can redistribute it and/or modify
|
|
|
|
# it under the terms of the GNU General Public License as published by
|
|
|
|
# the Free Software Foundation; either version 3 of the License, or
|
|
|
|
# (at your option) any later version.
|
|
|
|
#
|
|
|
|
# This program is distributed in the hope that it will be useful,
|
|
|
|
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
# GNU General Public License for more details.
|
|
|
|
#
|
|
|
|
# You should have received a copy of the GNU General Public License
|
|
|
|
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
|
|
|
|
|
|
|
# Test attaching to a program that is constantly spawning short-lived
|
|
|
|
# threads. The stresses the edge cases of attaching to threads that
|
|
|
|
# have just been created or are in process of dying. In addition, the
|
|
|
|
# test attaches, debugs, detaches, reattaches in a loop a few times,
|
|
|
|
# to stress the behavior of the debug API around detach (some systems
|
|
|
|
# end up leaving stale state behind that confuse the following
|
|
|
|
# attach).
|
|
|
|
|
2016-05-27 15:18:28 +00:00
|
|
|
# Return true if the running version of DejaGnu is known to not be
|
|
|
|
# able to run this test.
|
|
|
|
proc bad_dejagnu {} {
|
|
|
|
global frame_version
|
|
|
|
|
|
|
|
verbose -log "DejaGnu version: $frame_version"
|
|
|
|
verbose -log "Expect version: [exp_version]"
|
|
|
|
verbose -log "Tcl version: [info tclversion]"
|
|
|
|
|
|
|
|
set dj_ver [split $frame_version .]
|
|
|
|
set dj_ver_major [lindex $dj_ver 0]
|
|
|
|
set dj_ver_minor [lindex $dj_ver 1]
|
|
|
|
|
|
|
|
# DejaGnu versions prior to 1.6 manage to kill the wrong process
|
|
|
|
# due to PID-reuse races. Since this test spawns many threads, it
|
|
|
|
# widens the race window a whole lot, enough that the inferior is
|
|
|
|
# often killed, and thus the test randomly fails. See:
|
|
|
|
# http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00005.html
|
|
|
|
# The fix added a close_wait_program procedure. If that procedure
|
|
|
|
# is defined, and DejaGnu is older than 1.6, assume that means the
|
|
|
|
# fix was backported.
|
|
|
|
if {$dj_ver_major == 1
|
|
|
|
&& ($dj_ver_minor < 6 && [info procs close_wait_program] == "")} {
|
|
|
|
return 1
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0
|
|
|
|
}
|
|
|
|
|
|
|
|
if {[bad_dejagnu]} {
|
|
|
|
unsupported "broken DejaGnu"
|
|
|
|
return 0
|
|
|
|
}
|
|
|
|
|
2014-12-17 20:40:05 +00:00
|
|
|
if {![can_spawn_for_attach]} {
|
|
|
|
return 0
|
|
|
|
}
|
|
|
|
|
|
|
|
standard_testfile
|
|
|
|
|
|
|
|
# The test proper. See description above.
|
|
|
|
|
|
|
|
proc test {} {
|
|
|
|
global binfile
|
|
|
|
global gdb_prompt
|
|
|
|
global decimal
|
|
|
|
|
|
|
|
clean_restart ${binfile}
|
|
|
|
|
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others)
The buildbots show that attach-many-short-lived-thread.exp is racy.
But after staring at debug logs and playing with SystemTap scripts for
a (long) while, I figured out that neither GDB, nor the kernel nor the
test's program itself are at fault.
The problem is simply that the testsuite machinery is currently
subject to PID-reuse races. The attach-many-short-lived-threads.c
test program just happens to be much more susceptible to trigger this
race because threads and processes share the same number space on
Linux, and the test spawns many many short lived threads in
succession, thus enlarging the race window a lot.
Part of the problem is that several tests spawn processes with "exec&"
(in order to test the "attach" command) , and then at the end of the
test, to make sure things are cleaned up, issue a 'remote_spawn "kill
-p $testpid"'. Since with tcl's "exec&", tcl itself is responsible
for reaping the process's exit status, when we go kill the process,
testpid may have already exited _and_ its status may have (and often
has) been reaped already. Thus it can happen that another process
meanwhile reuses $testpid, and that "kill" command kills the wrong
process... Frequently, that happens to be
attach-many-short-lived-thread, but this explains other test's races
as well.
In the attach-many-short-lived-threads test, it sometimes manifests
like this:
(gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads
Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done.
(gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb
attach 5940
Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940
warning: process 5940 is a zombie - the process has already terminated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ptrace: Operation not permitted.
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach
info threads
No threads.
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads
set breakpoint always-inserted on
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on
Other times the process dies while the test is ongoing (the process is
ptrace-stopped):
(gdb) print again = 1
Cannot access memory at address 0x6020cc
(gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior
(Recall that on Linux, SIGKILL is not interceptable)
And other times it dies just while we're detaching:
$4 = 319
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left
detach
Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process
(gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach
GDB mishandles the latter (it should ignore ESRCH while detaching just
like when continuing), but that's another story.
The fix here is to change spawn_wait_for_attach to use Expect's
'spawn' command instead of Tcl's 'exec&' to spawn programs, because
with spawn we control when to wait for/reap the process. That allows
killing the process by PID without being subject to pid-reuse races,
because even if the process is already dead, the kernel won't reuse
the process's PID until the zombie is reaped.
The other part of the problem lies in DejaGnu itself, unfortunately.
I have occasionally seen tests (attach-many-short-lived-threads
included, but not only that one) die with a random inexplicable
SIGTERM too, and that too is caused by the same reason, except that in
that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp:
exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"
...
catch "wait -i $shell_id"
Even if the program exits promptly, that whole cascade of kills
carries on in the background, thus potentially killing the poor
process that manages to reuse $pid...
I sent a fix for that to the DejaGnu list:
http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html
With both patches in place, I haven't seen
attach-many-short-lived-threads.exp fail again.
Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver.
gdb/testsuite/ChangeLog:
2015-07-31 Pedro Alves <palves@redhat.com>
* gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id.
Use spawn_id_get_pid. Wait for spawn id after eof. Use
kill_wait_spawned_process instead of explicit "kill -9".
* gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach
returning a spawn id instead of a pid. Use spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/attach-twice.exp: Likewise.
* gdb.base/attach.exp: Likewise.
(do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and
gdb_test_multiple.
* gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach
returning a spawn id instead of a pid. Use spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/valgrind-infcall.exp: Likewise.
* gdb.multi/multi-attach.exp: Likewise.
* gdb.python/py-prompt.exp: Likewise.
* gdb.python/py-sync-interp.exp: Likewise.
* gdb.server/ext-attach.exp: Likewise.
* gdb.threads/attach-into-signal.exp (corefunc): Use
spawn_wait_for_attach, spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.threads/attach-many-short-lived-threads.exp: Adjust to
spawn_wait_for_attach returning a spawn id instead of a pid. Use
spawn_id_get_pid and kill_wait_spawned_process.
* gdb.threads/attach-stopped.exp (corefunc): Use
spawn_wait_for_attach, spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/break-interp.exp: Rename $res to $test_spawn_id.
Use spawn_id_get_pid. Wait for spawn id after eof. Use
kill_wait_spawned_process instead of explicit "kill -9".
* lib/gdb.exp (can_spawn_for_attach): Adjust comment.
(kill_wait_spawned_process, spawn_id_get_pid): New procedures.
(spawn_wait_for_attach): Use spawn instead of exec to spawn
processes. Don't map cygwin/windows pids here. Now returns a
spawn id list.
2015-07-31 19:06:24 +00:00
|
|
|
set test_spawn_id [spawn_wait_for_attach $binfile]
|
|
|
|
set testpid [spawn_id_get_pid $test_spawn_id]
|
2014-12-17 20:40:05 +00:00
|
|
|
|
|
|
|
set attempts 10
|
|
|
|
for {set attempt 1} { $attempt <= $attempts } { incr attempt } {
|
|
|
|
with_test_prefix "iter $attempt" {
|
|
|
|
set attached 0
|
|
|
|
set eperm 0
|
|
|
|
set test "attach"
|
|
|
|
gdb_test_multiple "attach $testpid" $test {
|
|
|
|
-re "new threads in iteration" {
|
|
|
|
# Seen when "set debug libthread_db" is on.
|
|
|
|
exp_continue
|
|
|
|
}
|
|
|
|
-re "warning: Cannot attach to lwp $decimal: Operation not permitted" {
|
|
|
|
# On Linux, PTRACE_ATTACH sometimes fails with
|
|
|
|
# EPERM, even though /proc/PID/status indicates
|
|
|
|
# the thread is running.
|
|
|
|
set eperm 1
|
|
|
|
exp_continue
|
|
|
|
}
|
|
|
|
-re "debugger service failed.*$gdb_prompt $" {
|
|
|
|
fail $test
|
|
|
|
}
|
|
|
|
-re "$gdb_prompt $" {
|
|
|
|
if {$eperm} {
|
|
|
|
xfail "$test (EPERM)"
|
|
|
|
} else {
|
|
|
|
pass $test
|
|
|
|
}
|
|
|
|
}
|
|
|
|
-re "Attaching to program.*process $testpid.*$gdb_prompt $" {
|
|
|
|
pass $test
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
# Sleep a bit and try updating the thread list. We should
|
|
|
|
# know about all threads already at this point. If we see
|
|
|
|
# "New Thread" or similar being output, then "attach" is
|
|
|
|
# failing to actually attach to all threads in the process,
|
|
|
|
# which would be a bug.
|
|
|
|
sleep 1
|
|
|
|
|
|
|
|
set test "no new threads"
|
|
|
|
gdb_test_multiple "info threads" $test {
|
|
|
|
-re "New .*$gdb_prompt $" {
|
|
|
|
fail $test
|
|
|
|
}
|
|
|
|
-re "$gdb_prompt $" {
|
|
|
|
pass $test
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
# Force breakpoints always inserted, so that threads we might
|
|
|
|
# have failed to attach to hit them even when threads we do
|
|
|
|
# know about are stopped.
|
|
|
|
gdb_test_no_output "set breakpoint always-inserted on"
|
|
|
|
|
|
|
|
# Run to a breakpoint a few times. A few threads should spawn
|
|
|
|
# and die meanwhile. This checks that thread creation/death
|
|
|
|
# events carry on correctly after attaching. Also, be
|
|
|
|
# detaching from the program and reattaching, we check that
|
|
|
|
# the program doesn't die due to gdb leaving a pending
|
|
|
|
# breakpoint hit on a new thread unprocessed.
|
|
|
|
gdb_test "break break_fn" "Breakpoint.*" "break break_fn"
|
|
|
|
|
|
|
|
# Wait a bit, to give time for most threads to hit the
|
|
|
|
# breakpoint, including threads we might have failed to
|
|
|
|
# attach.
|
|
|
|
sleep 2
|
|
|
|
|
|
|
|
set bps 3
|
|
|
|
for {set bp 1} { $bp <= $bps } { incr bp } {
|
|
|
|
gdb_test "continue" "Breakpoint.*" "break at break_fn: $bp"
|
|
|
|
}
|
|
|
|
|
|
|
|
if {$attempt < $attempts} {
|
2015-02-06 12:24:32 +00:00
|
|
|
# Kick the time out timer for another round.
|
|
|
|
gdb_test "print again = 1" " = 1" "reset timer in the inferior"
|
|
|
|
# Show the time we had left in the logs, in case
|
|
|
|
# something goes wrong.
|
|
|
|
gdb_test "print seconds_left" " = .*"
|
|
|
|
|
2014-12-17 20:40:05 +00:00
|
|
|
gdb_test "detach" "Detaching from.*"
|
|
|
|
} else {
|
|
|
|
gdb_test "kill" "" "kill process" "Kill the program being debugged.*y or n. $" "y"
|
|
|
|
}
|
|
|
|
|
|
|
|
gdb_test_no_output "set breakpoint always-inserted off"
|
|
|
|
delete_breakpoints
|
|
|
|
}
|
|
|
|
}
|
testsuite: tcl exec& -> 'kill -9 $pid' is racy (attach-many-short-lived-thread.exp races and others)
The buildbots show that attach-many-short-lived-thread.exp is racy.
But after staring at debug logs and playing with SystemTap scripts for
a (long) while, I figured out that neither GDB, nor the kernel nor the
test's program itself are at fault.
The problem is simply that the testsuite machinery is currently
subject to PID-reuse races. The attach-many-short-lived-threads.c
test program just happens to be much more susceptible to trigger this
race because threads and processes share the same number space on
Linux, and the test spawns many many short lived threads in
succession, thus enlarging the race window a lot.
Part of the problem is that several tests spawn processes with "exec&"
(in order to test the "attach" command) , and then at the end of the
test, to make sure things are cleaned up, issue a 'remote_spawn "kill
-p $testpid"'. Since with tcl's "exec&", tcl itself is responsible
for reaping the process's exit status, when we go kill the process,
testpid may have already exited _and_ its status may have (and often
has) been reaped already. Thus it can happen that another process
meanwhile reuses $testpid, and that "kill" command kills the wrong
process... Frequently, that happens to be
attach-many-short-lived-thread, but this explains other test's races
as well.
In the attach-many-short-lived-threads test, it sometimes manifests
like this:
(gdb) file /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads
Reading symbols from /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads...done.
(gdb) Loaded /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads into /home/pedro/gdb/mygit/build/gdb/testsuite/../../gdb/gdb
attach 5940
Attaching to program: /home/pedro/gdb/mygit/build/gdb/testsuite/gdb.threads/attach-many-short-lived-threads, process 5940
warning: process 5940 is a zombie - the process has already terminated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ptrace: Operation not permitted.
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: attach
info threads
No threads.
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: no new threads
set breakpoint always-inserted on
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 1: set breakpoint always-inserted on
Other times the process dies while the test is ongoing (the process is
ptrace-stopped):
(gdb) print again = 1
Cannot access memory at address 0x6020cc
(gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: reset timer in the inferior
(Recall that on Linux, SIGKILL is not interceptable)
And other times it dies just while we're detaching:
$4 = 319
(gdb) PASS: gdb.threads/attach-many-short-lived-threads.exp: iter 2: print seconds_left
detach
Can't detach Thread 0x7fb13b7de700 (LWP 1842): No such process
(gdb) FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 2: detach
GDB mishandles the latter (it should ignore ESRCH while detaching just
like when continuing), but that's another story.
The fix here is to change spawn_wait_for_attach to use Expect's
'spawn' command instead of Tcl's 'exec&' to spawn programs, because
with spawn we control when to wait for/reap the process. That allows
killing the process by PID without being subject to pid-reuse races,
because even if the process is already dead, the kernel won't reuse
the process's PID until the zombie is reaped.
The other part of the problem lies in DejaGnu itself, unfortunately.
I have occasionally seen tests (attach-many-short-lived-threads
included, but not only that one) die with a random inexplicable
SIGTERM too, and that too is caused by the same reason, except that in
that case, the rogue SIGTERM is sent from this bit in DejaGnu's remote.exp:
exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"
...
catch "wait -i $shell_id"
Even if the program exits promptly, that whole cascade of kills
carries on in the background, thus potentially killing the poor
process that manages to reuse $pid...
I sent a fix for that to the DejaGnu list:
http://lists.gnu.org/archive/html/dejagnu/2015-07/msg00000.html
With both patches in place, I haven't seen
attach-many-short-lived-threads.exp fail again.
Tested on x86_64 Fedora 20, native, gdbserver and extended-gdbserver.
gdb/testsuite/ChangeLog:
2015-07-31 Pedro Alves <palves@redhat.com>
* gdb.base/attach-pie-misread.exp: Rename $res to $test_spawn_id.
Use spawn_id_get_pid. Wait for spawn id after eof. Use
kill_wait_spawned_process instead of explicit "kill -9".
* gdb.base/attach-pie-noexec.exp: Adjust to spawn_wait_for_attach
returning a spawn id instead of a pid. Use spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/attach-twice.exp: Likewise.
* gdb.base/attach.exp: Likewise.
(do_command_attach_tests): Use gdb_spawn_with_cmdline_opts and
gdb_test_multiple.
* gdb.base/solib-overlap.exp: Adjust to spawn_wait_for_attach
returning a spawn id instead of a pid. Use spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/valgrind-infcall.exp: Likewise.
* gdb.multi/multi-attach.exp: Likewise.
* gdb.python/py-prompt.exp: Likewise.
* gdb.python/py-sync-interp.exp: Likewise.
* gdb.server/ext-attach.exp: Likewise.
* gdb.threads/attach-into-signal.exp (corefunc): Use
spawn_wait_for_attach, spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.threads/attach-many-short-lived-threads.exp: Adjust to
spawn_wait_for_attach returning a spawn id instead of a pid. Use
spawn_id_get_pid and kill_wait_spawned_process.
* gdb.threads/attach-stopped.exp (corefunc): Use
spawn_wait_for_attach, spawn_id_get_pid and
kill_wait_spawned_process.
* gdb.base/break-interp.exp: Rename $res to $test_spawn_id.
Use spawn_id_get_pid. Wait for spawn id after eof. Use
kill_wait_spawned_process instead of explicit "kill -9".
* lib/gdb.exp (can_spawn_for_attach): Adjust comment.
(kill_wait_spawned_process, spawn_id_get_pid): New procedures.
(spawn_wait_for_attach): Use spawn instead of exec to spawn
processes. Don't map cygwin/windows pids here. Now returns a
spawn id list.
2015-07-31 19:06:24 +00:00
|
|
|
kill_wait_spawned_process $test_spawn_id
|
2014-12-17 20:40:05 +00:00
|
|
|
}
|
|
|
|
|
2015-02-06 12:24:32 +00:00
|
|
|
# The test program exits after a while, in case GDB crashes. Make it
|
|
|
|
# wait at least as long as we may wait before declaring a time out
|
|
|
|
# failure.
|
|
|
|
set options { "additional_flags=-DTIMEOUT=$timeout" debug pthreads }
|
|
|
|
|
|
|
|
if {[prepare_for_testing "failed to prepare" $testfile $srcfile $options] == -1} {
|
2014-12-17 20:40:05 +00:00
|
|
|
return -1
|
|
|
|
}
|
|
|
|
|
|
|
|
test
|