If your ruby daemon isn’t doing something useful, it should be sleeping. A misbehaving daemon wakes up repeatedly, uses your CPU unnecessarily, and then goes back to sleep without doing anything helpful. It is easy to write a misbehaving daemon in Ruby, since you’re pretty well-insulated from what is actually happening at the operating system level. Let’s fix that.
Use ps
Watch your daemon using ps (I use ‘ps aux’). Look in the time field. If the time is increasing but your daemon is supposed to be sleeping, you have a problem. A greatly misbehaving daemon will also show up in the STAT field as running. Caught in the act! Now to figure out what it is doing.
Use strace
Use strace to watch the outgoing system calls of your daemon. Grab your daemon’s pid from the ps screen and use ’strace -p<your_pid>’. If your daemon is well-behaved, you should see nothing here if your daemon is sleeping. Otherwise, you’ll see a bunch of activity. Grab your copy of Stevens’s Advanced Programming in the UNIX Environment and debug.
Example one: insomniac daemon / incorrect select timeout using ruby inotify
inotify is a unix library for sending events when something changes in the filesystem (i.e. a file is moved, renamed, deleted, touched, etc). Ruby inotify is a useful plugin for using the inotify library in ruby. A simple inotify daemon may simply wait for something to happen to a file (say it is renamed) and then it performs some action (logs the rename operation? renames the file back to its original name?). If nothing is happening to the file, the daemon should be doing an equivalent amount of nothing.
Download ruby inotify version 0.0.2 from http://raa.ruby-lang.org/project/ruby-inotify/
Compile using rake.
Run ruby ./examples/watcher.rb ./tests/
This will start up an example script that watches the ./tests directory for any changes. Now create a new file in the ./tests directory to make sure that the script notices it. The script noticed it? Great. As you can see this script isn’t really doing anything from a functional standpoint except waiting. Or, we should say it shouldn’t be doing anything besides waiting. Let’s see.
Use ps to grab the script’s pid and strace it. You should see a scrolling list of selects, something like this repeated indefinitely:
select(5, [3], [], [], {0, 44}) = 0 (Timeout)
select(5, [3], [], [], {0, 0}) = 0 (Timeout)
select(5, [3], [], [], {0, 3999}) = 0 (Timeout)
See your copy of Stevens’s Advanced Programming in the UNIX Environment for good information on what this means, but in a nutshell select in combination with inotify is watching the directory for events. The last argument to select (the {0,44}, {0,0}, or{0,3999}) is a timeout value where the first digit is seconds and the second digit is microseconds. So {0,3999} is a timeout value of 0 seconds and 3999 microseconds, or about 4ms. Select is returning 0 (Timeout), which means select waited until it timed out without an event happening. This is being repeated over and over and over again. Effectively, we’re polling for events every 4ms instead of simply waiting for the event to happen.
We’re going to skip over the steps where we use gdb or ddd to find the code where this is occurring. Instead, open ./ext/inotify.c. Ruby inotify works by having an event loop (starting on line 129 with the while(1)) that checks for events (see event_check: line 30), processes the events (line 134+) and repeats. Note how if event_check returns 0, there are no events to process. Now look at event_check (lines 36-49). On line 47, you can see the rb_thread_select call; this will go through ruby’s select multiplexer and eventually end up in the select calls we saw earlier using strace. In fact, you can see on lines 41-42 the timeout being set and passed to rb_thread_select. So here is the 4ms poll. Time to fix it.
Take this line:
r = rb_thread_select (fd+1, &rfds, NULL, NULL, &timeout);
and change the &timeout to NULL, i.e.:
r = rb_thread_select (fd+1, &rfds, NULL, NULL, NULL);
A NULL timeout means infinity here so once this function is called, we’ll wait infinitely long for an event. This is fine. Save, go back to your command line, cntl-c out of your strace and your example script, and use rake to recompile ruby-inotify.
Rerun ruby ./examples/watcher.rb ./tests/
Use ps to grab the pid and strace it.
There should be no scrolling selects, there should only be activity on the strace when you actually create files in the ./tests/ directory and the script is actually doing work. Good script.
Example two: excited daemon / unnecessary signal redefinition
Check out this daemon example in ruby: http://blog.sosedoff.com/2009/01/24/writing-very-simple-daemon-in-ruby/. Download the example code, and you’ll also need the daemonize.rb script in the same directory as the example.
Run ‘ruby daemon.rb’ (you may need a sudo for this). Every second, the daemon will output a “Daemon working” message and then go back to sleep for another second. The code that does this looks like this:
def daemon_log(str)
puts "[#{Time.now.strftime("%m/%d/%Y-%H:%M:%S")}] #{str}"
end
...
while $daemon[:work] do
daemon_log("Daemon working")
daemon_handle_signals
sleep($daemon[:delay_sleep])
end
The daemon is running; now grab the pid using ps, and strace it. Every second you’ll see something like
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(1, "[02/23/2009-14:17:47] Daemon work"..., 37) = 37
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGTERM, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGKILL, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, 8) = -1 EINVAL (Invalid argument)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGINT, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, ~[KILL STOP RTMIN RT_1], 8) = 0
rt_sigaction(SIGTSTP, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, {0x33a9a96c40, [], SA_RESTORER, 0x38c5e0f0f0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(0, NULL, NULL, NULL, {1, 0}
) = 0 (Timeout)
Here is what is clear to us: the line beginning with ’stat’ is getting the time, the line beginning with ‘write’ is writing the log, and near the end there is this line ’select(0, NULL, NULL, NULL, {1, 0}’ that is doing a one second delay. But what is rt_sigprocmask and rt_sigaction and why is it happening every second? See the sigprocmask and sigaction section of your Steven’s book for real details, but effectively our application is registering to receive signals every second. This would be fine if we were handling a signal and then needed to handle future signals differently depending upon the first signal; however, this daemon is pretty simple, so let’s look back at the code:
while $daemon[:work] do
daemon_log("Daemon working")
daemon_handle_signals
sleep($daemon[:delay_sleep])
end
Let’s look at daemon_handle_signals (lines 83-105):
def daemon_handle_signals
# termination signal
Signal.trap("TERM") do
daemon_log("TERM signal received.")
daemon_terminate
end
# kill signal
Signal.trap("KILL") do
daemon_log("KILL signal received.")
daemon_terminate
end
# keyboard interruption
Signal.trap("INT") do
daemon_log("SIGINT signal received.")
daemon_terminate
end
Signal.trap("TSTP") do
daemon_log("SIGTSTP signal received.")
end
end
Each of these Signal.trap() lines is registering a block with the rt_sigaction call we see in strace. This code really only needs to be run once at program boot, not each time though the event loop. Once Signal.trap() is called and passed a block, the block will be evaluated when the process is signaled, regardless of where our application is in its codebase. Go back to this piece of code (lines 70-74):
while $daemon[:work] do
daemon_log("Daemon working")
daemon_handle_signals
sleep($daemon[:delay_sleep])
end
And move the daemon_handle_signals call above the loop, like this:
daemon_handle_signals
while $daemon[:work] do
daemon_log("Daemon working")
sleep($daemon[:delay_sleep])
end
Terminate the old daemons and straces if you haven’t yet, and start them back up. You should see a much smaller strace happening every second, but the signals still work. Easiest one to try is cntl-c which sends a INT signal to the proccess and causes it to quit.
A good daemon is one that does no unnecessary work.
Always strace your daemons to make sure they’re doing what you expect.
Go ahead and write daemons in ruby.
One Comment
Nice, thanks. I especially like the closing sentence, I needed the reassurance.