Sunday, February 24, 2008

Ruby's Thread#raise, Thread#kill, timeout.rb, and net/protocol.rb libraries are broken

I'm taking a break from some bug fixing to bring you this public service announcement:

Ruby's Thread#raise, Thread#kill, and the timeout.rb standard library based on them are inherently broken and should not be used for any purpose. And by extension, net/protocol.rb and all the net/* libraries that use timeout.rb are also currently broken (but they can be fixed).

I will explain, starting with timeout.rb. You see, timeout.rb allows you to specify that a given block of code should only run for a certain amount of time. If it runs longer, an error is raised. If it completes before the timeout, all is well.

Sounds innocuous enough, right? Well, it's not. Here's the code:

def timeout(sec, exception=Error)
return yield if sec == nil or sec.zero?
raise ThreadError, "timeout within critical session"\
if Thread.critical
begin
x = Thread.current
y = Thread.start {
sleep sec
x.raise exception, "execution expired" if x.alive?
}
yield sec
# return true
ensure
y.kill if y and y.alive?
end
end

So you call timeout with a number of seconds, an optional exception type to raise, and the code to execute. A new thread is spun up (for every invocation, I might add) and set to sleep for the specified number of seconds, while the main/calling thread yields to your block of code.

If the code completes before the timeout thread wakes up, all is well. The timeout thread is killed and the result of the provided block of code is returned.

If the timeout thread wakes up before the block has completed, it tells the main/calling thread to raise a new instance of the specified exception type.

All this is reasonable if we assume that Thread#kill and Thread#raise are safe. Unfortunately, they're provably unsafe.

Here's a reduced example:

main = Thread.current
timer = Thread.new { sleep 5; main.raise }
begin
lock(some_resource)
do_some_work
ensure
timer.kill
unlock_some_resource
end

Here we have a simple timeout case. A new thread is spun up to wait for five seconds. A resource is acquired (in this case a lock) and some work is performed. When the work has completed, the timer is killed and the resource is unlocked.

The problem, however, is that you can't guarantee when the timer might fire.

In general, with multithreaded applications, you have to assume that cross-thread events can happen at any point in the program. So we can start by listing a number of places where the timer's raise call might actually fire in the main part of the code, between "begin" and "ensure".
  1. It could fire before the lock is acquired
  2. It could fire while the lock is being acquired (potentially corrupting whatever resource is being locked, but we'll ignore that for the moment)
  3. It could fire after the lock is acquired but before the work has started
  4. It could fire while the work is happening (presumably the desired effect of the timeout, but it also suffers from potential data corruption issues)
  5. It could fire immediately after the work completes but before entering the ensure block
Other than the data corruption issues (which are very real concerns) none of these is particularly dangerous. We could even assume that the lock is safe and the work being done with the resource is perfectly synchronized and impossible to corrupt. Whatever. The bad news is what happens in the ensure block.

If we assume we've gotten through the main body of code without incident, we now enter the ensure. The main thread is about to kill the timeout thread, when BAM, the raise call fires. Now we're in a bit of a predicament. We're already outside the protected body, so the remaining code in the ensure is going to fail. What's worse, we're about to leave a resource locked that may never get unlocked, so even if we can gracefully handle the timeout error somewhere else, we're in trouble.

What if we move the timer kill inside the protected body, to ensure we kill the timer before proceeding to the lock release?

main = Thread.current
timer = Thread.new { sleep 5; raise }
begin
lock(some_resource)
do_some_work
timer.kill
ensure
unlock_some_resource
end

Now we have to deal with the flip side of the coin: if the work we're performing raises an exception, we won't kill the timer thread, and all hell breaks loose. Specifically, after our lock has been released and we've bubbled the exception somewhere up into the call stack, BAM, the raise call fires. Now it's anybody's guess what we've screwed up in our system. And the same situation applies to any thread you might want to call Thread#raise or Thread#kill against: you can't make any guarantees about what damage you'll do.

There's a good FAQ in the Java SE platform docs entitled Why Are Thread.stop, Thread.suspend, Thread.resume and Runtime.runFinalizersOnExit Deprecated?, which covers this phenomenon in more detail. You see, in the early days of Java, it had all these same operations: killing a thread with Thread.stop(), causing a thread to raise an arbitrary exception with Thread.stop(Throwable), and a few others for suspending and resuming threads. But they were a mistake, and in any current Java SE platform implementation, these methods no longer function.

It is provably unsafe to be able to kill a target thread or cause it to raise an exception arbitrarily.

So what about net/protocol.rb? Here's the relevant code, used to fill the read buffer from the protocol's socket:
def rbuf_fill
timeout(@read_timeout) {
@rbuf << @io.sysread(8196)
}
end

This is from JRuby's copy; MRI performs a read of 1024 bytes at a time (spinning up a thread for each) and Rubinius has both a size modification and a timeout.rb modification to use a single timeout thread. But the problems with timeout remain; specifically, if you have any code that uses net/protocol (like net/http, which I'm guessing a few of you rely on) you have a chance that a timeout error might be raised in the middle of your code. You are all rescuing Timeout::Error when you use net/http to read from a URL, right? What, you aren't? Well you better go add that to every piece of code you have that calls net/http, and while you're at it add it to every other library in the world that uses net/http. And then you can move on to the other net/* protocols and third-party libraries that use timeout.rb. Here's a quick list to get you started.

Ok, so you don't want to do all that. What are your options? Here's a few suggestions to help you on your way:
  1. Although you don't have to take my word for it, eventually you're going to have to accept the truth. Thread#kill, Thread#raise, timeout.rb, net/protocol.rb all suffer from these problems. net/protocol.rb could be fixed to use nonblocking IO (select), as could I suspect most of the other libraries, but there is no safe way to use Thread#kill or Thread#raise. Let me repeat that: there is no safe way to use Thread#kill or Thread#raise.
  2. Start lobbying the various Ruby implementations to eliminate Thread#kill and Thread#raise (and while you're at it, eliminate Thread#critical= as well, since it's probably the single largest thing preventing Ruby from being both concurrently-threaded and high-performance).
  3. Start lobbying the library and application maintainers using Thread#kill, Thread#raise, and timeout.rb to stop.
  4. Stop using them yourself.
Now I want this post to be productive, so I'll give a brief overview of how to avoid using these seductively powerful and inherently unsafe features:
  • If you want to be able to kill a thread, write its code such that it periodically checks whether it should terminate. That allows the thread to safely clean up resources and prepare to "die itself".
  • If you need to time out an operation, you're going to have to find a different way to do it. With IO, it's pretty easy. Just look up IO#select and learn to love it. With arbitrary code and libraries, you may be able successfully lobby the authors to add timeout options, or you may be able to hook into them yourself. If you can't do either of those...we'll, you're SOL. Welcome to threads. I hope others will post suggestions in the comments.
  • If you think you can ignore this, think again. Eventually you're going to get bitten in the ass, be it from a long-running transaction that gets corrupted due to a timeout error or a filesystem operation that wipes out some critical file. You're not going to escape this one, so we should start trying to fix it now.
I'm hoping this will start a discussion eventually leading to these features being deprecated and removed from use. I believe Ruby's viability in an increasingly parallel computing world depends on getting threading and concurrency right, and Thread#raise, Thread#kill, and the timeout.rb library need to go away.