Sunday, February 24, 2008

Ruby's Thread#raise, Thread#kill, timeout.rb, and net/protocol.rb libraries are broken

I'm taking a break from some bug fixing to bring you this public service announcement:

Ruby's Thread#raise, Thread#kill, and the timeout.rb standard library based on them are inherently broken and should not be used for any purpose. And by extension, net/protocol.rb and all the net/* libraries that use timeout.rb are also currently broken (but they can be fixed).

I will explain, starting with timeout.rb. You see, timeout.rb allows you to specify that a given block of code should only run for a certain amount of time. If it runs longer, an error is raised. If it completes before the timeout, all is well.

Sounds innocuous enough, right? Well, it's not. Here's the code:

def timeout(sec, exception=Error)
return yield if sec == nil or sec.zero?
raise ThreadError, "timeout within critical session"\
if Thread.critical
begin
x = Thread.current
y = Thread.start {
sleep sec
x.raise exception, "execution expired" if x.alive?
}
yield sec
# return true
ensure
y.kill if y and y.alive?
end
end

So you call timeout with a number of seconds, an optional exception type to raise, and the code to execute. A new thread is spun up (for every invocation, I might add) and set to sleep for the specified number of seconds, while the main/calling thread yields to your block of code.

If the code completes before the timeout thread wakes up, all is well. The timeout thread is killed and the result of the provided block of code is returned.

If the timeout thread wakes up before the block has completed, it tells the main/calling thread to raise a new instance of the specified exception type.

All this is reasonable if we assume that Thread#kill and Thread#raise are safe. Unfortunately, they're provably unsafe.

Here's a reduced example:

main = Thread.current
timer = Thread.new { sleep 5; main.raise }
begin
lock(some_resource)
do_some_work
ensure
timer.kill
unlock_some_resource
end

Here we have a simple timeout case. A new thread is spun up to wait for five seconds. A resource is acquired (in this case a lock) and some work is performed. When the work has completed, the timer is killed and the resource is unlocked.

The problem, however, is that you can't guarantee when the timer might fire.

In general, with multithreaded applications, you have to assume that cross-thread events can happen at any point in the program. So we can start by listing a number of places where the timer's raise call might actually fire in the main part of the code, between "begin" and "ensure".
  1. It could fire before the lock is acquired
  2. It could fire while the lock is being acquired (potentially corrupting whatever resource is being locked, but we'll ignore that for the moment)
  3. It could fire after the lock is acquired but before the work has started
  4. It could fire while the work is happening (presumably the desired effect of the timeout, but it also suffers from potential data corruption issues)
  5. It could fire immediately after the work completes but before entering the ensure block
Other than the data corruption issues (which are very real concerns) none of these is particularly dangerous. We could even assume that the lock is safe and the work being done with the resource is perfectly synchronized and impossible to corrupt. Whatever. The bad news is what happens in the ensure block.

If we assume we've gotten through the main body of code without incident, we now enter the ensure. The main thread is about to kill the timeout thread, when BAM, the raise call fires. Now we're in a bit of a predicament. We're already outside the protected body, so the remaining code in the ensure is going to fail. What's worse, we're about to leave a resource locked that may never get unlocked, so even if we can gracefully handle the timeout error somewhere else, we're in trouble.

What if we move the timer kill inside the protected body, to ensure we kill the timer before proceeding to the lock release?

main = Thread.current
timer = Thread.new { sleep 5; raise }
begin
lock(some_resource)
do_some_work
timer.kill
ensure
unlock_some_resource
end

Now we have to deal with the flip side of the coin: if the work we're performing raises an exception, we won't kill the timer thread, and all hell breaks loose. Specifically, after our lock has been released and we've bubbled the exception somewhere up into the call stack, BAM, the raise call fires. Now it's anybody's guess what we've screwed up in our system. And the same situation applies to any thread you might want to call Thread#raise or Thread#kill against: you can't make any guarantees about what damage you'll do.

There's a good FAQ in the Java SE platform docs entitled Why Are Thread.stop, Thread.suspend, Thread.resume and Runtime.runFinalizersOnExit Deprecated?, which covers this phenomenon in more detail. You see, in the early days of Java, it had all these same operations: killing a thread with Thread.stop(), causing a thread to raise an arbitrary exception with Thread.stop(Throwable), and a few others for suspending and resuming threads. But they were a mistake, and in any current Java SE platform implementation, these methods no longer function.

It is provably unsafe to be able to kill a target thread or cause it to raise an exception arbitrarily.

So what about net/protocol.rb? Here's the relevant code, used to fill the read buffer from the protocol's socket:
def rbuf_fill
timeout(@read_timeout) {
@rbuf << @io.sysread(8196)
}
end

This is from JRuby's copy; MRI performs a read of 1024 bytes at a time (spinning up a thread for each) and Rubinius has both a size modification and a timeout.rb modification to use a single timeout thread. But the problems with timeout remain; specifically, if you have any code that uses net/protocol (like net/http, which I'm guessing a few of you rely on) you have a chance that a timeout error might be raised in the middle of your code. You are all rescuing Timeout::Error when you use net/http to read from a URL, right? What, you aren't? Well you better go add that to every piece of code you have that calls net/http, and while you're at it add it to every other library in the world that uses net/http. And then you can move on to the other net/* protocols and third-party libraries that use timeout.rb. Here's a quick list to get you started.

Ok, so you don't want to do all that. What are your options? Here's a few suggestions to help you on your way:
  1. Although you don't have to take my word for it, eventually you're going to have to accept the truth. Thread#kill, Thread#raise, timeout.rb, net/protocol.rb all suffer from these problems. net/protocol.rb could be fixed to use nonblocking IO (select), as could I suspect most of the other libraries, but there is no safe way to use Thread#kill or Thread#raise. Let me repeat that: there is no safe way to use Thread#kill or Thread#raise.
  2. Start lobbying the various Ruby implementations to eliminate Thread#kill and Thread#raise (and while you're at it, eliminate Thread#critical= as well, since it's probably the single largest thing preventing Ruby from being both concurrently-threaded and high-performance).
  3. Start lobbying the library and application maintainers using Thread#kill, Thread#raise, and timeout.rb to stop.
  4. Stop using them yourself.
Now I want this post to be productive, so I'll give a brief overview of how to avoid using these seductively powerful and inherently unsafe features:
  • If you want to be able to kill a thread, write its code such that it periodically checks whether it should terminate. That allows the thread to safely clean up resources and prepare to "die itself".
  • If you need to time out an operation, you're going to have to find a different way to do it. With IO, it's pretty easy. Just look up IO#select and learn to love it. With arbitrary code and libraries, you may be able successfully lobby the authors to add timeout options, or you may be able to hook into them yourself. If you can't do either of those...we'll, you're SOL. Welcome to threads. I hope others will post suggestions in the comments.
  • If you think you can ignore this, think again. Eventually you're going to get bitten in the ass, be it from a long-running transaction that gets corrupted due to a timeout error or a filesystem operation that wipes out some critical file. You're not going to escape this one, so we should start trying to fix it now.
I'm hoping this will start a discussion eventually leading to these features being deprecated and removed from use. I believe Ruby's viability in an increasingly parallel computing world depends on getting threading and concurrency right, and Thread#raise, Thread#kill, and the timeout.rb library need to go away.

25 comments:

aquasync said...

Well we could perhaps do...

main = Thread.current
timer = Thread.new { sleep 5; raise }
begin
begin
lock(some_resource)
do_some_work
ensure
timer.kill
end
ensure
unlock_some_resource
end

This is mostly tongue-in-cheek though. I'd agree a better solution is needed

Martin Probst said...

The thing is that this kind of kill/raise functionality is highly desirable. E.g. in every modern application server you'll be running several applications in one process, handled by multiple threads.

Now if one of these applications has a coding error, it will destroy the whole server. If one application has a denial of service problem, where it goes into an unterminated loop, you can easily kill the whole appserver.

And you certainly will have such issues in large code bases. At least in some library you're using.

So I think we really need some kind of a solution for this problem. Any ideas?

zimbatm said...

@headius: Amen. I was bitten by the same problems. When using Thread#kill you easily start thinking "Oh, I could spawn another thread that kills the first one if it take too long or something". Then you get some unpredictable results and can also easily think "ok, if I sleep that process for 2 sec. it seems to work". And then it gets worser. Thanks for bringing visibility to that problem.

@martin probst: Comparing to unix processes, Thread#kill is like `kill -9`. You'd better only use it on highly exceptional cases. In the same way, long running software in C all have a main loop that handles the various signals. Do the same in your ruby "process" and you'll be fine.

@all: DRb is also highly flawed in that sense. Every proxied call can be broken by a Thread#kill.

sambo99 said...

This is a very interesting post! I come from a windows background and have seen this kind of stuff mess up in production. In general the pattern I tended to use when I needed a timeout was WaitForMultipleObjects and make sure I would wait for a terminate event. Thread.Abort is always a big no-no that you only use when you have no choice.

I wonder if something along these lines would partially fix it? I still dislike the thread.raise stuff its ugly

def timeout(sec, exception=Error)
return yield if sec == nil or sec.zero?
raise ThreadError, "timeout within critical session" if Thread.critical
begin
terminate_event = Event.new
x = Thread.current
y = Thread.start {
result = WaitForSingleObject(terminate_event, timeout)

x.raise exception, "execution expired" if x.alive? && result == TIMEOUT
}
yield sec
# return true
ensure
if y and y.alive?
terminate_event.signal
end
end
end

but the trick is that I do not think Ruby has a WaitForSingleObject

Paolo Bonzini said...

Sorry I cannot provide code in Ruby. But there is a way to make it work, and it is to run a single high-priority thread to do the delays, and schedule/unschedule timeouts in the same high-priority thread.

It's how both Squeak and GNU Smalltalk do it.

TuringTest said...

The same problems arise in other languages. Haskell's GHC compiler uses the scheme described in Asynchronous Exceptions in Haskell.

This solves the issue by adding two new (scoping) primitives: block and unblock. These nest, and the innermost controls whether raised exceptions are noticed (unblock) or not (block). When a blocked thread receives an exception it queues the exception until the next moment it becomes unblocked. Unblocked threads may notice exceptions at any time.

This makes it possible to write a timeout library, though it is still subtle. The haskell library is control-timeout.

The latest ghc version has spawned threads inherit the block/unblock status of the parent thread. This is particularly useful for starting threads in a blocked state.

zimbatm said...

The timeout should probably better be handled by the ruby scheduling itself. timeout.rb is only useful for I/O, which is already implemented in a non-blocking way internally anyways.

Basically, changing timeout(&block) to IO#read_or_timeout_on(delay) would solve the problem. Or did I miss something ?

MenTaLguY said...

@turingtest: We're aware of the GHC work, and I've pointed Charlie to the paper. Unfortunately there are a lot of additional problems when implementing it in an environment that isn't as functionally pure as Haskell, and it doesn't mesh well with external blocking calls on the JVM -- but I wouldn't entirely rule it out.

@paolo: unfortunately that approach to timeouts doesn't address all the problems with asynchronous exceptions generally, which is why @turingtest is bringing up the Haskell work (which goes a bit farther).

Juho Snellman said...

> @rbuf << @io.sysread(8196)

Is that a typo for 8192, or are you intentionally reading in a non-power of 2 bytes?

Charles Oliver Nutter said...

@sambo99, @paolo, and others: As you know, there are various ways to wrap individual cases to make it mostly safe. The problem is that doing this in every case is nearly impossible, and you still have the same problems with asynchronous exceptions bubbling out to callers or through libraries that might not be prepared to handle them.

@juho: Doh. You win the prize for spotting it.

Aaron Patterson said...

Here is an rbuf_fill workaround I saw on ruby-talk a while back:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/202223

undees said...

People also use timeout() as kind of a poor man's timed-loop-with-break sometimes, like so:

timeout(5) {sleep 1 until light.is_green}

...so that they don't have type all that Time.now boilerplate. But there are much safer ways.

No need for a separate thread when the loop has its timing needs spelled right out with the sleep() call.

One could easily slap together a quickie little class like this:

http://pastie.caboo.se/157816

...and do something like:

tick(1, 5).til {light.is_green}

Cheesy, but you get the idea.

Ken Bloom said...

Surely you could cpecify circumstances where timeout() is safe, for example when you're not in the presence of locks, and you can rely on the GC to clean up any resources that you allocate in the interrupted thread.

Daniel Berger said...

I wonder if what you've posted here is related to the WWW::Mechanize related GC bug I was hitting on Solaris, where it would segfault within a timeout block:

http://tinyurl.com/ysx2rd

sambo99 said...

@charlie,

Fair point. Keep in mind though that .Net have not deprecated Thread.Abort even though it is evil, It also has a suspend which can be evil as well.

Sometimes, just before you are tearing down a process calling thread.abort may make sense, Eg. signal thread to stop, wait, thread.abort.

Also, if no unmanaged/native resources are involved it may be useful as a quick hack, if you ensure that no exceptions bubble up by catching them.

But, I think it is much too complicated for newcomers to the language, and agree that no base libraries anywhere should use this.

I hope rubinius fix up rbuf_fill before shipping 1.0

Maybe I should submit the patch that aaron mentioned

Assaf said...

"You are all rescuing Timeout::Error when you use net/http to read from a URL, right?"

Besides HTTP status codes, Net:HTTP throws a bunch of exceptions dealing with TCP issues. Either you deal with them or you don't, Timeout is one of those unspecified exceptions.


"A new thread is spun up (for every invocation, I might add)"

You'll run out of connections before you run out of threads. The 1.8.6 MRI threading implementation can do million of threads with no sweat. It also context switches in a particular way that allows this code to work, and I have more examples like this that work because of the granularity of context switching in 1.8.6.

It is a mistake to copy this code over to JRuby or 1.9 without first fixing it to work with native threads. And it's a bad idea to continue writing code like this, now that we have more deployment options.

So if you're planning on shipping code, test on both implementations to make sure it's not limited by the wrong assumptions.

Charles Oliver Nutter said...

@assaf: At least you might reasonably expect to get TCP errors out of net/http, since it's doing network IO. But Timeout::Error is a bit sneakier. At any rate, the mere existence of timeout.rb means you should probably rescue Timeout::Error whenever you call into a third-party library, because people use it. It simply needs to go away.

And it's not the number of threads that's a problem, it's the cost of spinning them up for every 1024 bytes. Granted, it's a lot less with green threads than with native threads, but it's still a real cost for no good reason.

Assaf said...

I ran some sequential and parallel tests for WideFinder, in different combinations, some of which wasted threads like nobody's business. Didn't see any noticeable CPU or memory hit.

The most common trick for getting more performance out of 1.8.6 is increasing the buffer size. The second would be writing a native library.

Timeout is the failure detection mechanism in TCP/IP, and HTTP doesn't try to hide it. If you're using HTTP over TCP/IP you need to anticipate timeout errors.

To it's credit, Net::HTTP documents TimeoutError as an expected error condition on both open an read, since those use two different timeout values.

macournoyer said...

You can also look at http://rev.rubyforge.org/ for non-blocking IO, which is based on libev and offer replacements for some Net/HTTP stuff.

Anonymous said...

I'm very glad to see this being discussed, and I hope JRuby, Rubinious etc take the opportunity to deprecate or fix these (IMHO) design errors in Ruby. Go ahead and break our Thread#raise-using code - its built on dodgy foundations anyway.

Roger pack said...

Here here. Timeout is totally unsafe, and causes timeout errors to be injected "randomly" into your code--including in the middle of net/* calls, etc., which they sometimes don't know how to handle! It's disgusting! Unfortunately I can't think of easy ways around it.

The main problem is that within the ensure block it could be interrupted, I believe. And there is nothing in Ruby that allows for this not to happen.

I wonder if you could wrap the .kill calls within a mutex and maybe Thread.critical=true [which never realy works in MRI, anyway[ to avoid its problems.

Daniel Parker said...

Let me correct one statement: raise and kill are not ALWAYS bad. They come in very useful sometimes. For whatever (usually bad) reasons, you NEED them to get out of locked situations and kill a thread. Just like we people sometimes get stuck in a mindset and need someone else to come along and turn our heads to certain other truths we couldn't see from where we were; a single thread CAN get stuck. Usually the reasons are certain things beyond our control, and sometimes they actually have to do with somebody else using a bad timeout(.rb) library. ;)

But in all reality we need raise and kill. In situations where they are needed, I use Thread.raise, and just make sure I have the error rescued in the receiving thread, to cleanup whatever is needed. And of course you'd NEVER use raise or kill to stop something that should not be stopped.

raise is a normal part of programming. It's just imperative to *know* that you must be *very* careful that you have a safe environment for it before you use it, and only use it if you *have* to. It's the improper use that's dangerous - but not any more dangerous than using bad logic or math can be when working with financial data.

Simon said...

Would the SystemTimer native gem help here?

http://ph7spot.com/articles/system_timer

StartBreakingFree.com said...

Simon, I was thinking the same thing. Just came across System Timer and hoping this solves the issue!

Can anyone else comment on whether this solved it for them?

Charles Oliver Nutter said...

No timeout library can solve this problem unless they are able to guarantee they'll never fire within any ensure blocks. I don't think System Timer can make this guarantee any more than timeout.rb.