Wednesday, June 14, 2006

Unicode in Ruby, Unicode in JRuby?

The great unicode debate has started up again on the ruby-talk mailing list, and for once I'm not actively avoiding it. We have been asked many times by Java folks why JRuby doesn't suport unicode, and there's really been no good answer other than "that's the way Ruby does it".

For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work. The array of libraries and tricks, however, is baffling to me.

So here, now, I open up this question to the Java, JRuby, and Ruby communities in general as I did on the ruby-talk and rails-core mailing lists:

Every time these unicode discussions come up my head spins like a top. You should see it.

We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarrassing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question:

What form would you, the Ruby [or JRuby or Java] users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future?


So there it is, blogosphere. Unicode support is all but certain for JRuby; its usefulness as a JVM language depends on it. What should that support look like?

5 comments:

Golly said...

Even though I write Japanese and multilingual apps often, in my opinion, adding unicode support to JRuby would only cause future headaches. I think we all just have to make to with using utf8 internally, and wait for ruby 1.9 (which i think is 6-12 months away). For people who don't like that option, there are sooo many unicode-in-ruby libraries to choose from.

Charles Oliver Nutter said...

My understanding, based on Ruby Kaigi (some Ruby get-together in Japan?) blogs, is that Ruby 1.9 m17n is unlikely to be available until well into 2007, with a 1.9.1 release coming around December. I don't think most people want to wait 12-18 months for native unicode support. There's also a large number of Java folks that won't be willing to step down from full unicode support just to use Ruby, and they're one of the large target audiences for JRuby.

mortench said...

I would personally love to have unicode support in JRuby even at the cost of being slightly ahead of "the other Ruby implementation". In particular it makes perfect sense for mixed java/ruby code deployments.

tom said...

I too would love to see JRuby seamlessly integrate with the Java environment. Having to explicitly deal with unicode seems so unnatural when working in the JVM.

Anonymous said...

I think that regardless of when it happens, Ruby, and JRuby, should use 32-bit characters internally. That way you don't have to worry about the whole surrogate-pair problem, except when you are reading input or writing output.