Tuesday, February 13, 2007

California Schemin'

Last week Tim Bray and I were at Menlo Park to meet with the Open Source Software Society Shimane, a delegation of developers, managers, and company heads from the Shimane prefecture of Japan. They were visiting Sun to talk with us about opportunities for cooperation, Sun hardware and software, and most importantly: Ruby. For you see Shimane is the home of Yukihiro "Matz" Matsumoto, creator of Ruby, and he accompanied the group to California.

The evening before the event, we went to Fuki Sushi in Palo Alto, a short two blocks from my hotel. I don't believe I've ever eaten such quantity or variety of Japanese cuisine, and I think our guests felt the same way. They marveled at the size of most dishes, especially the ice-cream-scoop-sized lump of wasabi and the two-foot-long sushi tray. They also photographed almost everything...I think I posed for a couple dozen snaps.

During the following day, Thursday, they sat through numerous presentations on Sun hardware and software. Tim and I also discussed Sun's position on Ruby and JRuby for an hour before and about forty minutes after lunch. Tim hit the high-level points about where Ruby will likely fit into the Java ecosystem in the future, and I supplied details and demos of JRuby. I also threw in a demo of JRuby's compiler beating Ruby 1.8 in the standard fib algorithm, which elicited a smile and laugh from Matz himself (whew! I was worried how he'd react!).

Most interesting to me, however, was my discussion with Matz that night.

I was invited to join the delegation for a crab dinner in San Francisco. We went to Crustacean, a moderately upscale joint near California and 101. And after the attendant tied my plastic bib on, we were ready to go.

Since Matz and I ended up sitting together, and since very few others at the table spoke English, we managed to get in some time discussing a couple Ruby 2.0 design issues. Here's a quick summary:

  • Matz seems to have come around to my visibility proposal for "private" in Ruby 2.0, which is largely the same as how Java handles private visibility. I believe this model is a good simplification over the original proposal. See ruby-core:9996 and related for the original discussion. The basic facts of private then would be:
    • You must dispatch to private methods using a functional call, as in foo() versus xyz.foo(). I didn't like this at first, but I've come around to using call syntax to force certain aspects of visibility.
    • Dispatches to private methods will only look in the same class for the method definition.
    • Methods that are public in superclasses can't be made private in subclasses.
    • Methods that are private in superclasses are not visible to subclasses, and so new methods of the same name and any visibility can exist in subclasses.
  • Protected methods in Ruby 2.0 could potentially act like private methods now, though Matz is worried it would be too much of a change. I think it's appropriate; current private method behavior is very similar to Java's model for protected methods, where the methods can't be seen from outside the hierarchy, but can be called and overridden within the hierarchy as normal. I voiced my opinion, so we'll see where Matz goes from here.
  • Matz is still comfortable with removing set_trace_func if a better mechanism for profiling and debugging can replace it. I had a few suggestions for alternate mechanisms, but I also promised to look into Java's model, since it seems to work quite well. I also suggested there may be something to learn from DTrace.
  • Matz has come around to the idea that encoded character sequences are a different type than unencoded byte arrays, though he still wants them to have the same outward interface.
This last item warrants a bit more discussion.

The topic of encoded character strings came up a few times during Matz's visit, usually with him asking how we're doing things in JRuby. I explained that we mostly just follow Ruby 1.8, with our String now being backed by a byte[], but that we're also providing out-of-the-box native support for the new Rails ActiveSupport::MultiByte Chars class, a wrapper around string that enforces character boundaries and encodings.

At dinner, we continued the discussion. I made my case for a separate type with the following points:
  • A separate type would not require String's interface to change, and it could remain a byte array
  • By having separate types, we can use polymorphic behavior to avoid checking and re-checking encodings for every operation
The first item was mostly a non-issue...Matz is fairly intent on changing the String interface in 2.0, and much of that work is already complete. But he had an interesting response to the second item: he's already planning to have separate types internally for encoded character strings. This was very good news to me, since it meant that JRuby could easily support M17N in the future by simply providing different String types that handle the other encodings, where our UTF-16 String implementation could simply be backed up by java.lang.String/StringBuffer/Builder.

So the result of the String discussion can be summarized in a few points:
  • String's interface will change from 1.8 to work with characters rather than bytes, both in the encoded and unencoded forms of String. The plan for String methods' behaviors does not change from current Ruby 1.9.
  • String will have subtypes that represent encoded character data, though in most cases you won't need to know about those types. If you do need to go after a UTF8String (my name), you can, but there will also be some sort of factory model for generating encoded strings and Ruby 2's encoding pragma will handle literals.
All told, I think it was a very productive trip, and it was great to help Matz work through a few Ruby 2.0 design questions.

10 comments:

Tom Palmer said...

When Strings present characters rather than bytes, does that mean proper 4-byte characters (i.e., UTF-32 encoding)?

Charles Oliver Nutter said...

tom palmer: Basically, in Ruby 1.9 all the methods that used to return bytes (like String[1], etc) will now return single-character strings.

"hello"[1] => "e"

For getting at bytes directly, there are new methods like each_byte.

Tom Palmer said...

Makes sense. Thanks for the info.

Tom Palmer said...

Actually, my question still applies. For instance, if a string consists of a single code point that's past the 16-bit range, is length of the string still 1? (As opposed to Java which would say 2 for length() but 1 for codePointCount().)

Anonymous said...

Will each_byte take an encoding?

"hi".each_byte("UTF-16") ->
0x00
0x68
0x00
0x68

"hi".each_byte("UTF-8") ->
0x68
0x69

Anonymous said...

The above should be:

"hi".each_byte("UTF-16") ->
0x00
0x68
0x00
0x69

Charles Oliver Nutter said...

FYI, the specifics of the Ruby 2.0 String interface are more Matz's thing; I was just arguing for stronger typing of encoded versus unencoded strings, and it seems like that's going to happen.

tom palmer: My understanding is that it returns single-*character* strings, including multi-byte or multi-word surrogated characters. It is a true character sequence. This is also how the Chars class works in Rails, returning single-character (but potentially multi-byte) strings for numeric string indexes.

anonymous: each_byte would go over each byte of the underlying store as-is; so each_byte on a UTF-16 encoded string would have the nulls you show. each_byte on a UTF-8 string would not, and would walk through each byte of a surrogated character individually.

Charles Oliver Nutter said...

tom palmer: Oh, and for the length issue...I believe there will be two methods, one for char length and one for byte length.

donnacha said...

So, essentially, a Japanese guy agreed with everything you suggested and you thought it wasn't just politeness? Great.

Anonymous said...

I don't know if it's correct to refer to UTF-32 as the only "proper" Unicode encoding.

Or to phrase it differently, at one point in the past people thought UTF-16 was the "proper" Unicode encoding.

It sounds like Ruby will be doing it right though, so you can make a UTF-32-backed string if you want. And in the future maybe someone will make a UTF-64-backed one.