Sunday, April 11, 2010

Nokogiri Java Port: Help Us Finish It!

One of the most commonly used native extensions for Ruby is the Nokogiri XML API. Nokogiri wraps libxml and has a fair amount of C code that links directly against the Ruby C extension API we don't support in JRuby (and won't, without a lot of community help).

A bit over a year ago, the Nokogiri folks did us a big favor by creating an FFI version of Nokogiri that works surprisingly well; it's probably the most widely-used FFI-based Ruby library around. But the endgame for Nokogiri on JRuby has always been to get a pure-Java version. Not everyone is allowed to link native libraries on their Java platform of choice, and those that are often have trouble getting the right libxml versions installed. The Java version needs to happen.

That day is very close.

I spent a bit of time this weekend getting the Nokogiri "java" port running on my system, and the folks working on it have brought it almost to 100% passing. It's time to push it over the edge.

Building and Testing

Here's my process for getting it building. Let me know if this needs to be edited.

Update: Added rake-compiler and hoe to gems you need to install and modified the git command-line for versions that don't automatically create a local tracking branch.

1. Clone the Nokogiri repository and switch to the "java" branch

~/projects ➔ git clone git://
Initialized empty Git repository in /Users/headius/projects/nokogiri/.git/
remote: Counting objects: 14767, done.
remote: Compressing objects: 100% (3882/3882), done.
remote: Total 14767 (delta 10482), reused 13969 (delta 9945)
Receiving objects: 100% (14767/14767), 3.73 MiB | 742 KiB/s, done.
Resolving deltas: 100% (10482/10482), done.

~/projects ➔ cd nokogiri/

~/projects/nokogiri ➔ git checkout -b java origin/java
Branch java set up to track remote branch java from origin.
Switched to a new branch 'java'

2. Install racc, rexical, rake-compiler, and hoe into Ruby (C Ruby, that is, since they also have extensions)
~/projects/nokogiri ➔ sudo gem install racc rexical rake-compiler hoe
Building native extensions. This could take a while...
Successfully installed racc-1.4.6
Successfully installed rexical-1.0.4
Successfully installed rake-compiler-0.7.0
Successfully installed hoe-2.6.0
4 gems installed

3. Build the lexer and parser using C Ruby
~/projects/nokogiri ➔ rake gem:dev:spec
(in /Users/headius/projects/nokogiri)
warning: couldn't activate the debugging plugin, skipping
rake-compiler must be configured first to enable cross-compilation
/usr/bin/racc -l -o lib/nokogiri/css/generated_parser.rb lib/nokogiri/css/parser.y
rex --independent -o lib/nokogiri/css/generated_tokenizer.rb lib/nokogiri/css/tokenizer.rex

4. Build the Java bits (using rake in JRuby)
~/projects/nokogiri ➔ jruby -S rake java:build
(in /Users/headius/projects/nokogiri)
warning: couldn't activate the debugging plugin, skipping
javac -g -cp /Users/headius/projects/jruby/lib/jruby.jar:../../lib/nekohtml.jar:../../lib/nekodtd.jar:../../lib/xercesImpl.jar:../../lib/isorelax.jar:../../lib/jing.jar nokogiri/*.java nokogiri/internals/*.java
jar cf ../../lib/nokogiri/nokogiri.jar nokogiri/*.class nokogiri/internals/*.class

5. Run the tests (again with rake on JRuby)
~/projects/nokogiri ➔ jruby -S rake test
(in /Users/headius/projects/nokogiri)
...full output...

On my system, I get about 8 failures and 19 errors, out of 785 tests and 1657 assertions. We're very close!

A few other useful tasks:
  • jruby -S rake java:clean_all wipes out the build Java stuff
  • jruby -S rake java:gem builds the Java gem, if you want to try installing it
Helping Out

If you'd like to help fix these bugs, there's a few ways to approach it.
  • Join the nokogiri-talk Google Group so you can communicate with others working on the port. The key folks right now are Yoko Harada and Sergio Arbeo (who did the original bulk of the work for GSoC 2009). I'm also poking at it a bit in my spare time.
  • Post to the group to let folks know you want to help. This will help avoid duplicated effort.
  • Pick tests that appear to be missing or incorrect Ruby logic, like "not implemented", nil results ("method blah not found for nil") or arity errors ("3 arguments for 2" kinds of things). These are often the simplest ones to fix.
  • Don't give up! We're almost there!
It would be great if we could have a 100% working Nokogiri Java port for JRuby 1.5 final this month. I hope to see you on the nokogiri-talk list! Feel free to comment here if you have questions about getting bootstrapped.