Sunday, April 11, 2010

Nokogiri Java Port: Help Us Finish It!

One of the most commonly used native extensions for Ruby is the Nokogiri XML API. Nokogiri wraps libxml and has a fair amount of C code that links directly against the Ruby C extension API we don't support in JRuby (and won't, without a lot of community help).

A bit over a year ago, the Nokogiri folks did us a big favor by creating an FFI version of Nokogiri that works surprisingly well; it's probably the most widely-used FFI-based Ruby library around. But the endgame for Nokogiri on JRuby has always been to get a pure-Java version. Not everyone is allowed to link native libraries on their Java platform of choice, and those that are often have trouble getting the right libxml versions installed. The Java version needs to happen.

That day is very close.

I spent a bit of time this weekend getting the Nokogiri "java" port running on my system, and the folks working on it have brought it almost to 100% passing. It's time to push it over the edge.

Building and Testing

Here's my process for getting it building. Let me know if this needs to be edited.

Update: Added rake-compiler and hoe to gems you need to install and modified the git command-line for versions that don't automatically create a local tracking branch.

1. Clone the Nokogiri repository and switch to the "java" branch

~/projects ➔ git clone git://
Initialized empty Git repository in /Users/headius/projects/nokogiri/.git/
remote: Counting objects: 14767, done.
remote: Compressing objects: 100% (3882/3882), done.
remote: Total 14767 (delta 10482), reused 13969 (delta 9945)
Receiving objects: 100% (14767/14767), 3.73 MiB | 742 KiB/s, done.
Resolving deltas: 100% (10482/10482), done.

~/projects ➔ cd nokogiri/

~/projects/nokogiri ➔ git checkout -b java origin/java
Branch java set up to track remote branch java from origin.
Switched to a new branch 'java'

2. Install racc, rexical, rake-compiler, and hoe into Ruby (C Ruby, that is, since they also have extensions)
~/projects/nokogiri ➔ sudo gem install racc rexical rake-compiler hoe
Building native extensions. This could take a while...
Successfully installed racc-1.4.6
Successfully installed rexical-1.0.4
Successfully installed rake-compiler-0.7.0
Successfully installed hoe-2.6.0
4 gems installed

3. Build the lexer and parser using C Ruby
~/projects/nokogiri ➔ rake gem:dev:spec
(in /Users/headius/projects/nokogiri)
warning: couldn't activate the debugging plugin, skipping
rake-compiler must be configured first to enable cross-compilation
/usr/bin/racc -l -o lib/nokogiri/css/generated_parser.rb lib/nokogiri/css/parser.y
rex --independent -o lib/nokogiri/css/generated_tokenizer.rb lib/nokogiri/css/tokenizer.rex

4. Build the Java bits (using rake in JRuby)
~/projects/nokogiri ➔ jruby -S rake java:build
(in /Users/headius/projects/nokogiri)
warning: couldn't activate the debugging plugin, skipping
javac -g -cp /Users/headius/projects/jruby/lib/jruby.jar:../../lib/nekohtml.jar:../../lib/nekodtd.jar:../../lib/xercesImpl.jar:../../lib/isorelax.jar:../../lib/jing.jar nokogiri/*.java nokogiri/internals/*.java
jar cf ../../lib/nokogiri/nokogiri.jar nokogiri/*.class nokogiri/internals/*.class

5. Run the tests (again with rake on JRuby)
~/projects/nokogiri ➔ jruby -S rake test
(in /Users/headius/projects/nokogiri)
...full output...

On my system, I get about 8 failures and 19 errors, out of 785 tests and 1657 assertions. We're very close!

A few other useful tasks:
  • jruby -S rake java:clean_all wipes out the build Java stuff
  • jruby -S rake java:gem builds the Java gem, if you want to try installing it
Helping Out

If you'd like to help fix these bugs, there's a few ways to approach it.
  • Join the nokogiri-talk Google Group so you can communicate with others working on the port. The key folks right now are Yoko Harada and Sergio Arbeo (who did the original bulk of the work for GSoC 2009). I'm also poking at it a bit in my spare time.
  • Post to the group to let folks know you want to help. This will help avoid duplicated effort.
  • Pick tests that appear to be missing or incorrect Ruby logic, like "not implemented", nil results ("method blah not found for nil") or arity errors ("3 arguments for 2" kinds of things). These are often the simplest ones to fix.
  • Don't give up! We're almost there!
It would be great if we could have a 100% working Nokogiri Java port for JRuby 1.5 final this month. I hope to see you on the nokogiri-talk list! Feel free to comment here if you have questions about getting bootstrapped.


Anonymous said...

I didn't already have a 'java' branch, so I had to use:

git checkout -b java origin/java

No idea why 'git checkout java' didn't work - I really don't know enough about git to guess!

Charles Oliver Nutter said...

Thanks Martin, I'll modify the instructions.

yokolet said...

Thanks! I could rebuild pure Java nokogiri again.

I got:
785 tests, 1671 assertions, 7 failures, 7 errors
on the latest rev. Really close to complete. :)

Unknown said...

Thanks for taking that intitiative Charles. I ran into this exact issue as I tried to run some Cucumber tests via JRuby on an old redhat enterprise linux server.

(the server ran RHEL4 with ruby 1.8.5, and I thought that jruby would be a better fit for running the cucumber tests)

I reckon that this stack trace is an error that could be prevented when you are finished rewriting the Nokogiri java port:

[workspace] $ /bin/sh -xe /tmp/
+ cd smoketest
+ /opt/jruby-1.5.0.RC1/bin/jruby -S cucumber -p hudson site=preprod
Using the hudson profile...
undefined local variable or method `java' for Nokogiri::LibXML:Module (NameError)
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/nokogiri-1.4.1-java/lib/nokogiri/ffi/libxml.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/polyglot-0.3.1/lib/polyglot.rb:64:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/nokogiri-1.4.1-java/lib/nokogiri.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/polyglot-0.3.1/lib/polyglot.rb:64:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/webrat-0.7.0/lib/webrat.rb:36:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/polyglot-0.3.1/lib/polyglot.rb:64:in `require'
/opt/hudson-work/jobs/espre Cucumber acceptance tests/workspace/smoketest/features/support/env.rb:2
/opt/hudson-work/jobs/espre Cucumber acceptance tests/workspace/smoketest/features/support/env.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/polyglot-0.3.1/lib/polyglot.rb:64:in `require'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/rb_support/rb_language.rb:124:in `load_code_file'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/step_mother.rb:85:in `load_code_file'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/step_mother.rb:77:in `load_code_files'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/step_mother.rb:76:in `each'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/step_mother.rb:76:in `load_code_files'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/cli/main.rb:48:in `execute!'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/../lib/cucumber/cli/main.rb:20:in `execute'
/opt/jruby-1.5.0.RC1/lib/ruby/gems/1.8/gems/cucumber-0.6.4/bin/cucumber:19:in `load'
Finished: FAILURE

Kiko said...

I'm encountering the same issue as Jesper which is preventing me from upgrading. My error doesn't have a stacktrace though, just the error message.

Charles Oliver Nutter said...

Jesper, Kiko: You can fix this by just adding require 'java' to your code ahead of Nokogiri. In JRuby 1.5, a library that used to load 'java' does not need to anymore, even though Nokogiri still expects it to be present. This will be fixed for the next Nokogiri FFI release, and of course will not affect the Java release after that.

Unknown said...

Hey Charles! Thanks for the tip!

> You can fix this by just adding require 'java'
> to your code ahead of Nokogiri.

How can I do this in a way that does not break my cucumber tests on ordinary ruby?

Thanks for the heads up :) I look forward to that Nokogiri release and hope you succeed in finding help for the final polish :)

Unknown said...

Update: I just added this line:
require 'java' if RUBY_PLATFORM == 'java' in the cucumber support/env.rb file (before load of webrat that loads nokogiri).

Shih-gian Lee said...

Hi Charles,

I ran "jruby -S rake java:build" but received the following error. I am running Java 1.5.0. Any idea what may cause the problem? Thanks!

warning: couldn't activate the debugging plugin, skipping
javac -g -cp /usr/local/jruby-1.5.0/lib/jruby.jar:../../lib/nekohtml.jar:../../lib/nekodtd.jar:../../lib/xercesImpl.jar:../../lib/isorelax.jar:../../lib/jing.jar nokogiri/*.java nokogiri/internals/*.java
nokogiri/internals/ method does not override a method from its superclass
nokogiri/internals/ method does not override a method from its superclass
nokogiri/internals/ method does not override a method from its superclass
3 errors
rake aborted!
Command failed with status (1): [javac -g -cp /usr/local/jruby-1.5.0/lib/jr...]

Charles Oliver Nutter said...

Shih-gian: Hmm, normally that would mean that the method intended to override something but whatever it overrode previously has gone away. In this case, it would on be whatever class ParserContext extends. Perhaps that's the place to look first?