Playing around with string encoding on Ruby19

ruby19 encoding activeworlds questatlantis multibyte chars

Sun Mar 28 14:56:58 -0700 2010

Lately I've been working with ActiveWorlds 5.0 SDK, wrapped with ffi running on Ruby 1.9. We ran into encoding issues with the strings that were traveling between the SDK and the ffi wrapper, so I had to delve into the encoding aspects of String in Ruby 1.9, particularly in regards to multi-byte characters. This is not so much a tutorial as it is a self-reference for future, but it would be helpful for anyone googling the web about encoding in ruby19. Take the string, "étudiant", for instance. Its encoding by default in ruby1.9 is:
ruby-1.9.1-p378 > "étudiant".encoding.name
 => "UTF-8"
Provided we are working exclusively with UTF-8 encoding, decimal byte representation of "étudiant" would be:
ruby-1.9.1-p378 >"étudiant".bytes.to_a
 => [195, 169, 116, 117, 100, 105, 97, 110, 116] 
As you can see, the character length of "étudiant" is 8, yet byte representation shows 9, because, "é" is a multi-byte character represented as [195, 169]. As you move beyond ASCII table set, you invariably have to deal with multibyte characters. Now, what I also love about ruby is how is easy it is to see the string in it's various bases represenation. "étudiant" in binary:
ruby-1.9.1-p378 > "étudiant".bytes.to_a.map { |c| c.to_s(2) }
 => ["11000011", "10101001", "1110100", "1110101", "1100100", "1101001", "1100001", "1101110", "1110100"] 
"étudiant" in hex:
ruby-1.9.1-p378 > "étudiant".bytes.to_a.map { |c| c.to_s(16) }
 => ["c3", "a9", "74", "75", "64", "69", "61", "6e", "74"] 
Now, let's say I want to encode a stream of bytes into a UTF-8 encoded string:
[195, 169, 116, 117, 100, 105, 97, 110, 116] 
This'll do it for you:
ruby-1.9.1-p378 > [195, 169, 116, 117, 100, 105, 97, 110, 116].map { |c| c.chr }.join.force_encoding("utf-8")
 => "étudiant"
You have to do it as force encoding because the normal encoding method will throw you a Encoding::UndefinedConversionError. The Shades of Gray blog does a bang-up job in walking a n00b through on character encoding in Ruby 1.9 if you're interested on further readings. Cheers!
blog comments powered by Disqus