1. Skip to navigation
  2. Skip to content

The ELC Community Blog

A knowledge exchange on Ruby on Rails and Agile Development

July 21, 2008

by Dylan Stamat

Ehcache for JRuby / Rails

Spring, Hibernate, Linkedin.com... and the list goes on.
These are all projects that have chosen Ehcache as their caching solution... and with good reason !

Unlike Memcache, which most people in the Rails world are familiar with, Ehcache is a distributed and in-process caching solution, with asynchronous replication. Reads are extremely quick, as it all happens local and in-process... and writes are too... with all of the replication work happening asynchronously via RMI

Much more information...from the README:

== DESCRIPTION:

Ehcache is a simplified JRuby interface to Java's (JSR(107) compliant) Ehcache.
Simplified meaning that it should work out-of-the-box, but a lot of native
methods haven't been interfaced yet, as they weren't needed.  Configuration
occurs in config/ehcache.yml, and should support all the configuration
options available.

Some biased and non-biased Ehcache VS Memcache articles:
http://gregluck.com/blog/archives/2007/05/comparing_memca.html
http://feedblog.org/2007/05/21/unfair-benchmarks-of-ehcache-vs-memcached
http://blog.aristotlesdog.com/2008/05/01/memcached_vs_ehcache/
http://www.hugotroche.com/my_weblog/2008/06/ehcache-vs-memc.html

For more information on Ehcache, see:
http://ehcache.sourceforge.net/

Configuration, Code Samples and everything else, see:
http://ehcache.sourceforge.net/EhcacheUserGuide.html


== INSTALL:

jruby -S gem install ehcache


== BASIC USAGE:

manager = CacheManager.new
cache = manager.cache
cache.put("key", "value", {:ttl => 120})
cache.get("key")
manager.shutdown


== RAILS:

An EhcacheStore is available for use within Rails, so all the native
Rails caching methods are supported.  Make sure your config/environments/*,
are setup to support caching, eg: config.action_controller.perform_caching = true

1) From your RAILS_ROOT, run this command:
     - ehcache rails  ## just copies ehcache_store.rb into lib/ at the moment

2) In your environment.rb, specify:
     - config.cache_store = :ehcache_store

3) Cache stuff


== REQUIREMENTS:

Tested with JRuby 1.1.2 / Rails 2.1

June 06, 2008

by Jeff Emminger

JS Routes plugin

Here's a little plugin to make working with Rails routing in Javascript easier. It generates jsroutes.js each time your app starts, which allows you to use routes like so:

   1  JSRoutes.get('users_path()')
   2   // returns: /users
   3  
   4  JSRoutes.get('formatted_user_url(1, "js")')
   5   // returns: http://your.server/users/1.js
   6  
   7  JSRoutes.get('this_does_not_exist(1, "2", 3, "xml")')
   8   // throws: JSRoutes::Unknown route: this_does_not_exist(1, "2", 3, "js")

Get it here: https://wush.net/svn/public/jsroutes

June 04, 2008

by Alex Chee

Nathaniel Bibler wrote:
It looks as though Marcel has modified the S3Object.copy method to replace the ad-hoc with...


Tim Trautmann wrote:
Very nice, are you planning to submit this as a patch to Marcel’s main git...


AWS-S3 gem extensions and Amazon's Copy API

One of our projects needed to copy lots of files between different S3 buckets, and Amazon just came out with their beta version for Copying s3 objects. So, we decided it would be handy to use this new feature. Instead of downloading each file then uploading it back to S3, which was the only official way to do this before this feature came out.

We also found that the gem did not include an argument to copy/rename objects between different buckets. So we make a patch to the s3 gem to use the new Copy API and accept an extra argument for the destination bucket. We found it more useful, for us, to have this ability.

copy_patch.diff:
   1  Index: lib/aws/s3/object.rb
   2  ===================================================================
   3  --- lib/aws/s3/object.rb	(revision 1282)
   4  +++ lib/aws/s3/object.rb	(working copy)
   5  @@ -178,19 +178,19 @@
   6             end
   7           end
   8           
   9  -        # Makes a copy of the object with <tt>key</tt> to <tt>copy_name</tt>.
  10  -        def copy(key, copy_key, bucket = nil, options = {})
  11  -          bucket          = bucket_name(bucket)
  12  -          original        = open(url_for(key, bucket))
  13  +        # Makes a copy of the object with <tt>key</tt> in bucket <tt>src_bucket</tt> to <tt>copy_name</tt> in bucket <tt>dest_bucket</tt>.
  14  +        def copy(key, copy_key, src_bucket = nil, dest_bucket = nil, options = {})
  15  +          src_bucket          = bucket_name(src_bucket)
  16  +          dest_bucket          = bucket_name(dest_bucket)
  17  +          original        = open(url_for(key, src_bucket))
  18             default_options = {:content_type => original.content_type}
  19  -          store(copy_key, original, bucket, default_options.merge(options))
  20  -          acl(copy_key, bucket, acl(key, bucket))
  21  +          copy(key, copy_key, src_bucket, dest_bucket, options)
  22           end
  23           
  24  -        # Rename the object with key <tt>from</tt> to have key in <tt>to</tt>.
  25  -        def rename(from, to, bucket = nil, options = {})
  26  -          copy(from, to, bucket, options)
  27  -          delete(from, bucket)
  28  +        # Rename the object with key <tt>from</tt> in bucket <tt>src_bucket</tt> to have a key in <tt>to</tt> in bucket <tt>dest_bucket</tt>.
  29  +        def rename(from, to, src_bucket = nil, dest_bucket = nil, options = {})
  30  +          copy(from, to, src_bucket, dest_bucket, options)
  31  +          delete(from, src_bucket)
  32           end
  33           
  34           # Fetch information about the object with <tt>key</tt> from <tt>bucket</tt>. Information includes content type, content length,
  35  @@ -238,8 +238,35 @@
  36             
  37             put(path, options, data) # Don't call .success? on response. We want to get the etag.
  38           end
  39  +
  40  +        
  41  +        # Copies an object from <tt>source_key</tt> and <tt>source_bucket</tt> to <tt>dest_key</tt> and <tt>dest_bucket</tt> 
  42  +        def copy(source_key, dest_key, source_bucket = nil, dest_bucket = nil, options = {})
  43  +          validate_key!(dest_key)
  44  +          # Must build path before infering content type in case bucket is being used for options
  45  +          path1 = path!(dest_bucket, dest_key, options)
  46  +          path2 = path!(source_bucket, source_key, options)
  47  +          infer_content_type!(dest_key, options)
  48  +          options['x-amz-copy-source'] = path2
  49  +          options['x-amz-metadata-directive'] = 'COPY'
  50  +          put(path1, options) # Don't call .success? on response. We want to get the etag.
  51  +        end
  52  +
  53           alias_method :create, :store
  54           alias_method :save,   :store
  55  +
  56  +
  57  +        def copy(source_key, dest_key, source_bucket = nil, dest_bucket = nil, options = {})
  58  +          validate_key!(dest_key)
  59  +          # Must build path before infering content type in case bucket is being used for options
  60  +          path1 = path!(dest_bucket, dest_key, options)
  61  +          path2 = path!(source_bucket, source_key, options)
  62  +          infer_content_type!(dest_key, options)
  63  +          options['x-amz-copy-source'] = path2
  64  +          options['x-amz-metadata-directive'] = 'COPY'
  65  +          put(path1, options) # Don't call .success? on response. We want to get the etag.
  66  +        end
  67  +
  68           
  69           # All private objects are accessible via an authenticated GET request to the S3 servers. You can generate an 
  70           # authenticated url for an object like this:

Just run this patch in your gem directory and change all references in to copy and rename to include the destination bucket in the arguments. I suggest freezing your gem and executing the patch in the vendor/gems/aws-s3 directory, so you would not be changing your gem for all your previous projects and break them.

Since you're already modifying your aws-s3 gem, it might be worthwhile to also add an Expires and Cache-Control Header to your static assets (images, javascripts, and css). This will make the browser cache files for 3 years (don't worry, if you change the file, S3 will still update the cache-control header) and make YSlow happy.

June 02, 2008

by Dylan Stamat

Warble with Console

By default, creating a war via warble doesn't include script/. I was told that I "shouldn't" need it in production environments, but, I think the positives outweigh the negatives by far. Console can be an excellent diagnostic tool in production... and of course, can cause chaos if used incorrectly (see 'root').

Our custom warbler config is a bit complex due to the nature of our applications directory hierarchy, but here is the small piece that shows the shows the jist of the script/ addition. Some evil twin style approach or mixing in would definitely be preferred, but... alas.

This is a little hack in warbler/lib/warbler/task.rb:

   1  def define_scripts_task
   2    scripts = Dir.glob("script/**/*").map do |f|
   3      define_file_task(f, "#{@config.staging_dir}/#{apply_pathmaps(f, :application)}")
   4    end
   5    with_namespace_and_config do
   6      task "public" => scripts
   7    end
   8  end

And add define_scripts_task to the private define_tasks method. Nothing mind blowing, but, helpful.

Now you can play with script/console on your deployed application, as the full script directory will be added to WEB-INF. And a shout out to Nick for Warble in the first place. It's an excellent package.

June 02, 2008

by Asa Wilson

yawl wrote:
I think this behavior comes fromsolr-ruby, which acts_as_solr depends on: http://svn.apache.org/repos/asf/lucene/solr/trunk/client/ruby/solr-ruby/lib/solr/xml.rb


Scott wrote:
Yep…. REXML will bring any Ruby app to it’s knees. Hpricot is also a good...


Speedy Solr: XML Libraries

Act_as_solr needs to create XML docs to submit to the Solr server, either during initial indexing or on save, delete, etc. Acts_as_solr depends on one of two gems for this xml document creation. It's first choice is libxml and the failover is rexml. There is no message about this and I haven't seen any documentation to this effect on the web but there it is sitting in the source code. Libxml is much much faster than rexml in this situation. When we intalled libxml on our servers the time to create the Solr docs dropped from ~7 seconds to ~1 second and the overall rows/second doubled!! That's an absurd speed difference!

The take away message is, if you are using acts_as_solr you should make sure you have the libxml-ruby gem installed!

May 30, 2008

by Dylan Stamat

ImageVoodoo File Extensions

Sparing the details, a project of ours uses a custom built content processor, and not AttachmentFu. It works wonderfully, but I ran into some caveats in regard to how ImageVoodoo handles images. After some digging, I found an interesting post by Nick, which actually touches on my problem indirectly.

In an nutshell, ImageScience allows the loading of extension-less images, which ImageVoodoo does not, ie:

   1  >> ImageScience.with_image("/Users/dstamat/5a1e81c76e634dfc5005db0b1fdf5c58_CGI.11279.6") {}
   2  => nil
   3  >> ImageVoodoo.with_image("/Users/dstamat/5a1e81c76e634dfc5005db0b1fdf5c58_CGI.11279.6") {} 
   4  TypeError: unrecognized format for /Users/dstamat/5a1e81c76e634dfc5005db0b1fdf5c58_CGI.11279.6
   5  	from /Users/dstamat/src/work/.../vendor/gems/image_voodoo-0.2/lib/image_voodoo.rb:180:in `with_image'
   6  	from (irb):8:in `signal_status'

The "solution" in terms of getting this to work properly is to override Tempfile#make_tmpname to allow the slipping in of the file extension, as seen in Nick's post.

To make the libraries consistent however, that's TBD. Each library obviously uses different means for introspection. ImageScience uses FreeImage_GetFIFFromFilename and FreeImage_FIFSupportsReading to determine compatibility, while ImageVoodoo uses Java's ImageIO.getImageReadersBySuffix and writes with ImageIO.write, which requires a format (and will consequently write out the file but not stream).

Will hack up a patch if time permits... but, this will hopefully shed some light for those running into this problem as well :)

May 20, 2008

by Ryan Garver

krissy wrote:
F40rHv gfb07yvt9d6t94wbtx63bgq7d


Dataportability: XRDS-Simple

I've been getting very excited about the Dataportability project (DP) for quite a while now. Their mission is: to promote the idea that individuals have control over their data by determing how they can use it and who can use it. This includes access to data that is under the control of another entity. It's a very cool idea that is gaining a lot of support in very high places. So far companies like MySpace, Google, Microsoft, and Facebook have openly announce their support of DP and its proposed mission. With that kind of weight (those were only a small sample of the companies backing DP) a lot of things can get done very quickly... or very slowly as the case may be. Fortunately the DP group has kept itself relatively independent from the commercial sponsors that have pledged themselves. In fact most of the literature on the DP website doesn't even mention these sponsors as contributing to the standards. Lets hope they can continue to use this autonomy to the advantage of us all.

Among other technologies and standards currently under development, DP leverages OpenID, OAuth, and a number of Microformats (e.g.: hCard, XFN), as well as FOAF. I think it's important to increase awareness of these DP technologies and so I'm going to start putting together some posts to dig in to what they are, why they were created or chosen, and how they work. To start I want to explore a relatively new addition to the DP family that hasn't really received much publicity so far: XRDS-Simple.

XRDS-Simple is the standard that the DP group is developing to solve the problem of service discovery. That is, a standard protocol and format for sharing what services a user uses, for what purpose (to share video or photos, to broadcast updates, to store contacts), and with what priority. This is really important for situations like showing a photo gallery on a users profile page. Where does the user keep their photos? Flickr? Photobucket? Picasa Web Albums?

An Alternative: The rel="me" microformat

There are some other alternatives, however the DP group was concerned that these standards were either too heavy or under powered for the full extend of the task. One group that has been involved with the DP group since the beginning is the microformats group (µf). They have a µf that nearly satisfies the need for a discovery/directory system for services. The spec is called rel="me". This µf links resources to individuals by marking them as relevant to their profile. This is a very barebones approach, but it is also non-intrusive, as with all µfs. These links can me casually scattered within a person's profile page without impacting the normal formatting. But, with µf aware browsers the information gains meaning within the context.

For the larger goals of the DP project it seems that while rel="me" was driving the right road, it didn't take us far enough. Because if the intentional simplicity of the µf features like purpose of a linked service, the local usernames and IDs for using the service, or the service priority compared to similar services couldn't be described.

So how does it work?

XRDS-Simple is a reduced version of the XRDS standard which was developed by OASIS in conversation with the OpenID community. If you have done any work with OpenID you may recognize XRDS as the format used by the Yadis protocol. Here is a sample XRDS-Simple file (taken from the XRDS-Simple 1.0 Draft 1)

   1  <xrds xmlns="xri://$xrds">
   2      <xrd version="2.0" xmlns:simple="http://xrds-simple.net/core/1.0" xmlns="xri://$XRD*($v*2.0)">
   3          <type>xri://$xrds*simple</type>
   4          <service priority="10">
   5            <type>http://specs.example.com/wish_list/1.0</type>
   6            <uri simple:httpmethod="GET">http://books.example.com/wishlist</uri>
   7            <localid>jane</localid><localid>
   8          </localid></service>
   9          <service priority="20">
  10            <type>http://specs.example.com/wish_list/1.0</type>
  11            <uri priority="10" simple:httpmethod="GET">https://dvds.example.org/lists/wishes</uri>
  12            <uri priority="20" simple:httpmethod="GET">http://dvds.example.org/lists/wishes</uri>
  13            <localid>janedoe</localid><localid>
  14          </localid></service>
  15      </xrd>
  16  </xrds>

The XRDS-S lists off a collection of services that are described by the sub-element Type. Each Service element is prioritized and within the service a collection of URIs are prioritized. As you can see, the second service has two URIs and prefers the HTTPS one over the non-SSL URI. The last element in each Service element is a LocalID. The LocalID specifies basically a username or some other identifier that the service will tie to the correct user.

I got a little excited about this and decided to do a quick refresher on my Hpricot skills. I threw together a XRDS-Simple parser that returns a hash of XRDs indexed by id if you have a fragment to work with (see the spec on how this works). Each XRD is a hash of services indexed by the Service > Type. Each Service is an array of URIs which are ordered by overall priority.

   1  xml = Hpricot::XML(str)
   2  xrds = {}
   3  (xml/'XRD').each do |xrd|
   4    if xrd.attributes['xmlns'] == 'xri://$XRD*($v*2.0)' && 
   5        xrd.attributes['version'] == '2.0' && 
   6        (xrd%'Type').inner_text == 'xri://$xrds*simple'
   7  
   8      id = xrd.attributes['id'] || xrds.size
   9      xrds[id] = {}
  10      (xrd/'Service').each do |service|
  11        xrds[id][(service/'Type').inner_text] ||= []
  12        xrds[id][(service/'Type').inner_text] << [service.attributes['priority'].to_i, (service/'URI').map do |uri|
  13          {
  14            :method => (uri.attributes.find{|(k,v)| k =~ /httpMethod/}.last),
  15            :priority => uri.attributes['priority'].to_i,
  16            :local_id => (service%'LocalID').inner_text,
  17            :uri => uri.inner_text
  18          }
  19        end.sort{|l,r| l[:priority]<=>r[:priority]}]
  20      end
  21      xrds[id].each_key do |key|
  22        xrds[id][key] = xrds[id][key].sort{|l,r| l.first<=>r.first}.map{|e| e.last}.flatten.map{|e| e.delete(:priority);e}
  23      end
  24    end
  25  end

If we run this on the above xml and take a look at xrds we will see:

   1  {0=>
   2    {"http://specs.example.com/wish_list/1.0"=>
   3      [{:local_id=>"jane",
   4        :method=>"GET",
   5        :uri=>"http://books.example.com/wishlist"},
   6       {:local_id=>"janedoe",
   7        :method=>"GET",
   8        :uri=>"https://dvds.example.org/lists/wishes"},
   9       {:local_id=>"janedoe",
  10        :method=>"GET",
  11        :uri=>"http://dvds.example.org/lists/wishes"}]}}

May 14, 2008

by Ryan Garver

Defensio Lite

This is a quick post, but I wanted to point out that our new commenting system is now using Defensio spam filtering! This is good because after a day of watching the commenting statistics its pretty clear that we would have been consumed by a porn site or something by the end of the week. The code that we used for this is super simple (possibly too simple) and duplicates some work already done by the talented Marc-André. Oh well. Sometimes you just need to do it yourself. Below is my Defensio API.

   1  class Defensio
   2    cattr_accessor :format
   3    self.format = :xml
   4    
   5    cattr_accessor :service_type
   6    self.service_type = :app # Can be :blog
   7    
   8    cattr_accessor :api_version
   9    self.api_version = '1.2'
  10    
  11    cattr_accessor :api_key
  12    cattr_accessor :owner_url
  13    
  14    def self.configure(confhash)
  15      if confhash['test']
  16        @mock = true
  17        self.owner_url = 'http://www.example.com'
  18        return
  19      else
  20        confhash.each do |prop, val|
  21          self.send("#{prop}=", val)
  22        end
  23      end
  24    end
  25    
  26    def self.method_missing(name, *args)
  27      self.post(name.to_s.dasherize, *args)
  28    end
  29    
  30    private
  31      def self.connection
  32        uri = URI.parse('http://api.defensio.com/')
  33        Net::HTTP.start(uri.host, uri.port)
  34      end
  35    
  36      def self.post(action, params = {})
  37        resp = connection.post(real_path(action), params_from_hash(params))
  38        raise "Problem with request: #{action}" unless resp.code == '200'
  39        parse_response(resp.body)
  40      end
  41    
  42      def self.real_path(action)
  43        "/#{service_type}/#{api_version}/#{action}/#{api_key}.#{format}"
  44      end
  45    
  46      def self.params_from_hash(params = {})
  47        # Thanks Net::HTTPHeader
  48        params.stringify_keys.merge('owner-url' => owner_url).map {|k,v| "#{CGI.escape(k.dasherize.to_s)}=#{CGI.escape(v.to_s)}" }.join('&') 
  49      end
  50    
  51      def self.parse_response(body)
  52        case format
  53        when :yaml
  54          YAML.load(body)
  55        when :xml
  56          Hash.from_xml(body)
  57        end
  58      end
  59  end

I clearly didn't spend much time polishing this, but the usage is a pretty straight forward mapping from the API docs. So to announce an article I call:

   1  Defensio.announce_article(:article_author => 'Ryan Garver', :article_author_email => 'rgarver@domain.com', :article_title => 'Defensio Lite', ... )

There are also some site wide values that are set in a yml file. I'll close with an example.

   1  development:
   2    api_key: a09f87a09f87a098f7a098f7a098f7a0
   3    owner_url: http://www.example.com
   4  
   5  staging:
   6    api_key: 12f3412f341f234f123f4123f412f34f
   7    owner_url: http://mystaging_blog.com
   8  
   9  production:
  10    api_key: 123f412f412f412f3412f3412f3412f3
  11    owner_url: http://elctech.com
  12  
  13  test:
  14    test: true

April 20, 2008

by Ryan Garver

Ismael Celis wrote:
Hey thanks! Liquid’s been around for a while now but I’m always missing a bit...


Liquid Template Tags

I've been playing around with Liquid recently and have had a lot of fun extending it for a CMS that we're building. It wasn't obvious how to get started, but Liquid is a pretty lightweight code base and after some digging I was able to figure out how to create new custom tags for use in our templates.

The reason why I needed a custom tag was to build a Gravatar image url. This requires an email to be hashed and composed in to a URL. One of the nice things about Liquid is the fact that it protect you from templates that could do things that you don't want to allow. Unlike ERB, Liquid does not evaluate Ruby code directly. It will recognize tags and defer the evaluation to the tag definition which is usually parameterized. We want to be able to insert a line like this in our code:

   1  <img src="{% gravatar_image_url 'jdoe@example.com' size:40 %}" />

When Liquid evaluates that template we want the tag gravatar_image_url to take an email and a list of attributes and output a URL which will show the avatar for the specified icon. For this tag we will start off by creating a class inheriting from Liquid::Tag.

   1  module Liquid
   2      class GravatarImageUrl < Tag
   3        Syntax = /([^\s]+)\s+/
   4        def initialize(tag_name, markup, tokens)
   5          # ...
   6        end
   7      
   8        def render(context)
   9          # ...
  10        end
  11      end
  12    end

NOTE: I've seen some plugins and older versions of liquid that don't have the tag_name argument for initialize. In this example the parameter can be dropped and it should work fine.

There are two methods that we need to override: initialize and render. The initialize method is called to parse the arguments and prepare for rendering once the context is established (this would allow for a two pass evaluation and a possibility for some basic caching of state).

   1  Syntax = /([^\s]+)\s+/
   2    def initialize(markup, tokens)
   3      if markup =~ Syntax
   4        @email = $1
   5        @attributes = {}
   6        markup.scan(TagAttributes) do |key, value|
   7          @attributes[key] = value
   8        end
   9      else
  10        raise SyntaxError.new("Syntax Error in 'gravatar_image_url' - Valid syntax: gravatar_image_url [email]")
  11      end
  12    end

The Syntax trick here isn't normally my style, so I should give credit to the authors of Liquid for demonstrating it to me. It simplifies the process of parsing out important pieces from the input stream. The markup parameter is providing the string that follows the tag name. So if we put gravatar_image_url 'jdoe@example.com' size:40, markup would be set to 'jdoe@example.com' size:40. TagAttributes is provided by Liquid along with a number of other helper regular expressions.

   1  def render(context)
   2      base_url = "http://www.gravatar.com/avatar.php?gravatar_id=#{Digest::MD5.hexdigest(context[@email])}"
   3      extended_attrs = @attributes.map{|k,v| "#{URI.encode(k)}=#{URI.encode(v)}"}
   4      ([base_url]+extended_attrs).compact.join('&amp;')
   5    end

Here is where we get a real taste of the execution context. The first line we are running a MD5 hexdigest on the email, as specified by the Gravatar Docs. We allow for non-literal values here by asking the context tell us what the email actually is. This allows us to do things like: gravatar_image_url post.author.email size:40. The context has enough information to evaluate the post.author.email string and return the value. Incidentally this context trick also allows for some interesting tricks like doing basic math and such.

The last step is to register this Tag definition with a real name with the Liquid::Template handler.

   1  Template.register_tag('gravatar_image_url', GravatarImageURL)

And now you have your very own custom Liquid tag!

April 14, 2008

by Ryan Garver

Advanced Solr Filters with Phonetics

Solr is a very popular full text search engine these days. One of the big benefits over text searching in mysql or some other database that is gained with Solr beside performance is the quality of it's fuzzy searching. Most databases support basic substring comparison; in mysql this is achieved with the 'column LIKE %query%' construct. There are also some databses that will handle regular expressions as well, but these tend to be slow and don't necessarily give rise to good search results with out a lot of work. Solr, more specifically Lucene, is built for this kind of thing and is optimized for string based comparisons and indexing. Solr also comes with a number of algorithms for good fuzzy searches out of the box.

If you are using acts_as_solr in your rails app the Solr configuration that is run when you type 'rake solr:start' supports some very basic query filtering. The fragment below is found in the default schema.xml

   1  <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
   2      <analyzer type="index">
   3        <tokenizer class="solr.WhitespaceTokenizerFactory">
   4        </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
   5        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
   6        </filter><filter class="solr.LowerCaseFilterFactory">
   7        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
   8        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
   9      </filter></analyzer>
  10      <analyzer type="query">
  11        <tokenizer class="solr.WhitespaceTokenizerFactory">
  12        </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
  13        </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
  14        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts="1" catenateall="0">
  15        </filter><filter class="solr.LowerCaseFilterFactory">
  16        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
  17        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
  18      </filter></analyzer>
  19    </fieldtype>

What this says is that for all text fields that are indexed, and all queries that look at text fields, the text will be split in to words based on white space, add in any synonyms it can find, drop certain "stop words", breaks up the words based on a few other triggers (hyphens, camel case, apostrophes), drops the case of all of the words, trims down some basic conjugations (Porter filter), and then removes any duplicates. The Solr wiki has a full description of these tokenizers and filters.

The real gain from fuzzy searching comes with the tilde (~) operator. This operator tells Solr to base it's relevancy score on a Levenshtein distance algorithm which looks at the number of changes to a string required to arrive at another string.

Levenshtein is great, but we can improve the quality of these search results. One reason for inaccurate search queries is misspellings. Levenshtein attempts to find words that are close, but this ignores how people actually work with words. A better solution would be to look at what kinds of substitutions are required and give certain ones higher scores. This would be based on their use in language; their effective pronunciation.

There are a number of algorithms that support comparing two words based on how similar they are in pronunciation. One of the more common ones is called Soundex. This algorithm produces a hash from a given string. Other strings that have a similar pronunciation are supposed to hash to the same value. There are a few limitations with the method. One major one is that it can't handle wrong first letters. So 'psychology' (P242), 'sychology' (S240), and 'cychology' (C420) will not match at all. There are a number of variations on Soundex as well as a few alternatives.

Solr supports a number of these phonetic filters. We'll be adding support for indexing and querying on one of the Soundex variations: Double Metaphone. Looking at the same schema.xml file as above we can add a single line to both the index and the query analyzer tag content.

   1  <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
   2      <analyzer type="index">
   3        <tokenizer class="solr.WhitespaceTokenizerFactory">
   4        </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
   5        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
   6        </filter><filter class="solr.LowerCaseFilterFactory">
   7        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
   8        <!-- Add new phonetic filter -->
   9        </filter><filter class="solr.PhoneticFilterFactory" inject="true" encoder="DoubleMetaphone">
  10        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
  11      </filter></analyzer>
  12      <analyzer type="query">
  13        <tokenizer class="solr.WhitespaceTokenizerFactory">
  14        </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
  15        </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
  16        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts=