1. Skip to navigation
  2. Skip to content

The ELC Community Blog

A knowledge exchange on Ruby on Rails and Agile Development

April 20, 2008

by Ryan Garver

Ismael Celis wrote:
Hey thanks! Liquid’s been around for a while now but I’m always missing a bit...


Liquid Template Tags

I've been playing around with Liquid recently and have had a lot of fun extending it for a CMS that we're building. It wasn't obvious how to get started, but Liquid is a pretty lightweight code base and after some digging I was able to figure out how to create new custom tags for use in our templates.

The reason why I needed a custom tag was to build a Gravatar image url. This requires an email to be hashed and composed in to a URL. One of the nice things about Liquid is the fact that it protect you from templates that could do things that you don't want to allow. Unlike ERB, Liquid does not evaluate Ruby code directly. It will recognize tags and defer the evaluation to the tag definition which is usually parameterized. We want to be able to insert a line like this in our code:

   1  <img src="{% gravatar_image_url 'jdoe@example.com' size:40 %}" />

When Liquid evaluates that template we want the tag gravatar_image_url to take an email and a list of attributes and output a URL which will show the avatar for the specified icon. For this tag we will start off by creating a class inheriting from Liquid::Tag.

   1  module Liquid
   2      class GravatarImageUrl < Tag
   3        Syntax = /([^\s]+)\s+/
   4        def initialize(tag_name, markup, tokens)
   5          # ...
   6        end
   7      
   8        def render(context)
   9          # ...
  10        end
  11      end
  12    end

NOTE: I've seen some plugins and older versions of liquid that don't have the tag_name argument for initialize. In this example the parameter can be dropped and it should work fine.

There are two methods that we need to override: initialize and render. The initialize method is called to parse the arguments and prepare for rendering once the context is established (this would allow for a two pass evaluation and a possibility for some basic caching of state).

   1  Syntax = /([^\s]+)\s+/
   2    def initialize(markup, tokens)
   3      if markup =~ Syntax
   4        @email = $1
   5        @attributes = {}
   6        markup.scan(TagAttributes) do |key, value|
   7          @attributes[key] = value
   8        end
   9      else
  10        raise SyntaxError.new("Syntax Error in 'gravatar_image_url' - Valid syntax: gravatar_image_url [email]")
  11      end
  12    end

The Syntax trick here isn't normally my style, so I should give credit to the authors of Liquid for demonstrating it to me. It simplifies the process of parsing out important pieces from the input stream. The markup parameter is providing the string that follows the tag name. So if we put gravatar_image_url 'jdoe@example.com' size:40, markup would be set to 'jdoe@example.com' size:40. TagAttributes is provided by Liquid along with a number of other helper regular expressions.

   1  def render(context)
   2      base_url = "http://www.gravatar.com/avatar.php?gravatar_id=#{Digest::MD5.hexdigest(context[@email])}"
   3      extended_attrs = @attributes.map{|k,v| "#{URI.encode(k)}=#{URI.encode(v)}"}
   4      ([base_url]+extended_attrs).compact.join('&amp;')
   5    end

Here is where we get a real taste of the execution context. The first line we are running a MD5 hexdigest on the email, as specified by the Gravatar Docs. We allow for non-literal values here by asking the context tell us what the email actually is. This allows us to do things like: gravatar_image_url post.author.email size:40. The context has enough information to evaluate the post.author.email string and return the value. Incidentally this context trick also allows for some interesting tricks like doing basic math and such.

The last step is to register this Tag definition with a real name with the Liquid::Template handler.

   1  Template.register_tag('gravatar_image_url', GravatarImageURL)

And now you have your very own custom Liquid tag!

April 14, 2008

by Ryan Garver

Advanced Solr Filters with Phonetics

Solr is a very popular full text search engine these days. One of the big benefits over text searching in mysql or some other database that is gained with Solr beside performance is the quality of it's fuzzy searching. Most databases support basic substring comparison; in mysql this is achieved with the 'column LIKE %query%' construct. There are also some databses that will handle regular expressions as well, but these tend to be slow and don't necessarily give rise to good search results with out a lot of work. Solr, more specifically Lucene, is built for this kind of thing and is optimized for string based comparisons and indexing. Solr also comes with a number of algorithms for good fuzzy searches out of the box.

If you are using acts_as_solr in your rails app the Solr configuration that is run when you type 'rake solr:start' supports some very basic query filtering. The fragment below is found in the default schema.xml

   1  <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
   2      <analyzer type="index">
   3        <tokenizer class="solr.WhitespaceTokenizerFactory">
   4        </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
   5        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
   6        </filter><filter class="solr.LowerCaseFilterFactory">
   7        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
   8        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
   9      </filter></analyzer>
  10      <analyzer type="query">
  11        <tokenizer class="solr.WhitespaceTokenizerFactory">
  12        </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
  13        </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
  14        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts="1" catenateall="0">
  15        </filter><filter class="solr.LowerCaseFilterFactory">
  16        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
  17        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
  18      </filter></analyzer>
  19    </fieldtype>

What this says is that for all text fields that are indexed, and all queries that look at text fields, the text will be split in to words based on white space, add in any synonyms it can find, drop certain "stop words", breaks up the words based on a few other triggers (hyphens, camel case, apostrophes), drops the case of all of the words, trims down some basic conjugations (Porter filter), and then removes any duplicates. The Solr wiki has a full description of these tokenizers and filters.

The real gain from fuzzy searching comes with the tilde (~) operator. This operator tells Solr to base it's relevancy score on a Levenshtein distance algorithm which looks at the number of changes to a string required to arrive at another string.

Levenshtein is great, but we can improve the quality of these search results. One reason for inaccurate search queries is misspellings. Levenshtein attempts to find words that are close, but this ignores how people actually work with words. A better solution would be to look at what kinds of substitutions are required and give certain ones higher scores. This would be based on their use in language; their effective pronunciation.

There are a number of algorithms that support comparing two words based on how similar they are in pronunciation. One of the more common ones is called Soundex. This algorithm produces a hash from a given string. Other strings that have a similar pronunciation are supposed to hash to the same value. There are a few limitations with the method. One major one is that it can't handle wrong first letters. So 'psychology' (P242), 'sychology' (S240), and 'cychology' (C420) will not match at all. There are a number of variations on Soundex as well as a few alternatives.

Solr supports a number of these phonetic filters. We'll be adding support for indexing and querying on one of the Soundex variations: Double Metaphone. Looking at the same schema.xml file as above we can add a single line to both the index and the query analyzer tag content.

   1  <fieldtype name="text" class="solr.TextField" positionincrementgap="100">
   2      <analyzer type="index">
   3        <tokenizer class="solr.WhitespaceTokenizerFactory">
   4        </tokenizer><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
   5        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="1" catenatenumbers="1" generatewordparts="1" catenateall="0">
   6        </filter><filter class="solr.LowerCaseFilterFactory">
   7        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
   8        <!-- Add new phonetic filter -->
   9        </filter><filter class="solr.PhoneticFilterFactory" inject="true" encoder="DoubleMetaphone">
  10        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
  11      </filter></analyzer>
  12      <analyzer type="query">
  13        <tokenizer class="solr.WhitespaceTokenizerFactory">
  14        </tokenizer><filter class="solr.SynonymFilterFactory" ignorecase="true" expand="true" synonyms="synonyms.txt">
  15        </filter><filter words="stopwords.txt" class="solr.StopFilterFactory" ignorecase="true">
  16        </filter><filter class="solr.WordDelimiterFilterFactory" generatenumberparts="1" catenatewords="0" catenatenumbers="0" generatewordparts="1" catenateall="0">
  17        </filter><filter class="solr.LowerCaseFilterFactory">
  18        </filter><filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt">
  19        <!-- Add new phonetic filter -->
  20        </filter><filter class="solr.PhoneticFilterFactory" inject="true" encoder="DoubleMetaphone">
  21        </filter><filter class="solr.RemoveDuplicatesTokenFilterFactory">
  22      </filter></analyzer>
  23    </fieldtype>

We put this at the bottom of the list to make sure that the filter is applied last. Before this will affect our search queries we'll need to re-index our content. This is required so that the Metaphone hashes can be generated and indexed. Once the index is updated our queries will be processed as normal but also be hashed. When the comparisons are made the hashes of the record and the query will be compared to determine how similar they are from a phonetic standpoint.

The results you will see are usually pretty good, but as with any fuzzy algorithm it will require some tweaking before it will "feel" right. There are also a number of alternative algorithms that my index and query faster, or give better results. See the bottom of the AnalyzersTokenizersTokenFilters Solr wiki page for a full list.


home | services | Ruby on Rails Development | code | blog | company