Life as Clay

Using Nokogiri and Ruby on Rails to grab citations from PubMed

with 12 comments

Update: I put a working example up on github (Rails 3.0.4 and Nokogiri 1.4.4):


Following on the last post… I needed to provide to a client a custom database of PubMed citations, links, datasets, and other information related to a product area I was researching. The hardest part about this was understanding PubMed’s eutils and their .xml formats.

I did not want to manually enter the citation information. I wanted to avoid Zotero and other similar services because I wanted to integrate the data across domains and use the database that I was developing for them as the basis for a paper I wrote. For the PubMed citations, what I wanted to be able to do was to go to PubMed, perform a search, find an interesting citation, copy the PubMed URL, paste it into my Rails application, which would use the URL to query PubMed in XML format and populate my app’s database with that entry.

I have two models that play into this interaction: Article and Journal. I should note that I’m using Nokogiri to parse the XML and thoughtbot’s Paperclip gem to attach articles when they are available freely. This was done using Rails 2.3.5 and Ruby 1.8.7.

Here’s how the model files look:


class Journal < ActiveRecord::Base   attr_accessible :name, :description, :url, :short_name      has_many :articles      validates_uniqueness_of :name   validates_presence_of :short_name   validates_presence_of :name      before_save :make_short_name      def best_name     if !self.short_name.nil? && self.short_name.length > 0
      return self.short_name


  # Makes sure that a new journal has a short name in case one isn't entered.
  def make_short_name
    if self.short_name.nil? || self.short_name.length == 0
      self.short_name =

The Article model is a little more complicated. The logic that I’m using the parse the XML isn’t great, but I’ve found that the structure changes from article to article. I’m still refining it and will update it when I’m finished, after I’ve found the 90% of article that auto-import.


class Article < ActiveRecord::Base   require 'open-uri'   attr_accessible :pubmedlink,                    :article_title,                    :abstract,                    :authors,                    :affiliations,                    :pubdate,                    :pubmedid,                    :journal_id,                    :journal_volume,                    :journal_issue,                    :journal_pages,                   :fetched,                   :thearticle_file_name,                   :thearticle_content_type,                   :thearticle_file_size,                   :thearticle_updated_at,                   :thearticle                      has_attached_file :thearticle,                     :path => ":rails_root/public/system/uploads/:class/:id/:basename.:extension",
                    :url => "/system/uploads/:class/:id/:basename.:extension"
  belongs_to :journal
  before_save :make_pubmedid
  validates_uniqueness_of :article_title

  def citation
    cit = ""
    if self.journal_id != nil && self.journal_id > 0
      cit += self.journal.best_name + ". " + self.pubdate.year.to_s + " " + self.pubdate.strftime("%b")
    if self.journal_volume != nil && self.journal_volume.length > 0
      cit += "; " + self.journal_volume
    if self.journal_issue != nil && self.journal_issue.length > 0
      cit += "(" + self.journal_issue + ")"
    if self.journal_pages != nil && self.journal_pages.length > 0
      cit += ": " + self.journal_pages
      return cit


  def make_pubmedid
    if self.pubmedlink != nil && self.pubmedlink.length > 0 && self.fetched == false
      regex =[0-9]{4,10}/)
      matchdata = regex.match(self.pubmedlink)
      self.pubmedid = matchdata[0]
      self.fetched = true

  def pull_pubmed_data(theID)
    @doc = Nokogiri::XML(open("" + theID.to_s + "&retmode=xml").read)

    # The title and abstract
    self.article_title = @doc.xpath("//ArticleTitle").collect(&:text).to_s
    self.abstract = @doc.xpath("//AbstractText").collect(&:text).to_s

    # Setting up the authors
    first_names = @doc.xpath("//ForeName").collect(&:text)
    last_names = @doc.xpath("//LastName").collect(&:text)
    full_names = []

    i = 0
    last_names.length.times do
      the_name = first_names[i].to_s + " " + last_names[i].to_s
      full_names.push the_name
      i += 1
    self.authors = full_names.join(", ")

    # Affiliations
    self.affiliations = @doc.xpath("//Affiliation").collect(&:text).to_s

    # Publication Date - Check if the complete date is at the top. If not use the pub med date.
    theyear = @doc.xpath("//PubDate/Year").collect(&:text)
    theyear = theyear[0]
    themonth = @doc.xpath("//PubDate/Month").collect(&:text)
    themonth = themonth[0]
    theday = @doc.xpath("//PubDate/Day").collect(&:text)

    if theyear.nil? || themonth.nil? || theday.nil? || theyear.length == 0 || themonth.length == 0 || theday.length == 0

      theyear = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Year").collect(&:text)
      theyear = theyear[0]

      themonth = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Month").collect(&:text)
      themonth = themonth[0].to_i
        month_done = 1
      if themonth < 10
        themonth = "0" + themonth.to_s
        month_done = 1

      theday = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Day").collect(&:text)

    if month_done != 1

      themonth = case themonth || "01"
        when "Jan"  then "01"
        when "Feb"  then "02"
        when "Mar"  then "03"
        when "Apr"  then "04"
        when "May"  then "05"
        when "Jun"  then "06"
        when "Jul"  then "07"
        when "Aug"  then "08"
        when "Sep"  then "09"
        when "Oct"  then "10"
        when "Nov"  then "11"
        when "Dec"  then "12"


    if theday.length == 0
      theday = "01"
    elsif theday[0].to_i < 10       theday = "0" + theday[0].to_s     else       theday = theday[0].to_s     end          thedate = theyear.to_s + "-" + themonth.to_s + "-" + theday.to_s     puts "thedate: " + thedate.to_s     self.pubdate =, themonth.to_i, theday.to_i)          # Either referencing the proper journal or creating a new one     thejournal = @doc.xpath("//Journal/Title").collect(&:text)     thejournal = thejournal[0]          theshortname = @doc.xpath("//MedlineTA").collect(&:text)     if theshortname.length == 0       theshortname = ""     else       theshortname = theshortname[0]     end          thejournalid = Journal.find(:first, :conditions => ['lower(name) = ?', thejournal.downcase])

    if !thejournalid.nil?
      self.journal_id =
      @journal = => thejournal.to_s, :short_name => theshortname.to_s)
      thenewjournal = Journal.find(:first, :order => 'created_at DESC')
      self.journal_id =

    # Save the volume, issue, and pages
    thevolume = @doc.xpath("//JournalIssue/Volume").collect(&:text)
    thevolume = thevolume[0].to_s if thevolume.length > 0
    self.journal_volume = thevolume if thevolume.length > 0

    theissue = @doc.xpath("//JournalIssue/Issue").collect(&:text)
    theissue = theissue[0].to_s if theissue.length > 0
    self.journal_issue = theissue if theissue.length > 0

    thepag = @doc.xpath("//Pagination/MedlinePgn").collect(&:text)
    thepag = thepag[0].to_s if thepag.length > 0
    self.journal_pages = thepag if thepag.length > 0



The journals_controller.rb file is pretty standard. So is the articles_controller.rb file. Here it is as an example:


class ArticlesController < ApplicationController   def index     @articles = Article.all(:order => "pubdate DESC")

  def show
    @article = Article.find(params[:id])

  def new
    @article =
    @journals = Journal.all(:order => :name)

  def create
    @article =[:article])
      flash[:notice] = "Successfully created article."
      redirect_to @article
      render :action => 'new'

  def edit
    @article = Article.find(params[:id])
    @journals = Journal.all(:order => :name)

  def update
    @article = Article.find(params[:id])
    if @article.update_attributes(params[:article])
      flash[:notice] = "Successfully updated article."
      redirect_to @article
      render :action => 'edit'

  def destroy
    @article = Article.find(params[:id])
    flash[:notice] = "Successfully destroyed article."
    redirect_to articles_url

Here are the view files for articles:


<div class="toolbar">
<th>Article Title</th>
<tr class="<%= cycle('rowA', 'rowB') %>">
<td><strong> </strong>

		  <em> </em></td>
<td class="deletelink"></td>


<div class="toolbar">
	  <strong>PubMed Link:</strong>

	  <strong>View the Article:</strong>

  <strong>Article Title:</strong>







  <strong>Journal Volume:</strong>

  <strong>Journal Issue:</strong>

  <strong>Journal Pages:</strong>


<h3>Enter a link to an abstract on PubMed and click Submit or fill out the information below</h3>

		There currently is no file attached.

<hr />

The Paperclip gem requires :html => { :multipart => true } at the top of the _form file. It’s a very simple gem to use and very powerful. There are Google Groups for both Paperclip and Nokogiri, should you need additional assistance setting them up.

Leave me a comment if this helps you with either Nokogiri or with extracting data from PubMed!


Written by Clay

July 9, 2010 at 18:18

Posted in Code, Ruby, Technology

Tagged with , , , ,

12 Responses

Subscribe to comments with RSS.

  1. Hi,

    nice job,
    i was wondering what migration have you done before? for creat the articles db?


    February 14, 2011 at 06:36

  2. Hi Pierre. This is what I have as the schema for the articles:

      create_table "articles", :force => true do |t|
        t.string   "pubmedlink"
        t.string   "article_title"
        t.text     "abstract"
        t.string   "authors"
        t.text     "affiliations"     "pubdate"
        t.integer  "pubmedid"
        t.integer  "journal_id"
        t.string   "journal_volume"
        t.string   "journal_issue"
        t.string   "journal_pages"
        t.datetime "created_at"
        t.datetime "updated_at"
        t.boolean  "fetched"
        t.string   "thearticle_file_name"
        t.string   "thearticle_content_type"
        t.integer  "thearticle_file_size"
        t.datetime "thearticle_updated_at"


    February 14, 2011 at 11:22

    • Hi Clay,

      thank’s a lot, it seems there is also a db called journals?
      when i try the app i get a first error :
      undefined method `title’
      then if i remove in views i can see
      “Add a New Article”, when i click on it i try to redirect me to a new.html that doesn’t exist?

      what should be in the new.html.erb? cause i just see the show.html.erb and the _form.html.erb


      February 14, 2011 at 13:50

      • Yes, the “journal” represents the magazine that the article appeared in. You probably receive that error because the code tries to create a new journal if there isn’t one that matches the journal of the article being added. Here is journal.rb, the model file:

        class Journal < ActiveRecord::Base
          attr_accessible :name, :description, :url, :short_name, :country_id
          has_many :articles, :order => 'pubdate DESC'
          belongs_to :country
          validates_uniqueness_of :name
          validates_presence_of :short_name
          validates_presence_of :name
          before_save :make_short_name
          def best_name
            if !self.short_name.nil? && self.short_name.length > 0
              return self.short_name
          # Makes sure that a new journal has a short name in case one isn't entered.
          def make_short_name
            if self.short_name.nil? || self.short_name.length == 0
              self.short_name =

        The schema for the journals table in the database is:

          create_table "journals", :force => true do |t|
            t.string   "name"
            t.text     "description"
            t.string   "url"
            t.datetime "created_at"
            t.datetime "updated_at"
            t.string   "short_name"
            t.integer  "country_id"

        This obviously presents the ability to link a journal to a particular country. You can remove that in order to simplify the schema.

        Unfortunately, I did this a while ago and I cannot find an earlier version of the code that includes only the parts necessary to query PubMed (the rest of it is client proprietary).

        For articles, new.html.erb contains simply:

        <% title "New Article" %>
        <div class="toolbar">
        		<li><%= link_to "Cancel and Return to List of Articles", articles_path, :class => "deletelink" %></li>
        <%= render :partial => 'form' %>

        I don’t have time to review all of the code right now, but what probably is happening is that the article, upon creation, is trying to create a new journal since there isn’t one that matches.

        Let me know if you can get it working, and if you cannot, then I’ll dig into it a bit more. :)


        February 14, 2011 at 13:58

  3. Hi, thank’s again for your replies! that help a lot!
    finally… after have created the right DB for articles and journals, and after have created the new.html.erb, i got a new error of course…

    When i paste a pubmed link in ‘PubMed Link’ in the new.html.erb page i get :

    –> uninitialized constant Article::Nokogiri
    –> app/models/article.rb:59:in `pull_pubmed_data’
    app/models/article.rb:53:in `make_pubmedid’
    app/controllers/articles_controller.rb:17:in `create’
    –> {“article”=>{“pubdate(1i)”=>”2011”,
    “commit”=>”Pull Data from PubMed”,

    Any idea what i could have forgotten?


    February 15, 2011 at 10:59

    • I just put working example code (with slight modifications) up on github. You can check it out here:

      Let me know if this helps. It requires Rails 3.0.4 and Nokogiri 1.4.4


      February 15, 2011 at 14:32

  4. It looks like maybe you don’t have Nokogiri installed or that it’s not initialized properly. Do you have it listed in your environment.rb file and/or have you successfully used it in other projects (and know that you have it installed properly)?

    The other possibility is that there is an incompatibility with the character encoding. I noticed that the article is in Chinese; if the PubMed abstract contains characters that aren’t encoded with UTF-8 then it might not import properly.


    February 15, 2011 at 11:41

  5. I’m going to try to get a working version of it up on github in the next day or two. Sorry for all of the trouble! I did this prior to a computer upgrade and cannot find the original basic working code.


    February 15, 2011 at 12:08

  6. Hi!

    thank’s a lot! the version you put on github works well!
    For my error it was because i haven’t put Nokogiri in my environment.rb like you said ;)

    i am trying to add a research fonction to it with eutils/esearch.fcgi, so that it can find ID by this way.
    Have you work on this job?
    Thx again, nice work.


    February 16, 2011 at 02:58

    • I’m glad you were able to get it working. I didn’t need the advanced eutils behavior when I constructed this but I was thinking about playing with it soon. The API seems pretty straightforward but the documentation is bad. Let me know if you get it working! :)


      February 16, 2011 at 07:24

  7. Hi all, absolute noob here. Came here in search for answer to “can I use Nokogiri in a model”, the thing that stumped me was didn’t look like it belonged in the controller!

    Anyway – new question! – to get the data from the web into your database – you had to run a cron, or something right? What I’m basically asking is – there is a trigger of some description that makes the software actually fetch, right?

    Geoff Wright

    April 26, 2011 at 12:19

    • It’s very possible that I have some model logic in the controller — I’m a bit of a noob myself. You’ll find, however, that it’s not always easy to make clear distinctions between controllers, models, and views. Sure, that’s what all of the books and video tutorials say, and it’s a good goal to have, but things don’t always work out that way.

      I’m not sure that I understand your second question… To get the data from the web into my database, I simply enter the article info into the appropriate view when it renders. The code on this post is screwed up (’s editor did something to it), so you probably should browse through the github repository. Try running the code on github and you’ll see how it works:


      April 26, 2011 at 20:26

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: