Life as Clay

Using Nokogiri and Ruby on Rails to grab citations from PubMed

with 12 comments


Update: I put a working example up on github (Rails 3.0.4 and Nokogiri 1.4.4): https://github.com/ccheaton/pubnoko

##############################

Following on the last post… I needed to provide to a client a custom database of PubMed citations, links, datasets, and other information related to a product area I was researching. The hardest part about this was understanding PubMed’s eutils and their .xml formats.

I did not want to manually enter the citation information. I wanted to avoid Zotero and other similar services because I wanted to integrate the data across domains and use the database that I was developing for them as the basis for a paper I wrote. For the PubMed citations, what I wanted to be able to do was to go to PubMed, perform a search, find an interesting citation, copy the PubMed URL, paste it into my Rails application, which would use the URL to query PubMed in XML format and populate my app’s database with that entry.

I have two models that play into this interaction: Article and Journal. I should note that I’m using Nokogiri to parse the XML and thoughtbot’s Paperclip gem to attach articles when they are available freely. This was done using Rails 2.3.5 and Ruby 1.8.7.

Here’s how the model files look:

journal.rb

class Journal < ActiveRecord::Base   attr_accessible :name, :description, :url, :short_name      has_many :articles      validates_uniqueness_of :name   validates_presence_of :short_name   validates_presence_of :name      before_save :make_short_name      def best_name     if !self.short_name.nil? && self.short_name.length > 0
      return self.short_name
    else
      return self.name
    end
  end

  private

  # Makes sure that a new journal has a short name in case one isn't entered.
  def make_short_name
    if self.short_name.nil? || self.short_name.length == 0
      self.short_name = self.name
    end
  end
end

The Article model is a little more complicated. The logic that I’m using the parse the XML isn’t great, but I’ve found that the structure changes from article to article. I’m still refining it and will update it when I’m finished, after I’ve found the 90% of article that auto-import.

article.rb

class Article < ActiveRecord::Base   require 'open-uri'   attr_accessible :pubmedlink,                    :article_title,                    :abstract,                    :authors,                    :affiliations,                    :pubdate,                    :pubmedid,                    :journal_id,                    :journal_volume,                    :journal_issue,                    :journal_pages,                   :fetched,                   :thearticle_file_name,                   :thearticle_content_type,                   :thearticle_file_size,                   :thearticle_updated_at,                   :thearticle                      has_attached_file :thearticle,                     :path => ":rails_root/public/system/uploads/:class/:id/:basename.:extension",
                    :url => "/system/uploads/:class/:id/:basename.:extension"
  belongs_to :journal
  before_save :make_pubmedid
  validates_uniqueness_of :article_title

  def citation
    cit = ""
    if self.journal_id != nil && self.journal_id > 0
      cit += self.journal.best_name + ". " + self.pubdate.year.to_s + " " + self.pubdate.strftime("%b")
    end
    if self.journal_volume != nil && self.journal_volume.length > 0
      cit += "; " + self.journal_volume
    end
    if self.journal_issue != nil && self.journal_issue.length > 0
      cit += "(" + self.journal_issue + ")"
    end
    if self.journal_pages != nil && self.journal_pages.length > 0
      cit += ": " + self.journal_pages
    end
      return cit
  end

  private

  def make_pubmedid
    if self.pubmedlink != nil && self.pubmedlink.length > 0 && self.fetched == false
      regex = Regexp.new(/[0-9]{4,10}/)
      matchdata = regex.match(self.pubmedlink)
      self.pubmedid = matchdata[0]
      pull_pubmed_data(matchdata[0])
      self.fetched = true
    end
  end

  def pull_pubmed_data(theID)
    @doc = Nokogiri::XML(open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=" + theID.to_s + "&retmode=xml").read)

    # The title and abstract
    self.article_title = @doc.xpath("//ArticleTitle").collect(&:text).to_s
    self.abstract = @doc.xpath("//AbstractText").collect(&:text).to_s

    # Setting up the authors
    first_names = @doc.xpath("//ForeName").collect(&:text)
    last_names = @doc.xpath("//LastName").collect(&:text)
    full_names = []

    i = 0
    last_names.length.times do
      the_name = first_names[i].to_s + " " + last_names[i].to_s
      full_names.push the_name
      i += 1
    end
    self.authors = full_names.join(", ")

    # Affiliations
    self.affiliations = @doc.xpath("//Affiliation").collect(&:text).to_s

    # Publication Date - Check if the complete date is at the top. If not use the pub med date.
    theyear = @doc.xpath("//PubDate/Year").collect(&:text)
    theyear = theyear[0]
    themonth = @doc.xpath("//PubDate/Month").collect(&:text)
    themonth = themonth[0]
    theday = @doc.xpath("//PubDate/Day").collect(&:text)

    if theyear.nil? || themonth.nil? || theday.nil? || theyear.length == 0 || themonth.length == 0 || theday.length == 0

      theyear = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Year").collect(&:text)
      theyear = theyear[0]

      themonth = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Month").collect(&:text)
      themonth = themonth[0].to_i
        month_done = 1
      if themonth < 10
        themonth = "0" + themonth.to_s
        month_done = 1
      end

      theday = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Day").collect(&:text)
    end

    if month_done != 1

      themonth = case themonth || "01"
        when "Jan"  then "01"
        when "Feb"  then "02"
        when "Mar"  then "03"
        when "Apr"  then "04"
        when "May"  then "05"
        when "Jun"  then "06"
        when "Jul"  then "07"
        when "Aug"  then "08"
        when "Sep"  then "09"
        when "Oct"  then "10"
        when "Nov"  then "11"
        when "Dec"  then "12"
      end

    end

    if theday.length == 0
      theday = "01"
    elsif theday[0].to_i < 10       theday = "0" + theday[0].to_s     else       theday = theday[0].to_s     end          thedate = theyear.to_s + "-" + themonth.to_s + "-" + theday.to_s     puts "thedate: " + thedate.to_s     self.pubdate = Date.new(theyear.to_i, themonth.to_i, theday.to_i)          # Either referencing the proper journal or creating a new one     thejournal = @doc.xpath("//Journal/Title").collect(&:text)     thejournal = thejournal[0]          theshortname = @doc.xpath("//MedlineTA").collect(&:text)     if theshortname.length == 0       theshortname = ""     else       theshortname = theshortname[0]     end          thejournalid = Journal.find(:first, :conditions => ['lower(name) = ?', thejournal.downcase])

    if !thejournalid.nil?
      self.journal_id = thejournalid.id
    else
      @journal = Journal.new(:name => thejournal.to_s, :short_name => theshortname.to_s)
      @journal.save
      thenewjournal = Journal.find(:first, :order => 'created_at DESC')
      self.journal_id = thenewjournal.id
    end

    # Save the volume, issue, and pages
    thevolume = @doc.xpath("//JournalIssue/Volume").collect(&:text)
    thevolume = thevolume[0].to_s if thevolume.length > 0
    self.journal_volume = thevolume if thevolume.length > 0

    theissue = @doc.xpath("//JournalIssue/Issue").collect(&:text)
    theissue = theissue[0].to_s if theissue.length > 0
    self.journal_issue = theissue if theissue.length > 0

    thepag = @doc.xpath("//Pagination/MedlinePgn").collect(&:text)
    thepag = thepag[0].to_s if thepag.length > 0
    self.journal_pages = thepag if thepag.length > 0

  end

end

The journals_controller.rb file is pretty standard. So is the articles_controller.rb file. Here it is as an example:

articles_controller.rb

class ArticlesController < ApplicationController   def index     @articles = Article.all(:order => "pubdate DESC")
  end

  def show
    @article = Article.find(params[:id])
  end

  def new
    @article = Article.new
    @journals = Journal.all(:order => :name)
  end

  def create
    @article = Article.new(params[:article])
    if @article.save
      flash[:notice] = "Successfully created article."
      redirect_to @article
    else
      render :action => 'new'
    end
  end

  def edit
    @article = Article.find(params[:id])
    @journals = Journal.all(:order => :name)
  end

  def update
    @article = Article.find(params[:id])
    if @article.update_attributes(params[:article])
      flash[:notice] = "Successfully updated article."
      redirect_to @article
    else
      render :action => 'edit'
    end
  end

  def destroy
    @article = Article.find(params[:id])
    @article.destroy
    flash[:notice] = "Successfully destroyed article."
    redirect_to articles_url
  end
end

Here are the view files for articles:

app/views/articles/index.html.erb

<div class="toolbar">
<ul></ul>
</div>
<table>
<tbody>
<tr>
<th></th>
<th>Article Title</th>
<th></th>
</tr>
<tr class="<%= cycle('rowA', 'rowB') %>">
<td></td>
<td><strong> </strong>

		  <em> </em></td>
<td class="deletelink"></td>
</tr>
</tbody>
</table>

app/views/articles/show.html.erb

<div class="toolbar">
<ul></ul>
</div>
	  <strong>PubMed Link:</strong>







	  <strong>View the Article:</strong>





  <strong>Article Title:</strong>



  <strong>Abstract:</strong>



  <strong>Authors:</strong>



  <strong>Affiliations:</strong>



  <strong>Pubdate:</strong>



  <strong>Pubmedid:</strong>



  <strong>Journal:</strong>



  <strong>Journal Volume:</strong>



  <strong>Journal Issue:</strong>



  <strong>Journal Pages:</strong>



app/views/articles/_form.html.erb



<h3>Enter a link to an abstract on PubMed and click Submit or fill out the information below</h3>











		There currently is no file attached.


<hr />












































The Paperclip gem requires :html => { :multipart => true } at the top of the _form file. It’s a very simple gem to use and very powerful. There are Google Groups for both Paperclip and Nokogiri, should you need additional assistance setting them up.

Leave me a comment if this helps you with either Nokogiri or with extracting data from PubMed!

Advertisements

Written by Clay

July 9, 2010 at 18:18

Posted in Code, Ruby, Technology

Tagged with , , , ,

12 Responses

Subscribe to comments with RSS.

  1. Hi,

    nice job,
    i was wondering what migration have you done before? for creat the articles db?

    Pierre

    February 14, 2011 at 06:36

  2. Hi Pierre. This is what I have as the schema for the articles:

      create_table "articles", :force => true do |t|
        t.string   "pubmedlink"
        t.string   "article_title"
        t.text     "abstract"
        t.string   "authors"
        t.text     "affiliations"
        t.date     "pubdate"
        t.integer  "pubmedid"
        t.integer  "journal_id"
        t.string   "journal_volume"
        t.string   "journal_issue"
        t.string   "journal_pages"
        t.datetime "created_at"
        t.datetime "updated_at"
        t.boolean  "fetched"
        t.string   "thearticle_file_name"
        t.string   "thearticle_content_type"
        t.integer  "thearticle_file_size"
        t.datetime "thearticle_updated_at"
      end
    

    Clay

    February 14, 2011 at 11:22

    • Hi Clay,

      thank’s a lot, it seems there is also a db called journals?
      when i try the app i get a first error :
      undefined method `title’
      then if i remove in views i can see
      “Add a New Article”, when i click on it i try to redirect me to a new.html that doesn’t exist?

      what should be in the new.html.erb? cause i just see the show.html.erb and the _form.html.erb

      Pierre

      February 14, 2011 at 13:50

      • Yes, the “journal” represents the magazine that the article appeared in. You probably receive that error because the code tries to create a new journal if there isn’t one that matches the journal of the article being added. Here is journal.rb, the model file:

        class Journal < ActiveRecord::Base
          attr_accessible :name, :description, :url, :short_name, :country_id
          
          has_many :articles, :order => 'pubdate DESC'
          belongs_to :country
          
          validates_uniqueness_of :name
          validates_presence_of :short_name
          validates_presence_of :name
          
          before_save :make_short_name
          
          def best_name
            if !self.short_name.nil? && self.short_name.length > 0
              return self.short_name
            else
              return self.name
            end
          end
          
          private
          
          # Makes sure that a new journal has a short name in case one isn't entered.
          def make_short_name
            if self.short_name.nil? || self.short_name.length == 0
              self.short_name = self.name
            end
          end
        end
        
        

        The schema for the journals table in the database is:

          create_table "journals", :force => true do |t|
            t.string   "name"
            t.text     "description"
            t.string   "url"
            t.datetime "created_at"
            t.datetime "updated_at"
            t.string   "short_name"
            t.integer  "country_id"
          end
        

        This obviously presents the ability to link a journal to a particular country. You can remove that in order to simplify the schema.

        Unfortunately, I did this a while ago and I cannot find an earlier version of the code that includes only the parts necessary to query PubMed (the rest of it is client proprietary).

        For articles, new.html.erb contains simply:

        <% title "New Article" %>
        <div class="toolbar">
        	<ul>
        		<li><%= link_to "Cancel and Return to List of Articles", articles_path, :class => "deletelink" %></li>
        	</ul>
        </div>
        <%= render :partial => 'form' %>
        

        I don’t have time to review all of the code right now, but what probably is happening is that the article, upon creation, is trying to create a new journal since there isn’t one that matches.

        Let me know if you can get it working, and if you cannot, then I’ll dig into it a bit more. :)

        Clay

        February 14, 2011 at 13:58

  3. Hi, thank’s again for your replies! that help a lot!
    finally… after have created the right DB for articles and journals, and after have created the new.html.erb, i got a new error of course…

    When i paste a pubmed link in ‘PubMed Link’ in the new.html.erb page i get :

    –> uninitialized constant Article::Nokogiri
    –> app/models/article.rb:59:in `pull_pubmed_data’
    app/models/article.rb:53:in `make_pubmedid’
    app/controllers/articles_controller.rb:17:in `create’
    –> {“article”=>{“pubdate(1i)”=>”2011”,
    “pubmedlink”=>”http://www.ncbi.nlm.nih.gov/pubmed/21319380”,
    “pubdate(2i)”=>”2”,
    “affiliations”=>””,
    “fetched”=>”0”,
    “pubdate(3i)”=>”15”,
    “abstract”=>””,
    “journal_volume”=>””,
    “journal_id”=>””,
    “journal_issue”=>””,
    “authors”=>””,
    “article_title”=>””,
    “journal_pages”=>””,
    “pubmedid”=>””},
    “commit”=>”Pull Data from PubMed”,
    “authenticity_token”=>”ddbv9xyp+FCtFSJYEmpJdMiE7hyqKspv2yMhUQJIkXk=”,
    “utf8″=>”✓”}

    Any idea what i could have forgotten?

    cafiah

    February 15, 2011 at 10:59

    • I just put working example code (with slight modifications) up on github. You can check it out here: https://github.com/ccheaton/pubnoko

      Let me know if this helps. It requires Rails 3.0.4 and Nokogiri 1.4.4

      Clay

      February 15, 2011 at 14:32

  4. It looks like maybe you don’t have Nokogiri installed or that it’s not initialized properly. Do you have it listed in your environment.rb file and/or have you successfully used it in other projects (and know that you have it installed properly)?

    The other possibility is that there is an incompatibility with the character encoding. I noticed that the article is in Chinese; if the PubMed abstract contains characters that aren’t encoded with UTF-8 then it might not import properly.

    Clay

    February 15, 2011 at 11:41

  5. I’m going to try to get a working version of it up on github in the next day or two. Sorry for all of the trouble! I did this prior to a computer upgrade and cannot find the original basic working code.

    Clay

    February 15, 2011 at 12:08

  6. Hi!

    thank’s a lot! the version you put on github works well!
    For my error it was because i haven’t put Nokogiri in my environment.rb like you said ;)

    i am trying to add a research fonction to it with eutils/esearch.fcgi, so that it can find ID by this way.
    Have you work on this job?
    Thx again, nice work.

    Pierre

    February 16, 2011 at 02:58

    • I’m glad you were able to get it working. I didn’t need the advanced eutils behavior when I constructed this but I was thinking about playing with it soon. The API seems pretty straightforward but the documentation is bad. Let me know if you get it working! :)

      Clay

      February 16, 2011 at 07:24

  7. Hi all, absolute noob here. Came here in search for answer to “can I use Nokogiri in a model”, the thing that stumped me was @variable.save didn’t look like it belonged in the controller!

    Anyway – new question! – to get the data from the web into your database – you had to run a cron, or something right? What I’m basically asking is – there is a trigger of some description that makes the software actually fetch, right?

    Geoff Wright

    April 26, 2011 at 12:19

    • It’s very possible that I have some model logic in the controller — I’m a bit of a noob myself. You’ll find, however, that it’s not always easy to make clear distinctions between controllers, models, and views. Sure, that’s what all of the books and video tutorials say, and it’s a good goal to have, but things don’t always work out that way.

      I’m not sure that I understand your second question… To get the data from the web into my database, I simply enter the article info into the appropriate view when it renders. The code on this post is screwed up (WordPress.com’s editor did something to it), so you probably should browse through the github repository. Try running the code on github and you’ll see how it works: https://github.com/ccheaton/pubnoko

      Clay

      April 26, 2011 at 20:26


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: