Life as Clay

Archive for July 2010

DC as seen through population pyramids

with one comment

All of the recent work on parsing US Census data was part of a larger project — one that includes the dynamic generation of population pyramids for the entire population and for selected racial groups within each county in the United States. Residents of DC (like myself) frequently hear about how it is a terrible place for young women to meet boyfriends and date successfully. All data here are from the 2009 US Census Bureau population estimates.

Here’s what the population pyramids show:

For the final one, keep in mind that “Hispanic” is considered an ethnicity by the US Census Bureau and that most Hispanics also select a race on census forms and most people who select a race also indicate whether they also consider themselves Hispanic.

What really is striking here is the difference between the shapes of the white and black population pyramids. Perhaps a lot of young white people move to DC for congressional jobs and then move away when the job is finished. DC traditionally has a larger permanent black population and that is reflected in the more even distribution of the pyramid. However, DC also is known for having one of the least healthy black populations in the country, a fact reflected in the low numbers of elderly people. For comparison, look at this view of whites in Palm Beach, Florida:

Back to the original question — yes, you can see in the population pyramid that there are more females than males in Washington, DC, except in the population that identify as being ethnically Hispanic. Where do you go to find the opposite problem? One place is Honolulu, Hawaii:

Only Native Hawaiians (a group including other Pacific Islanders) show a normal distribution:

(All of these population pyramids were generated using CSS in a custom script written in Ruby on Rails.)

Written by Clay

July 26, 2010 at 12:40

Parsing US Census data with Ruby

leave a comment »

I am a fan of population pyramids as a visualization of age demographics. They are simple and effective. Sometimes getting the data to generate them isn’t….

I’m building a simple Rails app for a client who needs county-level population data. It doesn’t have to be precise. I went with the 2009 estimates from the Census bureau, located here:
http://www.census.gov/popest/counties/asrh/CC-EST2009-alldata.html

There were a few issues:
1. Each state’s county-level data is in a different file. You can order a CD that has all of the data in a single file, but I didn’t have time to do that.
2. There’s a ton of data that I didn’t need in each file. All I needed were the county totals and the totals for male/female for each age group.
3. The data is in rows — a different row for each age group. I needed the data in columns, with one row for each county.

What I did:
1. Downloaded all 51 of the files (DC has its own file) and put them in a folder on my desktop.
2. Changed the extension on them to .csv (because it changed to .txt when I downloaded, for some reason).
3. Wrote the Ruby script below, which extracts the data that I need and puts it into a single .csv file.
4. Saved the script to the same folder as the Census files as script.rb.
5. Opened terminal and ran ruby script.rb
4. Reconciled the data with my Excel sheet that I was using to collect county-level data so that I could seed my Rails app’s counties table.

I comment my code pretty heavily when I write it so that I can look back and remember what I did. I’ve been burned in the past by not commenting thoroughly.

Here’s the script:

# Script to pull population data out of census files for each county
# Original data files are here: http://www.census.gov/popest/counties/asrh/CC-EST2009-alldata.html

require 'CSV'

# These are the FIPS codes for each state which are used in file names.
state_fips_codes = ["01", "02", "04", "05", "06", "08", "09", "10", "11", "12", "13", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "44", "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"]

add_header = true

# This is the file that all of the data will go into...
File.open("county_age_data.csv", "w") do |output_file|
  
  # counter for iterating through the files
  n = 0
  
  state_fips_codes.length.times do  # This loop goes through all of the files
    
    # Read in the relevant file
    state_data = CSV.parse(File.read("cc-est2009-alldata-" + state_fips_codes[n] + ".csv"))
    puts "Processing data for " + state_data[1][3]

    # Add the header to the file
    if add_header == true
      output_file.puts "state_fips,county_fips,state_name,county_name,:census_total_pop,:census_total_male,:census_grp01_male,:census_grp02_male,:census_grp03_male,:census_grp04_male,:census_grp05_male,:census_grp06_male,:census_grp07_male,:census_grp08_male,:census_grp09_male,:census_grp10_male,:census_grp11_male,:census_grp12_male,:census_grp13_male,:census_grp14_male,:census_grp15_male,:census_grp16_male,:census_grp17_male,:census_grp18_male,:census_total_female,:census_grp01_female,:census_grp02_female,:census_grp03_female,:census_grp04_female,:census_grp05_female,:census_grp06_female,:census_grp07_female,:census_grp08_female,:census_grp09_female,:census_grp10_female,:census_grp11_female,:census_grp12_female,:census_grp13_female,:census_grp14_female,:census_grp15_female,:census_grp16_female,:census_grp17_female,:census_grp18_female"
      add_header = false
    end

    # Iterate through the loaded file.
    row = 1
    while row < state_data.length
      if state_data[row][5].to_i == 12 # Only take data from year coded 12 (2009 estimates)

        ########## MAIN LOGIC #############
        if state_data[row][6].to_i == 0 # It's a new county because this is the group 0 row with totals

          # Reset the data arrays
          row_data    = []
          male_data   = []
          female_data = []

          row_data.push state_data[row][1].to_s     # state_fips
          row_data.push state_data[row][2].to_s     # county_fips
          row_data.push state_data[row][3].to_s     # state_name
          row_data.push state_data[row][4].to_s     # county_name
          row_data.push state_data[row][7].to_s     # census_total_pop

          male_data.push state_data[row][8].to_s    # census_total_male
          female_data.push state_data[row][9].to_s  # census_total_female

        elsif state_data[row][6].to_i == 18 # It's the last row -- push the data into the array

          # push in the final data points
          male_data.push state_data[row][8].to_s    # census_grp18_male
          female_data.push state_data[row][9].to_s  # census_grp18_female

          # append all of the data to the row_data array
          row_data += male_data
          row_data += female_data

          # write the data to the file
          output_file.puts row_data.join(",")

        else # It's not the first or last row

          male_data.push state_data[row][8].to_s    # census_grpXX_male
          female_data.push state_data[row][9].to_s  # census_grpXX_female

        end # End of the if statement for pushing data into arrays
        ################ END OF MAIN LOGIC ####################
        
      end       # End of the if statement to check if it is year 10
      
      row += 1  # Look at the next row
    end

    n += 1      # Increment the counter
  end           # End of loop for going through each file
end

Written by Clay

July 15, 2010 at 21:17

Posted in Code, Ruby

Tagged with , , ,

Resource: Rails country seeds with ISO codes

leave a comment »

In case you need to seed your Rails app with countries. You can use the ISO codes to display flag icons, like these linked ones from famfamfam.com: link.

 countries = Country.create([
                               {:name => 'Åland Islands', :iso_code => 'AX'},
                               {:name => 'Albania', :iso_code => 'AL'},
                               {:name => 'Algeria', :iso_code => 'DZ'},
                               {:name => 'American Samoa', :iso_code => 'AS'},
                               {:name => 'Andorra', :iso_code => 'AD'},
                               {:name => 'Angola', :iso_code => 'AO'},
                               {:name => 'Anguilla', :iso_code => 'AI'},
                               {:name => 'Antarctica', :iso_code => 'AQ'},
                               {:name => 'Antigua and Barbuda', :iso_code => 'AG'},
                               {:name => 'Argentina', :iso_code => 'AR'},
                               {:name => 'Armenia', :iso_code => 'AM'},
                               {:name => 'Aruba', :iso_code => 'AW'},
                               {:name => 'Australia', :iso_code => 'AU'},
                               {:name => 'Austria', :iso_code => 'AT'},
                               {:name => 'Azerbaijan', :iso_code => 'AZ'},
                               {:name => 'Bahamas', :iso_code => 'BS'},
                               {:name => 'Bahrain', :iso_code => 'BH'},
                               {:name => 'Bangladesh', :iso_code => 'BD'},
                               {:name => 'Barbados', :iso_code => 'BB'},
                               {:name => 'Belarus', :iso_code => 'BY'},
                               {:name => 'Belgium', :iso_code => 'BE'},
                               {:name => 'Belize', :iso_code => 'BZ'},
                               {:name => 'Benin', :iso_code => 'BJ'},
                               {:name => 'Bermuda', :iso_code => 'BM'},
                               {:name => 'Bhutan', :iso_code => 'BT'},
                               {:name => 'Bolivia', :iso_code => 'BO'},
                               {:name => 'Bosnia and Herzegovina', :iso_code => 'BA'},
                               {:name => 'Botswana', :iso_code => 'BW'},
                               {:name => 'Bouvet Island', :iso_code => 'BV'},
                               {:name => 'Brazil', :iso_code => 'BR'},
                               {:name => 'British Indian Ocean Territory', :iso_code => 'IO'},
                               {:name => 'Brunei Darussalam', :iso_code => 'BN'},
                               {:name => 'Bulgaria', :iso_code => 'BG'},
                               {:name => 'Burkina Faso', :iso_code => 'BF'},
                               {:name => 'Burundi', :iso_code => 'BI'},
                               {:name => 'Cambodia', :iso_code => 'KH'},
                               {:name => 'Cameroon', :iso_code => 'CM'},
                               {:name => 'Canada', :iso_code => 'CA'},
                               {:name => 'Cape Verde', :iso_code => 'CV'},
                               {:name => 'Cayman Islands', :iso_code => 'KY'},
                               {:name => 'Central African Republic', :iso_code => 'CF'},
                               {:name => 'Chad', :iso_code => 'TD'},
                               {:name => 'Chile', :iso_code => 'CL'},
                               {:name => 'China', :iso_code => 'CN'},
                               {:name => 'Christmas Island', :iso_code => 'CX'},
                               {:name => 'Cocos (Keeling) Islands', :iso_code => 'CC'},
                               {:name => 'Colombia', :iso_code => 'CO'},
                               {:name => 'Comoros', :iso_code => 'KM'},
                               {:name => 'Congo', :iso_code => 'CG'},
                               {:name => 'Congo, The Democratic Republic of The', :iso_code => 'CD'},
                               {:name => 'Cook Islands', :iso_code => 'CK'},
                               {:name => 'Costa Rica', :iso_code => 'CR'},
                               {:name => 'Côte D\'Ivoire', :iso_code => 'CI'},
                               {:name => 'Croatia', :iso_code => 'HR'},
                               {:name => 'Cuba', :iso_code => 'CU'},
                               {:name => 'Cyprus', :iso_code => 'CY'},
                               {:name => 'Czech Republic', :iso_code => 'CZ'},
                               {:name => 'Denmark', :iso_code => 'DK'},
                               {:name => 'Djibouti', :iso_code => 'DJ'},
                               {:name => 'Dominica', :iso_code => 'DM'},
                               {:name => 'Dominican Republic', :iso_code => 'DO'},
                               {:name => 'Ecuador', :iso_code => 'EC'},
                               {:name => 'Egypt', :iso_code => 'EG'},
                               {:name => 'El Salvador', :iso_code => 'SV'},
                               {:name => 'Equatorial Guinea', :iso_code => 'GQ'},
                               {:name => 'Eritrea', :iso_code => 'ER'},
                               {:name => 'Estonia', :iso_code => 'EE'},
                               {:name => 'Ethiopia', :iso_code => 'ET'},
                               {:name => 'Falkland Islands (Malvinas)', :iso_code => 'FK'},
                               {:name => 'Faroe Islands', :iso_code => 'FO'},
                               {:name => 'Fiji', :iso_code => 'FJ'},
                               {:name => 'Finland', :iso_code => 'FI'},
                               {:name => 'France', :iso_code => 'FR'},
                               {:name => 'French Guiana', :iso_code => 'GF'},
                               {:name => 'French Polynesia', :iso_code => 'PF'},
                               {:name => 'French Southern Territories', :iso_code => 'TF'},
                               {:name => 'Gabon', :iso_code => 'GA'},
                               {:name => 'Gambia', :iso_code => 'GM'},
                               {:name => 'Georgia', :iso_code => 'GE'},
                               {:name => 'Germany', :iso_code => 'DE'},
                               {:name => 'Ghana', :iso_code => 'GH'},
                               {:name => 'Gibraltar', :iso_code => 'GI'},
                               {:name => 'Greece', :iso_code => 'GR'},
                               {:name => 'Greenland', :iso_code => 'GL'},
                               {:name => 'Grenada', :iso_code => 'GD'},
                               {:name => 'Guadeloupe', :iso_code => 'GP'},
                               {:name => 'Guam', :iso_code => 'GU'},
                               {:name => 'Guatemala', :iso_code => 'GT'},
                               {:name => 'Guernsey', :iso_code => 'GG'},
                               {:name => 'Guinea', :iso_code => 'GN'},
                               {:name => 'Guinea-Bissau', :iso_code => 'GW'},
                               {:name => 'Guyana', :iso_code => 'GY'},
                               {:name => 'Haiti', :iso_code => 'HT'},
                               {:name => 'Heard Island and Mcdonald Islands', :iso_code => 'HM'},
                               {:name => 'Holy See (Vatican City State)', :iso_code => 'VA'},
                               {:name => 'Honduras', :iso_code => 'HN'},
                               {:name => 'Hong Kong', :iso_code => 'HK'},
                               {:name => 'Hungary', :iso_code => 'HU'},
                               {:name => 'Iceland', :iso_code => 'IS'},
                               {:name => 'India', :iso_code => 'IN'},
                               {:name => 'Indonesia', :iso_code => 'ID'},
                               {:name => 'Iran, Islamic Republic of', :iso_code => 'IR'},
                               {:name => 'Iraq', :iso_code => 'IQ'},
                               {:name => 'Ireland', :iso_code => 'IE'},
                               {:name => 'Isle of Man', :iso_code => 'IM'},
                               {:name => 'Israel', :iso_code => 'IL'},
                               {:name => 'Italy', :iso_code => 'IT'},
                               {:name => 'Jamaica', :iso_code => 'JM'},
                               {:name => 'Japan', :iso_code => 'JP'},
                               {:name => 'Jersey', :iso_code => 'JE'},
                               {:name => 'Jordan', :iso_code => 'JO'},
                               {:name => 'Kazakhstan', :iso_code => 'KZ'},
                               {:name => 'Kenya', :iso_code => 'KE'},
                               {:name => 'Kiribati', :iso_code => 'KI'},
                               {:name => 'Korea, Democratic People\'s Republic of', :iso_code => 'KP'},
                               {:name => 'Korea, Republic of', :iso_code => 'KR'},
                               {:name => 'Kuwait', :iso_code => 'KW'},
                               {:name => 'Kyrgyzstan', :iso_code => 'KG'},
                               {:name => 'Lao People\'s Democratic Republic', :iso_code => 'LA'},
                               {:name => 'Latvia', :iso_code => 'LV'},
                               {:name => 'Lebanon', :iso_code => 'LB'},
                               {:name => 'Lesotho', :iso_code => 'LS'},
                               {:name => 'Liberia', :iso_code => 'LR'},
                               {:name => 'Libyan Arab Jamahiriya', :iso_code => 'LY'},
                               {:name => 'Liechtenstein', :iso_code => 'LI'},
                               {:name => 'Lithuania', :iso_code => 'LT'},
                               {:name => 'Luxembourg', :iso_code => 'LU'},
                               {:name => 'Macao', :iso_code => 'MO'},
                               {:name => 'Macedonia, The Former Yugoslav Republic of', :iso_code => 'MK'},
                               {:name => 'Madagascar', :iso_code => 'MG'},
                               {:name => 'Malawi', :iso_code => 'MW'},
                               {:name => 'Malaysia', :iso_code => 'MY'},
                               {:name => 'Maldives', :iso_code => 'MV'},
                               {:name => 'Mali', :iso_code => 'ML'},
                               {:name => 'Malta', :iso_code => 'MT'},
                               {:name => 'Marshall Islands', :iso_code => 'MH'},
                               {:name => 'Martinique', :iso_code => 'MQ'},
                               {:name => 'Mauritania', :iso_code => 'MR'},
                               {:name => 'Mauritius', :iso_code => 'MU'},
                               {:name => 'Mayotte', :iso_code => 'YT'},
                               {:name => 'Mexico', :iso_code => 'MX'},
                               {:name => 'Micronesia, Federated States of', :iso_code => 'FM'},
                               {:name => 'Moldova, Republic of', :iso_code => 'MD'},
                               {:name => 'Monaco', :iso_code => 'MC'},
                               {:name => 'Mongolia', :iso_code => 'MN'},
                               {:name => 'Montenegro', :iso_code => 'ME'},
                               {:name => 'Montserrat', :iso_code => 'MS'},
                               {:name => 'Morocco', :iso_code => 'MA'},
                               {:name => 'Mozambique', :iso_code => 'MZ'},
                               {:name => 'Myanmar', :iso_code => 'MM'},
                               {:name => 'Namibia', :iso_code => 'NA'},
                               {:name => 'Nauru', :iso_code => 'NR'},
                               {:name => 'Nepal', :iso_code => 'NP'},
                               {:name => 'Netherlands', :iso_code => 'NL'},
                               {:name => 'Netherlands Antilles', :iso_code => 'AN'},
                               {:name => 'New Caledonia', :iso_code => 'NC'},
                               {:name => 'New Zealand', :iso_code => 'NZ'},
                               {:name => 'Nicaragua', :iso_code => 'NI'},
                               {:name => 'Niger', :iso_code => 'NE'},
                               {:name => 'Nigeria', :iso_code => 'NG'},
                               {:name => 'Niue', :iso_code => 'NU'},
                               {:name => 'Norfolk Island', :iso_code => 'NF'},
                               {:name => 'Northern Mariana Islands', :iso_code => 'MP'},
                               {:name => 'Norway', :iso_code => 'NO'},
                               {:name => 'Oman', :iso_code => 'OM'},
                               {:name => 'Pakistan', :iso_code => 'PK'},
                               {:name => 'Palau', :iso_code => 'PW'},
                               {:name => 'Palestinian Territory, Occupied', :iso_code => 'PS'},
                               {:name => 'Panama', :iso_code => 'PA'},
                               {:name => 'Papua New Guinea', :iso_code => 'PG'},
                               {:name => 'Paraguay', :iso_code => 'PY'},
                               {:name => 'Peru', :iso_code => 'PE'},
                               {:name => 'Philippines', :iso_code => 'PH'},
                               {:name => 'Pitcairn', :iso_code => 'PN'},
                               {:name => 'Poland', :iso_code => 'PL'},
                               {:name => 'Portugal', :iso_code => 'PT'},
                               {:name => 'Puerto Rico', :iso_code => 'PR'},
                               {:name => 'Qatar', :iso_code => 'QA'},
                               {:name => 'Réunion', :iso_code => 'RE'},
                               {:name => 'Romania', :iso_code => 'RO'},
                               {:name => 'Russian Federation', :iso_code => 'RU'},
                               {:name => 'Rwanda', :iso_code => 'RW'},
                               {:name => 'Saint Barthélemy', :iso_code => 'BL'},
                               {:name => 'Saint Helena, Ascension and Tristan Da Cunha', :iso_code => 'SH'},
                               {:name => 'Saint Kitts and Nevis', :iso_code => 'KN'},
                               {:name => 'Saint Lucia', :iso_code => 'LC'},
                               {:name => 'Saint Martin', :iso_code => 'MF'},
                               {:name => 'Saint Pierre and Miquelon', :iso_code => 'PM'},
                               {:name => 'Saint Vincent and The Grenadines', :iso_code => 'VC'},
                               {:name => 'Samoa', :iso_code => 'WS'},
                               {:name => 'San Marino', :iso_code => 'SM'},
                               {:name => 'Sao Tome and Principe', :iso_code => 'ST'},
                               {:name => 'Saudi Arabia', :iso_code => 'SA'},
                               {:name => 'Senegal', :iso_code => 'SN'},
                               {:name => 'Serbia', :iso_code => 'RS'},
                               {:name => 'Seychelles', :iso_code => 'SC'},
                               {:name => 'Sierra Leone', :iso_code => 'SL'},
                               {:name => 'Singapore', :iso_code => 'SG'},
                               {:name => 'Slovakia', :iso_code => 'SK'},
                               {:name => 'Slovenia', :iso_code => 'SI'},
                               {:name => 'Solomon Islands', :iso_code => 'SB'},
                               {:name => 'Somalia', :iso_code => 'SO'},
                               {:name => 'South Africa', :iso_code => 'ZA'},
                               {:name => 'South Georgia and The South Sandwich Islands', :iso_code => 'GS'},
                               {:name => 'Spain', :iso_code => 'ES'},
                               {:name => 'Sri Lanka', :iso_code => 'LK'},
                               {:name => 'Sudan', :iso_code => 'SD'},
                               {:name => 'Suriname', :iso_code => 'SR'},
                               {:name => 'Svalbard and Jan Mayen', :iso_code => 'SJ'},
                               {:name => 'Swaziland', :iso_code => 'SZ'},
                               {:name => 'Sweden', :iso_code => 'SE'},
                               {:name => 'Switzerland', :iso_code => 'CH'},
                               {:name => 'Syrian Arab Republic', :iso_code => 'SY'},
                               {:name => 'Taiwan, Province of China', :iso_code => 'TW'},
                               {:name => 'Tajikistan', :iso_code => 'TJ'},
                               {:name => 'Tanzania, United Republic of', :iso_code => 'TZ'},
                               {:name => 'Thailand', :iso_code => 'TH'},
                               {:name => 'Timor-Leste', :iso_code => 'TL'},
                               {:name => 'Togo', :iso_code => 'TG'},
                               {:name => 'Tokelau', :iso_code => 'TK'},
                               {:name => 'Tonga', :iso_code => 'TO'},
                               {:name => 'Trinidad and Tobago', :iso_code => 'TT'},
                               {:name => 'Tunisia', :iso_code => 'TN'},
                               {:name => 'Turkey', :iso_code => 'TR'},
                               {:name => 'Turkmenistan', :iso_code => 'TM'},
                               {:name => 'Turks and Caicos Islands', :iso_code => 'TC'},
                               {:name => 'Tuvalu', :iso_code => 'TV'},
                               {:name => 'Uganda', :iso_code => 'UG'},
                               {:name => 'Ukraine', :iso_code => 'UA'},
                               {:name => 'United Arab Emirates', :iso_code => 'AE'},
                               {:name => 'United Kingdom', :iso_code => 'GB'},
                               {:name => 'United States', :iso_code => 'US'},
                               {:name => 'United States Minor Outlying Islands', :iso_code => 'UM'},
                               {:name => 'Uruguay', :iso_code => 'UY'},
                               {:name => 'Uzbekistan', :iso_code => 'UZ'},
                               {:name => 'Vanuatu', :iso_code => 'VU'},
                               {:name => 'Venezuela, Bolivarian Republic of', :iso_code => 'VE'},
                               {:name => 'Viet Nam', :iso_code => 'VN'},
                               {:name => 'Virgin Islands, British', :iso_code => 'VG'},
                               {:name => 'Virgin Islands, U.S.', :iso_code => 'VI'},
                               {:name => 'Wallis and Futuna', :iso_code => 'WF'},
                               {:name => 'Western Sahara', :iso_code => 'EH'},
                               {:name => 'Yemen', :iso_code => 'YE'},
                               {:name => 'Zambia', :iso_code => 'ZM'},
                               {:name => 'Zimbabwe', :iso_code => 'ZW'}
 ])

Written by Clay

July 13, 2010 at 17:54

Posted in Code, Resource, Ruby

Tagged with , , ,

Resource: Rails database seeds for US States, including ANSI codes

with one comment

In case you need to add US States to your Rails app. The data are from the US Census website. I’m not sure what the purpose of the “Geographic Names Information System Identifier (GNISID)” is (the number after the State name), but it’s included here in case you need to tie back to the census system.

states = State.create([
                        { :name => 'Alabama', :abbr => 'AL', :ansi_code => '1', :statens => '1779775'},
                        { :name => 'Alaska', :abbr => 'AK', :ansi_code => '2', :statens => '1785533'},
                        { :name => 'Arizona', :abbr => 'AZ', :ansi_code => '4', :statens => '1779777'},
                        { :name => 'Arkansas', :abbr => 'AR', :ansi_code => '5', :statens => '68085'},
                        { :name => 'California', :abbr => 'CA', :ansi_code => '6', :statens => '1779778'},
                        { :name => 'Colorado', :abbr => 'CO', :ansi_code => '8', :statens => '1779779'},
                        { :name => 'Connecticut', :abbr => 'CT', :ansi_code => '9', :statens => '1779780'},
                        { :name => 'Delaware', :abbr => 'DE', :ansi_code => '10', :statens => '1779781'},
                        { :name => 'District of Columbia', :abbr => 'DC', :ansi_code => '11', :statens => '1702382'},
                        { :name => 'Florida', :abbr => 'FL', :ansi_code => '12', :statens => '294478'},
                        { :name => 'Georgia', :abbr => 'GA', :ansi_code => '13', :statens => '1705317'},
                        { :name => 'Hawaii', :abbr => 'HI', :ansi_code => '15', :statens => '1779782'},
                        { :name => 'Idaho', :abbr => 'ID', :ansi_code => '16', :statens => '1779783'},
                        { :name => 'Illinois', :abbr => 'IL', :ansi_code => '17', :statens => '1779784'},
                        { :name => 'Indiana', :abbr => 'IN', :ansi_code => '18', :statens => '448508'},
                        { :name => 'Iowa', :abbr => 'IA', :ansi_code => '19', :statens => '1779785'},
                        { :name => 'Kansas', :abbr => 'KS', :ansi_code => '20', :statens => '481813'},
                        { :name => 'Kentucky', :abbr => 'KY', :ansi_code => '21', :statens => '1779786'},
                        { :name => 'Louisiana', :abbr => 'LA', :ansi_code => '22', :statens => '1629543'},
                        { :name => 'Maine', :abbr => 'ME', :ansi_code => '23', :statens => '1779787'},
                        { :name => 'Maryland', :abbr => 'MD', :ansi_code => '24', :statens => '1714934'},
                        { :name => 'Massachusetts', :abbr => 'MA', :ansi_code => '25', :statens => '606926'},
                        { :name => 'Michigan', :abbr => 'MI', :ansi_code => '26', :statens => '1779789'},
                        { :name => 'Minnesota', :abbr => 'MN', :ansi_code => '27', :statens => '662849'},
                        { :name => 'Mississippi', :abbr => 'MS', :ansi_code => '28', :statens => '1779790'},
                        { :name => 'Missouri', :abbr => 'MO', :ansi_code => '29', :statens => '1779791'},
                        { :name => 'Montana', :abbr => 'MT', :ansi_code => '30', :statens => '767982'},
                        { :name => 'Nebraska', :abbr => 'NE', :ansi_code => '31', :statens => '1779792'},
                        { :name => 'Nevada', :abbr => 'NV', :ansi_code => '32', :statens => '1779793'},
                        { :name => 'New Hampshire', :abbr => 'NH', :ansi_code => '33', :statens => '1779794'},
                        { :name => 'New Jersey', :abbr => 'NJ', :ansi_code => '34', :statens => '1779795'},
                        { :name => 'New Mexico', :abbr => 'NM', :ansi_code => '35', :statens => '897535'},
                        { :name => 'New York', :abbr => 'NY', :ansi_code => '36', :statens => '1779796'},
                        { :name => 'North Carolina', :abbr => 'NC', :ansi_code => '37', :statens => '1027616'},
                        { :name => 'North Dakota', :abbr => 'ND', :ansi_code => '38', :statens => '1779797'},
                        { :name => 'Ohio', :abbr => 'OH', :ansi_code => '39', :statens => '1085497'},
                        { :name => 'Oklahoma', :abbr => 'OK', :ansi_code => '40', :statens => '1102857'},
                        { :name => 'Oregon', :abbr => 'OR', :ansi_code => '41', :statens => '1155107'},
                        { :name => 'Pennsylvania', :abbr => 'PA', :ansi_code => '42', :statens => '1779798'},
                        { :name => 'Rhode Island', :abbr => 'RI', :ansi_code => '44', :statens => '1219835'},
                        { :name => 'South Carolina', :abbr => 'SC', :ansi_code => '45', :statens => '1779799'},
                        { :name => 'South Dakota', :abbr => 'SD', :ansi_code => '46', :statens => '1785534'},
                        { :name => 'Tennessee', :abbr => 'TN', :ansi_code => '47', :statens => '1325873'},
                        { :name => 'Texas', :abbr => 'TX', :ansi_code => '48', :statens => '1779801'},
                        { :name => 'Utah', :abbr => 'UT', :ansi_code => '49', :statens => '1455989'},
                        { :name => 'Vermont', :abbr => 'VT', :ansi_code => '50', :statens => '1779802'},
                        { :name => 'Virginia', :abbr => 'VA', :ansi_code => '51', :statens => '1779803'},
                        { :name => 'Washington', :abbr => 'WA', :ansi_code => '53', :statens => '1779804'},
                        { :name => 'West Virginia', :abbr => 'WV', :ansi_code => '54', :statens => '1779805'},
                        { :name => 'Wisconsin', :abbr => 'WI', :ansi_code => '55', :statens => '1779806'},
                        { :name => 'Wyoming', :abbr => 'WY', :ansi_code => '56', :statens => '1779807'}
])

Written by Clay

July 13, 2010 at 17:49

Posted in Code, Resource, Ruby

Tagged with , , ,

Using Nokogiri and Ruby on Rails to grab citations from PubMed

with 12 comments

Update: I put a working example up on github (Rails 3.0.4 and Nokogiri 1.4.4): https://github.com/ccheaton/pubnoko

##############################

Following on the last post… I needed to provide to a client a custom database of PubMed citations, links, datasets, and other information related to a product area I was researching. The hardest part about this was understanding PubMed’s eutils and their .xml formats.

I did not want to manually enter the citation information. I wanted to avoid Zotero and other similar services because I wanted to integrate the data across domains and use the database that I was developing for them as the basis for a paper I wrote. For the PubMed citations, what I wanted to be able to do was to go to PubMed, perform a search, find an interesting citation, copy the PubMed URL, paste it into my Rails application, which would use the URL to query PubMed in XML format and populate my app’s database with that entry.

I have two models that play into this interaction: Article and Journal. I should note that I’m using Nokogiri to parse the XML and thoughtbot’s Paperclip gem to attach articles when they are available freely. This was done using Rails 2.3.5 and Ruby 1.8.7.

Here’s how the model files look:

journal.rb

class Journal < ActiveRecord::Base   attr_accessible :name, :description, :url, :short_name      has_many :articles      validates_uniqueness_of :name   validates_presence_of :short_name   validates_presence_of :name      before_save :make_short_name      def best_name     if !self.short_name.nil? && self.short_name.length > 0
      return self.short_name
    else
      return self.name
    end
  end

  private

  # Makes sure that a new journal has a short name in case one isn't entered.
  def make_short_name
    if self.short_name.nil? || self.short_name.length == 0
      self.short_name = self.name
    end
  end
end

The Article model is a little more complicated. The logic that I’m using the parse the XML isn’t great, but I’ve found that the structure changes from article to article. I’m still refining it and will update it when I’m finished, after I’ve found the 90% of article that auto-import.

article.rb

class Article < ActiveRecord::Base   require 'open-uri'   attr_accessible :pubmedlink,                    :article_title,                    :abstract,                    :authors,                    :affiliations,                    :pubdate,                    :pubmedid,                    :journal_id,                    :journal_volume,                    :journal_issue,                    :journal_pages,                   :fetched,                   :thearticle_file_name,                   :thearticle_content_type,                   :thearticle_file_size,                   :thearticle_updated_at,                   :thearticle                      has_attached_file :thearticle,                     :path => ":rails_root/public/system/uploads/:class/:id/:basename.:extension",
                    :url => "/system/uploads/:class/:id/:basename.:extension"
  belongs_to :journal
  before_save :make_pubmedid
  validates_uniqueness_of :article_title

  def citation
    cit = ""
    if self.journal_id != nil && self.journal_id > 0
      cit += self.journal.best_name + ". " + self.pubdate.year.to_s + " " + self.pubdate.strftime("%b")
    end
    if self.journal_volume != nil && self.journal_volume.length > 0
      cit += "; " + self.journal_volume
    end
    if self.journal_issue != nil && self.journal_issue.length > 0
      cit += "(" + self.journal_issue + ")"
    end
    if self.journal_pages != nil && self.journal_pages.length > 0
      cit += ": " + self.journal_pages
    end
      return cit
  end

  private

  def make_pubmedid
    if self.pubmedlink != nil && self.pubmedlink.length > 0 && self.fetched == false
      regex = Regexp.new(/[0-9]{4,10}/)
      matchdata = regex.match(self.pubmedlink)
      self.pubmedid = matchdata[0]
      pull_pubmed_data(matchdata[0])
      self.fetched = true
    end
  end

  def pull_pubmed_data(theID)
    @doc = Nokogiri::XML(open("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=" + theID.to_s + "&retmode=xml").read)

    # The title and abstract
    self.article_title = @doc.xpath("//ArticleTitle").collect(&:text).to_s
    self.abstract = @doc.xpath("//AbstractText").collect(&:text).to_s

    # Setting up the authors
    first_names = @doc.xpath("//ForeName").collect(&:text)
    last_names = @doc.xpath("//LastName").collect(&:text)
    full_names = []

    i = 0
    last_names.length.times do
      the_name = first_names[i].to_s + " " + last_names[i].to_s
      full_names.push the_name
      i += 1
    end
    self.authors = full_names.join(", ")

    # Affiliations
    self.affiliations = @doc.xpath("//Affiliation").collect(&:text).to_s

    # Publication Date - Check if the complete date is at the top. If not use the pub med date.
    theyear = @doc.xpath("//PubDate/Year").collect(&:text)
    theyear = theyear[0]
    themonth = @doc.xpath("//PubDate/Month").collect(&:text)
    themonth = themonth[0]
    theday = @doc.xpath("//PubDate/Day").collect(&:text)

    if theyear.nil? || themonth.nil? || theday.nil? || theyear.length == 0 || themonth.length == 0 || theday.length == 0

      theyear = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Year").collect(&:text)
      theyear = theyear[0]

      themonth = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Month").collect(&:text)
      themonth = themonth[0].to_i
        month_done = 1
      if themonth < 10
        themonth = "0" + themonth.to_s
        month_done = 1
      end

      theday = @doc.xpath("//PubMedPubDate[@PubStatus='pubmed']/Day").collect(&:text)
    end

    if month_done != 1

      themonth = case themonth || "01"
        when "Jan"  then "01"
        when "Feb"  then "02"
        when "Mar"  then "03"
        when "Apr"  then "04"
        when "May"  then "05"
        when "Jun"  then "06"
        when "Jul"  then "07"
        when "Aug"  then "08"
        when "Sep"  then "09"
        when "Oct"  then "10"
        when "Nov"  then "11"
        when "Dec"  then "12"
      end

    end

    if theday.length == 0
      theday = "01"
    elsif theday[0].to_i < 10       theday = "0" + theday[0].to_s     else       theday = theday[0].to_s     end          thedate = theyear.to_s + "-" + themonth.to_s + "-" + theday.to_s     puts "thedate: " + thedate.to_s     self.pubdate = Date.new(theyear.to_i, themonth.to_i, theday.to_i)          # Either referencing the proper journal or creating a new one     thejournal = @doc.xpath("//Journal/Title").collect(&:text)     thejournal = thejournal[0]          theshortname = @doc.xpath("//MedlineTA").collect(&:text)     if theshortname.length == 0       theshortname = ""     else       theshortname = theshortname[0]     end          thejournalid = Journal.find(:first, :conditions => ['lower(name) = ?', thejournal.downcase])

    if !thejournalid.nil?
      self.journal_id = thejournalid.id
    else
      @journal = Journal.new(:name => thejournal.to_s, :short_name => theshortname.to_s)
      @journal.save
      thenewjournal = Journal.find(:first, :order => 'created_at DESC')
      self.journal_id = thenewjournal.id
    end

    # Save the volume, issue, and pages
    thevolume = @doc.xpath("//JournalIssue/Volume").collect(&:text)
    thevolume = thevolume[0].to_s if thevolume.length > 0
    self.journal_volume = thevolume if thevolume.length > 0

    theissue = @doc.xpath("//JournalIssue/Issue").collect(&:text)
    theissue = theissue[0].to_s if theissue.length > 0
    self.journal_issue = theissue if theissue.length > 0

    thepag = @doc.xpath("//Pagination/MedlinePgn").collect(&:text)
    thepag = thepag[0].to_s if thepag.length > 0
    self.journal_pages = thepag if thepag.length > 0

  end

end

The journals_controller.rb file is pretty standard. So is the articles_controller.rb file. Here it is as an example:

articles_controller.rb

class ArticlesController < ApplicationController   def index     @articles = Article.all(:order => "pubdate DESC")
  end

  def show
    @article = Article.find(params[:id])
  end

  def new
    @article = Article.new
    @journals = Journal.all(:order => :name)
  end

  def create
    @article = Article.new(params[:article])
    if @article.save
      flash[:notice] = "Successfully created article."
      redirect_to @article
    else
      render :action => 'new'
    end
  end

  def edit
    @article = Article.find(params[:id])
    @journals = Journal.all(:order => :name)
  end

  def update
    @article = Article.find(params[:id])
    if @article.update_attributes(params[:article])
      flash[:notice] = "Successfully updated article."
      redirect_to @article
    else
      render :action => 'edit'
    end
  end

  def destroy
    @article = Article.find(params[:id])
    @article.destroy
    flash[:notice] = "Successfully destroyed article."
    redirect_to articles_url
  end
end

Here are the view files for articles:

app/views/articles/index.html.erb

<div class="toolbar">
<ul></ul>
</div>
<table>
<tbody>
<tr>
<th></th>
<th>Article Title</th>
<th></th>
</tr>
<tr class="<%= cycle('rowA', 'rowB') %>">
<td></td>
<td><strong> </strong>

		  <em> </em></td>
<td class="deletelink"></td>
</tr>
</tbody>
</table>

app/views/articles/show.html.erb

<div class="toolbar">
<ul></ul>
</div>
	  <strong>PubMed Link:</strong>







	  <strong>View the Article:</strong>





  <strong>Article Title:</strong>



  <strong>Abstract:</strong>



  <strong>Authors:</strong>



  <strong>Affiliations:</strong>



  <strong>Pubdate:</strong>



  <strong>Pubmedid:</strong>



  <strong>Journal:</strong>



  <strong>Journal Volume:</strong>



  <strong>Journal Issue:</strong>



  <strong>Journal Pages:</strong>



app/views/articles/_form.html.erb



<h3>Enter a link to an abstract on PubMed and click Submit or fill out the information below</h3>











		There currently is no file attached.


<hr />












































The Paperclip gem requires :html => { :multipart => true } at the top of the _form file. It’s a very simple gem to use and very powerful. There are Google Groups for both Paperclip and Nokogiri, should you need additional assistance setting them up.

Leave me a comment if this helps you with either Nokogiri or with extracting data from PubMed!

Written by Clay

July 9, 2010 at 18:18

Posted in Code, Ruby, Technology

Tagged with , , , ,

How to query PubMed and retrieve XML results

This seems like it should be a simple thing to figure out how to do on the PubMed website — return a single result as XML. If you visit PubMed, there is a “Display” link at the top left that allow you to view an entry in XML format, but when you select that option, the URL changes to the vanilla PubMed URL. Anyhow, let’s say you go to PubMed and look at a particular article:

http://www.ncbi.nlm.nih.gov/pubmed/20598978

You can retrieve the same result in XML by simply going to this URL:

http://www.ncbi.nlm.nih.gov/pubmed/20598978?report=XML

Other formats are available and outlined on the PubMed help site. The help files on the site are not particularly easy to search, so it took me forever to find this info.

When you use the URL above, the page is designed to create XML that displays properly in the browser — not XML parsing by Ruby or another language. For that, you have to use the eutils. The proper link for returning XML that you can parse with nokogiri or another gem is:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=20598978&retmode=xml

Written by Clay

July 6, 2010 at 13:07

Posted in Public Health, Technology

Tagged with ,