Life as Clay

Parsing US Census data with Ruby

leave a comment »


I am a fan of population pyramids as a visualization of age demographics. They are simple and effective. Sometimes getting the data to generate them isn’t….

I’m building a simple Rails app for a client who needs county-level population data. It doesn’t have to be precise. I went with the 2009 estimates from the Census bureau, located here:
http://www.census.gov/popest/counties/asrh/CC-EST2009-alldata.html

There were a few issues:
1. Each state’s county-level data is in a different file. You can order a CD that has all of the data in a single file, but I didn’t have time to do that.
2. There’s a ton of data that I didn’t need in each file. All I needed were the county totals and the totals for male/female for each age group.
3. The data is in rows — a different row for each age group. I needed the data in columns, with one row for each county.

What I did:
1. Downloaded all 51 of the files (DC has its own file) and put them in a folder on my desktop.
2. Changed the extension on them to .csv (because it changed to .txt when I downloaded, for some reason).
3. Wrote the Ruby script below, which extracts the data that I need and puts it into a single .csv file.
4. Saved the script to the same folder as the Census files as script.rb.
5. Opened terminal and ran ruby script.rb
4. Reconciled the data with my Excel sheet that I was using to collect county-level data so that I could seed my Rails app’s counties table.

I comment my code pretty heavily when I write it so that I can look back and remember what I did. I’ve been burned in the past by not commenting thoroughly.

Here’s the script:

# Script to pull population data out of census files for each county
# Original data files are here: http://www.census.gov/popest/counties/asrh/CC-EST2009-alldata.html

require 'CSV'

# These are the FIPS codes for each state which are used in file names.
state_fips_codes = ["01", "02", "04", "05", "06", "08", "09", "10", "11", "12", "13", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "44", "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"]

add_header = true

# This is the file that all of the data will go into...
File.open("county_age_data.csv", "w") do |output_file|
  
  # counter for iterating through the files
  n = 0
  
  state_fips_codes.length.times do  # This loop goes through all of the files
    
    # Read in the relevant file
    state_data = CSV.parse(File.read("cc-est2009-alldata-" + state_fips_codes[n] + ".csv"))
    puts "Processing data for " + state_data[1][3]

    # Add the header to the file
    if add_header == true
      output_file.puts "state_fips,county_fips,state_name,county_name,:census_total_pop,:census_total_male,:census_grp01_male,:census_grp02_male,:census_grp03_male,:census_grp04_male,:census_grp05_male,:census_grp06_male,:census_grp07_male,:census_grp08_male,:census_grp09_male,:census_grp10_male,:census_grp11_male,:census_grp12_male,:census_grp13_male,:census_grp14_male,:census_grp15_male,:census_grp16_male,:census_grp17_male,:census_grp18_male,:census_total_female,:census_grp01_female,:census_grp02_female,:census_grp03_female,:census_grp04_female,:census_grp05_female,:census_grp06_female,:census_grp07_female,:census_grp08_female,:census_grp09_female,:census_grp10_female,:census_grp11_female,:census_grp12_female,:census_grp13_female,:census_grp14_female,:census_grp15_female,:census_grp16_female,:census_grp17_female,:census_grp18_female"
      add_header = false
    end

    # Iterate through the loaded file.
    row = 1
    while row < state_data.length
      if state_data[row][5].to_i == 12 # Only take data from year coded 12 (2009 estimates)

        ########## MAIN LOGIC #############
        if state_data[row][6].to_i == 0 # It's a new county because this is the group 0 row with totals

          # Reset the data arrays
          row_data    = []
          male_data   = []
          female_data = []

          row_data.push state_data[row][1].to_s     # state_fips
          row_data.push state_data[row][2].to_s     # county_fips
          row_data.push state_data[row][3].to_s     # state_name
          row_data.push state_data[row][4].to_s     # county_name
          row_data.push state_data[row][7].to_s     # census_total_pop

          male_data.push state_data[row][8].to_s    # census_total_male
          female_data.push state_data[row][9].to_s  # census_total_female

        elsif state_data[row][6].to_i == 18 # It's the last row -- push the data into the array

          # push in the final data points
          male_data.push state_data[row][8].to_s    # census_grp18_male
          female_data.push state_data[row][9].to_s  # census_grp18_female

          # append all of the data to the row_data array
          row_data += male_data
          row_data += female_data

          # write the data to the file
          output_file.puts row_data.join(",")

        else # It's not the first or last row

          male_data.push state_data[row][8].to_s    # census_grpXX_male
          female_data.push state_data[row][9].to_s  # census_grpXX_female

        end # End of the if statement for pushing data into arrays
        ################ END OF MAIN LOGIC ####################
        
      end       # End of the if statement to check if it is year 10
      
      row += 1  # Look at the next row
    end

    n += 1      # Increment the counter
  end           # End of loop for going through each file
end
Advertisements

Written by Clay

July 15, 2010 at 21:17

Posted in Code, Ruby

Tagged with , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: