Final Exam - Data 2100

Part A

  1. In this question, you will use a series of datasets to investigate population density in the United States.
  1. Load in population data for Alabama (“sub-est2016_1.csv”) and Alaska (“sub-est2016_2.csv”), then append the two datasets together so that all of the information is within one dataframe.
Population Data Merge Example
SUMLEV STATE COUNTY PLACE COUSUB CONCIT PRIMGEO_FLAG FUNCSTAT NAME STNAME
1104 157 1 133 53400 0 0 1 A Nauvoo town (pt.) Alabama
1105 157 1 133 99990 0 0 1 F Balance of Winston County Alabama
1106 40 2 0 0 0 0 0 A Alaska Alaska
1107 162 2 0 65 0 0 0 A Adak city Alaska

  1. Read in the csv file that already contains population information for each state. Check to see which unique states are included in this dataset.
Unique States from ‘All Population’ Data Set
State
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
State
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
State
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
State
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

  1. There’s a lot of interesting data in this population dataset, but for our purposes in this problem set, we are only interested in a few columns. Use the subset() function to subset the “NAME”, “STNAME”, and “POPESTIMATE2012” columns into a new dataset. (Using a different function to complete the same task will result in partial credit.)
All Population Subset Example
City/State Name State Population Est.
Alabama Alabama 4815960
Abbeville city Alabama 2647
Adamsville city Alabama 4472
Addison town Alabama 748
Akron town Alabama 345

  1. This new subsetted dataset definitely makes our lives easier, but it still includes the population stats for each city and town. You’ll notice, however, that the first observation for each new state is the population total for the entire state where the states name appears in both the NAME and STNAME columns. Use the subset() function to choose only these rows. Make sure that your new data set doesn’t have any repeating/redundant observations or columns (The resulting dataframe should be 51 X 2)
Dataframe Size
Dims
51
2
Removal of State and Town Example
State Population Est.
Alabama 4815960
Alaska 731089
Arizona 6549634
Arkansas 2950685
California 38011074

  1. We’re going to try to find the population density of each state. Our first step in doing this is to read in some online data about the square mileage of each state from this link. Once the data is read in, merge that data set with our 2012 state populations dataset from the last question. Which observations can be matched? Make sure to not merge observation(s) that have no match.
State Pop. & Area Merge Example
Name Population Est. Area (Sq. Mile)
Alabama 4815960 52423
Alaska 731089 656425
Arizona 6549634 114006
Arkansas 2950685 53182
California 38011074 163707

I used a left_join() with a join_by() of the “NAME” and “state” observations.


  1. Next, we are going to create a new variable in this merged dataset that tells us each state’s population density in 2012. Do this by dividing the population variable by the state size variable.
Population Density Addition Example
State Population Est. Area (Sq. Mile) Density
Alabama 4815960 52423 91.867310
Alaska 731089 656425 1.113743
Arizona 6549634 114006 57.449906
Arkansas 2950685 53182 55.482776
California 38011074 163707 232.189668

  1. Finally we’ve finished preparing our dataset, now we’re going to get into some more interesting investigative work. Let’s first load in the “ECN_2012_US_52A1.csv” dataset which includes economic data for each sector within each state. Get rid of the first row, as this merely gives us descriptions of each variable.
Econ. Data Clean Example (1:5 & 1:5)
GEO.id GEO.id2 GEO.display.label GEO.annotation.id NAICS.id
2 0400000US01 01 Alabama 52
3 0400000US01 01 Alabama 521
4 0400000US01 01 Alabama 5211
5 0400000US01 01 Alabama 52111
6 0400000US01 01 Alabama 521110

  1. Find the total revenue per sector by state.
Total State Revenue by Sector (Fin. & Ins.) Example
State Total Revenue
Alabama 18509867
Alaska 1010275
Arizona 65996299
Arkansas 7177328
California 483627172

  1. Now merge this dataset with our population density dataset.
Total State Data Merge Example
State Population Est. Area Density Total Revenue
Alabama 4815960 52423 91.867310 18509867
Alaska 731089 656425 1.113743 1010275
Arizona 6549634 114006 57.449906 65996299
Arkansas 2950685 53182 55.482776 7177328
California 38011074 163707 232.189668 483627172

  1. Plot the relationship between state population density and the state’s total revenue to see if there’s a relationship. Comment on your findings.

I do not believe a direct coorelation can be made between population density. Looking at the graph (and the underlying data) you can see that DC has the highest population density, but also the lowest Total Revenue while NY has the highest Total Revenue but is just middle of the pack in population density.


Part B

For this question, you will use the data file ‘nes.rda’ (which will require you to use load(“nes.rda”) to read in the data.) The codebook, called ‘nes2012_codebook.pdf’ is also available to you on Canvas. This data comes from the ANES 2012 Time Series Study, which looks at attitudes toward political ideologies and groups, among many other things.

  1. According to this survey, of those who claimed to have voted in the 2008 election, what percentage of survey respondents voted for Barack Obama in 2008? (Hint: you will need to search the codebook to find the variables ‘interest_voted2008’ and ‘interest_whovote2008’ in order to clean them correctly.)
Percent of respondents that voted for Obama in 2008
Voted in 2008 Voted for Obama Percentage for Obama
1371 923 67.32312

  1. A ‘Feeling Thermometer’ is a type of survey question that asks respondents to rate how warmly or cool they feel toward an individual or group. A feeling thermometer score of 100 indicates a respondent feels the most positive toward that entity. A feeling score of 0 indicates the respondent feels most negative about that entity. A score of 50 indicates indifference. Using the variable that records the feeling thermometer score towards the ‘Federal Government in Washington,’ clean the variable to only include scores between 0 and 100. (Use the codebook to locate the ‘ftgr_fedgov’ variable to clean it properly.)
Fed Gov Feeling Thermometer Clean Data Example
Case ID Feeling Therm.
2 85
3 100
4 60
5 70
6 25

  1. Using the cleaned variable, what is the average feeling thermometer for the Federal Government in Washington, according to this survey?
Fed Gov Avg. Feeling Thermometer
Avg. FT
52.48652

  1. Using the ‘prevote_regpty’ variable, create a new variable that indicates whether a respondent is a Democrat or a Republican. All other political affiliations or unknowns should be set to ‘NA.’ (Use the codebook to clean this variable correctly.)
Registered Party Data Example
caseid prevote_regpty ftgr_fedgov Party
2 -1 85 NA
3 -1 100 NA
4 -1 60 NA
5 -1 70 NA
6 -1 25 NA
7 -1 85 NA
8 -1 100 NA
9 -1 40 NA
10 2 60 Rep
11 2 30 Rep

  1. Find the difference in means between the average feeling thermometer score for Democrats vs. Republicans. What do you conclude?
Fed. Gov. Feeling Thermometer by Party
Democrat Republican
59.57834 40.29787

At almost 20 points higher, the average feeling thermometer score shows that Democrats have a much warmer feeling about the Federal Government than their Republican peers though, at only 59.58, their feeling is only about 10 points above indifference (so positive, but not overly so).

knitr::opts_chunk$set(echo = TRUE)

library(tidyr)
library(dplyr)
library(readr)
library(lubridate)
library(knitr)
library(stringr)
library(readxl)
library(tidyverse)
library(ggplot2)
library(ggrepel)

options(warn=-1)

setwd("c:/Users/trega/Dropbox/DATA2100/Final")

al.pop<-read.csv("sub-est2016_1.csv")
ak.pop<-read.csv("sub-est2016_2.csv") 



pop.data<-rbind(al.pop,ak.pop)

kable(pop.data[1104:1107,1:10], align='c', 
      caption="Population Data Merge Example")


all.pop<-read.csv("sub-est2016_all.csv")

ustates<-as.data.frame(unique(all.pop$STNAME))

kable(list(ustates[1:15,], ustates[16:30,], ustates[31:45,], ustates[46:51,]), align = 'c',
  col.names = "State",
  caption="Unique States from 'All Population' Data Set") 


allpopsub<-subset(all.pop, select=c("NAME","STNAME","POPESTIMATE2012"))

kable(head(allpopsub,5), align='c', 
      col.names = c('City/State Name','State','Population Est.'),
      caption = "All Population Subset Example")


allstatesub<- distinct(subset(allpopsub, NAME==STNAME,
                      select=c("NAME","POPESTIMATE2012")
                      ))

kable(dim(allstatesub), align = 'c',
      col.names = "Dims",
      caption = "Dataframe Size")

kable(head(allstatesub,5), align = 'c',
      col.names = c('State','Population Est.'),
      caption="Removal of State and Town Example")


state.area<-read.csv("https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-areas.csv")

state.merge<- left_join(allstatesub,state.area,join_by(NAME==state))

kable(head(state.merge,5), align = 'c',
      col.names = c('Name','Population Est.', 'Area (Sq. Mile)'),
      caption = "State Pop. & Area Merge Example")


state.merge<-mutate(state.merge,
                    PopDen = POPESTIMATE2012 / area..sq..mi.)

kable(head(state.merge,5),align = 'c',
      col.names = c('State','Population Est.','Area (Sq. Mile)','Density'),
      caption = "Population Density Addition Example")


econ.data<-read.csv("ECN_2012_US_52A1.csv")

econ.data<-econ.data[c(2:3800),]

kable(econ.data[1:5,1:5], align = 'c',
      caption = "Econ. Data Clean Example (1:5 & 1:5)")


bysector<- econ.data %>% 
  group_by(GEO.display.label) %>% 
  summarise(ttlrev = sum(as.numeric(RCPTOT), na.rm = TRUE))

kable(head(bysector,5),align = 'c',
      col.names = c('State','Total Revenue'),
      caption = "Total State Revenue by Sector (Fin. & Ins.) Example")



area.rev.merge<-left_join(state.merge,bysector,join_by(NAME==GEO.display.label))

kable(head(area.rev.merge,5), align = 'c',
      col.names = c('State','Population Est.','Area','Density','Total Revenue'),
      caption = "Total State Data Merge Example")



area.rev.merge<-mutate(area.rev.merge,
                       revbyden = round(ttlrev / PopDen,0), 
                       ttlrev = ttlrev/1000000000)

ggplot(area.rev.merge, mapping = aes(ttlrev, PopDen, color= NAME)) + 
  geom_point()+
  geom_text_repel(
    label=area.rev.merge$NAME,
    nudge_x = 0.25, nudge_y = 0.75,
    check_overlap=T)+
  labs(title="Population Density vs. Total Revenue (by State)", 
       x = "Total Revenue (in Billions)", 
       y="Population Density (People/Sq. Mile)") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "")


load("nes.rda")

nesobama<-subset(nes,select = c("caseid","interest_voted2008", "interest_whovote2008"))

nesobama1<- data.frame(Voted = sum(nesobama$interest_voted2008==1), 
                       Obama = sum(nesobama$interest_whovote2008==1))

nesobama1<-mutate(nesobama1, 
                  Percent = Obama / Voted *100)

kable(nesobama1, align = 'c',
      col.names = c("Voted in 2008", "Voted for Obama", "Percentage for Obama"),
      caption = "Percent of respondents that voted for Obama in 2008")


nesFT<- subset(nes, select = c("caseid", "ftgr_fedgov"))

nesFT<-filter(nesFT, ftgr_fedgov>=0)

kable(head(nesFT,5), align = 'c', 
      col.names = c("Case ID",
      "Feeling Therm."), 
      caption = "Fed Gov Feeling Thermometer Clean Data Example")


nesFT<-data.frame(Mean = mean(nesFT$ftgr_fedgov))

kable(nesFT, align = 'c', 
      col.names = "Avg. FT", 
      caption = "Fed Gov Avg. Feeling Thermometer")


nesparty<-subset(nes, select = c("caseid","prevote_regpty","ftgr_fedgov"))

nesparty<-filter(nesparty,ftgr_fedgov>=0)

nesparty<-mutate(nesparty, 
                 Party = ifelse(prevote_regpty == 1, "Dem", 
                                ifelse(prevote_regpty == 2, "Rep", "NA")))

kable(head(nesparty,10),align = 'c',
      caption = "Registered Party Data Example")


nesparty<-mutate(nesparty, 
                 FTDem = ifelse(Party == "Dem", ftgr_fedgov, NA), 
                 FTRep = ifelse(Party == "Rep", ftgr_fedgov, NA))

nesparty<-data.frame(DEMFT = mean(nesparty$FTDem, na.rm=TRUE), 
                     REPFT = mean(nesparty$FTRep, na.rm=TRUE))

kable(nesparty, align = 'c', 
      col.names = c("Democrat", "Republican"), 
      caption = "Fed. Gov. Feeling Thermometer by Party")