Final Exam - Data 2100
Part A
- In this question, you will use a series of datasets to investigate
population density in the United States.
- Load in population data for Alabama (“sub-est2016_1.csv”) and Alaska
(“sub-est2016_2.csv”), then append the two datasets together so that all
of the information is within one dataframe.
Population Data Merge Example
| 1104 |
157 |
1 |
133 |
53400 |
0 |
0 |
1 |
A |
Nauvoo town (pt.) |
Alabama |
| 1105 |
157 |
1 |
133 |
99990 |
0 |
0 |
1 |
F |
Balance of Winston County |
Alabama |
| 1106 |
40 |
2 |
0 |
0 |
0 |
0 |
0 |
A |
Alaska |
Alaska |
| 1107 |
162 |
2 |
0 |
65 |
0 |
0 |
0 |
A |
Adak city |
Alaska |
- Read in the csv file that already contains population information
for each state. Check to see which unique states are included in this
dataset.
Unique States from ‘All Population’ Data Set
| Alabama |
| Alaska |
| Arizona |
| Arkansas |
| California |
| Colorado |
| Connecticut |
| Delaware |
| District of Columbia |
| Florida |
| Georgia |
| Hawaii |
| Idaho |
| Illinois |
| Indiana |
|
| Iowa |
| Kansas |
| Kentucky |
| Louisiana |
| Maine |
| Maryland |
| Massachusetts |
| Michigan |
| Minnesota |
| Mississippi |
| Missouri |
| Montana |
| Nebraska |
| Nevada |
| New Hampshire |
|
| New Jersey |
| New Mexico |
| New York |
| North Carolina |
| North Dakota |
| Ohio |
| Oklahoma |
| Oregon |
| Pennsylvania |
| Rhode Island |
| South Carolina |
| South Dakota |
| Tennessee |
| Texas |
| Utah |
|
| Vermont |
| Virginia |
| Washington |
| West Virginia |
| Wisconsin |
| Wyoming |
|
- There’s a lot of interesting data in this population dataset, but
for our purposes in this problem set, we are only interested in a few
columns. Use the subset() function to subset the “NAME”, “STNAME”, and
“POPESTIMATE2012” columns into a new dataset. (Using a different
function to complete the same task will result in partial credit.)
All Population Subset Example
| Alabama |
Alabama |
4815960 |
| Abbeville city |
Alabama |
2647 |
| Adamsville city |
Alabama |
4472 |
| Addison town |
Alabama |
748 |
| Akron town |
Alabama |
345 |
- This new subsetted dataset definitely makes our lives easier, but it
still includes the population stats for each city and town. You’ll
notice, however, that the first observation for each new state is the
population total for the entire state where the states name appears in
both the NAME and STNAME columns. Use the subset() function to choose
only these rows. Make sure that your new data set doesn’t have any
repeating/redundant observations or columns (The resulting dataframe
should be 51 X 2)
Removal of State and Town Example
| Alabama |
4815960 |
| Alaska |
731089 |
| Arizona |
6549634 |
| Arkansas |
2950685 |
| California |
38011074 |
- We’re going to try to find the population density of each state. Our
first step in doing this is to read in some online data about the square
mileage of each state from this link. Once the data is read in, merge
that data set with our 2012 state populations dataset from the last
question. Which observations can be matched? Make sure to not merge
observation(s) that have no match.
State Pop. & Area Merge Example
| Alabama |
4815960 |
52423 |
| Alaska |
731089 |
656425 |
| Arizona |
6549634 |
114006 |
| Arkansas |
2950685 |
53182 |
| California |
38011074 |
163707 |
I used a left_join() with a join_by() of the “NAME” and “state”
observations.
- Next, we are going to create a new variable in this merged dataset
that tells us each state’s population density in 2012. Do this by
dividing the population variable by the state size variable.
Population Density Addition Example
| Alabama |
4815960 |
52423 |
91.867310 |
| Alaska |
731089 |
656425 |
1.113743 |
| Arizona |
6549634 |
114006 |
57.449906 |
| Arkansas |
2950685 |
53182 |
55.482776 |
| California |
38011074 |
163707 |
232.189668 |
- Finally we’ve finished preparing our dataset, now we’re going to get
into some more interesting investigative work. Let’s first load in the
“ECN_2012_US_52A1.csv” dataset which includes economic data for each
sector within each state. Get rid of the first row, as this merely gives
us descriptions of each variable.
Econ. Data Clean Example (1:5 & 1:5)
| 2 |
0400000US01 |
01 |
Alabama |
|
52 |
| 3 |
0400000US01 |
01 |
Alabama |
|
521 |
| 4 |
0400000US01 |
01 |
Alabama |
|
5211 |
| 5 |
0400000US01 |
01 |
Alabama |
|
52111 |
| 6 |
0400000US01 |
01 |
Alabama |
|
521110 |
- Find the total revenue per sector by state.
Total State Revenue by Sector (Fin. & Ins.)
Example
| Alabama |
18509867 |
| Alaska |
1010275 |
| Arizona |
65996299 |
| Arkansas |
7177328 |
| California |
483627172 |
- Now merge this dataset with our population density dataset.
Total State Data Merge Example
| Alabama |
4815960 |
52423 |
91.867310 |
18509867 |
| Alaska |
731089 |
656425 |
1.113743 |
1010275 |
| Arizona |
6549634 |
114006 |
57.449906 |
65996299 |
| Arkansas |
2950685 |
53182 |
55.482776 |
7177328 |
| California |
38011074 |
163707 |
232.189668 |
483627172 |
- Plot the relationship between state population density and the
state’s total revenue to see if there’s a relationship. Comment on your
findings.

I do not believe a direct coorelation can be made between population
density. Looking at the graph (and the underlying data) you can see that
DC has the highest population density, but also the lowest Total Revenue
while NY has the highest Total Revenue but is just middle of the pack in
population density.
Part B
For this question, you will use the data file ‘nes.rda’ (which will
require you to use load(“nes.rda”) to read in the data.) The codebook,
called ‘nes2012_codebook.pdf’ is also available to you on Canvas. This
data comes from the ANES 2012 Time Series Study, which looks at
attitudes toward political ideologies and groups, among many other
things.
- According to this survey, of those who claimed to have voted in the
2008 election, what percentage of survey respondents voted for Barack
Obama in 2008? (Hint: you will need to search the codebook to find the
variables ‘interest_voted2008’ and ‘interest_whovote2008’ in order to
clean them correctly.)
Percent of respondents that voted for Obama in 2008
| 1371 |
923 |
67.32312 |
- A ‘Feeling Thermometer’ is a type of survey question that asks
respondents to rate how warmly or cool they feel toward an individual or
group. A feeling thermometer score of 100 indicates a respondent feels
the most positive toward that entity. A feeling score of 0 indicates the
respondent feels most negative about that entity. A score of 50
indicates indifference. Using the variable that records the feeling
thermometer score towards the ‘Federal Government in Washington,’ clean
the variable to only include scores between 0 and 100. (Use the codebook
to locate the ‘ftgr_fedgov’ variable to clean it properly.)
Fed Gov Feeling Thermometer Clean Data Example
| 2 |
85 |
| 3 |
100 |
| 4 |
60 |
| 5 |
70 |
| 6 |
25 |
- Using the cleaned variable, what is the average feeling thermometer
for the Federal Government in Washington, according to this survey?
Fed Gov Avg. Feeling Thermometer
| 52.48652 |
- Using the ‘prevote_regpty’ variable, create a new variable that
indicates whether a respondent is a Democrat or a Republican. All other
political affiliations or unknowns should be set to ‘NA.’ (Use the
codebook to clean this variable correctly.)
Registered Party Data Example
| 2 |
-1 |
85 |
NA |
| 3 |
-1 |
100 |
NA |
| 4 |
-1 |
60 |
NA |
| 5 |
-1 |
70 |
NA |
| 6 |
-1 |
25 |
NA |
| 7 |
-1 |
85 |
NA |
| 8 |
-1 |
100 |
NA |
| 9 |
-1 |
40 |
NA |
| 10 |
2 |
60 |
Rep |
| 11 |
2 |
30 |
Rep |
- Find the difference in means between the average feeling thermometer
score for Democrats vs. Republicans. What do you conclude?
Fed. Gov. Feeling Thermometer by Party
| 59.57834 |
40.29787 |
At almost 20 points higher, the average feeling thermometer score
shows that Democrats have a much warmer feeling about the Federal
Government than their Republican peers though, at only 59.58, their
feeling is only about 10 points above indifference (so positive, but not
overly so).
knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
library(dplyr)
library(readr)
library(lubridate)
library(knitr)
library(stringr)
library(readxl)
library(tidyverse)
library(ggplot2)
library(ggrepel)
options(warn=-1)
setwd("c:/Users/trega/Dropbox/DATA2100/Final")
al.pop<-read.csv("sub-est2016_1.csv")
ak.pop<-read.csv("sub-est2016_2.csv")
pop.data<-rbind(al.pop,ak.pop)
kable(pop.data[1104:1107,1:10], align='c',
caption="Population Data Merge Example")
all.pop<-read.csv("sub-est2016_all.csv")
ustates<-as.data.frame(unique(all.pop$STNAME))
kable(list(ustates[1:15,], ustates[16:30,], ustates[31:45,], ustates[46:51,]), align = 'c',
col.names = "State",
caption="Unique States from 'All Population' Data Set")
allpopsub<-subset(all.pop, select=c("NAME","STNAME","POPESTIMATE2012"))
kable(head(allpopsub,5), align='c',
col.names = c('City/State Name','State','Population Est.'),
caption = "All Population Subset Example")
allstatesub<- distinct(subset(allpopsub, NAME==STNAME,
select=c("NAME","POPESTIMATE2012")
))
kable(dim(allstatesub), align = 'c',
col.names = "Dims",
caption = "Dataframe Size")
kable(head(allstatesub,5), align = 'c',
col.names = c('State','Population Est.'),
caption="Removal of State and Town Example")
state.area<-read.csv("https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-areas.csv")
state.merge<- left_join(allstatesub,state.area,join_by(NAME==state))
kable(head(state.merge,5), align = 'c',
col.names = c('Name','Population Est.', 'Area (Sq. Mile)'),
caption = "State Pop. & Area Merge Example")
state.merge<-mutate(state.merge,
PopDen = POPESTIMATE2012 / area..sq..mi.)
kable(head(state.merge,5),align = 'c',
col.names = c('State','Population Est.','Area (Sq. Mile)','Density'),
caption = "Population Density Addition Example")
econ.data<-read.csv("ECN_2012_US_52A1.csv")
econ.data<-econ.data[c(2:3800),]
kable(econ.data[1:5,1:5], align = 'c',
caption = "Econ. Data Clean Example (1:5 & 1:5)")
bysector<- econ.data %>%
group_by(GEO.display.label) %>%
summarise(ttlrev = sum(as.numeric(RCPTOT), na.rm = TRUE))
kable(head(bysector,5),align = 'c',
col.names = c('State','Total Revenue'),
caption = "Total State Revenue by Sector (Fin. & Ins.) Example")
area.rev.merge<-left_join(state.merge,bysector,join_by(NAME==GEO.display.label))
kable(head(area.rev.merge,5), align = 'c',
col.names = c('State','Population Est.','Area','Density','Total Revenue'),
caption = "Total State Data Merge Example")
area.rev.merge<-mutate(area.rev.merge,
revbyden = round(ttlrev / PopDen,0),
ttlrev = ttlrev/1000000000)
ggplot(area.rev.merge, mapping = aes(ttlrev, PopDen, color= NAME)) +
geom_point()+
geom_text_repel(
label=area.rev.merge$NAME,
nudge_x = 0.25, nudge_y = 0.75,
check_overlap=T)+
labs(title="Population Density vs. Total Revenue (by State)",
x = "Total Revenue (in Billions)",
y="Population Density (People/Sq. Mile)") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "")
load("nes.rda")
nesobama<-subset(nes,select = c("caseid","interest_voted2008", "interest_whovote2008"))
nesobama1<- data.frame(Voted = sum(nesobama$interest_voted2008==1),
Obama = sum(nesobama$interest_whovote2008==1))
nesobama1<-mutate(nesobama1,
Percent = Obama / Voted *100)
kable(nesobama1, align = 'c',
col.names = c("Voted in 2008", "Voted for Obama", "Percentage for Obama"),
caption = "Percent of respondents that voted for Obama in 2008")
nesFT<- subset(nes, select = c("caseid", "ftgr_fedgov"))
nesFT<-filter(nesFT, ftgr_fedgov>=0)
kable(head(nesFT,5), align = 'c',
col.names = c("Case ID",
"Feeling Therm."),
caption = "Fed Gov Feeling Thermometer Clean Data Example")
nesFT<-data.frame(Mean = mean(nesFT$ftgr_fedgov))
kable(nesFT, align = 'c',
col.names = "Avg. FT",
caption = "Fed Gov Avg. Feeling Thermometer")
nesparty<-subset(nes, select = c("caseid","prevote_regpty","ftgr_fedgov"))
nesparty<-filter(nesparty,ftgr_fedgov>=0)
nesparty<-mutate(nesparty,
Party = ifelse(prevote_regpty == 1, "Dem",
ifelse(prevote_regpty == 2, "Rep", "NA")))
kable(head(nesparty,10),align = 'c',
caption = "Registered Party Data Example")
nesparty<-mutate(nesparty,
FTDem = ifelse(Party == "Dem", ftgr_fedgov, NA),
FTRep = ifelse(Party == "Rep", ftgr_fedgov, NA))
nesparty<-data.frame(DEMFT = mean(nesparty$FTDem, na.rm=TRUE),
REPFT = mean(nesparty$FTRep, na.rm=TRUE))
kable(nesparty, align = 'c',
col.names = c("Democrat", "Republican"),
caption = "Fed. Gov. Feeling Thermometer by Party")