Final Exam

Final Exam - Data 2100

Part A

In this question, you will use a series of datasets to investigate population density in the United States.

Load in population data for Alabama (“sub-est2016_1.csv”) and Alaska (“sub-est2016_2.csv”), then append the two datasets together so that all of the information is within one dataframe.

Population Data Merge Example
	SUMLEV	STATE	COUNTY	PLACE	PRIMGEO_FLAG	FUNCSTAT	NAME	STNAME
1104	157	1	133	53400	1	A	Nauvoo town (pt.)	Alabama
1105	157	1	133	99990	1	F	Balance of Winston County	Alabama
1106	40	2	0	0	0	A	Alaska	Alaska
1107	162	2	0	65	0	A	Adak city	Alaska

Read in the csv file that already contains population information for each state. Check to see which unique states are included in this dataset.

Unique States from ‘All Population’ Data Set

State
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana

State
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire

State
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah

State
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

There’s a lot of interesting data in this population dataset, but for our purposes in this problem set, we are only interested in a few columns. Use the subset() function to subset the “NAME”, “STNAME”, and “POPESTIMATE2012” columns into a new dataset. (Using a different function to complete the same task will result in partial credit.)

All Population Subset Example
City/State Name	State	Population Est.
Alabama	Alabama	4815960
Abbeville city	Alabama	2647
Adamsville city	Alabama	4472
Addison town	Alabama	748
Akron town	Alabama	345

This new subsetted dataset definitely makes our lives easier, but it still includes the population stats for each city and town. You’ll notice, however, that the first observation for each new state is the population total for the entire state where the states name appears in both the NAME and STNAME columns. Use the subset() function to choose only these rows. Make sure that your new data set doesn’t have any repeating/redundant observations or columns (The resulting dataframe should be 51 X 2)

Dataframe Size
Dims
51
2

Removal of State and Town Example
State	Population Est.
Alabama	4815960
Alaska	731089
Arizona	6549634
Arkansas	2950685
California	38011074

We’re going to try to find the population density of each state. Our first step in doing this is to read in some online data about the square mileage of each state from this link. Once the data is read in, merge that data set with our 2012 state populations dataset from the last question. Which observations can be matched? Make sure to not merge observation(s) that have no match.

State Pop. & Area Merge Example
Name	Population Est.	Area (Sq. Mile)
Alabama	4815960	52423
Alaska	731089	656425
Arizona	6549634	114006
Arkansas	2950685	53182
California	38011074	163707

I used a left_join() with a join_by() of the “NAME” and “state” observations.

Next, we are going to create a new variable in this merged dataset that tells us each state’s population density in 2012. Do this by dividing the population variable by the state size variable.

Population Density Addition Example
State	Population Est.	Area (Sq. Mile)	Density
Alabama	4815960	52423	91.867310
Alaska	731089	656425	1.113743
Arizona	6549634	114006	57.449906
Arkansas	2950685	53182	55.482776
California	38011074	163707	232.189668

Finally we’ve finished preparing our dataset, now we’re going to get into some more interesting investigative work. Let’s first load in the “ECN_2012_US_52A1.csv” dataset which includes economic data for each sector within each state. Get rid of the first row, as this merely gives us descriptions of each variable.

Econ. Data Clean Example (1:5 & 1:5)
	GEO.id	GEO.id2	GEO.display.label	NAICS.id
2	0400000US01	01	Alabama	52
3	0400000US01	01	Alabama	521
4	0400000US01	01	Alabama	5211
5	0400000US01	01	Alabama	52111
6	0400000US01	01	Alabama	521110

Find the total revenue per sector by state.

Total State Revenue by Sector (Fin. & Ins.) Example
State	Total Revenue
Alabama	18509867
Alaska	1010275
Arizona	65996299
Arkansas	7177328
California	483627172

Now merge this dataset with our population density dataset.

Total State Data Merge Example
State	Population Est.	Area	Density	Total Revenue
Alabama	4815960	52423	91.867310	18509867
Alaska	731089	656425	1.113743	1010275
Arizona	6549634	114006	57.449906	65996299
Arkansas	2950685	53182	55.482776	7177328
California	38011074	163707	232.189668	483627172

Plot the relationship between state population density and the state’s total revenue to see if there’s a relationship. Comment on your findings.

I do not believe a direct coorelation can be made between population density. Looking at the graph (and the underlying data) you can see that DC has the highest population density, but also the lowest Total Revenue while NY has the highest Total Revenue but is just middle of the pack in population density.

Part B

For this question, you will use the data file ‘nes.rda’ (which will require you to use load(“nes.rda”) to read in the data.) The codebook, called ‘nes2012_codebook.pdf’ is also available to you on Canvas. This data comes from the ANES 2012 Time Series Study, which looks at attitudes toward political ideologies and groups, among many other things.

According to this survey, of those who claimed to have voted in the 2008 election, what percentage of survey respondents voted for Barack Obama in 2008? (Hint: you will need to search the codebook to find the variables ‘interest_voted2008’ and ‘interest_whovote2008’ in order to clean them correctly.)

Percent of respondents that voted for Obama in 2008
Voted in 2008	Voted for Obama	Percentage for Obama
1371	923	67.32312

A ‘Feeling Thermometer’ is a type of survey question that asks respondents to rate how warmly or cool they feel toward an individual or group. A feeling thermometer score of 100 indicates a respondent feels the most positive toward that entity. A feeling score of 0 indicates the respondent feels most negative about that entity. A score of 50 indicates indifference. Using the variable that records the feeling thermometer score towards the ‘Federal Government in Washington,’ clean the variable to only include scores between 0 and 100. (Use the codebook to locate the ‘ftgr_fedgov’ variable to clean it properly.)

Fed Gov Feeling Thermometer Clean Data Example
Case ID	Feeling Therm.
2	85
3	100
4	60
5	70
6	25

Using the cleaned variable, what is the average feeling thermometer for the Federal Government in Washington, according to this survey?

Fed Gov Avg. Feeling Thermometer
Avg. FT
52.48652

Using the ‘prevote_regpty’ variable, create a new variable that indicates whether a respondent is a Democrat or a Republican. All other political affiliations or unknowns should be set to ‘NA.’ (Use the codebook to clean this variable correctly.)

Registered Party Data Example
caseid	prevote_regpty	ftgr_fedgov	Party
2	-1	85	NA
3	-1	100	NA
4	-1	60	NA
5	-1	70	NA
6	-1	25	NA
7	-1	85	NA
8	-1	100	NA
9	-1	40	NA
10	2	60	Rep
11	2	30	Rep

Find the difference in means between the average feeling thermometer score for Democrats vs. Republicans. What do you conclude?

Fed. Gov. Feeling Thermometer by Party
Democrat	Republican
59.57834	40.29787

At almost 20 points higher, the average feeling thermometer score shows that Democrats have a much warmer feeling about the Federal Government than their Republican peers though, at only 59.58, their feeling is only about 10 points above indifference (so positive, but not overly so).

knitr::opts_chunk$set(echo = TRUE)

library(tidyr)
library(dplyr)
library(readr)
library(lubridate)
library(knitr)
library(stringr)
library(readxl)
library(tidyverse)
library(ggplot2)
library(ggrepel)

options(warn=-1)

setwd("c:/Users/trega/Dropbox/DATA2100/Final")

al.pop<-read.csv("sub-est2016_1.csv")
ak.pop<-read.csv("sub-est2016_2.csv") 



pop.data<-rbind(al.pop,ak.pop)

kable(pop.data[1104:1107,1:10], align='c', 
      caption="Population Data Merge Example")


all.pop<-read.csv("sub-est2016_all.csv")

ustates<-as.data.frame(unique(all.pop$STNAME))

kable(list(ustates[1:15,], ustates[16:30,], ustates[31:45,], ustates[46:51,]), align = 'c',
  col.names = "State",
  caption="Unique States from 'All Population' Data Set") 


allpopsub<-subset(all.pop, select=c("NAME","STNAME","POPESTIMATE2012"))

kable(head(allpopsub,5), align='c', 
      col.names = c('City/State Name','State','Population Est.'),
      caption = "All Population Subset Example")


allstatesub<- distinct(subset(allpopsub, NAME==STNAME,
                      select=c("NAME","POPESTIMATE2012")
                      ))

kable(dim(allstatesub), align = 'c',
      col.names = "Dims",
      caption = "Dataframe Size")

kable(head(allstatesub,5), align = 'c',
      col.names = c('State','Population Est.'),
      caption="Removal of State and Town Example")


state.area<-read.csv("https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-areas.csv")

state.merge<- left_join(allstatesub,state.area,join_by(NAME==state))

kable(head(state.merge,5), align = 'c',
      col.names = c('Name','Population Est.', 'Area (Sq. Mile)'),
      caption = "State Pop. & Area Merge Example")


state.merge<-mutate(state.merge,
                    PopDen = POPESTIMATE2012 / area..sq..mi.)

kable(head(state.merge,5),align = 'c',
      col.names = c('State','Population Est.','Area (Sq. Mile)','Density'),
      caption = "Population Density Addition Example")


econ.data<-read.csv("ECN_2012_US_52A1.csv")

econ.data<-econ.data[c(2:3800),]

kable(econ.data[1:5,1:5], align = 'c',
      caption = "Econ. Data Clean Example (1:5 & 1:5)")


bysector<- econ.data %>% 
  group_by(GEO.display.label) %>% 
  summarise(ttlrev = sum(as.numeric(RCPTOT), na.rm = TRUE))

kable(head(bysector,5),align = 'c',
      col.names = c('State','Total Revenue'),
      caption = "Total State Revenue by Sector (Fin. & Ins.) Example")



area.rev.merge<-left_join(state.merge,bysector,join_by(NAME==GEO.display.label))

kable(head(area.rev.merge,5), align = 'c',
      col.names = c('State','Population Est.','Area','Density','Total Revenue'),
      caption = "Total State Data Merge Example")



area.rev.merge<-mutate(area.rev.merge,
                       revbyden = round(ttlrev / PopDen,0), 
                       ttlrev = ttlrev/1000000000)

ggplot(area.rev.merge, mapping = aes(ttlrev, PopDen, color= NAME)) + 
  geom_point()+
  geom_text_repel(
    label=area.rev.merge$NAME,
    nudge_x = 0.25, nudge_y = 0.75,
    check_overlap=T)+
  labs(title="Population Density vs. Total Revenue (by State)", 
       x = "Total Revenue (in Billions)", 
       y="Population Density (People/Sq. Mile)") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "")


load("nes.rda")

nesobama<-subset(nes,select = c("caseid","interest_voted2008", "interest_whovote2008"))

nesobama1<- data.frame(Voted = sum(nesobama$interest_voted2008==1), 
                       Obama = sum(nesobama$interest_whovote2008==1))

nesobama1<-mutate(nesobama1, 
                  Percent = Obama / Voted *100)

kable(nesobama1, align = 'c',
      col.names = c("Voted in 2008", "Voted for Obama", "Percentage for Obama"),
      caption = "Percent of respondents that voted for Obama in 2008")


nesFT<- subset(nes, select = c("caseid", "ftgr_fedgov"))

nesFT<-filter(nesFT, ftgr_fedgov>=0)

kable(head(nesFT,5), align = 'c', 
      col.names = c("Case ID",
      "Feeling Therm."), 
      caption = "Fed Gov Feeling Thermometer Clean Data Example")


nesFT<-data.frame(Mean = mean(nesFT$ftgr_fedgov))

kable(nesFT, align = 'c', 
      col.names = "Avg. FT", 
      caption = "Fed Gov Avg. Feeling Thermometer")


nesparty<-subset(nes, select = c("caseid","prevote_regpty","ftgr_fedgov"))

nesparty<-filter(nesparty,ftgr_fedgov>=0)

nesparty<-mutate(nesparty, 
                 Party = ifelse(prevote_regpty == 1, "Dem", 
                                ifelse(prevote_regpty == 2, "Rep", "NA")))

kable(head(nesparty,10),align = 'c',
      caption = "Registered Party Data Example")


nesparty<-mutate(nesparty, 
                 FTDem = ifelse(Party == "Dem", ftgr_fedgov, NA), 
                 FTRep = ifelse(Party == "Rep", ftgr_fedgov, NA))

nesparty<-data.frame(DEMFT = mean(nesparty$FTDem, na.rm=TRUE), 
                     REPFT = mean(nesparty$FTRep, na.rm=TRUE))

kable(nesparty, align = 'c', 
      col.names = c("Democrat", "Republican"), 
      caption = "Fed. Gov. Feeling Thermometer by Party")

Final Exam

Tom Regan

2023-12-17

Final Exam - Data 2100

caseid	prevote_regpty	ftgr_fedgov	Party
2	-1	85	NA
3	-1	100	NA
4	-1	60	NA
5	-1	70	NA
6	-1	25	NA
7	-1	85	NA
8	-1	100	NA
9	-1	40	NA
10	2	60	Rep
11	2	30	Rep

caseid	prevote_regpty	ftgr_fedgov	Party
2	-1	85	NA
3	-1	100	NA
4	-1	60	NA
5	-1	70	NA
6	-1	25	NA
7	-1	85	NA
8	-1	100	NA
9	-1	40	NA
10	2	60	Rep
11	2	30	Rep

caseid	prevote_regpty	ftgr_fedgov	Party
2	-1	85	NA
3	-1	100	NA
4	-1	60	NA
5	-1	70	NA
6	-1	25	NA
7	-1	85	NA
8	-1	100	NA
9	-1	40	NA
10	2	60	Rep
11	2	30	Rep