The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to:
(1) Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 5 assignment!) For each of the three chosen datasets:
(2) Please include in your homework submission, for each of the three chosen datasets:
Load the required libraries for the Rmd file to run
The link for this data was provided in the CUNY SPS DATA 607 discussion thread for Week 5.
Author: Donghwan Kim
Date: Saturday, September 26, 2020 10:53:46 AM EDT
Subject: Student’s Residency.
The data was then accessed from the Kent State SPSS Tutorial page on Monday, September 28, 2020 12:35:46 PM EDT. It was the last table on the page titled: Class Rank *Do you live on campus?* State of residence Crosstabulation
livingdata <-
getURL("https://raw.githubusercontent.com/sbiguzzi/data607project2/master/class-rank-table.csv")college <-
rename(college, c("State" = "X.U.FEFF.State.of.residence", "Class" = "X.1",
"OffCampus" = "Do.you.live.on.campus", "OnCampus" = "X.2"))#Convert blanks to NAs
college <- na_if(college,"")
#Fill resident types
college <- fill(college,State,.direction = "down")
#Replace class rank na with total
college$Class <- ifelse(is.na(college$Class),"Total",college$Class)#Drop rows and columns
college <- college[!(college$State=="Total" | college$Class=="Total"),-c(2,6)]
#Reset index
rownames(college) <- NULL#Create long data
college <- college %>%
gather(Residence,Students,OffCampus:OnCampus, factor_key = TRUE)
#Convert student column to numeric
college$Students <- as.numeric(college$Students)#Create percent change
college <- college %>%
group_by(State,Residence) %>%
mutate(residence_pctchg = ((lead(Students)-Students)/Students))
#Create percent residence by class
college <- college %>%
group_by(State,Residence) %>%
mutate(percent_residence = Students/(Students+lead(Students)))
college$percent_residence <-
ifelse(is.na(college$percent_residence), 1-lag(college$percent_residence), college$percent_residence)From both Figure 1 and Figure 2, it’s clear that off-campus residents are primarily upperclassmen while on-campus residents are primarily underclassmen. Whether they are in state or out of state students has no bearing on a student lives on-campus or off-campus. It would be interesting to find out if this particular college has a rule in place at this college that doesn’t allow underclassmen to live off-campus.
The link for this data was provided in the CUNY SPS DATA 607 discussion thread for Week 5.
Author: Peter Fernandes
Date: Friday, September 25, 2020 3:48:43 AM EDT
Subject: NYC per-capita fuel consumption and CO2 emissions.
The data was then accessed from the The Contribution of Urban Areas to Climate Change: New York City Case Study page on Tuesday, September 29, 2020 08:56 PM EDT. Under the Figures section, click until you reach Table 3.
* I did not add the the NY Metro Area, United States, Tri-State Area, or New York City averages because the assignment required to find some of these numbers. *
energydata <-
getURL("https://raw.githubusercontent.com/sbiguzzi/data607project2/master/fuel-consumption.csv")energy.data <-
rename(energy.data, c("Area" = "X.U.FEFF.Category....of.counties.",
"BuildingAndIndustry_GJ" = "Building.and.Industry",
"BuildingAndIndustry_MtCO2" = "X",
"Transportation_GJ" = "Transportation",
"Transportation_MtCO2" = "X.1"))#Create land type column
energy.data$LandType <-
gsub(".*\\s+(.*)+\\s\\(.*","\\1",energy.data$Area)
#Create land type count
energy.data$LandTypeCount <-
as.integer(gsub(".*\\(+(.*)+\\)","\\1",energy.data$Area))
#Create area column
energy.data$Area <- gsub(" .*$","",energy.data$Area)#Create long format
energy.data <- energy.data %>%
gather(Industry,Consumption,BuildingAndIndustry_GJ:Transportation_MtCO2, factor_key = TRUE)
#Change consumption to numeric
energy.data$Consumption <- as.numeric(energy.data$Consumption)#Create energy type column
energy.data$EnergyType <-
factor(gsub(".*_(.*)","\\1",energy.data$Industry), levels = c("GJ","MtCO2"))
#Update industry column
energy.data$Industry <- gsub("(.*)_.*","\\1",energy.data$Industry)#Create average consumption by area and fuel type
energy.data <- energy.data %>%
group_by(Area,EnergyType) %>%
mutate(area_consumption = mean(Consumption))
#Create average consumption by area and industry type
energy.data <- energy.data %>%
group_by(Area,Industry,EnergyType) %>%
mutate(area_industry_consumption = mean(Consumption))The NYC metro area used 22.13 gigajoules (GJ) of energy per capita less than the United States for fiscal year 2007. However, for the same fiscal year, the metric tons of CO\({_2}\) that were emitted in NYC were only slightly less than the US per capita, with a difference of 0.95 Metric Tons (Mt). Is there a specific industry that drove the emissions per capita rate for NYC to be closer to that of the US vs the per capita rate of energy consumption?
The outlier for NY has a skewing effect on the mean energy consumption and CO\({_2}\) emission. Mean CO\({_2}\) emissions for NY metro area was 5.6 Mt while the median CO\({_2}\) emissions was just 3.5 Mt. Mean energy consumption for for NY metro area was 86.3 GJ, while the median was only 56.8 GJ. While there is a a difference between the mean US energy consumption(108.4 GJ) and median energy consumption (96.4 GJ), the median and mean for US CO\({_2}\) emissions were 6.4 and 6.6 Mt, respectively.
The outliers are most likely under the transportation label for both energy consumption and CO\(_{2}\) emissions. The average transportation energy consumption in NY was 0.253 more than the US and the CO\(_{2}\) emissions of the transportation industry in NY were 0.246 more than the transportation emissions in the US. Conversely, the building and industry sector in NY had a rate of energy consumption 0.615 lower than the US, and a rate of CO\(_{2}\) emissions 0.565 lower than the US.
The transportation outliers increase the average energy consumption and C0\(_{2}\) emissions for NY, which skews the data and does not show the full picture, especially for C0\(_{2}\) emissions. When splitting up the emissions by sector NY Metro area seems to be doing much better than the United States. Looking at the data even closer we see that the real outliers are being driven by the one mixed urban county, Putnam County, for NY metro area. One solution to this would be to use weighted averages when calculating the average consumption and emission for NY Metro Area.
One note about the data and the accuracy of the analysis. It’s unknown whether the US counts include the NY Metro Area or not. It would be nice to know and also would make for a more accurate comparison when comparing the NY metro area to the rest of the US.
For future analysis it would be good to understand, one, what is driving the transportation fuel consumption and CO\(_{2}\) emission in Putnam County. Additionally, for future analysis it might be more accurate to use a weighted average with the weight being equal to the percent make-up of counties in NY metro area. And finally, it might be more interesting to compare NY Metro Area to other metro areas in the US and the world rather than all of the US.
The link for this data was provided in the CUNY SPS DATA 607 discussion thread for Week 5.
Author: Karim Hammoud
Date: Monday, September 28, 2020 11:50:01 PM EDT
Subject: New Hampshire state budget and finances
The data was then scraped from the BallotPedia page, State government tax collections by source on Wednesday, September 30, 2020 10:00 PM EDT.
The table that was scraped is titled State tax collections by source ($ in thousands), 2016
tax.data <- rename(tax.data,
c("State" = "V1", "Property_Tax" = "V2",
"Sales_and_gross_receipts" = "V3",
"Licenses" = "V4", "Income_taxes" = "V5",
"Other_taxes" = "V6","Total" = "V7",
"Pop2016" = "V8", "PerCapitaRev" = "V9"))for (i in names(tax.data)) {
if (i != "State"){
tax.data[,i] <- as.numeric(gsub("[\\$]|[,]","",tax.data[,i]))
}
else{
tax.data[,i] <- tax.data[,i]
}
}for (i in names(tax.data[,c(2:6)])) {
tax.data[,paste(i,"pct",sep="_")] = round(tax.data[,i]/tax.data$Total,7)
}In the 2016 tax year some states collect a lot of tax revenue, while other states collected much less. To help put the data in perspective a median line was drawn in Figure 1. The median tax revenue for the 50 states in 2016 was 10.5M dollars. California was the top earning state with almost fifteen times the median at 155M dollars, while Alaska was at the lowest earning state with only around one tenth of the median at 1.04M dollars. Is this difference solely due to the population difference?
By normalizing the tax collected from total to per capita, it doesn’t seem like the disparity between California’s tax revenue and Alaska’s tax revenue has to do with population. Alaska’s per capita rate is $1,405 while California’s is $3,955. Is there a difference in the make up of the total tax revenue between California and Alaska. That is, does the percent at which each tax category contributes to the total tax revenue differ between California and Alaska?
Continuing our analysis from the previous section, California relies more on Income and Sales taxes, while Alaska has a pretty even distribution of all the five tax categories. It seems that the two states due differ in how much they rely on each tax category for the total tax revenue for 2016. There is not enough evidence to conclude that this is the only reason for a difference in tax revenue.
The difference between the highest earning state, California, and the lowest earning state, Alaska, is not based on population and cannot be solely attributed to the tax category. There must be something else that has to account for this difference.
One possible continuation of this analysis is looking at is tax rate for each tax category. The assumption here is that the difference is caused by tax rate. However, another possibility is the cost of living in California as a whole is higher than in Alaska. For example, if the income tax rates are the same, is California’s average income higher than Alaska’s, or is the price of goods in California greater than in Alaska. Additional data is needed to to continue this analysis.