In this project, we were tasked with choosing 3 data sets among a set aggregated by our peers in the preceding week, and running an analysis proposed in our weekly discussion. I found that some of these asks were not specific, so we took liberties when outlining our objectives. The module was focused on using the tidry and dplyr packages, as well as the principles of Tiny Data.
library(RCurl)
library(tidyr)
library(dplyr)
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/pbp-2019.csv")
RawTable <- read.csv(text = x)
For my first study, I wanted to dig into a large dataset outlining the outcomes of over 40,000 NFL plays over the course of the 2019 season. This dataset is very wide, 41 columns, and has information about a number of events that could occur on a play.
I wanted to test a hypothesis I have about the game. At the risk of over-simplifying, at the conclusion of a play in the NFL, the game clock will continue to run provided the attempted play was “complete” and the ball remained in bounds. If there is an incomplete pass, a turnover, or the player steps out of bounds, the game clock will stop.
As a result, when the team who is leading a game has possession late in the game, it is common to run the ball frequently in order to use more time during your possession, giving their opponents less of an opportunity to come back. Similarly, in cases where the outcome of the game is all-but decided (in cases where the winning team has a substantial lead), it is relatively common for both teams to the run the ball for the sake of sportsmanship and player safety.
There is a third option that is relevant in a study like this - the offensive team could purposely kneel. This is a low risk “play” where to offense does not attempt to advance the ball, but is also guaranteed to keep the clock running - common at the end of the game. I elected to limit my analysis to run vs pass plays, as I felt it could serve as an indicator of playing style, and could be used for more applications when studying the sport.
It is my perception that few games are competitive for the duration of the game, and thus, the rate of running plays compared to passing plays is relatively higher in the 4th quarter of play than in other periods. I wanted to see if this data reflected this.
NFLTable <- data.frame(RawTable$OffenseTeam, RawTable$IsRush, RawTable$IsPass, RawTable$Quarter)
colnames(NFLTable) <- c("Team", "Rush", "Pass", "Quarter")
nfl2<- NFLTable %>%
group_by(Team,Quarter) %>%
summarise(RushTotal = sum(Rush), PassTotal = sum(Pass))
After importing the data, we “grouped by” Team and Quarter, and calculated the total amount of rushing plays and passing plays per team.
NFLTableF <- nfl2[!(nfl2$Team == "" | nfl2$Quarter == 5),]
We then removed overtime from our study, and removed a few lines where the offensive team was a null value. Finally, we then added in a new column that calculated the percent of plays that were rushing plays out of all offensive plays (excluding special teams and kneeling situations).
NFLTableF$RushPercent <- NFLTableF$RushTotal/(NFLTableF$RushTotal+NFLTableF$PassTotal)
NFLTableF
## # A tibble: 128 x 5
## # Groups: Team [32]
## Team Quarter RushTotal PassTotal RushPercent
## <fct> <int> <int> <int> <dbl>
## 1 ARI 1 95 119 0.444
## 2 ARI 2 100 147 0.405
## 3 ARI 3 72 125 0.365
## 4 ARI 4 106 132 0.445
## 5 ATL 1 108 141 0.434
## 6 ATL 2 73 179 0.290
## 7 ATL 3 85 139 0.379
## 8 ATL 4 66 207 0.242
## 9 BAL 1 110 99 0.526
## 10 BAL 2 93 108 0.463
## # ... with 118 more rows
NFLSummary <- NFLTableF %>%
group_by(Quarter) %>%
summarise(AverageRushPercent = mean(RushPercent))
NFLSummary
## # A tibble: 4 x 2
## Quarter AverageRushPercent
## <int> <dbl>
## 1 1 0.442
## 2 2 0.372
## 3 3 0.439
## 4 4 0.397
I debated the best way to calculate our summary. We could either calculate the total rushing plays vs passing plays across the league. This would scale our findings, and give more weight to teams which attempt more offensive plays. As each team plays the same amount of games, the end result is that our findings would more heavily weight successful teams (as they tend to have a higher percentage of offensive plays compared to their opponents). Alternatively, we could give each team equal weight, and calculate the average of the average for each quarter. As this analysis is at its heart, about playing style and if on average teams change their playing style throughout the game (and not about the probability a play will be a run or pass), I elected to use the later method.
The results were surprising, on average, rushing is actually less common in the 4th quarter than the average for the game.
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/mbta.csv")
RawTable <- read.csv(text = x)
For my next analysis, I took a look at ridership data in Boston across several modes of transportation over several years. I have never been to Boston, however, I live in New York City, and live relatively close to one of our newer ferry terminals. I enjoy taking the ferry on a nice day, however, it never crosses my mind to take a boat during the winter. With that in mind, I wanted to see how seasonal boat travel in Boston is, normalized for the seasonality of mass transit across all modes of travel.
I elected to pivot the table to its longest form, before then shortening the string in the month column such that it was identical regardless of year. From here, I was able to group by month and mode of transportation, and pivot back to a slightly wider format for analysis.
RawTable2 <- RawTable %>%
pivot_longer(c("X2007.01":"X2011.10"), names_to = "Month", values_to = "Rider Count")
RawTable2$Month <- substring(RawTable2$Month,7,8)
TranspoTable <- RawTable2 %>%
group_by(Month,mode) %>%
summarise(`Rider Count` = sum(`Rider Count`))
TidyTranspo <- TranspoTable %>%
pivot_wider(names_from = mode, values_from = `Rider Count`)
TidyTranspo <- TidyTranspo[ -c(2,8)]
TidyTranspo$Boat <- TidyTranspo$Boat/TidyTranspo$TOTAL
TidyTranspo$Bus <- TidyTranspo$Bus/TidyTranspo$TOTAL
TidyTranspo$`Commuter Rail` <- TidyTranspo$`Commuter Rail`/TidyTranspo$TOTAL
TidyTranspo$`Heavy Rail` <- TidyTranspo$`Heavy Rail`/TidyTranspo$TOTAL
TidyTranspo$`Light Rail` <- TidyTranspo$`Light Rail`/TidyTranspo$TOTAL
TidyTranspo$`Private Bus` <- TidyTranspo$`Private Bus`/TidyTranspo$TOTAL
TidyTranspo$RIDE <- TidyTranspo$RIDE/TidyTranspo$TOTAL
TidyTranspo$`Trackless Trolley` <- TidyTranspo$`Trackless Trolley`/TidyTranspo$TOTAL
TidyTranspo <- TidyTranspo[ -9]
TidyTranspo
## # A tibble: 12 x 9
## # Groups: Month [12]
## Month Boat Bus `Commuter Rail` `Heavy Rail` `Light Rail` `Private Bus`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01 0.00280 0.290 0.116 0.389 0.184 0.00294
## 2 02 0.00267 0.287 0.111 0.393 0.188 0.00280
## 3 03 0.00871 0.292 0.110 0.390 0.187 0.00265
## 4 04 0.00331 0.287 0.107 0.392 0.193 0.00264
## 5 05 0.00346 0.291 0.109 0.392 0.186 0.00270
## 6 06 0.00425 0.286 0.110 0.395 0.187 0.00275
## 7 07 0.00511 0.284 0.109 0.395 0.190 0.00261
## 8 08 0.00513 0.287 0.113 0.392 0.187 0.00260
## 9 09 0.00372 0.290 0.105 0.393 0.191 0.00265
## 10 10 0.00346 0.291 0.107 0.397 0.184 0.00262
## 11 11 0.00294 0.285 0.113 0.396 0.186 0.00276
## 12 12 0.00270 0.286 0.117 0.394 0.183 0.00262
## # ... with 2 more variables: RIDE <dbl>, `Trackless Trolley` <dbl>
As expected, we do see the most seasonal variation in the boat travel mode, however, there is a bit of an outlier. In March, we see Boat travel increase by 4x over February before dropping again in April. Looking into the source data, we see a value of 40,000 for Boat travel instances in March 2007. This is roughly 10x the values of Feb 2007, April 2007, March 2008-2010. For this reason, I think we can assume that this value was meant to be 4000 and there was an error in data entry.
I will change said value (and that of the March 2007 total) and re-run our analysis:
RawTable[2,5] = 4
RawTable[11,5] = 1168
RawTable2 <- RawTable %>%
pivot_longer(c("X2007.01":"X2011.10"), names_to = "Month", values_to = "Rider Count")
RawTable2$Month <- substring(RawTable2$Month,7,8)
TranspoTable <- RawTable2 %>%
group_by(Month,mode) %>%
summarise(`Rider Count` = sum(`Rider Count`))
TidyTranspo <- TranspoTable %>%
pivot_wider(names_from = mode, values_from = `Rider Count`)
TidyTranspo <- TidyTranspo[ -c(2,8)]
TidyTranspo$Boat <- TidyTranspo$Boat/TidyTranspo$TOTAL
TidyTranspo$Bus <- TidyTranspo$Bus/TidyTranspo$TOTAL
TidyTranspo$`Commuter Rail` <- TidyTranspo$`Commuter Rail`/TidyTranspo$TOTAL
TidyTranspo$`Heavy Rail` <- TidyTranspo$`Heavy Rail`/TidyTranspo$TOTAL
TidyTranspo$`Light Rail` <- TidyTranspo$`Light Rail`/TidyTranspo$TOTAL
TidyTranspo$`Private Bus` <- TidyTranspo$`Private Bus`/TidyTranspo$TOTAL
TidyTranspo$RIDE <- TidyTranspo$RIDE/TidyTranspo$TOTAL
TidyTranspo$`Trackless Trolley` <- TidyTranspo$`Trackless Trolley`/TidyTranspo$TOTAL
TidyTranspo <- TidyTranspo[ -9]
TidyTranspo
## # A tibble: 12 x 9
## # Groups: Month [12]
## Month Boat Bus `Commuter Rail` `Heavy Rail` `Light Rail` `Private Bus`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01 0.00280 0.290 0.116 0.389 0.184 0.00294
## 2 02 0.00267 0.287 0.111 0.393 0.188 0.00280
## 3 03 0.00292 0.293 0.111 0.393 0.188 0.00267
## 4 04 0.00331 0.287 0.107 0.392 0.193 0.00264
## 5 05 0.00346 0.291 0.109 0.392 0.186 0.00270
## 6 06 0.00425 0.286 0.110 0.395 0.187 0.00275
## 7 07 0.00511 0.284 0.109 0.395 0.190 0.00261
## 8 08 0.00513 0.287 0.113 0.392 0.187 0.00260
## 9 09 0.00372 0.290 0.105 0.393 0.191 0.00265
## 10 10 0.00346 0.291 0.107 0.397 0.184 0.00262
## 11 11 0.00294 0.285 0.113 0.396 0.186 0.00276
## 12 12 0.00270 0.286 0.117 0.394 0.183 0.00262
## # ... with 2 more variables: RIDE <dbl>, `Trackless Trolley` <dbl>
With this data corrected, our seasonality in Boat travel is a lot more predictable. Additionally, we see that there is virtually no seasonality to any mode of transportation other than boat travel. To quantify this, we can take the coefficient of variation (the Standard Deviation scaled to the Mean) across all months for each mode.
As this function is not offered in any of our loaded libraries, I will write it myself. Also of note, as our table is still grouped, we need to un-group our table for this analysis to work correctly.
TidyTranspo <- ungroup(TidyTranspo)
TidyTranspo <- TidyTranspo[,-1]
myFx <- function(x) {result <- sd(x)/mean(x) }
TidyTranspo %>%
summarise_each(funs(myFx))
## # A tibble: 1 x 8
## Boat Bus `Commuter Rail` `Heavy Rail` `Light Rail` `Private Bus` RIDE
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.246 0.00983 0.0316 0.00543 0.0159 0.0376 0.0286
## # ... with 1 more variable: `Trackless Trolley` <dbl>
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/2010__AP__College_Board__School_Level_Results.csv")
RawTable <- read.csv(text = x)
For my third analysis, I wanted to explore a project which involved joining two tables. The analysis of test scores suggested by Kevin Potter in our discussion last week seemed appropriate: “Compare the schools to see which one had the highest performing ratio of exams with a 3, 4, or 5 and cross-reference with the location of the school.” The initial dataset does not have school location, however, I found another table from the same source which had school locations, and both fields had a unique identifier, DBN.
As this data set is from the New York department of Education, I thought it would be interesting to look at test success rate by borough. Specifically - what borough has the highest AP test pass rate?
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/2018_DOE_High_School_Directory%20(1).csv")
SchoolTable <- read.csv(text = x)
Looking through the two tables, the first thing that stands out is that DBN and school name are the only shared field between the two. It is worth noting that the naming conventions in these fields differ slightly, and of these two, it is more likely that school name has slight variations from table to table. For these reasons, I decided it would be easiest to change the naming convention of DBN (from dbn in the school location table), and do a natural left join.
SchoolTable <- SchoolTable %>%
rename(
DBN = dbn
)
JoinedTable <- left_join(RawTable, SchoolTable)
## Joining, by = "DBN"
## Warning: Column `DBN` joining factors with different levels, coercing to
## character vector
With all of our information in one table, we can proceed with our analysis:
ApTest <- data.frame(JoinedTable$SchoolName,JoinedTable$AP.Test.Takers,JoinedTable$Total.Exams.Taken,JoinedTable$Number.of.Exams.with.scores.3.4.or.5,JoinedTable$boro)
ApTest <- ApTest[ !is.na(ApTest$JoinedTable.AP.Test.Takers),]
ApTest <- ApTest[ !is.na(ApTest$JoinedTable.boro),]
ApTest$AverageSuccess <- ApTest$JoinedTable.Number.of.Exams.with.scores.3.4.or.5/ApTest$JoinedTable.Total.Exams.Taken
colnames(ApTest) <- c("School Name", "Test Takers", "Attempts" , "Success Count", "Borough" , "SuccessRate")
head(ApTest, 10)
## School Name Test Takers Attempts
## 1 UNIVERSITY NEIGHBORHOOD H.S. 39 49
## 2 EAST SIDE COMMUNITY HS 19 21
## 4 NEW EXPLORATIONS SCI,TECH,MATH 255 377
## 6 Pace High School 21 21
## 7 Urban Assembly School of Design and Construction, 99 117
## 8 Facing History School, The 42 44
## 9 Urban Assembly Academy of Government and Law, The 25 37
## 11 HS FOR ENVIRONMENTAL STUDIES 213 298
## 12 PROFESSIONAL PERFORMING ARTS 20 20
## 13 BARUCH COLLEGE CAMPUS HS 78 115
## Success Count Borough SuccessRate
## 1 10 M 0.20408163
## 2 NA M NA
## 4 191 M 0.50663130
## 6 NA M NA
## 7 10 M 0.08547009
## 8 NA M NA
## 9 15 M 0.40540541
## 11 152 M 0.51006711
## 12 15 M 0.75000000
## 13 88 M 0.76521739
After removing schools who did not have borough information, and those who did not have any test takers, we calculated the average success rate per school. Thinking through our uses of this information, I am choosing to leave this table in this state and to build a second table with a borough analysis.
Of note - I will not be using the success rate per school metric in my borough table, as it will skew results to more heavily weigh schools with fewer test takers. Rethinking our question of “what borough has the highest test success rate?” - it is clear that this does not on its own tell us which borough is “better” at prepping students for tests. A borough which only allows students who will likely pass an exam to take a test will naturally have more favorable results. To round out our analysis, I will also provide the total test count and pass numbers for each borough.
ApTestbyB <- ApTest %>%
group_by(Borough) %>%
summarise(`Success Count` = sum(`Success Count`, na.rm = TRUE), `Attempts` = sum(`Attempts`, na.rm=TRUE))
ApTestbyB$SuccessRate <- ApTestbyB$`Success Count`/ApTestbyB$Attempts
ApTestbyB
## # A tibble: 5 x 4
## Borough `Success Count` Attempts SuccessRate
## <fct> <int> <int> <dbl>
## 1 K 5668 11598 0.489
## 2 M 5821 9272 0.628
## 3 Q 4944 10504 0.471
## 4 R 1484 2960 0.501
## 5 X 3547 6570 0.540
As we can see above, Manhattan has the highest passing percentage. While it is also true they have the highest passing count, it seems as though they also test fewer students when compared to Queens, Brooklyn (K for Kings County) and Staten Island (R for Richmond County). The Bronx (X) tests far fewer students than the other boroughs, but passes more than half. Seeing as the students that Manhattan, and to a lesser extent the Bronx, choose to test are significantly more likely to pass, we can make inferences about selection and student pathing in these schools. A child who is on the cusp of being “AP ready” might have significantly different experiences depending on which borough they attend high school.