Working with data sets that are messy or in a format not conducive to downstream analysis is a major task that a data scientist must face. In this project, 3 datasets that qualify as messy are to be tidied primarily using the ubiquidous dplyr and tidyr packages. Then some analysis is provided as requested. All input .csv tables are available at the GitHub repository https://github.com/robertwelk/DATA607_Project2.git
library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
library(ggplot2)
library(usmap)
library(mapdata)
This data comes from ESPN and has some immediate noticeable issues, mainly columns contain values seperated by commas and one column lacks structure. Luckily the tools of the tidyverse are well-equiped to tackle these issues, pun intended.
Read in data and look at the structure
mvp.data <- read.csv('Project2_tbl1.csv', header=T, stringsAsFactors = F)
#Examine the structure of the raw data
str(mvp.data)
## 'data.frame': 53 obs. of 3 variables:
## $ NO. : chr "I" "II" "III" "IV" ...
## $ PLAYER : chr "Bart Starr, QB, Green Bay" "Bart Starr, QB, Green Bay" "Joe Namath, QB, New York Jets" "Len Dawson, QB, Kansas City" ...
## $ HIGHLIGHTS: chr "Two touchdown passes" "202 yards passing, 1 TD" "206 yards passing" "142 yards passing, 1 TD" ...
#Rename columns
colnames(mvp.data) <- c('superbowlID', 'player', 'stats')
Parsing of ‘player’ variable
This is found in column 2, and in the raw data contained player name, position, and team. Tidy data should not have 3 variables in the same column, and commas are a poor option to parse data within a column. The ‘separate()’ function was designed to solve a problem like this. Note that the pipe operater can be used for assignment (%<>%) thanks to the magrittr package.
# parse column 2
mvp.data %<>% separate(player,c('name', 'position','team'),sep=",")
#trim
mvp.data$team <- str_trim(mvp.data$team, side="both")
mvp.data$position <- str_trim(mvp.data$position, side="both")
Parse & restructure ‘stats’ variable
This task is more complicated than the previous step. Not only does the column need to be parsed, but the ordering of stats is inconsistent and not every player has the same set of stats since they play different positions. Here we are primarily concerned with yards from scrimmage and touchdowns scored, for the downstream analysis. Regular expressions are used in conjunction with a ‘for loop’ to get place stats in respective columns.
# standardize vocabulary -
mvp.data$stats %<>% str_replace_all('touchdown[s]?', 'TD') %>%
str_replace_all('[Tt]wo', '2')
# create variables within the df to store stats of interest
mvp.data$TD <- NA
mvp.data$yards<-NA
# Set up the loop - for each player the number of touchdowns scored and yards from scrimmage will be assigned to the designated column and stored as an integer
for(i in 1:nrow(mvp.data)){
mvp.data$TD[i] <- str_extract(mvp.data$stats[i], '[[:digit:]] TD') %>%
str_replace('TD','')
mvp.data$yards[i] <- str_extract(mvp.data$stats[i], '[[:digit:]]{2,3} yards') %>%
str_replace('yards','') %>%
str_trim(side = 'both')
}
#remove the stats column
mvp.data <- mvp.data %>% select(-stats)
#Coerce to integer data type
mvp.data$yards <- as.integer(mvp.data$yards)
#assume that NA values of TDs mean no touchdown was scored
mvp.data$TD[is.na(mvp.data$TD)] <- 0
head(mvp.data)
## superbowlID name position team TD yards
## 1 I Bart Starr QB Green Bay 2 NA
## 2 II Bart Starr QB Green Bay 1 202
## 3 III Joe Namath QB New York Jets 0 206
## 4 IV Len Dawson QB Kansas City 1 142
## 5 V Chuck Howley LB Dallas 0 NA
## 6 VI Roger Staubach QB Dallas 2 119
Objectives
a. finding the players with the most MVP’s
b. Player with most total touchdowns and passing or rushing yardage
c. frequencies of which positions win the award most.
Players with more than one MVP
the results are not what we wanted - Tom Brady is the leader.
mvp.data %>% count(name) %>% arrange(desc(n)) %>% filter(n>1)
## # A tibble: 5 x 2
## name n
## <chr> <int>
## 1 Tom Brady 4
## 2 Joe Montana 3
## 3 Bart Starr 2
## 4 Eli Manning 2
## 5 Terry Bradshaw 2
Players with the most total touchdowns
More bad news!
mvp.data %>% group_by(name) %>%
summarise(TotalTD=sum(as.numeric(TD))) %>%
arrange(desc(TotalTD))
## # A tibble: 45 x 2
## name TotalTD
## <chr> <dbl>
## 1 Tom Brady 10
## 2 Joe Montana 9
## 3 Steve Young 6
## 4 Terry Bradshaw 6
## 5 Doug Williams 4
## 6 Troy Aikman 4
## 7 Aaron Rodgers 3
## 8 Bart Starr 3
## 9 Eli Manning 3
## 10 Jim Plunkett 3
## # ... with 35 more rows
Players with the most passing yardage - no surprises
# single game
mvp.data %>% filter(position=='QB') %>%
arrange(desc(yards)) %>%
top_n(10)
## Selecting by yards
## superbowlID name position team TD yards
## 1 LI Tom Brady QB New England 2 466
## 2 XXXIV Kurt Warner QB St. Louis 2 414
## 3 LII Nick Foles QB Philadelphia 3 373
## 4 XXXVIII Tom Brady QB New England 3 354
## 5 XXII Doug Williams QB Washington 4 340
## 6 XXXIII John Elway QB Denver 1 336
## 7 XIX Joe Montana QB San Francisco 3 331
## 8 XLIX Tom Brady QB New England 4 328
## 9 XXIX Steve Young QB San Francisco 6 325
## 10 XIII Terry Bradshaw QB Pittsburgh 4 318
#combined
mvp.data %>% filter(position=='QB') %>%
group_by(name) %>%
summarise(Total_yards=sum(yards,na.rm=T)) %>%
arrange(desc(Total_yards)) %>%
top_n(10)
## Selecting by Total_yards
## # A tibble: 10 x 2
## name Total_yards
## <chr> <int>
## 1 Tom Brady 1293
## 2 Joe Montana 785
## 3 Terry Bradshaw 627
## 4 Eli Manning 551
## 5 Kurt Warner 414
## 6 Nick Foles 373
## 7 Doug Williams 340
## 8 John Elway 336
## 9 Steve Young 325
## 10 Aaron Rodgers 304
Most yards in a single game for a non-QB was achieved by Desmond Howard, an unlikely source. Although a Heisman Trophy winner, Howard was a kick return specialist and used sparingly as wide receiver for QB Brett Favre.
mvp.data %>% filter(position!='QB') %>%
arrange(desc(yards)) %>%
top_n(10)
## Selecting by yards
## superbowlID name position team TD yards
## 1 XXXI Desmond Howard KR Green Bay 1 244
## 2 XXIII Jerry Rice WR San Francisco 1 215
## 3 XVIII Marcus Allen RB Los Angeles Raiders 2 191
## 4 XVII John Riggins RB Washington 1 166
## 5 X Lynn Swann WR Pittsburgh 1 161
## 6 IX Franco Harris RB Pittsburgh 1 158
## 7 XXXII Terrell Davis RB Denver 3 157
## 8 VIII Larry Csonka FB Miami 2 145
## 9 LIII Julian Edelman WR New England 0 141
## 10 XXXIX Deion Branch WR New England 0 133
Barpolot of position frequencies
- as expected QBs dominate MVP awards
mvp.data %>% group_by(position) %>%
count() %>%
arrange(desc(n)) %>%
ggplot(aes(x=position,y=n)) +
geom_bar(stat='identity') +
labs(title = "Super Bowl MVP Awards by Position", x="Position", y="Number")
This data set comes from Zillow and provides monthly summaries of rental prices per square foot for each state in the US in the year 2018. It is untidy in the fact that the data appears in ‘wide’ format, that is, a single variable (price per square foot) is contained in multiple columns, which is not ideal for most analytical techniques.
Read in data as a csv
rental.price <- read.csv('Project2_tbl2.csv', header=T, stringsAsFactors = F) %>% as_tibble()
The data is tidied by providing descriptive column names and using the ‘gather()’ function to put into long form. The result is that for each state and each month, price per square foot is provided in a single column instead of 12 (one for each month). Now the data is tidy - each variable is in its own column.
month.level <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
colnames(rental.price) <- c('region','SizeRank', month.level)
rental.price %<>% gather('MOY', 'price',3:14)
rental.price
## # A tibble: 612 x 4
## region SizeRank MOY price
## <chr> <int> <chr> <dbl>
## 1 California 1 Jan 1.87
## 2 Texas 2 Jan 0.906
## 3 New York 3 Jan 4.27
## 4 Florida 4 Jan 1.32
## 5 Illinois 5 Jan 1.22
## 6 Pennsylvania 6 Jan 1.03
## 7 Ohio 7 Jan 0.831
## 8 Michigan 8 Jan 0.934
## 9 Georgia 9 Jan 0.772
## 10 North Carolina 10 Jan 0.785
## # ... with 602 more rows
Objectives
- include trends by state and seasonal trends
#set up MOY as a factor
rental.price$MOY <- factor(rental.price$MOY, levels=month.level)
Most expensive states to rent
If we look at the most expensive states to rent properties, we see New York at the top of the list. These numbers are the mean price for year 2018.
rental.price %>% group_by(region) %>%
summarise(meanPrice=mean(price)) %>%
arrange(desc(meanPrice))
## # A tibble: 51 x 2
## region meanPrice
## <chr> <dbl>
## 1 New York 3.78
## 2 District of Columbia 3.01
## 3 Massachusetts 2.51
## 4 Hawaii 2.43
## 5 California 1.92
## 6 Rhode Island 1.53
## 7 New Jersey 1.45
## 8 Colorado 1.45
## 9 Vermont 1.43
## 10 Connecticut 1.37
## # ... with 41 more rows
Map
Map data is provided in the ‘usmap’ package and can be used in conjunction with ggplot. The resulting cloropleth map provides an overview of mean price per square foot of rental units for every state in the US.
# prepare to join state map data to zillow df
rental.price$region <- str_to_lower(rental.price$region)
# join data from map_data package with the zillow data
state.plot <- map_data("state") %>% left_join(rental.price,by="region")
ggplot() +
geom_polygon(data=state.plot, aes(x=long, y=lat, group = group, fill=price),colour="white") +
scale_fill_continuous(low = "lightblue", high = "darkblue", guide="colorbar") +
theme_bw() +
labs(fill = "Price/sqft" ,title = "Average State Rental Price per sqft - 2018", x="", y="")
Seasonality
As was hypothesized on the message board, there appears to be seasonal effects in the price of rentals. Prices rise in the beginning of spring and then fall as the summer progresses. This may be influenced by parents of children attending schools for the upcoming year.
rental.price %>% group_by(MOY) %>%
summarize(price.month=mean(price)) %>%
arrange(as.numeric(MOY)) %>%
ggplot(aes(x=MOY,y=price.month)) +
geom_path(stat='identity', group=1) +
labs(title = "Price per sqft vs Month", x="Month of Year (2018)", y="Price per square foot")
It would be interesting to get more years data and see if the same trend is consistent over time.
Unemployment data in US counties from years 2007-2017.
Load data as a csv and look at the structure
unemp.tbl <- read.csv('Project2_tbl3.csv', header=T, stringsAsFactors = F) %>% as_tibble
#str(unemp.tbl)
From the imported raw data, the subset of columns that provide yearly unemployment rates are selected, as well as county ID #(FIPS) and Area_name (country name and state)
# select statement
unemp.tbl <- unemp.tbl %>% select(FIPStxt, Area_name,Unemployment_rate_2007,Unemployment_rate_2008,Unemployment_rate_2009,Unemployment_rate_2010,Unemployment_rate_2011,Unemployment_rate_2012,Unemployment_rate_2013, Unemployment_rate_2014,Unemployment_rate_2015,Unemployment_rate_2016,Unemployment_rate_2017)
#rename variables
colnames(unemp.tbl) <-c('fips','Area','y2007','y2008','y2009','y2010','y2011','y2012','y2013','y2014','y2015','y2016','y2017')
#parse area_name into county name and state name - just county data, remove national and state averages
unemp.tbl %<>% separate(Area, c('county_name', 'state_name'), sep=',') %>%
filter(!is.na(state_name))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1,
## 2, 70, 103, 119, 195, 254, 319, 328, 332, 333, 334, 402, 562, 567, 612,
## 715, 808, 908, 1014, ...].
Data is still in wide format, but this is useful for calculated values. For example the 11-year change in unemployment for a county can be computed and stored in the dataframe before converting to long format.
#find 10 year percent decrease (negative value means unemployment went up)
unemp.tbl %<>% mutate(percent_decrease = ((y2007-y2017)/y2007)*100)
unemp.tbl
## # A tibble: 3,221 x 15
## fips county_name state_name y2007 y2008 y2009 y2010 y2011 y2012 y2013
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Autauga Co~ " AL" 3.3 5.1 9.7 8.9 8.4 6.9 6.2
## 2 1003 Baldwin Co~ " AL" 3.1 4.6 9.8 10 9 7.5 6.6
## 3 1005 Barbour Co~ " AL" 6.3 8.8 14.3 12.3 11.5 11.5 10.2
## 4 1007 Bibb County " AL" 4.1 5.8 13.3 11.4 10.5 8.5 7.9
## 5 1009 Blount Cou~ " AL" 3.2 4.7 10 9.8 8.7 6.9 6.3
## 6 1011 Bullock Co~ " AL" 9.4 10.5 15.6 11.8 11.6 10.4 9.4
## 7 1013 Butler Cou~ " AL" 6.2 8.5 16.4 13.6 12.5 11.5 10.3
## 8 1015 Calhoun Co~ " AL" 3.9 5.7 11.1 11.4 10.3 8.9 8.8
## 9 1017 Chambers C~ " AL" 6.5 14.4 19.6 14.9 12 9.9 8
## 10 1019 Cherokee C~ " AL" 4.2 5.7 11.7 10.6 9.7 8 6.6
## # ... with 3,211 more rows, and 5 more variables: y2014 <dbl>,
## # y2015 <dbl>, y2016 <dbl>, y2017 <dbl>, percent_decrease <dbl>
#put unemployment rates in 'long' form
unemp.tbl %<>% gather("year",'unemployment_rate', y2007:y2017)
unemp.tbl
## # A tibble: 35,431 x 6
## fips county_name state_name percent_decrease year unemployment_ra~
## <int> <chr> <chr> <dbl> <chr> <dbl>
## 1 1001 Autauga County " AL" -18.2 y2007 3.3
## 2 1003 Baldwin County " AL" -29.0 y2007 3.1
## 3 1005 Barbour County " AL" 6.35 y2007 6.3
## 4 1007 Bibb County " AL" -7.32 y2007 4.1
## 5 1009 Blount County " AL" -25.0 y2007 3.2
## 6 1011 Bullock County " AL" 47.9 y2007 9.4
## 7 1013 Butler County " AL" 11.3 y2007 6.2
## 8 1015 Calhoun County " AL" -25.6 y2007 3.9
## 9 1017 Chambers County " AL" 36.9 y2007 6.5
## 10 1019 Cherokee County " AL" 2.38 y2007 4.2
## # ... with 35,421 more rows
Group by county and retrieve mean values
unemp.tbl %>% group_by(county_name, state_name) %>%
summarize(ten_year_mean = mean(unemployment_rate)) %>%
arrange(desc(ten_year_mean))
## # A tibble: 3,221 x 3
## # Groups: county_name [1,956]
## county_name state_name ten_year_mean
## <chr> <chr> <dbl>
## 1 Imperial County " CA" 24.4
## 2 Salinas Municipio " PR" 22.6
## 3 Patillas Municipio " PR" 22.0
## 4 Kusilvak Census Area " AK" 21.6
## 5 Maunabo Municipio " PR" 21.5
## 6 Yuma County " AZ" 21.4
## 7 Guanica Municipio " PR" 21.2
## 8 Yabucoa Municipio " PR" 21.0
## 9 Ciales Municipio " PR" 21.0
## 10 Arroyo Municipio " PR" 20.7
## # ... with 3,211 more rows
In which states has unemployment gotten worse in the last 10 years?
unemp.tbl %>% group_by(state_name) %>% summarise(mean_decrease=mean(percent_decrease,na.rm = T )) %>% arrange(mean_decrease)
## # A tibble: 51 x 2
## state_name mean_decrease
## <chr> <dbl>
## 1 " NM" -69.0
## 2 " UT" -46.9
## 3 " WY" -44.9
## 4 " DE" -37.0
## 5 " AZ" -27.1
## 6 " MD" -22.5
## 7 " LA" -20.9
## 8 " VA" -20.2
## 9 " SD" -15.9
## 10 " WV" -15.4
## # ... with 41 more rows
Map of New York State counties
#prepare the join - need to have fips ID be a 5 digit number with leading zero if needed
unemp.tbl$fips <- as.character(unemp.tbl$fips) %>% str_pad(5,pad='0')
# join data from map_data package with unemployment df
county.plot <- us_map("counties") %>% left_join(unemp.tbl,by="fips") %>% filter(abbr=='NY')
ggplot() +
geom_polygon(data=county.plot, aes(x=long, y=lat, group = group, fill=percent_decrease),colour="white",linetype=1) +
scale_fill_continuous(low = "lightblue", high = "darkblue", guide="colorbar") +
theme_bw() +
labs(fill = "Change in Unemployemnt Rate" ,title = "Change in Unemployment Rate in US counties from 2007 to 2017", x="", y="")