1 Problem Statement

Working with data sets that are messy or in a format not conducive to downstream analysis is a major task that a data scientist must face. In this project, 3 datasets that qualify as messy are to be tidied primarily using the ubiquidous dplyr and tidyr packages. Then some analysis is provided as requested. All input .csv tables are available at the GitHub repository https://github.com/robertwelk/DATA607_Project2.git

2 Packages Used

library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
library(ggplot2)
library(usmap)
library(mapdata)

3 Data set #1 - Super Bowl MVPs

This data comes from ESPN and has some immediate noticeable issues, mainly columns contain values seperated by commas and one column lacks structure. Luckily the tools of the tidyverse are well-equiped to tackle these issues, pun intended.

3.1 Tidy

Read in data and look at the structure

mvp.data <- read.csv('Project2_tbl1.csv', header=T, stringsAsFactors = F)
#Examine the structure of the raw data
str(mvp.data)
## 'data.frame':    53 obs. of  3 variables:
##  $ NO.       : chr  "I" "II" "III" "IV" ...
##  $ PLAYER    : chr  "Bart Starr, QB, Green Bay" "Bart Starr, QB, Green Bay" "Joe Namath, QB, New York Jets" "Len Dawson, QB, Kansas City" ...
##  $ HIGHLIGHTS: chr  "Two touchdown passes" "202 yards passing, 1 TD" "206 yards passing" "142 yards passing, 1 TD" ...
#Rename columns
colnames(mvp.data) <- c('superbowlID', 'player', 'stats')

Parsing of ‘player’ variable
This is found in column 2, and in the raw data contained player name, position, and team. Tidy data should not have 3 variables in the same column, and commas are a poor option to parse data within a column. The ‘separate()’ function was designed to solve a problem like this. Note that the pipe operater can be used for assignment (%<>%) thanks to the magrittr package.

# parse column 2
mvp.data %<>% separate(player,c('name', 'position','team'),sep=",")
#trim
mvp.data$team <- str_trim(mvp.data$team, side="both")
mvp.data$position <- str_trim(mvp.data$position, side="both")

Parse & restructure ‘stats’ variable
This task is more complicated than the previous step. Not only does the column need to be parsed, but the ordering of stats is inconsistent and not every player has the same set of stats since they play different positions. Here we are primarily concerned with yards from scrimmage and touchdowns scored, for the downstream analysis. Regular expressions are used in conjunction with a ‘for loop’ to get place stats in respective columns.

# standardize vocabulary -  
mvp.data$stats %<>% str_replace_all('touchdown[s]?', 'TD') %>% 
  str_replace_all('[Tt]wo', '2')

# create variables within the df to store stats of interest
mvp.data$TD <- NA
mvp.data$yards<-NA

# Set up the loop - for each player the number of touchdowns scored and yards from scrimmage will be assigned to the designated column and stored as an integer
for(i in 1:nrow(mvp.data)){

    mvp.data$TD[i] <- str_extract(mvp.data$stats[i], '[[:digit:]] TD') %>%  
                      str_replace('TD','') 
    
    mvp.data$yards[i] <- str_extract(mvp.data$stats[i], '[[:digit:]]{2,3} yards') %>%  
                      str_replace('yards','') %>% 
                      str_trim(side = 'both') 
}

#remove the stats column
mvp.data <- mvp.data %>% select(-stats)
#Coerce to integer data type
mvp.data$yards <- as.integer(mvp.data$yards)
#assume that NA values of TDs mean no touchdown was scored
mvp.data$TD[is.na(mvp.data$TD)] <- 0

head(mvp.data)
##   superbowlID           name position          team TD yards
## 1           I     Bart Starr       QB     Green Bay 2     NA
## 2          II     Bart Starr       QB     Green Bay 1    202
## 3         III     Joe Namath       QB New York Jets  0   206
## 4          IV     Len Dawson       QB   Kansas City 1    142
## 5           V   Chuck Howley       LB        Dallas  0    NA
## 6          VI Roger Staubach       QB        Dallas 2    119

3.2 Analysis

Objectives
a. finding the players with the most MVP’s
b. Player with most total touchdowns and passing or rushing yardage
c. frequencies of which positions win the award most.

Players with more than one MVP
the results are not what we wanted - Tom Brady is the leader.

mvp.data %>%  count(name) %>% arrange(desc(n)) %>% filter(n>1)
## # A tibble: 5 x 2
##   name               n
##   <chr>          <int>
## 1 Tom Brady          4
## 2 Joe Montana        3
## 3 Bart Starr         2
## 4 Eli Manning        2
## 5 Terry Bradshaw     2

Players with the most total touchdowns
More bad news!

mvp.data %>% group_by(name) %>% 
            summarise(TotalTD=sum(as.numeric(TD))) %>% 
            arrange(desc(TotalTD))
## # A tibble: 45 x 2
##    name           TotalTD
##    <chr>            <dbl>
##  1 Tom Brady           10
##  2 Joe Montana          9
##  3 Steve Young          6
##  4 Terry Bradshaw       6
##  5 Doug Williams        4
##  6 Troy Aikman          4
##  7 Aaron Rodgers        3
##  8 Bart Starr           3
##  9 Eli Manning          3
## 10 Jim Plunkett         3
## # ... with 35 more rows

Players with the most passing yardage - no surprises

# single game
mvp.data %>% filter(position=='QB') %>% 
              arrange(desc(yards)) %>% 
              top_n(10)
## Selecting by yards
##    superbowlID           name position          team TD yards
## 1           LI      Tom Brady       QB   New England 2    466
## 2        XXXIV    Kurt Warner       QB     St. Louis 2    414
## 3          LII     Nick Foles       QB  Philadelphia 3    373
## 4      XXXVIII      Tom Brady       QB   New England 3    354
## 5         XXII  Doug Williams       QB    Washington 4    340
## 6       XXXIII     John Elway       QB        Denver 1    336
## 7          XIX    Joe Montana       QB San Francisco 3    331
## 8         XLIX      Tom Brady       QB   New England 4    328
## 9         XXIX    Steve Young       QB San Francisco 6    325
## 10        XIII Terry Bradshaw       QB    Pittsburgh 4    318
#combined
mvp.data %>% filter(position=='QB') %>% 
          group_by(name) %>% 
          summarise(Total_yards=sum(yards,na.rm=T)) %>% 
          arrange(desc(Total_yards)) %>% 
          top_n(10)
## Selecting by Total_yards
## # A tibble: 10 x 2
##    name           Total_yards
##    <chr>                <int>
##  1 Tom Brady             1293
##  2 Joe Montana            785
##  3 Terry Bradshaw         627
##  4 Eli Manning            551
##  5 Kurt Warner            414
##  6 Nick Foles             373
##  7 Doug Williams          340
##  8 John Elway             336
##  9 Steve Young            325
## 10 Aaron Rodgers          304

Most yards in a single game for a non-QB was achieved by Desmond Howard, an unlikely source. Although a Heisman Trophy winner, Howard was a kick return specialist and used sparingly as wide receiver for QB Brett Favre.

mvp.data %>% filter(position!='QB') %>%
              arrange(desc(yards)) %>% 
              top_n(10)
## Selecting by yards
##    superbowlID           name position                team TD yards
## 1         XXXI Desmond Howard       KR           Green Bay 1    244
## 2        XXIII     Jerry Rice       WR       San Francisco 1    215
## 3        XVIII   Marcus Allen       RB Los Angeles Raiders 2    191
## 4         XVII   John Riggins       RB          Washington 1    166
## 5            X     Lynn Swann       WR          Pittsburgh 1    161
## 6           IX  Franco Harris       RB          Pittsburgh 1    158
## 7        XXXII  Terrell Davis       RB              Denver 3    157
## 8         VIII   Larry Csonka       FB               Miami 2    145
## 9         LIII Julian Edelman       WR         New England  0   141
## 10       XXXIX   Deion Branch       WR         New England  0   133

Barpolot of position frequencies
- as expected QBs dominate MVP awards

mvp.data %>% group_by(position) %>% 
              count() %>%  
              arrange(desc(n)) %>%  
              ggplot(aes(x=position,y=n)) + 
                    geom_bar(stat='identity') + 
                    labs(title = "Super Bowl MVP Awards by Position", x="Position", y="Number")


4 Data set #2 - Zillow 2018 rental prices

This data set comes from Zillow and provides monthly summaries of rental prices per square foot for each state in the US in the year 2018. It is untidy in the fact that the data appears in ‘wide’ format, that is, a single variable (price per square foot) is contained in multiple columns, which is not ideal for most analytical techniques.

4.1 Tidy

Read in data as a csv

rental.price <- read.csv('Project2_tbl2.csv', header=T, stringsAsFactors = F) %>% as_tibble()

The data is tidied by providing descriptive column names and using the ‘gather()’ function to put into long form. The result is that for each state and each month, price per square foot is provided in a single column instead of 12 (one for each month). Now the data is tidy - each variable is in its own column.

month.level <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
colnames(rental.price) <- c('region','SizeRank', month.level)
rental.price %<>%  gather('MOY', 'price',3:14)
rental.price
## # A tibble: 612 x 4
##    region         SizeRank MOY   price
##    <chr>             <int> <chr> <dbl>
##  1 California            1 Jan   1.87 
##  2 Texas                 2 Jan   0.906
##  3 New York              3 Jan   4.27 
##  4 Florida               4 Jan   1.32 
##  5 Illinois              5 Jan   1.22 
##  6 Pennsylvania          6 Jan   1.03 
##  7 Ohio                  7 Jan   0.831
##  8 Michigan              8 Jan   0.934
##  9 Georgia               9 Jan   0.772
## 10 North Carolina       10 Jan   0.785
## # ... with 602 more rows

4.2 Analysis

Objectives
- include trends by state and seasonal trends

#set up MOY as a factor 
rental.price$MOY <- factor(rental.price$MOY, levels=month.level)

Most expensive states to rent
If we look at the most expensive states to rent properties, we see New York at the top of the list. These numbers are the mean price for year 2018.

rental.price %>%  group_by(region) %>% 
              summarise(meanPrice=mean(price)) %>%  
              arrange(desc(meanPrice)) 
## # A tibble: 51 x 2
##    region               meanPrice
##    <chr>                    <dbl>
##  1 New York                  3.78
##  2 District of Columbia      3.01
##  3 Massachusetts             2.51
##  4 Hawaii                    2.43
##  5 California                1.92
##  6 Rhode Island              1.53
##  7 New Jersey                1.45
##  8 Colorado                  1.45
##  9 Vermont                   1.43
## 10 Connecticut               1.37
## # ... with 41 more rows

Map
Map data is provided in the ‘usmap’ package and can be used in conjunction with ggplot. The resulting cloropleth map provides an overview of mean price per square foot of rental units for every state in the US.

# prepare to join state map data to zillow df
rental.price$region <- str_to_lower(rental.price$region)
# join data from map_data package with the zillow data
state.plot <- map_data("state") %>% left_join(rental.price,by="region")

ggplot() + 
    geom_polygon(data=state.plot, aes(x=long, y=lat, group = group,  fill=price),colour="white") + 
    scale_fill_continuous(low = "lightblue", high = "darkblue", guide="colorbar")  + 
    theme_bw()  +
    labs(fill = "Price/sqft" ,title = "Average State Rental Price per sqft - 2018", x="", y="")

Seasonality
As was hypothesized on the message board, there appears to be seasonal effects in the price of rentals. Prices rise in the beginning of spring and then fall as the summer progresses. This may be influenced by parents of children attending schools for the upcoming year.

rental.price %>%  group_by(MOY) %>%  
                  summarize(price.month=mean(price)) %>% 
                  arrange(as.numeric(MOY)) %>% 
                  ggplot(aes(x=MOY,y=price.month)) + 
                          geom_path(stat='identity', group=1) +
                          labs(title = "Price per sqft vs Month", x="Month of Year (2018)", y="Price per square foot")

It would be interesting to get more years data and see if the same trend is consistent over time.

5 Data set # 3 - Unemployment Rates

Unemployment data in US counties from years 2007-2017.

5.1 Tidy

Load data as a csv and look at the structure

unemp.tbl <- read.csv('Project2_tbl3.csv', header=T, stringsAsFactors = F) %>% as_tibble
#str(unemp.tbl)

From the imported raw data, the subset of columns that provide yearly unemployment rates are selected, as well as county ID #(FIPS) and Area_name (country name and state)

# select statement
unemp.tbl <- unemp.tbl %>%  select(FIPStxt, Area_name,Unemployment_rate_2007,Unemployment_rate_2008,Unemployment_rate_2009,Unemployment_rate_2010,Unemployment_rate_2011,Unemployment_rate_2012,Unemployment_rate_2013, Unemployment_rate_2014,Unemployment_rate_2015,Unemployment_rate_2016,Unemployment_rate_2017)

#rename variables
colnames(unemp.tbl) <-c('fips','Area','y2007','y2008','y2009','y2010','y2011','y2012','y2013','y2014','y2015','y2016','y2017')

#parse area_name into county name and state name - just county data, remove national and state averages
unemp.tbl %<>%  separate(Area, c('county_name', 'state_name'), sep=',') %>% 
                filter(!is.na(state_name)) 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1,
## 2, 70, 103, 119, 195, 254, 319, 328, 332, 333, 334, 402, 562, 567, 612,
## 715, 808, 908, 1014, ...].

Data is still in wide format, but this is useful for calculated values. For example the 11-year change in unemployment for a county can be computed and stored in the dataframe before converting to long format.

#find 10 year percent decrease (negative value means unemployment went up)
unemp.tbl %<>% mutate(percent_decrease = ((y2007-y2017)/y2007)*100)
unemp.tbl
## # A tibble: 3,221 x 15
##     fips county_name state_name y2007 y2008 y2009 y2010 y2011 y2012 y2013
##    <int> <chr>       <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1001 Autauga Co~ " AL"        3.3   5.1   9.7   8.9   8.4   6.9   6.2
##  2  1003 Baldwin Co~ " AL"        3.1   4.6   9.8  10     9     7.5   6.6
##  3  1005 Barbour Co~ " AL"        6.3   8.8  14.3  12.3  11.5  11.5  10.2
##  4  1007 Bibb County " AL"        4.1   5.8  13.3  11.4  10.5   8.5   7.9
##  5  1009 Blount Cou~ " AL"        3.2   4.7  10     9.8   8.7   6.9   6.3
##  6  1011 Bullock Co~ " AL"        9.4  10.5  15.6  11.8  11.6  10.4   9.4
##  7  1013 Butler Cou~ " AL"        6.2   8.5  16.4  13.6  12.5  11.5  10.3
##  8  1015 Calhoun Co~ " AL"        3.9   5.7  11.1  11.4  10.3   8.9   8.8
##  9  1017 Chambers C~ " AL"        6.5  14.4  19.6  14.9  12     9.9   8  
## 10  1019 Cherokee C~ " AL"        4.2   5.7  11.7  10.6   9.7   8     6.6
## # ... with 3,211 more rows, and 5 more variables: y2014 <dbl>,
## #   y2015 <dbl>, y2016 <dbl>, y2017 <dbl>, percent_decrease <dbl>
#put unemployment rates in 'long' form
unemp.tbl %<>% gather("year",'unemployment_rate', y2007:y2017)
unemp.tbl
## # A tibble: 35,431 x 6
##     fips county_name     state_name percent_decrease year  unemployment_ra~
##    <int> <chr>           <chr>                 <dbl> <chr>            <dbl>
##  1  1001 Autauga County  " AL"                -18.2  y2007              3.3
##  2  1003 Baldwin County  " AL"                -29.0  y2007              3.1
##  3  1005 Barbour County  " AL"                  6.35 y2007              6.3
##  4  1007 Bibb County     " AL"                 -7.32 y2007              4.1
##  5  1009 Blount County   " AL"                -25.0  y2007              3.2
##  6  1011 Bullock County  " AL"                 47.9  y2007              9.4
##  7  1013 Butler County   " AL"                 11.3  y2007              6.2
##  8  1015 Calhoun County  " AL"                -25.6  y2007              3.9
##  9  1017 Chambers County " AL"                 36.9  y2007              6.5
## 10  1019 Cherokee County " AL"                  2.38 y2007              4.2
## # ... with 35,421 more rows

5.2 Analysis

Group by county and retrieve mean values

unemp.tbl %>% group_by(county_name, state_name) %>% 
              summarize(ten_year_mean = mean(unemployment_rate)) %>% 
              arrange(desc(ten_year_mean)) 
## # A tibble: 3,221 x 3
## # Groups:   county_name [1,956]
##    county_name          state_name ten_year_mean
##    <chr>                <chr>              <dbl>
##  1 Imperial County      " CA"               24.4
##  2 Salinas Municipio    " PR"               22.6
##  3 Patillas Municipio   " PR"               22.0
##  4 Kusilvak Census Area " AK"               21.6
##  5 Maunabo Municipio    " PR"               21.5
##  6 Yuma County          " AZ"               21.4
##  7 Guanica Municipio    " PR"               21.2
##  8 Yabucoa Municipio    " PR"               21.0
##  9 Ciales Municipio     " PR"               21.0
## 10 Arroyo Municipio     " PR"               20.7
## # ... with 3,211 more rows

In which states has unemployment gotten worse in the last 10 years?

unemp.tbl %>% group_by(state_name) %>% summarise(mean_decrease=mean(percent_decrease,na.rm = T  )) %>% arrange(mean_decrease)
## # A tibble: 51 x 2
##    state_name mean_decrease
##    <chr>              <dbl>
##  1 " NM"              -69.0
##  2 " UT"              -46.9
##  3 " WY"              -44.9
##  4 " DE"              -37.0
##  5 " AZ"              -27.1
##  6 " MD"              -22.5
##  7 " LA"              -20.9
##  8 " VA"              -20.2
##  9 " SD"              -15.9
## 10 " WV"              -15.4
## # ... with 41 more rows

Map of New York State counties

#prepare the join - need to have fips ID be a 5 digit number with leading zero if needed
unemp.tbl$fips <- as.character(unemp.tbl$fips) %>% str_pad(5,pad='0') 


# join data from map_data package with unemployment df
county.plot <- us_map("counties") %>% left_join(unemp.tbl,by="fips") %>% filter(abbr=='NY')

ggplot() + 
    geom_polygon(data=county.plot, aes(x=long, y=lat, group = group, fill=percent_decrease),colour="white",linetype=1) + 
    scale_fill_continuous(low = "lightblue", high = "darkblue", guide="colorbar")  + 
    theme_bw()  +
    labs(fill = "Change in Unemployemnt Rate" ,title = "Change in Unemployment Rate in US counties from 2007 to 2017", x="", y="")