Tidy 3 Datasets

Part 1. Cleaning the Data
Part 2. Analysis

The goal of this project is to tidy and briefly analyze 3 different datasets.

Here are the ones I will be looking at:

NFL Jets and Giants (myself)
MLB Team Wins 2013-2015 (Nick)
US International Trade Data (Jason)

Part 1. Cleaning the Data

Load all the neccesary libraries and csv files.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

setwd("~/Data")
trade_raw = read.csv('./trade.csv',stringsAsFactors = FALSE, na.strings = "")
mlb_raw = read.csv('./mlb_data.csv')
nfl_raw = read.csv('./NFL_data.csv')

NFL dataset

The NFL dataset comprises team totals of some common football statistics for the Jets and the Giants from 2007 to 2014. Its currently in a wide format with stats for each team given in years across columns.

nfl = tbl_df(nfl_raw)
head(nfl)

## Source: local data frame [6 x 10]
## 
##     team           statistic X2007 X2008 X2009 X2010 X2011 X2012 X2013
## 1   Jets   Total Passing Yds  3014  3303  2380  3242  3297  2891  2932
## 2   Jets Total Rushing Yards  1701  2004  2756  2374  1692  1896  2158
## 3   Jets   Total First Downs   286   308   280   307   301   299   280
## 4   Jets          Touchdowns    26    48    37    39    45    31    27
## 5 Giants   Total Passing Yds  3154  3177  4019  3885  4734  3825  3588
## 6 Giants Total Rushing Yards  2148  2518  1837  2200  1427  1862  1332
## Variables not shown: X2014 (int)

A simple dplyr::gather() on the year columns will create the long dataset we need. The year is now its own column and the statistics appear veritcally under the column “n”. When we perform our analysis we will manipulate the dataset a bit more, but for now it can be considered cleaned.

nfl = nfl %>% gather ("year", "n", 3:10)
head(nfl, n = 10)

## Source: local data frame [10 x 4]
## 
##      team           statistic  year    n
## 1    Jets   Total Passing Yds X2007 3014
## 2    Jets Total Rushing Yards X2007 1701
## 3    Jets   Total First Downs X2007  286
## 4    Jets          Touchdowns X2007   26
## 5  Giants   Total Passing Yds X2007 3154
## 6  Giants Total Rushing Yards X2007 2148
## 7  Giants   Total First Downs X2007  321
## 8  Giants          Touchdowns X2007   44
## 9    Jets   Total Passing Yds X2008 3303
## 10   Jets Total Rushing Yards X2008 2004

MLB dataset

Similarly, the MLB dataset is in wide format with win totals spread out by teams across each of the three years.

mlb = tbl_df(mlb_raw)
head(mlb)

## Source: local data frame [3 x 32]
## 
##   Year   G ARI ATL BAL BOS CHC CHW CIN CLE COL DET HOU KCR ANA LAD FLA MIL
## 1 2015 160  78  65  78  78  95  75  63  79  67  73  85  93  84  90  69  68
## 2 2014 162  64  79  96  71  73  73  76  85  66  90  70  89  98  94  77  82
## 3 2013 163  81  96  97  97  66  63  90  92  74  93  51  86  78  92  62  74
## Variables not shown: MIN (int), NYM (int), NYY (int), OAK (int), PHI
##   (int), PIT (int), SDP (int), SFG (int), SEA (int), STL (int), TBD (int),
##   TEX (int), TOR (int), WSN (int)

Again a simple dplyr::gather() will transform this dataset from wide to long.

mlb = mlb %>% gather("teams", "wins", 3:32)
head(mlb)

## Source: local data frame [6 x 4]
## 
##   Year   G teams wins
## 1 2015 160   ARI   78
## 2 2014 162   ARI   64
## 3 2013 163   ARI   81
## 4 2015 160   ATL   65
## 5 2014 162   ATL   79
## 6 2013 163   ATL   96

Trade dataset

The International Trade dataset is a bit more complex and also messier than the previous two datasets. It will require a bit more effort to get in order.

A look at the raw csv file shows a number of problems. The y columns are split into two variabes, some cells are blank due to the formatting of the csv file and many of the numerics were read in as strings due to commas.

trade = tbl_df(trade_raw) 
head(trade)

## Source: local data frame [6 x 8]
## 
##     Period        X   X2013     X.1   X2014     X.2   X2015     X.3
## 1       NA       NA Exports Imports Exports Imports Exports Imports
## 2       NA       NA      NA      NA      NA      NA      NA      NA
## 3  January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 4       NA Services   56883  37,995  59,141  38,635  59,733  40,371
## 5       NA       NA      NA      NA      NA      NA      NA      NA
## 6 February   Goods   132084 193,731 131,768 193,060 126,329 184,370

The initial steps involve reseting the headers and popping the blank lines.

c('month', 
  'type', 
  'export_2013', 
  'import_2013', 
  'export_2014', 
  'import_2014', 
  'export_2015', 
  'import_2015') -> colnames(trade)

trade = trade[-1,]  # pop extraneous header row
trade = trade[c(FALSE, TRUE, TRUE), ] # remove blank lines
trade

## Source: local data frame [8 x 8]
## 
##      month     type export_2013 import_2013 export_2014 import_2014
## 1  January   Goods      131,228     191,745     133,738     193,706
## 2       NA Services       56883      37,995      59,141      38,635
## 3 February   Goods       132084     193,731     131,768     193,060
## 4       NA Services      56,828      37,935      57,726      39,270
## 5    March   Goods      130,093     186,326     135,923     198,973
## 6       NA Services      56,647      37,938      58,837      38,908
## 7    April   Goods      131,468     189,507     135,556     199,877
## 8       NA Services      56,592      38,318      59,468      39,417
## Variables not shown: export_2015 (chr), import_2015 (chr)

Next, we will remove the NAs in the month column with a for loop. It replaces an NA with its previous item in the column.

for (i in seq_along(trade$month)){
    
    if (is.na(trade$month[i])){
        trade$month[i] = trade$month[i-1]   # set month[i] to its previous name
    }
}

trade

## Source: local data frame [8 x 8]
## 
##      month     type export_2013 import_2013 export_2014 import_2014
## 1  January   Goods      131,228     191,745     133,738     193,706
## 2  January Services       56883      37,995      59,141      38,635
## 3 February   Goods       132084     193,731     131,768     193,060
## 4 February Services      56,828      37,935      57,726      39,270
## 5    March   Goods      130,093     186,326     135,923     198,973
## 6    March Services      56,647      37,938      58,837      38,908
## 7    April   Goods      131,468     189,507     135,556     199,877
## 8    April Services      56,592      38,318      59,468      39,417
## Variables not shown: export_2015 (chr), import_2015 (chr)

Each variable from columns 3 - 8 actually contain two variables. We want to split these into years in quantities. To solve this we split the table into two tables, one for imports and one for exports. We then create ids and join the tables.

Since we need to perform the same operations on two tables, I wrap some of the cleaning elements in a function for brevity. Dplyr is then used to subset the dataset into two seperate tables for import and export

trade_cleaner = function(df, name){
    
    df$id = as.numeric(rownames(df))    # make row ids
    
    year_num = sapply(df$year, function(i){   # extract a vector of year nums...
        as.numeric(stringr::str_extract(i,"[0-9]{4}"))    
    })
    
    df$year = year_num   # apply this vector to the year column
    
    return (df)
}


export_subset = trade %>% 
    select(-c(import_2013,import_2014,import_2015)) %>%
        gather("year", "exports", 3:5)

import_subset = trade %>% 
    select(-c(export_2013, export_2014, export_2015)) %>%
        gather("year", "imports", 3:5)

exports = trade_cleaner(export_subset)
imports = trade_cleaner(import_subset)

head(exports)

## Source: local data frame [6 x 5]
## 
##      month     type year exports id
## 1  January   Goods  2013 131,228  1
## 2  January Services 2013   56883  2
## 3 February   Goods  2013  132084  3
## 4 February Services 2013  56,828  4
## 5    March   Goods  2013 130,093  5
## 6    March Services 2013  56,647  6

head(imports)

## Source: local data frame [6 x 5]
## 
##      month     type year imports id
## 1  January   Goods  2013 191,745  1
## 2  January Services 2013  37,995  2
## 3 February   Goods  2013 193,731  3
## 4 February Services 2013  37,935  4
## 5    March   Goods  2013 186,326  5
## 6    March Services 2013  37,938  6

Finally we join the imports onto the export table by id

trade_df_merge = exports %>% 
    left_join(select(imports, id, imports), by = "id")

trade_df_merge

## Source: local data frame [24 x 6]
## 
##       month     type year exports id imports
## 1   January   Goods  2013 131,228  1 191,745
## 2   January Services 2013   56883  2  37,995
## 3  February   Goods  2013  132084  3 193,731
## 4  February Services 2013  56,828  4  37,935
## 5     March   Goods  2013 130,093  5 186,326
## 6     March Services 2013  56,647  6  37,938
## 7     April   Goods  2013 131,468  7 189,507
## 8     April Services 2013  56,592  8  38,318
## 9   January   Goods  2014 133,738  9 193,706
## 10  January Services 2014  59,141 10  38,635
## ..      ...      ...  ...     ... ..     ...

Now that the tables are merged all that is left to deal with are type conversions and a final gather() call to make one observation in each row. These rows will be spread() later for easier analysis.

trade_df_merge$exports = as.numeric(gsub(",", "", trade_df_merge$exports))
trade_df_merge$imports = as.numeric(gsub(",","",trade_df_merge$imports))
trade_df_merge$type = as.factor(trade_df_merge$type)
trade_df_merge$month = as.factor(trade_df_merge$month)
trade_df_merge$id = NULL  # drop the id column... no longer needed

trade_df_merge = trade_df_merge %>% gather("exp_imp", "n", 4:5)

#final tidy dataset
trade_df_merge

## Source: local data frame [48 x 5]
## 
##       month     type year exp_imp      n
## 1   January   Goods  2013 exports 131228
## 2   January Services 2013 exports  56883
## 3  February   Goods  2013 exports 132084
## 4  February Services 2013 exports  56828
## 5     March   Goods  2013 exports 130093
## 6     March Services 2013 exports  56647
## 7     April   Goods  2013 exports 131468
## 8     April Services 2013 exports  56592
## 9   January   Goods  2014 exports 133738
## 10  January Services 2014 exports  59141
## ..      ...      ...  ...     ...    ...

Part 2. Analysis

NFL

Make and compare proportion of touchdowns to first downs between teams for each year.

Since we want to focus on a team by team comparison, it makes sense to spread out the stat fields. This makes it easier to create metrics with mutate and plot the results.

nfl_stats = nfl %>% 
    spread(statistic, n) 

c("team", "year","first_dwns", "passing", "rushing", "tds") -> colnames(nfl_stats)
 
nfl_stats = nfl_stats %>% mutate(ratio_tds_fds = tds/first_dwns)

nfl_stats

## Source: local data frame [16 x 7]
## 
##      team  year first_dwns passing rushing tds ratio_tds_fds
## 1  Giants X2007        321    3154    2148  44    0.13707165
## 2  Giants X2008        338    3177    2518  45    0.13313609
## 3  Giants X2009        323    4019    1837  46    0.14241486
## 4  Giants X2010        331    3885    2200  48    0.14501511
## 5  Giants X2011        331    4734    1427  47    0.14199396
## 6  Giants X2012        327    3825    1862  47    0.14373089
## 7  Giants X2013        280    3588    1332  32    0.11428571
## 8  Giants X2014        336    4272    1603  44    0.13095238
## 9    Jets X2007        286    3014    1701  26    0.09090909
## 10   Jets X2008        308    3303    2004  48    0.15584416
## 11   Jets X2009        280    2380    2756  37    0.13214286
## 12   Jets X2010        307    3242    2374  39    0.12703583
## 13   Jets X2011        301    3297    1692  45    0.14950166
## 14   Jets X2012        299    2891    1896  31    0.10367893
## 15   Jets X2013        280    2932    2158  27    0.09642857
## 16   Jets X2014        289    2946    2280  27    0.09342561

The ratio of touchdowns to first downs might be an indicator of how well a team will complete offensive drives.

ggplot(nfl_stats, aes(x=year, y=ratio_tds_fds, color=team, group=team)) +
    geom_point(size=3) +
    geom_line(size=3)

One can see that the Jets had a higher percentage than the Giants only two years, 2008 and 2011, two seasons where they had lackluster records. The 2008 season could be attributed to the offensive presence of Brett Favre. Though offensively strong in 2011, losses could be attributed to horrible defensive play.

One can also see that Giants are much more consistent in their scoring metric than the Jets. All Giants seasons are above .13 except for 2013, where they had an exceptionally bad offensive year.

MLB

Comparison of year over year win totals.

To find out which team has the most wins since 2013 we summarize the data and arrange the results in descending order. A barplot is the simplest and most effective way to visualize the result.

mlb_win_totals = mlb %>% 
    group_by(teams) %>% 
        summarize(total_wins_2013_2015 = sum(wins)) %>% 
            arrange(desc(total_wins_2013_2015))

mlb_win_totals

## Source: local data frame [30 x 2]
## 
##    teams total_wins_2013_2015
## 1    STL                  287
## 2    PIT                  279
## 3    LAD                  276
## 4    BAL                  271
## 5    KCR                  268
## 6    WSN                  263
## 7    ANA                  260
## 8    CLE                  256
## 9    DET                  256
## 10   NYY                  256
## ..   ...                  ...

#reorder team factor to make a barplot in descending order
mlb_win_totals$teams = factor(mlb_win_totals$teams, as.character(mlb_win_totals$teams))

ggplot(mlb_win_totals, aes(x = teams, y = total_wins_2013_2015)) +
    geom_bar(stat="identity") +
    coord_cartesian(ylim = c(200, 300))

St. Louis had the most wins and Houston had the least. Since the display of all teams is hard to read, here are two plots of the top 10 and bottom 10 teams.

top10 = mlb_win_totals[1:10, ]
bot10 = mlb_win_totals[20:30, ]

ggplot(top10, aes(top10$teams,top10$total_wins_2013_2015)) + 
    geom_bar(stat="identity") +
    coord_cartesian(ylim = c(250, 300))

ggplot(bot10, aes(bot10$teams,bot10$total_wins_2013_2015)) + 
    geom_bar(stat="identity") +
    coord_cartesian(ylim = c(200, 240))

Trade dataset

Calculate the average goods export for each month over the three-year span.
Calculate the average goods import for each month over the three-year span.
Calculate the average services export for each month over the three-year span.
Calculate the average services import for each month over the three-year span.

First we calculate the neccesary averages of all fields and create a new dataframe with the results.

trade_spread = trade_df_merge %>% spread(exp_imp, n)

avg_trade = trade_spread %>% 
    group_by(month, type) %>% 
        summarize(avg_export = mean(exports), 
                  avg_import = mean(imports)) %>% 
            gather("export_import", "value", 3:4)

# reorder month factor for plotting... 
avg_trade$month = factor(avg_trade$month, c("January", "February", "March", "April"))

avg_trade

## Source: local data frame [16 x 4]
## 
##       month     type export_import     value
## 1     April   Goods     avg_export 132133.33
## 2     April Services    avg_export  58608.67
## 3  February   Goods     avg_export 130060.33
## 4  February Services    avg_export  58038.00
## 5   January   Goods     avg_export 131419.33
## 6   January Services    avg_export  58585.67
## 7     March   Goods     avg_export 131066.33
## 8     March Services    avg_export  58355.67
## 9     April   Goods     avg_import 193450.33
## 10    April Services    avg_import  39389.00
## 11 February   Goods     avg_import 190387.00
## 12 February Services    avg_import  39087.33
## 13  January   Goods     avg_import 192564.33
## 14  January Services    avg_import  39000.33
## 15    March   Goods     avg_import 194548.67
## 16    March Services    avg_import  39142.33

The values of the averages fall into 4 distinct groupings. Visualizing them in the same plot will not show meaningful variations. Instead each grouping is looked at seperately in its own plot.

avg_exports_goods = avg_trade %>% 
    filter(export_import == "avg_export" & as.numeric(type) == 1)
avg_imports_goods = avg_trade %>% 
    filter(export_import == "avg_import" & as.numeric(type) == 1)

avg_exports_services = avg_trade %>% 
    filter(export_import == "avg_export" & as.numeric(type) == 2)
avg_imports_services = avg_trade %>% 
    filter(export_import == "avg_import" & as.numeric(type) == 2)

make_trade_plots = function(name, title){
    return (ggplot(name, aes(x=month, y=value, group=type)) + 
                geom_point(size=3) + geom_line(size=3) +
                    ggtitle(title)) 
}

make_trade_plots(avg_exports_goods, "Avg Exports - Goods")

make_trade_plots(avg_exports_services, "Avg Exports - Services")

make_trade_plots(avg_imports_goods, "Avg Imports - Goods")

make_trade_plots(avg_imports_services, "Avg Imports - Services")

The visualizations show that the average exports and imports for goods as well as average imports for services dropped quite a bit in February, before rebounding in March and April.

The average imports for services was much lower in January and showed an upward trend moving into March.

Tidy 3 Datasets

jdeblase

October 9, 2015

Part 1. Cleaning the Data

NFL dataset

MLB dataset

Trade dataset

Part 2. Analysis

NFL

MLB

Trade dataset