The goal of this project is to tidy and briefly analyze 3 different datasets.
Here are the ones I will be looking at:
Load all the neccesary libraries and csv files.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
setwd("~/Data")
trade_raw = read.csv('./trade.csv',stringsAsFactors = FALSE, na.strings = "")
mlb_raw = read.csv('./mlb_data.csv')
nfl_raw = read.csv('./NFL_data.csv')
The NFL dataset comprises team totals of some common football statistics for the Jets and the Giants from 2007 to 2014. Its currently in a wide format with stats for each team given in years across columns.
nfl = tbl_df(nfl_raw)
head(nfl)
## Source: local data frame [6 x 10]
##
## team statistic X2007 X2008 X2009 X2010 X2011 X2012 X2013
## 1 Jets Total Passing Yds 3014 3303 2380 3242 3297 2891 2932
## 2 Jets Total Rushing Yards 1701 2004 2756 2374 1692 1896 2158
## 3 Jets Total First Downs 286 308 280 307 301 299 280
## 4 Jets Touchdowns 26 48 37 39 45 31 27
## 5 Giants Total Passing Yds 3154 3177 4019 3885 4734 3825 3588
## 6 Giants Total Rushing Yards 2148 2518 1837 2200 1427 1862 1332
## Variables not shown: X2014 (int)
A simple dplyr::gather() on the year columns will create the long dataset we need. The year is now its own column and the statistics appear veritcally under the column “n”. When we perform our analysis we will manipulate the dataset a bit more, but for now it can be considered cleaned.
nfl = nfl %>% gather ("year", "n", 3:10)
head(nfl, n = 10)
## Source: local data frame [10 x 4]
##
## team statistic year n
## 1 Jets Total Passing Yds X2007 3014
## 2 Jets Total Rushing Yards X2007 1701
## 3 Jets Total First Downs X2007 286
## 4 Jets Touchdowns X2007 26
## 5 Giants Total Passing Yds X2007 3154
## 6 Giants Total Rushing Yards X2007 2148
## 7 Giants Total First Downs X2007 321
## 8 Giants Touchdowns X2007 44
## 9 Jets Total Passing Yds X2008 3303
## 10 Jets Total Rushing Yards X2008 2004
Similarly, the MLB dataset is in wide format with win totals spread out by teams across each of the three years.
mlb = tbl_df(mlb_raw)
head(mlb)
## Source: local data frame [3 x 32]
##
## Year G ARI ATL BAL BOS CHC CHW CIN CLE COL DET HOU KCR ANA LAD FLA MIL
## 1 2015 160 78 65 78 78 95 75 63 79 67 73 85 93 84 90 69 68
## 2 2014 162 64 79 96 71 73 73 76 85 66 90 70 89 98 94 77 82
## 3 2013 163 81 96 97 97 66 63 90 92 74 93 51 86 78 92 62 74
## Variables not shown: MIN (int), NYM (int), NYY (int), OAK (int), PHI
## (int), PIT (int), SDP (int), SFG (int), SEA (int), STL (int), TBD (int),
## TEX (int), TOR (int), WSN (int)
Again a simple dplyr::gather() will transform this dataset from wide to long.
mlb = mlb %>% gather("teams", "wins", 3:32)
head(mlb)
## Source: local data frame [6 x 4]
##
## Year G teams wins
## 1 2015 160 ARI 78
## 2 2014 162 ARI 64
## 3 2013 163 ARI 81
## 4 2015 160 ATL 65
## 5 2014 162 ATL 79
## 6 2013 163 ATL 96
The International Trade dataset is a bit more complex and also messier than the previous two datasets. It will require a bit more effort to get in order.
A look at the raw csv file shows a number of problems. The y columns are split into two variabes, some cells are blank due to the formatting of the csv file and many of the numerics were read in as strings due to commas.
trade = tbl_df(trade_raw)
head(trade)
## Source: local data frame [6 x 8]
##
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## 1 NA NA Exports Imports Exports Imports Exports Imports
## 2 NA NA NA NA NA NA NA NA
## 3 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 4 NA Services 56883 37,995 59,141 38,635 59,733 40,371
## 5 NA NA NA NA NA NA NA NA
## 6 February Goods 132084 193,731 131,768 193,060 126,329 184,370
The initial steps involve reseting the headers and popping the blank lines.
c('month',
'type',
'export_2013',
'import_2013',
'export_2014',
'import_2014',
'export_2015',
'import_2015') -> colnames(trade)
trade = trade[-1,] # pop extraneous header row
trade = trade[c(FALSE, TRUE, TRUE), ] # remove blank lines
trade
## Source: local data frame [8 x 8]
##
## month type export_2013 import_2013 export_2014 import_2014
## 1 January Goods 131,228 191,745 133,738 193,706
## 2 NA Services 56883 37,995 59,141 38,635
## 3 February Goods 132084 193,731 131,768 193,060
## 4 NA Services 56,828 37,935 57,726 39,270
## 5 March Goods 130,093 186,326 135,923 198,973
## 6 NA Services 56,647 37,938 58,837 38,908
## 7 April Goods 131,468 189,507 135,556 199,877
## 8 NA Services 56,592 38,318 59,468 39,417
## Variables not shown: export_2015 (chr), import_2015 (chr)
Next, we will remove the NAs in the month column with a for loop. It replaces an NA with its previous item in the column.
for (i in seq_along(trade$month)){
if (is.na(trade$month[i])){
trade$month[i] = trade$month[i-1] # set month[i] to its previous name
}
}
trade
## Source: local data frame [8 x 8]
##
## month type export_2013 import_2013 export_2014 import_2014
## 1 January Goods 131,228 191,745 133,738 193,706
## 2 January Services 56883 37,995 59,141 38,635
## 3 February Goods 132084 193,731 131,768 193,060
## 4 February Services 56,828 37,935 57,726 39,270
## 5 March Goods 130,093 186,326 135,923 198,973
## 6 March Services 56,647 37,938 58,837 38,908
## 7 April Goods 131,468 189,507 135,556 199,877
## 8 April Services 56,592 38,318 59,468 39,417
## Variables not shown: export_2015 (chr), import_2015 (chr)
Each variable from columns 3 - 8 actually contain two variables. We want to split these into years in quantities. To solve this we split the table into two tables, one for imports and one for exports. We then create ids and join the tables.
Since we need to perform the same operations on two tables, I wrap some of the cleaning elements in a function for brevity. Dplyr is then used to subset the dataset into two seperate tables for import and export
trade_cleaner = function(df, name){
df$id = as.numeric(rownames(df)) # make row ids
year_num = sapply(df$year, function(i){ # extract a vector of year nums...
as.numeric(stringr::str_extract(i,"[0-9]{4}"))
})
df$year = year_num # apply this vector to the year column
return (df)
}
export_subset = trade %>%
select(-c(import_2013,import_2014,import_2015)) %>%
gather("year", "exports", 3:5)
import_subset = trade %>%
select(-c(export_2013, export_2014, export_2015)) %>%
gather("year", "imports", 3:5)
exports = trade_cleaner(export_subset)
imports = trade_cleaner(import_subset)
head(exports)
## Source: local data frame [6 x 5]
##
## month type year exports id
## 1 January Goods 2013 131,228 1
## 2 January Services 2013 56883 2
## 3 February Goods 2013 132084 3
## 4 February Services 2013 56,828 4
## 5 March Goods 2013 130,093 5
## 6 March Services 2013 56,647 6
head(imports)
## Source: local data frame [6 x 5]
##
## month type year imports id
## 1 January Goods 2013 191,745 1
## 2 January Services 2013 37,995 2
## 3 February Goods 2013 193,731 3
## 4 February Services 2013 37,935 4
## 5 March Goods 2013 186,326 5
## 6 March Services 2013 37,938 6
Finally we join the imports onto the export table by id
trade_df_merge = exports %>%
left_join(select(imports, id, imports), by = "id")
trade_df_merge
## Source: local data frame [24 x 6]
##
## month type year exports id imports
## 1 January Goods 2013 131,228 1 191,745
## 2 January Services 2013 56883 2 37,995
## 3 February Goods 2013 132084 3 193,731
## 4 February Services 2013 56,828 4 37,935
## 5 March Goods 2013 130,093 5 186,326
## 6 March Services 2013 56,647 6 37,938
## 7 April Goods 2013 131,468 7 189,507
## 8 April Services 2013 56,592 8 38,318
## 9 January Goods 2014 133,738 9 193,706
## 10 January Services 2014 59,141 10 38,635
## .. ... ... ... ... .. ...
Now that the tables are merged all that is left to deal with are type conversions and a final gather() call to make one observation in each row. These rows will be spread() later for easier analysis.
trade_df_merge$exports = as.numeric(gsub(",", "", trade_df_merge$exports))
trade_df_merge$imports = as.numeric(gsub(",","",trade_df_merge$imports))
trade_df_merge$type = as.factor(trade_df_merge$type)
trade_df_merge$month = as.factor(trade_df_merge$month)
trade_df_merge$id = NULL # drop the id column... no longer needed
trade_df_merge = trade_df_merge %>% gather("exp_imp", "n", 4:5)
#final tidy dataset
trade_df_merge
## Source: local data frame [48 x 5]
##
## month type year exp_imp n
## 1 January Goods 2013 exports 131228
## 2 January Services 2013 exports 56883
## 3 February Goods 2013 exports 132084
## 4 February Services 2013 exports 56828
## 5 March Goods 2013 exports 130093
## 6 March Services 2013 exports 56647
## 7 April Goods 2013 exports 131468
## 8 April Services 2013 exports 56592
## 9 January Goods 2014 exports 133738
## 10 January Services 2014 exports 59141
## .. ... ... ... ... ...
Make and compare proportion of touchdowns to first downs between teams for each year.
Since we want to focus on a team by team comparison, it makes sense to spread out the stat fields. This makes it easier to create metrics with mutate and plot the results.
nfl_stats = nfl %>%
spread(statistic, n)
c("team", "year","first_dwns", "passing", "rushing", "tds") -> colnames(nfl_stats)
nfl_stats = nfl_stats %>% mutate(ratio_tds_fds = tds/first_dwns)
nfl_stats
## Source: local data frame [16 x 7]
##
## team year first_dwns passing rushing tds ratio_tds_fds
## 1 Giants X2007 321 3154 2148 44 0.13707165
## 2 Giants X2008 338 3177 2518 45 0.13313609
## 3 Giants X2009 323 4019 1837 46 0.14241486
## 4 Giants X2010 331 3885 2200 48 0.14501511
## 5 Giants X2011 331 4734 1427 47 0.14199396
## 6 Giants X2012 327 3825 1862 47 0.14373089
## 7 Giants X2013 280 3588 1332 32 0.11428571
## 8 Giants X2014 336 4272 1603 44 0.13095238
## 9 Jets X2007 286 3014 1701 26 0.09090909
## 10 Jets X2008 308 3303 2004 48 0.15584416
## 11 Jets X2009 280 2380 2756 37 0.13214286
## 12 Jets X2010 307 3242 2374 39 0.12703583
## 13 Jets X2011 301 3297 1692 45 0.14950166
## 14 Jets X2012 299 2891 1896 31 0.10367893
## 15 Jets X2013 280 2932 2158 27 0.09642857
## 16 Jets X2014 289 2946 2280 27 0.09342561
The ratio of touchdowns to first downs might be an indicator of how well a team will complete offensive drives.
ggplot(nfl_stats, aes(x=year, y=ratio_tds_fds, color=team, group=team)) +
geom_point(size=3) +
geom_line(size=3)
One can see that the Jets had a higher percentage than the Giants only two years, 2008 and 2011, two seasons where they had lackluster records. The 2008 season could be attributed to the offensive presence of Brett Favre. Though offensively strong in 2011, losses could be attributed to horrible defensive play.
One can also see that Giants are much more consistent in their scoring metric than the Jets. All Giants seasons are above .13 except for 2013, where they had an exceptionally bad offensive year.
Comparison of year over year win totals.
To find out which team has the most wins since 2013 we summarize the data and arrange the results in descending order. A barplot is the simplest and most effective way to visualize the result.
mlb_win_totals = mlb %>%
group_by(teams) %>%
summarize(total_wins_2013_2015 = sum(wins)) %>%
arrange(desc(total_wins_2013_2015))
mlb_win_totals
## Source: local data frame [30 x 2]
##
## teams total_wins_2013_2015
## 1 STL 287
## 2 PIT 279
## 3 LAD 276
## 4 BAL 271
## 5 KCR 268
## 6 WSN 263
## 7 ANA 260
## 8 CLE 256
## 9 DET 256
## 10 NYY 256
## .. ... ...
#reorder team factor to make a barplot in descending order
mlb_win_totals$teams = factor(mlb_win_totals$teams, as.character(mlb_win_totals$teams))
ggplot(mlb_win_totals, aes(x = teams, y = total_wins_2013_2015)) +
geom_bar(stat="identity") +
coord_cartesian(ylim = c(200, 300))
St. Louis had the most wins and Houston had the least. Since the display of all teams is hard to read, here are two plots of the top 10 and bottom 10 teams.
top10 = mlb_win_totals[1:10, ]
bot10 = mlb_win_totals[20:30, ]
ggplot(top10, aes(top10$teams,top10$total_wins_2013_2015)) +
geom_bar(stat="identity") +
coord_cartesian(ylim = c(250, 300))
ggplot(bot10, aes(bot10$teams,bot10$total_wins_2013_2015)) +
geom_bar(stat="identity") +
coord_cartesian(ylim = c(200, 240))
Calculate the average goods export for each month over the three-year span.
Calculate the average goods import for each month over the three-year span.
Calculate the average services export for each month over the three-year span.
Calculate the average services import for each month over the three-year span.
First we calculate the neccesary averages of all fields and create a new dataframe with the results.
trade_spread = trade_df_merge %>% spread(exp_imp, n)
avg_trade = trade_spread %>%
group_by(month, type) %>%
summarize(avg_export = mean(exports),
avg_import = mean(imports)) %>%
gather("export_import", "value", 3:4)
# reorder month factor for plotting...
avg_trade$month = factor(avg_trade$month, c("January", "February", "March", "April"))
avg_trade
## Source: local data frame [16 x 4]
##
## month type export_import value
## 1 April Goods avg_export 132133.33
## 2 April Services avg_export 58608.67
## 3 February Goods avg_export 130060.33
## 4 February Services avg_export 58038.00
## 5 January Goods avg_export 131419.33
## 6 January Services avg_export 58585.67
## 7 March Goods avg_export 131066.33
## 8 March Services avg_export 58355.67
## 9 April Goods avg_import 193450.33
## 10 April Services avg_import 39389.00
## 11 February Goods avg_import 190387.00
## 12 February Services avg_import 39087.33
## 13 January Goods avg_import 192564.33
## 14 January Services avg_import 39000.33
## 15 March Goods avg_import 194548.67
## 16 March Services avg_import 39142.33
The values of the averages fall into 4 distinct groupings. Visualizing them in the same plot will not show meaningful variations. Instead each grouping is looked at seperately in its own plot.
avg_exports_goods = avg_trade %>%
filter(export_import == "avg_export" & as.numeric(type) == 1)
avg_imports_goods = avg_trade %>%
filter(export_import == "avg_import" & as.numeric(type) == 1)
avg_exports_services = avg_trade %>%
filter(export_import == "avg_export" & as.numeric(type) == 2)
avg_imports_services = avg_trade %>%
filter(export_import == "avg_import" & as.numeric(type) == 2)
make_trade_plots = function(name, title){
return (ggplot(name, aes(x=month, y=value, group=type)) +
geom_point(size=3) + geom_line(size=3) +
ggtitle(title))
}
make_trade_plots(avg_exports_goods, "Avg Exports - Goods")
make_trade_plots(avg_exports_services, "Avg Exports - Services")
make_trade_plots(avg_imports_goods, "Avg Imports - Goods")
make_trade_plots(avg_imports_services, "Avg Imports - Services")
The visualizations show that the average exports and imports for goods as well as average imports for services dropped quite a bit in February, before rebounding in March and April.
The average imports for services was much lower in January and showed an upward trend moving into March.