For my second project in IS 607, I am tasked with taking 3 ‘wide’ datasets, tidying them according to the standards in this Hadley Wickham paper, and performing a bit of exploratory data analysis. The datasets were all proposed by my peers in the IS 607 Fall 2015 cohort. The 3 datasets I chose were the 2016 GOP presidential nomination polling data, all time MLB wins data, and 2013-2015 Q1 United States import/export data.
As always, I’ll load my libraries before beginning.
library(tidyr)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
This dataset was proposed by Edwige Talla Badjio, and can be found in the second table at this link from Real Clear Politics (and coincentally, it looks like the website has undergone a face lift since I pulled the data).
Since the data comes from a tabular HTML format, it would have been possible to use cURL to interact with the table directly, but in this case it was faster to copy the data into excel and generate a .csv file that way.
I’ll first load the .csv file, and use the na.strings argument to have R “–” values as NA, since it represented no polling data for that candidate.
gop <- read.csv('gop.csv', na.strings = "--")
The first step is to take a look at the data to get a feel for how it is formatted and develop a strategy for tidying it.
tbl_df(gop)
## Source: local data frame [101 x 17]
##
## Poll Date Trump Carson Fiorina Rubio Bush Cruz Kasich
## (fctr) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 RCP Average 9/17 - 10/4 23.2 17.2 10.4 9.9 8.4 6.2 3.2
## 2 PPP (D) 10/1 - 10/4 27.0 17.0 6.0 13.0 10.0 7.0 4.0
## 3 IBD/TIPP 9/26 - 10/1 17.0 24.0 9.0 11.0 8.0 6.0 4.0
## 4 USAT/Suffolk 9/24 - 9/28 23.0 13.0 13.0 9.0 8.0 6.0 2.0
## 5 Pew Research* 9/22 - 9/27 25.0 16.0 8.0 8.0 4.0 6.0 1.0
## 6 NBC/WSJ 9/20 - 9/24 21.0 20.0 11.0 11.0 7.0 5.0 6.0
## 7 FOX News 9/20 - 9/22 26.0 18.0 9.0 9.0 7.0 8.0 4.0
## 8 Bloomberg 9/18 - 9/21 21.0 16.0 11.0 8.0 13.0 5.0 4.0
## 9 Quinnipiac 9/17 - 9/21 25.0 17.0 12.0 9.0 10.0 7.0 2.0
## 10 CNN/ORC 9/17 - 9/19 24.0 14.0 15.0 11.0 9.0 6.0 2.0
## .. ... ... ... ... ... ... ... ... ...
## Variables not shown: Huckabee (dbl), Christie (dbl), Paul (dbl), Santorum
## (dbl), Jindal (dbl), Graham (dbl), Pataki (dbl), Spread (fctr)
The data frame I am working with consists of a Poll variable, which describes the organization that conducted the poll, a date variable, which describes the time frame the polling data was collected, each candidate as a seperate column, and a spread column indicating the top candidates lead over the 2nd place candidate.
Because the date ranges do not include years, and because most of the candidates did not formally announce until this year, I will only look at observations from 2015, and drop everything prior.
gop <- gop[1:55,]
The most obvious violation of a tidy data format is that the candidates names, which are variables, are used as columnn names. To fix this problem, I’ll use tidyr’s gather() function to move them all into one column, and significantly reduce the ‘width’ of the data set.
gop <- gather(gop, candidate, points, Trump:Pataki)
tbl_df(gop)
## Source: local data frame [770 x 5]
##
## Poll Date Spread candidate points
## (fctr) (fctr) (fctr) (fctr) (dbl)
## 1 RCP Average 9/17 - 10/4 Trump +6.0 Trump 23.2
## 2 PPP (D) 10/1 - 10/4 Trump +10 Trump 27.0
## 3 IBD/TIPP 9/26 - 10/1 Carson +7 Trump 17.0
## 4 USAT/Suffolk 9/24 - 9/28 Trump +10 Trump 23.0
## 5 Pew Research* 9/22 - 9/27 Trump +9 Trump 25.0
## 6 NBC/WSJ 9/20 - 9/24 Trump +1 Trump 21.0
## 7 FOX News 9/20 - 9/22 Trump +8 Trump 26.0
## 8 Bloomberg 9/18 - 9/21 Trump +5 Trump 21.0
## 9 Quinnipiac 9/17 - 9/21 Trump +8 Trump 25.0
## 10 CNN/ORC 9/17 - 9/19 Trump +9 Trump 24.0
## .. ... ... ... ... ...
I don’t really care about the spread column, it’s redundant, and I’ll be able to easily calculate it later if I want, so I’ll subset it out.
gop <- cbind(gop[1:2], gop[4:5])
I eventually want to look at this data as a function of time, so I need to get the date data cleaned up. The intervals are all different, so I think the best way to standardize this is to only look at the end date. I’ll use a regular expression to extract out the second date and replace the gop$Date vector with the end date.
gop$Date <- str_extract(gop$Date, "[[:digit:]]{1,2}/[[:digit:]]{1,2}$")
head(gop$Date)
## [1] "10/4" "10/4" "10/1" "9/28" "9/27" "9/24"
The next issue is while I have the dates formatted the way I want, they are still character strings, thus won’t compare to each other well. The solution is to convert each to a POSIX time format using the strptime() function.
gop$Date <- strptime(gop$Date,"%m/%d")
Last but not least is to coerce the polling points to numeric so they too can be compared.
gop$points <- as.numeric(gop$points)
I’ll have a look at the df…
tbl_df(gop)
## Source: local data frame [770 x 4]
##
## Poll Date candidate points
## (fctr) (chr) (fctr) (dbl)
## 1 RCP Average <dbl[10]> Trump 23.2
## 2 PPP (D) <int[10]> Trump 27.0
## 3 IBD/TIPP <int[10]> Trump 17.0
## 4 USAT/Suffolk <int[10]> Trump 23.0
## 5 Pew Research* <int[10]> Trump 25.0
## 6 NBC/WSJ <int[10]> Trump 21.0
## 7 FOX News <int[10]> Trump 26.0
## 8 Bloomberg <int[10]> Trump 21.0
## 9 Quinnipiac <int[10]> Trump 25.0
## 10 CNN/ORC <chr[10]> Trump 24.0
## .. ... ... ... ...
…and the data looks tidy to me, so I’m ready to start exploring.
The webpage that this data pulls from has a line graph that plots all the candidates polling data as a function of time, color coded by candidate. I find it a bit difficult to read, so I’ll see if I can do better.
My first step is to subset the top 5 candidates (sorry George Pataki).
trump <- subset(gop, candidate == "Trump" & !is.na(points))
rubio <- subset(gop, candidate == "Rubio" & !is.na(points))
bush <- subset(gop, candidate == "Bush" & !is.na(points))
carson <- subset(gop, candidate == "Carson" & !is.na(points))
fiorina <- subset(gop, candidate == "Fiorina" & !is.na(points))
cruz <- subset(gop, candidate == "Cruz" & !is.na(points))
Next, I’ll instantiate a ggplot object using Trump’s data, and plot out his numbers. What colo(u)r? We’ll if you’ve ever seen Trump tower, you know there’s only one choice.
gg <- ggplot(data=trump, aes(x=Date, y=points, colours='gold'))
gg + geom_point()
So this provides a pretty good feel for Trumps candidacy so far. His ascendancy started around mid June, and dipped a bit towards the end of September.
Next, I’ll compare how the estranged Florida candidates have performed:
gg + geom_point(data=bush, colour ='red') + geom_point(data=rubio, colour='blue')
While Jeb! has been leading most of the way, it looks like he may have been passed by his former understudy in recent weeks.
Finally, I’ll look at all the top 6 candiates.
horse_race <- gg + geom_point(data=fiorina, colour ='black') +
geom_point(data=rubio, colour='blue') +
geom_point(data=trump, colour='gold') +
geom_point(data=carson, colour='green') +
geom_point(data=bush, colour='red')
horse_race
Despite Trumps downturn, he’s still clearly in the lead. But the top candidates seem to converging, indicating we’re a long way from the convetion.
This data set was contributed by Nicholas Capofari and can be found on the Baseball Reference site. The data includes every MLB team from all time, there wins, and the games played in that year.
If you click the link, the table can actually be converted into a .csv format (but not downloaded, oddly). This makes the data a little bit easier to get into a .csv file, but also means the header rows are repeated throughout the file. To handle this, when I read in the data I will subset out those ‘header’ records.
mlb <- subset(read.csv('mlb.csv', stringsAsFactors = FALSE), Year != 'Year')
If I look at the structure of the data…
str(mlb)
## 'data.frame': 115 obs. of 33 variables:
## $ Year: chr "2015" "2014" "2013" "2012" ...
## $ G : chr "162" "162" "163" "162" ...
## $ ARI : chr "79" "64" "81" "81" ...
## $ ATL : chr "67" "79" "96" "94" ...
## $ BLA : chr "" "" "" "" ...
## $ BAL : chr "81" "96" "85" "93" ...
## $ BOS : chr "78" "71" "97" "69" ...
## $ CHC : chr "97" "73" "66" "61" ...
## $ CHW : chr "76" "73" "63" "85" ...
## $ CIN : chr "64" "76" "90" "97" ...
## $ CLE : chr "81" "85" "92" "68" ...
## $ COL : chr "68" "66" "74" "64" ...
## $ DET : chr "74" "90" "93" "88" ...
## $ HOU : chr "86" "70" "51" "55" ...
## $ KCR : chr "95" "89" "86" "72" ...
## $ ANA : chr "85" "98" "78" "89" ...
## $ LAD : chr "92" "94" "92" "86" ...
## $ FLA : chr "71" "77" "62" "69" ...
## $ MIL : chr "68" "82" "74" "83" ...
## $ MIN : chr "83" "70" "66" "66" ...
## $ NYM : chr "90" "79" "74" "74" ...
## $ NYY : chr "87" "84" "85" "95" ...
## $ OAK : chr "68" "88" "96" "94" ...
## $ PHI : chr "63" "73" "73" "81" ...
## $ PIT : chr "98" "88" "94" "79" ...
## $ SDP : chr "74" "77" "76" "76" ...
## $ SFG : chr "84" "88" "76" "94" ...
## $ SEA : chr "76" "87" "71" "75" ...
## $ STL : chr "100" "90" "97" "88" ...
## $ TBD : chr "80" "77" "92" "90" ...
## $ TEX : chr "88" "67" "91" "93" ...
## $ TOR : chr "93" "83" "74" "73" ...
## $ WSN : chr "83" "96" "86" "98" ...
…I’ll see that for whatever reason my stringsAsFactors option didn’t work, everything is still a factor. I’m guessing this is because of the junk records.
Anyway I need to handle this, and to do so I’ll use lapply across the columns of the data frame to coerce everything to numeric and that put it back in a data frame.
mlb <- as.data.frame.list(lapply(mlb, function (x) as.numeric(x)))
tbl_df(mlb)
## Source: local data frame [115 x 33]
##
## Year G ARI ATL BLA BAL BOS CHC CHW CIN CLE COL
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 2015 162 79 67 NA 81 78 97 76 64 81 68
## 2 2014 162 64 79 NA 96 71 73 73 76 85 66
## 3 2013 163 81 96 NA 85 97 66 63 90 92 74
## 4 2012 162 81 94 NA 93 69 61 85 97 68 64
## 5 2011 162 94 89 NA 69 90 71 79 79 80 73
## 6 2010 162 65 91 NA 66 89 75 88 91 69 83
## 7 2009 163 70 86 NA 64 95 83 79 78 65 92
## 8 2008 163 82 72 NA 68 95 97 89 74 81 74
## 9 2007 163 90 84 NA 69 96 85 72 72 96 90
## 10 2006 162 76 79 NA 70 86 66 90 80 78 76
## .. ... ... ... ... ... ... ... ... ... ... ... ...
## Variables not shown: DET (dbl), HOU (dbl), KCR (dbl), ANA (dbl), LAD
## (dbl), FLA (dbl), MIL (dbl), MIN (dbl), NYM (dbl), NYY (dbl), OAK (dbl),
## PHI (dbl), PIT (dbl), SDP (dbl), SFG (dbl), SEA (dbl), STL (dbl), TBD
## (dbl), TEX (dbl), TOR (dbl), WSN (dbl)
Similar to the last scenario, the major problem witht this dataset is that there are variables as column names. Fortunately, tidyr makes this a very easy problem to deal with.
tidy_mlb <- gather(mlb, team, wins, ARI:WSN)
tbl_df(tidy_mlb)
## Source: local data frame [3,565 x 4]
##
## Year G team wins
## (dbl) (dbl) (fctr) (dbl)
## 1 2015 162 ARI 79
## 2 2014 162 ARI 64
## 3 2013 163 ARI 81
## 4 2012 162 ARI 81
## 5 2011 162 ARI 94
## 6 2010 162 ARI 65
## 7 2009 163 ARI 70
## 8 2008 163 ARI 82
## 9 2007 163 ARI 90
## 10 2006 162 ARI 76
## .. ... ... ... ...
Since the amount of games played each year varies, I’ll add a win pct column to make each observation directly comparable.
tidy_mlb[,'win_pct'] <- tidy_mlb$wins / tidy_mlb$G
And with that I’m ready to look at the data.
Since baseball has been played seemingly forever, I’ll mostly stick to looking at data since 1980. The first thing I’ll look at is the best individual seasons.
head(arrange(subset(tidy_mlb, Year >= 1980), desc(win_pct)))
## Year G team wins win_pct
## 1 2001 162 SEA 116 0.7160494
## 2 1998 163 NYY 114 0.6993865
## 3 1995 145 CLE 100 0.6896552
## 4 1986 162 NYM 108 0.6666667
## 5 1998 163 ATL 106 0.6503067
## 6 2004 162 STL 105 0.6481481
Good show by the New York teams.
My favorite team is the Red Sox, so I’ll single them out, looking at their best and worst seasons.
head(arrange(subset(tidy_mlb, Year >= 1980 & team == 'BOS'), desc(win_pct)))
## Year G team wins win_pct
## 1 2004 162 BOS 98 0.6049383
## 2 2013 163 BOS 97 0.5950920
## 3 1995 145 BOS 86 0.5931034
## 4 2007 163 BOS 96 0.5889571
## 5 2005 162 BOS 95 0.5864198
## 6 2003 162 BOS 95 0.5864198
head(arrange(subset(tidy_mlb, Year >= 1980 & team == 'BOS'), win_pct))
## Year G team wins win_pct
## 1 2012 162 BOS 69 0.4259259
## 2 2014 162 BOS 71 0.4382716
## 3 1992 162 BOS 73 0.4506173
## 4 1994 117 BOS 54 0.4615385
## 5 2015 162 BOS 78 0.4814815
## 6 1997 162 BOS 78 0.4814815
Now this is really interesting. Of the past 35 years, the Red Sox had their worst season, their second best season (where they won the World Series), and their second worst season. Maybe this explains why Boston sports fans are so manic despite seemingly winning a title in a different sport every year.
Finally, I’ll compare the Red Sox to their rival Yankees over that same time.
mlb_80 <- subset(tidy_mlb, Year >= 1980)
boston <- subset(mlb_80, team == 'BOS')
nyy <- subset(mlb_80, team == 'NYY')
table(boston$wins > nyy$wins)
##
## FALSE TRUE
## 26 10
So despite the Sox recent success, the Yankees still get the better more often than not.
The last dataset I chose was US trade data, contributed by Jason Joseph. Jason was kind enough to provide a .csv file, so I’ll load that in directly.
trade_data <- read.csv("International Trade in Goods and Services.csv",
stringsAsFactors = FALSE)
tbl_df(trade_data)
## Source: local data frame [13 x 8]
##
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Exports Imports Exports Imports Exports Imports
## 2
## 3 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 4 Services 56883 37,995 59,141 38,635 59,733 40,371
## 5
## 6 February Goods 132084 193,731 131,768 193,060 126,329 184,370
## 7 Services 56,828 37,935 57,726 39,270 59,560 40,057
## 8
## 9 March Goods 130,093 186,326 135,923 198,973 127,183 198,347
## 10 Services 56,647 37,938 58,837 38,908 59,583 40,581
## 11
## 12 April Goods 131,468 189,507 135,556 199,877 129,376 190,967
## 13 Services 56,592 38,318 59,468 39,417 59,766 40,432
In spite of the fact that this is the smallest data set I am working with on this project, it is also probably the most challenging to tidy. There are blank rows, two header rows, and double row records.
The easiest issue to deal with are the blank rows, so I’ll delete them out first by subsetting out any row not blank in the X2013 column.
trade_data <- subset(trade_data, X2013 != "")
tbl_df(trade_data)
## Source: local data frame [9 x 8]
##
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Exports Imports Exports Imports Exports Imports
## 2 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 3 Services 56883 37,995 59,141 38,635 59,733 40,371
## 4 February Goods 132084 193,731 131,768 193,060 126,329 184,370
## 5 Services 56,828 37,935 57,726 39,270 59,560 40,057
## 6 March Goods 130,093 186,326 135,923 198,973 127,183 198,347
## 7 Services 56,647 37,938 58,837 38,908 59,583 40,581
## 8 April Goods 131,468 189,507 135,556 199,877 129,376 190,967
## 9 Services 56,592 38,318 59,468 39,417 59,766 40,432
Now that the empty space has been eliminated, I’ll turn my attention to fixing the fact that the observations span two columns. I can do this by filling in the same month as the one above it.
trade_data$Period[c(3,5,7,9)] <- c("January", "February","March","April")
tbl_df(trade_data)
## Source: local data frame [9 x 8]
##
## Period X X2013 X.1 X2014 X.2 X2015 X.3
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Exports Imports Exports Imports Exports Imports
## 2 January Goods 131,228 191,745 133,738 193,706 129,292 192,242
## 3 January Services 56883 37,995 59,141 38,635 59,733 40,371
## 4 February Goods 132084 193,731 131,768 193,060 126,329 184,370
## 5 February Services 56,828 37,935 57,726 39,270 59,560 40,057
## 6 March Goods 130,093 186,326 135,923 198,973 127,183 198,347
## 7 March Services 56,647 37,938 58,837 38,908 59,583 40,581
## 8 April Goods 131,468 189,507 135,556 199,877 129,376 190,967
## 9 April Services 56,592 38,318 59,468 39,417 59,766 40,432
These column names aren’t very descriptive, but I can tell what they should be using context clues, so I will fix those now. This also allows me to fix the issue that there are two header rows. It creates another aspect of “untidyness” by combining two variables into one column, but I’m not really chaning the data, I’m just relabelling at this point.
colnames(trade_data)[2] <- "goods_services"
colnames(trade_data)[3:8] <- c("2013 Exports","2013 Imports","2014 Exports",
"2014 Imports","2015 Exports", "2015 Imports")
trade_data <- trade_data[2:9,]
rownames(trade_data) <- 1:8
tbl_df(trade_data)
## Source: local data frame [8 x 8]
##
## Period goods_services 2013 Exports 2013 Imports 2014 Exports
## (chr) (chr) (chr) (chr) (chr)
## 1 January Goods 131,228 191,745 133,738
## 2 January Services 56883 37,995 59,141
## 3 February Goods 132084 193,731 131,768
## 4 February Services 56,828 37,935 57,726
## 5 March Goods 130,093 186,326 135,923
## 6 March Services 56,647 37,938 58,837
## 7 April Goods 131,468 189,507 135,556
## 8 April Services 56,592 38,318 59,468
## Variables not shown: 2014 Imports (chr), 2015 Exports (chr), 2015 Imports
## (chr)
I want to do three things here. First, I want to narrow and lengthen the data by eliminating the Year/Type columns and putting them in a variable field. Since this also constitutes an example of more than one variable in a field, I’ll use separate to split them out using regex. Finally, the goods and services are swung out as column names.
tidy_trade <- trade_data %>%
gather("Year/Type", "Total", -Period, -goods_services)%>%
separate("Year/Type",
into = c("year", "import_export"),
sep = "[[:space:]]")%>%
spread(goods_services, Total)
tbl_df(tidy_trade)
## Source: local data frame [24 x 5]
##
## Period year import_export Goods Services
## (chr) (chr) (chr) (chr) (chr)
## 1 April 2013 Exports 131,468 56,592
## 2 April 2013 Imports 189,507 38,318
## 3 April 2014 Exports 135,556 59,468
## 4 April 2014 Imports 199,877 39,417
## 5 April 2015 Exports 129,376 59,766
## 6 April 2015 Imports 190,967 40,432
## 7 February 2013 Exports 132084 56,828
## 8 February 2013 Imports 193,731 37,935
## 9 February 2014 Exports 131,768 57,726
## 10 February 2014 Imports 193,060 39,270
## .. ... ... ... ... ...
It looks good, but I kind of went back and forth as to whether Goods and Services should be spread out as separate columns the way I did here. After all, it is just a two level categorical value, just like import/export. Should the same be done with that variable?
I decided no because it would require combining import/exports with goods and services. “Goods Imported” would be a column, for instance. While this would be a nice tabular display of the data, it seems like it would violate the variable combination rule, so I abstained.
However, there is a bit more work to be done. Since the csv contained commas, the values were imported as strings, therefore I am unable to compute at this point. The commas also prevent me from coercing directly to numeric (I thought R might be able to figure this out, but I was wrong). In any event, it’s a good opportunity to practice Regex a bit more.
tidy_trade$'Goods ' <- as.numeric(str_replace(tidy_trade$'Goods ', ",", ""))
tidy_trade$Services <- as.numeric(str_replace(tidy_trade$Services , ",", ""))
Before I do the computations Jason inquired about, I’m going to split up imports and exports into two separate data frames to make the subsetting a little easier.
exports <- subset(tidy_trade, import_export == 'Exports')
imports <- subset(tidy_trade, import_export == 'Imports')
With that in hand, I can quickly and easily find the average accross months.
#1. average goods exported for each month
mean(exports$Goods)
## [1] 131169.8
#2. avegage goods imported for each month
mean(imports$Goods)
## [1] 192737.6
#3. average services exported for each month
mean(exports$Services)
## [1] 58397
#4. average services imported for each month
mean(imports$Services)
## [1] 39154.75