For my second project in IS 607, I am tasked with taking 3 ‘wide’ datasets, tidying them according to the standards in this Hadley Wickham paper, and performing a bit of exploratory data analysis. The datasets were all proposed by my peers in the IS 607 Fall 2015 cohort. The 3 datasets I chose were the 2016 GOP presidential nomination polling data, all time MLB wins data, and 2013-2015 Q1 United States import/export data.

As always, I’ll load my libraries before beginning.

library(tidyr)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)

1. 2016 GOP Presidential Nomination Polling Data

This dataset was proposed by Edwige Talla Badjio, and can be found in the second table at this link from Real Clear Politics (and coincentally, it looks like the website has undergone a face lift since I pulled the data).

Since the data comes from a tabular HTML format, it would have been possible to use cURL to interact with the table directly, but in this case it was faster to copy the data into excel and generate a .csv file that way.

I’ll first load the .csv file, and use the na.strings argument to have R “–” values as NA, since it represented no polling data for that candidate.

gop <- read.csv('gop.csv', na.strings = "--")

The first step is to take a look at the data to get a feel for how it is formatted and develop a strategy for tidying it.

tbl_df(gop)
## Source: local data frame [101 x 17]
## 
##             Poll        Date Trump Carson Fiorina Rubio  Bush  Cruz Kasich
##           (fctr)      (fctr) (dbl)  (dbl)   (dbl) (dbl) (dbl) (dbl)  (dbl)
## 1    RCP Average 9/17 - 10/4  23.2   17.2    10.4   9.9   8.4   6.2    3.2
## 2        PPP (D) 10/1 - 10/4  27.0   17.0     6.0  13.0  10.0   7.0    4.0
## 3       IBD/TIPP 9/26 - 10/1  17.0   24.0     9.0  11.0   8.0   6.0    4.0
## 4   USAT/Suffolk 9/24 - 9/28  23.0   13.0    13.0   9.0   8.0   6.0    2.0
## 5  Pew Research* 9/22 - 9/27  25.0   16.0     8.0   8.0   4.0   6.0    1.0
## 6        NBC/WSJ 9/20 - 9/24  21.0   20.0    11.0  11.0   7.0   5.0    6.0
## 7       FOX News 9/20 - 9/22  26.0   18.0     9.0   9.0   7.0   8.0    4.0
## 8      Bloomberg 9/18 - 9/21  21.0   16.0    11.0   8.0  13.0   5.0    4.0
## 9     Quinnipiac 9/17 - 9/21  25.0   17.0    12.0   9.0  10.0   7.0    2.0
## 10       CNN/ORC 9/17 - 9/19  24.0   14.0    15.0  11.0   9.0   6.0    2.0
## ..           ...         ...   ...    ...     ...   ...   ...   ...    ...
## Variables not shown: Huckabee (dbl), Christie (dbl), Paul (dbl), Santorum
##   (dbl), Jindal (dbl), Graham (dbl), Pataki (dbl), Spread (fctr)

The data frame I am working with consists of a Poll variable, which describes the organization that conducted the poll, a date variable, which describes the time frame the polling data was collected, each candidate as a seperate column, and a spread column indicating the top candidates lead over the 2nd place candidate.

Because the date ranges do not include years, and because most of the candidates did not formally announce until this year, I will only look at observations from 2015, and drop everything prior.

gop <- gop[1:55,]

The most obvious violation of a tidy data format is that the candidates names, which are variables, are used as columnn names. To fix this problem, I’ll use tidyr’s gather() function to move them all into one column, and significantly reduce the ‘width’ of the data set.

gop <- gather(gop, candidate, points, Trump:Pataki)
tbl_df(gop)
## Source: local data frame [770 x 5]
## 
##             Poll        Date     Spread candidate points
##           (fctr)      (fctr)     (fctr)    (fctr)  (dbl)
## 1    RCP Average 9/17 - 10/4 Trump +6.0     Trump   23.2
## 2        PPP (D) 10/1 - 10/4  Trump +10     Trump   27.0
## 3       IBD/TIPP 9/26 - 10/1  Carson +7     Trump   17.0
## 4   USAT/Suffolk 9/24 - 9/28  Trump +10     Trump   23.0
## 5  Pew Research* 9/22 - 9/27   Trump +9     Trump   25.0
## 6        NBC/WSJ 9/20 - 9/24   Trump +1     Trump   21.0
## 7       FOX News 9/20 - 9/22   Trump +8     Trump   26.0
## 8      Bloomberg 9/18 - 9/21   Trump +5     Trump   21.0
## 9     Quinnipiac 9/17 - 9/21   Trump +8     Trump   25.0
## 10       CNN/ORC 9/17 - 9/19   Trump +9     Trump   24.0
## ..           ...         ...        ...       ...    ...

I don’t really care about the spread column, it’s redundant, and I’ll be able to easily calculate it later if I want, so I’ll subset it out.

gop <- cbind(gop[1:2], gop[4:5])

I eventually want to look at this data as a function of time, so I need to get the date data cleaned up. The intervals are all different, so I think the best way to standardize this is to only look at the end date. I’ll use a regular expression to extract out the second date and replace the gop$Date vector with the end date.

gop$Date <- str_extract(gop$Date, "[[:digit:]]{1,2}/[[:digit:]]{1,2}$")
head(gop$Date)
## [1] "10/4" "10/4" "10/1" "9/28" "9/27" "9/24"

The next issue is while I have the dates formatted the way I want, they are still character strings, thus won’t compare to each other well. The solution is to convert each to a POSIX time format using the strptime() function.

gop$Date <- strptime(gop$Date,"%m/%d")

Last but not least is to coerce the polling points to numeric so they too can be compared.

gop$points <- as.numeric(gop$points)

I’ll have a look at the df…

tbl_df(gop)
## Source: local data frame [770 x 4]
## 
##             Poll      Date candidate points
##           (fctr)     (chr)    (fctr)  (dbl)
## 1    RCP Average <dbl[10]>     Trump   23.2
## 2        PPP (D) <int[10]>     Trump   27.0
## 3       IBD/TIPP <int[10]>     Trump   17.0
## 4   USAT/Suffolk <int[10]>     Trump   23.0
## 5  Pew Research* <int[10]>     Trump   25.0
## 6        NBC/WSJ <int[10]>     Trump   21.0
## 7       FOX News <int[10]>     Trump   26.0
## 8      Bloomberg <int[10]>     Trump   21.0
## 9     Quinnipiac <int[10]>     Trump   25.0
## 10       CNN/ORC <chr[10]>     Trump   24.0
## ..           ...       ...       ...    ...

…and the data looks tidy to me, so I’m ready to start exploring.

The webpage that this data pulls from has a line graph that plots all the candidates polling data as a function of time, color coded by candidate. I find it a bit difficult to read, so I’ll see if I can do better.

My first step is to subset the top 5 candidates (sorry George Pataki).

trump <- subset(gop, candidate == "Trump" & !is.na(points))
rubio <- subset(gop, candidate == "Rubio" & !is.na(points))
bush <- subset(gop, candidate == "Bush" & !is.na(points))
carson <- subset(gop, candidate == "Carson" & !is.na(points))
fiorina <- subset(gop, candidate == "Fiorina" & !is.na(points))
cruz <- subset(gop, candidate == "Cruz" & !is.na(points))

Next, I’ll instantiate a ggplot object using Trump’s data, and plot out his numbers. What colo(u)r? We’ll if you’ve ever seen Trump tower, you know there’s only one choice.

gg <- ggplot(data=trump, aes(x=Date, y=points, colours='gold'))
gg + geom_point()

So this provides a pretty good feel for Trumps candidacy so far. His ascendancy started around mid June, and dipped a bit towards the end of September.

Next, I’ll compare how the estranged Florida candidates have performed:

gg + geom_point(data=bush, colour ='red') + geom_point(data=rubio, colour='blue')

While Jeb! has been leading most of the way, it looks like he may have been passed by his former understudy in recent weeks.

Finally, I’ll look at all the top 6 candiates.

horse_race <- gg +  geom_point(data=fiorina, colour ='black') + 
                    geom_point(data=rubio, colour='blue') + 
                    geom_point(data=trump, colour='gold') + 
                    geom_point(data=carson, colour='green') + 
                    geom_point(data=bush, colour='red')
horse_race

Despite Trumps downturn, he’s still clearly in the lead. But the top candidates seem to converging, indicating we’re a long way from the convetion.

2. All time MLB wins data

This data set was contributed by Nicholas Capofari and can be found on the Baseball Reference site. The data includes every MLB team from all time, there wins, and the games played in that year.

If you click the link, the table can actually be converted into a .csv format (but not downloaded, oddly). This makes the data a little bit easier to get into a .csv file, but also means the header rows are repeated throughout the file. To handle this, when I read in the data I will subset out those ‘header’ records.

mlb <- subset(read.csv('mlb.csv', stringsAsFactors = FALSE), Year != 'Year')

If I look at the structure of the data…

str(mlb)
## 'data.frame':    115 obs. of  33 variables:
##  $ Year: chr  "2015" "2014" "2013" "2012" ...
##  $ G   : chr  "162" "162" "163" "162" ...
##  $ ARI : chr  "79" "64" "81" "81" ...
##  $ ATL : chr  "67" "79" "96" "94" ...
##  $ BLA : chr  "" "" "" "" ...
##  $ BAL : chr  "81" "96" "85" "93" ...
##  $ BOS : chr  "78" "71" "97" "69" ...
##  $ CHC : chr  "97" "73" "66" "61" ...
##  $ CHW : chr  "76" "73" "63" "85" ...
##  $ CIN : chr  "64" "76" "90" "97" ...
##  $ CLE : chr  "81" "85" "92" "68" ...
##  $ COL : chr  "68" "66" "74" "64" ...
##  $ DET : chr  "74" "90" "93" "88" ...
##  $ HOU : chr  "86" "70" "51" "55" ...
##  $ KCR : chr  "95" "89" "86" "72" ...
##  $ ANA : chr  "85" "98" "78" "89" ...
##  $ LAD : chr  "92" "94" "92" "86" ...
##  $ FLA : chr  "71" "77" "62" "69" ...
##  $ MIL : chr  "68" "82" "74" "83" ...
##  $ MIN : chr  "83" "70" "66" "66" ...
##  $ NYM : chr  "90" "79" "74" "74" ...
##  $ NYY : chr  "87" "84" "85" "95" ...
##  $ OAK : chr  "68" "88" "96" "94" ...
##  $ PHI : chr  "63" "73" "73" "81" ...
##  $ PIT : chr  "98" "88" "94" "79" ...
##  $ SDP : chr  "74" "77" "76" "76" ...
##  $ SFG : chr  "84" "88" "76" "94" ...
##  $ SEA : chr  "76" "87" "71" "75" ...
##  $ STL : chr  "100" "90" "97" "88" ...
##  $ TBD : chr  "80" "77" "92" "90" ...
##  $ TEX : chr  "88" "67" "91" "93" ...
##  $ TOR : chr  "93" "83" "74" "73" ...
##  $ WSN : chr  "83" "96" "86" "98" ...

…I’ll see that for whatever reason my stringsAsFactors option didn’t work, everything is still a factor. I’m guessing this is because of the junk records.

Anyway I need to handle this, and to do so I’ll use lapply across the columns of the data frame to coerce everything to numeric and that put it back in a data frame.

mlb <- as.data.frame.list(lapply(mlb, function (x) as.numeric(x)))
tbl_df(mlb)
## Source: local data frame [115 x 33]
## 
##     Year     G   ARI   ATL   BLA   BAL   BOS   CHC   CHW   CIN   CLE   COL
##    (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1   2015   162    79    67    NA    81    78    97    76    64    81    68
## 2   2014   162    64    79    NA    96    71    73    73    76    85    66
## 3   2013   163    81    96    NA    85    97    66    63    90    92    74
## 4   2012   162    81    94    NA    93    69    61    85    97    68    64
## 5   2011   162    94    89    NA    69    90    71    79    79    80    73
## 6   2010   162    65    91    NA    66    89    75    88    91    69    83
## 7   2009   163    70    86    NA    64    95    83    79    78    65    92
## 8   2008   163    82    72    NA    68    95    97    89    74    81    74
## 9   2007   163    90    84    NA    69    96    85    72    72    96    90
## 10  2006   162    76    79    NA    70    86    66    90    80    78    76
## ..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
## Variables not shown: DET (dbl), HOU (dbl), KCR (dbl), ANA (dbl), LAD
##   (dbl), FLA (dbl), MIL (dbl), MIN (dbl), NYM (dbl), NYY (dbl), OAK (dbl),
##   PHI (dbl), PIT (dbl), SDP (dbl), SFG (dbl), SEA (dbl), STL (dbl), TBD
##   (dbl), TEX (dbl), TOR (dbl), WSN (dbl)

Similar to the last scenario, the major problem witht this dataset is that there are variables as column names. Fortunately, tidyr makes this a very easy problem to deal with.

tidy_mlb <- gather(mlb, team, wins, ARI:WSN)
tbl_df(tidy_mlb)
## Source: local data frame [3,565 x 4]
## 
##     Year     G   team  wins
##    (dbl) (dbl) (fctr) (dbl)
## 1   2015   162    ARI    79
## 2   2014   162    ARI    64
## 3   2013   163    ARI    81
## 4   2012   162    ARI    81
## 5   2011   162    ARI    94
## 6   2010   162    ARI    65
## 7   2009   163    ARI    70
## 8   2008   163    ARI    82
## 9   2007   163    ARI    90
## 10  2006   162    ARI    76
## ..   ...   ...    ...   ...

Since the amount of games played each year varies, I’ll add a win pct column to make each observation directly comparable.

tidy_mlb[,'win_pct'] <- tidy_mlb$wins / tidy_mlb$G

And with that I’m ready to look at the data.

Since baseball has been played seemingly forever, I’ll mostly stick to looking at data since 1980. The first thing I’ll look at is the best individual seasons.

head(arrange(subset(tidy_mlb, Year >= 1980), desc(win_pct)))
##   Year   G team wins   win_pct
## 1 2001 162  SEA  116 0.7160494
## 2 1998 163  NYY  114 0.6993865
## 3 1995 145  CLE  100 0.6896552
## 4 1986 162  NYM  108 0.6666667
## 5 1998 163  ATL  106 0.6503067
## 6 2004 162  STL  105 0.6481481

Good show by the New York teams.

My favorite team is the Red Sox, so I’ll single them out, looking at their best and worst seasons.

head(arrange(subset(tidy_mlb, Year >= 1980 & team == 'BOS'), desc(win_pct)))
##   Year   G team wins   win_pct
## 1 2004 162  BOS   98 0.6049383
## 2 2013 163  BOS   97 0.5950920
## 3 1995 145  BOS   86 0.5931034
## 4 2007 163  BOS   96 0.5889571
## 5 2005 162  BOS   95 0.5864198
## 6 2003 162  BOS   95 0.5864198
head(arrange(subset(tidy_mlb, Year >= 1980 & team == 'BOS'), win_pct))
##   Year   G team wins   win_pct
## 1 2012 162  BOS   69 0.4259259
## 2 2014 162  BOS   71 0.4382716
## 3 1992 162  BOS   73 0.4506173
## 4 1994 117  BOS   54 0.4615385
## 5 2015 162  BOS   78 0.4814815
## 6 1997 162  BOS   78 0.4814815

Now this is really interesting. Of the past 35 years, the Red Sox had their worst season, their second best season (where they won the World Series), and their second worst season. Maybe this explains why Boston sports fans are so manic despite seemingly winning a title in a different sport every year.

Finally, I’ll compare the Red Sox to their rival Yankees over that same time.

mlb_80 <- subset(tidy_mlb, Year >= 1980)
boston <- subset(mlb_80, team == 'BOS')
nyy <- subset(mlb_80, team == 'NYY')
table(boston$wins > nyy$wins)
## 
## FALSE  TRUE 
##    26    10

So despite the Sox recent success, the Yankees still get the better more often than not.

3. 2013-2015 Q1 United States Import/Export Data

The last dataset I chose was US trade data, contributed by Jason Joseph. Jason was kind enough to provide a .csv file, so I’ll load that in directly.

trade_data <- read.csv("International Trade in Goods and Services.csv", 
                          stringsAsFactors = FALSE)
tbl_df(trade_data)
## Source: local data frame [13 x 8]
## 
##      Period        X   X2013     X.1   X2014     X.2   X2015     X.3
##       (chr)    (chr)   (chr)   (chr)   (chr)   (chr)   (chr)   (chr)
## 1                    Exports Imports Exports Imports Exports Imports
## 2                                                                   
## 3   January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 4           Services   56883  37,995  59,141  38,635  59,733  40,371
## 5                                                                   
## 6  February   Goods   132084 193,731 131,768 193,060 126,329 184,370
## 7           Services  56,828  37,935  57,726  39,270  59,560  40,057
## 8                                                                   
## 9     March   Goods  130,093 186,326 135,923 198,973 127,183 198,347
## 10          Services  56,647  37,938  58,837  38,908  59,583  40,581
## 11                                                                  
## 12    April   Goods  131,468 189,507 135,556 199,877 129,376 190,967
## 13          Services  56,592  38,318  59,468  39,417  59,766  40,432

In spite of the fact that this is the smallest data set I am working with on this project, it is also probably the most challenging to tidy. There are blank rows, two header rows, and double row records.

The easiest issue to deal with are the blank rows, so I’ll delete them out first by subsetting out any row not blank in the X2013 column.

trade_data <- subset(trade_data, X2013 != "")
tbl_df(trade_data)
## Source: local data frame [9 x 8]
## 
##     Period        X   X2013     X.1   X2014     X.2   X2015     X.3
##      (chr)    (chr)   (chr)   (chr)   (chr)   (chr)   (chr)   (chr)
## 1                   Exports Imports Exports Imports Exports Imports
## 2  January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 3          Services   56883  37,995  59,141  38,635  59,733  40,371
## 4 February   Goods   132084 193,731 131,768 193,060 126,329 184,370
## 5          Services  56,828  37,935  57,726  39,270  59,560  40,057
## 6    March   Goods  130,093 186,326 135,923 198,973 127,183 198,347
## 7          Services  56,647  37,938  58,837  38,908  59,583  40,581
## 8    April   Goods  131,468 189,507 135,556 199,877 129,376 190,967
## 9          Services  56,592  38,318  59,468  39,417  59,766  40,432

Now that the empty space has been eliminated, I’ll turn my attention to fixing the fact that the observations span two columns. I can do this by filling in the same month as the one above it.

trade_data$Period[c(3,5,7,9)] <- c("January", "February","March","April")
tbl_df(trade_data)
## Source: local data frame [9 x 8]
## 
##     Period        X   X2013     X.1   X2014     X.2   X2015     X.3
##      (chr)    (chr)   (chr)   (chr)   (chr)   (chr)   (chr)   (chr)
## 1                   Exports Imports Exports Imports Exports Imports
## 2  January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 3  January Services   56883  37,995  59,141  38,635  59,733  40,371
## 4 February   Goods   132084 193,731 131,768 193,060 126,329 184,370
## 5 February Services  56,828  37,935  57,726  39,270  59,560  40,057
## 6    March   Goods  130,093 186,326 135,923 198,973 127,183 198,347
## 7    March Services  56,647  37,938  58,837  38,908  59,583  40,581
## 8    April   Goods  131,468 189,507 135,556 199,877 129,376 190,967
## 9    April Services  56,592  38,318  59,468  39,417  59,766  40,432

These column names aren’t very descriptive, but I can tell what they should be using context clues, so I will fix those now. This also allows me to fix the issue that there are two header rows. It creates another aspect of “untidyness” by combining two variables into one column, but I’m not really chaning the data, I’m just relabelling at this point.

colnames(trade_data)[2] <- "goods_services"
colnames(trade_data)[3:8] <- c("2013 Exports","2013 Imports","2014 Exports",
                               "2014 Imports","2015 Exports", "2015 Imports")
trade_data <- trade_data[2:9,]
rownames(trade_data) <- 1:8
tbl_df(trade_data)
## Source: local data frame [8 x 8]
## 
##     Period goods_services 2013 Exports 2013 Imports 2014 Exports
##      (chr)          (chr)        (chr)        (chr)        (chr)
## 1  January         Goods       131,228      191,745      133,738
## 2  January       Services        56883       37,995       59,141
## 3 February         Goods        132084      193,731      131,768
## 4 February       Services       56,828       37,935       57,726
## 5    March         Goods       130,093      186,326      135,923
## 6    March       Services       56,647       37,938       58,837
## 7    April         Goods       131,468      189,507      135,556
## 8    April       Services       56,592       38,318       59,468
## Variables not shown: 2014 Imports (chr), 2015 Exports (chr), 2015 Imports
##   (chr)

I want to do three things here. First, I want to narrow and lengthen the data by eliminating the Year/Type columns and putting them in a variable field. Since this also constitutes an example of more than one variable in a field, I’ll use separate to split them out using regex. Finally, the goods and services are swung out as column names.

tidy_trade <- trade_data %>%
                      gather("Year/Type", "Total", -Period, -goods_services)%>%  
                      separate("Year/Type", 
                                into = c("year", "import_export"), 
                                sep = "[[:space:]]")%>%
                      spread(goods_services, Total)
tbl_df(tidy_trade)
## Source: local data frame [24 x 5]
## 
##      Period  year import_export  Goods  Services
##       (chr) (chr)         (chr)   (chr)    (chr)
## 1     April  2013       Exports 131,468   56,592
## 2     April  2013       Imports 189,507   38,318
## 3     April  2014       Exports 135,556   59,468
## 4     April  2014       Imports 199,877   39,417
## 5     April  2015       Exports 129,376   59,766
## 6     April  2015       Imports 190,967   40,432
## 7  February  2013       Exports  132084   56,828
## 8  February  2013       Imports 193,731   37,935
## 9  February  2014       Exports 131,768   57,726
## 10 February  2014       Imports 193,060   39,270
## ..      ...   ...           ...     ...      ...

It looks good, but I kind of went back and forth as to whether Goods and Services should be spread out as separate columns the way I did here. After all, it is just a two level categorical value, just like import/export. Should the same be done with that variable?

I decided no because it would require combining import/exports with goods and services. “Goods Imported” would be a column, for instance. While this would be a nice tabular display of the data, it seems like it would violate the variable combination rule, so I abstained.

However, there is a bit more work to be done. Since the csv contained commas, the values were imported as strings, therefore I am unable to compute at this point. The commas also prevent me from coercing directly to numeric (I thought R might be able to figure this out, but I was wrong). In any event, it’s a good opportunity to practice Regex a bit more.

tidy_trade$'Goods ' <- as.numeric(str_replace(tidy_trade$'Goods ', ",", ""))
tidy_trade$Services  <- as.numeric(str_replace(tidy_trade$Services , ",", ""))

Before I do the computations Jason inquired about, I’m going to split up imports and exports into two separate data frames to make the subsetting a little easier.

exports <- subset(tidy_trade, import_export == 'Exports')
imports <- subset(tidy_trade, import_export == 'Imports')

With that in hand, I can quickly and easily find the average accross months.

#1. average goods exported for each month
mean(exports$Goods)
## [1] 131169.8
#2. avegage goods imported for each month
mean(imports$Goods)
## [1] 192737.6
#3. average services exported for each month
mean(exports$Services)
## [1] 58397
#4. average services imported for each month
mean(imports$Services)
## [1] 39154.75