For this assignment, I decided to go a little off the beaten path. As we are still in the introduction of data acquistion, and these Projects are more to help build a basis for our future understanding of data management, I felt we can still create solid data and code that highlights a way of extracting data, while still having a little fun. I wanted to focus on rvest, tidyr and dplyr packages in this project. I tried to stick with one of these three packages for the most part to do any of the necessary cleanning (though I did rely on some base R for the basic). I took the data sets from Sanjive Kumar on Brazil’s Football Team, and the Motocross data from Maxwell Wagner. These were the sets of data I could easily extract using the rvest package. I also noticed that the data for the Pope’s Popularity by Andrew Goldberg was an interesting choice, not exactly like the other data, but I thought it might present an opportunity to compare data. It would be fun to play around with this and maybe to answer the question: Is there any correlation, between the Pope’s Popularity and a Team’s winning streak?
#Initializing Packages
library(rvest)
## Loading required package: xml2
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
library(stringr)
library(ggplot2)
The reason I primaily picked the Motocross and FIFA data was because it was from Wikipedia. This data can easily be extracted using the Rvest Package. I did some research into the package, and will rely heavily on one source in particular:
http://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/
This one set of code requires one to inspect the elements of a webpage and locate the specific path in the source code. I like this in particular, as it allows one to extract the dataset directly from the site. No need to create a csv file in another program. So, the extracted data looks as follows:
WC_URL <- "https://en.wikipedia.org/wiki/Brazil_national_football_team#FIFA_World_Cup"
WC_Data <- WC_URL %>%
html() %>%
html_node(xpath ='//*[@id="mw-content-text"]/table[2]') %>%
html_table(header = NA, trim = TRUE, fill= TRUE)
## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")
head(WC_Data)
## FIFA World Cup record NA NA NA NA NA NA NA NA
## 1 Year Round Position Pld W D * L GF GA
## 2 1930 Group Stage 6th 2 1 0 1 5 2
## 3 1934 Round 1 14th 1 0 0 1 1 3
## 4 1938 Third Place 3rd 5 3 1 1 14 11
## 5 1950 Runners-up 2nd 6 4 1 1 22 6
## 6 1954 Quarter-Finals 5th 3 1 1 1 8 5
## FIFA World Cup qualification record NA NA NA
## 1 Pld W D L GF
## 2 â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093>
## 3 â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093>
## 4 â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093> â<U+0080><U+0093>
## 5 Qualified as hosts <NA> <NA> <NA> <NA>
## 6 4 4 0 0 8
## NA NA
## 1 GA NA
## 2 â<U+0080><U+0093> NA
## 3 â<U+0080><U+0093> NA
## 4 â<U+0080><U+0093> NA
## 5 <NA> NA
## 6 1 NA
So the above code works at extacting a very rough dataset, but it did extract it! For our purposes here, I was only interested in the first table (as you can see from the wikipedia Page it attached two tables into one) So first things first, clean up and tidying:
# Remove only the first 9 Columns which we will be working with.
WC_Brazil <-WC_Data[1:9]
#Resets Header to the Second Row
names(WC_Brazil) <- c("Year","Round", "Position", "Played", "Won", "Draw", "Lost","Goals","Goals-Against")
#Removes Duplicate Header Row
WC_Brazil <- WC_Brazil[-1,]
#Removes Total at the bottom (Not Necessary)
WC_Brazil <- WC_Brazil[-21,]
Now we have a cleaner data frame that we can work with, but there is still some tidying that needs to be done. First, the data that we are going to primarily work with in this data set is the final position. However, this is a none numeric value, so tidyr can be used to extact this data:
# I found while running the data, some of these values were not"numeric" However this extract numeric coerced them into a numeric value, so I pushed the values I needed using this function
WC_Brazil$Position <- extract_numeric(WC_Brazil$Position)
WC_Brazil$Goals <- extract_numeric(WC_Brazil$Goals)
WC_Brazil$Played <- extract_numeric(WC_Brazil$Played)
head(WC_Brazil)
## Year Round Position Played Won Draw Lost Goals Goals-Against
## 2 1930 Group Stage 6 2 1 0 1 5 2
## 3 1934 Round 1 14 1 0 0 1 1 3
## 4 1938 Third Place 3 5 3 1 1 14 11
## 5 1950 Runners-up 2 6 4 1 1 22 6
## 6 1954 Quarter-Finals 5 3 1 1 1 8 5
## 7 1958 Champions 1 6 5 1 0 16 4
For the Motocross data a similar Extraction was performed, and we created to following dataset.
MC_URL <- "https://en.wikipedia.org/wiki/Motocross_des_Nations"
MC_Data <- MC_URL %>%
html() %>%
html_node(xpath ='//*[@id="mw-content-text"]/table[3]') %>%
html_table(header = NA, trim = TRUE, fill= FALSE)
## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")
#Trimmed an excess column on the end:
MC_Data <- MC_Data[,-18]
head(MC_Data)
## Team 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 1 Australia 7th 6th 14th 13th 16th 11th 20th 6th 7th 6th 3rd
## 2 Austria 13th 22nd 23rd 21st 23rd 12th 20th
## 3 Belarus 27th
## 4 Belgium 3rd 2nd 2nd 1st 1st 3rd 2nd 3rd 3rd 3rd 2nd 5th
## 5 Brazil 23rd 23rd 16th 14th 14th 18th 27th
## 6 Canada 28th 26th 8th 17th 18th 10th 13th 27th 16th
## 2012 2013 2014 2015
## 1 10th 4th 14th 7th
## 2 24th 12th 10th
## 3
## 4 2nd 1st 2nd 3rd
## 5 32nd 31st 27th 27th
## 6 23rd 17th
For the most part this dataset is much more tidy than the previous one, and didn’t require much deletion or extraction of data. However, it did require us to gather the data into a much more manageble set. So that was the first part that we did:
MC_Total <-MC_Data %>% gather("Year","Place",2:17,na.rm = TRUE)
#Another step needed would be get a numeric value for the postion
MC_Total$Place <- extract_numeric(MC_Total$Place)
#Last we need to omit the NA values, as they are unnecessary for our analysis downstream
MC_Total<-na.omit(MC_Total)
head(MC_Total)
## Team Year Place
## 1 Australia 2000 7
## 2 Austria 2000 13
## 4 Belgium 2000 3
## 5 Brazil 2000 23
## 6 Canada 2000 28
## 10 Croatia 2000 17
For the Pope’s Popularity, unfortunately, the data was in an image file. At the moment, I was unable to extract the info from the website. So, I created a CSV file in the format of the data presented on the Discussion board. I kept the CSV to the format as best I could, but for simplicity I did change things a bit (I realize that this Project is to aid our knowledge of tidying data, but some things especially when creating .csv file are easier to manipulate in the file itself). I tried to use each of these datasets to emphasize a particular extraction and data manipulation, for this one I used some regular expressions to extract a particular item in the file.
#Extracting the URL text
popeURL <-getURL("https://raw.githubusercontent.com/mfarris9505/PopePop/master/PopePopularity.csv")
#Reading the txt file into "CSV"
pope_data <-read.csv(text = popeURL)
#Re-naming the columns
names(pope_data) <- c("Pope", "Date","Favorable", "Unfavorable", "No.Opinion")
head(pope_data)
## Pope Date Favorable Unfavorable No.Opinion
## 1 Pope Francis Feb 6-9,2014 76 9 16
## 2 Pope Francis Apr 11-14,2013 58 10 31
## 3 Pope Benedict XVI Mar 26-28,2010 40 35 25
## 4 Pope Benedict XVI Apr 18-20,2008 63 15 22
## 5 Pope Benedict XVI Jun 1-3,2007 52 16 32
## 6 Pope Benedict XVI Dec 16-18,2005 50 11 39
This data is pretty tidy, however, for my analysis, I want to extract the specific year when the poll was taken. So to extract the data, we used the the str_extract function.
pope_data$Date <-str_extract(pope_data$Date,"[:digit:]{4}")
#There were two 2005 years in this dataset for Pope Benedict and one for Pope John Paul, for ease (and because I needed a 2006 and a 2004 data point) I changed the December poll to 2006 and the Pope John Paul to 2004. We have yet to learn about the problems of changing data, so I figured this would be OK.
pope_data$Date[6] <- 2006
pope_data$Date[8] <- 2004
head(pope_data)
## Pope Date Favorable Unfavorable No.Opinion
## 1 Pope Francis 2014 76 9 16
## 2 Pope Francis 2013 58 10 31
## 3 Pope Benedict XVI 2010 40 35 25
## 4 Pope Benedict XVI 2008 63 15 22
## 5 Pope Benedict XVI 2007 52 16 32
## 6 Pope Benedict XVI 2006 50 11 39
Now that we have all the data in a group we can do some analysis.
For the analysis, I did the bare minimum that was requested for the two threads (Motocross and Pope data), which is shown below. For the World Cup data and Motocross data, I created two datasets and combined it with Favorability of the Pope during the given years. For the Motocross data, I picked 4 countries specifically USA, Japan, Ireland, and Brazil.
As stated, I couldn’t quite extract the data for the opponents, so I did some generic plotting. First, I was interested in see if the number of goals scored was related to the position they placed. This was accomplished as follows:
ggplot(WC_Brazil, aes(x=Goals, y=Position))+
geom_point()+
geom_smooth(method=lm)
I realize, this data is actually quite obvious. They would score more goals, mostly because they played more games, so I tried to optimize this by creating a sepatate column called Goals Per Game:
WC_Brazil <- WC_Brazil %>% mutate(GoalsPerGame = Goals/Played)
ggplot(WC_Brazil, aes(x=GoalsPerGame, y=Position))+
geom_point()+
geom_smooth(method=lm)
The replotted data shows a more acuarte depiction if there is a relation between Goals Scored and Placement.
The motocross data was interesting to look at and using dplyr we came to some interesting finds. First the average placement of each country:
MC_Total %>% group_by(Team) %>%
summarise(Mean = mean(Place))
## Source: local data frame [55 x 2]
##
## Team Mean
## (chr) (dbl)
## 1 Australia 9.60000
## 2 Austria 18.00000
## 3 Belarus 27.00000
## 4 Belgium 2.37500
## 5 Brazil 22.90909
## 6 Canada 18.45455
## 7 Chile 18.00000
## 8 China 32.50000
## 9 Costa Rica 25.00000
## 10 Croatia 27.80000
## .. ... ...
For the Pope dataset, I created two new Columns using the Mutate function, one to obtain what he called Awareness, and the other as Net Favoablility:
pope_data <- pope_data %>% mutate(Awareness = Favorable + Unfavorable) %>% mutate(Net_Favor = Favorable - Unfavorable)
pope_data
## Pope Date Favorable Unfavorable No.Opinion Awareness
## 1 Pope Francis 2014 76 9 16 85
## 2 Pope Francis 2013 58 10 31 68
## 3 Pope Benedict XVI 2010 40 35 25 75
## 4 Pope Benedict XVI 2008 63 15 22 78
## 5 Pope Benedict XVI 2007 52 16 32 68
## 6 Pope Benedict XVI 2006 50 11 39 61
## 7 Pope Benedict XVI 2005 55 12 33 67
## 8 Pope John Paul II 2004 78 11 11 89
## 9 Pope John Paul II 2003 73 17 10 90
## 10 Pope John Paul II 2002 61 26 13 87
## 11 Pope John Paul II 1998 86 8 6 94
## 12 Pope John Paul II 1993 64 15 21 79
## Net_Favor
## 1 67
## 2 48
## 3 5
## 4 48
## 5 36
## 6 39
## 7 43
## 8 67
## 9 56
## 10 35
## 11 78
## 12 49
This is where I take leaps and bounds past normal, and took it one step further: I combined data… Why? Because I can…
First step was the create a simple data table from the Pope data this was done as followed:
combo_data <- subset(pope_data,select = c(Date, Favorable))
names(combo_data) <- c("Year","Favorable")
# I mutated this into A percentage
combo_data <- combo_data %>% mutate(Favorable = Favorable/100)
# I next pulled data fromthe World Cup:
WC_Combo <- subset(WC_Brazil, select= c(Year,Position))
# And finally pulled the Motocross Data:
MC_Japan <- MC_Total %>% filter(Team == "Japan")
MC_USA <- MC_Total %>% filter(Team == "USA")
MC_Ireland <- MC_Total %>% filter(Team == "Ireland")
MC_Brazil <- MC_Total %>% filter(Team == "Brazil")
#Removes excess Column
MC_Japan <- MC_Japan[,-1]
MC_USA <- MC_USA[,-1]
MC_Ireland <- MC_Ireland[,-1]
MC_Brazil <- MC_Brazil[,-1]
#Using the Left_join we combined each of these separate datasets into one:
combo_data <- left_join(combo_data,WC_Combo, by = "Year")
combo_data <- left_join(combo_data,MC_Japan, by = "Year")
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
combo_data <- left_join(combo_data,MC_USA, by = "Year")
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
combo_data <- left_join(combo_data,MC_Ireland, by = "Year")
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
combo_data <- left_join(combo_data,MC_Brazil, by = "Year")
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
names(combo_data) <- c("Year","Favorable","Brazil_WC","Japan_MC","USA_MC","Ireland_MC", "Brazil_MC")
combo_data
## Year Favorable Brazil_WC Japan_MC USA_MC Ireland_MC Brazil_MC
## 1 2014 0.76 4 22 3 19 27
## 2 2013 0.58 NA 24 2 22 31
## 3 2010 0.40 6 19 1 20 18
## 4 2008 0.63 NA 17 1 21 14
## 5 2007 0.52 NA 7 1 13 16
## 6 2006 0.50 5 12 1 31 NA
## 7 2005 0.55 NA 12 1 22 NA
## 8 2004 0.78 NA 17 NA 14 NA
## 9 2003 0.73 NA 5 2 7 NA
## 10 2002 0.61 1 NA NA 9 NA
## 11 1998 0.86 2 NA NA NA NA
## 12 1993 0.64 NA NA NA NA NA
After all that… We successfully found absolutely nothing… First, this data makes several assumptions, like the favoribilty of the Pope in America is representative of the entire world (obviously not). However, for the sake of argument, I did plot the data for Ireland, as it looked promising, and Ireland is known for its connection to the Catholic Church (71% Catholic):
ggplot(combo_data, aes(x=Ireland_MC, y=Favorable))+
geom_point()+
geom_smooth(method=lm)
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
This data was actually surprising, There does appear to be a slight trend, that shows that as the people’s favor of the Pope goes down, so does the relative ranking of Ireland.
A similar plot was done using the World Cup data, as Brazil is another Catholic country (64% Catholic):
ggplot(combo_data, aes(x=Brazil_WC, y=Favorable))+
geom_point()+
geom_smooth(method=lm)
## Warning: Removed 7 rows containing missing values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).
Finally, we plotted Japan, Which we know isn’t catholic (according to my research only 0.5% of the popluation is a practicing catholic).
ggplot(combo_data, aes(x=Japan_MC, y=Favorable))+
geom_point()+
geom_smooth(method=lm)
## Warning: Removed 3 rows containing missing values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
Though this analysis is FAR from accurate, it is interesting to say the least. I am still not familiar with many statistical test, but the fact that I found any sort of trend is just flat out strange. I felt I would find absolutely no change, like we see in the data set from Japan. Again, I can’t say this anything other than a coincidence, but it is a pleasant surprise for me. I do admire Pope Francis, and was generally pleased with his recent appearance here in America, as I think he did speak on many topics, and inspired many by his speeches. Maybe, just maybe, I should try out for the soccer team.