Does the Popularity of the Pope Make People Better Players?

For this assignment, I decided to go a little off the beaten path. As we are still in the introduction of data acquistion, and these Projects are more to help build a basis for our future understanding of data management, I felt we can still create solid data and code that highlights a way of extracting data, while still having a little fun. I wanted to focus on rvest, tidyr and dplyr packages in this project. I tried to stick with one of these three packages for the most part to do any of the necessary cleanning (though I did rely on some base R for the basic). I took the data sets from Sanjive Kumar on Brazil’s Football Team, and the Motocross data from Maxwell Wagner. These were the sets of data I could easily extract using the rvest package. I also noticed that the data for the Pope’s Popularity by Andrew Goldberg was an interesting choice, not exactly like the other data, but I thought it might present an opportunity to compare data. It would be fun to play around with this and maybe to answer the question: Is there any correlation, between the Pope’s Popularity and a Team’s winning streak?

Data Extraction

#Initializing Packages
library(rvest)

## Loading required package: xml2

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(RCurl)

## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

library(stringr)
library(ggplot2)

Brazil’s FIFA Record

The reason I primaily picked the Motocross and FIFA data was because it was from Wikipedia. This data can easily be extracted using the Rvest Package. I did some research into the package, and will rely heavily on one source in particular:

http://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/

This one set of code requires one to inspect the elements of a webpage and locate the specific path in the source code. I like this in particular, as it allows one to extract the dataset directly from the site. No need to create a csv file in another program. So, the extracted data looks as follows:

WC_URL <- "https://en.wikipedia.org/wiki/Brazil_national_football_team#FIFA_World_Cup"

WC_Data <- WC_URL %>% 
  html() %>% 
  html_node(xpath ='//*[@id="mw-content-text"]/table[2]') %>% 
  html_table(header = NA, trim = TRUE, fill= TRUE)

## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")

head(WC_Data)

##   FIFA World Cup record             NA       NA  NA NA  NA NA NA NA
## 1                  Year          Round Position Pld  W D *  L GF GA
## 2                  1930    Group Stage      6th   2  1   0  1  5  2
## 3                  1934        Round 1     14th   1  0   0  1  1  3
## 4                  1938    Third Place      3rd   5  3   1  1 14 11
## 5                  1950     Runners-up      2nd   6  4   1  1 22  6
## 6                  1954 Quarter-Finals      5th   3  1   1  1  8  5
##                      FIFA World Cup qualification record   NA   NA   NA
## 1                Pld                                   W    D    L   GF
## 2                â<U+0080><U+0093>                                 â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>
## 3                â<U+0080><U+0093>                                 â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>
## 4                â<U+0080><U+0093>                                 â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>  â<U+0080><U+0093>
## 5 Qualified as hosts                                <NA> <NA> <NA> <NA>
## 6                  4                                   4    0    0    8
##     NA NA
## 1   GA NA
## 2  â<U+0080><U+0093> NA
## 3  â<U+0080><U+0093> NA
## 4  â<U+0080><U+0093> NA
## 5 <NA> NA
## 6    1 NA

So the above code works at extacting a very rough dataset, but it did extract it! For our purposes here, I was only interested in the first table (as you can see from the wikipedia Page it attached two tables into one) So first things first, clean up and tidying:

# Remove only the first 9 Columns which we will be working with.
WC_Brazil <-WC_Data[1:9]
#Resets Header to the Second Row
names(WC_Brazil) <- c("Year","Round", "Position", "Played", "Won", "Draw", "Lost","Goals","Goals-Against")
#Removes Duplicate Header Row
WC_Brazil <- WC_Brazil[-1,]
#Removes Total at the bottom (Not Necessary)
WC_Brazil <- WC_Brazil[-21,]

Now we have a cleaner data frame that we can work with, but there is still some tidying that needs to be done. First, the data that we are going to primarily work with in this data set is the final position. However, this is a none numeric value, so tidyr can be used to extact this data:

# I found while running the data, some of these values were not"numeric" However this extract numeric coerced them into a numeric value, so I pushed the values I needed using this function
WC_Brazil$Position <- extract_numeric(WC_Brazil$Position)
WC_Brazil$Goals <- extract_numeric(WC_Brazil$Goals)
WC_Brazil$Played <- extract_numeric(WC_Brazil$Played)
head(WC_Brazil)

##   Year          Round Position Played Won Draw Lost Goals Goals-Against
## 2 1930    Group Stage        6      2   1    0    1     5             2
## 3 1934        Round 1       14      1   0    0    1     1             3
## 4 1938    Third Place        3      5   3    1    1    14            11
## 5 1950     Runners-up        2      6   4    1    1    22             6
## 6 1954 Quarter-Finals        5      3   1    1    1     8             5
## 7 1958      Champions        1      6   5    1    0    16             4

Motocross

For the Motocross data a similar Extraction was performed, and we created to following dataset.

MC_URL <- "https://en.wikipedia.org/wiki/Motocross_des_Nations"

MC_Data <- MC_URL %>% 
  html() %>% 
  html_node(xpath ='//*[@id="mw-content-text"]/table[3]') %>% 
  html_table(header = NA, trim = TRUE, fill= FALSE)

## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")

#Trimmed an excess column on the end: 
MC_Data <- MC_Data[,-18]

head(MC_Data)

##        Team 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 1 Australia  7th  6th      14th 13th 16th 11th 20th  6th  7th  6th  3rd
## 2   Austria 13th                22nd 23rd 21st           23rd 12th 20th
## 3   Belarus                27th                                        
## 4   Belgium  3rd  2nd  2nd  1st  1st  3rd  2nd  3rd  3rd  3rd  2nd  5th
## 5    Brazil 23rd 23rd                          16th 14th 14th 18th 27th
## 6    Canada 28th 26th            8th 17th 18th 10th 13th 27th 16th     
##   2012 2013 2014 2015
## 1 10th  4th 14th  7th
## 2 24th 12th      10th
## 3                    
## 4  2nd  1st  2nd  3rd
## 5 32nd 31st 27th 27th
## 6      23rd 17th

For the most part this dataset is much more tidy than the previous one, and didn’t require much deletion or extraction of data. However, it did require us to gather the data into a much more manageble set. So that was the first part that we did:

MC_Total <-MC_Data %>% gather("Year","Place",2:17,na.rm = TRUE)
#Another step needed would be get a numeric value for the postion
MC_Total$Place <- extract_numeric(MC_Total$Place)
#Last we need to omit the NA values, as they are unnecessary for our analysis downstream
MC_Total<-na.omit(MC_Total)

head(MC_Total)

##         Team Year Place
## 1  Australia 2000     7
## 2    Austria 2000    13
## 4    Belgium 2000     3
## 5     Brazil 2000    23
## 6     Canada 2000    28
## 10   Croatia 2000    17

Pope Popularity

For the Pope’s Popularity, unfortunately, the data was in an image file. At the moment, I was unable to extract the info from the website. So, I created a CSV file in the format of the data presented on the Discussion board. I kept the CSV to the format as best I could, but for simplicity I did change things a bit (I realize that this Project is to aid our knowledge of tidying data, but some things especially when creating .csv file are easier to manipulate in the file itself). I tried to use each of these datasets to emphasize a particular extraction and data manipulation, for this one I used some regular expressions to extract a particular item in the file.

#Extracting the URL text 
popeURL <-getURL("https://raw.githubusercontent.com/mfarris9505/PopePop/master/PopePopularity.csv")

#Reading the txt file into "CSV"
pope_data <-read.csv(text = popeURL)
#Re-naming the columns
names(pope_data) <- c("Pope", "Date","Favorable", "Unfavorable", "No.Opinion")

head(pope_data)

##                Pope           Date Favorable Unfavorable No.Opinion
## 1      Pope Francis   Feb 6-9,2014        76           9         16
## 2      Pope Francis Apr 11-14,2013        58          10         31
## 3 Pope Benedict XVI Mar 26-28,2010        40          35         25
## 4 Pope Benedict XVI Apr 18-20,2008        63          15         22
## 5 Pope Benedict XVI  Jun 1-3,2007         52          16         32
## 6 Pope Benedict XVI Dec 16-18,2005        50          11         39

This data is pretty tidy, however, for my analysis, I want to extract the specific year when the poll was taken. So to extract the data, we used the the str_extract function.

pope_data$Date <-str_extract(pope_data$Date,"[:digit:]{4}")

#There were two 2005 years in this dataset for Pope Benedict and one for Pope John Paul, for ease (and because I needed a 2006 and a 2004 data point) I changed the December poll to 2006 and the Pope John Paul to 2004. We have yet to learn about the problems of changing data, so I figured this would be OK.
pope_data$Date[6] <- 2006
pope_data$Date[8] <- 2004
head(pope_data)

##                Pope Date Favorable Unfavorable No.Opinion
## 1      Pope Francis 2014        76           9         16
## 2      Pope Francis 2013        58          10         31
## 3 Pope Benedict XVI 2010        40          35         25
## 4 Pope Benedict XVI 2008        63          15         22
## 5 Pope Benedict XVI 2007        52          16         32
## 6 Pope Benedict XVI 2006        50          11         39

Now that we have all the data in a group we can do some analysis.

Results and Analysis

For the analysis, I did the bare minimum that was requested for the two threads (Motocross and Pope data), which is shown below. For the World Cup data and Motocross data, I created two datasets and combined it with Favorability of the Pope during the given years. For the Motocross data, I picked 4 countries specifically USA, Japan, Ireland, and Brazil.

Thread Analysis

As stated, I couldn’t quite extract the data for the opponents, so I did some generic plotting. First, I was interested in see if the number of goals scored was related to the position they placed. This was accomplished as follows:

ggplot(WC_Brazil, aes(x=Goals, y=Position))+
  geom_point()+
  geom_smooth(method=lm)

I realize, this data is actually quite obvious. They would score more goals, mostly because they played more games, so I tried to optimize this by creating a sepatate column called Goals Per Game:

WC_Brazil <- WC_Brazil %>% mutate(GoalsPerGame = Goals/Played)

ggplot(WC_Brazil, aes(x=GoalsPerGame, y=Position))+
  geom_point()+
  geom_smooth(method=lm)

The replotted data shows a more acuarte depiction if there is a relation between Goals Scored and Placement.

Next we did some analysis on the Motocross data:

The motocross data was interesting to look at and using dplyr we came to some interesting finds. First the average placement of each country:

MC_Total %>% group_by(Team)  %>% 
  summarise(Mean = mean(Place))

## Source: local data frame [55 x 2]
## 
##          Team     Mean
##         (chr)    (dbl)
## 1   Australia  9.60000
## 2     Austria 18.00000
## 3     Belarus 27.00000
## 4     Belgium  2.37500
## 5      Brazil 22.90909
## 6      Canada 18.45455
## 7       Chile 18.00000
## 8       China 32.50000
## 9  Costa Rica 25.00000
## 10    Croatia 27.80000
## ..        ...      ...

Finally some analysis on the Pope data:

For the Pope dataset, I created two new Columns using the Mutate function, one to obtain what he called Awareness, and the other as Net Favoablility:

pope_data <- pope_data %>% mutate(Awareness = Favorable + Unfavorable) %>%  mutate(Net_Favor = Favorable - Unfavorable)

pope_data

##                 Pope Date Favorable Unfavorable No.Opinion Awareness
## 1       Pope Francis 2014        76           9         16        85
## 2       Pope Francis 2013        58          10         31        68
## 3  Pope Benedict XVI 2010        40          35         25        75
## 4  Pope Benedict XVI 2008        63          15         22        78
## 5  Pope Benedict XVI 2007        52          16         32        68
## 6  Pope Benedict XVI 2006        50          11         39        61
## 7  Pope Benedict XVI 2005        55          12         33        67
## 8  Pope John Paul II 2004        78          11         11        89
## 9  Pope John Paul II 2003        73          17         10        90
## 10 Pope John Paul II 2002        61          26         13        87
## 11 Pope John Paul II 1998        86           8          6        94
## 12 Pope John Paul II 1993        64          15         21        79
##    Net_Favor
## 1         67
## 2         48
## 3          5
## 4         48
## 5         36
## 6         39
## 7         43
## 8         67
## 9         56
## 10        35
## 11        78
## 12        49

Combined data analysis

This is where I take leaps and bounds past normal, and took it one step further: I combined data… Why? Because I can…

First step was the create a simple data table from the Pope data this was done as followed:

combo_data <- subset(pope_data,select = c(Date, Favorable))

names(combo_data) <- c("Year","Favorable")

# I mutated this into A percentage
combo_data <- combo_data %>% mutate(Favorable = Favorable/100) 

# I next pulled data fromthe World Cup:
WC_Combo <- subset(WC_Brazil, select= c(Year,Position))

# And finally pulled the Motocross Data:
MC_Japan <- MC_Total %>% filter(Team == "Japan")
MC_USA <- MC_Total %>% filter(Team == "USA")
MC_Ireland <- MC_Total %>% filter(Team == "Ireland")
MC_Brazil <- MC_Total %>% filter(Team == "Brazil")

#Removes excess Column
MC_Japan <- MC_Japan[,-1]
MC_USA <- MC_USA[,-1]
MC_Ireland <- MC_Ireland[,-1]
MC_Brazil <- MC_Brazil[,-1]

#Using the Left_join we combined each of these separate datasets into one: 
combo_data <- left_join(combo_data,WC_Combo, by = "Year")
combo_data <- left_join(combo_data,MC_Japan, by = "Year")

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

combo_data <- left_join(combo_data,MC_USA, by = "Year")

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

combo_data <- left_join(combo_data,MC_Ireland, by = "Year")

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

combo_data <- left_join(combo_data,MC_Brazil, by = "Year")

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

names(combo_data) <- c("Year","Favorable","Brazil_WC","Japan_MC","USA_MC","Ireland_MC", "Brazil_MC")

combo_data

##    Year Favorable Brazil_WC Japan_MC USA_MC Ireland_MC Brazil_MC
## 1  2014      0.76         4       22      3         19        27
## 2  2013      0.58        NA       24      2         22        31
## 3  2010      0.40         6       19      1         20        18
## 4  2008      0.63        NA       17      1         21        14
## 5  2007      0.52        NA        7      1         13        16
## 6  2006      0.50         5       12      1         31        NA
## 7  2005      0.55        NA       12      1         22        NA
## 8  2004      0.78        NA       17     NA         14        NA
## 9  2003      0.73        NA        5      2          7        NA
## 10 2002      0.61         1       NA     NA          9        NA
## 11 1998      0.86         2       NA     NA         NA        NA
## 12 1993      0.64        NA       NA     NA         NA        NA

After all that… We successfully found absolutely nothing… First, this data makes several assumptions, like the favoribilty of the Pope in America is representative of the entire world (obviously not). However, for the sake of argument, I did plot the data for Ireland, as it looked promising, and Ireland is known for its connection to the Catholic Church (71% Catholic):

ggplot(combo_data, aes(x=Ireland_MC, y=Favorable))+
  geom_point()+
  geom_smooth(method=lm)

## Warning: Removed 2 rows containing missing values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

This data was actually surprising, There does appear to be a slight trend, that shows that as the people’s favor of the Pope goes down, so does the relative ranking of Ireland.

A similar plot was done using the World Cup data, as Brazil is another Catholic country (64% Catholic):

ggplot(combo_data, aes(x=Brazil_WC, y=Favorable))+
  geom_point()+
  geom_smooth(method=lm)

## Warning: Removed 7 rows containing missing values (stat_smooth).

## Warning: Removed 7 rows containing missing values (geom_point).

Finally, we plotted Japan, Which we know isn’t catholic (according to my research only 0.5% of the popluation is a practicing catholic).

ggplot(combo_data, aes(x=Japan_MC, y=Favorable))+
  geom_point()+
  geom_smooth(method=lm)

## Warning: Removed 3 rows containing missing values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

Though this analysis is FAR from accurate, it is interesting to say the least. I am still not familiar with many statistical test, but the fact that I found any sort of trend is just flat out strange. I felt I would find absolutely no change, like we see in the data set from Japan. Again, I can’t say this anything other than a coincidence, but it is a pleasant surprise for me. I do admire Pope Francis, and was generally pleased with his recent appearance here in America, as I think he did speak on many topics, and inspired many by his speeches. Maybe, just maybe, I should try out for the soccer team.

Project 2

Matthew Farris

October 11, 2015