Dataset Selected

Soccer is my favorite sport, or as they call it in the land of my favorite league, football. For this project, I looked for a dataset that focused on the English Premier League (EPL), as the name suggests, its the top football system in England.This league was created to have the highest paid athletes play at their upmost level as well as keep their competitive drive ready for international play (Dziak, 2020). The dataset selected was found on Kaggle. This dataset profiles various players in the league for the 2017-2018 season in terms of how much they are worth in the league as well as their worth to the fans in terms of the fantasy league and popularity online. (https://www.kaggle.com/mauryashubham/english-premier-league-players-dataset).

Load the Data

Loaded the csv file and had file saved to designated folder for the first initial look.

setwd("~/Desktop/MC Data Science /DATA 110 /DataSets ")
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
eng <- read_csv("epldata_final.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   club = col_character(),
##   age = col_double(),
##   position = col_character(),
##   position_cat = col_double(),
##   market_value = col_double(),
##   page_views = col_double(),
##   fpl_value = col_double(),
##   fpl_sel = col_character(),
##   fpl_points = col_double(),
##   region = col_double(),
##   nationality = col_character(),
##   new_foreign = col_double(),
##   age_cat = col_double(),
##   club_id = col_double(),
##   big_club = col_double(),
##   new_signing = col_double()
## )
head(eng)
## # A tibble: 6 x 17
##   name  club    age position position_cat market_value page_views fpl_value
##   <chr> <chr> <dbl> <chr>           <dbl>        <dbl>      <dbl>     <dbl>
## 1 Alex… Arse…    28 LW                  1           65       4329      12  
## 2 Mesu… Arse…    28 AM                  1           50       4395       9.5
## 3 Petr… Arse…    35 GK                  4            7       1529       5.5
## 4 Theo… Arse…    28 RW                  1           20       2393       7.5
## 5 Laur… Arse…    31 CB                  3           22        912       6  
## 6 Hect… Arse…    22 RB                  3           30       1675       6  
## # … with 9 more variables: fpl_sel <chr>, fpl_points <dbl>, region <dbl>,
## #   nationality <chr>, new_foreign <dbl>, age_cat <dbl>, club_id <dbl>,
## #   big_club <dbl>, new_signing <dbl>
tail(eng)
## # A tibble: 6 x 17
##   name  club    age position position_cat market_value page_views fpl_value
##   <chr> <chr> <dbl> <chr>           <dbl>        <dbl>      <dbl>     <dbl>
## 1 Pabl… West…    32 RB                  3          7          698       5  
## 2 Edim… West…    21 CM                  2          5          288       4.5
## 3 Arth… West…    23 LB                  3          7          199       4.5
## 4 Sam … West…    23 RB                  3          4.5        198       4.5
## 5 Ashl… West…    21 CF                  1          1          412       4.5
## 6 Diaf… West…    27 CF                  1         10          214       5.5
## # … with 9 more variables: fpl_sel <chr>, fpl_points <dbl>, region <dbl>,
## #   nationality <chr>, new_foreign <dbl>, age_cat <dbl>, club_id <dbl>,
## #   big_club <dbl>, new_signing <dbl>
glimpse(eng)
## Rows: 461
## Columns: 17
## $ name         <chr> "Alexis Sanchez", "Mesut Ozil", "Petr Cech", "Theo Walco…
## $ club         <chr> "Arsenal", "Arsenal", "Arsenal", "Arsenal", "Arsenal", "…
## $ age          <dbl> 28, 28, 35, 28, 31, 22, 30, 31, 25, 21, 24, 23, 25, 26, …
## $ position     <chr> "LW", "AM", "GK", "RW", "CB", "RB", "CF", "LB", "CB", "L…
## $ position_cat <dbl> 1, 1, 4, 1, 3, 3, 1, 3, 3, 1, 2, 2, 2, 2, 2, 3, 3, 2, 1,…
## $ market_value <dbl> 65, 50, 7, 20, 22, 30, 22, 13, 30, 10, 35, 22, 18, 35, 1…
## $ page_views   <dbl> 4329, 4395, 1529, 2393, 912, 1675, 2230, 555, 1877, 1812…
## $ fpl_value    <dbl> 12.0, 9.5, 5.5, 7.5, 6.0, 6.0, 8.5, 5.5, 5.5, 5.5, 5.5, …
## $ fpl_sel      <chr> "17.10%", "5.60%", "5.90%", "1.50%", "0.70%", "13.70%", …
## $ fpl_points   <dbl> 264, 167, 134, 122, 121, 119, 116, 115, 90, 89, 85, 83, …
## $ region       <dbl> 3, 2, 2, 1, 2, 2, 2, 2, 2, 4, 2, 1, 1, 1, 2, 3, 1, 2, 1,…
## $ nationality  <chr> "Chile", "Germany", "Czech Republic", "England", "France…
## $ new_foreign  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age_cat      <dbl> 4, 4, 6, 4, 4, 2, 4, 4, 3, 1, 2, 2, 3, 3, 3, 3, 3, 5, 3,…
## $ club_id      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ big_club     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ new_signing  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Data Cleaning

The club column had the club names with the presence of “+” to signify a space. I formatted that column by removing the punctuation character then recreating the column to replace it with a space.The creator of this dataset used numbers to represent specific values. So to clean the dataset up, I am going to convert some of the columns to what they are meant to represent. The numbers in the region column represent what part of the world the player came from. For this column, 1=England, 2=EU, 3=Americas, and 4=Rest of World. The numbers of position_cat column represent what the position of the player. For this column, 1= attacker, 2=midfielder, 3=defender, and 4= Goalkeeper. The big_club column uses 0 or 1 to determine whether the club in questions is apart of the top clubs in the league (Premier League, 2020). In the league there are a select number of teams that have been in the league since its formation, those teams form the big clubs I will change 0=no and 1= yes. By making these changes it will be a lot easier for use in visualizations.

# Formatting club column 
eng$club <- str_remove_all(eng$club, "[+]")
eng$club_clean <- gsub("([a-z])([A-Z])", "\\1 \\2", eng$club)
# Region column
eng$region_new <- eng$region %>% 
  factor(., levels = c(1,2,3,4), labels = c("England", "EU", "Americas", "Rest of World"))
# Position column
eng$position_new <- eng$position_cat %>% 
  factor(., levels = c(4,3,2,1), labels = c("Goalkeeper", "Defender", "Midfield", "Attacker"))
# Big Club 
eng$top_club <- eng$big_club %>% factor(., levels = c(0,1), labels = c("No", "Yes"))
# Remove any NAs that might be present in dataset 
eng <-na.omit(eng)
# Show new columns 
eng_nc <-eng%>% select(club_clean,region_new, top_club, position_new)
head(eng_nc)
## # A tibble: 6 x 4
##   club_clean region_new top_club position_new
##   <chr>      <fct>      <fct>    <fct>       
## 1 Arsenal    Americas   Yes      Attacker    
## 2 Arsenal    EU         Yes      Attacker    
## 3 Arsenal    EU         Yes      Goalkeeper  
## 4 Arsenal    England    Yes      Attacker    
## 5 Arsenal    EU         Yes      Defender    
## 6 Arsenal    EU         Yes      Defender
tail(eng_nc)
## # A tibble: 6 x 4
##   club_clean region_new    top_club position_new
##   <chr>      <fct>         <fct>    <fct>       
## 1 West Ham   Americas      No       Defender    
## 2 West Ham   EU            No       Midfield    
## 3 West Ham   Rest of World No       Defender    
## 4 West Ham   England       No       Defender    
## 5 West Ham   England       No       Attacker    
## 6 West Ham   Rest of World No       Attacker

Exploring the Data

Prepared some primary visualizations to help figure out what questions I want to try to answer with the data. The first graph is a box plot of the distribution of ages by player position as a box plot. The second visual is a graph that shows which teams are apart of the top clubs in the league as a bar graph.

library(viridis)
## Loading required package: viridisLite
#Age by player position   
eng_p1 <- ggplot(eng, aes(x=position_new, y=age, fill=position_new)) + 
  geom_boxplot() + scale_fill_viridis(discrete = TRUE, alpha = 0.6) + 
  geom_jitter(color = "green", size=0.4, alpha=0.9) + 
  theme_dark() + 
  theme(legend.position = "none")
eng_p1 + ggtitle("Player Position Age Range") + 
  xlab("Player Position") + ylab("Age")

# Determination of Top Clubs 
eng_p2 <- ggplot(eng, aes(x=club_clean, fill=top_club)) + 
  geom_bar(width =0.5) + theme(axis.text.x=
                 element_text(size=10,
                              angle = 45,
                              hjust = 1, 
                              vjust= 1))
eng_p2 + ggtitle("Determination of the Top Clubs") + 
  xlab("Clubs") + ylab("Count") + labs(fill= "Top Club?")

Conculsion From Boxplot

There appears to be a wide array of ages present at each position but more specifically, younger players tend to be attackers while older players tend to be goalkeepers. There also looks to be a smaller pool of goalkeepers compared to the other positions as there are a lot less dots surrounding the corresponding boxplot compared to the others.

Conclusion From Bar Graph

Currently, the EPL is comprised of 20 teams that compete for the top spots every season each fighting to keep their status in the league. Those that finish at the bottom are demoted to a lower league in a process called relegation. The EPL was first established in 1992 since that time, there have only been a select number of teams that have consistently remained in the league since its inception(Premier League, 2020). Those teams are Manchester United, Arsenal, Tottenham, Liverpool, Everton and Chelsea. Interestingly enough, Everton is not listed as a top club in this plot, they’re replaced by Manchester City. After some research, though it is true that Everton is apart of the group of teams that have never been out of the league, this season they did not finish in the top 6, they finished in 8th place (Premier League 2017/18).

Statistical Analysis

Something I want to see is if there is any correlations between any of the variables in the dataset. ### Correlations

#Age and market value 
eng_cor1 <- eng %>% select(age,market_value)
cor(eng_cor1$age, eng_cor1$market_value)
## [1] -0.1338284
# Page views and market value 
eng_cor2 <- eng %>% select(page_views, market_value)
cor(eng_cor2$page_views, eng_cor2$market_value)
## [1] 0.7395401
#Fantasy premier league (fpl) points and market_value 
eng_cor3 <- eng %>% select(fpl_points,market_value)
cor(eng_cor3$fpl_points, eng_cor3$market_value)
## [1] 0.6150133

I wanted to explore where players see worth. So I compared their market value to certain variables to see if there are any correlations using the correlation command. From those that I picked, it appears that there are some strong correlations between page views and how many points each player has racked up in fantasy premier league. Ultimately, if you’re getting a lot of hits in web searches as well as showing high performance in the fantasy league, you are pretty expensive.

Heatmap

Heatmap of clubs in accordance of their player’s market values. As a challenge, I wanted to try converting the market value from £m to USD. To do this, I looked for the exchange rate in 2017. The average exchange rate in that year was 1.1304 (exchangerates.org). I then used the new column, the club and player names column to create the heatmap.

library(RColorBrewer)
# Convert market value from £m to usd. Average exchange rate in 2017 was 1.1304
eng_usd <- eng %>% group_by(club_clean) %>% mutate(usd = 1.1304 *market_value) %>% arrange(usd)
head(eng_usd)
## # A tibble: 6 x 22
## # Groups:   club_clean [6]
##   name  club    age position position_cat market_value page_views fpl_value
##   <chr> <chr> <dbl> <chr>           <dbl>        <dbl>      <dbl>     <dbl>
## 1 Edua… Chel…    34 LW                  1         0.05        467       5  
## 2 Joel… Manc…    21 GK                  4         0.1         395       4  
## 3 Niki… Brig…    32 GK                  4         0.25        103       4  
## 4 Matt… Burn…    35 LM                  2         0.25        255       4.5
## 5 Juli… Crys…    38 GK                  4         0.25        188       4  
## 6 Jonj… Ever…    20 RB                  3         0.25         53       4.5
## # … with 14 more variables: fpl_sel <chr>, fpl_points <dbl>, region <dbl>,
## #   nationality <chr>, new_foreign <dbl>, age_cat <dbl>, club_id <dbl>,
## #   big_club <dbl>, new_signing <dbl>, club_clean <chr>, region_new <fct>,
## #   position_new <fct>, top_club <fct>, usd <dbl>
# Condense dataset 
eng_1 <- eng_usd %>% group_by(club_clean) %>% select(club_clean, usd) 
head(eng_1)
## # A tibble: 6 x 2
## # Groups:   club_clean [6]
##   club_clean           usd
##   <chr>              <dbl>
## 1 Chelsea           0.0565
## 2 Manchester United 0.113 
## 3 Brightonand Hove  0.283 
## 4 Burnley           0.283 
## 5 Crystal Palace    0.283 
## 6 Everton           0.283
# Sort Data 
eng_2 <- as.numeric(as.factor(substr(rownames(eng_1), 1,1)))
row.names(eng_1) <- eng_usd$name
## Warning: Setting row names on a tibble is deprecated.
eng_2 <- as.numeric(as.factor(substr(rownames(eng_1), 1,1)))
eng_3 <- data.matrix(eng_1)

#Create Heatmap 
cute <- colorRampPalette(brewer.pal(12,"Paired")) (20)
eng_hm <- heatmap(eng_3,Rowv = NA, Colv = NA,col= cute, scale = "column", margins = c(5,5), main="Player Market Value")

The club column shows a lot of variation which makes sense. Different clubs pay players differently.The usd column shows a lot less variation most likely because I had arranged the data frame by usd so the heatmap shows a lot less variation in colors. From a broad sense, this shows that only a small group of players are paid the big bucks. In order to figure out the more detailed reasons to why, more analysis would be needed.

What I Didn’t Get To Work On

I could have gone a lot of different ways with this dataset. I definitely only scratched the surface with it. I think it would’ve been beneficial to see how much more fantasy play could play a role in different factors. Another approach could have been comparing the clubs to each other maybe seeing if this data could have predicted how the season had ended up.

Bibliography

Dziak, Mark (2020). “Premier League (English Premier League or EPL)”. Salem Press Encyclopedia,2020.2p.

“Euro to US Dollar Spot Exchange Rates for 2017”(2020). ExchangeRates.org. https://www.exchangerates.org.uk/EUR-USD-spot-exchange-rates-history-2017.html#:~:text=This%20is%20the%20Euro%20(EUR,USD%20on%2003%20Jan%202017.

“Premier League 2017/18”(n.d.). FootballDatabase. https://footballdatabase.com/league-scores-tables/england-premier-league-2017-18.

“Premier League explained” (2020). Premier Leauge. https://www.premierleague.com/premier-league-explained