La Liga Analysis (2014-2020)

Introduction

The data set that I chose to analyze dealt with soccer games in the La Liga ranging from 2014 to 2020. The La Liga is a well know soccer league that that takes place in Spain and is the highest level of soccer in the country. The La Liga consists of a few of the most well known soccer teams throughout the world with decorative histories.

Dataset

The data that is in this data set consists of 2660 observations that are described by 41 variables. The variables describe each observation by providing information on the teams that played, the score of the game, and many other descriptive stats of every single game like the number of shots on target and fouls for each team. The main variables that will be focused on in these visualizations are the match excitement, score, home team rating and home team goals scored. The purpose of using these variables is to analyze these variables to find trends on how they have changed throughout the years that the data is provided.

Findings

Write something general about your findings.

setwd("c:/Users/pptallon/Desktop/")

library(data.table)
library(dplyr)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plotly)

df <-fread("LaLiga_all_years.csv")

Match Excitement

The analysis of the La Liga data starts with a visualization describing the most common match excitement ratings that are given each year. This piece of data does not have a large number of ratings that are involved in the visualization. Match excitement is a rating that is given at the end of the match that describes how exciting the match was or was not. This rating is on a scale from 0 to 10. The least exciting match would be given a rating of 0, while the most exciting match would be given a rating of 10.

Here in this visualization the most common ratings are in between 3.7 and 6.0. Meaning that the most common ratings are around 5, which makes sense because that is the middle of the entire range of the ratings. The most common rating for a match from the seven years of data is a rating of 5.7, which has occurred 68 times, but the other count of the most common ratings are quite close with most of them only being a few counts behind. The plot shows that in more recent years, these match excitement ratings have become more common. The count of these ratings have become more frequent in the past two years than many of the years before them. With that being said, it is clear that it is more common for a game in the La Liga to have an average excitement rating of somewhere around 5, versus the game being very exciting or not so exciting.

df_reasons <- dplyr::count(df, `Match Excitement`)
df_reasons <- df_reasons[order(df_reasons$n, decreasing = TRUE),]

top_reasons <- df_reasons$`Match Excitement`[1:10]

library(dplyr)
new_df <- df %>%
  filter(`Match Excitement` %in% top_reasons) %>%
  dplyr::select(year, `Match Excitement`) %>%
  group_by(year, `Match Excitement`) %>%
  dplyr::summarise(n= length(year), .groups = 'keep') %>%
  data.frame()

agg_tot <- new_df %>%
  dplyr::select(Match.Excitement, n) %>%
  group_by(Match.Excitement) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

new_df$year <- as.factor(new_df$year)

ggplot(new_df, aes(x=`Match.Excitement`, y=n, fill=year))+
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE))+
  coord_flip()+
  labs(title = "Top 10 Match Excitement Ratings by Year", x = "Match Excitement Rating", y = "Count of Ratings")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_fill_brewer(palette="Set3", guide = guide_legend(reverse = TRUE))+
  geom_text(data = agg_tot, aes(x= `Match.Excitement`, y= tot, label = tot, fill = NULL), hjust= -0.2, size=4)+
  scale_x_continuous(breaks = c(3.6,3.8,4,4.2,4.4,4.6,4.8,5,5.2,5.4,5.6,5.8,6,6.2))

Most Common Scores

Following the analysis on match excitement, the next piece of data that will be analyzed deals with the most common final scores in the La Liga. Unlike the match excitement ratings, there is a larger amount of counts in this visualization. Soccer is generally a sport that produces a low total score. This shows as in the visualization there is not total that is more than 4 combined goals by both teams in a game.

The most common final score is a game that ends with a scoreline of 1-1. On the contrary, the score that is the least common in this visualization is a 3-0 score, with the home team winning the game. The results from this chart show that close and low scoring games are the most common outcome. What is also shown is that home teams winning is much more common than away teams winning. There is a clear difference in the number of times a home team wins “1-0”, vs an away team winning “0-1”. As well as “2-0” and “0-2”.

df_reasons2 <- count(df, `Score`)
df_reasons2 <- df_reasons2[order(df_reasons2$n, decreasing = TRUE),]
top_reasons2 <- df_reasons2$Score[1:10]

scores_df <- df%>%
  filter(Score %in% top_reasons2) %>%
  dplyr::select(year, Score) %>%
  group_by(year, Score) %>%
  dplyr::summarise(n= length(year), .groups = 'keep') %>%
  data.frame()

agg_tot2 <- scores_df %>%
  dplyr::select(Score, n) %>%
  group_by(Score) %>%
  dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
  data.frame()

scores_df $year <- as.factor(scores_df$year)  

ggplot(scores_df, aes(x= `Score`, y=n, group=year))+
  geom_line(aes(color=year))+
  labs(title =  "Top 10 Most Common Final Scores by Year", x= "Score", y= "Count of Score")+
  theme_light()+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_point(shape= 21, size=2, color="black", fill="white")+
  scale_color_brewer(palette = "Set1", name= "Year", guide= guide_legend(reverse = TRUE))

Home Team Rating

Considering the major success of the home teams that was seen in the last visualization, an interesting topic to look into would be the top 20 best ratings that home teams have received. Team ratings are a rating that is given to the team based on their performance in the game. The best rating that can be given is a 10, and the worst is a 0. The range of ratings in this visualization is from 7.7 to 10.

The results show how uncommon it is for a team to receive any of these high ratings, and especially a rating over a 9.0. There are trends in this data where there are similarities in some of the years. In the first four season (2014-2017), the data is quite similar. All of the years have a few teams that received over a 9.0 and the other numbers counts on the ratings from 7.7 to 8.9 are similar as well. Observing the most recent years of 2020, 2019 and 2018, it shows less teams have received rating in the top 20 and there are very few rating of 9.0 and higher.

A description behind why these top 20 ratings are less common in the past 3 years could be the league getting stronger top to bottom. AS shown in the match excitement visualization, games are trending to be more around the average, which could mean that the teams that are playing against each other are relatively equal which calls for an average, low scoring game.

home_rating <- count(df, `Home Team Rating`)
home_rating <- home_rating[order(home_rating$`Home Team Rating`, decreasing = TRUE), ]

top_home <- home_rating$`Home Team Rating`[1:20]

home_df <- df %>%
  filter(`Home Team Rating` %in% top_home) %>%
  dplyr::select(`year`, `Home Team Rating`) %>%
  group_by(`year`, `Home Team Rating`)%>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

home_df$year <- factor(home_df$year)

x= min(as.numeric(levels(home_df$year)))

y=max(as.numeric(levels(home_df$year)))

home_df$year <- factor(home_df$year, levels = seq(y,x, by=-1))

ggplot(home_df, aes(x= `Home Team Rating`, y= n, fill= year )) +
  geom_bar(stat ="identity", position ="dodge") +
  theme_light()+
  theme(plot.title = element_text(hjust = .5)) +
  scale_x_continuous(breaks = c(7.6,8,8.4,8.8,9.2,9.6,10)) +
  labs(title = "Top 20 Home Team ratings by Year",
      x= "Home Team Rating",
      y= "Rating Count",
      fill= "Year") +
  scale_fill_brewer(palette = "Spectral") +
  facet_wrap(~year, ncol = 4, nrow = 2)

Number of occurances Home Teams Scored between 3-10 Goals each Each Year

With the decline in high home match ratings in recent years as shown in the previous visualization, this visualization will look at the number of large amounts of goals scored, to see if there is any correlation between the two. This visualization provides information on how many times a home team scored between 3 and 10 goals each year. When a team scores 3 or more goals in a soccer game, that is usually a very good result for that team which would like provide them with a high rating.

The data that is shown is exactly what was expected. In the past 3 years, these years have contributed to the 3 lowest occurances of a home team scoring between 3 and 10 goals in a single game. This is a result that was expected because of the decrease that top 20 home team ratings have seen in the past 3 years.

home_team_goals <- count(df, `Home Team Goals Scored`)
home_team_goals <- home_team_goals[order(home_team_goals$n, decreasing = TRUE), ]
top_goals <- home_team_goals$`Home Team Goals Scored`[4:11]

goals_df <- df %>%
  filter(`Home Team Goals Scored` %in% top_goals) %>%
  dplyr::select(year, `Home Team Goals Scored`) %>%
  group_by(year) %>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

plot_ly(goals_df, labels = ~year, values = ~n) %>%
  add_pie(hole=.65) %>%
  layout(title="Number of occurances Home Teams Scored between 3-10 Goals each Each Year") %>%
  layout(annotations=list(text=paste0("Occurance Count:",  (sum(goals_df$n))),guide= guide_legend(reverse = TRUE) , "showarrow"=F))

Most Common Home Team Rating by Year

Finally, the fifth visualization is also continuing with the home team theme. the last visualization looks at the top 5 most common home team ratings yearly. The data shows that the most common ratings for home teams are well above the average of 5. This is no surprise as it was shown earlier that home teams win games more often than away teams. With that being said, winning a game will most likely provide a team with a higher rating than the losing team.

This data is very consistent as it shows that most of the ratings are around 20% of the total each year. There are outliers in the data like any, but in many of the years the data is very evenly split. The range between the most common ratings is very small as well. The 5 ratings range from 6.6 to 6.0, with 6.5 occurring the most frequent. With large percentage of the ratings being within this tiny range, the numbers around or in that are not in this visualization are likely to occur frequent also. Given this data, this visualization can provide a strong estimate that the home teams rating will be around in this range of data. This visualization does not show much of a trend, but can more give a prediction of on the variable home team rating.

home_average <- count(df, `Home Team Rating`,)
home_average <- home_average[order(home_average$n, decreasing = TRUE),]
top_average <- home_average$`Home Team Rating`[1:5]

average_df <- df %>%
  filter(`Home Team Rating` %in% top_average) %>%
  dplyr::select(`year`, `Home Team Rating`) %>%
  group_by(`year`, `Home Team Rating`) %>%
  dplyr::summarise(n=length(year), .groups = "keep") %>%
  data_frame()

plot_ly(textposition= "inside", labels = ~`Home Team Rating`) %>%
  add_pie(data = average_df[average_df$year ==2014,],  values= ~n, name= "2014",
          title = '2014', domain=list(row=0, column=0)) %>%
  add_pie(data = average_df[average_df$year ==2015,],  values= ~n, name= "2015",
          title = '2015', domain=list(row=0, column=1)) %>%
  add_pie(data = average_df[average_df$year ==2016,],  values= ~n, name= "2016",
          title = '2016', domain=list(row=0, column=2)) %>%
  add_pie(data = average_df[average_df$year ==2017,],  values= ~n, name= "2017",
          title = '2017', domain=list(row=1, column=0)) %>%
  add_pie(data = average_df[average_df$year ==2018,],  values= ~n, name= "2018",
          title = '2018', domain=list(row=1, column=1)) %>%
  add_pie(data = average_df[average_df$year ==2019,],  values= ~n, name= "2019",
          title = '2019', domain=list(row=1, column=2)) %>%
  add_pie(data = average_df[average_df$year ==2020,],  values= ~n, name= "2020",
          title = '2020', domain=list(row=2, column=0)) %>%
  layout(title="Average Home Team Rating by Year", showlegend= TRUE, grid= list(rows=3, columns= 3))