Welcome to my DATA 110 project analysis. For this assignment, I decided to explore a passion of mine: soccer. I have sourced a comprehensive dataset covering the entire 2019-2020 season of the Spanish ‘La Liga’ Primera Division.
My main research question for this study is:
“How does offensive efficiency (the relationship between shots on target and goals) vary between matches played at home versus those played away in the Spanish La Liga?”
I chose this dynamic image from the official FC Barcelona website to capture the energy of Spanish soccer and the moment a goal is celebrated—the ultimate measure of offensive success.
In this first stage, I am setting up my R environment by loading the necessary libraries. I will then import the La Liga dataset and perform essential cleaning steps using the tidyverse package.
# --- Phase 1: Loading Libraries and Dataset ---# Load the required packages for data science and visualizationlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
# Import the raw dataset from the CSV file# We use read_csv to bring the soccer statistics into our environmentsoccer_data <-read_csv("spain-la-liga-primera-division-2019-to-2020 (1).csv")
Rows: 180 Columns: 105
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Div, Date, HomeTeam, AwayTeam, FTR, HTR
dbl (98): FTHG, FTAG, HTHG, HTAG, HS, AS, HST, AST, HF, AF, HC, AC, HY, AY,...
time (1): Time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# --- Data Cleaning and Filtering (dplyr) ---# Following the project guidelines, I am using the dplyr 'filter' command.# I am focusing my analysis on the top 4 teams in the league to ensure # my observations remain under 800 and are relevant to my research question.clean_data <- soccer_data %>%select(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR) %>%filter(!is.na(FTHG), !is.na(FTAG), !is.na(HST), !is.na(AST)) %>%# Select 2 categorical variables and 4 quantitative variables as requiredselect(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR)# Display the first few rows to document the successful cleaning processhead(clean_data)
# A tibble: 6 × 7
HomeTeam AwayTeam FTHG FTAG HST AST FTR
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Ath Bilbao Barcelona 1 0 5 2 H
2 Celta Real Madrid 1 3 4 11 A
3 Valencia Sociedad 1 1 6 3 D
4 Mallorca Eibar 2 1 4 5 H
5 Leganes Osasuna 0 1 2 2 A
6 Villarreal Granada 4 4 7 7 D
Phase 2: Exploratory Data Analysis & Visualization
In this phase, I am exploring the relationship between precision (Shots on Target) and actual scoring (Goals). This helps answer if teams are more “efficient” when playing in front of their home crowd compared to playing away.
# Summary statistics for quantitative variables# This helps understand the range of shots and goals in our datasetclean_data <- soccer_data %>%select(HomeTeam, AwayTeam, FTHG, FTAG, HST, AST, FTR) %>%filter(!is.na(FTHG), !is.na(FTAG), !is.na(HST), !is.na(AST))# Display resultshead(clean_data)
# A tibble: 6 × 7
HomeTeam AwayTeam FTHG FTAG HST AST FTR
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Ath Bilbao Barcelona 1 0 5 2 H
2 Celta Real Madrid 1 3 4 11 A
3 Valencia Sociedad 1 1 6 3 D
4 Mallorca Eibar 2 1 4 5 H
5 Leganes Osasuna 0 1 2 2 A
6 Villarreal Granada 4 4 7 7 D
Phase 3: Multiple Linear Regression Analysis
In this phase, I use a multiple linear regression model to analyze how different factors influence the number of goals scored by the home team. Specifically, I examine the impact of shots on target (HST), opponent shots on target (AST), and goals conceded (FTAG).
This helps determine whether offensive efficiency is significantly affected by both attacking and defensive variables.
# Phase 3: Multiple Linear Regression# In this step, I build a regression model to predict home goals (FTHG)# using three quantitative variables:# - HST: Home shots on target (offensive strength)# - AST: Away shots on target (defensive pressure)# - FTAG: Away goals scored (defensive weakness)model <-lm(FTHG ~ HST + AST + FTAG, data = clean_data)# Display the results of the regression modelsummary(model)
Call:
lm(formula = FTHG ~ HST + AST + FTAG, data = clean_data)
Residuals:
Min 1Q Median 3Q Max
-2.4909 -0.7128 -0.1226 0.6668 2.6359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.15943 0.22054 0.723 0.471
HST 0.29412 0.02947 9.980 <2e-16 ***
AST -0.01076 0.04927 -0.218 0.827
FTAG 0.08902 0.09366 0.951 0.343
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.048 on 176 degrees of freedom
Multiple R-squared: 0.3694, Adjusted R-squared: 0.3587
F-statistic: 34.37 on 3 and 176 DF, p-value: < 2.2e-16
The regression results provide insight into how different variables affect the number of goals scored by the home team.
First, the variable HST (home shots on target) has a very small p-value (p < 0.001), which indicates that it is statistically significant. This means that shots on target have a strong positive effect on goals scored. In other words, the more accurate shots a team takes, the more goals it is likely to score.
On the other hand, AST (away shots on target) and FTAG (away goals) have much higher p-values (greater than 0.05), suggesting that they are not statistically significant predictors in this model. This implies that defensive pressure from the opponent does not have a strong direct impact on home team goal scoring in this dataset.
The adjusted R-squared value is approximately 0.36, meaning that about 36% of the variation in home team goals can be explained by the variables included in the model. While this shows a moderate relationship, it also suggests that other factors not included in the model may influence goal scoring.
Overall, the model confirms that offensive efficiency, particularly shots on target, plays a key role in determining the number of goals scored.
Phase 4: Interactive Visualization
In this phase, I create a simple interactive visualization to better understand the relationship between shots on target and goals. I compare home and away matches to see if teams perform differently depending on where they play.
This helps answer whether offensive efficiency changes when playing at home versus away.
# Phase 4: Interactive Highcharter Visualization# This chart shows the relationship between shots on target and goals# for both home and away teams.# Home: HST vs FTHG# Away: AST vs FTAGlibrary(highcharter)highchart() %>%hc_chart(type ="scatter", zoomType ="xy") %>%# Title and subtitlehc_title(text ="Shots on Target vs Goals (Home vs Away)") %>%hc_subtitle(text ="Spanish La Liga 2019-2020") %>%# Axis labelshc_xAxis(title =list(text ="Shots on Target")) %>%hc_yAxis(title =list(text ="Goals Scored")) %>%# Home datahc_add_series(data = clean_data %>%select(x = HST, y = FTHG) %>%list_parse2(),name ="Home Matches",color ="#1f77b4" ) %>%# Away datahc_add_series(data = clean_data %>%select(x = AST, y = FTAG) %>%list_parse2(),name ="Away Matches",color ="#d62728" ) %>%# Tooltip for interactivityhc_tooltip(pointFormat ="Shots: {point.x}<br>Goals: {point.y}") %>%# Caption (data source)hc_caption(text ="Source: Football-Data.co.uk")
# Additional Visualization with 3 Colors (ggplot)library(ggplot2)ggplot(clean_data, aes(x = HST, y = FTHG, color = FTR)) +geom_point(size =2, alpha =0.7) +# Labels and titlelabs(title ="Goals vs Shots on Target by Match Result",x ="Shots on Target",y ="Goals Scored",color ="Match Result",caption ="Source: Football-Data.co.uk" ) +# Non-default themetheme_minimal()
The visualizations show a clear relationship between shots on target and goals scored. In general, as the number of shots on target increases, the number of goals also increases.
The Highcharter graph allows us to compare home and away matches directly. It appears that home teams may perform slightly better, as their data points are more consistent. This suggests that playing at home could have a positive effect on offensive efficiency.
The second graph adds more detail by showing match results using different colors. It shows that matches with more shots on target tend to result in wins, while fewer shots are often associated with losses or draws.
Overall, these visualizations confirm that shots on target are an important factor in scoring goals, but they also show that performance can vary depending on the match situation.
Discussion of Results
The visualizations show that there is a positive relationship between shots on target and goals. Teams that take more accurate shots tend to score more goals.
One interesting observation is that home teams appear slightly more consistent than away teams, which may suggest a home advantage. However, there is still variation, meaning that scoring goals depends on more than just shots on target.
If I had more time, I would include additional variables such as possession or player performance to better understand what influences goal scoring.