My data set covers the National Women’s Soccer League and it covers a lot of soccer performance statistics. The harder variables to understand are goal_conversion_pct,pass_pct, tackle_success_pct, possession_pct and goal_differential. The variable goal_conversion_pct covers the percentage of shots scored. The pass_pct covers the pass accuracy in the season. The goal_differential covers when goals are subtracted from goals conceded in the season. The tackle_success_pct is the percent of clean and fair tackles in soccer in the season. The possession_pct covers the overall possession of the ball each team had in the season. I plan to see in soccer which variable has the greatest effect on the goals scored in the game and see which teams scored the most and also allowed goals to be scored against them. I only plan to look at only the 2022 data when making my final graph. The dataset’s source is the National Women’s Soccer League (NWSL). R package version 0.0.0.9001.
Load the dataset into R
library(tidyverse) #loading library tidyverse
Warning: package 'tidyverse' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/casti/OneDrive/Documents/DATA 110")Female_soccer_stats <-read_csv("Female soccer stats.csv") #To Load in the Datset in my Environment
Rows: 59 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): team_name
dbl (12): season, games_played, goal_differential, goals, goals_conceded, cr...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
No data cleaning needed :
Linear Regression analysis:
Making a geom_point graph before the Linear Regression analysis
P1 <-ggplot(Female_soccer_stats,aes(x= possession_pct, y = goals)) +labs(title ="Goals Vs Possession In National Women Soccer League",caption ="Source: National Women’s Soccer League (NWSL). R package version 0.0.0.9001. ") +xlab(" Average Possession percentage for each team in the League") +theme_minimal(base_size =12) +geom_point()P1
Correlation Line
P2 <- P1 +geom_point() +geom_smooth(method ='lm', formula = y~x, color ="blue") +theme_classic() #Making a presentable point graphP2 #Showing the graph
Actual Correlation
cor(Female_soccer_stats$goals,Female_soccer_stats$possession_pct) # To get the correlation
[1] 0.2877691
Calculate the correlation coefficient and model summary
Fit1 <-lm(goals ~ possession_pct, data = Female_soccer_stats)summary(Fit1) # To make the model summary
Call:
lm(formula = goals ~ possession_pct, data = Female_soccer_stats)
Residuals:
Min 1Q Median 3Q Max
-18.275 -7.275 -2.170 7.042 26.778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.1534 20.9386 -0.771 0.4436
possession_pct 0.9475 0.4177 2.269 0.0271 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.06 on 57 degrees of freedom
Multiple R-squared: 0.08281, Adjusted R-squared: 0.06672
F-statistic: 5.146 on 1 and 57 DF, p-value: 0.0271
The Model has the equation :
Goals = 0.9475(possession_pct) - 16.1534
This can be interpreted for each 1% of possession of the ball there is 0.95 goals scored.
Multi Regression analysis
Fit2 <-lm(goals ~ possession_pct + cross_accuracy + pass_pct + pass_pct_opposition_half + shot_accuracy + tackle_success_pct, data = Female_soccer_stats) # Making a multi linear regression model summary(Fit2) # The summary of the multi linear regression model
Warning: package 'ggfortify' was built under R version 4.3.3
autoplot(Fit2, 1:4, nrow=2, ncol=2) # Making four diagnostic plots
The residual plot isn’t too horizontal so a linear plot may not be appropriate.
The Q-Q plot shows that it’s relatively normal however there are a few outliers like numbers 13 48, and ,49.
The scale location indicates homogeneous variance.
The cook distance shows that the outliers have high leveling meaning that it’s causing problems to my model. I might have to try to remove the outliers because they did pop up in all three of the other plots.
Removing the three outliers
Female_data2 <- Female_soccer_stats[-c(13,48,49),]Fit3 <-lm(goals ~ possession_pct + cross_accuracy + pass_pct + pass_pct_opposition_half + shot_accuracy + tackle_success_pct, data = Female_data2)summary(Fit3) #Just removing the 3 points and making a new dataframe and summary
autoplot(Fit3, 1:4, nrow=2, ncol=2) # making 4 diagnostic plots
Final Multi-regression Response:
The new dataframe of Fit3 did not change any of the four dianostic graphs dramatically from autoplot however it did increase the adjusted R squared from 0.2855 to 0.3537. This means that 65 percent of the variation in the data is likely not explained by this model. The shot accuracy and passing accuracy have the most influence with the amount of goals each team scored due to their low P values. I don’t find this suprising because the higher shots accuracy on goal the more goals are scored. In the other hand , the more passes accuracy then the more chances that a team can move forward towards to the goalie.
Filtering the data
female_recent <- Female_soccer_stats |>filter(season =="2022") #filtering the data to only 2022 season female_recent<- female_recent[order(-female_recent$goals),] # Order the data from most goals to least
Making a tree graph:
Loading in treemap
library(treemap) #loading in with library of the two packages
Warning: package 'treemap' was built under R version 4.3.3
library(RColorBrewer)
The final Tree Graph
treemap(female_recent,index ="team_name", vSize ="goals",vColor ="goals_conceded", type ="manual",title ="The National Women’s Soccer League 2022 (NWSL) Goals vs Goals Conceded",fontface.labels =c(2,1),inflate.labels = F, fontsize.labels =c(13,12), border.col ="darkred", border.lwds =c(5,2), # changing fontsize and border color/sizepalette ="RdPu") #pallete from red to purple with type = manual
Source: National Women’s Soccer League (NWSL). R package version 0.0.0.9001.
Final Essay:
A. I liked this data set because it followed a passion of mine which is soccer. It had a lot of stats that were tracked throughout each match. I wanted to know what stat has the biggest impact on the amount of goals scored for each team. Luckily this data was cleaned and nothing on my part I had to do to clean the data.
B. I used a treemap because it showed the teams that scored the most while highlighting the teams that also got scored on heavily. I found with this treemap that some teams were more defensive than offensive or vice versa. It also highlighted teams that both were poor offensive and defense.
C. I tried to do a faucet wrap around all the variables but it was hard to read. I also had problems with the multi regression models and tried my best to interpret it with the adjusted R value and P values. I am glad that autoplot exists because it would be very time consuming to make each graph with ggplot. I wished that maybe I found a dataset with the MLS(male) and compare if the same variables had the most effect towards scoring goals.