National Women Soccer League Data

Author

Gabriel Castillo Lopez

Introduction Paragraph:

My data set covers the National Women’s Soccer League and it covers a lot of soccer performance statistics. The harder variables to understand are goal_conversion_pct,pass_pct, tackle_success_pct, possession_pct and goal_differential. The variable goal_conversion_pct covers the percentage of shots scored. The pass_pct covers the pass accuracy in the season. The goal_differential covers when goals are subtracted from goals conceded in the season. The tackle_success_pct is the percent of clean and fair tackles in soccer in the season. The possession_pct covers the overall possession of the ball each team had in the season. I plan to see in soccer which variable has the greatest effect on the goals scored in the game and see which teams scored the most and also allowed goals to be scored against them. I only plan to look at only the 2022 data when making my final graph. The dataset’s source is the National Women’s Soccer League (NWSL). R package version 0.0.0.9001.

Load the dataset into R

library(tidyverse) #loading library tidyverse

Warning: package 'tidyverse' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/casti/OneDrive/Documents/DATA 110")
Female_soccer_stats <- read_csv("Female soccer stats.csv") #To Load in the Datset in my Environment

Rows: 59 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): team_name
dbl (12): season, games_played, goal_differential, goals, goals_conceded, cr...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

No data cleaning needed :

Linear Regression analysis:

Making a geom_point graph before the Linear Regression analysis

P1 <- ggplot(Female_soccer_stats,aes(x= possession_pct, y = goals)) +
  labs(title = "Goals Vs Possession In National Women Soccer League",
       caption = "Source: National Women’s Soccer League (NWSL). R package version 0.0.0.9001. ") + 
  xlab(" Average Possession percentage for each team in the League") + theme_minimal(base_size = 12) + geom_point()
P1

Correlation Line

P2 <- P1 + geom_point() +geom_smooth(method = 'lm', formula = y~x, color = "blue") + theme_classic() #Making a presentable point graph
P2 #Showing the graph

Actual Correlation

cor(Female_soccer_stats$goals,Female_soccer_stats$possession_pct) # To get the correlation

[1] 0.2877691

Calculate the correlation coefficient and model summary

Fit1 <- lm(goals ~ possession_pct, data = Female_soccer_stats)
summary(Fit1) # To make the model summary


Call:
lm(formula = goals ~ possession_pct, data = Female_soccer_stats)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.275  -7.275  -2.170   7.042  26.778 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -16.1534    20.9386  -0.771   0.4436  
possession_pct   0.9475     0.4177   2.269   0.0271 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.06 on 57 degrees of freedom
Multiple R-squared:  0.08281,   Adjusted R-squared:  0.06672 
F-statistic: 5.146 on 1 and 57 DF,  p-value: 0.0271

The Model has the equation :

Goals = 0.9475(possession_pct) - 16.1534

This can be interpreted for each 1% of possession of the ball there is 0.95 goals scored.

Multi Regression analysis

Fit2 <- lm(goals ~ possession_pct + cross_accuracy + pass_pct + pass_pct_opposition_half + shot_accuracy + tackle_success_pct, data = Female_soccer_stats) # Making a multi linear regression model 
summary(Fit2) # The summary of the multi linear regression model


Call:
lm(formula = goals ~ possession_pct + cross_accuracy + pass_pct + 
    pass_pct_opposition_half + shot_accuracy + tackle_success_pct, 
    data = Female_soccer_stats)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.8225  -6.6724   0.2058   4.9169  26.0505 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)   
(Intercept)                0.9469    37.9636   0.025   0.9802   
possession_pct             0.9462     0.5346   1.770   0.0826 . 
cross_accuracy             0.7779     0.4477   1.737   0.0882 . 
pass_pct                  -1.7713     0.8038  -2.204   0.0320 * 
pass_pct_opposition_half   1.0794     0.6103   1.769   0.0828 . 
shot_accuracy              1.0197     0.3165   3.222   0.0022 **
tackle_success_pct        -0.3279     0.1669  -1.965   0.0548 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.8 on 52 degrees of freedom
Multiple R-squared:  0.3594,    Adjusted R-squared:  0.2855 
F-statistic: 4.863 on 6 and 52 DF,  p-value: 0.0005203

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.3.3

autoplot(Fit2, 1:4, nrow=2, ncol=2) # Making four diagnostic plots

The residual plot isn’t too horizontal so a linear plot may not be appropriate.

The Q-Q plot shows that it’s relatively normal however there are a few outliers like numbers 13 48, and ,49.

The scale location indicates homogeneous variance.

The cook distance shows that the outliers have high leveling meaning that it’s causing problems to my model. I might have to try to remove the outliers because they did pop up in all three of the other plots.

Removing the three outliers

Female_data2 <- Female_soccer_stats[-c(13,48,49),]
Fit3 <- lm(goals ~ possession_pct + cross_accuracy + pass_pct + pass_pct_opposition_half + shot_accuracy + tackle_success_pct, data = Female_data2)
summary(Fit3) #Just removing the 3 points and making a new dataframe and summary


Call:
lm(formula = goals ~ possession_pct + cross_accuracy + pass_pct + 
    pass_pct_opposition_half + shot_accuracy + tackle_success_pct, 
    data = Female_data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.4023  -6.0030  -0.8653   5.1170  14.5386 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.1420    32.0104   0.036 0.971685    
possession_pct             0.9776     0.4841   2.020 0.048905 *  
cross_accuracy             0.4512     0.4148   1.088 0.282035    
pass_pct                  -1.6441     0.6794  -2.420 0.019280 *  
pass_pct_opposition_half   0.9302     0.5113   1.819 0.074998 .  
shot_accuracy              1.0932     0.2637   4.146 0.000134 ***
tackle_success_pct        -0.3034     0.1392  -2.179 0.034173 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.309 on 49 degrees of freedom
Multiple R-squared:  0.4242,    Adjusted R-squared:  0.3537 
F-statistic: 6.016 on 6 and 49 DF,  p-value: 9.053e-05

Autoplot the new data without the outliers

autoplot(Fit3, 1:4, nrow=2, ncol=2) # making 4 diagnostic plots

Final Multi-regression Response:

The new dataframe of Fit3 did not change any of the four dianostic graphs dramatically from autoplot however it did increase the adjusted R squared from 0.2855 to 0.3537. This means that 65 percent of the variation in the data is likely not explained by this model. The shot accuracy and passing accuracy have the most influence with the amount of goals each team scored due to their low P values. I don’t find this suprising because the higher shots accuracy on goal the more goals are scored. In the other hand , the more passes accuracy then the more chances that a team can move forward towards to the goalie.

Filtering the data

female_recent <- Female_soccer_stats |> filter(season == "2022") #filtering the data to only 2022 season 
female_recent<- female_recent[order(-female_recent$goals),] # Order the data from most goals to least

Making a tree graph:

Loading in treemap

library(treemap) #loading in with library of the two packages

Warning: package 'treemap' was built under R version 4.3.3

library(RColorBrewer)

The final Tree Graph

treemap(female_recent,index = "team_name", vSize = "goals",
        vColor = "goals_conceded", type = "manual",title = "The National Women’s Soccer League 2022 (NWSL) Goals vs Goals Conceded",fontface.labels = c(2,1),inflate.labels = F, fontsize.labels = c(13,12), border.col = "darkred", border.lwds = c(5,2), # changing fontsize and border color/size
        palette = "RdPu") #pallete from red to purple with type = manual

Source: National Women’s Soccer League (NWSL). R package version 0.0.0.9001.

Final Essay:

A. I liked this data set because it followed a passion of mine which is soccer. It had a lot of stats that were tracked throughout each match. I wanted to know what stat has the biggest impact on the amount of goals scored for each team. Luckily this data was cleaned and nothing on my part I had to do to clean the data.

B. I used a treemap because it showed the teams that scored the most while highlighting the teams that also got scored on heavily. I found with this treemap that some teams were more defensive than offensive or vice versa. It also highlighted teams that both were poor offensive and defense.

C. I tried to do a faucet wrap around all the variables but it was hard to read. I also had problems with the multi regression models and tried my best to interpret it with the adjusted R value and P values. I am glad that autoplot exists because it would be very time consuming to make each graph with ggplot. I wished that maybe I found a dataset with the MLS(male) and compare if the same variables had the most effect towards scoring goals.