Main question: Do attacking stats, defensive stats, or possession-based stats better explain team success in soccer?
Story: Soccer teams can win in different ways. Some dominate possession, some create tons of shots, and others defend well and win low-scoring games. This project will use the soccer dataset to evaluate which team-level statistics are most closely connected to winning.
Data
The dataset contains soccer team performance statistics. Each row represents a team-season observation. The data includes variables related to team success, attacking production, defensive performance, and possession or efficiency.
data_dictionary <-tibble(Variable =c("Date", "Opponent", "Is_Home", "Result", "Goals", "Opponent_Goals","Possession", "Shots", "Shots_On_Target", "Passes_Completed","Pass_Accuracy", "Corners", "Crosses", "Fouls", "Offsides","Opponent_Possession", "Opponent_Shots", "Opponent_Shots_On_Target","Opponent_Passes_Completed", "Opponent_Pass_Accuracy","Opponent_Corners", "Opponent_Crosses", "Opponent_Fouls","Opponent_Offsides", "Shot_Efficiency", "Season", "Month","Day_of_Week", "Last5_Avg_Goals", "Last5_Win_Rate"),Meaning =c("Date of the match", "Team played against", "Whether the match was played at home","Match result -1 = Loss, 0 =. Draw 1 = Win", "Goals scored by the team", "Goals scored by the opponent","Team possession percentage", "Total shots taken", "Shots placed on target","Completed passes", "Passing accuracy percentage", "Corner kicks earned","Crosses attempted", "Fouls committed", "Offsides committed","Opponent possession percentage", "Opponent total shots","Opponent shots on target", "Opponent completed passes","Opponent passing accuracy percentage", "Opponent corner kicks","Opponent crosses attempted", "Opponent fouls committed","Opponent offsides committed", "Goals per shot or scoring efficiency","Season of the match", "Month of the match", "Day of week of the match","Average goals over the previous five matches","Win rate over the previous five matches"))data_dictionary
# A tibble: 30 × 2
Variable Meaning
<chr> <chr>
1 Date Date of the match
2 Opponent Team played against
3 Is_Home Whether the match was played at home
4 Result Match result -1 = Loss, 0 =. Draw 1 = Win
5 Goals Goals scored by the team
6 Opponent_Goals Goals scored by the opponent
7 Possession Team possession percentage
8 Shots Total shots taken
9 Shots_On_Target Shots placed on target
10 Passes_Completed Completed passes
# ℹ 20 more rows
Summary Statistics
Soccer %>%summary()
Date Opponent Is_Home
Min. :2013-08-04 18:20:00 Length:699 Min. :0.000
1st Qu.:2016-05-12 05:30:00 Class :character 1st Qu.:0.000
Median :2019-04-01 22:00:00 Mode :character Median :1.000
Mean :2019-05-13 13:27:37 Mean :0.515
3rd Qu.:2022-04-13 05:00:00 3rd Qu.:1.000
Max. :2025-04-08 22:00:00 Max. :1.000
Result Goals Opponent_Goals Possession
Min. :-1.0000 Min. :0.000 Min. :0.000 Min. :20.00
1st Qu.: 0.0000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:49.00
Median : 1.0000 Median :2.000 Median :1.000 Median :58.00
Mean : 0.3462 Mean :1.938 Mean :1.073 Mean :56.56
3rd Qu.: 1.0000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:65.00
Max. : 1.0000 Max. :8.000 Max. :6.000 Max. :82.00
Shots Shots_On_Target Passes_Completed Pass_Accuracy
Min. : 1.00 Min. : 0.000 Min. : 94.0 Min. :49.00
1st Qu.:10.00 1st Qu.: 3.000 1st Qu.:388.0 1st Qu.:82.00
Median :14.00 Median : 5.000 Median :459.0 Median :85.00
Mean :14.19 Mean : 5.266 Mean :458.1 Mean :84.03
3rd Qu.:17.00 3rd Qu.: 7.000 3rd Qu.:536.0 3rd Qu.:88.00
Max. :36.00 Max. :17.000 Max. :827.0 Max. :93.00
Corners Crosses Fouls Offsides
Min. : 0.000 Min. : 0.000 Min. : 2.00 Min. : 0.00
1st Qu.: 4.000 1st Qu.: 2.000 1st Qu.: 8.00 1st Qu.: 1.00
Median : 5.000 Median : 4.000 Median :10.00 Median : 2.00
Mean : 5.827 Mean : 4.157 Mean :10.39 Mean : 2.01
3rd Qu.: 7.500 3rd Qu.: 6.000 3rd Qu.:13.00 3rd Qu.: 3.00
Max. :18.000 Max. :20.000 Max. :23.00 Max. :11.00
Opponent_Possession Opponent_Shots Opponent_Shots_On_Target
Min. :18.00 Min. : 1.00 Min. : 0.000
1st Qu.:35.00 1st Qu.: 7.00 1st Qu.: 2.000
Median :42.00 Median :10.00 Median : 3.000
Mean :43.44 Mean :10.88 Mean : 3.727
3rd Qu.:51.00 3rd Qu.:14.00 3rd Qu.: 5.000
Max. :80.00 Max. :33.00 Max. :13.000
Opponent_Passes_Completed Opponent_Pass_Accuracy Opponent_Corners
Min. : 80.0 Min. :50.00 Min. : 0.00
1st Qu.:240.0 1st Qu.:74.00 1st Qu.: 2.00
Median :315.0 Median :79.00 Median : 4.00
Mean :328.8 Mean :78.28 Mean : 4.26
3rd Qu.:402.0 3rd Qu.:83.00 3rd Qu.: 6.00
Max. :816.0 Max. :95.00 Max. :17.00
Opponent_Crosses Opponent_Fouls Opponent_Offsides Shot_Efficiency
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. :0.0000
1st Qu.: 2.000 1st Qu.: 9.00 1st Qu.: 1.00 1st Qu.:0.2727
Median : 3.000 Median :11.00 Median : 2.00 Median :0.3684
Mean : 3.486 Mean :11.14 Mean : 2.03 Mean :0.3779
3rd Qu.: 5.000 3rd Qu.:13.00 3rd Qu.: 3.00 3rd Qu.:0.4667
Max. :15.000 Max. :25.00 Max. :10.00 Max. :1.0000
Season Month Day_of_Week Last5_Avg_Goals
Min. :2013 Min. : 1.000 Min. :1.000 Min. :0.000
1st Qu.:2016 1st Qu.: 3.000 1st Qu.:3.000 1st Qu.:1.400
Median :2019 Median : 7.000 Median :6.000 Median :2.000
Mean :2019 Mean : 6.768 Mean :4.953 Mean :1.936
3rd Qu.:2022 3rd Qu.:10.000 3rd Qu.:7.000 3rd Qu.:2.400
Max. :2025 Max. :12.000 Max. :7.000 Max. :4.200
Last5_Win_Rate
Min. :0.0000
1st Qu.:0.4000
Median :0.6000
Mean :0.5777
3rd Qu.:0.8000
Max. :1.0000
Summary Statistics Interpretation
The summary statistics suggest the team generally controlled matches through possession and offensive production. The team averaged nearly 57% possession and almost 2 goals per match while allowing just over 1 goal per game. The team also averaged over 14 shots and 5 shots on target per match, indicating a consistent attacking presence.
The recent form variables also show the team entered many matches in relatively strong form, with an average win rate of approximately 58% across the previous five matches.
Relationship Between Goals Scored and Match Result
Soccer_clean %>%ggplot(aes(x = Goals, y = Result)) +geom_jitter(alpha =0.5, color ="#C4122E") +geom_smooth(method ="lm", se =FALSE, color ="black") +labs(title ="Goals Scored Compared to Match Result",x ="Goals Scored",y ="Match Result" ) +theme_minimal()
The visualization shows a strong positive relationship between goals scored and match result. Matches in which the team scored more goals were much more likely to result in wins, while lower-scoring matches were more commonly associated with losses or draws. Most winning performances occurred when the team scored at least two goals, suggesting offensive production plays a major role in overall success.
Possession Compared to Match Result
Soccer_clean %>%ggplot(aes(x = Possession, y = Result)) +geom_jitter(alpha =0.5,color ="#C4122E",width =0.5 ) +geom_smooth(method ="lm",se =FALSE,color ="black",linewidth =1 ) +labs(title ="Possession Compared to Match Result",x ="Possession Percentage",y ="Match Result" ) +theme_minimal()
The visualization shows only a weak positive relationship between possession and match results. While teams with higher possession percentages were slightly more likely to achieve positive results, there is still significant variation across all possession levels. This suggests that controlling possession alone is not enough to consistently win matches, and that offensive efficiency may be more important than simply maintaining control of the ball.
Shots on Target Compared to Goals Scored
Soccer_clean %>%ggplot(aes(x = Shots_On_Target, y = Goals)) +geom_jitter(alpha =0.5,color ="#C4122E",width =0.2,height =0.2 ) +geom_smooth(method ="lm",se =FALSE,color ="black",linewidth =1 ) +labs(title ="Shots on Target Compared to Goals Scored",x ="Shots On Target",y ="Goals Scored" ) +theme_minimal()
The visualization shows a strong positive relationship between shots on target and goals scored. Teams that generated more accurate shooting opportunities generally scored more goals. Compared to the possession analysis, this relationship appears much stronger, suggesting that offensive efficiency and shot quality are more important indicators of success than simply controlling possession.
Opponent Shots on Target Compared to Match Result
Soccer_clean %>%ggplot(aes(x = Opponent_Shots_On_Target, y = Result)) +geom_jitter(alpha =0.5,color ="#C4122E",width =0.2,height =0.2 ) +geom_smooth(method ="lm",se =FALSE,color ="black",linewidth =1 ) +labs(title ="Opponent Shots on Target Compared to Match Result",x ="Opponent Shots On Target",y ="Match Result" ) +theme_minimal()
The visualization shows a strong negative relationship between opponent shots on target and match results. As opponents created more accurate scoring opportunities, the team was significantly less likely to achieve positive results. This suggests defensive performance and limiting high-quality chances are critical components of winning soccer matches.
The correlation matrix provides a broader summary of the relationships between team performance statistics and match outcomes. Goals scored showed the strongest positive relationship with match results (0.65), while opponent goals showed the strongest negative relationship (-0.64). Shots on target and shot efficiency also demonstrated meaningful positive relationships with success.
Interestingly, possession had only a weak positive correlation with match results (0.09), suggesting that controlling possession alone does not strongly predict winning. Overall, the data suggests offensive efficiency and defensive effectiveness are far more important indicators of success than possession percentage by itself.
To supplement the primary soccer data set, Bundesliga match data was collected using the OpenLigaDB API. This secondary data set contains professional league match information including teams, match dates, and final scores. Using an external API allowed additional analysis to be performed while also satisfying the requirement of integrating a secondary data source into the project.
The secondary data set was used to examine the distribution of goals scored in professional soccer matches. This comparison helps evaluate whether the trends identified in the primary data set are also reflected in a larger professional league environment.
secondary_soccer %>%ggplot(aes(x = total_goals)) +geom_histogram(fill ="#0C2340",color ="white",bins =10 ) +labs(title ="Distribution of Total Goals in Bundesliga Matches",x ="Total Goals",y ="Matches" ) +theme_minimal()
The Bundesliga goal distribution shows that most professional soccer matches contain between one and three total goals, while extremely high-scoring matches are relatively uncommon. This finding aligns with the results from the primary data set, where offensive efficiency and shot quality appeared to play major roles in determining match outcomes.
The consistency between the two datasets strengthens the conclusion that successful soccer teams are not simply those that dominate possession, but rather those that create efficient scoring opportunities while limiting quality chances for opponents. Overall, the analysis suggests that offensive production and defensive effectiveness are stronger predictors of success than possession percentage alone.
Conclusion
This project explored which soccer performance metrics are most closely associated with winning match outcomes. Across multiple visualizations and statistical comparisons, offensive production and defensive effectiveness consistently showed the strongest relationships with success.
Goals scored and shots on target demonstrated strong positive relationships with match results, while opponent goals and opponent shots on target showed strong negative relationships. In contrast, possession percentage displayed only a weak relationship with winning, suggesting that controlling possession alone is not enough to consistently produce positive outcomes.
The secondary Bundesliga API data supported these findings by showing that most professional matches are relatively low scoring, increasing the importance of offensive efficiency and defensive organization. Overall, the analysis suggests that teams are most successful when they efficiently convert scoring opportunities while limiting high-quality chances for opponents.