What Drives Winning in Soccer?

A Data Analysis of Team Performance

Main question: Do attacking stats, defensive stats, or possession-based stats better explain team success in soccer?

Story: Soccer teams can win in different ways. Some dominate possession, some create tons of shots, and others defend well and win low-scoring games. This project will use the soccer dataset to evaluate which team-level statistics are most closely connected to winning.

Data

The dataset contains soccer team performance statistics. Each row represents a team-season observation. The data includes variables related to team success, attacking production, defensive performance, and possession or efficiency.

names(Soccer)
 [1] "Date"                      "Opponent"                 
 [3] "Is_Home"                   "Result"                   
 [5] "Goals"                     "Opponent_Goals"           
 [7] "Possession"                "Shots"                    
 [9] "Shots_On_Target"           "Passes_Completed"         
[11] "Pass_Accuracy"             "Corners"                  
[13] "Crosses"                   "Fouls"                    
[15] "Offsides"                  "Opponent_Possession"      
[17] "Opponent_Shots"            "Opponent_Shots_On_Target" 
[19] "Opponent_Passes_Completed" "Opponent_Pass_Accuracy"   
[21] "Opponent_Corners"          "Opponent_Crosses"         
[23] "Opponent_Fouls"            "Opponent_Offsides"        
[25] "Shot_Efficiency"           "Season"                   
[27] "Month"                     "Day_of_Week"              
[29] "Last5_Avg_Goals"           "Last5_Win_Rate"           

Data Dictionary

data_dictionary <- tibble(
  Variable = c("Date", "Opponent", "Is_Home", "Result", "Goals", "Opponent_Goals",
               "Possession", "Shots", "Shots_On_Target", "Passes_Completed",
               "Pass_Accuracy", "Corners", "Crosses", "Fouls", "Offsides",
               "Opponent_Possession", "Opponent_Shots", "Opponent_Shots_On_Target",
               "Opponent_Passes_Completed", "Opponent_Pass_Accuracy",
               "Opponent_Corners", "Opponent_Crosses", "Opponent_Fouls",
               "Opponent_Offsides", "Shot_Efficiency", "Season", "Month",
               "Day_of_Week", "Last5_Avg_Goals", "Last5_Win_Rate"),
  Meaning = c("Date of the match", "Team played against", "Whether the match was played at home",
              "Match result -1 = Loss, 0 =. Draw 1 = Win", "Goals scored by the team", "Goals scored by the opponent",
              "Team possession percentage", "Total shots taken", "Shots placed on target",
              "Completed passes", "Passing accuracy percentage", "Corner kicks earned",
              "Crosses attempted", "Fouls committed", "Offsides committed",
              "Opponent possession percentage", "Opponent total shots",
              "Opponent shots on target", "Opponent completed passes",
              "Opponent passing accuracy percentage", "Opponent corner kicks",
              "Opponent crosses attempted", "Opponent fouls committed",
              "Opponent offsides committed", "Goals per shot or scoring efficiency",
              "Season of the match", "Month of the match", "Day of week of the match",
              "Average goals over the previous five matches",
              "Win rate over the previous five matches")
)

data_dictionary
# A tibble: 30 × 2
   Variable         Meaning                                  
   <chr>            <chr>                                    
 1 Date             Date of the match                        
 2 Opponent         Team played against                      
 3 Is_Home          Whether the match was played at home     
 4 Result           Match result -1 = Loss, 0 =. Draw 1 = Win
 5 Goals            Goals scored by the team                 
 6 Opponent_Goals   Goals scored by the opponent             
 7 Possession       Team possession percentage               
 8 Shots            Total shots taken                        
 9 Shots_On_Target  Shots placed on target                   
10 Passes_Completed Completed passes                         
# ℹ 20 more rows

Summary Statistics

Soccer %>% summary()
      Date                       Opponent            Is_Home     
 Min.   :2013-08-04 18:20:00   Length:699         Min.   :0.000  
 1st Qu.:2016-05-12 05:30:00   Class :character   1st Qu.:0.000  
 Median :2019-04-01 22:00:00   Mode  :character   Median :1.000  
 Mean   :2019-05-13 13:27:37                      Mean   :0.515  
 3rd Qu.:2022-04-13 05:00:00                      3rd Qu.:1.000  
 Max.   :2025-04-08 22:00:00                      Max.   :1.000  
     Result            Goals       Opponent_Goals    Possession   
 Min.   :-1.0000   Min.   :0.000   Min.   :0.000   Min.   :20.00  
 1st Qu.: 0.0000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:49.00  
 Median : 1.0000   Median :2.000   Median :1.000   Median :58.00  
 Mean   : 0.3462   Mean   :1.938   Mean   :1.073   Mean   :56.56  
 3rd Qu.: 1.0000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:65.00  
 Max.   : 1.0000   Max.   :8.000   Max.   :6.000   Max.   :82.00  
     Shots       Shots_On_Target  Passes_Completed Pass_Accuracy  
 Min.   : 1.00   Min.   : 0.000   Min.   : 94.0    Min.   :49.00  
 1st Qu.:10.00   1st Qu.: 3.000   1st Qu.:388.0    1st Qu.:82.00  
 Median :14.00   Median : 5.000   Median :459.0    Median :85.00  
 Mean   :14.19   Mean   : 5.266   Mean   :458.1    Mean   :84.03  
 3rd Qu.:17.00   3rd Qu.: 7.000   3rd Qu.:536.0    3rd Qu.:88.00  
 Max.   :36.00   Max.   :17.000   Max.   :827.0    Max.   :93.00  
    Corners          Crosses           Fouls          Offsides    
 Min.   : 0.000   Min.   : 0.000   Min.   : 2.00   Min.   : 0.00  
 1st Qu.: 4.000   1st Qu.: 2.000   1st Qu.: 8.00   1st Qu.: 1.00  
 Median : 5.000   Median : 4.000   Median :10.00   Median : 2.00  
 Mean   : 5.827   Mean   : 4.157   Mean   :10.39   Mean   : 2.01  
 3rd Qu.: 7.500   3rd Qu.: 6.000   3rd Qu.:13.00   3rd Qu.: 3.00  
 Max.   :18.000   Max.   :20.000   Max.   :23.00   Max.   :11.00  
 Opponent_Possession Opponent_Shots  Opponent_Shots_On_Target
 Min.   :18.00       Min.   : 1.00   Min.   : 0.000          
 1st Qu.:35.00       1st Qu.: 7.00   1st Qu.: 2.000          
 Median :42.00       Median :10.00   Median : 3.000          
 Mean   :43.44       Mean   :10.88   Mean   : 3.727          
 3rd Qu.:51.00       3rd Qu.:14.00   3rd Qu.: 5.000          
 Max.   :80.00       Max.   :33.00   Max.   :13.000          
 Opponent_Passes_Completed Opponent_Pass_Accuracy Opponent_Corners
 Min.   : 80.0             Min.   :50.00          Min.   : 0.00   
 1st Qu.:240.0             1st Qu.:74.00          1st Qu.: 2.00   
 Median :315.0             Median :79.00          Median : 4.00   
 Mean   :328.8             Mean   :78.28          Mean   : 4.26   
 3rd Qu.:402.0             3rd Qu.:83.00          3rd Qu.: 6.00   
 Max.   :816.0             Max.   :95.00          Max.   :17.00   
 Opponent_Crosses Opponent_Fouls  Opponent_Offsides Shot_Efficiency 
 Min.   : 0.000   Min.   : 0.00   Min.   : 0.00     Min.   :0.0000  
 1st Qu.: 2.000   1st Qu.: 9.00   1st Qu.: 1.00     1st Qu.:0.2727  
 Median : 3.000   Median :11.00   Median : 2.00     Median :0.3684  
 Mean   : 3.486   Mean   :11.14   Mean   : 2.03     Mean   :0.3779  
 3rd Qu.: 5.000   3rd Qu.:13.00   3rd Qu.: 3.00     3rd Qu.:0.4667  
 Max.   :15.000   Max.   :25.00   Max.   :10.00     Max.   :1.0000  
     Season         Month         Day_of_Week    Last5_Avg_Goals
 Min.   :2013   Min.   : 1.000   Min.   :1.000   Min.   :0.000  
 1st Qu.:2016   1st Qu.: 3.000   1st Qu.:3.000   1st Qu.:1.400  
 Median :2019   Median : 7.000   Median :6.000   Median :2.000  
 Mean   :2019   Mean   : 6.768   Mean   :4.953   Mean   :1.936  
 3rd Qu.:2022   3rd Qu.:10.000   3rd Qu.:7.000   3rd Qu.:2.400  
 Max.   :2025   Max.   :12.000   Max.   :7.000   Max.   :4.200  
 Last5_Win_Rate  
 Min.   :0.0000  
 1st Qu.:0.4000  
 Median :0.6000  
 Mean   :0.5777  
 3rd Qu.:0.8000  
 Max.   :1.0000  

Summary Statistics Interpretation

The summary statistics suggest the team generally controlled matches through possession and offensive production. The team averaged nearly 57% possession and almost 2 goals per match while allowing just over 1 goal per game. The team also averaged over 14 shots and 5 shots on target per match, indicating a consistent attacking presence.

The recent form variables also show the team entered many matches in relatively strong form, with an average win rate of approximately 58% across the previous five matches.

Cleaning

Soccer_clean <- Soccer %>%
  mutate(
    Date = as.Date(Date),
    Goals = as.numeric(Goals),
    Opponent_Goals = as.numeric(Opponent_Goals),
    Possession = as.numeric(Possession),
    Shots = as.numeric(Shots),
    Shots_On_Target = as.numeric(Shots_On_Target),
    Passes_Completed = as.numeric(Passes_Completed),
    Pass_Accuracy = as.numeric(Pass_Accuracy),
    Corners = as.numeric(Corners),
    Crosses = as.numeric(Crosses),
    Fouls = as.numeric(Fouls),
    Offsides = as.numeric(Offsides),
    Opponent_Possession = as.numeric(Opponent_Possession),
    Opponent_Shots = as.numeric(Opponent_Shots),
    Opponent_Shots_On_Target = as.numeric(Opponent_Shots_On_Target),
    Opponent_Passes_Completed = as.numeric(Opponent_Passes_Completed),
    Opponent_Pass_Accuracy = as.numeric(Opponent_Pass_Accuracy),
    Opponent_Corners = as.numeric(Opponent_Corners),
    Opponent_Crosses = as.numeric(Opponent_Crosses),
    Opponent_Fouls = as.numeric(Opponent_Fouls),
    Opponent_Offsides = as.numeric(Opponent_Offsides),
    Shot_Efficiency = as.numeric(Shot_Efficiency),
    Last5_Avg_Goals = as.numeric(Last5_Avg_Goals),
    Last5_Win_Rate = as.numeric(Last5_Win_Rate)
  )

Summary Table

Soccer_clean %>%
  summarise(
    Average_Goals = mean(Goals, na.rm = TRUE),
    Average_Shots = mean(Shots, na.rm = TRUE),
    Average_Possession = mean(Possession, na.rm = TRUE),
    Average_Shot_Efficiency = mean(Shot_Efficiency, na.rm = TRUE)
  )
# A tibble: 1 × 4
  Average_Goals Average_Shots Average_Possession Average_Shot_Efficiency
          <dbl>         <dbl>              <dbl>                   <dbl>
1          1.94          14.2               56.6                   0.378

Data Visualization and Interpretation

Relationship Between Goals Scored and Match Result

Soccer_clean %>%
  ggplot(aes(x = Goals, y = Result)) +
  geom_jitter(alpha = 0.5, color = "#C4122E") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Goals Scored Compared to Match Result",
    x = "Goals Scored",
    y = "Match Result"
  ) +
  theme_minimal()

The visualization shows a strong positive relationship between goals scored and match result. Matches in which the team scored more goals were much more likely to result in wins, while lower-scoring matches were more commonly associated with losses or draws. Most winning performances occurred when the team scored at least two goals, suggesting offensive production plays a major role in overall success.

Possession Compared to Match Result

Soccer_clean %>%
  ggplot(aes(x = Possession, y = Result)) +
  geom_jitter(
    alpha = 0.5,
    color = "#C4122E",
    width = 0.5
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    linewidth = 1
  ) +
  labs(
    title = "Possession Compared to Match Result",
    x = "Possession Percentage",
    y = "Match Result"
  ) +
  theme_minimal()

The visualization shows only a weak positive relationship between possession and match results. While teams with higher possession percentages were slightly more likely to achieve positive results, there is still significant variation across all possession levels. This suggests that controlling possession alone is not enough to consistently win matches, and that offensive efficiency may be more important than simply maintaining control of the ball.

Shots on Target Compared to Goals Scored

Soccer_clean %>%
  ggplot(aes(x = Shots_On_Target, y = Goals)) +
  geom_jitter(
    alpha = 0.5,
    color = "#C4122E",
    width = 0.2,
    height = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    linewidth = 1
  ) +
  labs(
    title = "Shots on Target Compared to Goals Scored",
    x = "Shots On Target",
    y = "Goals Scored"
  ) +
  theme_minimal()

The visualization shows a strong positive relationship between shots on target and goals scored. Teams that generated more accurate shooting opportunities generally scored more goals. Compared to the possession analysis, this relationship appears much stronger, suggesting that offensive efficiency and shot quality are more important indicators of success than simply controlling possession.

Opponent Shots on Target Compared to Match Result

Soccer_clean %>%
  ggplot(aes(x = Opponent_Shots_On_Target, y = Result)) +
  geom_jitter(
    alpha = 0.5,
    color = "#C4122E",
    width = 0.2,
    height = 0.2
  ) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    linewidth = 1
  ) +
  labs(
    title = "Opponent Shots on Target Compared to Match Result",
    x = "Opponent Shots On Target",
    y = "Match Result"
  ) +
  theme_minimal()

The visualization shows a strong negative relationship between opponent shots on target and match results. As opponents created more accurate scoring opportunities, the team was significantly less likely to achieve positive results. This suggests defensive performance and limiting high-quality chances are critical components of winning soccer matches.

cor_data <- Soccer_clean %>%
  select(
    Result,
    Goals,
    Opponent_Goals,
    Possession,
    Shots,
    Shots_On_Target,
    Pass_Accuracy,
    Opponent_Shots_On_Target,
    Shot_Efficiency,
    Last5_Win_Rate
  )

cor_matrix <- cor(cor_data, use = "complete.obs")

cor_df <- as.data.frame(as.table(cor_matrix))

ggplot(cor_df, aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = round(Freq, 2)), size = 3) +
  scale_fill_gradient2(
    low = "#C4122E",
    mid = "white",
    high = "#0C2340",
    midpoint = 0
  ) +
  labs(
    title = "Correlation Matrix of Soccer Performance Metrics",
    x = "",
    y = ""
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

The correlation matrix provides a broader summary of the relationships between team performance statistics and match outcomes. Goals scored showed the strongest positive relationship with match results (0.65), while opponent goals showed the strongest negative relationship (-0.64). Shots on target and shot efficiency also demonstrated meaningful positive relationships with success.

Interestingly, possession had only a weak positive correlation with match results (0.09), suggesting that controlling possession alone does not strongly predict winning. Overall, the data suggests offensive efficiency and defensive effectiveness are far more important indicators of success than possession percentage by itself.

Secondary Data Source

url <- "https://api.openligadb.de/getmatchdata/bl1/2024"

api_data <- url %>%
  GET() %>%
  content(as = "text", encoding = "UTF-8") %>%
  fromJSON(flatten = TRUE)

secondary_soccer <- api_data %>%
  mutate(
    home_goals = sapply(matchResults, function(x) x$pointsTeam1[1]),
    away_goals = sapply(matchResults, function(x) x$pointsTeam2[1])
  ) %>%
  transmute(
    match_date = matchDateTime,
    home_team = team1.teamName,
    away_team = team2.teamName,
    home_goals,
    away_goals,
    total_goals = home_goals + away_goals
  )

secondary_soccer

To supplement the primary soccer data set, Bundesliga match data was collected using the OpenLigaDB API. This secondary data set contains professional league match information including teams, match dates, and final scores. Using an external API allowed additional analysis to be performed while also satisfying the requirement of integrating a secondary data source into the project.

The secondary data set was used to examine the distribution of goals scored in professional soccer matches. This comparison helps evaluate whether the trends identified in the primary data set are also reflected in a larger professional league environment.

secondary_soccer %>%
  ggplot(aes(x = total_goals)) +
  geom_histogram(
    fill = "#0C2340",
    color = "white",
    bins = 10
  ) +
  labs(
    title = "Distribution of Total Goals in Bundesliga Matches",
    x = "Total Goals",
    y = "Matches"
  ) +
  theme_minimal()

The Bundesliga goal distribution shows that most professional soccer matches contain between one and three total goals, while extremely high-scoring matches are relatively uncommon. This finding aligns with the results from the primary data set, where offensive efficiency and shot quality appeared to play major roles in determining match outcomes.

The consistency between the two datasets strengthens the conclusion that successful soccer teams are not simply those that dominate possession, but rather those that create efficient scoring opportunities while limiting quality chances for opponents. Overall, the analysis suggests that offensive production and defensive effectiveness are stronger predictors of success than possession percentage alone.

Conclusion

This project explored which soccer performance metrics are most closely associated with winning match outcomes. Across multiple visualizations and statistical comparisons, offensive production and defensive effectiveness consistently showed the strongest relationships with success.

Goals scored and shots on target demonstrated strong positive relationships with match results, while opponent goals and opponent shots on target showed strong negative relationships. In contrast, possession percentage displayed only a weak relationship with winning, suggesting that controlling possession alone is not enough to consistently produce positive outcomes.

The secondary Bundesliga API data supported these findings by showing that most professional matches are relatively low scoring, increasing the importance of offensive efficiency and defensive organization. Overall, the analysis suggests that teams are most successful when they efficiently convert scoring opportunities while limiting high-quality chances for opponents.