Summary

This dataset contains detailed information on approximately 17,000 football players from the SoFIFA.com platform. It covers a wide range of attributes related to players, such as demographic details, performance metrics, market values, and ratings for various skills and abilities.

Link to dataset

https://www.kaggle.com/datasets/maso0dahmed/football-players-data https://sofifa.com/

Main project goal

The primary goal of this project is to build a regression model to predict a football player’s market value based on various attributes, helping to better understand the factors that drive player performance, market value, and career potential.

Specifically, the project aims to:

Identify key factors that influence a player’s market value and wage in the football industry (e.g., potential, rating, position).

Evaluate the impact of physical traits such as height, stamina, and strength on a player’s overall rating and performance.

Analyze the effect of age on a player’s performance over time, identifying peak performance years and typical periods of decline.

This model will help provide insights into the most significant predictors of player value and help stakeholders make informed decisions.

Plan moving forward.

Data cleaning and data prep.
Identify any anomalies in the data set.
Exploratory data analysis - Correlation analysis, outlier detection etc.,
Visualization of key relationships and aggregation of data.
Interpretation of results.
Develop a dashboard to display this results
Prepare a final report.

Initial findings

Does higher overall rating lead to higher market value or is any other factor driving this ?
Are players with high dribbling, sprint speed and stamina play are more likely to play as attacker ?

Preparing Visualisations

# Set working directory as the path to the data set
setwd("C:/Users/raghu/OneDrive/Documents/Statistics_with_R/Week 2 Data Dive")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(dplyr)
library(lubridate)
library(corrplot)

## corrplot 0.95 loaded

library(glue)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

# Read the data set
player_data <- read_csv("fifa_players.csv")

## Rows: 17954 Columns: 51
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): name, full_name, birth_date, positions, nationality, preferred_foo...
## dbl (42): age, height_cm, weight_kgs, overall_rating, potential, value_euro,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

spec(player_data)

## cols(
##   name = col_character(),
##   full_name = col_character(),
##   birth_date = col_character(),
##   age = col_double(),
##   height_cm = col_double(),
##   weight_kgs = col_double(),
##   positions = col_character(),
##   nationality = col_character(),
##   overall_rating = col_double(),
##   potential = col_double(),
##   value_euro = col_double(),
##   wage_euro = col_double(),
##   preferred_foot = col_character(),
##   `international_reputation(1-5)` = col_double(),
##   `weak_foot(1-5)` = col_double(),
##   `skill_moves(1-5)` = col_double(),
##   body_type = col_character(),
##   release_clause_euro = col_double(),
##   national_team = col_character(),
##   national_rating = col_double(),
##   national_team_position = col_character(),
##   national_jersey_number = col_double(),
##   crossing = col_double(),
##   finishing = col_double(),
##   heading_accuracy = col_double(),
##   short_passing = col_double(),
##   volleys = col_double(),
##   dribbling = col_double(),
##   curve = col_double(),
##   freekick_accuracy = col_double(),
##   long_passing = col_double(),
##   ball_control = col_double(),
##   acceleration = col_double(),
##   sprint_speed = col_double(),
##   agility = col_double(),
##   reactions = col_double(),
##   balance = col_double(),
##   shot_power = col_double(),
##   jumping = col_double(),
##   stamina = col_double(),
##   strength = col_double(),
##   long_shots = col_double(),
##   aggression = col_double(),
##   interceptions = col_double(),
##   positioning = col_double(),
##   vision = col_double(),
##   penalties = col_double(),
##   composure = col_double(),
##   marking = col_double(),
##   standing_tackle = col_double(),
##   sliding_tackle = col_double()
## )

Hypothesis 1:

Higher overall rating and potential lead to a higher market value.

####Visualization:

# Scatter plot for Overall Rating vs Market Value
ggplot(player_data, aes(x = overall_rating, y = value_euro)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Adds linear trend line
  labs(title = "Overall Rating vs Market Value", x = "Overall Rating", y = "Market Value (Euro)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 255 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 255 rows containing missing values or values outside the scale range
## (`geom_point()`).

### Interpretation: 1. The scatter plot shows an upward trend, we can infer that players with higher overall ratings tend to have higher market values. 2. There isn’t a very strong positive slope so we can infer that there are other factors influencing overall rating.

Hypothesis 2:

Physical traits such as height, stamina, and strength positively influence market value.

Visualisation:

# Select numeric columns related to physical traits and market value
numeric_cols <- player_data |> select(value_euro, height_cm, stamina, strength)

# Compute correlation matrix
cor_matrix <- cor(numeric_cols, use = "complete.obs")

# Plot heatmap using corrplot with correlation values in the boxes
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black")  # Adds correlation values in boxes

Interpretation

The heat map will show the correlation coefficients between market value and physical attributes like height, stamina, and strength.
Only stamina has a weak correlation to value in euro other than that no other physical traits are influencing player value.

Project Meeting 1: Data Discovery

Raghuveer Venkatesh

2024-10-21