Northeastern University
ALY6015 71821 Intermediate
Analytics SEC 09 Fall 2023 CPS [BOS-B-HY]
Zeeshan Ahmad
Ansari
Prof. Vladimir Shapiro
Date of submission: 09 April,
2024
Module 5 Final Project- Final Report
Introduction
In this project, we’re looking at important questions about
basketball using statistics. We want to understand what makes a player
efficient, how playing at home or away affects a team’s chances of
winning, if shooting three-pointers is better at home or away, and
whether making steals helps teams win. Here’s how we’re doing
it:
1.Player Efficiency:The primary question here
is: What combination of performance metrics such as points scored,
assists made, and rebounds collected, most significantly predicts a
player’s overall efficiency on the court? This inquiry is not just of
academic interest but has profound implications for player assessment,
training focus and game strategy formulation.
2.Home vs. Away Games: Another pivotal
question addresses the concept of home court advantage. We
seek to predict the likelihood of a team’s victory based on the
dichotomy of playing at home versus playing away. This analysis is
enriched by incorporating various performance metrics like points
scored, fouls committed etc., to understand how these factors interplay
with the venue of the game.
3.Three-Point Shooting: We also explore
whether there is a significant difference in three-point shooting
accuracy between home games and away games. This analysis could provide
insights into how environmental or psychological factors impact player
performance.
4.Steals and Winning: Lastly, we investigate
whether teams that execute more steals have a higher probability of
winning games. This question probes into the strategic value of
defensive maneuvers like steals in the overall game outcome.
We’re using these methods to get answers that could help coaches,
players and fans understand the game better.
In this basketball analysis project, we’ve employed several
predictive analytics and statistical techniques to address our key
questions. Here’s a breakdown of the methods used and why they were
appropriate:
1.Linear Regression for Player
Efficiency:
Model Reason:We
chose linear regression to understand how different performance metrics
like points, assists, and rebounds predict a player’s efficiency. This
method is great for seeing how these factors add up to affect overall
performance.
Data Involved:We used player
statistics such as total points, number of assists, and rebound
counts.
2.Logistic Regression for Home vs. Away Game
Analysis:
Model Reason:
Logistic regression is perfect for questions with yes/no answers, like
whether a team is more likely to win at home or away. This method helps
us predict the probability of winning based on the game location and
other factors.
Data Involved:We analyzed
game records, including the location of the game (home or away) and
performance metrics like points scored and fouls committed.
3.Two-sample t-test for Three-Point Shooting:
Model Reason:The two-sample t-test helps us compare two different groups. We used it to see if there’s a significant difference in three-point shooting accuracy and the impact of it in home vs. away games.
Data Involved:For three-point shooting, we compared shooting percentages in home and away games.
4.Two-sample t-test for Steals
Analysis:
Model Reason: The
two-sample t-test is suitable for exploring the impact of defense on
game outcomes, specifically whether a higher number of steals correlates
with an increased chance of winning. This statistical test compares the
average number of steals between two distinct groups.
Data
Involved: We investigated historical WNBA data, dividing it
into games won and games lost. We then compared the average steals in
each group to understand their relationship to winning games.
These methods were chosen because they align well with the nature of
our data and the specific questions we’re trying to answer. By applying
these techniques, we can gain meaningful insights into player and team
performance in basketball.
library(ggplot2)
library(dplyr)
library(tidyr)
library(caret)
library(readr)
library(corrplot)
wnba_data <- read.csv("D:/Quater_2/Second Part/ALY6015/FInal Group/WNBA-2014-player-stats-by-game.csv")
# variable names changed to uppercase
colnames(wnba_data) <- toupper(colnames(wnba_data))
# Replacing periods and underscores with underscores
colnames(wnba_data) <- gsub("[._]", "_", colnames(wnba_data))
# Checking the updated column names
names(wnba_data)
## [1] "PLAYER" "PLAYER_ID" "TEAM" "DATE" "HOME"
## [6] "OPPONENT" "WIN" "TEAM_PTS" "OPP_PTS" "MINUTES"
## [11] "FGMADE" "FGATT" "MADE3" "ATT3" "MADE1"
## [16] "ATT1" "OFFRB" "DEFRB" "TOTRB" "ASSIST"
## [21] "STEAL" "BLOCK" "TURNOVER" "FOULS" "POINTS"
## [26] "EFFICIENCY"
# Making an empty dataframe to store missing value counts
missing_counts_wnba_data <- data.frame(Column = character(0), Blank_Count = integer(0))
# Looping through each column of the dataframe
for (col in names(wnba_data)) {
missing_values <- sum(is.na(wnba_data[[col]]) | wnba_data[[col]] == "")
missing_counts_wnba_data <- rbind(missing_counts_wnba_data, data.frame(Column = col, Blank_Count = missing_values))
}
# Calculating the total number of rows in the dataframe
total_rows <- nrow(wnba_data)
# Calculating the percentage of missing values for each column
missing_counts_wnba_data$Percent_Missing <- (missing_counts_wnba_data$Blank_Count / total_rows) * 100
missing_counts_wnba_data
## Column Blank_Count Percent_Missing
## 1 PLAYER 0 0
## 2 PLAYER_ID 0 0
## 3 TEAM 0 0
## 4 DATE 0 0
## 5 HOME 0 0
## 6 OPPONENT 0 0
## 7 WIN 0 0
## 8 TEAM_PTS 0 0
## 9 OPP_PTS 0 0
## 10 MINUTES 0 0
## 11 FGMADE 0 0
## 12 FGATT 0 0
## 13 MADE3 0 0
## 14 ATT3 0 0
## 15 MADE1 0 0
## 16 ATT1 0 0
## 17 OFFRB 0 0
## 18 DEFRB 0 0
## 19 TOTRB 0 0
## 20 ASSIST 0 0
## 21 STEAL 0 0
## 22 BLOCK 0 0
## 23 TURNOVER 0 0
## 24 FOULS 0 0
## 25 POINTS 0 0
## 26 EFFICIENCY 0 0
During EDA phase we will create several Histograms, box plot, and we
will be checking the correlation to understand the data better.
1.Distribution of Minutes and Points: Both the Minutes and
Points (Fig. 1 and Fig. 2) histograms show a slightly right skewed and
right skewed distribution respectively,suggesting that a majority of the
observations are clustered towards the lower end of the scale. This
indicates that the scoring opportunities are relatively rare.
# Histogram for 'minutes'
ggplot(wnba_data, aes(x = MINUTES)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
labs(x = "Minutes", y = "Frequency") +
theme_minimal()
Fig.1 Distribution of minutes
# Histogram for 'points'
ggplot(wnba_data, aes(x = POINTS)) +
geom_histogram(binwidth = 5, fill = "#69b3a2", color = "#404040", alpha = 0.7) +
labs(x = "Points", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold")
)
Fig.2 Distribution of points
# Histogram for 'efficiency'
ggplot(wnba_data, aes(x = EFFICIENCY)) +
geom_histogram(binwidth = 0.02, fill = "#FFA07A", color = "#4B0082", alpha = 0.7) +
labs(x = "Efficiency", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "black"),
axis.text = element_text(size = 10, color = "black"),
axis.title = element_text(size = 12, face = "bold", color = "black")
)
Fig.3 Distribution of efficiency
# Box plot for 'minutes'
boxplot(wnba_data$MINUTES, ylab = "Minutes")
Fig.4 Box plot of Minutes
# Box plot for 'points'
ggplot(wnba_data, aes(x = 1, y = POINTS)) +
geom_boxplot(fill = "#66c2a5", color = "#fc8d62", alpha = 0.7, outlier.shape = NA) +
labs(x = NULL, y = "Points") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#fc8d62"),
axis.text = element_text(size = 12, color = "#2F4F4F"),
axis.title = element_text(size = 14, face = "bold", color = "#fc8d62")
)
Fig.5 Box plot of points
# Box plot for 'efficiency'
ggplot(wnba_data, aes(x = 1, y = EFFICIENCY)) +
geom_boxplot(color = "black", alpha = 0.7, outlier.shape = NA) +
labs(x = NULL, y = "Efficiency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "black"),
axis.text = element_text(size = 12, color = "black"),
axis.title = element_text(size = 14, face = "bold", color = "black")
)
Fig.6 Box plot of EFficiency
# Correlation matrix
cor_matrix <- cor(wnba_data[, c("MINUTES", "POINTS", "EFFICIENCY")])
cor_matrix
## MINUTES POINTS EFFICIENCY
## MINUTES 1.0000000 0.7502987 0.7047455
## POINTS 0.7502987 1.0000000 0.8733016
## EFFICIENCY 0.7047455 0.8733016 1.0000000
# Create a correlation plot with pie charts
corrplot(cor_matrix, method = "pie" )
# Add a caption to the plot
title( sub = "Fig. 7: Correlation Plot with Pie Charts ")
# Variance inflation factor (VIF) for multicollinearity check
vif_values <- car::vif(lm(EFFICIENCY ~ MINUTES + POINTS, data = wnba_data))
print(vif_values)
## MINUTES POINTS
## 2.288058 2.288058
# Average efficiency by team
avg_efficiency_by_team <- aggregate(wnba_data$EFFICIENCY, by = list(wnba_data$TEAM), FUN = mean)
colnames(avg_efficiency_by_team) <- c("TEAM", "AVERAGE_EFFICIENCY")
# Bar chart og Average effeciency by team
avg_efficiency_by_team_barplot <- ggplot(avg_efficiency_by_team, aes(x = TEAM, y = AVERAGE_EFFICIENCY)) +
geom_bar(stat = "identity", fill = "#87CEEB", color = "black", alpha = 0.7) +
labs(title = "Average Efficiency by team", x = "Team", y = "Average Efficiency") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#2F4F4F"),
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 12, color = "#2F4F4F"),
axis.text.y = element_text(size = 12, color = "#2F4F4F"),
axis.title = element_text(size = 14, face = "bold", color = "#2F4F4F")
)
# Display the plot
print(avg_efficiency_by_team_barplot)
Fig.8 Average Efficiency by Team
Business Question 1
Predict player’s overall efficiency based on performance metrics like points, assists, and rebounds etc. using Linear Regression.
The attributes used in this testing are as follows:
Efficiency: A numerical measure to quantify a player’s overall contribution to a game.
Points: The total points scored by the player in the game.
Assists: The number of assists by the player, which means the number of times a player passes the ball to a teammate in order to score a basket
Totrb: Total rebounds (offensive + defensive) by the player.
Goal : Predicting Player Efficiency
in Basketbal
Motivation:Understanding player
efficiency in basketball allows teams to make data-driven decisions
regarding player selection, training, and in-game strategy. We aim to
use statistical methods to pinpoint which performance metrics are the
most predictive of a player’s overall efficiency.
Research and Development :The research
involves analyzing game-by-game player statistics from the WNBA to
determine the relationship between on-court performances in terms of
points, assists, and rebounds with the efficiency metric.
Methodology :Linear Regression is
utilized as it suits the prediction of a continuous outcome efficiency
based on these performance metrics. The method assumes a linear
relationship, making it a logical choice for this analysis.
# Fitting a linear regression model
model <- lm(wnba_data$EFFICIENCY ~ wnba_data$POINTS + wnba_data$ASSIST + wnba_data$TOTRB, data = wnba_data)
# To view the summary of the model
summary(model)
##
## Call:
## lm(formula = wnba_data$EFFICIENCY ~ wnba_data$POINTS + wnba_data$ASSIST +
## wnba_data$TOTRB, data = wnba_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.4536 -1.6213 0.2577 1.7434 14.0374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.646283 0.078411 -21.00 <2e-16 ***
## wnba_data$POINTS 0.781614 0.008159 95.80 <2e-16 ***
## wnba_data$ASSIST 0.594199 0.025362 23.43 <2e-16 ***
## wnba_data$TOTRB 0.902929 0.016329 55.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.916 on 4028 degrees of freedom
## Multiple R-squared: 0.8729, Adjusted R-squared: 0.8728
## F-statistic: 9221 on 3 and 4028 DF, p-value: < 2.2e-16
# Plotting residuals vs. fitted values
plot(model$fitted.values, model$residuals,abline(h = 0, col = "red"))
# Add a caption to the plot
title( sub = "Fig. 8: Residual Plot for linear model ")
# QQ Plot for normality of residuals
qqnorm(model$residuals)
qqline(model$residuals)
# Add a caption to the plot
title( sub = "Fig. 9: QQ Plot ")
Insights:
1.The summary(model)
output indicates that all three predictor variables (points, assists,
rebounds) are statistically significant predictors of player
efficiency,as proved by the High R-squared value and extremely low
p-value.
2.The Coefficients reflect how much efficiency is
expected to increase with a one-unit increase in each performance
metric, holding other variables constant.
3.The positive
values of the coefficients for points (0.781614), assists (0.594199),
and rebounds (0.902929) suggest that all these metrics positively
influence a player’s efficiency.
4.A high Multiple R-squared
value of 0.8729 indicates that our model explains a large portion of the
variability in efficiency.
5.The residual plot and QQ plot
were generated to check the regression assumptions.
6.Residuals vs. Fitted plot does not show any obvious patterns,
suggesting a good fit for the linear model.
7.Normal Q-Q Plot
shows the residuals are fairly well aligned with the theoretical
quantiles, indicating normality.
The analysis using linear regression has shown that points, assists,
and rebounds are significant predictors of a player’s efficiency in
basketball. This conclusion helps answer our main business question and
provides a basis for teams and coaches to evaluate player performance
more effectively.
Business Question 2
Predict likelihood of a team winning a game based on where the match is held (home or away) along with performance metrics using Logistic Regression.
Goal: Predicting the Influence of
Court Location on Game Outcomes
Motivation:The concept of ‘home court
advantage’ is frequently cited in basketball, and our goal is to
quantify its impact. We want to determine whether a team’s performance
metrics at home games are a significant predictor of their likelihood to
win, as opposed to their performance in away games.
Research and Development:By analyzing
WNBA player data, we investigate the influence of playing at home versus
away. We hypothesize that players perform differently depending on the
game location, which in turn affects the team’s chances of winning.
Methodology:Logistic Regression is
employed due to the binary nature of the target variable whether a
player’s efficiency is higher at home (1) or not (0). This method is
adept at modeling probabilities and is thus ideal for our binary
classification problem.
# Data Preprocessing
# Create the target variable
wnba_data <- wnba_data %>%
group_by(PLAYER_ID) %>%
mutate(
avg_home_efficiency = mean(EFFICIENCY[HOME == 1], na.rm = TRUE),
avg_away_efficiency = mean(EFFICIENCY[HOME == 0], na.rm = TRUE)
) %>%
ungroup() %>%
mutate(higher_home_efficiency = ifelse(avg_home_efficiency > avg_away_efficiency, 1, 0))
# Remove rows with NA in target variable
wnba_data <- wnba_data %>% filter(!is.na(higher_home_efficiency))
# wnba_data$higher_home_efficiency[is.na(wnba_data$higher_home_efficiency)] <- some_imputation_method
# Select relevant features
features <- wnba_data %>%
select(PLAYER_ID, higher_home_efficiency, MINUTES, TOTRB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, POINTS)
# Split the data into training and test sets
set.seed(123)
training_indices <- createDataPartition(features$higher_home_efficiency, p = 0.8, list = FALSE)
training_data <- features[training_indices, ]
test_data <- features[-training_indices, ]
# Logistic Regression Model
model <- glm(higher_home_efficiency ~ ., data = training_data, family = binomial())
print(model)
##
## Call: glm(formula = higher_home_efficiency ~ ., family = binomial(),
## data = training_data)
##
## Coefficients:
## (Intercept) PLAYER_ID MINUTES TOTRB ASSIST STEAL
## -1.074231 0.005370 0.024711 0.053854 -0.045638 0.052315
## BLOCK TURNOVER FOULS POINTS
## 0.131767 0.022404 0.090224 -0.007681
##
## Degrees of Freedom: 3215 Total (i.e. Null); 3206 Residual
## Null Deviance: 4450
## Residual Deviance: 4335 AIC: 4355
# Model Evaluation
predictions <- predict(model, test_data, type = "response")
predicted_class <- ifelse(predictions > 0.5, 1, 0)
conf_matrix <- confusionMatrix(as.factor(predicted_class), as.factor(test_data$higher_home_efficiency))
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
sensitivity <- recall
specificity <- conf_matrix$byClass["Specificity"]
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.5559701
cat("Precision:", precision, "\n")
## Precision: 0.5168539
cat("Sensitivity (Recall):", sensitivity, "\n")
## Sensitivity (Recall): 0.498645
cat("Specificity:", specificity, "\n")
## Specificity: 0.6045977
# Plotting the predicted probabilities
pred_df <- data.frame(actual = test_data$higher_home_efficiency, predicted_prob = predictions)
ggplot(pred_df, aes(x = predicted_prob, fill = as.factor(actual))) +
geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
labs(x = "Predicted Probability", y = "Count", fill = "Actual Value") +
theme_minimal()
Fig. 10:Win/Loss Prediction using Logistic Regression
Insights:
1.The logistic
regression model was built with the target variable
higher_home_efficiency and predictors including minutes played, total
rebounds, assists, steals, blocks, turnovers, fouls, and points
scored.
2.The confusion matrix shows the model’s accuracy at
55.6%, which is slightly above the No Information Rate, indicating the
model does better than random guessing but not by a substantial
margin.
3.Sensitivity and specificity scores indicate the
model’s ability to correctly identify both true positives and true
negatives, which are relatively balanced.
4.The analysis
shows that while there is some predictive power in the model, it is not
highly accurate in predicting the likelihood of higher home efficiency
leading to a win. However, it does suggest there are measurable effects
of home court on player performance.
Business Question 3
Test whether there is a significant difference in three-point shooting accuracy between home games and away games.
Attributes used in this testing are as follows:
made3: The number of three-pointers made.
att3: The number of three-point attempts.
Goal : Examining Three-Point Shooting
Accuracy Based on Game Location
Motivation:The environment in which a
basketball game is played home or away may have an impact on a player’s
performance. Specifically, we are interested in whether it affects the
accuracy of three-point shots. This analysis is crucial as it can inform
coaching strategies and team preparations.
Research
and Development:We utilized historical WNBA game data
to compare the three-point shooting accuracy of players in home versus
away games, testing the common belief that players shoot better at
home.
Methodology: A Two-sample
t-test is appropriate for comparing the means of two independent groups
here, the shooting accuracy at home and away games. This test is used
under the hypothesis framework that there is no significant difference
in shooting accuracy based on game location (null hypothesis) versus the
alternative that there is a difference.
wnba_data$three_point_accuracy <- wnba_data$MADE3 / wnba_data$ATT3
# Handle infinite and missing values that might result from division by zero
wnba_data$three_point_accuracy[is.infinite(wnba_data$three_point_accuracy)] <- NA
wnba_data <- na.omit(wnba_data)
# Subset the data into home and away games
home_games_data <- subset(wnba_data, HOME == 1)
away_games_data <- subset(wnba_data, HOME == 0)
# Perform a two-sample t-test on three-point shooting accuracy
t_test_result <- t.test(home_games_data$three_point_accuracy, away_games_data$three_point_accuracy)
# Print the result
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: home_games_data$three_point_accuracy and away_games_data$three_point_accuracy
## t = 0.69208, df = 1996.8, p-value = 0.489
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01824071 0.03813569
## sample estimates:
## mean of x mean of y
## 0.2969697 0.2870222
# Calculation of three-point accuracy
wnba_data$three_point_accuracy <- wnba_data$MADE3 / wnba_data$ATT3
# Handle infinite and missing values that might result from division by zero
wnba_data$three_point_accuracy[is.infinite(wnba_data$three_point_accuracy)] <- NA
wnba_data <- na.omit(wnba_data)
# Subset the data into home and away games
home_games_data <- subset(wnba_data, HOME == 1)
away_games_data <- subset(wnba_data, HOME == 0)
# Perform a two-sample t-test on three-point shooting accuracy
t_test_result <- t.test(home_games_data$three_point_accuracy, away_games_data$three_point_accuracy)
t_test_result
##
## Welch Two Sample t-test
##
## data: home_games_data$three_point_accuracy and away_games_data$three_point_accuracy
## t = 0.69208, df = 1996.8, p-value = 0.489
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01824071 0.03813569
## sample estimates:
## mean of x mean of y
## 0.2969697 0.2870222
# a. T-test already performed above
# b. State the Hypotheses
# Null Hypothesis (H0): No significant difference in three-point shooting accuracy between home and away games.
# Alternative Hypothesis (H1): Significant difference in three-point shooting accuracy between home and away games.
# c. Find the Critical Value (for a two-tailed test at alpha = 0.05)
alpha <- 0.05
df <- t_test_result$parameter
critical_value <- qt(alpha/2, df, lower.tail = FALSE)
# d. Make the Decision
p_value <- t_test_result$p.value
decision <- ifelse(p_value <= alpha, "Reject the Null Hypothesis", "Fail to Reject the Null Hypothesis")
# e. Summarize the Results
summary <- paste("T-Test Result:", decision,
"\nCritical Value at alpha =", alpha, "is", critical_value,
"\nP-Value of the test is", p_value)
print(summary)
## [1] "T-Test Result: Fail to Reject the Null Hypothesis \nCritical Value at alpha = 0.05 is 1.9611527254421 \nP-Value of the test is 0.48896629326814"
Insights:
1.The t-test result
shows a p-value of 0.4889, which is above the typical alpha level of
0.05, suggesting that the difference in three-point shooting accuracy
between home and away games is not statistically significant.
2.The 95% confidence interval for the difference in means
includes zero, further supporting the lack of a statistically
significant difference.
3.The means of the two samples are
relatively close, with home game accuracy at 29.6% and away game
accuracy at 28.6%.
4.The analysis indicates that there is no
significant difference in three-point shooting accuracy between home and
away games in the WNBA. This finding challenges the notion of a ‘home
court’ advantage in this aspect and suggests that players maintain a
consistent performance in three-point shooting regardless of the game
location.
Business Question 4 - Test whether teams that accumulate more steals have an increased likelihood of winning the game.
Goal 4 : Understanding the
Relationship Between Steals and Game Outcomes
Motivation:In basketball, defensive plays
like steals can be crucial in turning the tide of a game. Our objective
is to determine whether there is a statistical link between the number
of steals a team makes and their chances of winning a game.
Research and Development:We focus on the
hypothesis that more steals could lead to a higher probability of
winning. Our analysis involves comparing the average number of steals in
games won to those lost, using historical WNBA game data.
Methodology:The Two-sample t-test is an
ideal statistical test for comparing the means of two groups—steals in
games won versus games lost. This method will help us test the null
hypothesis that there is no difference in the average number of steals
between the two types of games.
#Reference Stack Exchange. (2012, August 31). How to perform two-sample t-tests in R by inputting sample statistics rather than raw data? Cross Validated. https://stats.stackexchange.com/questions/30394/how-to-perform-two-sample-t-tests-in-r-by-inputting-sample-statistics-rather-tha
# Subset the data into games won and games lost
games_won_data <- subset(wnba_data, WIN == 1)
games_lost_data <- subset(wnba_data, WIN == 0)
# Perform a two-sample t-test on the number of steals
t_test_result <- t.test(games_won_data$STEAL, games_lost_data$STEAL)
# Print the result
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: games_won_data$STEAL and games_lost_data$STEAL
## t = 2.4684, df = 1970.8, p-value = 0.01366
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02451659 0.21409664
## sample estimates:
## mean of x mean of y
## 0.9479060 0.8285994
# Subset the data into games won and games lost
games_won_data <- subset(wnba_data, WIN == 1)
games_lost_data <- subset(wnba_data, WIN == 0)
# Perform a two-sample t-test on the number of steals
t_test_result <- t.test(games_won_data$STEAL, games_lost_data$STEAL)
t_test_result
##
## Welch Two Sample t-test
##
## data: games_won_data$STEAL and games_lost_data$STEAL
## t = 2.4684, df = 1970.8, p-value = 0.01366
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02451659 0.21409664
## sample estimates:
## mean of x mean of y
## 0.9479060 0.8285994
# a. T-test already performed above
# b. State the Hypotheses
# Null Hypothesis (H0): No significant difference in the average number of steals between games won and lost.
# Alternative Hypothesis (H1): Significant difference in the average number of steals between games won and lost.
# c. Find the Critical Value (for a two-tailed test at alpha = 0.05)
alpha <- 0.05
df <- t_test_result$parameter
critical_value <- qt(alpha/2, df, lower.tail = FALSE)
# d. Make the Decision
p_value <- t_test_result$p.value
decision <- ifelse(p_value <= alpha, "Reject the Null Hypothesis", "Fail to Reject the Null Hypothesis")
# e. Summarize the Results
summary <- paste("T-Test Result:", decision,
"\nCritical Value at alpha =", alpha, "is", critical_value,
"\nP-Value of the test is", p_value)
print(summary)
## [1] "T-Test Result: Reject the Null Hypothesis \nCritical Value at alpha = 0.05 is 1.96116845168867 \nP-Value of the test is 0.013655901329085"
Insights:
1.The results from the
t-test provide a p-value of 0.01321, indicating that the difference in
the number of steals between games won and games lost is statistically
significant (p < 0.05).
2.The positive confidence interval
(0.02503698 to 0.21422709) suggests that, on average, teams make more
steals in games they win compared to games they lose.
3.The
mean values show that the average number of steals is higher in games
won (0.9459735) than in games lost (0.8263415).
The data supports the alternative hypothesis that teams with a higher
number of steals have a higher probability of winning the game. This
insight can be influential in team strategies, emphasizing the
importance of defensive play and steals in winning games.
Our study looked at basketball statistics to understand what makes
teams and players do well. We used different math methods to look at
players’ scores, assists, rebounds, and other things like how often
teams win at home and how steals affect winning.
We learned that scoring points, helping teammates score, and grabbing
rebounds are all important for a player to be considered efficient. This
information can help coaches figure out which skills to focus on during
training.
When we looked at whether teams are more likely to win at home, we
found a slight advantage, but it wasn’t very clear. This means that
being at home might not be as big of a deal as we thought.
We also saw that making more steals can mean a better chance of
winning. This shows that good defense can be just as important as
scoring.
In short, our research tells us that looking at the numbers can
really help understand basketball better. It can help teams decide what
to work on and how to play the game. Our findings give useful tips for
teams and show that using data is a smart way to make sports
decisions.
1.Bluman, A. G. (2022). Elementary Statistics, A Step by Step
Approach: A Brief Version, 10th Edition. McGraw Hill Publishing.
2.James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).
An Introduction to Statistical Learning. Springer. URL: https://link.springer.com/chapter/10.1007/978-1-4614-7138-7_2
WNBA 2014 Player Stats- Data Key
The dataset contains 26 variables. Below is the data key for each variable which will help explain the project and dataset better.
| Name | Description |
|---|---|
| Player | The name of the basketball player. |
| player_id | A numerical identifier for each player. |
| team | The team for which the player is playing. |
| date | The date of the game. |
| home | A binary indicator (0 or 1) representing whether the game was played at home (1) or away (0). |
| opponent | The name of the opposing team. |
| win | A binary indicator (1 or 0) representing whether the player’s team won (1) or lost (0) the game. |
| team_pts | The total points scored by the player’s team in the game. |
| opp_pts | The total points scored by the opposing team. |
| minutes | The number of minutes the player participated in the game. |
| fgmade | The number of field goals made by the player. |
| fgatt | The number of field goal attempts by the player. |
| made3 | The number of three-pointers made. |
| att3 | The number of three-point attempts. |
| made1 | The number of free throws made. |
| att1 | The number of free throw attempts. |
| offrb | Offensive rebounds by the player. |
| defrb | Defensive rebounds by the player. |
| totrb | Total rebounds (offensive + defensive) by the player. |
| Assist | The number of assists by the player. |
| steal | The number of steals by the player. |
| block | The number of blocks by the player. |
| turnover | The number of turnovers committed by the player. |
| fouls | The number of fouls committed by the player. |
| points | The total points scored by the player in the game. |
| efficiency | A numerical measure of a player’s overall efficiency in the game. |