ALY6015_Module_6_Group_1_Final

Northeastern University
ALY6015 71821 Intermediate Analytics SEC 09 Fall 2023 CPS [BOS-B-HY]
Zeeshan Ahmad Ansari
Prof. Vladimir Shapiro
Date of submission: 09 April, 2024

Module 5 Final Project- Final Report

Introduction

In this project, we’re looking at important questions about basketball using statistics. We want to understand what makes a player efficient, how playing at home or away affects a team’s chances of winning, if shooting three-pointers is better at home or away, and whether making steals helps teams win. Here’s how we’re doing it:

1.Player Efficiency:The primary question here is: What combination of performance metrics such as points scored, assists made, and rebounds collected, most significantly predicts a player’s overall efficiency on the court? This inquiry is not just of academic interest but has profound implications for player assessment, training focus and game strategy formulation.

2.Home vs. Away Games: Another pivotal question addresses the concept of home court advantage. We seek to predict the likelihood of a team’s victory based on the dichotomy of playing at home versus playing away. This analysis is enriched by incorporating various performance metrics like points scored, fouls committed etc., to understand how these factors interplay with the venue of the game.

3.Three-Point Shooting: We also explore whether there is a significant difference in three-point shooting accuracy between home games and away games. This analysis could provide insights into how environmental or psychological factors impact player performance.

4.Steals and Winning: Lastly, we investigate whether teams that execute more steals have a higher probability of winning games. This question probes into the strategic value of defensive maneuvers like steals in the overall game outcome.

We’re using these methods to get answers that could help coaches, players and fans understand the game better.

Methods

In this basketball analysis project, we’ve employed several predictive analytics and statistical techniques to address our key questions. Here’s a breakdown of the methods used and why they were appropriate:

1.Linear Regression for Player Efficiency:
Model Reason:We chose linear regression to understand how different performance metrics like points, assists, and rebounds predict a player’s efficiency. This method is great for seeing how these factors add up to affect overall performance.
Data Involved:We used player statistics such as total points, number of assists, and rebound counts.
2.Logistic Regression for Home vs. Away Game Analysis:
Model Reason: Logistic regression is perfect for questions with yes/no answers, like whether a team is more likely to win at home or away. This method helps us predict the probability of winning based on the game location and other factors.
Data Involved:We analyzed game records, including the location of the game (home or away) and performance metrics like points scored and fouls committed.

3.Two-sample t-test for Three-Point Shooting:

Model Reason:The two-sample t-test helps us compare two different groups. We used it to see if there’s a significant difference in three-point shooting accuracy and the impact of it in home vs. away games.

Data Involved:For three-point shooting, we compared shooting percentages in home and away games.

4.Two-sample t-test for Steals Analysis:
Model Reason: The two-sample t-test is suitable for exploring the impact of defense on game outcomes, specifically whether a higher number of steals correlates with an increased chance of winning. This statistical test compares the average number of steals between two distinct groups.
Data Involved: We investigated historical WNBA data, dividing it into games won and games lost. We then compared the average steals in each group to understand their relationship to winning games.

These methods were chosen because they align well with the nature of our data and the specific questions we’re trying to answer. By applying these techniques, we can gain meaningful insights into player and team performance in basketball.

Analysis

EDA

library(ggplot2)
library(dplyr)
library(tidyr)
library(caret)
library(readr)
library(corrplot)

wnba_data <- read.csv("D:/Quater_2/Second Part/ALY6015/FInal Group/WNBA-2014-player-stats-by-game.csv")


# variable names changed to uppercase
colnames(wnba_data) <- toupper(colnames(wnba_data))

# Replacing periods and underscores with underscores
colnames(wnba_data) <- gsub("[._]", "_", colnames(wnba_data))

# Checking the updated column names
names(wnba_data)

##  [1] "PLAYER"     "PLAYER_ID"  "TEAM"       "DATE"       "HOME"      
##  [6] "OPPONENT"   "WIN"        "TEAM_PTS"   "OPP_PTS"    "MINUTES"   
## [11] "FGMADE"     "FGATT"      "MADE3"      "ATT3"       "MADE1"     
## [16] "ATT1"       "OFFRB"      "DEFRB"      "TOTRB"      "ASSIST"    
## [21] "STEAL"      "BLOCK"      "TURNOVER"   "FOULS"      "POINTS"    
## [26] "EFFICIENCY"

# Making an empty dataframe to store missing value counts

missing_counts_wnba_data <- data.frame(Column = character(0), Blank_Count = integer(0))

# Looping through each column of the dataframe

for (col in names(wnba_data)) {
  missing_values <- sum(is.na(wnba_data[[col]]) | wnba_data[[col]] == "")
  missing_counts_wnba_data <- rbind(missing_counts_wnba_data, data.frame(Column = col, Blank_Count = missing_values))
}

# Calculating the total number of rows in the dataframe
total_rows <- nrow(wnba_data)

# Calculating the percentage of missing values for each column

missing_counts_wnba_data$Percent_Missing <- (missing_counts_wnba_data$Blank_Count / total_rows) * 100

missing_counts_wnba_data

##        Column Blank_Count Percent_Missing
## 1      PLAYER           0               0
## 2   PLAYER_ID           0               0
## 3        TEAM           0               0
## 4        DATE           0               0
## 5        HOME           0               0
## 6    OPPONENT           0               0
## 7         WIN           0               0
## 8    TEAM_PTS           0               0
## 9     OPP_PTS           0               0
## 10    MINUTES           0               0
## 11     FGMADE           0               0
## 12      FGATT           0               0
## 13      MADE3           0               0
## 14       ATT3           0               0
## 15      MADE1           0               0
## 16       ATT1           0               0
## 17      OFFRB           0               0
## 18      DEFRB           0               0
## 19      TOTRB           0               0
## 20     ASSIST           0               0
## 21      STEAL           0               0
## 22      BLOCK           0               0
## 23   TURNOVER           0               0
## 24      FOULS           0               0
## 25     POINTS           0               0
## 26 EFFICIENCY           0               0

During EDA phase we will create several Histograms, box plot, and we will be checking the correlation to understand the data better.

1.Distribution of Minutes and Points: Both the Minutes and Points (Fig. 1 and Fig. 2) histograms show a slightly right skewed and right skewed distribution respectively,suggesting that a majority of the observations are clustered towards the lower end of the scale. This indicates that the scoring opportunities are relatively rare.

# Histogram for 'minutes'
ggplot(wnba_data, aes(x = MINUTES)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(x = "Minutes", y = "Frequency") +
  theme_minimal()

Fig.1 Distribution of minutes

# Histogram for 'points'
ggplot(wnba_data, aes(x = POINTS)) +
  geom_histogram(binwidth = 5, fill = "#69b3a2", color = "#404040", alpha = 0.7) +
  labs(x = "Points", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14, face = "bold")
  )

Fig.2 Distribution of points

Distribution of Efficiency: The histogram for Efficiency (Fig.3) is normally distributed but with a rightward tail, indicating some observations with exceptionally high efficiency.

# Histogram for 'efficiency'
ggplot(wnba_data, aes(x = EFFICIENCY)) +
  geom_histogram(binwidth = 0.02, fill = "#FFA07A", color = "#4B0082", alpha = 0.7) +
  labs(x = "Efficiency", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "black"),
    axis.text = element_text(size = 10, color = "black"),
    axis.title = element_text(size = 12, face = "bold", color = "black")
  )

Fig.3 Distribution of efficiency

Box Plots for Consistency: The box plots show the spread and central tendency of the Minutes (Fig.4), Points (Fig.5), and Efficiency data (Fig.6). The Minutes and Points have a relatively small interquartile range (IQR), indicating that most of the data falls within a narrow range. The Efficiency box plot shows a larger IQR and several outliers, indicating more variability in this measure.

# Box plot for 'minutes'
boxplot(wnba_data$MINUTES, ylab = "Minutes")

Fig.4 Box plot of Minutes

# Box plot for 'points'
ggplot(wnba_data, aes(x = 1, y = POINTS)) +
  geom_boxplot(fill = "#66c2a5", color = "#fc8d62", alpha = 0.7, outlier.shape = NA) +
  labs(x = NULL, y = "Points") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#fc8d62"),
    axis.text = element_text(size = 12, color = "#2F4F4F"),
    axis.title = element_text(size = 14, face = "bold", color = "#fc8d62")
  )

Fig.5 Box plot of points

# Box plot for 'efficiency'
ggplot(wnba_data, aes(x = 1, y = EFFICIENCY)) +
  geom_boxplot(color = "black", alpha = 0.7, outlier.shape = NA) +
  labs(x = NULL, y = "Efficiency") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "black"),
    axis.text = element_text(size = 12, color = "black"),
    axis.title = element_text(size = 14, face = "bold", color = "black")
  )

Fig.6 Box plot of EFficiency

# Correlation matrix
cor_matrix <- cor(wnba_data[, c("MINUTES", "POINTS", "EFFICIENCY")])
cor_matrix

##              MINUTES    POINTS EFFICIENCY
## MINUTES    1.0000000 0.7502987  0.7047455
## POINTS     0.7502987 1.0000000  0.8733016
## EFFICIENCY 0.7047455 0.8733016  1.0000000

# Create a correlation plot with pie charts
corrplot(cor_matrix, method = "pie" )

# Add a caption to the plot
title( sub = "Fig. 7: Correlation Plot with Pie Charts ")

Correlation Plot with Pie Charts: The correlation plot (Fig.7) suggests that there is a positive correlation between Minutes and Points, meaning as the minutes increase, the points tend to increase as well in the game. The correlation between Minutes and Efficiency is not clear from the pie charts, suggesting that more minutes do not necessarily lead to greater efficiency of the game or the player.

# Variance inflation factor (VIF) for multicollinearity check
vif_values <- car::vif(lm(EFFICIENCY ~ MINUTES + POINTS, data = wnba_data))
print(vif_values)

##  MINUTES   POINTS 
## 2.288058 2.288058

Average Efficiency by Team: The bar graph comparing Average Efficiency by Team shows variation between teams, with some teams having significantly higher average efficiency than others. This could indicate differences in strategies, skills, or other team-specific factors that affect efficiency.

# Average efficiency by team
avg_efficiency_by_team <- aggregate(wnba_data$EFFICIENCY, by = list(wnba_data$TEAM), FUN = mean)
colnames(avg_efficiency_by_team) <- c("TEAM", "AVERAGE_EFFICIENCY")
# Bar chart og Average effeciency by team
avg_efficiency_by_team_barplot <- ggplot(avg_efficiency_by_team, aes(x = TEAM, y = AVERAGE_EFFICIENCY)) +
  geom_bar(stat = "identity", fill = "#87CEEB", color = "black", alpha = 0.7) +
  labs(title = "Average Efficiency by team", x = "Team", y = "Average Efficiency") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#2F4F4F"),
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 12, color = "#2F4F4F"),
    axis.text.y = element_text(size = 12, color = "#2F4F4F"),
    axis.title = element_text(size = 14, face = "bold", color = "#2F4F4F")
  )
 
# Display the plot
print(avg_efficiency_by_team_barplot)

Fig.8 Average Efficiency by Team

Outliers and Variability: The presence of outliers, particularly in the Efficiency data, suggests that there are instances that deviate significantly from the average performance.

Potential Applications: These insights could be used for performance evaluation, team strategy development, or identifying areas for improvement. For example, teams with lower efficiency might investigate the causes and seek ways to optimize performance.

Business Questions

Business Question 1

Predict player’s overall efficiency based on performance metrics like points, assists, and rebounds etc. using Linear Regression.

The attributes used in this testing are as follows:

Efficiency: A numerical measure to quantify a player’s overall contribution to a game.
Points: The total points scored by the player in the game.
Assists: The number of assists by the player, which means the number of times a player passes the ball to a teammate in order to score a basket
Totrb: Total rebounds (offensive + defensive) by the player.

Goal : Predicting Player Efficiency in Basketbal
Motivation:Understanding player efficiency in basketball allows teams to make data-driven decisions regarding player selection, training, and in-game strategy. We aim to use statistical methods to pinpoint which performance metrics are the most predictive of a player’s overall efficiency.
Research and Development :The research involves analyzing game-by-game player statistics from the WNBA to determine the relationship between on-court performances in terms of points, assists, and rebounds with the efficiency metric.
Methodology :Linear Regression is utilized as it suits the prediction of a continuous outcome efficiency based on these performance metrics. The method assumes a linear relationship, making it a logical choice for this analysis.

# Fitting a linear regression model

model <- lm(wnba_data$EFFICIENCY ~ wnba_data$POINTS + wnba_data$ASSIST + wnba_data$TOTRB, data = wnba_data)
 
# To view the summary of the model

summary(model)

## 
## Call:
## lm(formula = wnba_data$EFFICIENCY ~ wnba_data$POINTS + wnba_data$ASSIST + 
##     wnba_data$TOTRB, data = wnba_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4536  -1.6213   0.2577   1.7434  14.0374 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.646283   0.078411  -21.00   <2e-16 ***
## wnba_data$POINTS  0.781614   0.008159   95.80   <2e-16 ***
## wnba_data$ASSIST  0.594199   0.025362   23.43   <2e-16 ***
## wnba_data$TOTRB   0.902929   0.016329   55.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.916 on 4028 degrees of freedom
## Multiple R-squared:  0.8729, Adjusted R-squared:  0.8728 
## F-statistic:  9221 on 3 and 4028 DF,  p-value: < 2.2e-16

# Plotting residuals vs. fitted values

plot(model$fitted.values, model$residuals,abline(h = 0, col = "red"))
# Add a caption to the plot

title( sub = "Fig. 8: Residual Plot for linear model ")

 # QQ Plot for normality of residuals

qqnorm(model$residuals)

qqline(model$residuals)
# Add a caption to the plot
title( sub = "Fig. 9: QQ Plot ")

Insights:
1.The summary(model) output indicates that all three predictor variables (points, assists, rebounds) are statistically significant predictors of player efficiency,as proved by the High R-squared value and extremely low p-value.
2.The Coefficients reflect how much efficiency is expected to increase with a one-unit increase in each performance metric, holding other variables constant.
3.The positive values of the coefficients for points (0.781614), assists (0.594199), and rebounds (0.902929) suggest that all these metrics positively influence a player’s efficiency.
4.A high Multiple R-squared value of 0.8729 indicates that our model explains a large portion of the variability in efficiency.
5.The residual plot and QQ plot were generated to check the regression assumptions.
6.Residuals vs. Fitted plot does not show any obvious patterns, suggesting a good fit for the linear model.
7.Normal Q-Q Plot shows the residuals are fairly well aligned with the theoretical quantiles, indicating normality.

The analysis using linear regression has shown that points, assists, and rebounds are significant predictors of a player’s efficiency in basketball. This conclusion helps answer our main business question and provides a basis for teams and coaches to evaluate player performance more effectively.

Business Question 2

Predict likelihood of a team winning a game based on where the match is held (home or away) along with performance metrics using Logistic Regression.

Goal: Predicting the Influence of Court Location on Game Outcomes
Motivation:The concept of ‘home court advantage’ is frequently cited in basketball, and our goal is to quantify its impact. We want to determine whether a team’s performance metrics at home games are a significant predictor of their likelihood to win, as opposed to their performance in away games.
Research and Development:By analyzing WNBA player data, we investigate the influence of playing at home versus away. We hypothesize that players perform differently depending on the game location, which in turn affects the team’s chances of winning.
Methodology:Logistic Regression is employed due to the binary nature of the target variable whether a player’s efficiency is higher at home (1) or not (0). This method is adept at modeling probabilities and is thus ideal for our binary classification problem.

# Data Preprocessing
# Create the target variable
wnba_data <- wnba_data %>%
  group_by(PLAYER_ID) %>%
  mutate(
    avg_home_efficiency = mean(EFFICIENCY[HOME == 1], na.rm = TRUE),
    avg_away_efficiency = mean(EFFICIENCY[HOME == 0], na.rm = TRUE)
  ) %>%
  ungroup() %>%
  mutate(higher_home_efficiency = ifelse(avg_home_efficiency > avg_away_efficiency, 1, 0))

# Remove rows with NA in target variable
wnba_data <- wnba_data %>% filter(!is.na(higher_home_efficiency))

# wnba_data$higher_home_efficiency[is.na(wnba_data$higher_home_efficiency)] <- some_imputation_method

# Select relevant features
features <- wnba_data %>%
  select(PLAYER_ID, higher_home_efficiency, MINUTES, TOTRB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, POINTS)

# Split the data into training and test sets
set.seed(123)
training_indices <- createDataPartition(features$higher_home_efficiency, p = 0.8, list = FALSE)
training_data <- features[training_indices, ]
test_data <- features[-training_indices, ]

# Logistic Regression Model
model <- glm(higher_home_efficiency ~ ., data = training_data, family = binomial())
print(model)

## 
## Call:  glm(formula = higher_home_efficiency ~ ., family = binomial(), 
##     data = training_data)
## 
## Coefficients:
## (Intercept)    PLAYER_ID      MINUTES        TOTRB       ASSIST        STEAL  
##   -1.074231     0.005370     0.024711     0.053854    -0.045638     0.052315  
##       BLOCK     TURNOVER        FOULS       POINTS  
##    0.131767     0.022404     0.090224    -0.007681  
## 
## Degrees of Freedom: 3215 Total (i.e. Null);  3206 Residual
## Null Deviance:       4450 
## Residual Deviance: 4335  AIC: 4355

# Model Evaluation
predictions <- predict(model, test_data, type = "response")
predicted_class <- ifelse(predictions > 0.5, 1, 0)
conf_matrix <- confusionMatrix(as.factor(predicted_class), as.factor(test_data$higher_home_efficiency))

accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
sensitivity <- recall
specificity <- conf_matrix$byClass["Specificity"]

cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.5559701

cat("Precision:", precision, "\n")

## Precision: 0.5168539

cat("Sensitivity (Recall):", sensitivity, "\n")

## Sensitivity (Recall): 0.498645

cat("Specificity:", specificity, "\n")

## Specificity: 0.6045977

# Plotting the predicted probabilities
pred_df <- data.frame(actual = test_data$higher_home_efficiency, predicted_prob = predictions)
ggplot(pred_df, aes(x = predicted_prob, fill = as.factor(actual))) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 30) +
  labs(x = "Predicted Probability", y = "Count", fill = "Actual Value") +
  theme_minimal()

Fig. 10:Win/Loss Prediction using Logistic Regression

Insights:
1.The logistic regression model was built with the target variable higher_home_efficiency and predictors including minutes played, total rebounds, assists, steals, blocks, turnovers, fouls, and points scored.
2.The confusion matrix shows the model’s accuracy at 55.6%, which is slightly above the No Information Rate, indicating the model does better than random guessing but not by a substantial margin.
3.Sensitivity and specificity scores indicate the model’s ability to correctly identify both true positives and true negatives, which are relatively balanced.
4.The analysis shows that while there is some predictive power in the model, it is not highly accurate in predicting the likelihood of higher home efficiency leading to a win. However, it does suggest there are measurable effects of home court on player performance.

Business Question 3

Test whether there is a significant difference in three-point shooting accuracy between home games and away games.

Attributes used in this testing are as follows:

made3: The number of three-pointers made.

att3: The number of three-point attempts.

Goal : Examining Three-Point Shooting Accuracy Based on Game Location
Motivation:The environment in which a basketball game is played home or away may have an impact on a player’s performance. Specifically, we are interested in whether it affects the accuracy of three-point shots. This analysis is crucial as it can inform coaching strategies and team preparations.
Research and Development:We utilized historical WNBA game data to compare the three-point shooting accuracy of players in home versus away games, testing the common belief that players shoot better at home.
Methodology: A Two-sample t-test is appropriate for comparing the means of two independent groups here, the shooting accuracy at home and away games. This test is used under the hypothesis framework that there is no significant difference in shooting accuracy based on game location (null hypothesis) versus the alternative that there is a difference.

wnba_data$three_point_accuracy <- wnba_data$MADE3 / wnba_data$ATT3

# Handle infinite and missing values that might result from division by zero
wnba_data$three_point_accuracy[is.infinite(wnba_data$three_point_accuracy)] <- NA
wnba_data <- na.omit(wnba_data)
 
# Subset the data into home and away games
home_games_data <- subset(wnba_data, HOME == 1)
away_games_data <- subset(wnba_data, HOME == 0)
 
# Perform a two-sample t-test on three-point shooting accuracy
t_test_result <- t.test(home_games_data$three_point_accuracy, away_games_data$three_point_accuracy)
 
# Print the result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  home_games_data$three_point_accuracy and away_games_data$three_point_accuracy
## t = 0.69208, df = 1996.8, p-value = 0.489
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01824071  0.03813569
## sample estimates:
## mean of x mean of y 
## 0.2969697 0.2870222

# Calculation of three-point accuracy
wnba_data$three_point_accuracy <- wnba_data$MADE3 / wnba_data$ATT3
 
# Handle infinite and missing values that might result from division by zero
wnba_data$three_point_accuracy[is.infinite(wnba_data$three_point_accuracy)] <- NA
wnba_data <- na.omit(wnba_data)
 
# Subset the data into home and away games
home_games_data <- subset(wnba_data, HOME == 1)
away_games_data <- subset(wnba_data, HOME == 0)
 
# Perform a two-sample t-test on three-point shooting accuracy
t_test_result <- t.test(home_games_data$three_point_accuracy, away_games_data$three_point_accuracy)

t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  home_games_data$three_point_accuracy and away_games_data$three_point_accuracy
## t = 0.69208, df = 1996.8, p-value = 0.489
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01824071  0.03813569
## sample estimates:
## mean of x mean of y 
## 0.2969697 0.2870222

# a. T-test already performed above
 
# b. State the Hypotheses
# Null Hypothesis (H0): No significant difference in three-point shooting accuracy between home and away games.
# Alternative Hypothesis (H1): Significant difference in three-point shooting accuracy between home and away games.
 
# c. Find the Critical Value (for a two-tailed test at alpha = 0.05)
alpha <- 0.05
df <- t_test_result$parameter
critical_value <- qt(alpha/2, df, lower.tail = FALSE)
 
# d. Make the Decision
p_value <- t_test_result$p.value
decision <- ifelse(p_value <= alpha, "Reject the Null Hypothesis", "Fail to Reject the Null Hypothesis")
 
# e. Summarize the Results
summary <- paste("T-Test Result:", decision,
                 "\nCritical Value at alpha =", alpha, "is", critical_value,
                 "\nP-Value of the test is", p_value)
print(summary)

## [1] "T-Test Result: Fail to Reject the Null Hypothesis \nCritical Value at alpha = 0.05 is 1.9611527254421 \nP-Value of the test is 0.48896629326814"

Insights:
1.The t-test result shows a p-value of 0.4889, which is above the typical alpha level of 0.05, suggesting that the difference in three-point shooting accuracy between home and away games is not statistically significant.
2.The 95% confidence interval for the difference in means includes zero, further supporting the lack of a statistically significant difference.
3.The means of the two samples are relatively close, with home game accuracy at 29.6% and away game accuracy at 28.6%.
4.The analysis indicates that there is no significant difference in three-point shooting accuracy between home and away games in the WNBA. This finding challenges the notion of a ‘home court’ advantage in this aspect and suggests that players maintain a consistent performance in three-point shooting regardless of the game location.

Business Question 4 - Test whether teams that accumulate more steals have an increased likelihood of winning the game.

Goal 4 : Understanding the Relationship Between Steals and Game Outcomes
Motivation:In basketball, defensive plays like steals can be crucial in turning the tide of a game. Our objective is to determine whether there is a statistical link between the number of steals a team makes and their chances of winning a game.
Research and Development:We focus on the hypothesis that more steals could lead to a higher probability of winning. Our analysis involves comparing the average number of steals in games won to those lost, using historical WNBA game data.
Methodology:The Two-sample t-test is an ideal statistical test for comparing the means of two groups—steals in games won versus games lost. This method will help us test the null hypothesis that there is no difference in the average number of steals between the two types of games.

#Reference Stack Exchange. (2012, August 31). How to perform two-sample t-tests in R by inputting sample statistics rather than raw data? Cross Validated. https://stats.stackexchange.com/questions/30394/how-to-perform-two-sample-t-tests-in-r-by-inputting-sample-statistics-rather-tha

# Subset the data into games won and games lost
games_won_data <- subset(wnba_data, WIN == 1)
games_lost_data <- subset(wnba_data, WIN == 0)

# Perform a two-sample t-test on the number of steals
t_test_result <- t.test(games_won_data$STEAL, games_lost_data$STEAL)

# Print the result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  games_won_data$STEAL and games_lost_data$STEAL
## t = 2.4684, df = 1970.8, p-value = 0.01366
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02451659 0.21409664
## sample estimates:
## mean of x mean of y 
## 0.9479060 0.8285994

# Subset the data into games won and games lost

games_won_data <- subset(wnba_data, WIN == 1)

games_lost_data <- subset(wnba_data, WIN == 0)
 
# Perform a two-sample t-test on the number of steals

t_test_result <- t.test(games_won_data$STEAL, games_lost_data$STEAL)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  games_won_data$STEAL and games_lost_data$STEAL
## t = 2.4684, df = 1970.8, p-value = 0.01366
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02451659 0.21409664
## sample estimates:
## mean of x mean of y 
## 0.9479060 0.8285994

# a. T-test already performed above
 
# b. State the Hypotheses

# Null Hypothesis (H0): No significant difference in the average number of steals between games won and lost.

# Alternative Hypothesis (H1): Significant difference in the average number of steals between games won and lost.
 
# c. Find the Critical Value (for a two-tailed test at alpha = 0.05)

alpha <- 0.05

df <- t_test_result$parameter

critical_value <- qt(alpha/2, df, lower.tail = FALSE)
 
# d. Make the Decision

p_value <- t_test_result$p.value

decision <- ifelse(p_value <= alpha, "Reject the Null Hypothesis", "Fail to Reject the Null Hypothesis")
 
# e. Summarize the Results

summary <- paste("T-Test Result:", decision,

                 "\nCritical Value at alpha =", alpha, "is", critical_value,

                 "\nP-Value of the test is", p_value)

print(summary)

## [1] "T-Test Result: Reject the Null Hypothesis \nCritical Value at alpha = 0.05 is 1.96116845168867 \nP-Value of the test is 0.013655901329085"

Insights:
1.The results from the t-test provide a p-value of 0.01321, indicating that the difference in the number of steals between games won and games lost is statistically significant (p < 0.05).
2.The positive confidence interval (0.02503698 to 0.21422709) suggests that, on average, teams make more steals in games they win compared to games they lose.
3.The mean values show that the average number of steals is higher in games won (0.9459735) than in games lost (0.8263415).

The data supports the alternative hypothesis that teams with a higher number of steals have a higher probability of winning the game. This insight can be influential in team strategies, emphasizing the importance of defensive play and steals in winning games.

Conclusion

Our study looked at basketball statistics to understand what makes teams and players do well. We used different math methods to look at players’ scores, assists, rebounds, and other things like how often teams win at home and how steals affect winning.

We learned that scoring points, helping teammates score, and grabbing rebounds are all important for a player to be considered efficient. This information can help coaches figure out which skills to focus on during training.

When we looked at whether teams are more likely to win at home, we found a slight advantage, but it wasn’t very clear. This means that being at home might not be as big of a deal as we thought.

We also saw that making more steals can mean a better chance of winning. This shows that good defense can be just as important as scoring.

In short, our research tells us that looking at the numbers can really help understand basketball better. It can help teams decide what to work on and how to play the game. Our findings give useful tips for teams and show that using data is a smart way to make sports decisions.

References

1.Bluman, A. G. (2022). Elementary Statistics, A Step by Step Approach: A Brief Version, 10th Edition. McGraw Hill Publishing.
2.James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. URL: https://link.springer.com/chapter/10.1007/978-1-4614-7138-7_2

Appendix

WNBA 2014 Player Stats- Data Key

The dataset contains 26 variables. Below is the data key for each variable which will help explain the project and dataset better.

Name	Description
Player	The name of the basketball player.
player_id	A numerical identifier for each player.
team	The team for which the player is playing.
date	The date of the game.
home	A binary indicator (0 or 1) representing whether the game was played at home (1) or away (0).
opponent	The name of the opposing team.
win	A binary indicator (1 or 0) representing whether the player’s team won (1) or lost (0) the game.
team_pts	The total points scored by the player’s team in the game.
opp_pts	The total points scored by the opposing team.
minutes	The number of minutes the player participated in the game.
fgmade	The number of field goals made by the player.
fgatt	The number of field goal attempts by the player.
made3	The number of three-pointers made.
att3	The number of three-point attempts.
made1	The number of free throws made.
att1	The number of free throw attempts.
offrb	Offensive rebounds by the player.
defrb	Defensive rebounds by the player.
totrb	Total rebounds (offensive + defensive) by the player.
Assist	The number of assists by the player.
steal	The number of steals by the player.
block	The number of blocks by the player.
turnover	The number of turnovers committed by the player.
fouls	The number of fouls committed by the player.
points	The total points scored by the player in the game.
efficiency	A numerical measure of a player’s overall efficiency in the game.