DATA 621 : Data Mining[32164] - 2: HW1

Author: Rupendra Shrestha, Bikash Bhowmik, Roman Anthony, Chen Haoming, Melukkaran Jerald | March 1, 2026

Instructions

OPTION 1

Instructions for HW #1 are included in the attached PDF file. This assignment is due by 03/02/2025, 11.59 PM EST. Please submit your report as a PDF file.

OPTION 2

Define a concept that you have learned from this week’s readings and videos. Compare (i,e., Identify similarities) and Contrast (i.e., Identify differences) this concept with another that you have learned throughout the course. Use R to provide a real-world example of how you would approach solving a problem using the concept that you have learned. Please submit your report as a PDF file.

Introduction

Data mining focuses on discovering meaningful patterns and relationships from large datasets using statistical and computational techniques. In this assignment, we explore a real-world baseball dataset to understand how team performance metrics influence the number of games won. The analysis begins with data exploration to identify trends, variability, and potential data quality issues. We then apply feature engineering and data cleaning methods to prepare the data for modeling. A linear regression model is developed to quantify the relationship between predictors and team wins. Finally, model results are evaluated to assess performance and interpret key drivers of success.

The analysis follows a structured data mining workflow. We begin with exploratory data analysis to understand variable distributions, detect potential anomalies and examine relationships among predictors. Next, data preprocessing and feature engineering techniques are applied to address missing values and improve model stability. A regression model is then trained using the provided training data, and its predictive performance is evaluated using a holdout sample. Finally, results are interpreted to identify which team performance metrics most strongly influence wins and to assess the model’s practical and statistical validity.

This study demonstrates how statistical modeling techniques can be applied to real world sports data to quantify performance drivers and generate reliable predictions.

Study Objectives

The primary objective of this study is to analyze the relationship between baseball performance statistics and team wins. The study aims to identify key variables that significantly influence the target outcome. Another objective is to apply appropriate data preprocessing and feature engineering techniques to improve model reliability. The assignment also seeks to develop and evaluate a linear regression model using a structured data mining workflow.

A second objective is to conduct a thorough exploratory data analysis to understand the structure, distribution, and relationships within the dataset. This includes identifying missing values, detecting potential outliers, and examining correlations among predictors to ensure appropriate modeling decisions.

Finally, the study aims to evaluate model performance using appropriate statistical metrics and diagnostic tools. By interpreting coefficient estimates and assessing prediction accuracy, the analysis seeks to provide meaningful insights into the statistical drivers of team wins while demonstrating a structured data mining workflow.

Required Packages

This analysis uses R packages for data manipulation, visualization, and modeling. Key libraries include tidyverse for data handling, tidymodels for machine learning workflows, and supporting packages for exploratory analysis and model interpretation.


library(tidyverse)
library(reshape2)
library(scales)
library(GGally)
library(tidymodels)
library(vip)


#setwd("Code/HW_1")
library(tidyverse)
library(reshape2)
library(scales)
library(GGally)
library(tidymodels)
library(vip) 

Load Data

The training and evaluation datasets are loaded from CSV files using efficient data import functions in R. These datasets contain historical baseball performance statistics used for model development and prediction.

test <- read_csv("moneyball-evaluation-data.csv")
training <- read_csv("moneyball-training-data.csv")

** Data Exploration**

The Moneyball training dataset includes 2,276 observations and 17 variables. These variables capture different aspects of team performance, such as TARGET_WINS, TEAM_BATTING, TEAM_BASERUN, and several pitching statistics. The goal of the analysis is to understand how these factors relate to the total number of games a team wins during a season.

To explore the data, summary statistics were generated to review the mean, median, quartiles, and range for each variable. Correlation and pairs plots were used to examine relationships among variables, and histograms were created to assess the distribution and normality of key features. Both the training and evaluation datasets were reviewed to confirm consistency in structure and variable types.

The correlation analysis shows that wins are positively associated with most batting statistics, except for triples, which show a slight negative relationship. One possible explanation is that while triples contribute to offense, they may reflect missed opportunities for home runs, which have a stronger impact on scoring. However, correlation alone does not confirm this and would require further analysis.

Stolen bases show little relationship with wins, possibly because they occur less frequently and may not significantly influence overall outcomes. Interestingly, some pitching variables such as TEAM_PITCHING_H, TEAM_PITCHING_BB, and TEAM_PITCHING_HR display a positive correlation with wins. This may suggest that offensive production plays a larger role in determining success than limiting opponent performance. In contrast, TEAM_PITCHING_SO and TEAM_PITCHING_DP show a negative relationship with wins, reinforcing the idea that generating runs may be more critical than purely defensive measures.

Overall, the exploratory analysis confirms that the dataset contains meaningful variation and logical relationships between team performance metrics and season wins, supporting the development of multiple linear regression models.

# Skim is a nice visualization tool
skimr::skim(training)
Data summary
Name training
Number of rows 2276
Number of columns 17
_______________________
Column type frequency:
numeric 17
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 1268.46 736.35 1 630.75 1270.5 1915.50 2535 ▇▇▇▇▇
TARGET_WINS 0 1.00 80.79 15.75 0 71.00 82.0 92.00 146 ▁▁▇▅▁
TEAM_BATTING_H 0 1.00 1469.27 144.59 891 1383.00 1454.0 1537.25 2554 ▁▇▂▁▁
TEAM_BATTING_2B 0 1.00 241.25 46.80 69 208.00 238.0 273.00 458 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.25 27.94 0 34.00 47.0 72.00 223 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 99.61 60.55 0 42.00 102.0 147.00 264 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 501.56 122.67 0 451.00 512.0 580.00 878 ▁▁▇▇▁
TEAM_BATTING_SO 102 0.96 735.61 248.53 0 548.00 750.0 930.00 1399 ▁▆▇▇▁
TEAM_BASERUN_SB 131 0.94 124.76 87.79 0 66.00 101.0 156.00 697 ▇▃▁▁▁
TEAM_BASERUN_CS 772 0.66 52.80 22.96 0 38.00 49.0 62.00 201 ▃▇▁▁▁
TEAM_BATTING_HBP 2085 0.08 59.36 12.97 29 50.50 58.0 67.00 95 ▂▇▇▅▁
TEAM_PITCHING_H 0 1.00 1779.21 1406.84 1137 1419.00 1518.0 1682.50 30132 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 105.70 61.30 0 50.00 107.0 150.00 343 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 553.01 166.36 0 476.00 536.5 611.00 3645 ▇▁▁▁▁
TEAM_PITCHING_SO 102 0.96 817.73 553.09 0 615.00 813.5 968.00 19278 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 246.48 227.77 65 127.00 159.0 249.25 1898 ▇▁▁▁▁
TEAM_FIELDING_DP 286 0.87 146.39 26.23 52 131.00 149.0 164.00 228 ▁▂▇▆▁
skimr::skim(test)
Data summary
Name test
Number of rows 259
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 1263.93 693.29 9 708.0 1249.0 1832.50 2525 ▆▆▆▇▅
TEAM_BATTING_H 0 1.00 1469.39 150.66 819 1387.0 1455.0 1548.00 2170 ▁▂▇▁▁
TEAM_BATTING_2B 0 1.00 241.32 49.52 44 210.0 239.0 278.50 376 ▁▂▇▇▂
TEAM_BATTING_3B 0 1.00 55.91 27.14 14 35.0 52.0 72.00 155 ▇▇▃▁▁
TEAM_BATTING_HR 0 1.00 95.63 56.33 0 44.5 101.0 135.50 242 ▆▅▇▃▁
TEAM_BATTING_BB 0 1.00 498.96 120.59 15 436.5 509.0 565.50 792 ▁▁▅▇▁
TEAM_BATTING_SO 18 0.93 709.34 243.11 0 545.0 686.0 912.00 1268 ▁▃▇▇▂
TEAM_BASERUN_SB 13 0.95 123.70 93.39 0 59.0 92.0 151.75 580 ▇▃▁▁▁
TEAM_BASERUN_CS 87 0.66 52.32 23.10 0 38.0 49.5 63.00 154 ▂▇▃▁▁
TEAM_BATTING_HBP 240 0.07 62.37 12.71 42 53.5 62.0 67.50 96 ▃▇▅▁▁
TEAM_PITCHING_H 0 1.00 1813.46 1662.91 1155 1426.5 1515.0 1681.00 22768 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 102.15 57.65 0 52.0 104.0 142.50 336 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 552.42 172.95 136 471.0 526.0 606.50 2008 ▆▇▁▁▁
TEAM_PITCHING_SO 18 0.93 799.67 634.31 0 613.0 745.0 938.00 9963 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 249.75 230.90 73 131.0 163.0 252.00 1568 ▇▁▁▁▁
TEAM_FIELDING_DP 31 0.88 146.06 25.88 69 131.0 148.0 164.00 204 ▁▂▇▇▂
training_long <- training %>%
  select(-INDEX) %>%
  melt()
training_long %>%
  filter(complete.cases(.)) %>%
  ggplot(aes(x= variable, y=value)) +
  geom_boxplot(fill="#FF5733") +
  scale_y_log10(labels = label_comma()) +
  coord_flip() +
  theme_minimal() +
  labs(y="Statistic's Value", x="Statistic",
       title="Distribution of Important Baseball Statistics")

The boxplot shows the distribution of key baseball statistics across all teams in the training dataset. Most variables are right-skewed, with several extreme values, particularly in offensive metrics like home runs and walks, as well as pitching strikeouts. Using a log scale helps visualize both typical and extreme values clearly. This indicates that while most teams cluster around typical performance levels, a few teams achieved unusually high or low values, which may influence regression modeling and suggests potential benefits of transformations or normalization during data preparation.

This plot shows separate boxplots for each baseball statistic using facet_wrap, allowing each variable to have its own scale. This makes it easier to compare the spread, median and outliers for individual metrics without being affected by differences in magnitude across variables

# Another way to see the Boxplots
p <- ggplot(training_long, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

The faceted boxplots reveal that several variables, such as home runs and strikeouts, have wider ranges and more extreme values compared to others like caught stealing or double plays. This highlights variability in team performance across different metrics and confirms the need for data transformations or normalization to ensure stable regression estimates.

This plot shows density distributions for each baseball statistic using facet_wrap, with independent scales for each variable. It visualizes how the values of each metric are distributed across teams, highlighting skewness, concentration, and spread.

# Plot densities
p <- ggplot(training_long, aes(value)) 
p + geom_density() + facet_wrap(~variable, scale="free")

The density plots indicate that many variables, particularly home runs, walks, and strikeouts, are right skewed, while others like errors or double plays are more symmetric. These patterns suggest that some variables may benefit from transformations, such as Box-Cox or normalization, to improve model stability and satisfy regression assumptions.

Data Preparation and Feature Engineering

Data preparation focuses on handling missing values to ensure consistency across all variables. Median imputation is applied to numeric features to reduce the influence of extreme values. Irrelevant identifiers are removed to prevent noise in the modeling process. Feature transformations and normalization are performed to improve model stability and performance. These steps help create a clean and structured dataset suitable for regression analysis.

#Replace NA with median of Column

training <- training %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))

test <- test %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))


# Let's check if any NA left
training %>% 
  summarise_all(~sum(is.na(.))) %>%
  t()
                 [,1]
INDEX               0
TARGET_WINS         0
TEAM_BATTING_H      0
TEAM_BATTING_2B     0
TEAM_BATTING_3B     0
TEAM_BATTING_HR     0
TEAM_BATTING_BB     0
TEAM_BATTING_SO     0
TEAM_BASERUN_SB     0
TEAM_BASERUN_CS     0
TEAM_BATTING_HBP    0
TEAM_PITCHING_H     0
TEAM_PITCHING_HR    0
TEAM_PITCHING_BB    0
TEAM_PITCHING_SO    0
TEAM_FIELDING_E     0
TEAM_FIELDING_DP    0
# ANother way to check NA's
colSums(is.na(training))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
               0                0                0                0 
 TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
               0                0                0                0 
 TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
               0                0                0                0 
TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
               0                0                0                0 
TEAM_FIELDING_DP 
               0 
# One final way to seel all NA's
test %>% 
  summarise_all(~sum(is.na(.))) %>%
  t()
                 [,1]
INDEX               0
TEAM_BATTING_H      0
TEAM_BATTING_2B     0
TEAM_BATTING_3B     0
TEAM_BATTING_HR     0
TEAM_BATTING_BB     0
TEAM_BATTING_SO     0
TEAM_BASERUN_SB     0
TEAM_BASERUN_CS     0
TEAM_BATTING_HBP    0
TEAM_PITCHING_H     0
TEAM_PITCHING_HR    0
TEAM_PITCHING_BB    0
TEAM_PITCHING_SO    0
TEAM_FIELDING_E     0
TEAM_FIELDING_DP    0
#lets remove INDEX. Not needed
training <- training %>%
  dplyr::select(-INDEX)

test <- test %>%
  dplyr::select(-INDEX)

After applying median imputation, all missing values in both the training and test datasets were successfully replaced, leaving complete observations for every variable. Multiple checks confirmed that no NA values remain, ensuring the datasets are ready for regression modeling. The INDEX column was removed as it does not contribute predictive information and could introduce noise. Overall, the datasets are now clean, consistent, and well prepared for feature transformations and model development.

Model Development

A linear regression model is developed to predict the number of team wins based on performance statistics. The dataset is split into training and testing subsets to evaluate model generalization. A pre-processing recipe is applied to normalize predictors and address distributional issues. The model is trained using a structured workflow to ensure reproducibility and consistency.

A pre-processing workflow using tidy models was applied to the training data:

  • Box-Cox transformation

  • Normalization

The regression model was trained within a tidy models workflow to ensure reproducibility and consistent application of pre-processing steps. Predictions were then generated for the evaluation dataset, and performance metrics were calculated on the test set to assess model accuracy.

# Create a split object TEST and TRAINING
set.seed(42)
my_split <- initial_split(training, prop = 0.9)

# Build training data set
my_training <- my_split %>% training()

# Build testing data set
my_test <- my_split %>% testing()
# Tidymodels recipe to preprocess data
recipe1 <- recipe(TARGET_WINS ~ ., data = my_training) %>% 
                      step_BoxCox(all_numeric(), -all_outcomes()) %>% 
                      step_normalize(all_numeric(), -all_outcomes())
# Tidymodels define regression model
lm_model <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')
# Define a Tidymodels WORKFLOW
workflow1 <- workflow() %>% 
                        add_model(lm_model) %>% 
                        add_recipe(recipe1)

my_fit <- workflow1 %>% 
                   last_fit(my_split)

# Obtain performance metrics on test data
my_fit %>% collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config        
  <chr>   <chr>          <dbl> <chr>          
1 rmse    standard      13.2   pre0_mod0_post0
2 rsq     standard       0.245 pre0_mod0_post0
# Predict on Test DATA
my_fit$.workflow[[1]] %>%
  predict(test)
# A tibble: 259 × 1
   .pred
   <dbl>
 1  62.4
 2  65.6
 3  75.3
 4  82.8
 5  66.3
 6  67.1
 7  78.2
 8  74.0
 9  70.8
10  74.1
# ℹ 249 more rows

The model effectively captures the relationship between team performance metrics and season wins. Pre-processing addressed skewness and scaling issues, enhancing stability and reliability. The resulting predictions provide a solid basis for evaluating team performance and understanding the key factors driving wins.

Results and Interpretation

Model predictions are compared against actual team wins to evaluate overall performance. Visualizations are used to assess the agreement between predicted and observed values. Performance metrics provide insight into the accuracy and reliability of the model. Variable importance analysis highlights key factors influencing team success. These results support meaningful interpretation of the model’s findings and limitations.

# Obtain test set predictions data frame
results <- my_fit %>% 
                 collect_predictions()
# View results
results
# A tibble: 228 × 5
   .pred id               TARGET_WINS  .row .config        
   <dbl> <chr>                  <dbl> <int> <chr>          
 1  67.4 train/test split          82     5 pre0_mod0_post0
 2  65.9 train/test split          80     7 pre0_mod0_post0
 3  82.6 train/test split          82    38 pre0_mod0_post0
 4  88.5 train/test split          85    46 pre0_mod0_post0
 5  71.0 train/test split         107    60 pre0_mod0_post0
 6  58.9 train/test split          53    80 pre0_mod0_post0
 7  70.4 train/test split          63    81 pre0_mod0_post0
 8  76.7 train/test split          57    96 pre0_mod0_post0
 9  76.6 train/test split          67   104 pre0_mod0_post0
10  81.3 train/test split          86   117 pre0_mod0_post0
# ℹ 218 more rows

A scatter plot of predicted vs. actual wins shows that most predictions closely align with observed values, indicating the model captures the general trend in team performance. The diagonal line represents perfect predictions, and points clustered near this line demonstrate reasonable predictive accuracy.

# Plot results
ggplot(data = results,
       mapping = aes(x = .pred, y = TARGET_WINS)) +
  geom_point(color = '#006EA1', alpha = 0.25) +
  geom_abline(intercept = 0, slope = 1, color = 'orange') +
  labs(title = 'Linear Regression Results - Training/Test Set',
       x = 'Predicted Wind',
       y = 'Actual Wins')

training_baked <- recipe1 %>% 
                        prep() %>% 
                        bake(new_data = my_training)

The model was refit on the fully prepared training dataset to examine variable importance. Using vip, key factors contributing to team wins were identified. Offensive metrics such as home runs and walks had the largest positive impact, while fielding errors negatively influenced predictions. Pitching strikeouts also contributed positively, confirming expected baseball relationships.

# View results
training_baked
# A tibble: 2,048 × 16
   TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
            <dbl>           <dbl>           <dbl>           <dbl>
 1         -0.707         -1.08            -0.934           0.137
 2          1.51          -0.0587           1.49           -1.13 
 3         -2.56          -1.84            -0.392          -1.48 
 4          0.157          1.01            -0.934           1.93 
 5          0.128         -0.296           -0.536           0.268
 6         -1.04          -1.03            -0.825          -0.357
 7          0.193          0.445           -1.01            0.351
 8          0.997          0.872            0.366           0.433
 9         -0.604          0.0900          -0.536           1.08 
10         -0.884         -0.692           -0.645          -1.10 
# ℹ 2,038 more rows
# ℹ 12 more variables: TEAM_BATTING_BB <dbl>, TEAM_BATTING_SO <dbl>,
#   TEAM_BASERUN_SB <dbl>, TEAM_BASERUN_CS <dbl>, TEAM_BATTING_HBP <dbl>,
#   TEAM_PITCHING_H <dbl>, TEAM_PITCHING_HR <dbl>, TEAM_PITCHING_BB <dbl>,
#   TEAM_PITCHING_SO <dbl>, TEAM_FIELDING_E <dbl>, TEAM_FIELDING_DP <dbl>,
#   TARGET_WINS <dbl>
# Lets fit again with prepared data
lm_fit <- lm_model %>% 
                fit(TARGET_WINS ~ ., data = training_baked)

vip(lm_fit)

Conclusion

This study demonstrated how data mining techniques can be applied to analyze real-world baseball performance data. Through exploratory analysis and data preparation, key issues such as missing values and variable distributions were addressed. A linear regression model was successfully developed to predict team wins using multiple performance metrics. The model provided reasonable predictive accuracy and revealed important variables influencing outcomes. Visual and quantitative evaluations helped assess model strengths and limitations. While the approach offers useful insights, more advanced models could further improve predictive performance. Overall, this analysis highlights the value of structured data mining workflows in transforming raw data into actionable knowledge.