Note to Professor: Assignment 2 is included below for reference because Assignment 3 uses the same cleaned dataset and preprocessing steps. Please feel free to begin reading at Assignment 3: Experimentation & Model Training.

1 Assignment 2: EDA

First step to exploratory data analysis (EDA) is to load the necessary libraries for different data handling.

1.1 Loading the necessary libraries

The libraries used in this analysis support different stages of the EDA process. The readxl package is used to import the dataset from Excel, while tidyverse provides essential tools for data manipulation and visualization, including ggplot2. The janitor package helps clean and standardize column names, making the data easier to work with. Additionally, skimr is used for quick summary statistics, naniar for identifying and visualizing missing data, GGally for exploring relationships between variables, and scales for improving the readability of plots. Together, these libraries enable efficient data cleaning, exploration, and visualization.

1.2 Loading the dataset

Of course, EDA requires a dataset, so this step involves loading the dataset that I got from the National Assessment of Educational Progress (NAEP) Data Explorer website. The website provides national and state assessments results in all core subjects that are assessed. Since I am a middle school teacher and one of the subjects I teach is mathematics, I picked to get data in both national and New York state level in 8th grade mathematics, especially Algebra. The data provides insights of the average scale score of different groups of students that are categorized by race and ethnicity which will be comparable and easy to find trends.

Before moving on to EDA, let’s check if the dataset loaded correctly:

## # A tibble: 10 × 4
##    Year  Jurisdiction Race/ethnicity used to report tren…¹ `Average scale score`
##    <chr> <chr>        <chr>                                <chr>                
##  1 2024  National     White                                293.27971245966      
##  2 2024  National     Black                                260.59177267955101   
##  3 2024  National     Hispanic                             265.84186852976899   
##  4 2024  National     Asian/Pacific Islander               315.183504238252     
##  5 2024  National     American Indian/Alaska Native        257.90389371060098   
##  6 2024  National     Two or more races                    286.27011213518199   
##  7 2024  New York     White                                292.82165256179502   
##  8 2024  New York     Black                                266.66589978477498   
##  9 2024  New York     Hispanic                             264.56017274169801   
## 10 2024  New York     Asian/Pacific Islander               314.30916955953899   
## # ℹ abbreviated name: ¹`Race/ethnicity used to report trends, school-reported`

## Rows: 204
## Columns: 4
## $ Year                                                    <chr> "2024", "2024"…
## $ Jurisdiction                                            <chr> "National", "N…
## $ `Race/ethnicity used to report trends, school-reported` <chr> "White", "Blac…
## $ `Average scale score`                                   <chr> "293.279712459…

The data set has 204 observations and 4 variables. The variables are the year when the assessment is performed, the jurisdiction, which is where the results of the tests are produced either nationally or in the state. The next variable is race and ethnicity of the participant students. Then there is the average scale score.

1.3 Cleaning the Data

This section focuses on preparing the dataset for analysis by cleaning column names, correcting data types, and handling missing values. Proper data cleaning ensures that the dataset is accurate, consistent, and suitable for meaningful analysis.

1.3.1 Clean column names

The column names were simplified to remove spaces and special characters, making them easier to reference and manipulate during analysis. This step improves code readability and reduces potential errors.

## # A tibble: 6 × 4
##   year  jurisdiction race_ethnicity                average_scale_score
##   <chr> <chr>        <chr>                         <chr>              
## 1 2024  National     White                         293.27971245966    
## 2 2024  National     Black                         260.59177267955101 
## 3 2024  National     Hispanic                      265.84186852976899 
## 4 2024  National     Asian/Pacific Islander        315.183504238252   
## 5 2024  National     American Indian/Alaska Native 257.90389371060098 
## 6 2024  National     Two or more races             286.27011213518199

1.3.2 Convert data types

glimpse(NAEP_8Math) shows that the variables year and average_scale_score were originally stored as character values so they need to be converted to numeric formats. This allows for proper statistical analysis and visualization of trends over time.

## Rows: 204
## Columns: 5
## $ year                <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ jurisdiction        <chr> "National", "National", "National", "National", "N…
## $ race_ethnicity      <chr> "White", "Black", "Hispanic", "Asian/Pacific Islan…
## $ average_scale_score <dbl> 293.2797, 260.5918, 265.8419, 315.1835, 257.9039, …
## $ year_raw            <chr> "2024", "2024", "2024", "2024", "2024", "2024", "2…

1.3.3 Remove missing values

Missing values, represented by special symbols in the original dataset, were removed to ensure accurate analysis. A cleaned subset of the dataset was created to preserve the original data while enabling reliable computations and visualizations.

##       year      jurisdiction       race_ethnicity     average_scale_score
##  Min.   :1990   Length:158         Length:158         Min.   :234.8      
##  1st Qu.:2000   Class :character   Class :character   1st Qu.:265.6      
##  Median :2009   Mode  :character   Mode  :character   Median :274.8      
##  Mean   :2008                                         Mean   :278.3      
##  3rd Qu.:2017                                         3rd Qu.:293.2      
##  Max.   :2024                                         Max.   :320.1      
##    year_raw        
##  Length:158        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

##                year        jurisdiction      race_ethnicity average_scale_score 
##                   0                   0                   0                   0 
##            year_raw 
##                   0

Now that the dataset is clean and ready for analysis, I can start the EDA.

1.4 Exploratory Data Analysis (EDA)

1.4.1 Overview of the clean data

The cleaned dataset contains observations from 1990 to 2024 across two jurisdictions: National and New York. It includes multiple racial and ethnic groups, allowing for meaningful comparisons and trend analysis across both time and demographic categories.

## [1] 158   5

## [1] "National" "New York"

## [1] "White"                         "Black"                        
## [3] "Hispanic"                      "Asian/Pacific Islander"       
## [5] "American Indian/Alaska Native" "Two or more races"

## [1] 1990 2024

1.4.2 Distribution of the clean data

The distribution of average scale scores (histogram below) shows that most values fall within a moderate range, with a slight spread across groups. The boxplots reveal differences in score distributions among racial and ethnic groups, with some groups consistently scoring higher than others.

1.4.3 Trends over time

Now, let’s look at the trends of the scores at national level and at the state of New York level.

The trend analysis shows how average scale scores have changed over time for different racial and ethnic groups. Overall, scores have generally improved from 1990 to the early 2010s, followed by slight declines in recent years. Additionally, consistent gaps between groups can be observed, with some groups maintaining higher performance levels over time.

Notably, Asian/Pacific Islander students consistently achieve the highest scores, while other groups show gradual improvement but remain below this level.

These trends suggest that improvements over time have not been equally distributed across all groups, highlighting persistent structural disparities in educational outcomes.

1.4.4 Compare groups

The comparison between National and New York results highlights similarities and differences across groups. While overall trends are comparable, some variations suggest that New York performs slightly differently for certain groups, indicating regional differences in educational outcomes.

1.4.5 Central Tendency

The summary statistics indicate that the average score is approximately 278, with a moderate spread around the mean. The relatively close values of the mean and median suggest that the distribution is fairly balanced without extreme skewness.

## # A tibble: 1 × 5
##    mean median    sd   min   max
##   <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1  278.   275.  19.7  235.  320.

1.4.6 Outliers

The boxplot analysis does not reveal extreme outliers, but it highlights variations between groups. These differences likely reflect genuine disparities in performance rather than data errors. Furthermore, these variations are consistent with the patterns observed in the previous line graphs, which show persistent differences between groups over time.

1.4.7 Achievement gaps

The achievement gap analysis reveals persistent differences in performance between groups over time. The gap between White and Black students, for example, remains consistent across years, indicating that disparities in educational outcomes have not been fully addressed despite overall improvements.

Although all groups show some improvement over time, the persistence of these gaps suggests systemic challenges that require targeted educational interventions.

Correlation analysis is limited in this dataset because most variables are categorical, so the most meaningful relationship examined is the trend between year and average scale score within demographic groups. There is no clear data leakage concern in this EDA because the variables describe observed assessment outcomes rather than future-known predictors. For data preparation, missing-value symbols should be retained as NA and removed only when necessary for analysis, while superscript-year duplicate rows should be excluded carefully. Because some demographic groups have fewer reported observations, subgroup imbalance should be acknowledged when interpreting results. Since the dataset contains only a few variables, dimension reduction is not necessary, and group-based comparisons are more informative than reducing features.

2 Assignment 3: Experimentation & Model Training

Based on the NAEP dataset and the exploratory data analysis above, I used the cleaned data to build predictive models. The variables in the dataset are:

year
jurisdiction
race_ethnicity
average_scale_score

The target variable is average_scale_score, making this a Decision Tree regression problem. I am going to use year, jurisdiction, and race_ethnicity which will predict average scale score with reasonable accuracy because EDA showed strong trends across demographic groups and over time.

2.1 Experiment 1: A Baseline Decision Tree Model

2.1.1 Planning - the Hypothesis

The purpose of Experiment 1 is to create a baseline model. No parameter was changed in this experiment. This experiment establishes a baseline model for comparison with future experiments.

Hypothesis: A baseline Decision Tree regression model using year, jurisdiction, and race_ethnicity will be able to predict average scale score with reasonable accuracy because these variables showed visible patterns during the EDA. This first experiment will create a baseline RMSE and R² score that future experiments can be compared against.

2.1.2 Loading the librairies

library(rpart)
library(rpart.plot)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123)

2.1.3 Prepare the modeling data

model_data <- NAEP_clean %>%
  select(year, jurisdiction, race_ethnicity, average_scale_score) %>%
  drop_na()

model_data$jurisdiction <- as.factor(model_data$jurisdiction)
model_data$race_ethnicity <- as.factor(model_data$race_ethnicity)

2.1.4 Split the data into training and testing sets

train_index <- createDataPartition(
  model_data$average_scale_score,
  p = 0.8,
  list = FALSE
)

train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

2.1.5 Train baseline Decision Tree model

tree_exp1 <- rpart(
  average_scale_score ~ year + jurisdiction + race_ethnicity,
  data = train_data,
  method = "anova"
)

2.1.6 View the tree

rpart.plot(tree_exp1)

2.1.7 Make predictions

pred_exp1 <- predict(tree_exp1, test_data)

2.1.8 Evaluate model performance

rmse_exp1 <- RMSE(pred_exp1, test_data$average_scale_score)
r2_exp1 <- R2(pred_exp1, test_data$average_scale_score)

rmse_exp1

## [1] 4.484048

r2_exp1

## [1] 0.9335152

2.1.9 Save results in a table

experiment_results <- tibble(
  Experiment = "Experiment 1",
  Change = "Baseline model - no parameter changed",
  RMSE = rmse_exp1,
  R2 = r2_exp1
)

experiment_results

## # A tibble: 1 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934

2.1.10 Evaluation + Recommendation:

A baseline Decision Tree regression model using year, jurisdiction, and race_ethnicity was used to predict average scale score with reasonable accuracy because EDA showed strong trends across demographic groups and over time. For this reason, I didn’t change any parameter in Experiment 1. This experiment established a baseline model for comparison with future experiments. I trained a Decision Tree regression model using year, jurisdiction, and race_ethnicity to predict average_scale_score.

The baseline model performed strongly with an RMSE of 4.48 and an R² of 0.9335. This means predictions were off by about 4.5 points on average, while the model explained over 93% of score variation. The tree primarily split by race_ethnicity and year, suggesting these are the strongest predictors. Jurisdiction was less influential because it did not appear in the final tree.

Since the baseline model already performs strongly, one recommendation will be that the next experiment should test whether pruning the tree can improve generalization and reduce possible overfitting.

2.2 Experiment 2: Pruning the Decision Tree

2.2.1 Planning - the Hypothesis

The baseline Decision Tree may contain unnecessary splits that overfit the training data. Increasing the complexity parameter to \(cp = 0.01\) should prune weaker branches and may improve generalization performance on the testing set. So, I am going to change only the complexity parameter from the baseline setting.

2.2.1.1 Why change the `cp`?

In Decision Trees, the complexity parameter (cp) controls pruning. It determines how much improvement a split must provide before the tree is allowed to grow further. A higher cp value creates a simpler tree by removing weak splits, while a lower cp allows a larger and more complex tree. Since my baseline model performed strongly, the next logical step is to test whether a simpler pruned tree can maintain or improve test performance ((GeeksforGeeks, n.d.)).

2.2.1.2 Why start with `cp=0.01`?

I selected \(cp = 0.01\) as an initial test because it is a moderate increase from the default small threshold and is commonly used as a starting pruning value. It is large enough to remove weak branches but not so large that the tree becomes oversimplified and the accuracy is slightly better ((GeeksforGeeks, n.d.)). This makes it a reasonable first experiment after the baseline model.

tree_exp2 <- rpart(
 average_scale_score ~ year + jurisdiction + race_ethnicity,
 data = train_data,
 method = "anova",
 control = rpart.control(cp = 0.01)
)

# View Experiment 2 tree
rpart.plot(tree_exp2)

# Make predictions for Experiment 2
pred_exp2 <- predict(tree_exp2, test_data)

# Evaluate Experiment 2
rmse_exp2 <- RMSE(pred_exp2, test_data$average_scale_score)
r2_exp2 <- R2(pred_exp2, test_data$average_scale_score)

rmse_exp2

## [1] 4.484048

r2_exp2

## [1] 0.9335152

2.2.2 Compare Experiment 1 to Experiment 2

experiment_results <- experiment_results %>%
  add_row(
    Experiment = "Experiment 2",
    Change = "Changed cp to 0.01",
    RMSE = rmse_exp2,
    R2 = r2_exp2
  )

experiment_results

## # A tibble: 2 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934

# Compare visually
experiment_results %>%
  ggplot(aes(x = Experiment, y = RMSE)) +
  geom_col()

The results of Experiment 2 were identical to Experiment 1, with the same tree structure, RMSE, and R² values. This indicates that no actual change was introduced because the default rpart() complexity parameter is already \(cp = 0.01\). Therefore, explicitly setting \(cp = 0.01\) reproduced the baseline model rather than creating a new one. Based on this finding, the next logical step is to decrease the complexity parameter to \(0.005\) in Experiment 3 to test whether allowing additional splits improves predictive accuracy.

2.3 Experiment 3: Lowering the Complexity Parameter

2.3.1 Planning - the Hypothesis

Since the baseline model used the default cp = 0.01, lowering the complexity parameter to 0.005 may allow useful additional splits and improve predictive accuracy.

The only thing I will be changing is the cp from \(0.01\) to \(0.005\).

tree_exp3 <- rpart(
 average_scale_score ~ year + jurisdiction + race_ethnicity,
 data = train_data,
 method = "anova",
 control = rpart.control(cp = 0.005)
)

rpart.plot(tree_exp3)

pred_exp3 <- predict(tree_exp3, test_data)

rmse_exp3 <- RMSE(pred_exp3, test_data$average_scale_score)
r2_exp3 <- R2(pred_exp3, test_data$average_scale_score)

rmse_exp3

## [1] 3.42645

r2_exp3

## [1] 0.9596804

A new Decision Tree regression model was trained using the same training and testing data, with \(cp = 0.005\).

2.3.2 Comparing Experiments 1, 2, & 3

# Add Experiment 3 to the results table
experiment_results <- experiment_results %>%
  add_row(
    Experiment = "Experiment 3",
    Change = "Changed cp to 0.005",
    RMSE = rmse_exp3,
    R2 = r2_exp3
  )

# View comparison table
experiment_results

## # A tibble: 3 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934
## 3 Experiment 3 Changed cp to 0.005                    3.43 0.960

experiment_results %>%
  mutate(
    RMSE = round(RMSE, 3),
    R2 = round(R2, 4)
  )

## # A tibble: 3 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934
## 3 Experiment 3 Changed cp to 0.005                    3.43 0.960

Since Experiment 1 and Experiment 2 are identical, let’s look at the focused comparison between Experiment 1 and Experiment 3:

experiment_results %>%
  filter(Experiment %in% c("Experiment 1", "Experiment 3")) %>%
  mutate(
    RMSE = round(RMSE, 3),
    R2 = round(R2, 4)
  )

## # A tibble: 2 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 3 Changed cp to 0.005                    3.43 0.960

experiment_results %>%
  filter(Experiment %in% c("Experiment 1", "Experiment 3")) %>%
  ggplot(aes(x = Experiment, y = RMSE)) +
  geom_col() +
  labs(
    title = "Baseline vs Lower CP Model",
    x = "Experiment",
    y = "RMSE"
  )

The comparison shows that the Experiment 3 improved performance substantially. RMSE decreased from \(4.48\) to \(3.43\), while \(R²\) increased from \(0.9335\) to \(0.9597\). This indicates that allowing a slightly more complex tree captured important patterns that were missed in the baseline model.

Since reducing cp improved results, in the next experiment, I should test an even smaller value such as \(cp = 0.001\) to determine whether additional splits continue improving performance or begin overfitting.

2.4 Experiment 4: Lower cp further

2.4.1 Planning - The Hypothesis

Since reducing cp to \(0.005\) improved performance, lowering it further to \(0.001\) may allow additional meaningful splits and slightly improve prediction accuracy. So, I am going to change only the cp again from \(0.005\) to \(0.001\).

tree_exp4 <- rpart(
 average_scale_score ~ year + jurisdiction + race_ethnicity,
 data = train_data,
 method = "anova",
 control = rpart.control(cp = 0.001)
)

rpart.plot(tree_exp4)

pred_exp4 <- predict(tree_exp4, test_data)

rmse_exp4 <- RMSE(pred_exp4, test_data$average_scale_score)
r2_exp4 <- R2(pred_exp4, test_data$average_scale_score)

rmse_exp4

## [1] 3.365846

r2_exp4

## [1] 0.9616695

2.4.2 Comparing the Experiments

# Add Experiment 4 to the results table
experiment_results <- experiment_results %>%
  add_row(
    Experiment = "Experiment 4",
    Change = "Changed cp to 0.001",
    RMSE = rmse_exp4,
    R2 = r2_exp4
  )

# View comparison table
experiment_results

## # A tibble: 4 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934
## 3 Experiment 3 Changed cp to 0.005                    3.43 0.960
## 4 Experiment 4 Changed cp to 0.001                    3.37 0.962

2.4.3 Evaluation + Recommendation

Experiment 4 slightly improved the model again. RMSE decreased from \(3.426\) to \(3.366\), while \(R²\) increased from \(0.9597\) to \(0.9617\). This suggests that some additional splits captured useful structure in the data.

Experiment 4 produced a deeper tree with improving accuracy. The next question is whether some of that complexity is unnecessary.

2.5 Experiment 5: The Depth of the tree

Since Experiment 4 used a deeper tree, I tested whether limiting the tree depth could simplify the model while maintaining strong accuracy. Limiting the tree depth to 3 is a reasonable simplification test to determine whether a smaller tree can maintain strong performance.

So, I added maxdepth = 3 while keeping the previous best settings unchanged.

2.5.1 Planning - The Hypothesis

Since Experiment 4 used a deeper tree, limiting depth to 3; maxdepth = 3, may reduce unnecessary complexity while maintaining strong predictive performance.

tree_exp5 <- rpart(
  average_scale_score ~ year + jurisdiction + race_ethnicity,
  data = train_data,
  method = "anova",
  control = rpart.control(cp = 0.001, maxdepth = 3)
)

# Visualize Experiment 5 tree
rpart.plot(tree_exp5)

# Make predictions on test data
pred_exp5 <- predict(tree_exp5, test_data)

# Calculate metrics
rmse_exp5 <- RMSE(pred_exp5, test_data$average_scale_score)
r2_exp5   <- R2(pred_exp5, test_data$average_scale_score)

# Show results
rmse_exp5

## [1] 4.069631

r2_exp5

## [1] 0.9461673

experiment_results <- experiment_results %>%
  add_row(
    Experiment = "Experiment 5",
    Change = "Added maxdepth = 3",
    RMSE = rmse_exp5,
    R2 = r2_exp5
  )

experiment_results

## # A tibble: 5 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934
## 3 Experiment 3 Changed cp to 0.005                    3.43 0.960
## 4 Experiment 4 Changed cp to 0.001                    3.37 0.962
## 5 Experiment 5 Added maxdepth = 3                     4.07 0.946

Based on the results of Experiment 5, performance decreased when the tree depth was restricted. RMSE increased from \(3.366\) to \(4.070\), while \(R²\) decreased from \(0.9617\) to \(0.9462\). This indicates that the deeper branches in Experiment 4 were capturing meaningful patterns in the data.

Since restricting depth reduced performance, the next experiment should keep the deeper tree and instead test another control parameter such as minsplit.

2.6 Experiment 6: minsplit parameter

minsplit is a parameter in decision trees that defines the minimum of observations that each node is required to split further (Sachdeva, n.d.). The same source indicates that “minsplit helps us avoid overfitting by pre-pruning the tree before we test our model.”

2.6.1 Planning- The hypothesis

The best tree may still contain some splits based on small sample groups. Increasing minsplit may improve generalization without removing important depth.

tree_exp6 <- rpart(
 average_scale_score ~ year + jurisdiction + race_ethnicity,
 data = train_data,
 method = "anova",
 control = rpart.control(cp = 0.001, minsplit = 20)
)

# Visualize Experiment 6 tree
rpart.plot(tree_exp6)

# Make predictions on test data
pred_exp6 <- predict(tree_exp6, test_data)

# Calculate metrics
rmse_exp6 <- RMSE(pred_exp6, test_data$average_scale_score)
r2_exp6   <- R2(pred_exp6, test_data$average_scale_score)

# Show results
rmse_exp6

## [1] 3.365846

r2_exp6

## [1] 0.9616695

experiment_results <- experiment_results %>%
  add_row(
    Experiment = "Experiment 6",
    Change = "Added minsplit = 20",
    RMSE = rmse_exp6,
    R2 = r2_exp6
  )

experiment_results

## # A tibble: 6 × 4
##   Experiment   Change                                 RMSE    R2
##   <chr>        <chr>                                 <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed  4.48 0.934
## 2 Experiment 2 Changed cp to 0.01                     4.48 0.934
## 3 Experiment 3 Changed cp to 0.005                    3.43 0.960
## 4 Experiment 4 Changed cp to 0.001                    3.37 0.962
## 5 Experiment 5 Added maxdepth = 3                     4.07 0.946
## 6 Experiment 6 Added minsplit = 20                    3.37 0.962

Experiment 6 gave me the same results as Experiment 4. The model produced identical RMSE and R² values to Experiment 4, indicating that the best-performing tree already used splits supported by sufficient observations. Therefore, increasing minsplit to 20 did not materially change the tree structure or predictive performance. The important splits already satisfied this threshold, so the model remained unchanged.

The results of all 6 experiments:

Baseline
Confirmed default cp
Lower cp improved
Lower cp improved again
Too much simplification hurt
Test split stability

indicates that the most effective parameter for this dataset was the complexity parameter (cp). Lowering cp improved model accuracy by allowing meaningful additional splits. However, excessive simplification reduced performance, and increasing minsplit had no further benefit once the model reached a stable structure.

2.7 Experimentations & Model Training - Essay

experiment_results %>%
  mutate(
    RMSE = round(RMSE, 3),
    R2 = round(R2, 4)
  ) %>%
  knitr::kable(
    caption = "Summary of Decision Tree Experiments",
    col.names = c("Experiment", "Parameter Change", "RMSE", "R²"),
    align = c("l", "l", "c", "c"),
    booktabs = TRUE
  )

Summary of Decision Tree Experiments
Experiment	Parameter Change	RMSE	R²
Experiment 1	Baseline model - no parameter changed	4.484	0.9335
Experiment 2	Changed cp to 0.01	4.484	0.9335
Experiment 3	Changed cp to 0.005	3.426	0.9597
Experiment 4	Changed cp to 0.001	3.366	0.9617
Experiment 5	Added maxdepth = 3	4.070	0.9462
Experiment 6	Added minsplit = 20	3.366	0.9617

This assignment builds on the dataset I used in Assignment 2, which examined National Assessment of Educational Progress (NAEP) Grade 8 Mathematics scores by year, jurisdiction, and race/ethnicity. The objective of this assignment was to predict average_scale_score using a Decision Tree regression model and to apply systematic experimentation by changing one parameter at a time. Model performance was evaluated using RMSE (Root Mean Squared Error) and R².

Experiment 1 established a baseline model using default rpart() settings. The hypothesis was that the variables year, jurisdiction, and race/ethnicity would predict average scores with reasonable accuracy because strong trends were identified during the exploratory data analysis. The model performed strongly, with an RMSE of \(4.484\) and an R² of \(0.9335\), showing that the predictors explained over \(93\%\) of score variation.

Experiment 2 explicitly set the complexity parameter to \(cp = 0.01\) to test whether pruning would improve results. However, the model produced identical results to Experiment 1. This occurred because \(cp = 0.01\) is already the default setting in rpart(). Although performance did not change, this experiment confirmed the baseline model configuration and provided an important understanding of the software defaults.

Experiment 3 reduced the complexity parameter to \(cp = 0.005\). The hypothesis was that allowing additional splits would capture more meaningful patterns in the data. This significantly improved performance, reducing RMSE to \(3.426\) and increasing R² to \(0.9597\). This suggested that the baseline model had been slightly over-pruned.

Experiment 4 further reduced the complexity parameter to \(cp = 0.001\). The model improved again, though by a smaller margin, achieving an RMSE of \(3.366\) and an R² of \(0.9617\). This indicated that additional tree flexibility still captured useful structure, but the gains were beginning to level off.

Experiment 5 tested whether simplifying the tree would maintain strong accuracy by limiting tree depth to maxdepth = 3. Performance declined, with RMSE increasing to \(4.070\) and R² decreasing to \(0.9462\). This showed that some deeper branches from Experiment 4 were important and that over-simplifying the model removed meaningful predictive patterns.

Experiment 6 increased the minimum split requirement to minsplit = 20 while keeping the strongest previous settings. The results were identical to Experiment 4, with an RMSE of \(3.366\) and an R² of \(0.9617\). This indicated that the best-performing tree was already using stable splits supported by enough observations.

Overall, the most influential parameter in this project was the complexity parameter (cp). Lowering cp improved performance by allowing useful additional splits, while excessive simplification reduced accuracy. The best-performing model was Experiment 4, which was matched by Experiment 6. This demonstrates that careful experimentation and evidence-based tuning can significantly improve Decision Tree performance.

2.8 Sources

GeeksforGeeks. (n.d.). How to choose alpha in cost-complexity pruning? https://www.geeksforgeeks.org/machine-learning/how-to-choose-a-in-cost-complexity-pruning/

Sachdeva, J. (n.d.). Minsplit and minbucket. Medium. https://medium.com/talking-with-data/minsplit-and-minbucket-a49ff56026c8

2.9 Appendix

I used ChatGPT to review my assignment against the requirements and rubric. I pasted my answers, the assignment and the rubric and Iprompted ChatGPT as follow:

Here is my rmd document and attached is the assignment I am answering plus the rubric. Check if my answers are correct and the flow of my writing.

Assignment 3: Experimentation & Model Training

Saloua Daouki

2026-04-26