Note to Professor: Assignment 2 is included below for reference because Assignment 3 uses the same cleaned dataset and preprocessing steps. Please feel free to begin reading at Assignment 3: Experimentation & Model Training.
First step to exploratory data analysis (EDA) is to load the necessary libraries for different data handling.
The libraries used in this analysis support different stages of the EDA process. The readxl package is used to import the dataset from Excel, while tidyverse provides essential tools for data manipulation and visualization, including ggplot2. The janitor package helps clean and standardize column names, making the data easier to work with. Additionally, skimr is used for quick summary statistics, naniar for identifying and visualizing missing data, GGally for exploring relationships between variables, and scales for improving the readability of plots. Together, these libraries enable efficient data cleaning, exploration, and visualization.
Of course, EDA requires a dataset, so this step involves loading the dataset that I got from the National Assessment of Educational Progress (NAEP) Data Explorer website. The website provides national and state assessments results in all core subjects that are assessed. Since I am a middle school teacher and one of the subjects I teach is mathematics, I picked to get data in both national and New York state level in 8th grade mathematics, especially Algebra. The data provides insights of the average scale score of different groups of students that are categorized by race and ethnicity which will be comparable and easy to find trends.
Before moving on to EDA, let’s check if the dataset loaded correctly:
## # A tibble: 10 × 4
## Year Jurisdiction Race/ethnicity used to report tren…¹ `Average scale score`
## <chr> <chr> <chr> <chr>
## 1 2024 National White 293.27971245966
## 2 2024 National Black 260.59177267955101
## 3 2024 National Hispanic 265.84186852976899
## 4 2024 National Asian/Pacific Islander 315.183504238252
## 5 2024 National American Indian/Alaska Native 257.90389371060098
## 6 2024 National Two or more races 286.27011213518199
## 7 2024 New York White 292.82165256179502
## 8 2024 New York Black 266.66589978477498
## 9 2024 New York Hispanic 264.56017274169801
## 10 2024 New York Asian/Pacific Islander 314.30916955953899
## # ℹ abbreviated name: ¹`Race/ethnicity used to report trends, school-reported`
## Rows: 204
## Columns: 4
## $ Year <chr> "2024", "2024"…
## $ Jurisdiction <chr> "National", "N…
## $ `Race/ethnicity used to report trends, school-reported` <chr> "White", "Blac…
## $ `Average scale score` <chr> "293.279712459…
The data set has 204 observations and 4 variables. The variables are
the year when the assessment is performed, the
jurisdiction, which is where the results of the tests are
produced either nationally or in the state. The next variable is race
and ethnicity of the participant students. Then there is the average
scale score.
This section focuses on preparing the dataset for analysis by cleaning column names, correcting data types, and handling missing values. Proper data cleaning ensures that the dataset is accurate, consistent, and suitable for meaningful analysis.
The column names were simplified to remove spaces and special characters, making them easier to reference and manipulate during analysis. This step improves code readability and reduces potential errors.
## # A tibble: 6 × 4
## year jurisdiction race_ethnicity average_scale_score
## <chr> <chr> <chr> <chr>
## 1 2024 National White 293.27971245966
## 2 2024 National Black 260.59177267955101
## 3 2024 National Hispanic 265.84186852976899
## 4 2024 National Asian/Pacific Islander 315.183504238252
## 5 2024 National American Indian/Alaska Native 257.90389371060098
## 6 2024 National Two or more races 286.27011213518199
glimpse(NAEP_8Math) shows that the variables year and
average_scale_score were originally stored as character values so they
need to be converted to numeric formats. This allows for proper
statistical analysis and visualization of trends over time.
## Rows: 204
## Columns: 5
## $ year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 20…
## $ jurisdiction <chr> "National", "National", "National", "National", "N…
## $ race_ethnicity <chr> "White", "Black", "Hispanic", "Asian/Pacific Islan…
## $ average_scale_score <dbl> 293.2797, 260.5918, 265.8419, 315.1835, 257.9039, …
## $ year_raw <chr> "2024", "2024", "2024", "2024", "2024", "2024", "2…
Missing values, represented by special symbols in the original dataset, were removed to ensure accurate analysis. A cleaned subset of the dataset was created to preserve the original data while enabling reliable computations and visualizations.
## year jurisdiction race_ethnicity average_scale_score
## Min. :1990 Length:158 Length:158 Min. :234.8
## 1st Qu.:2000 Class :character Class :character 1st Qu.:265.6
## Median :2009 Mode :character Mode :character Median :274.8
## Mean :2008 Mean :278.3
## 3rd Qu.:2017 3rd Qu.:293.2
## Max. :2024 Max. :320.1
## year_raw
## Length:158
## Class :character
## Mode :character
##
##
##
## year jurisdiction race_ethnicity average_scale_score
## 0 0 0 0
## year_raw
## 0
Now that the dataset is clean and ready for analysis, I can start the EDA.
The cleaned dataset contains observations from 1990 to 2024 across two jurisdictions: National and New York. It includes multiple racial and ethnic groups, allowing for meaningful comparisons and trend analysis across both time and demographic categories.
## [1] 158 5
## [1] "National" "New York"
## [1] "White" "Black"
## [3] "Hispanic" "Asian/Pacific Islander"
## [5] "American Indian/Alaska Native" "Two or more races"
## [1] 1990 2024
The distribution of average scale scores (histogram below) shows that most values fall within a moderate range, with a slight spread across groups. The boxplots reveal differences in score distributions among racial and ethnic groups, with some groups consistently scoring higher than others.
Now, let’s look at the trends of the scores at national level and at the state of New York level.
The trend analysis shows how average scale scores have changed over time for different racial and ethnic groups. Overall, scores have generally improved from 1990 to the early 2010s, followed by slight declines in recent years. Additionally, consistent gaps between groups can be observed, with some groups maintaining higher performance levels over time.
Notably, Asian/Pacific Islander students consistently achieve the highest scores, while other groups show gradual improvement but remain below this level.
These trends suggest that improvements over time have not been equally distributed across all groups, highlighting persistent structural disparities in educational outcomes.
The comparison between National and New York results highlights similarities and differences across groups. While overall trends are comparable, some variations suggest that New York performs slightly differently for certain groups, indicating regional differences in educational outcomes.
The summary statistics indicate that the average score is approximately 278, with a moderate spread around the mean. The relatively close values of the mean and median suggest that the distribution is fairly balanced without extreme skewness.
## # A tibble: 1 × 5
## mean median sd min max
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 278. 275. 19.7 235. 320.
The boxplot analysis does not reveal extreme outliers, but it highlights variations between groups. These differences likely reflect genuine disparities in performance rather than data errors. Furthermore, these variations are consistent with the patterns observed in the previous line graphs, which show persistent differences between groups over time.
The achievement gap analysis reveals persistent differences in performance between groups over time. The gap between White and Black students, for example, remains consistent across years, indicating that disparities in educational outcomes have not been fully addressed despite overall improvements.
Although all groups show some improvement over time, the persistence of these gaps suggests systemic challenges that require targeted educational interventions.
Correlation analysis is limited in this dataset because most variables are categorical, so the most meaningful relationship examined is the trend between year and average scale score within demographic groups. There is no clear data leakage concern in this EDA because the variables describe observed assessment outcomes rather than future-known predictors. For data preparation, missing-value symbols should be retained as NA and removed only when necessary for analysis, while superscript-year duplicate rows should be excluded carefully. Because some demographic groups have fewer reported observations, subgroup imbalance should be acknowledged when interpreting results. Since the dataset contains only a few variables, dimension reduction is not necessary, and group-based comparisons are more informative than reducing features.
Based on the NAEP dataset and the exploratory data
analysis above, I used the cleaned data to build predictive models. The
variables in the dataset are:
year
jurisdiction
race_ethnicity
average_scale_score
The target variable is average_scale_score, making this
a Decision Tree regression problem. I am going to use year,
jurisdiction, and race_ethnicity which will
predict average scale score with reasonable accuracy because EDA showed
strong trends across demographic groups and over time.
The purpose of Experiment 1 is to create a baseline model. No parameter was changed in this experiment. This experiment establishes a baseline model for comparison with future experiments.
Hypothesis: A baseline Decision Tree regression
model using year, jurisdiction, and
race_ethnicity will be able to predict average scale score
with reasonable accuracy because these variables showed visible patterns
during the EDA. This first experiment will create a baseline
RMSE and R² score that future experiments can
be compared against.
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(123)
model_data <- NAEP_clean %>%
select(year, jurisdiction, race_ethnicity, average_scale_score) %>%
drop_na()
model_data$jurisdiction <- as.factor(model_data$jurisdiction)
model_data$race_ethnicity <- as.factor(model_data$race_ethnicity)
train_index <- createDataPartition(
model_data$average_scale_score,
p = 0.8,
list = FALSE
)
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
tree_exp1 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova"
)
rpart.plot(tree_exp1)
pred_exp1 <- predict(tree_exp1, test_data)
rmse_exp1 <- RMSE(pred_exp1, test_data$average_scale_score)
r2_exp1 <- R2(pred_exp1, test_data$average_scale_score)
rmse_exp1
## [1] 4.484048
r2_exp1
## [1] 0.9335152
experiment_results <- tibble(
Experiment = "Experiment 1",
Change = "Baseline model - no parameter changed",
RMSE = rmse_exp1,
R2 = r2_exp1
)
experiment_results
## # A tibble: 1 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
A baseline Decision Tree regression model using year, jurisdiction,
and race_ethnicity was used to predict average scale score with
reasonable accuracy because EDA showed strong trends across demographic
groups and over time. For this reason, I didn’t change any parameter in
Experiment 1. This experiment established a baseline
model for comparison with future experiments. I trained a Decision Tree
regression model using year, jurisdiction, and
race_ethnicity to predict average_scale_score.
The baseline model performed strongly with an RMSE of
4.48 and an R² of 0.9335. This means
predictions were off by about 4.5 points on average, while the model
explained over 93% of score variation. The tree primarily split by
race_ethnicity and year, suggesting these are
the strongest predictors. Jurisdiction was less influential because it
did not appear in the final tree.
Since the baseline model already performs strongly, one recommendation will be that the next experiment should test whether pruning the tree can improve generalization and reduce possible overfitting.
The baseline Decision Tree may contain unnecessary splits that overfit the training data. Increasing the complexity parameter to \(cp = 0.01\) should prune weaker branches and may improve generalization performance on the testing set. So, I am going to change only the complexity parameter from the baseline setting.
cp?In Decision Trees, the complexity parameter (cp)
controls pruning. It determines how much improvement a split must
provide before the tree is allowed to grow further. A higher
cp value creates a simpler tree by removing weak splits,
while a lower cp allows a larger and more complex tree.
Since my baseline model performed strongly, the next logical step is to
test whether a simpler pruned tree can maintain or improve test
performance ((GeeksforGeeks, n.d.)).
cp=0.01?I selected \(cp = 0.01\) as an initial test because it is a moderate increase from the default small threshold and is commonly used as a starting pruning value. It is large enough to remove weak branches but not so large that the tree becomes oversimplified and the accuracy is slightly better ((GeeksforGeeks, n.d.)). This makes it a reasonable first experiment after the baseline model.
tree_exp2 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.01)
)
# View Experiment 2 tree
rpart.plot(tree_exp2)
# Make predictions for Experiment 2
pred_exp2 <- predict(tree_exp2, test_data)
# Evaluate Experiment 2
rmse_exp2 <- RMSE(pred_exp2, test_data$average_scale_score)
r2_exp2 <- R2(pred_exp2, test_data$average_scale_score)
rmse_exp2
## [1] 4.484048
r2_exp2
## [1] 0.9335152
experiment_results <- experiment_results %>%
add_row(
Experiment = "Experiment 2",
Change = "Changed cp to 0.01",
RMSE = rmse_exp2,
R2 = r2_exp2
)
experiment_results
## # A tibble: 2 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
# Compare visually
experiment_results %>%
ggplot(aes(x = Experiment, y = RMSE)) +
geom_col()
The results of Experiment 2 were identical to Experiment 1, with the
same tree structure, RMSE, and R² values. This
indicates that no actual change was introduced because the default
rpart() complexity parameter is already \(cp = 0.01\). Therefore, explicitly setting
\(cp = 0.01\) reproduced the baseline
model rather than creating a new one. Based on this finding, the next
logical step is to decrease the complexity parameter to \(0.005\) in Experiment 3 to test whether
allowing additional splits improves predictive accuracy.
Since the baseline model used the default cp = 0.01, lowering the complexity parameter to 0.005 may allow useful additional splits and improve predictive accuracy.
The only thing I will be changing is the cp from \(0.01\) to \(0.005\).
tree_exp3 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.005)
)
rpart.plot(tree_exp3)
pred_exp3 <- predict(tree_exp3, test_data)
rmse_exp3 <- RMSE(pred_exp3, test_data$average_scale_score)
r2_exp3 <- R2(pred_exp3, test_data$average_scale_score)
rmse_exp3
## [1] 3.42645
r2_exp3
## [1] 0.9596804
A new Decision Tree regression model was trained using the same training and testing data, with \(cp = 0.005\).
# Add Experiment 3 to the results table
experiment_results <- experiment_results %>%
add_row(
Experiment = "Experiment 3",
Change = "Changed cp to 0.005",
RMSE = rmse_exp3,
R2 = r2_exp3
)
# View comparison table
experiment_results
## # A tibble: 3 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
## 3 Experiment 3 Changed cp to 0.005 3.43 0.960
experiment_results %>%
mutate(
RMSE = round(RMSE, 3),
R2 = round(R2, 4)
)
## # A tibble: 3 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
## 3 Experiment 3 Changed cp to 0.005 3.43 0.960
Since Experiment 1 and Experiment 2 are identical, let’s look at the focused comparison between Experiment 1 and Experiment 3:
experiment_results %>%
filter(Experiment %in% c("Experiment 1", "Experiment 3")) %>%
mutate(
RMSE = round(RMSE, 3),
R2 = round(R2, 4)
)
## # A tibble: 2 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 3 Changed cp to 0.005 3.43 0.960
experiment_results %>%
filter(Experiment %in% c("Experiment 1", "Experiment 3")) %>%
ggplot(aes(x = Experiment, y = RMSE)) +
geom_col() +
labs(
title = "Baseline vs Lower CP Model",
x = "Experiment",
y = "RMSE"
)
The comparison shows that the Experiment 3 improved performance
substantially. RMSE decreased from \(4.48\) to \(3.43\), while \(R²\) increased from \(0.9335\) to \(0.9597\). This indicates that allowing a
slightly more complex tree captured important patterns that were missed
in the baseline model.
Since reducing cp improved results, in the next
experiment, I should test an even smaller value such as \(cp = 0.001\) to determine whether
additional splits continue improving performance or begin
overfitting.
Since reducing cp to \(0.005\) improved performance, lowering it
further to \(0.001\) may allow
additional meaningful splits and slightly improve prediction accuracy.
So, I am going to change only the cp again from \(0.005\) to \(0.001\).
tree_exp4 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.001)
)
rpart.plot(tree_exp4)
pred_exp4 <- predict(tree_exp4, test_data)
rmse_exp4 <- RMSE(pred_exp4, test_data$average_scale_score)
r2_exp4 <- R2(pred_exp4, test_data$average_scale_score)
rmse_exp4
## [1] 3.365846
r2_exp4
## [1] 0.9616695
# Add Experiment 4 to the results table
experiment_results <- experiment_results %>%
add_row(
Experiment = "Experiment 4",
Change = "Changed cp to 0.001",
RMSE = rmse_exp4,
R2 = r2_exp4
)
# View comparison table
experiment_results
## # A tibble: 4 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
## 3 Experiment 3 Changed cp to 0.005 3.43 0.960
## 4 Experiment 4 Changed cp to 0.001 3.37 0.962
Experiment 4 slightly improved the model again. RMSE
decreased from \(3.426\) to \(3.366\), while \(R²\) increased from \(0.9597\) to \(0.9617\). This suggests that some
additional splits captured useful structure in the data.
Experiment 4 produced a deeper tree with improving accuracy. The next question is whether some of that complexity is unnecessary.
Since Experiment 4 used a deeper tree, I tested whether limiting the tree depth could simplify the model while maintaining strong accuracy. Limiting the tree depth to 3 is a reasonable simplification test to determine whether a smaller tree can maintain strong performance.
So, I added maxdepth = 3 while keeping the previous best
settings unchanged.
Since Experiment 4 used a deeper tree, limiting depth to 3;
maxdepth = 3, may reduce unnecessary complexity while
maintaining strong predictive performance.
tree_exp5 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.001, maxdepth = 3)
)
# Visualize Experiment 5 tree
rpart.plot(tree_exp5)
# Make predictions on test data
pred_exp5 <- predict(tree_exp5, test_data)
# Calculate metrics
rmse_exp5 <- RMSE(pred_exp5, test_data$average_scale_score)
r2_exp5 <- R2(pred_exp5, test_data$average_scale_score)
# Show results
rmse_exp5
## [1] 4.069631
r2_exp5
## [1] 0.9461673
experiment_results <- experiment_results %>%
add_row(
Experiment = "Experiment 5",
Change = "Added maxdepth = 3",
RMSE = rmse_exp5,
R2 = r2_exp5
)
experiment_results
## # A tibble: 5 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
## 3 Experiment 3 Changed cp to 0.005 3.43 0.960
## 4 Experiment 4 Changed cp to 0.001 3.37 0.962
## 5 Experiment 5 Added maxdepth = 3 4.07 0.946
Based on the results of Experiment 5, performance decreased when the
tree depth was restricted. RMSE increased from \(3.366\) to \(4.070\), while \(R²\) decreased from \(0.9617\) to \(0.9462\). This indicates that the deeper
branches in Experiment 4 were capturing meaningful patterns in the
data.
Since restricting depth reduced performance, the next experiment
should keep the deeper tree and instead test another control parameter
such as minsplit.
minsplit is a parameter in decision trees that defines
the minimum of observations that each node is required to split further
(Sachdeva, n.d.). The same source indicates that “minsplit helps us
avoid overfitting by pre-pruning the tree before we test our model.”
The best tree may still contain some splits based on small sample groups. Increasing minsplit may improve generalization without removing important depth.
tree_exp6 <- rpart(
average_scale_score ~ year + jurisdiction + race_ethnicity,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.001, minsplit = 20)
)
# Visualize Experiment 6 tree
rpart.plot(tree_exp6)
# Make predictions on test data
pred_exp6 <- predict(tree_exp6, test_data)
# Calculate metrics
rmse_exp6 <- RMSE(pred_exp6, test_data$average_scale_score)
r2_exp6 <- R2(pred_exp6, test_data$average_scale_score)
# Show results
rmse_exp6
## [1] 3.365846
r2_exp6
## [1] 0.9616695
experiment_results <- experiment_results %>%
add_row(
Experiment = "Experiment 6",
Change = "Added minsplit = 20",
RMSE = rmse_exp6,
R2 = r2_exp6
)
experiment_results
## # A tibble: 6 × 4
## Experiment Change RMSE R2
## <chr> <chr> <dbl> <dbl>
## 1 Experiment 1 Baseline model - no parameter changed 4.48 0.934
## 2 Experiment 2 Changed cp to 0.01 4.48 0.934
## 3 Experiment 3 Changed cp to 0.005 3.43 0.960
## 4 Experiment 4 Changed cp to 0.001 3.37 0.962
## 5 Experiment 5 Added maxdepth = 3 4.07 0.946
## 6 Experiment 6 Added minsplit = 20 3.37 0.962
Experiment 6 gave me the same results as Experiment 4. The model
produced identical RMSE and R² values to
Experiment 4, indicating that the best-performing tree already used
splits supported by sufficient observations. Therefore, increasing
minsplit to 20 did not materially change the tree structure
or predictive performance. The important splits already satisfied this
threshold, so the model remained unchanged.
The results of all 6 experiments:
Baseline
Confirmed default cp
Lower cp improved
Lower cp improved again
Too much simplification hurt
Test split stability
indicates that the most effective parameter for this dataset was the
complexity parameter (cp). Lowering cp improved model
accuracy by allowing meaningful additional splits. However, excessive
simplification reduced performance, and increasing minsplit
had no further benefit once the model reached a stable structure.
experiment_results %>%
mutate(
RMSE = round(RMSE, 3),
R2 = round(R2, 4)
) %>%
knitr::kable(
caption = "Summary of Decision Tree Experiments",
col.names = c("Experiment", "Parameter Change", "RMSE", "R²"),
align = c("l", "l", "c", "c"),
booktabs = TRUE
)
| Experiment | Parameter Change | RMSE | R² |
|---|---|---|---|
| Experiment 1 | Baseline model - no parameter changed | 4.484 | 0.9335 |
| Experiment 2 | Changed cp to 0.01 | 4.484 | 0.9335 |
| Experiment 3 | Changed cp to 0.005 | 3.426 | 0.9597 |
| Experiment 4 | Changed cp to 0.001 | 3.366 | 0.9617 |
| Experiment 5 | Added maxdepth = 3 | 4.070 | 0.9462 |
| Experiment 6 | Added minsplit = 20 | 3.366 | 0.9617 |
This assignment builds on the dataset I used in Assignment 2, which
examined National Assessment of Educational Progress (NAEP) Grade 8
Mathematics scores by year, jurisdiction, and race/ethnicity. The
objective of this assignment was to predict
average_scale_score using a Decision Tree regression model
and to apply systematic experimentation by changing one parameter at a
time. Model performance was evaluated using RMSE (Root Mean
Squared Error) and R².
Experiment 1 established a baseline model using default
rpart() settings. The hypothesis was that the variables
year, jurisdiction, and
race/ethnicity would predict average scores with reasonable
accuracy because strong trends were identified during the exploratory
data analysis. The model performed strongly, with an RMSE
of \(4.484\) and an R² of
\(0.9335\), showing that the predictors
explained over \(93\%\) of score
variation.
Experiment 2 explicitly set the complexity parameter to
\(cp = 0.01\) to test whether pruning
would improve results. However, the model produced identical results to
Experiment 1. This occurred because \(cp =
0.01\) is already the default setting in rpart().
Although performance did not change, this experiment confirmed the
baseline model configuration and provided an important understanding of
the software defaults.
Experiment 3 reduced the complexity parameter to \(cp = 0.005\). The hypothesis was that
allowing additional splits would capture more meaningful patterns in the
data. This significantly improved performance, reducing
RMSE to \(3.426\) and
increasing R² to \(0.9597\). This suggested that the baseline
model had been slightly over-pruned.
Experiment 4 further reduced the complexity parameter to
\(cp = 0.001\). The model improved
again, though by a smaller margin, achieving an RMSE of
\(3.366\) and an R² of
\(0.9617\). This indicated that
additional tree flexibility still captured useful structure, but the
gains were beginning to level off.
Experiment 5 tested whether simplifying the tree would
maintain strong accuracy by limiting tree depth to
maxdepth = 3. Performance declined, with RMSE
increasing to \(4.070\) and
R² decreasing to \(0.9462\). This showed that some deeper
branches from Experiment 4 were important and that over-simplifying the
model removed meaningful predictive patterns.
Experiment 6 increased the minimum split requirement to
minsplit = 20 while keeping the strongest previous
settings. The results were identical to Experiment 4, with an
RMSE of \(3.366\) and an
R² of \(0.9617\). This
indicated that the best-performing tree was already using stable splits
supported by enough observations.
Overall, the most influential parameter in this project was the
complexity parameter (cp). Lowering cp improved performance
by allowing useful additional splits, while excessive simplification
reduced accuracy. The best-performing model was Experiment
4, which was matched by Experiment 6. This demonstrates that
careful experimentation and evidence-based tuning can significantly
improve Decision Tree performance.
GeeksforGeeks. (n.d.). How to choose alpha in cost-complexity pruning? https://www.geeksforgeeks.org/machine-learning/how-to-choose-a-in-cost-complexity-pruning/
Sachdeva, J. (n.d.). Minsplit and minbucket. Medium. https://medium.com/talking-with-data/minsplit-and-minbucket-a49ff56026c8
I used ChatGPT to review my assignment against the requirements and rubric. I pasted my answers, the assignment and the rubric and Iprompted ChatGPT as follow:
Here is my rmd document and attached is the assignment I am answering plus the rubric. Check if my answers are correct and the flow of my writing.