Maths 248: Final Project

1. Introduction

The Census Income data set, also known as the Adult data set, contains data on various demographic and employment characteristics of individuals from the 1994 U.S. Census. The goal of this project is to predict whether an individual’s income exceeds $\$50k$ annually based on education, age and weekly hours of work.

Potential Research Questions

How does income distribution vary among individuals from different countries of origin?
What is the relationship between race, education, and income level across different countries of origin?
Can we predict an individual’s income level based on education, hours worked per week, and country of origin?
How does race influence the likelihood of earning $\geq \$50K$ annually, and are these effects consistent across educational levels?
How does income distribution vary between foreign-born and U.S.-born individuals when considering different occupations and levels of education?
What role does the level of education play in achieving higher income levels across various occupations?
How does gender impact income distribution across different levels of education and occupation types?
Is there a significant relationship between age and hours worked per week, and does this relationship vary by income level?
How do capital gain and capital loss influence income, and do these effects vary by educational level?

Description of Data

This data set is obtained from the UCI Machine Learning Repository. There are 48,842 observations, each representing an individual. The data set consists of 15 variables such as Age, Work class, Education, Educational numbers, Occupation, Marital status, Race, Gender, Capital gain, Capital loss, Hours per week, Native country, Relationship and Income.

#load necessary libraries
library(readxl)
library(readr)      
library(knitr)
library(ggplot2)  
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lattice)     
library(rpart)
library(rpart.plot)
library(caret)
set.seed(123)

# Load the data
adult <- ("Adult data.xlsx")

library(readxl)
adult <- read_xlsx("~/Desktop/Maths 248 Final Project/Adult data.xlsx")
head(adult)

Data Preparation

#Handling missing values
adult <- na.omit(adult)

#Converting variables to factors
adult$income <- factor(ifelse(adult$income == "<=50K", "<=50K", ">50K"))
adult$workclass <- as.factor(adult$workclass)
adult$education <- as.factor(adult$education)
adult$`marital-status` <- as.factor(adult$`marital-status`)
adult$occupation <- as.factor(adult$occupation)
adult$relationship <- as.factor(adult$relationship)
adult$race <- as.factor(adult$race)
adult$gender <- as.factor(adult$gender)
adult$`native-country` <- as.factor(adult$`native-country`)

head(adult, n=3)

variable_descriptions <- data.frame(
  Variable = c("age", "workclass", "fnlwgt", "education", "educational-num", 
               "marital-status", "occupation", "relationship", "race", 
               "gender", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"),
  Description = c(
    "Age of the individual",
    "Type of employment (e.g., Private, Self-emp, Government)",
    "Population weight, represents the number of people this entry represents",
    "Highest level of education attained",
    "Numeric encoding of education level",
    "Marital status of the individual",
    "Type of occupation",
    "Relationship status within the household",
    "Race of the individual",
    "Gender (Male or Female)",
    "Capital gains from investment",
    "Capital losses from investment",
    "Average number of hours worked per week",
    "Country of origin or residence",
    "Income category of the individual (<50K or >50K)"

  )
)

# Display the variable descriptions
kable(variable_descriptions)

Variable	Description
age	Age of the individual
workclass	Type of employment (e.g., Private, Self-emp, Government)
fnlwgt	Population weight, represents the number of people this entry represents
education	Highest level of education attained
educational-num	Numeric encoding of education level
marital-status	Marital status of the individual
occupation	Type of occupation
relationship	Relationship status within the household
race	Race of the individual
gender	Gender (Male or Female)
capital-gain	Capital gains from investment
capital-loss	Capital losses from investment
hours-per-week	Average number of hours worked per week
native-country	Country of origin or residence
income	Income category of the individual (<50K or >50K)

Exploratory Data Analysis

Quantitative Variables Summaries

summary(adult[, c("age", "fnlwgt", "educational-num", "capital-gain", "capital-loss", "hours-per-week", "income")])

##       age            fnlwgt        educational-num  capital-gain  
##  Min.   :17.00   Min.   :  12285   Min.   : 1.00   Min.   :    0  
##  1st Qu.:28.00   1st Qu.: 117550   1st Qu.: 9.00   1st Qu.:    0  
##  Median :37.00   Median : 178144   Median :10.00   Median :    0  
##  Mean   :38.64   Mean   : 189664   Mean   :10.08   Mean   : 1079  
##  3rd Qu.:48.00   3rd Qu.: 237642   3rd Qu.:12.00   3rd Qu.:    0  
##  Max.   :90.00   Max.   :1490400   Max.   :16.00   Max.   :99999  
##   capital-loss    hours-per-week    income     
##  Min.   :   0.0   Min.   : 1.00   <=50K:37155  
##  1st Qu.:   0.0   1st Qu.:40.00   >50K :11687  
##  Median :   0.0   Median :40.00                
##  Mean   :  87.5   Mean   :40.42                
##  3rd Qu.:   0.0   3rd Qu.:45.00                
##  Max.   :4356.0   Max.   :99.00

Average weekly hours distribution

histogram(~`hours-per-week`, data = adult, main = "Average Weekly Hours Distribution", 
          xlab = "Avg Weekly Hours", col = "darkorange", type = "density")

This histogram looks right skewed with a large peak at 40 hours which makes sense because most individuals work 40 per week.

Age Distribution

histogram(~age, data = adult, main = "Age Distribution", xlab = "Age", col = "lightblue", type = "density")

The histogram looks right skewed, where the peak of the distribution is around 20-40 years, with a gradual decline in frequency as age increases. In this data set, most individuals are in the younger to middle age ranges, with fewer individuals in older age groups.

Boxplot of Capital Gain

boxplot(adult$`capital-gain`, main = "Capital Gain", col = "green", horizontal = TRUE)

The boxplot shows that most individuals have little to no capital gain, with a few high outliers stretching far to the right. This distribution is heavily skewed, with the majority concentrated near zero

Scatter plot between average hours per week and educational completion

xyplot(`hours-per-week`~`educational-num`, data = adult)

There seem to be no strong or clear trend indicating that higher education levels correlate with more or fewer working hours. There’s more cluster around 40 hours per week which is the standard full-time working hours.

Bar plot of Education Levels

ggplot(adult, aes(y = education)) +
  geom_bar(fill = "blue", color = "black") +
  labs(title = "Education Levels", x = "Count", y = "Education Level") +
  theme_minimal()

The horizontal bar plot displays the distribution of education levels among individuals in the data set. The most common levels are ‘HS-grad’ (High School Graduate) and ‘Some-college,’ indicating that a majority of individuals have completed high school or attended some college.

Income Proportion by Race

ggplot(adult, aes(x = race, fill = as.factor(income))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("lightblue", "orange"), name = "Income Level") +
  labs(title = "Income Proportion by Race", x = "Race", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The stacked bar plot shows the proportion of individuals earning above and below $\$50K$ annually across different racial groups. It suggest that suggests that income disparities exist across racial groups, with White and Asian-Pac-Islander individuals being more likely to earn above 50K compared to Amer-Indian-Eskimo, Black, and Other racial groups.

2. Methodology

Statistical Model: Decision Tree

A decision tree is a non-parametric supervised learning algorithm used for classification and regression. It’s appropriate for exploring relationship between predictors and outcomes. In this project we are using the decision tree model specifically the classification model to predict the likelihood of earning $\geq \$50k$ annually based on age, education level and weekly hours.

Model Assumption

The response variable must be categorical. The annual income is categorized with $\leq \$50k$ and $\geq \$50k$.
Observations in the dataset are independent of each other
The tree assumes that the chosen splitting criterion i.e the Gini Index effectively separates the classes.

First, We will need to simplify the education variable to improve the tree interpretability into three labels. Education levels is categorized into Low for Basic education or incomplete schooling, Mid for High school or some college and High for advanced degrees like Bachelors, Masters and Doctorate. This grouping reduces the complexity of the decision tree especially since education is at the root node, making it easier to interpret. Having too many education levels can lead too overfitting and lead to noise, so this might help it generalize the model better.

Step 1: Simplify education level

# Simplify education levels
adult$education <- factor(adult$education, levels = c(
  "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "Preschool", # Low
  "HS-grad", "Some-college", "Assoc-acdm", "Assoc-voc",                       # Mid
  "Bachelors", "Masters", "Doctorate", "Prof-school"                         # High
), labels = c(
  rep("Low", 8),    # Map 8 levels to 'Low'
  rep("Mid", 4),    # Map 4 levels to 'Mid'
  rep("High", 4)    # Map 4 levels to 'High'
))

# Verify the simplified levels
table(adult$education)

## 
##   Low   Mid  High 
##  6408 30324 12110

These are observations in each of the three education levels.

Step 2: Splitting the data set

We will split the 70% of the dataset to the training set which will be used to construct the tree and determine the root node, internal nodes and leaf nodes. 15% of the data set will be the validation set used to fine-tune the tree by selecting hyperparameters. It will also be used in cross validation to evaluate the model’s performance on unseen data.The other 15% will be the testing set to evaluate the final tree’s performance to ensure it generalizes well to new data.

# Generate indices for splitting
train_index <- sample(1:nrow(adult), size = 0.7 * nrow(adult))
remaining <- setdiff(1:nrow(adult), train_index)

#Split data set for validation and testing
validation_index <- sample(remaining, size = 0.5 * length(remaining))
test_index <- setdiff(remaining, validation_index)

#Create data sets
train_data <- adult[train_index, , drop = FALSE]    
validation_data <- adult[validation_index, , drop = FALSE]
test_data <- adult[test_index, , drop = FALSE]

cat("Training Set:", nrow(train_data), "\n")

## Training Set: 34189

cat("Validation Set:", nrow(validation_data), "\n")

## Validation Set: 7326

cat("Testing Set:", nrow(test_data), "\n")

## Testing Set: 7327

Step 3: Train the decision tree on training data

tree_model <- rpart(income ~ age + education + `hours-per-week`,
                    data = train_data,
                    method = "class",
                    parms = list(split = "gini"))

print(tree_model)

## n= 34189 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 34189 8209 <=50K (0.7598935 0.2401065)  
##    2) education=Low,Mid 25738 4144 <=50K (0.8389929 0.1610071) *
##    3) education=High 8451 4065 <=50K (0.5189918 0.4810082)  
##      6) age< 29.5 1665  233 <=50K (0.8600601 0.1399399) *
##      7) age>=29.5 6786 2954 >50K (0.4353080 0.5646920)  
##       14) hours-per-week< 42.5 3751 1748 <=50K (0.5339909 0.4660091)  
##         28) hours-per-week< 30.5 599  191 <=50K (0.6811352 0.3188648) *
##         29) hours-per-week>=30.5 3152 1557 <=50K (0.5060279 0.4939721)  
##           58) age< 42.5 1508  623 <=50K (0.5868700 0.4131300) *
##           59) age>=42.5 1644  710 >50K (0.4318735 0.5681265) *
##       15) hours-per-week>=42.5 3035  951 >50K (0.3133443 0.6866557) *

The entire data set (n = 34189) is split based on the variable education based on the Gini index results. Education is the most important predictor as the root node splits based on it. In the Internal node, age and hours worked per week further refine the prediction. Then the leaf nodes represent the final predictions of “Yes” which represent individuals with income $\geq \$50k$ and “No” which represent those with $\leq \$50k$.

Step 4: Visualize the Decision Tree

rpart.plot(
  tree_model,                        # Decision tree model
  type = 3, # Display splits and node outcomes
  faclen = 0,                    # Show full names for categorical variables
  varlen = 0,
  extra = 108,                       # Show probabilities and percentages at nodes
  tweak= 1,
  main = "Decision Tree for Income Prediction",  # Add title
  box.palette = c("Blues", "Greens"),# Use distinct colors for outcomes
  shadow.col = "gray",               # Add shadows for depth
  branch.lty = 1,                    # Use solid branch lines
  split.cex = 1.5,                   # Larger text size for split labels
  split.box.col = "lightblue",       # Highlight split boxes
  split.border.col = "black",     # Border color for split boxes
  fallen.leaves = FALSE,              # Ensure better spacing for leaf nodes
  branch.lwd = 2,                     # Thicker branches for clarity
  clip.right.labs = FALSE
)

legend(
  "topright",                            # Position the legend at the top right
  legend = c("No (≤ $50K)", "Yes (> $50K)"), # Legend labels
  fill = c("lightblue", "lightgreen"),   # Colors for the legend
  title = "Annual Income",             # Add title for clarity
  bty = "o",                             # Ensure a full box is drawn
  box.lwd = 2,                           # Set border thickness
  box.col = "black",                     # Set border color to black
  text.col = "black",                    # Set text color
  inset = 0.02,                          # Adjust padding to prevent crowding
  cex = 0.9                                # Adjust font size for readability
)

At the root node, individuals with lower or middle levels of education are predicted to earn $\leq \$50K$ (No) with a probability of 84%, as shown by the corresponding leaf node. Conversely, individuals with higher education levels proceed further down the tree, where the predictions ultimately result in “Yes,” indicating they are more likely to earn $\geq \$50K$. The tree uses additional predictors, such as age and hours worked per week, to refine these income predictions.

At the internal nodes, among highly educated individuals, those younger than 30 are predicted to earn $\leq \$50K$ (No) with an 86% probability. This suggests that, despite their advanced education, younger individuals may lack the necessary work experience or career opportunities to achieve a higher income. For individuals aged 30 or older, the tree further examines weekly work hours to refine income predictions. Those working fewer than 43 hours per week are more likely to earn $\leq \$50K$, whereas individuals working 43 or more hours per week are more likely to earn $\geq \$50K$. Additionally, among those working fewer than 43 hours per week, the tree splits again based on age, where individuals younger than 42 are predicted to earn $\leq \$50K$, while those aged 42 or older are more likely to earn greater50K.

The tree highlights the role of predictors such as education, age, and work hours in predicting income. Higher education correlates with a greater likelihood of earning $\geq \$50K$ annually. Moreover, among highly educated individuals, older age and longer working hours per week consistently increase the probability of earning $\geq \$50K$. These patterns reflect how age contributes to accumulated experience and how extended work hours signify greater earning potential.

3. Results and Conclusion

Step 5: Evaluation of the unpruned on the testing data

#We are evaluating the unpruned tree on the testing data to find a baseline on how the model perform before pruning it.

# Predict on the test set
predictions <- predict(tree_model, newdata = test_data, type = "class")

# Create a confusion matrix
confusion_matrix <- table(Predicted = predictions, Actual = test_data$income)
print("Confusion Matrix:")

## [1] "Confusion Matrix:"

print(confusion_matrix)

##          Actual
## Predicted <=50K >50K
##     <=50K  5184 1140
##     >50K    343  660

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy: ", round(accuracy * 100, 2), "%\n")

## Accuracy:  79.76 %

# Additional evaluation metrics
confusionMatrix(confusion_matrix)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted <=50K >50K
##     <=50K  5184 1140
##     >50K    343  660
##                                           
##                Accuracy : 0.7976          
##                  95% CI : (0.7882, 0.8067)
##     No Information Rate : 0.7543          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3581          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9379          
##             Specificity : 0.3667          
##          Pos Pred Value : 0.8197          
##          Neg Pred Value : 0.6580          
##              Prevalence : 0.7543          
##          Detection Rate : 0.7075          
##    Detection Prevalence : 0.8631          
##       Balanced Accuracy : 0.6523          
##                                           
##        'Positive' Class : <=50K           
##

The unpruned decision tree achieved an accuracy of 79.76% on the test dataset, correctly classifying the income levels for the majority of individuals. The sensitivity of 93.79% indicates that the model is highly effective in identifying individuals with income $\leq \$50K$, correctly predicting 5,184 out of 5,527 cases in this class. However, the specificity of 36.67% highlights a significant challenge in classifying individuals with income $\geq \$50K$, with 1,140 individuals earning $\leq \$50K$ being misclassified as $\geq \$50K$ and 343 individuals earning $\geq \$50K$ being misclassified as $\leq \$50K$. This imbalance suggests that the unpruned tree prioritizes the majority class ($\leq \$50K$) at the expense of the minority class, which is a common issue in imbalanced datasets.

Step 6: Perform cross-validation

#Cross validation will help determine the optimal complexity parameter in order to prune the tree effectively.

# Perform cross-validation on training data to find optimal cp
cv_results <- rpart(
  income ~ education + age + `hours-per-week`,
  data = train_data,
  method = "class",
  xval = 10
)

# Identify the optimal cp from cross-validation
optimal_cp <- cv_results$cptable[which.min(cv_results$cptable[, "xerror"]), "CP"]
cat("Optimal cp", optimal_cp, "\n")

## Optimal cp 0.01

printcp(cv_results)                           # Display cp table

## 
## Classification tree:
## rpart(formula = income ~ education + age + `hours-per-week`, 
##     data = train_data, method = "class", xval = 10)
## 
## Variables actually used in tree construction:
## [1] age            education      hours-per-week
## 
## Root node error: 8209/34189 = 0.24011
## 
## n= 34189 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.053478      0   1.00000 1.00000 0.0096212
## 2 0.031063      2   0.89304 0.89719 0.0092601
## 3 0.013644      3   0.86198 0.86271 0.0091282
## 4 0.010000      5   0.83469 0.85406 0.0090942

plotcp(cv_results, main = "Cross-Validation for Complexity Parameter") # Plot cp vs xerror

This graph explain the relationship between the complexity parameter and the cross-validation error. The cp value controls the number of splits in the tree. As the complexity increases(in this case lower cp) the cross validation decreases but eventually increase due to overfitting. I found the optimal cp for pruning the tree.

Step 7: Prune the Tree

pruned_tree <- prune(tree_model, cp = optimal_cp)

# Visualize the pruned tree
rpart.plot(
  pruned_tree,                        # Use pruned decision tree model
  type = 3,                           # Display splits and node outcomes
  faclen = 0,                         # Show full names for categorical variables
  varlen = 0,                         # Show full variable names
  extra = 108,                        # Show probabilities and percentages at nodes
  tweak = 1,                          # Slight size tweak for better visualization
  main = "Pruned Decision Tree for Income Prediction",  # Add title
  box.palette = c("Blues", "Greens"), # Use distinct colors for outcomes
  shadow.col = "gray",                # Add shadows for depth
  branch.lty = 1,                     # Use solid branch lines
  split.cex = 1.5,                    # Larger text size for split labels
  split.box.col = "lightblue",        # Highlight split boxes
  split.border.col = "black",         # Border color for split boxes
  fallen.leaves = FALSE,              # Ensure better spacing for leaf nodes
  branch.lwd = 2,                     # Thicker branches for clarity
  clip.right.labs = FALSE             # Do not clip right labels
)

legend(
  "topright",                            # Position the legend at the top right
  legend = c("No (≤ $50K)", "Yes (> $50K)"), # Legend labels
  fill = c("lightblue", "lightgreen"),   # Colors for the legend
  title = "Annual Income",             # Add title for clarity
  bty = "o",                             # Ensure a full box is drawn
  box.lwd = 2,                           # Set border thickness
  box.col = "black",                     # Set border color to black
  text.col = "black",                    # Set text color
  inset = 0.02,                          # Adjust padding to prevent crowding
  cex = 0.9                              # Adjust font size for readability
)

The pruned tree looks visually similar to the unpruned tree. We are going to perform evaluation using confusion matrices on both validation and testing data to see how much the model was improved.

Step 8:Evaluation of the pruned tree on validation data

# Evaluate the pruned tree on the validation data
validation_predictions <- predict(pruned_tree, newdata = validation_data, type = "class")
confusion_matrix <- table(Predicted = validation_predictions, Actual = validation_data$income)
print(confusion_matrix)

##          Actual
## Predicted <=50K >50K
##     <=50K  5275 1051
##     >50K    373  627

# Calculate and display accuracy
validation_accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Validation Accuracy: ", validation_accuracy, "\n")

## Validation Accuracy:  0.8056238

confusionMatrix(confusion_matrix)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted <=50K >50K
##     <=50K  5275 1051
##     >50K    373  627
##                                           
##                Accuracy : 0.8056          
##                  95% CI : (0.7964, 0.8146)
##     No Information Rate : 0.771           
##     P-Value [Acc > NIR] : 3.473e-13       
##                                           
##                   Kappa : 0.3585          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9340          
##             Specificity : 0.3737          
##          Pos Pred Value : 0.8339          
##          Neg Pred Value : 0.6270          
##              Prevalence : 0.7710          
##          Detection Rate : 0.7200          
##    Detection Prevalence : 0.8635          
##       Balanced Accuracy : 0.6538          
##                                           
##        'Positive' Class : <=50K           
##

The pruned decision tree on the validation dataset demonstrated slightly improved performance, achieving an accuracy of 80.56%. The sensitivity remains high at 93.40%, reflecting the model’s continued ability to correctly identify individuals earning $\leq \$50K$. This translates to correctly predicting 5,275 out of 5,648 cases in this class. The specificity improved marginally to 37.37%, indicating slightly better performance in identifying individuals earning $\geq \$50K$, though 1,051 individuals earning $\leq \$50K$ were still misclassified as $\geq \$50K$, and 373 individuals earning $\geq \$50K$ were misclassified as $\leq \$50K$. These results suggest that pruning reduced overfitting, leading to better generalization on the validation set and balanced performance across the two income categories.

Step 9: Evaluate the pruned Tree on Test Data

# Predict on the test set using the pruned tree
pruned_predictions <- predict(pruned_tree, newdata = test_data, type = "class")

# Create a confusion matrix
pruned_confusion_matrix <- table(Predicted = pruned_predictions, Actual = test_data$income)
print(pruned_confusion_matrix)

##          Actual
## Predicted <=50K >50K
##     <=50K  5184 1140
##     >50K    343  660

# Calculate accuracy
pruned_accuracy <- sum(diag(pruned_confusion_matrix)) / sum(pruned_confusion_matrix)
cat("Accuracy (Pruned Tree):", pruned_accuracy, "\n")

## Accuracy (Pruned Tree): 0.7975979

confusionMatrix(confusion_matrix)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted <=50K >50K
##     <=50K  5275 1051
##     >50K    373  627
##                                           
##                Accuracy : 0.8056          
##                  95% CI : (0.7964, 0.8146)
##     No Information Rate : 0.771           
##     P-Value [Acc > NIR] : 3.473e-13       
##                                           
##                   Kappa : 0.3585          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9340          
##             Specificity : 0.3737          
##          Pos Pred Value : 0.8339          
##          Neg Pred Value : 0.6270          
##              Prevalence : 0.7710          
##          Detection Rate : 0.7200          
##    Detection Prevalence : 0.8635          
##       Balanced Accuracy : 0.6538          
##                                           
##        'Positive' Class : <=50K           
##

The pruned tree applied to the test dataset maintained an accuracy of 79.76%, consistent with the unpruned tree. Sensitivity remained high at 93.40%, indicating that the pruned model effectively identifies individuals earning $\leq \$50K$, with 5,275 out of 5,648 cases correctly classified. Specificity was 37.37%, which, while still low, is consistent with the validation dataset. The model misclassified 1,051 individuals earning $\leq \$50K$ as $\geq \$50K$ and 373 individuals earning $\geq \$50K$ as $\leq \$50K$. While pruning improved the model’s generalization slightly, the persistent difficulty in classifying the minority class ($\geq \$50K$) suggests that additional measures, such as addressing class imbalance, are needed to further enhance performance.

Step 10: Conclusion

Using a classification decision tree model, the analysis identified education level as the most significant predictor, followed by age and weekly hours worked. Higher education levels (e.g., Bachelor’s, Master’s, or Doctorate degrees) were strongly associated with a higher likelihood of earning $\geq \$50K$. Additionally, older individuals and those working longer weekly hours were more likely to achieve higher income levels.

The unpruned decision tree achieved an accuracy of 79.76% on the test dataset, demonstrating strong sensitivity (93.79%) but low specificity (36.67%), meaning it was better at identifying individuals earning $\leq \$50K$ than those earning $\geq \$50K$. After pruning, the model slightly improved in generalization, with validation accuracy increasing to 80.56%. While specificity remained lower, the pruning process helped reduce overfitting. Despite its strengths, the model struggled with class imbalance, misclassifying many individuals earning $\geq \$50K$ as $\leq \$50K$. This limitation suggests future improvements could include rebalancing the dataset

Overall, we can conclude that education, age, and work hours influence the likelihood of earning $\geq \$50K$ annually. However addressing class imbalance would significantly improve our model and lead to better results.

4. Discussion and critique

This project reinforced how key factors like education, age, and work hours shape income levels with Education being the strongest predictor. Age and work hours added further context, highlighting how experience and effort contribute to increase in income. However the simplifation lead to class imbalance in the annual income observations.

The reason for simplifying is that the original dataset included many unique education levels, which could create a very complex and cluttered tree if used directly. So, grouping these levels into broader categories allowing me to create a simpler, more interpretable model.Another reason was to avoid overfitting. Since this was a highly detailed variable, it would have increases the risk of overfitting where the model learns patterns that are too specific to the training data and fails to generalize to new data. So simplifying reduced the noise.

When building the decision tree, the model combined Low and Mid education levels into one group using the Gini Index. This decision was based on the Gini Index’s ability to measure impurity, which quantifies how mixed the data is. The Gini Index aims to reduce impurity by creating splits that form more homogenous groups. The model determined that combining Low and Mid education levels produced a lower Gini value (less impurity) than splitting them separately.

However, this combination likely contributed to the class imbalance in predictions. By grouping Low and Mid education levels, the model created a larger, dominant class with a broad range of individuals, potentially overshadowing the smaller High education level class. This imbalance may have made it harder for the model to accurately distinguish between individuals earning $\geq \$50K$ and $\leq \$50K$ within the combined group. Consequently, the model tended to favor the majority class ($\leq \$50K$), leading to misclassification of individuals in the $\geq \$50K$ category, especially for those with Mid education who might have had earnings closer to the threshold.

Stregths

Easy to Understand: The decision tree model gave straightforward, interpretable results that made it easy to see how the predictors influenced income.
Generalization: Pruning the tree reduced overfitting, helping the model make better predictions on new data.
Focused Insights: The model pinpointed education as the most impactful factor, confirming its importance in income prediction.

Weaknesses

Class Imbalance: The model performed poorly for the minority class ($\geq \$50K$ earners), often predicting them as earning $\leq \$50K$. This imbalance skewed the results and lowered specificity.
Oversimplification: Decision trees split data into categories, which can miss more complex relationships between predictors.
Overfitting in Initial Model: The unpruned tree captured too much noise, reducing its usefulness before pruning was applied.

Improvement for the Future

To address class imbalance, techniques like oversampling the minority class ($\geq \$50k$), undersampling the majority class ($\leq \$50k$), or assigning different weights to income categories could improve the model’s performance. Exploring advanced methods like Random Forests, Ensemble methods or Gradient Boosting would also address these issues, as they combine the strengths of multiple trees to improve accuracy and handle imbalance better.

Maths 248: Final Project

Marlyne Nitunga

2024-12-07