Introduction

Data science is a multidisciplinary practice involving a combination of statistical, computational, and domain knowledge to derive significant insights from data. It helps organisations to draw conclusions using historical information and predict future events that are likely to happen. Classification is one of the essential aspects of data science in which observations are put in pre-existing categories.

This research is directed towards the classification of incomes, where predicting whether the person would earn above, below, or the same as 50K based on the demographic and employment factors is the objective. The issue is a binary classification problem, so it can be optimally addressed using supervised machine learning methods. Logistic Regression, Decision Tree and Random Forest are the three models that are carried out and tested to determine the best model to use in predicting the income levels. # Load Required Packages

# install.packages("ggplot2")
# install.packages("caret")
# install.packages("rpart")
# install.packages("rpart.plot")
# install.packages("randomForest")
# install.packages("pROC")
# install.packages("corrplot")

Load Libraries

library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: lattice
library(rpart)
## Warning: package 'rpart' was built under R version 4.5.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.5.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded

Import Dataset

Income <- read.csv("C:/Users/Admin/Downloads/Income.csv")

View Dataset

View(Income)
head(Income)
str(Income)
## 'data.frame':    10000 obs. of  11 variables:
##  $ age           : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass     : chr  " Private" " Private" " Local-gov" " Private" ...
##  $ education     : chr  " 11th" " HS-grad" " Assoc-acdm" " Some-college" ...
##  $ education_num : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status: chr  " Never-married" " Married-civ-spouse" " Married-civ-spouse" " Married-civ-spouse" ...
##  $ occupation    : chr  " Machine-op-inspct" " Farming-fishing" " Protective-serv" " Machine-op-inspct" ...
##  $ relationship  : chr  " Own-child" " Husband" " Husband" " Husband" ...
##  $ race          : chr  " Black" " White" " White" " Black" ...
##  $ sex           : chr  " Male" " Male" " Male" " Male" ...
##  $ hours.per.week: int  40 50 40 40 30 30 40 32 40 10 ...
##  $ Income        : chr  " <=50K." " <=50K." " >50K." " >50K." ...
summary(Income)
##       age         workclass          education         education_num  
##  Min.   :17.00   Length:10000       Length:10000       Min.   : 1.00  
##  1st Qu.:28.00   Class :character   Class :character   1st Qu.: 9.00  
##  Median :37.00   Mode  :character   Mode  :character   Median :10.00  
##  Mean   :38.75                                         Mean   :10.07  
##  3rd Qu.:48.00                                         3rd Qu.:12.00  
##  Max.   :90.00                                         Max.   :16.00  
##  marital.status      occupation        relationship           race          
##  Length:10000       Length:10000       Length:10000       Length:10000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      sex            hours.per.week     Income         
##  Length:10000       Min.   : 1.00   Length:10000      
##  Class :character   1st Qu.:40.00   Class :character  
##  Mode  :character   Median :40.00   Mode  :character  
##                     Mean   :40.46                     
##                     3rd Qu.:45.00                     
##                     Max.   :99.00
dim(Income)
## [1] 10000    11

The data applicable to this study is structured data with demographic and occupation features of people. Structured data are generally stored in a tabular format, and therefore they can be readily analysed statistically and used in machine learning. The data sets have variables like age, workclass, education, marital status, occupation, relationship, race, sex, hours worked/week, and income level. The dataset is a type of secondary data, i.e. has already been gathered in the past, and it is presented to be analysed. As per principles of data science, defining the problem is a very essential task before the commencement of analysis. The issue here is formulated as having a good predictor of an income category using the features at hand. The data set is categorical and numeric and can be used in the context of classification. But, as most real-world datasets, it has gaps in it and inconsistencies that should be filled in before modelling. # Data Understanding & Types There are two broad categories of data: categorical and quantitative. Categorical data is qualitative data, such as gender, occupation and marital status, whereas quantitative data is numerical data, such as age and the number of hours worked. In this dataset: Some categorical variables are marital status, sex, income, occupation, workclass, education and relationship. Age, education number and hours per week are examples of numerical variables. Learning about types of data is vital since various statistical techniques and machine learning algorithms do not accept the right forms of data. Categorical variables should frequently be transformed into factors in order to use them in modelling, whereas numerical variables can be directly used in predictive models. Also, knowing how variables are distributed and related is significant in determining patterns or the appropriate modelling methods. # Replace Missing Values

Income[Income == "?"] <- NA
colSums(is.na(Income))
##            age      workclass      education  education_num marital.status 
##              0              0              0              0              0 
##     occupation   relationship           race            sex hours.per.week 
##              0              0              0              0              0 
##         Income 
##              0

Data cleaning is an important phase of the data science process because data of low quality is not reliable. The open principle of garbage in, garbage out reminds the fact that the wrong input data would lead to the wrong output. # Remove Missing Values

Income <- na.omit(Income)
dim(Income)
## [1] 10000    11

There have been missing values in the dataset in the form of a question mark. These values have been changed into NA and then eliminated so that there could be complete data. Deletion of partially realised observations would enhance the accuracy of the models and avoid inaccuracies in the analysis. # Convert Variables into Factors

Income$workclass <- as.factor(Income$workclass)
Income$education <- as.factor(Income$education)
Income$marital.status <- as.factor(Income$marital.status)
Income$occupation <- as.factor(Income$occupation)
Income$relationship <- as.factor(Income$relationship)
Income$race <- as.factor(Income$race)
Income$sex <- as.factor(Income$sex)
Income$Income <- as.factor(Income$Income)

Check Structure Again

str(Income)
## 'data.frame':    10000 obs. of  11 variables:
##  $ age           : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
##  $ education     : Factor w/ 16 levels " 10th"," 11th",..: 2 12 8 16 16 1 12 15 16 6 ...
##  $ education_num : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
##  $ occupation    : Factor w/ 15 levels " ?"," Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
##  $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
##  $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
##  $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 2 2 2 1 2 ...
##  $ hours.per.week: int  40 50 40 40 30 30 40 32 40 10 ...
##  $ Income        : Factor w/ 2 levels " <=50K."," >50K.": 1 1 2 2 1 1 1 2 1 1 ...

Categorical variables went through conversion to factor types to be compatible with machine learning algorithms. In this form of transformation, categorical features can be read by models like logistic regression and decision trees. A 70:30 split has been used to split the dataset into training and testing sets. The models have been developed using the training set, and performance has been measured using the testing set. Separate datasets are used so that the models can be tested on the unseen data, avoiding the threat of overfitting. # Income Distribution Plot Exploratory Data Analysis (EDA) constitutes a technique that is basic in comprehending the format, trends and associations within an information set. It is useful in determining trends, detecting anomalies and model selection. EDA is very important in assuming and informing the choice of models.

ggplot(Income, aes(x = Income)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Income Classes",
       x = "Income Category",
       y = "Count")

The percentage distribution of income classes indicates a significant disequilibrium between the two groups. Most of the observations are in the lower-income group, which is a reflection of real income inequality occurrences in reality. The dataset’s class imbalance is indicated by the higher number of people with lower incomes (less than fifty thousand) and higher incomes (more than fifty thousand). This imbalance may affect model performance as it can skew predictions towards the dominant class. To determine the effectiveness of the models, proper evaluation metrics other than checking on accuracy are needed. The knowledge of the distribution of classes helps to choose appropriate algorithms and increase the predictive reliability. # Age Distribution Plot

ggplot(Income, aes(x = age)) +
  geom_histogram(fill = "orange", bins = 20) +
  labs(title = "Distribution of Age",
       x = "Age",
       y = "Count")

The age distribution shows that there is a concentration in the working age group, and especially among younger and middle-aged individuals. The frequency also fades away to age ranges that are older, in age showing a low number of observations among the aged. This type of distribution has common labour force demographics in most economic datasets. Age is a significant predictive factor that affects employment and earning capacity. Trends in the distribution give some indication of the possible relationships between age and income levels. The knowledge of age distribution assists in determining trends, so that the modelling assumptions of trend in a predictive analysis are made.

Hours Worked Per Week Plot

ggplot(Income, aes(x = hours.per.week)) +
  geom_histogram(fill = "maroon", bins = 20) +
  labs(title = "Distribution of Hours Worked Per Week",
       x = "Hours Per Week",
       y = "Count")

The spread of the working hours shows that there is a high concentration in the normal full-time employment statuses. The normal labour markets have situations where most people work about forty hours a week. The ratio of people working less than or more than 30 hours shows that there is part-time employment or extended employment. Working hours can be varied, and this can affect the level of income and productivity. The extreme values would indicate the likelihood of outliers that need to be taken into consideration when applying the analysis. Knowledge of working time patterns would help in establishing the interrelationship between labour input and income classification results.

Income by Sex Plot

ggplot(Income, aes(x = sex, fill = Income)) +
  geom_bar(position = "dodge") +
  labs(title = "Income by Sex",
       x = "Sex",
       y = "Count")

The income distribution by gender lines shows that there are a few changes in the income trends in each group. More apparently, income levels are higher among males, implying the possibility of inequities in the dataset. With such differences, there could be structural inequalities in the labour market and jobs. The differences in income due to gender may have an impact on the prediction modelling decisions and inference of fairness. Such variables would have to be used carefully to prevent biases in interpretation. The awareness of such differences enables an ethical analysis and emphasises a fairness principle in machine learning models.

Education vs Income Plot

ggplot(Income, aes(x = education, fill = Income)) +
  geom_bar(position = "dodge") +
  labs(title = "Income by Education",
       x = "Education",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90))

The correlation between the education level and the income category reveals that there is a close relationship between higher education and better earnings. Those who have higher education qualifications are in better positions to be in the higher income group. This trend segregates people of lower income, which is mainly due to lower education levels, which implies that the income level is low. Education is one of the essential factors determining career development and employment opportunities. Observations made indicate that education is an important aspect of predictive modelling. This association aids in studying the importance of features and also improves the understandability of classification models.

Occupation vs Income Plot

ggplot(Income, aes(x = occupation, fill = Income)) +
  geom_bar(position = "dodge") +
  labs(title = "Income by Occupation",
       x = "Occupation",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90))

The difference between money earned by various jobs shows that there is a great difference in the earning capacity of various job groups. Some jobs exhibit a greater balance of people earning over fifty thousand, which proves advantageous remuneration arrangements. The other professions are mainly linked to the low-income bracket, which indicates the lack of prospects for high wage growth. The occupation role affects the demands of skills, experience and the market. All these variables have an effect on the patterns of income distribution, as can be observed within the dataset. The capability to comprehend occupational dissimilarities offers a reasonable insight into the economic systems and assists in precise predictive modelling.

Hours Worked by Income Plot

ggplot(Income, aes(x = Income, y = hours.per.week, fill = Income)) +
  geom_boxplot() +
  labs(title = "Hours Worked Per Week by Income",
       x = "Income Category",
       y = "Hours Per Week")

The working hours comparison between the income categories depicts trends between labour input and earnings. The higher the income, the more people work, which implies that there may be a relationship between work and pay. Yet, there is heterogeneity within both income groups, which shows that other elements other than working hours have control over earnings. Some people may have more income by working moderate hours, which portrays the aspect of ability and occupation. The distribution of values shows that there is a variation in work patterns amongst individuals. This knowledge of this relationship can improve the interpretation of the labour dynamics in predictive modelling.

Correlation Matrix

numeric_data <- Income[, c("age", "education_num", "hours.per.week")]
cor_matrix <- cor(numeric_data)
print(cor_matrix)
##                        age education_num hours.per.week
## age            1.000000000   0.005501721     0.08278719
## education_num  0.005501721   1.000000000     0.13850808
## hours.per.week 0.082787192   0.138508080     1.00000000
corrplot(cor_matrix, method = "color", addCoef.col = "black")

Relationships among numerical variables can be presented in the form of the correlation matrix, showing the strengths and direction of their relationships. There is little linear dependence among features, with most variables showing weak correlations. Low multicollinearity indicates that the variables are independent when used to predict the model. This feature helps in achieving model stability and interpretability. There might be some positive relationships, even slight, between some of the variables, such as, age and working hours. The detection of correlation patterns would help in the selection of features and eliminate overfitting in modelling. The test is compatible with the use of many variables to undertake classification tasks using the matrix.

Split Data into Training and Testing Sets

set.seed(123)

train_index <- createDataPartition(Income$Income, p = 0.7, list = FALSE)

train_data <- Income[train_index, ]
test_data <- Income[-train_index, ]

train_data$workclass <- factor(train_data$workclass)
test_data$workclass <- factor(test_data$workclass, levels = levels(train_data$workclass))

train_data$education <- factor(train_data$education)
test_data$education <- factor(test_data$education, levels = levels(train_data$education))

train_data$marital.status <- factor(train_data$marital.status)
test_data$marital.status <- factor(test_data$marital.status, levels = levels(train_data$marital.status))

train_data$occupation <- factor(train_data$occupation)
test_data$occupation <- factor(test_data$occupation, levels = levels(train_data$occupation))

train_data$relationship <- factor(train_data$relationship)
test_data$relationship <- factor(test_data$relationship, levels = levels(train_data$relationship))

train_data$race <- factor(train_data$race)
test_data$race <- factor(test_data$race, levels = levels(train_data$race))

train_data$sex <- factor(train_data$sex)
test_data$sex <- factor(test_data$sex, levels = levels(train_data$sex))

train_data$Income <- factor(train_data$Income)
test_data$Income <- factor(test_data$Income, levels = levels(train_data$Income))

Logistic Regression Model

# Build model
log_model <- glm(Income ~ age + workclass + education + education_num +
                   marital.status + occupation + relationship + race +
                   sex + hours.per.week,
                 data = train_data,
                 family = "binomial")

# Predict
log_pred_prob <- predict(log_model, test_data, type = "response")

# Convert to class (your old stable method)
log_pred <- rep(levels(test_data$Income)[1], length(log_pred_prob))
log_pred[log_pred_prob > 0.5] <- levels(test_data$Income)[2]

log_pred <- factor(log_pred, levels = levels(test_data$Income))




cm_log <- table(log_pred, test_data$Income)
cm_log
##          
## log_pred   <=50K.  >50K.
##    <=50K.    2113    304
##    >50K.      179    403
TN <- cm_log[1,1]
FP <- cm_log[2,1]
FN <- cm_log[1,2]
TP <- cm_log[2,2]

log_acc <- (TP + TN) / sum(cm_log)
precision_log <- TP / (TP + FP)
recall_log <- TP / (TP + FN)
f1_log <- 2 * (precision_log * recall_log) / (precision_log + recall_log)

log_acc
## [1] 0.8389463
precision_log
## [1] 0.6924399
recall_log
## [1] 0.5700141
f1_log
## [1] 0.6252909

Logistic regression is a learning algorithm in supervised learning that is applied in binary classification. It estimates the likelihood of an occurrence with the help of a logistic function. It is appropriate in datasets that have both quantitative and non-quantitative variables, and offers interpretable results. The confusion matrix summarises the programme by comparing the performance of the prediction and the actual income category. Many correctly identified instances would show a great predictive accuracy of the model. The misclassifications are made on either side of incomes, which is what indicates a constraint in observing complex trends. The model shows enhanced performance in determining the majority over the minority. Such an imbalance affects recall for higher income predictions. The analysis based on several measures allows for realising the overall concept of the effectiveness of models. The confusion matrix is still necessary in measuring the performance of classification.

Decision Tree Model

tree_model <- rpart(Income ~ age + workclass + education + education_num +
                      marital.status + occupation + relationship + race +
                      sex + hours.per.week,
                    data = train_data,
                    method = "class")

rpart.plot(tree_model)

The decision tree diagram depicts the hierarchical decision-making procedure applied to classification. Each node denotes a feature-based split which adds to predicting the income category. The tree structure enables easy interpretation of the effect of various variables on results. The patterns of branching present the significance of certain features in decision-making. The model is transparent and allows the comprehension of classification rules. However, complicated trees might find noise in the data. The interpretation of the tree aids the interpretation of associations between the variables and their predicted outcomes of income.

Decision Tree Predictions

tree_pred <- predict(tree_model, test_data, type = "class")

table(Predicted = tree_pred, Actual = test_data$Income)
##          Actual
## Predicted  <=50K.  >50K.
##    <=50K.    2111    319
##    >50K.      181    388
tree_acc <- mean(tree_pred == test_data$Income)
tree_acc
## [1] 0.8332778
# Precision
precision_tree <- sum(tree_pred == levels(test_data$Income)[2] & test_data$Income == levels(test_data$Income)[2]) / sum(tree_pred == levels(test_data$Income)[2])

# Recall
recall_tree <- sum(tree_pred == levels(test_data$Income)[2] & test_data$Income == levels(test_data$Income)[2]) / sum(test_data$Income == levels(test_data$Income)[2])

# F1 Score
f1_tree <- 2 * (precision_tree * recall_tree) / (precision_tree + recall_tree)

precision_tree
## [1] 0.6818981
recall_tree
## [1] 0.5487977
f1_tree
## [1] 0.6081505

A decision tree is a classification model which depicts decisions in a tree format. It divides the data according to the values of the features, forming branches which result in the outcome of classification. Decision trees can be easily interpreted and can be used to analyse both categorical and numeric data. The confusion matrix measures the decision tree model in terms of the income category classification performance. Accuracy in making classifications shows that the model is able to identify trends in the data. Misclassifications are the regions in which the model is not very effective in generalisation. The calibre of the model is a bit lower than that of logistic regression. Deviations in performance can be caused by overfitting or sensitivity to the data structure. The confusion matrix evaluation reveals the strengths and weaknesses of the model. This analysis can be used to compare with other classification methods.

Random Forest Model

# Random Forest Model
rf_model <- randomForest(Income ~ age + workclass + education + education_num +
                          marital.status + occupation + relationship + race +
                          sex + hours.per.week,
                        data = train_data)

# Predictions
rf_pred <- predict(rf_model, test_data)


print(rf_model)
## 
## Call:
##  randomForest(formula = Income ~ age + workclass + education +      education_num + marital.status + occupation + relationship +      race + sex + hours.per.week, data = train_data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.27%
## Confusion matrix:
##          <=50K.  >50K. class.error
##  <=50K.    4869    482  0.09007662
##  >50K.      727    923  0.44060606

Random Forest Predictions

# Random Forest Predictions

# Confusion Matrix
cm <- table(Predicted = rf_pred, Actual = test_data$Income)
cm
##          Actual
## Predicted  <=50K.  >50K.
##    <=50K.    2075    305
##    >50K.      216    402
# Extract values by position (NOT names)
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
TP <- cm[2,2]

# Accuracy
rf_acc <- (TP + TN) / sum(cm)

# Precision
precision_rf <- TP / (TP + FP)

# Recall
recall_rf <- TP / (TP + FN)

# F1 Score
f1_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)

rf_acc
## [1] 0.8262175
precision_rf
## [1] 0.6504854
recall_rf
## [1] 0.5685997
f1_rf
## [1] 0.6067925

Random forest is an ensemble approach to learning, which is a collection of decision trees to achieve greater predictive accuracy and less overfitting. It takes the average of the predictions of many trees, forming a more powerful model. The confusion matrix shows the classification accuracy of the random forest model by the income category. Many correct predictions mean that the model is capable of high levels of model capture of the complicated patterns. The model works better to generalise than simpler models. There are still some instances of misclassifications, especially in the cases of minority classes. Ensemble learning minimises overfitting and improves predictive stability. Results assessment shows that it is competitive in comparison with other models. A confusion matrix gives us the much-needed understanding of the effectiveness of the random forest procedure.

Variable Importance

importance(rf_model)
##                MeanDecreaseGini
## age                   394.43873
## workclass             135.33663
## education             177.20517
## education_num         178.18530
## marital.status        227.63203
## occupation            286.66964
## relationship          299.90138
## race                   49.69275
## sex                    35.79664
## hours.per.week        248.67142
varImpPlot(rf_model)

All the models have been implemented with the help of the R programming language. The glm() function has been used to construct a logistic regression, rpart() has been used to construct a decision tree, and the randomForest has been used to construct a random forest. The training and evaluation of all the models have been done on the same training set and tested using the same testing set to keep a fair comparison. Effectiveness of the model has been evaluated based on performance metrics such as accuracy, precision, recall and F1-score. The importance of the features plot shows the relative applicability of variables to income category prediction. Some features, including education and age, have stronger values of importance. These variables are important in determining model forecasts. The less important features add less to the accuracy of classification. The feature ranking would give us an understanding of the most important predictors. The interpretation of feature importance boosts model interpretability. The analysis assists in determining the important factors influencing income as well as enhancing decision-making in predictive modelling.

ROC Curve and AUC

The ROC curve illustrates a trade-off between the true positive rate and the false positive rate at various thresholds. The curve that lies above the diagonal base is a sign of the high level of discriminative ability of the model. An increased area under the curve would indicate better classification. The model somewhat productively divides the two income classes. The curve also shows that it is stable at different threshold values, which implies that it performs consistently. ROC analysis can be interpreted to aid in the selection of the best decision thresholds. ROC curve is still an imperative tool to assess the classification models.

log_roc <- roc(test_data$Income, log_pred_prob)
## Setting levels: control =  <=50K., case =  >50K.
## Setting direction: controls < cases
plot(log_roc, main = "ROC Curve for Logistic Regression")

auc(log_roc)
## Area under the curve: 0.8917

The ROC curve illustrates a trade-off between the true positive rate and the false positive rate at various thresholds. The curve that lies above the diagonal base is a sign of the high level of discriminative ability of the model. An increased area under the curve would indicate better classification. The model somewhat productively divides the two income classes. The curve also shows that it is stable at different threshold values, which implies that it performs consistently. ROC analysis can be interpreted to aid in the selection of the best decision thresholds. ROC curve is still an imperative tool to assess the classification models.

# Create model comparison data
model_comparison <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
  Accuracy = c(0.8389, 0.8333, 0.8262),
  Precision = c(0.6924, 0.6819, 0.6505),
  Recall = c(0.5700, 0.5488, 0.5686),
  F1_Score = c(0.6253, 0.6082, 0.6068)
)

# Display table
knitr::kable(model_comparison, caption = "Table 1: Performance Comparison of Classification Models")
Table 1: Performance Comparison of Classification Models
Model Accuracy Precision Recall F1_Score
Logistic Regression 0.8389 0.6924 0.5700 0.6253
Decision Tree 0.8333 0.6819 0.5488 0.6082
Random Forest 0.8262 0.6505 0.5686 0.6068

Feature Importance

The random forest model has pointed out the most important characteristics affecting income prediction. Among the most important variables have been education, marital status, age and duration of work per week. These results are consistent with the real-life ratings, according to which a college degree and professional experience lead to higher income levels.

Critical Evaluation

The models have been effective, but there are several constraints. There have been missing values in the dataset, which have been cut off by the removal of such observations and decreased the dataset size. There could have been a class imbalance, which could have influenced model performance, especially the recall. The other possible problem is overfitting, particularly of decision trees, which may get too complicated. Random forest alleviates this problem through ensemble learning. Further enhancements may involve applying methods, such as cross-validation, hyperparameter optimisation, and class imbalance, with techniques such as SMOTE. Also, it may be possible to enhance predictive accuracy by adding features.

Ethical Considerations

When working with sensitive information such as income, gender and race, ethical considerations are important in data science. The prejudice of the data may result in biased predictions, which would support the established inequalities. The issue of data privacy also needs to be taken into account, and regulations, like GDPR, should be adhered to. The way personal data is stored and utilised must be secure and used in legitimate ways. The use of machine learning models should be transparent, fair and accountable to be ethically sound.

Conclusion

Machine learning methods have been used to make predictions about individuals and classify them in terms of income level. Random forest, decision tree and logistic regression models have been executed and compared. The findings revealed that logistic regression had the best overall performance, which gives it a good balance of accuracy, precision and recall. Decision trees provided interpretability, and the random forest enhanced robustness by using ensemble learning. The exploratory data analysis came up with key themes such as education, occupation and working hours and how these factors affect income. This leads to these insights about the significance of feature selection in predictive modelling. Despite of these limitations, the models have been very predictive. The next step in working on the improvement of models in the future would be to enhance the performance of these models using more sophisticated methods and overcome the ethical issues of bias and fairness. The study proves that data science techniques can be successfully applied to providing a solution to a problem of real-world classification and emphasises the significance of organised data analysis.

Bibliography

Abd Rahman, M.S., Jamaludin, N.A.A., Zainol, Z. and Sembok, T.M.T., 2023. The application of decision tree classification algorithm on decision-making for upstream business. International Journal of Advanced Computer Science and Applications, 14(8).

Akhtiamov, D., Ghane, R. and Hassibi, B., 2024, July. Regularized linear regression for binary classification. In 2024 IEEE International Symposium on Information Theory (ISIT) (pp. 202-207). IEEE.

Avcı, C., Budak, M., Yağmur, N. and Balçık, F., 2023. Comparison between random forest and support vector machine algorithms for LULC classification. International Journal of Engineering and Geosciences, 8(1), pp.1-10.

Dhummad, S., 2025. The imperative of exploratory data analysis in machine learning. Scholars Journal of Engineering and Technology, 13(01), pp.30-44. Igual, L. and Seguí, S., 2024. Introduction to data science. In Introduction to Data Science: A Python approach to concepts, techniques and applications (pp. 1-4). Cham: Springer International Publishing.

Issitt, R.W., Cortina-Borja, M., Bryant, W., Bowyer, S., Taylor, A.M., Sebire, N. and Bowyer, S.A., 2022. Classification performance of neural networks versus logistic regression models: evidence from healthcare practice. Cureus, 14(2).

Palanivinayagam, A. and Damaševičius, R., 2023. Effective handling of missing values in datasets for classification using machine learning methods. Information, 14(2), p.92.

Sujon, K.M., Hassan, R., Choi, K. and Samad, M.A., 2025. Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models. Journal of Big Data, 12(1), p.268.

# next code