WQD7004 PROGRAMMING FOR DATA SCIENCE.
GROUP PROJECT


Group Member Matrix No
MUHAMAD ASHRAF BIN MOHAMAD 23050906
NUR NAZIFA BINTI ZHAMRI 22088840
NUR AMIERA BINTI ABDUL RASHID S2043650
NUR AISYAH BINTI ABDUL MANAP S2125518
ABDUL JAMAK BIN ABDUL HADI S2197406


INTRODUCTION

Dataset year: 1994


The “Adult Census Income” dataset, comprises 14 input variables encompassing diverse data types such as categorical, ordinal, and numerical. The variables include age, work class, education, marital status, occupation, race, gender, and more.

With a total of 32561 rows and 15 columns, this dataset captures key details influencing an individual’s annual income. Factors considered include education level, age, gender, occupation, and additional socio-economic aspects. The dataset’s primary aim is likely to analyze and model income trends based on demographic and socio-economic attributes, facilitating predictive analysis or classification tasks. It’s a valuable resource for exploring correlations between personal details and income levels.


PROBLEM STATEMENT

To predict whether an individual earns more than $50,000 annually based on socio-demographic


OBJECTIVES

  1. To identify and analyze socio-demographic features such as education, occupation and age that strongly correlate with income levels.
  2. To develop a predictive model to classify individuals into two income groups: earning more or less than $50,000 annually.
  3. To evaluate the predictive model performance.


ADULT HOUSEHOLD INCOME DATASET DETAILS

IMPORTING AND READING DATA

The ‘dplyr’ library is used to read the ‘adult.csv’ file into a dataframe named ‘adult_income’.

#Import and read data
library(dplyr)
adult_income <- read.csv("adult.csv")


STRUCTURE OF DATASET

Displays the structure of the ‘adult_income’ dataset, showcasing the variable types and their arrangement.

#Structure of adult income dataset 
str(adult_income)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  90 82 66 54 41 34 38 74 68 41 ...
##  $ workclass     : chr  "?" "Private" "?" "Private" ...
##  $ fnlwgt        : int  77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
##  $ education     : chr  "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
##  $ education.num : int  9 9 10 4 10 9 6 16 9 10 ...
##  $ marital.status: chr  "Widowed" "Widowed" "Widowed" "Divorced" ...
##  $ occupation    : chr  "?" "Exec-managerial" "?" "Machine-op-inspct" ...
##  $ relationship  : chr  "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
##  $ race          : chr  "White" "White" "Black" "White" ...
##  $ sex           : chr  "Female" "Female" "Female" "Female" ...
##  $ capital.gain  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ capital.loss  : int  4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
##  $ hours.per.week: int  40 18 40 40 40 45 40 20 40 60 ...
##  $ native.country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...


DIMENSION OF DATASET

Displays the size information of the ‘adult_income’ dataset

#The dimensions of the dataset
dimensions <- dim(adult_income)
num_rows <- dimensions[1]
num_columns <- dimensions[2]

cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 32561 
## Number of columns: 15


DATASET CONTENT

Shows the first 10 rows of the ‘adult_income’ dataset.

#Content of the dataset
head(adult_income,10)
##    age   workclass fnlwgt    education education.num marital.status
## 1   90           ?  77053      HS-grad             9        Widowed
## 2   82     Private 132870      HS-grad             9        Widowed
## 3   66           ? 186061 Some-college            10        Widowed
## 4   54     Private 140359      7th-8th             4       Divorced
## 5   41     Private 264663 Some-college            10      Separated
## 6   34     Private 216864      HS-grad             9       Divorced
## 7   38     Private 150601         10th             6      Separated
## 8   74   State-gov  88638    Doctorate            16  Never-married
## 9   68 Federal-gov 422013      HS-grad             9       Divorced
## 10  41     Private  70037 Some-college            10  Never-married
##           occupation   relationship  race    sex capital.gain capital.loss
## 1                  ?  Not-in-family White Female            0         4356
## 2    Exec-managerial  Not-in-family White Female            0         4356
## 3                  ?      Unmarried Black Female            0         4356
## 4  Machine-op-inspct      Unmarried White Female            0         3900
## 5     Prof-specialty      Own-child White Female            0         3900
## 6      Other-service      Unmarried White Female            0         3770
## 7       Adm-clerical      Unmarried White   Male            0         3770
## 8     Prof-specialty Other-relative White Female            0         3683
## 9     Prof-specialty  Not-in-family White Female            0         3683
## 10      Craft-repair      Unmarried White   Male            0         3004
##    hours.per.week native.country income
## 1              40  United-States  <=50K
## 2              18  United-States  <=50K
## 3              40  United-States  <=50K
## 4              40  United-States  <=50K
## 5              40  United-States  <=50K
## 6              45  United-States  <=50K
## 7              40  United-States  <=50K
## 8              20  United-States   >50K
## 9              40  United-States  <=50K
## 10             60              ?   >50K


COUNT OF INTEGER AND CHARACTER ATTRIBUTES

Count of columns classified as integers and characters in the ‘adult_income’ dataset.

# Count of integer and character in attributes
data_types <- sapply(adult_income, class)

# Count of integer columns
num_int_columns <- sum(data_types == "integer")

# Count of character columns
num_char_columns <- sum(data_types == "character")

cat("Number of integer columns:", num_int_columns, "\nNumber of character columns:", num_char_columns, "\n")
## Number of integer columns: 6 
## Number of character columns: 9


LIST OF INTEGER AND CHARACTER COLUMNS

List of the names of columns identified as integers and characters in the dataset.

#List of integer and character columns
integer_columns <- names(adult_income)[sapply(adult_income, is.integer)]
char_columns <- names(adult_income)[sapply(adult_income, is.character)]
cat("Integer columns:", paste(integer_columns, collapse = ", "), "\nCharacter columns:", paste(char_columns, collapse = ", "), "\n")
## Integer columns: age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week 
## Character columns: workclass, education, marital.status, occupation, relationship, race, sex, native.country, income


ATTRIBUTES OF THE DATASET

Attributes <- c("workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country", "income", "age_category")
Description <- c("Industry of occupation", "Final Weight, which is the number of units in the target population that the responding unit represents", "Level of education", "The level of education in numerical form", "Marital status of an individual", "Type of occupation of an individual", "What this individual is relative to others", "Individual’s race", "Biological sex of the individual", "Capital gains for an individual", "Capital loss for an individual", "The hours an individual has reported to work per week", "Country of origin for an individual", "Whether or not an individual makes more than $50,000 annually", "The age of an individual")

knitr::kable(data.frame(Attributes, Description), col.names = c('Attributes', 'Description'))
Attributes Description
workclass Industry of occupation
fnlwgt Final Weight, which is the number of units in the target population that the responding unit represents
education Level of education
education.num The level of education in numerical form
marital.status Marital status of an individual
occupation Type of occupation of an individual
relationship What this individual is relative to others
race Individual’s race
sex Biological sex of the individual
capital.gain Capital gains for an individual
capital.loss Capital loss for an individual
hours.per.week The hours an individual has reported to work per week
native.country Country of origin for an individual
income Whether or not an individual makes more than $50,000 annually
age_category The age of an individual


SUMMARY OF DATASET

Generates a statistical summary of the ‘adult_income’ dataset, providing key statistics for each variable/column.

#Summary of the adult income dataset
summary(adult_income)
##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education.num   marital.status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital.gain    capital.loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours.per.week  native.country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00


DATA PROCESSING

DATA CLEANING


CHECKING DATA COMPLETENESS ACROSS ALL ROWS

nrow(adult_income[complete.cases(adult_income),])
## [1] 32561

We can see that all rows in the dataset appear complete. However, we are assuming that the “?” symbols within certain columns represent missing values, as indicated by the dataset’s structure above.


DUPLICATES

In the dataset details, the code indicates that there are 32,561 rows and 15 columns. The code below checks whether there are any duplicates in the dataset.

# Count the number of duplicates
num_duplicates <- sum(duplicated(adult_income))

if (num_duplicates > 0) {
  cat("Found", num_duplicates, "duplicate rows in the dataframe.\n")
} else {
  cat("No duplicated rows found in the dataframe.\n")
}
## Found 24 duplicate rows in the dataframe.


REMOVE THE DUPLICATES

#keeping only the first occurrence
dfunique <- unique(adult_income)
dimensions <- dim(dfunique)
num_rows <- dimensions[1]
num_columns <- dimensions[2]

cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 32537 
## Number of columns: 15

After retaining only the first occurrence of each row with dfunique <- unique(adult_income), now we have 32,537 rows and 15 columns.


DOUBLE-CHECKING FOR DATA INTEGRITY: ENSURING COMPLETE DUPLICATE REMOVAL

duplicated_rows <- dfunique[duplicated(dfunique), ]

# Display duplicated rows
if (nrow(duplicated_rows) > 0) {
  cat("Duplicated rows in the dataframe:\n")
  print(duplicated_rows)
} else {
  cat("No duplicated rows found in the dataframe.\n")
}
## No duplicated rows found in the dataframe.

Duplicates were removed using the unique() function and a check was made for duplicated rows in the resulting dataframe (dfunique), the script indicated that no duplicated rows were found.This suggested that the process of keeping only the first occurrence of each row with the unique() function had effectively eliminated all duplicate rows from the dataset.


MISSING VALUES

The function check_missing_values() was used to identify missing values in different columns of the dataframe (dfunique). The script checked for missing values in various columns using a list of column names (columns_to_check).

check_missing_values <- function(data, column_name) {
  any(data[[column_name]] == "?")
}

columns_to_check <- c("age", "workclass", "fnlwgt", "education", "education.num",
                       "marital.status", "occupation", "relationship", "race",
                       "sex", "capital.gain", "capital.loss", "hours.per.week",
                       "native.country", "income")

for (column in columns_to_check) {
  if (check_missing_values(dfunique, column)) {
    cat(paste("There are missing values in column", column, "\n"))
  } else {
    cat(paste("No missing values in column", column, "\n"))
  }
}
## No missing values in column age 
## There are missing values in column workclass 
## No missing values in column fnlwgt 
## No missing values in column education 
## No missing values in column education.num 
## No missing values in column marital.status 
## There are missing values in column occupation 
## No missing values in column relationship 
## No missing values in column race 
## No missing values in column sex 
## No missing values in column capital.gain 
## No missing values in column capital.loss 
## No missing values in column hours.per.week 
## There are missing values in column native.country 
## No missing values in column income


THE RESULTS OF THIS ANALYSIS SHOWED

Columns with No Missing Values: ‘age’, ‘fnlwgt’, ‘education’, ‘education.num’, ‘marital.status’, ‘relationship’, ‘race’, ‘sex’, ‘capital.gain’, ‘capital.loss’, ‘hours.per.week’, ‘income’ were found to have no missing values.

Columns with Missing Values: ‘workclass’, ‘occupation’, and ‘native.country’ were identified to contain missing values.


IDENTIFYING MISSING VALUES Checked for the presence of “?” in the ‘workclass’, ‘occupation’, and ‘native.country’ columns.

missing_values <- rowSums(dfunique[, c("workclass", "occupation", "native.country")] == "?")

Counts were found to be 1836, 1843, and 582, respectively.


NUMBER OF ROWS THAT HAVE MISSING VALUES

Determined the total number of rows containing at least one missing value among ‘workclass’, ‘occupation’, and ‘native.country’

total_empty_rows <- sum(rowSums(dfunique[c("workclass", "occupation", "native.country")] == "?") > 0)
total_empty_rows
## [1] 2398

Total of rows was 2398 rows


REMOVE ROWS WITH MISSING VALUES

Created a new dataframe df_no_missing by filtering out rows where ‘workclass’, ‘occupation’, and ‘native.country’ columns contained “?”.

df_no_missing <- dfunique %>%
  filter(workclass != "?" & occupation != "?" & native.country != "?")
new_dimensions <- dim(df_no_missing)
num_rows <- new_dimensions[1]
num_columns <- new_dimensions[2]

cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 30139 
## Number of columns: 15

This resulted in a dataframe with 30,139 rows and 15 columns.


DATA TRANSFORMATION


CREATING AGE CATEGORIES

Created a new column ‘age_category’ by categorizing ‘age’ into groups: Under 20, 20-29, 30-39, 40-49, 50-59, and 60 and above.

df_no_missing$age_category <- cut(df_no_missing$age, breaks = c(-Inf, 20, 30, 40, 50, 60, Inf),
                       labels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60 and above"),
                       include.lowest = TRUE)

# Display the updated dataframe
head(df_no_missing)
##   age workclass fnlwgt    education education.num marital.status
## 1  82   Private 132870      HS-grad             9        Widowed
## 2  54   Private 140359      7th-8th             4       Divorced
## 3  41   Private 264663 Some-college            10      Separated
## 4  34   Private 216864      HS-grad             9       Divorced
## 5  38   Private 150601         10th             6      Separated
## 6  74 State-gov  88638    Doctorate            16  Never-married
##          occupation   relationship  race    sex capital.gain capital.loss
## 1   Exec-managerial  Not-in-family White Female            0         4356
## 2 Machine-op-inspct      Unmarried White Female            0         3900
## 3    Prof-specialty      Own-child White Female            0         3900
## 4     Other-service      Unmarried White Female            0         3770
## 5      Adm-clerical      Unmarried White   Male            0         3770
## 6    Prof-specialty Other-relative White Female            0         3683
##   hours.per.week native.country income age_category
## 1             18  United-States  <=50K 60 and above
## 2             40  United-States  <=50K        50-59
## 3             40  United-States  <=50K        40-49
## 4             45  United-States  <=50K        30-39
## 5             40  United-States  <=50K        30-39
## 6             20  United-States   >50K 60 and above


CREATING EDUCATION CATEGORIES

Created a new column ‘education_group’ by categorizing ‘education’ into groups: Primary, Secondary, Tertiary, Doctorate and Other.

dfedu <- df_no_missing %>%
  mutate(education_group = case_when(
    education %in% c('1st-4th', '5th-6th') ~ 'Primary',
    education %in% c('7th-8th', 'HS-grad', '10th', '11th', '12th', '9th') ~ 'Secondary',
    education %in% c('Some-college', 'Bachelors', 'Assoc-voc', 'Assoc-acdm') ~ 'Tertiary',
    education %in% c('Masters', 'Doctorate', 'Prof-school') ~ 'Doctorate',
    TRUE ~ 'Other'  # For any education levels not listed
  ))


Removed the ‘age’ and ‘education’ columns from the dataframe dfedu.

dfedu$age <- NULL
dfedu$education <- NULL
glimpse(dfedu)
## Rows: 30,139
## Columns: 15
## $ workclass       <chr> "Private", "Private", "Private", "Private", "Private",…
## $ fnlwgt          <int> 132870, 140359, 264663, 216864, 150601, 88638, 422013,…
## $ education.num   <int> 9, 4, 10, 9, 6, 16, 9, 16, 15, 13, 14, 15, 7, 14, 13, …
## $ marital.status  <chr> "Widowed", "Divorced", "Separated", "Divorced", "Separ…
## $ occupation      <chr> "Exec-managerial", "Machine-op-inspct", "Prof-specialt…
## $ relationship    <chr> "Not-in-family", "Unmarried", "Own-child", "Unmarried"…
## $ race            <chr> "White", "White", "White", "White", "White", "White", …
## $ sex             <chr> "Female", "Female", "Female", "Female", "Male", "Femal…
## $ capital.gain    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss    <int> 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3004, 2824, …
## $ hours.per.week  <int> 18, 40, 40, 45, 40, 20, 40, 35, 45, 20, 55, 40, 76, 50…
## $ native.country  <chr> "United-States", "United-States", "United-States", "Un…
## $ income          <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", "…
## $ age_category    <fct> 60 and above, 50-59, 40-49, 30-39, 30-39, 60 and above…
## $ education_group <chr> "Secondary", "Secondary", "Tertiary", "Secondary", "Se…


Rename the dataframe to dfclean

dfclean <- dfedu


Makes a copy of the cleaned dataframe (df_no_missing) into a new dataframe dfclean and then exporting it to a CSV file

write.csv(dfclean, "dfclean.csv")


EXPLORATORY DATA ANALYSIS


DISTRIBUTION OF INCOME

Based on the dataset, the column ‘income’ represents financial earnings of adults which is categorized into two classifications which are ‘<=50k’ and ‘>50k’ to depict adults making up to and equal to 50k US Dollars (USD) and those making more than 50k USD, respectively.

library(dplyr)
library(ggplot2)
income_data <- dfclean %>%
    group_by(income) %>%
    summarise(count = n()) %>%
    mutate(percentage = count / sum(count) * 100)

ggplot(income_data, aes(x = income, y = count)) + 
    geom_bar(stat = "identity", fill = "purple", color = "black") +
    geom_text(aes(label = count), vjust = -0.5) +
    geom_text(aes(y = count/2, label = paste0(round(percentage, 1), "%")), vjust = 0.5, color = "black") +
    theme_light() +
    labs(title = "Distribution of Income", x = "Income Category", y = "Count")

According to the bar chart, it can be observed that over 75% of adults earn less than and up to 50k USD, which is substantial in comparison to those making more than 50k USD, at about 25%.


DISTRIBUTION OF WORK CLASS OF ADULTS

## Convert table into dataframe to manipulate data summary
workclass_summary <- as.data.frame(table(dfclean$workclass))

## Add percentage values 
workclass_summary <- transform(workclass_summary, Percentage = Freq / sum(Freq) * 100)

## Plot
ggplot(workclass_summary, aes(x = reorder(Var1, -Freq), y = Freq)) +
    geom_bar(stat = "identity", fill = "purple", color = "black") +
    geom_text(aes(label = paste0(round(Percentage, 1), "%")), 
              vjust = -0.5, size = 3) +
    labs(title = "Distribution of Work Class",
         x = "Work Class",
         y = "Count") +
    theme_light() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

The private sector is the predominant source of employment from this dataset, accounting for a significant majority of 73.9%. There are 2 categories for self-employed individuals including individuals who are incorporated (self-emp-inc) at 8.3%, and those who are not incorporated (self-emp-not-inc) at 3.6%. This suggests that there is a higher number of individuals who operate businesses without forming corporations within the self-employed group. It can be observed from the plot that there are no individuals recorded in the ‘Without Pay’ group, indicating that workers within the dataset are all compensated.


DISTRIBUTION OF WORKCLASS BY INCOME CATEGORY

## Summarize income and workclass data and arrange in descending order
summary_data <- dfclean %>%
    group_by(workclass, income) %>%
    summarise(count = n()) %>%
    arrange(desc(count))
## `summarise()` has grouped output by 'workclass'. You can override using the
## `.groups` argument.
# Plot
ggplot(summary_data, aes(x = reorder(workclass, -count), y = count, fill = income)) +
    geom_bar(stat = "identity", position = "dodge", color = "black") +
    labs(x = "Workclass", y = "Count", fill = "Income") +
    theme_light() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_fill_manual(values = c("<=50K" = "purple", ">50K" = "green"))

The private sector has the highest count of individuals in both income categories with a significantly large number of individuals earning 50k and less, compared to those earning more than 50k. Among the self-employed, those within the self-emp-not-inc groups have a higher count of individuals earning 50k and less. On the other hand, incorporated self-employed individuals (self-emp-inc) demonstrate a more balanced distribution with a slightly higher individual counts earning more than 50k. Across local, state and federal government categories, more individuals are earning less than or equal to 50k. Upon closer inspection, it can be observed that the difference in income distribution within the federal government jobs is not as pronounced, implying a more uniform salary structure.


DISTRIBUTION OF SEX, RACE AND AGE CATEGORIES

options(repos = "https://cloud.r-project.org/")
install.packages("tidyverse")
## 
## The downloaded binary packages are in
##  /var/folders/7x/7k2hn6795vn9v8c1v0chmch40000gn/T//RtmpQOm53O/downloaded_packages
library (tidyverse)
df <- dfclean
# Function to create a pie chart with labeled sections, percentage
create_pie_chart <- function(data, category, title) {
  # First, create a summarized version of the data with counts
  data_sum <- data %>%
    group_by(!!sym(category)) %>%
    summarise(count = n()) %>%
    mutate(perc = count / sum(count) * 100)

  # Now, create the pie chart
  ggplot(data_sum, aes(x = "", y = count, fill = !!sym(category))) +
    geom_bar(width = 1, stat = "identity") +
    coord_polar("y") +
    theme_void() +  # Remove background
    geom_text(aes(label = paste0(round(perc, 1), "%")), 
              position = position_stack(vjust = 0.5), 
              size = 2.2) +
    labs(title = title, fill = category) +
    scale_fill_brewer(palette = "Accent")
}

# Pie chart for sex
pie_sex <- create_pie_chart(df, "sex", "Distribution of Sex")

# Pie chart for race
pie_race <- create_pie_chart(df, "race", "Distribution of Race")
# Print or save the plots
print(pie_sex)

Distribution of sex - In addition, 67.6% of the respondents is a “Male”, whereas the rest 32.4% is “Female”.


print(pie_race)

Distribution of race - Moreover, the majority of the respondents is a “White” race adults with 86% whereas the smallest percentage is contributed by Amer-Indian-Eskimo with less than 1%


ggplot(df, aes(x = age_category)) +
  geom_bar(stat = "count", fill = "#69b3a2") +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +  # Use after_stat(count) for labels
  labs(title = "Distribution of Age Categories", x = "Age Category", y = "Count") +
  theme_minimal()

The respondents age category is order accordingly to showcase the distribution of age group. Based on the data transformation that we have conducted earlier, the age group respondents showcase a normal distribution of respondents across the age group.


AGE CATEGORIES AND INCOME

# Group data by age_category and income
grouped_data <- df %>%
  group_by(age_category, income) %>%  # Group by age_category and income
  summarise(n = n(), .groups = "drop")  # Explicitly set .groups to "drop" to remove all grouping

# Create the side-by-side bar chart
ggplot(grouped_data, aes(x = age_category, y = n, fill = income)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.5) +  # Add text labels
  labs(x = "Age Category", y = "Count", title = "Distribution of Age by Income") +
  theme_minimal() +
  theme(legend.position = "bottom")

Based on the chart above, adults with an income of “<=50K” are the highest between the “20-29”. At the same time,adults with an income “>50K” resides the highest between the “40-49”.


OCCUPATIONS AND INCOME

# Exploratory Data Analysis (EDA) for Occupation
occupation_counts <- df %>%
  group_by(occupation) %>%
  summarise(count = n()) %>%
  mutate(percentage = (count / sum(count)) * 100) %>%
  arrange(desc(count))

# Plotting occupation distribution
ggplot(occupation_counts, aes(x = reorder(occupation, -count), y = percentage, fill = occupation)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Percentage of Population by Occupation",
       x = "Occupation",
       y = "% of Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set3") +
  geom_text(aes(label = sprintf("%.2f%%", percentage)), vjust = -0.5, size = 3) +
  coord_flip()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors

Occupations in managerial and professional specialties boast over a quarter of individuals earning above $50K constituting a substantial 13% of our dataset. In contrast, private household services and armed forces demonstrate the lowest income brackets. We could further explore the demographics within these occupations and key factors like education and work hours could provide crucial insights into these income disparities.


INCOME DISTRIBUTION BY SEX

data <-dfclean

# Create separate dataframes for income above and below $50,000 by sex
above_50k_sex <- data %>%
  filter(income == ">50K") %>%
  count(sex)

below_50k_sex <- data %>%
  filter(income == "<=50K") %>%
  count(sex)

# Merge the data into a single dataframe
income_by_sex <- merge(above_50k_sex, below_50k_sex, by = "sex", suffixes = c("_above_50k", "_below_50k"))

# Reshape the data for plotting
income_by_sex_long <- income_by_sex %>%
  pivot_longer(cols = -sex, names_to = "income_category", values_to = "count")

# Plotting income distribution by sex
ggplot(income_by_sex_long, aes(x = sex, y = count, fill = income_category)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  labs(title = "Income Distribution by Sex",
       x = "Sex",
       y = "Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

The EDA reveals a significant disparity in income distribution between genders within the dataset. Notably, 13,972 males and 8,661 females earn below or equal to $50K annually. Thats 61% of males, whereas only 39% are females. Conversely, among individuals earning above $50K comprises of 85% are males, whereas only 15% are females. This disparity highlights a considerable gender influence on income levels within the dataset.


DISTRIBUTION OF EDUCATION AND INCOME CATEGORIES

education_order <- c("Primary", "Secondary", "Tertiary", "Doctorate", "Others")
df$education_group <- factor(df$education_group, levels = education_order)

# Group data by education and income
grouped_data <- df %>% 
  group_by(education_group, income) %>%  # Group by education and income
  summarise(n = n(), .groups = "drop")  # Explicitly set .groups to "drop" to remove all grouping

# Create the side-by-side bar chart
ggplot(grouped_data, aes(x = education_group, y = n, fill = income)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.5) +  # Add text labels
  labs(x = "Education", y = "Count", title = "Distribution of Education by Income") +
  theme_minimal() +
  theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust angle of x-axis labels

Furthermore, adults with an income “>50K” is the highest among adults who have tertiary education.


SUMMARY OF EXPLORATORY DATA ANALYSIS

In summary, these are the few insight that we would like to highlight based on our findings through the EDA method:

Work Sector Dominance: The private sector employs a sizable fraction of the adult labor force. This suggests that the private sector dominates the labor market.

Distribution of Gender and Race in the Workforce: Males and White people make up a larger portion of the workforce, which may indicate differences in employment possibilities or preferences based on gender and race.

Age-Related Income Trends: The income levels show a discernible age-related pattern. While younger persons, especially those in their 20s, are more likely to be in lower income groups (<=50K), adults in their 40s, especially those in executive managerial roles, are more likely to have higher salaries (>50K).

Impact of Education on Income: Post secondary education level is linked to higher income groups, underscoring the significance of education in terms of earning potential.

Occupation-Income Correlation: A strong association has been observed between the sorts of occupations and income levels; executive managerial responsibilities are typically associated with higher salaries, whilst administrative or clerical positions tend to pay less.

In conclusion, the study shows patterns of employment and income distribution across age groups, professions, sectors, and educational attainment, along with signs that sociodemographic factors have an impact on these patterns.


CORRELATION MATRIX


# Convert 'income' to a binary numeric column
data$income_numeric <- ifelse(data$income == ">50K", 1, 0)

# Selecting relevant columns for correlation analysis
correlation_data <- data[c("income_numeric", "education.num", "hours.per.week")]

# Calculating the correlation matrix
correlation_matrix <- cor(correlation_data)

# Displaying the correlation matrix
print(correlation_matrix)
##                income_numeric education.num hours.per.week
## income_numeric      1.0000000     0.3354134      0.2294349
## education.num       0.3354134     1.0000000      0.1528419
## hours.per.week      0.2294349     0.1528419      1.0000000

The correlation matrix shows moderate positive associations between ‘income_numeric’ and ‘education.num’ (0.335), and a weaker link with ‘hours.per.week’ (0.229), suggesting a positive connection between income and both education and weekly hours. ‘education.num’ appears slightly more influential than ‘hours.per.week’ on ‘income_numeric’. Meanwhile, ‘education.num’ and ‘hours.per.week’ have a mild positive correlation (0.153), indicating they may capture different aspects of income prediction, reducing multicollinearity concerns.


MODELLING

REGRESSION

For the regression model, Linear Regression have been choose to predict the income.


LINEAR REGRESSION

library(caret)
data <- read.csv("dfclean.csv")
# Assuming 'data' is your dataset
# Convert 'income' to a binary numeric column
data$income_numeric <- ifelse(data$income == ">50K", 1, 0)

# Selecting relevant columns for regression
regression_data <- data[c("education.num", "hours.per.week","income_numeric")]

# Check for missing values and remove rows with NAs if necessary
regression_data <- na.omit(regression_data)

# Splitting the data
set.seed(42)
train_indices_reg <- sample(nrow(regression_data), 0.8 * nrow(regression_data))
train_data_reg <- regression_data[train_indices_reg, ]
test_data_reg <- regression_data[-train_indices_reg, ]

# Fitting the linear regression model
regression_model <- lm(income_numeric ~ education.num + hours.per.week, data = train_data_reg)

# Predicting on the test set
predicted_values_reg <- predict(regression_model, newdata = test_data_reg)


CLASSIFICATION

For the classification models, two model have selected to predict the income which is Naive Bayes and Random Forest.


NAIVE BAYES

# Load required libraries
library(readr)
library(dplyr)
library(e1071)  # For Naive Bayes


# Load the cleaned dataset
df <- read.csv("dfclean.csv")


# Set 'sex' and 'occupation' are categorical variables
df$sex <- as.factor(df$sex)
df$occupation <- as.factor(df$occupation)

# Split the data into training and testing sets (e.g., 80-20 split)
set.seed(123)  # For reproducibility
train_indices <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]


# Naive Bayes model
nb_model <- naiveBayes(income ~ ., data = train_data)

# Make predictions using Naive Bayes
nb_predictions <- predict(nb_model, newdata = test_data)

# Assess the model performance 
actual_values <- test_data$income


RANDOM FOREST

# Load required libraries
library(readr)
library(dplyr)
library(randomForest)  # For Random Forest


# Load the cleaned dataset
df <- read.csv("dfclean.csv")

# Convert 'income' to a factor
train_data$income <- as.factor(train_data$income)

# Fit Random Forest using the training data
rf_model <- randomForest(income ~ ., data = train_data, ntree = 100)

# Make predictions using Random Forest on the test data
rf_predictions <- predict(rf_model, newdata = test_data)


MODEL EVALUATION

Below is the the model evaluation for Linear Regression, Naive Bayes and Random Forest.


LINEAR REGRESSION MODEL EVALUATION

# Calculate the regression model's performance (e.g., using mean squared error)
mse <- mean((test_data_reg$income_numeric - predicted_values_reg)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 0.160219392390919"
#Calculate Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE):
mae <- mean(abs(test_data_reg$income_numeric - predicted_values_reg))
print(paste("Mean Absolute Error:", mae))
## [1] "Mean Absolute Error: 0.33268564699103"
rmse <- sqrt(mean((test_data_reg$income_numeric - predicted_values_reg)^2))
print(paste("Root Mean Squared Error:", rmse))
## [1] "Root Mean Squared Error: 0.400274146543239"
#plot MSE, MAE AND RMSE
# Creating a vector of evaluation metrics
evaluation_metrics <- c(MSE = mse, MAE = mae, RMSE = rmse)

# Creating a bar plot
barplot(evaluation_metrics, 
        main = "Model Evaluation Metrics",
        ylab = "Value",
        col = "skyblue",
        ylim = c(0, max(evaluation_metrics) * 1.2),
        cex.names = 0.8)

# Adding text labels to the bars
text(seq_along(evaluation_metrics), evaluation_metrics, labels = round(evaluation_metrics, 4), pos = 3, cex = 0.8)

The linear regression model yielded a Mean Squared Error (MSE) of 0.1602, a Mean Absolute Error (MAE) of 0.3327, and a Root Mean Squared Error (RMSE) of 0.4003, showcasing its performance in predicting the target variable.

These metrics are used to assess the performance of the regression model. Lower values for these metrics indicate better predictive performance. In this case, the RMSE is slightly larger than the MAE.Overall, the values seem relatively close, suggesting that the model’s predictions are reasonably close to the actual values.


NAIVE BAYES MODEL EVALUATION

# Evaluation for Naive Bayes
nb_accuracy <- sum(nb_predictions == actual_values) / length(actual_values)
print(paste("Naive Bayes Accuracy:", nb_accuracy))
## [1] "Naive Bayes Accuracy: 0.811877903118779"
# Load required library
library(caret)

# Compute confusion matrix for Naive Bayes predictions
nb_conf_matrix <- table(nb_predictions, actual_values)

# Convert the confusion matrix to a matrix format
conf_matrix <- as.matrix(nb_conf_matrix)

# heatmap-style plot
confusionMatrix(conf_matrix, main = "Confusion Matrix Plot", color = "red")
## Confusion Matrix and Statistics
## 
##               actual_values
## nb_predictions <=50K >50K
##          <=50K  4223  812
##          >50K    322  671
##                                           
##                Accuracy : 0.8119          
##                  95% CI : (0.8018, 0.8217)
##     No Information Rate : 0.754           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4294          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9292          
##             Specificity : 0.4525          
##          Pos Pred Value : 0.8387          
##          Neg Pred Value : 0.6757          
##              Prevalence : 0.7540          
##          Detection Rate : 0.7006          
##    Detection Prevalence : 0.8353          
##       Balanced Accuracy : 0.6908          
##                                           
##        'Positive' Class : <=50K           
## 
library(ggplot2)

# Convert the confusion matrix to a data frame for plotting
conf_matrix_df <- as.data.frame.matrix(conf_matrix)
conf_matrix_df$Predicted <- rownames(conf_matrix_df)

# Reshape the data for ggplot
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
conf_matrix_melted <- melt(conf_matrix_df, id.vars = "Predicted")

# Plot a labeled heatmap using ggplot2
ggplot(conf_matrix_melted, aes(x = variable, y = Predicted, fill = value, label = value)) +
  geom_tile() +
  geom_text() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Confusion Matrix Heatmap", x = "Actual Values", y = "Predicted Values")

# Naive Bayes Evaluation Metrics
nb_conf_matrix <- table(nb_predictions, actual_values)

# Calculate Precision, Recall, F1-score for Naive Bayes
nb_precision <- diag(nb_conf_matrix) / colSums(nb_conf_matrix)
nb_recall <- diag(nb_conf_matrix) / rowSums(nb_conf_matrix)
nb_f1_score <- 2 * (nb_precision * nb_recall) / (nb_precision + nb_recall)

nb_precision
##     <=50K      >50K 
## 0.9291529 0.4524612
nb_recall
##     <=50K      >50K 
## 0.8387289 0.6757301
nb_f1_score
##     <=50K      >50K 
## 0.8816284 0.5420032

The Naive Bayes model displayed an accuracy of approximately 81.54%, with a sensitivity (True Positive Rate) of 92.50% and a specificity (True Negative Rate) of 47.94%. Furthermore, the model demonstrated a precision (Positive Predictive Value) of 84.49%, as determined from the analysis of its confusion matrix, highlighting its performance in classification tasks for the given dataset.With a positive predictive value (precision) of 84.49% for <=50K and 67.59% for >50K, the model shows good precision in its predictions. However, the F1 score indicates a more balanced performance between precision and recall, achieving about 88.31% for <=50K and 56.09% for >50K classes.

The model appears to have higher sensitivity than specificity, indicating it’s better at correctly identifying those earning less than or equal to 50K. However, its ability to identify those earning more than 50K is relatively lower.


RANDOM FOREST MODEL EVALUATION

# Make predictions using Random Forest on the test data
rf_predictions <- predict(rf_model, newdata = test_data)

# Calculate accuracy for Random Forest
rf_accuracy <- mean(rf_predictions == test_data$income)
print(paste("Random Forest Accuracy:", rf_accuracy))
## [1] "Random Forest Accuracy: 0.863636363636364"
# Load required library
library(caret)

# Compute confusion matrix for Random Forest predictions
rf_conf_matrix <- table(rf_predictions, actual_values)

# Convert the confusion matrix to a matrix format
conf_matrix <- as.matrix(rf_conf_matrix)

# heatmap-style plot
confusionMatrix(conf_matrix, main = "Confusion Matrix Plot", color = "red")
## Confusion Matrix and Statistics
## 
##               actual_values
## rf_predictions <=50K >50K
##          <=50K  4271  548
##          >50K    274  935
##                                           
##                Accuracy : 0.8636          
##                  95% CI : (0.8547, 0.8722)
##     No Information Rate : 0.754           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.608           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9397          
##             Specificity : 0.6305          
##          Pos Pred Value : 0.8863          
##          Neg Pred Value : 0.7734          
##              Prevalence : 0.7540          
##          Detection Rate : 0.7085          
##    Detection Prevalence : 0.7994          
##       Balanced Accuracy : 0.7851          
##                                           
##        'Positive' Class : <=50K           
## 
library(ggplot2)

# Convert the confusion matrix to a data frame for plotting
conf_matrix_df <- as.data.frame.matrix(conf_matrix)
conf_matrix_df$Predicted <- rownames(conf_matrix_df)

# Reshape the data for ggplot
library(reshape2)
conf_matrix_melted <- melt(conf_matrix_df, id.vars = "Predicted")

# Plot a labeled heatmap using ggplot2
ggplot(conf_matrix_melted, aes(x = variable, y = Predicted, fill = value, label = value)) +
  geom_tile() +
  geom_text() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Confusion Matrix Heatmap", x = "Actual Values", y = "Predicted Values")

# Random Forest Evaluation Metrics
rf_conf_matrix <- table(rf_predictions, test_data$income)

# Calculate Precision, Recall, F1-score for Random Forest
rf_precision <- diag(rf_conf_matrix) / colSums(rf_conf_matrix)
rf_recall <- diag(rf_conf_matrix) / rowSums(rf_conf_matrix)
rf_f1_score <- 2 * (rf_precision * rf_recall) / (rf_precision + rf_recall)

rf_precision
##     <=50K      >50K 
## 0.9397140 0.6304788
rf_recall
##     <=50K      >50K 
## 0.8862835 0.7733664
rf_f1_score
##     <=50K      >50K 
## 0.9122170 0.6946508
rf_model$importance
##                 MeanDecreaseGini
## X                     1333.90198
## workclass              218.97136
## fnlwgt                 674.31898
## education.num          645.42859
## marital.status         585.88090
## occupation             697.49158
## relationship           946.18156
## race                    84.34450
## sex                     89.82242
## capital.gain           600.15168
## capital.loss           184.38127
## hours.per.week         475.34020
## native.country          99.38437
## age_category           359.19437
## education_group        235.60573

The random forest model achieves an accuracy of 86.38%, showing strong sensitivity (94.04%) in predicting lower-income individuals but comparatively lower specificity (62.91%) for higher-income predictions. It demonstrates solid precision for <=50K (88.60%) but weaker precision for >50K (62.91%). The model’s F1 score indicates a balanced performance, with better balance for <=50K (91.24%) compared to >50K predictions (69.45%).

High importance scores for features like education.num, occupation, and marital.status suggest a strong association between these socio-demographic attributes and income levels within the model’s predictions.

Overall, the model excels in identifying lower-income individuals but might need improvement in recognizing higher-income ones for a more balanced prediction.


CONCLUSION

Considering the project objectives of accurately classifying individuals into income groups and analyzing socio-demographic features related to income, the Random Forest Model might be considered the better option among the three. It achieves the highest accuracy and performs well in identifying lower-income individuals and predict better for higher-income level individuals compared to the other models performances, which aligns with the objective to predict income levels accurately. Additionally, its feature importance analysis can help identify socio-demographic factors strongly correlated with income levels.