WQD7004 PROGRAMMING FOR DATA SCIENCE.
GROUP PROJECT
| Group Member | Matrix No |
|---|---|
| MUHAMAD ASHRAF BIN MOHAMAD | 23050906 |
| NUR NAZIFA BINTI ZHAMRI | 22088840 |
| NUR AMIERA BINTI ABDUL RASHID | S2043650 |
| NUR AISYAH BINTI ABDUL MANAP | S2125518 |
| ABDUL JAMAK BIN ABDUL HADI | S2197406 |
Dataset year: 1994
The “Adult Census Income” dataset, comprises 14 input variables encompassing diverse data types such as categorical, ordinal, and numerical. The variables include age, work class, education, marital status, occupation, race, gender, and more.
With a total of 32561 rows and 15 columns, this dataset captures key details influencing an individual’s annual income. Factors considered include education level, age, gender, occupation, and additional socio-economic aspects. The dataset’s primary aim is likely to analyze and model income trends based on demographic and socio-economic attributes, facilitating predictive analysis or classification tasks. It’s a valuable resource for exploring correlations between personal details and income levels.
To predict whether an individual earns more than $50,000 annually based on socio-demographic
IMPORTING AND READING DATA
The ‘dplyr’ library is used to read the ‘adult.csv’ file into a dataframe named ‘adult_income’.
#Import and read data
library(dplyr)
adult_income <- read.csv("adult.csv")
STRUCTURE OF DATASET
Displays the structure of the ‘adult_income’ dataset, showcasing the variable types and their arrangement.
#Structure of adult income dataset
str(adult_income)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 90 82 66 54 41 34 38 74 68 41 ...
## $ workclass : chr "?" "Private" "?" "Private" ...
## $ fnlwgt : int 77053 132870 186061 140359 264663 216864 150601 88638 422013 70037 ...
## $ education : chr "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
## $ education.num : int 9 9 10 4 10 9 6 16 9 10 ...
## $ marital.status: chr "Widowed" "Widowed" "Widowed" "Divorced" ...
## $ occupation : chr "?" "Exec-managerial" "?" "Machine-op-inspct" ...
## $ relationship : chr "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
## $ race : chr "White" "White" "Black" "White" ...
## $ sex : chr "Female" "Female" "Female" "Female" ...
## $ capital.gain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss : int 4356 4356 4356 3900 3900 3770 3770 3683 3683 3004 ...
## $ hours.per.week: int 40 18 40 40 40 45 40 20 40 60 ...
## $ native.country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
DIMENSION OF DATASET
Displays the size information of the ‘adult_income’ dataset
#The dimensions of the dataset
dimensions <- dim(adult_income)
num_rows <- dimensions[1]
num_columns <- dimensions[2]
cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 32561
## Number of columns: 15
DATASET CONTENT
Shows the first 10 rows of the ‘adult_income’ dataset.
#Content of the dataset
head(adult_income,10)
## age workclass fnlwgt education education.num marital.status
## 1 90 ? 77053 HS-grad 9 Widowed
## 2 82 Private 132870 HS-grad 9 Widowed
## 3 66 ? 186061 Some-college 10 Widowed
## 4 54 Private 140359 7th-8th 4 Divorced
## 5 41 Private 264663 Some-college 10 Separated
## 6 34 Private 216864 HS-grad 9 Divorced
## 7 38 Private 150601 10th 6 Separated
## 8 74 State-gov 88638 Doctorate 16 Never-married
## 9 68 Federal-gov 422013 HS-grad 9 Divorced
## 10 41 Private 70037 Some-college 10 Never-married
## occupation relationship race sex capital.gain capital.loss
## 1 ? Not-in-family White Female 0 4356
## 2 Exec-managerial Not-in-family White Female 0 4356
## 3 ? Unmarried Black Female 0 4356
## 4 Machine-op-inspct Unmarried White Female 0 3900
## 5 Prof-specialty Own-child White Female 0 3900
## 6 Other-service Unmarried White Female 0 3770
## 7 Adm-clerical Unmarried White Male 0 3770
## 8 Prof-specialty Other-relative White Female 0 3683
## 9 Prof-specialty Not-in-family White Female 0 3683
## 10 Craft-repair Unmarried White Male 0 3004
## hours.per.week native.country income
## 1 40 United-States <=50K
## 2 18 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 United-States <=50K
## 6 45 United-States <=50K
## 7 40 United-States <=50K
## 8 20 United-States >50K
## 9 40 United-States <=50K
## 10 60 ? >50K
COUNT OF INTEGER AND CHARACTER ATTRIBUTES
Count of columns classified as integers and characters in the ‘adult_income’ dataset.
# Count of integer and character in attributes
data_types <- sapply(adult_income, class)
# Count of integer columns
num_int_columns <- sum(data_types == "integer")
# Count of character columns
num_char_columns <- sum(data_types == "character")
cat("Number of integer columns:", num_int_columns, "\nNumber of character columns:", num_char_columns, "\n")
## Number of integer columns: 6
## Number of character columns: 9
LIST OF INTEGER AND CHARACTER COLUMNS
List of the names of columns identified as integers and characters in the dataset.
#List of integer and character columns
integer_columns <- names(adult_income)[sapply(adult_income, is.integer)]
char_columns <- names(adult_income)[sapply(adult_income, is.character)]
cat("Integer columns:", paste(integer_columns, collapse = ", "), "\nCharacter columns:", paste(char_columns, collapse = ", "), "\n")
## Integer columns: age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week
## Character columns: workclass, education, marital.status, occupation, relationship, race, sex, native.country, income
ATTRIBUTES OF THE DATASET
Attributes <- c("workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country", "income", "age_category")
Description <- c("Industry of occupation", "Final Weight, which is the number of units in the target population that the responding unit represents", "Level of education", "The level of education in numerical form", "Marital status of an individual", "Type of occupation of an individual", "What this individual is relative to others", "Individual’s race", "Biological sex of the individual", "Capital gains for an individual", "Capital loss for an individual", "The hours an individual has reported to work per week", "Country of origin for an individual", "Whether or not an individual makes more than $50,000 annually", "The age of an individual")
knitr::kable(data.frame(Attributes, Description), col.names = c('Attributes', 'Description'))
| Attributes | Description |
|---|---|
| workclass | Industry of occupation |
| fnlwgt | Final Weight, which is the number of units in the target population that the responding unit represents |
| education | Level of education |
| education.num | The level of education in numerical form |
| marital.status | Marital status of an individual |
| occupation | Type of occupation of an individual |
| relationship | What this individual is relative to others |
| race | Individual’s race |
| sex | Biological sex of the individual |
| capital.gain | Capital gains for an individual |
| capital.loss | Capital loss for an individual |
| hours.per.week | The hours an individual has reported to work per week |
| native.country | Country of origin for an individual |
| income | Whether or not an individual makes more than $50,000 annually |
| age_category | The age of an individual |
SUMMARY OF DATASET
Generates a statistical summary of the ‘adult_income’ dataset, providing key statistics for each variable/column.
#Summary of the adult income dataset
summary(adult_income)
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education.num marital.status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital.gain capital.loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours.per.week native.country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
CHECKING DATA COMPLETENESS ACROSS ALL ROWS
nrow(adult_income[complete.cases(adult_income),])
## [1] 32561
We can see that all rows in the dataset appear complete. However, we are assuming that the “?” symbols within certain columns represent missing values, as indicated by the dataset’s structure above.
DUPLICATES
In the dataset details, the code indicates that there are 32,561 rows and 15 columns. The code below checks whether there are any duplicates in the dataset.
# Count the number of duplicates
num_duplicates <- sum(duplicated(adult_income))
if (num_duplicates > 0) {
cat("Found", num_duplicates, "duplicate rows in the dataframe.\n")
} else {
cat("No duplicated rows found in the dataframe.\n")
}
## Found 24 duplicate rows in the dataframe.
REMOVE THE DUPLICATES
#keeping only the first occurrence
dfunique <- unique(adult_income)
dimensions <- dim(dfunique)
num_rows <- dimensions[1]
num_columns <- dimensions[2]
cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 32537
## Number of columns: 15
After retaining only the first occurrence of each row with dfunique <- unique(adult_income), now we have 32,537 rows and 15 columns.
DOUBLE-CHECKING FOR DATA INTEGRITY: ENSURING COMPLETE DUPLICATE REMOVAL
duplicated_rows <- dfunique[duplicated(dfunique), ]
# Display duplicated rows
if (nrow(duplicated_rows) > 0) {
cat("Duplicated rows in the dataframe:\n")
print(duplicated_rows)
} else {
cat("No duplicated rows found in the dataframe.\n")
}
## No duplicated rows found in the dataframe.
Duplicates were removed using the unique() function and a check was made for duplicated rows in the resulting dataframe (dfunique), the script indicated that no duplicated rows were found.This suggested that the process of keeping only the first occurrence of each row with the unique() function had effectively eliminated all duplicate rows from the dataset.
MISSING VALUES
The function check_missing_values() was used to identify missing values in different columns of the dataframe (dfunique). The script checked for missing values in various columns using a list of column names (columns_to_check).
check_missing_values <- function(data, column_name) {
any(data[[column_name]] == "?")
}
columns_to_check <- c("age", "workclass", "fnlwgt", "education", "education.num",
"marital.status", "occupation", "relationship", "race",
"sex", "capital.gain", "capital.loss", "hours.per.week",
"native.country", "income")
for (column in columns_to_check) {
if (check_missing_values(dfunique, column)) {
cat(paste("There are missing values in column", column, "\n"))
} else {
cat(paste("No missing values in column", column, "\n"))
}
}
## No missing values in column age
## There are missing values in column workclass
## No missing values in column fnlwgt
## No missing values in column education
## No missing values in column education.num
## No missing values in column marital.status
## There are missing values in column occupation
## No missing values in column relationship
## No missing values in column race
## No missing values in column sex
## No missing values in column capital.gain
## No missing values in column capital.loss
## No missing values in column hours.per.week
## There are missing values in column native.country
## No missing values in column income
THE RESULTS OF THIS ANALYSIS SHOWED
Columns with No Missing Values: ‘age’, ‘fnlwgt’, ‘education’, ‘education.num’, ‘marital.status’, ‘relationship’, ‘race’, ‘sex’, ‘capital.gain’, ‘capital.loss’, ‘hours.per.week’, ‘income’ were found to have no missing values.
Columns with Missing Values: ‘workclass’, ‘occupation’, and ‘native.country’ were identified to contain missing values.
IDENTIFYING MISSING VALUES Checked for the presence of “?” in the ‘workclass’, ‘occupation’, and ‘native.country’ columns.
missing_values <- rowSums(dfunique[, c("workclass", "occupation", "native.country")] == "?")
Counts were found to be 1836, 1843, and 582, respectively.
NUMBER OF ROWS THAT HAVE MISSING VALUES
Determined the total number of rows containing at least one missing value among ‘workclass’, ‘occupation’, and ‘native.country’
total_empty_rows <- sum(rowSums(dfunique[c("workclass", "occupation", "native.country")] == "?") > 0)
total_empty_rows
## [1] 2398
Total of rows was 2398 rows
REMOVE ROWS WITH MISSING VALUES
Created a new dataframe df_no_missing by filtering out rows where ‘workclass’, ‘occupation’, and ‘native.country’ columns contained “?”.
df_no_missing <- dfunique %>%
filter(workclass != "?" & occupation != "?" & native.country != "?")
new_dimensions <- dim(df_no_missing)
num_rows <- new_dimensions[1]
num_columns <- new_dimensions[2]
cat("Number of rows:", num_rows, "\nNumber of columns:", num_columns, "\n")
## Number of rows: 30139
## Number of columns: 15
This resulted in a dataframe with 30,139 rows and 15 columns.
CREATING AGE CATEGORIES
Created a new column ‘age_category’ by categorizing ‘age’ into groups: Under 20, 20-29, 30-39, 40-49, 50-59, and 60 and above.
df_no_missing$age_category <- cut(df_no_missing$age, breaks = c(-Inf, 20, 30, 40, 50, 60, Inf),
labels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60 and above"),
include.lowest = TRUE)
# Display the updated dataframe
head(df_no_missing)
## age workclass fnlwgt education education.num marital.status
## 1 82 Private 132870 HS-grad 9 Widowed
## 2 54 Private 140359 7th-8th 4 Divorced
## 3 41 Private 264663 Some-college 10 Separated
## 4 34 Private 216864 HS-grad 9 Divorced
## 5 38 Private 150601 10th 6 Separated
## 6 74 State-gov 88638 Doctorate 16 Never-married
## occupation relationship race sex capital.gain capital.loss
## 1 Exec-managerial Not-in-family White Female 0 4356
## 2 Machine-op-inspct Unmarried White Female 0 3900
## 3 Prof-specialty Own-child White Female 0 3900
## 4 Other-service Unmarried White Female 0 3770
## 5 Adm-clerical Unmarried White Male 0 3770
## 6 Prof-specialty Other-relative White Female 0 3683
## hours.per.week native.country income age_category
## 1 18 United-States <=50K 60 and above
## 2 40 United-States <=50K 50-59
## 3 40 United-States <=50K 40-49
## 4 45 United-States <=50K 30-39
## 5 40 United-States <=50K 30-39
## 6 20 United-States >50K 60 and above
CREATING EDUCATION CATEGORIES
Created a new column ‘education_group’ by categorizing ‘education’ into groups: Primary, Secondary, Tertiary, Doctorate and Other.
dfedu <- df_no_missing %>%
mutate(education_group = case_when(
education %in% c('1st-4th', '5th-6th') ~ 'Primary',
education %in% c('7th-8th', 'HS-grad', '10th', '11th', '12th', '9th') ~ 'Secondary',
education %in% c('Some-college', 'Bachelors', 'Assoc-voc', 'Assoc-acdm') ~ 'Tertiary',
education %in% c('Masters', 'Doctorate', 'Prof-school') ~ 'Doctorate',
TRUE ~ 'Other' # For any education levels not listed
))
Removed the ‘age’ and ‘education’ columns from the dataframe dfedu.
dfedu$age <- NULL
dfedu$education <- NULL
glimpse(dfedu)
## Rows: 30,139
## Columns: 15
## $ workclass <chr> "Private", "Private", "Private", "Private", "Private",…
## $ fnlwgt <int> 132870, 140359, 264663, 216864, 150601, 88638, 422013,…
## $ education.num <int> 9, 4, 10, 9, 6, 16, 9, 16, 15, 13, 14, 15, 7, 14, 13, …
## $ marital.status <chr> "Widowed", "Divorced", "Separated", "Divorced", "Separ…
## $ occupation <chr> "Exec-managerial", "Machine-op-inspct", "Prof-specialt…
## $ relationship <chr> "Not-in-family", "Unmarried", "Own-child", "Unmarried"…
## $ race <chr> "White", "White", "White", "White", "White", "White", …
## $ sex <chr> "Female", "Female", "Female", "Female", "Male", "Femal…
## $ capital.gain <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ capital.loss <int> 4356, 3900, 3900, 3770, 3770, 3683, 3683, 3004, 2824, …
## $ hours.per.week <int> 18, 40, 40, 45, 40, 20, 40, 35, 45, 20, 55, 40, 76, 50…
## $ native.country <chr> "United-States", "United-States", "United-States", "Un…
## $ income <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", "…
## $ age_category <fct> 60 and above, 50-59, 40-49, 30-39, 30-39, 60 and above…
## $ education_group <chr> "Secondary", "Secondary", "Tertiary", "Secondary", "Se…
Rename the dataframe to dfclean
dfclean <- dfedu
Makes a copy of the cleaned dataframe (df_no_missing) into a new dataframe dfclean and then exporting it to a CSV file
write.csv(dfclean, "dfclean.csv")
DISTRIBUTION OF INCOME
Based on the dataset, the column ‘income’ represents financial earnings of adults which is categorized into two classifications which are ‘<=50k’ and ‘>50k’ to depict adults making up to and equal to 50k US Dollars (USD) and those making more than 50k USD, respectively.
library(dplyr)
library(ggplot2)
income_data <- dfclean %>%
group_by(income) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100)
ggplot(income_data, aes(x = income, y = count)) +
geom_bar(stat = "identity", fill = "purple", color = "black") +
geom_text(aes(label = count), vjust = -0.5) +
geom_text(aes(y = count/2, label = paste0(round(percentage, 1), "%")), vjust = 0.5, color = "black") +
theme_light() +
labs(title = "Distribution of Income", x = "Income Category", y = "Count")
According to the bar chart, it can be observed that over 75% of adults earn less than and up to 50k USD, which is substantial in comparison to those making more than 50k USD, at about 25%.
DISTRIBUTION OF WORK CLASS OF ADULTS
## Convert table into dataframe to manipulate data summary
workclass_summary <- as.data.frame(table(dfclean$workclass))
## Add percentage values
workclass_summary <- transform(workclass_summary, Percentage = Freq / sum(Freq) * 100)
## Plot
ggplot(workclass_summary, aes(x = reorder(Var1, -Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "purple", color = "black") +
geom_text(aes(label = paste0(round(Percentage, 1), "%")),
vjust = -0.5, size = 3) +
labs(title = "Distribution of Work Class",
x = "Work Class",
y = "Count") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The private sector is the predominant source of employment from this dataset, accounting for a significant majority of 73.9%. There are 2 categories for self-employed individuals including individuals who are incorporated (self-emp-inc) at 8.3%, and those who are not incorporated (self-emp-not-inc) at 3.6%. This suggests that there is a higher number of individuals who operate businesses without forming corporations within the self-employed group. It can be observed from the plot that there are no individuals recorded in the ‘Without Pay’ group, indicating that workers within the dataset are all compensated.
DISTRIBUTION OF WORKCLASS BY INCOME CATEGORY
## Summarize income and workclass data and arrange in descending order
summary_data <- dfclean %>%
group_by(workclass, income) %>%
summarise(count = n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'workclass'. You can override using the
## `.groups` argument.
# Plot
ggplot(summary_data, aes(x = reorder(workclass, -count), y = count, fill = income)) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
labs(x = "Workclass", y = "Count", fill = "Income") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("<=50K" = "purple", ">50K" = "green"))
The private sector has the highest count of individuals in both income categories with a significantly large number of individuals earning 50k and less, compared to those earning more than 50k. Among the self-employed, those within the self-emp-not-inc groups have a higher count of individuals earning 50k and less. On the other hand, incorporated self-employed individuals (self-emp-inc) demonstrate a more balanced distribution with a slightly higher individual counts earning more than 50k. Across local, state and federal government categories, more individuals are earning less than or equal to 50k. Upon closer inspection, it can be observed that the difference in income distribution within the federal government jobs is not as pronounced, implying a more uniform salary structure.
DISTRIBUTION OF SEX, RACE AND AGE CATEGORIES
options(repos = "https://cloud.r-project.org/")
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/7x/7k2hn6795vn9v8c1v0chmch40000gn/T//RtmpQOm53O/downloaded_packages
library (tidyverse)
df <- dfclean
# Function to create a pie chart with labeled sections, percentage
create_pie_chart <- function(data, category, title) {
# First, create a summarized version of the data with counts
data_sum <- data %>%
group_by(!!sym(category)) %>%
summarise(count = n()) %>%
mutate(perc = count / sum(count) * 100)
# Now, create the pie chart
ggplot(data_sum, aes(x = "", y = count, fill = !!sym(category))) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") +
theme_void() + # Remove background
geom_text(aes(label = paste0(round(perc, 1), "%")),
position = position_stack(vjust = 0.5),
size = 2.2) +
labs(title = title, fill = category) +
scale_fill_brewer(palette = "Accent")
}
# Pie chart for sex
pie_sex <- create_pie_chart(df, "sex", "Distribution of Sex")
# Pie chart for race
pie_race <- create_pie_chart(df, "race", "Distribution of Race")
# Print or save the plots
print(pie_sex)
Distribution of sex - In addition, 67.6% of the respondents is a “Male”, whereas the rest 32.4% is “Female”.
print(pie_race)
Distribution of race - Moreover, the majority of the respondents is a “White” race adults with 86% whereas the smallest percentage is contributed by Amer-Indian-Eskimo with less than 1%
ggplot(df, aes(x = age_category)) +
geom_bar(stat = "count", fill = "#69b3a2") +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) + # Use after_stat(count) for labels
labs(title = "Distribution of Age Categories", x = "Age Category", y = "Count") +
theme_minimal()
The respondents age category is order accordingly to showcase the distribution of age group. Based on the data transformation that we have conducted earlier, the age group respondents showcase a normal distribution of respondents across the age group.
AGE CATEGORIES AND INCOME
# Group data by age_category and income
grouped_data <- df %>%
group_by(age_category, income) %>% # Group by age_category and income
summarise(n = n(), .groups = "drop") # Explicitly set .groups to "drop" to remove all grouping
# Create the side-by-side bar chart
ggplot(grouped_data, aes(x = age_category, y = n, fill = income)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.5) + # Add text labels
labs(x = "Age Category", y = "Count", title = "Distribution of Age by Income") +
theme_minimal() +
theme(legend.position = "bottom")
Based on the chart above, adults with an income of “<=50K” are the highest between the “20-29”. At the same time,adults with an income “>50K” resides the highest between the “40-49”.
OCCUPATIONS AND INCOME
# Exploratory Data Analysis (EDA) for Occupation
occupation_counts <- df %>%
group_by(occupation) %>%
summarise(count = n()) %>%
mutate(percentage = (count / sum(count)) * 100) %>%
arrange(desc(count))
# Plotting occupation distribution
ggplot(occupation_counts, aes(x = reorder(occupation, -count), y = percentage, fill = occupation)) +
geom_bar(stat = "identity", color = "black") +
labs(title = "Percentage of Population by Occupation",
x = "Occupation",
y = "% of Population") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set3") +
geom_text(aes(label = sprintf("%.2f%%", percentage)), vjust = -0.5, size = 3) +
coord_flip()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
## Returning the palette you asked for with that many colors
Occupations in managerial and professional specialties boast over a quarter of individuals earning above $50K constituting a substantial 13% of our dataset. In contrast, private household services and armed forces demonstrate the lowest income brackets. We could further explore the demographics within these occupations and key factors like education and work hours could provide crucial insights into these income disparities.
INCOME DISTRIBUTION BY SEX
data <-dfclean
# Create separate dataframes for income above and below $50,000 by sex
above_50k_sex <- data %>%
filter(income == ">50K") %>%
count(sex)
below_50k_sex <- data %>%
filter(income == "<=50K") %>%
count(sex)
# Merge the data into a single dataframe
income_by_sex <- merge(above_50k_sex, below_50k_sex, by = "sex", suffixes = c("_above_50k", "_below_50k"))
# Reshape the data for plotting
income_by_sex_long <- income_by_sex %>%
pivot_longer(cols = -sex, names_to = "income_category", values_to = "count")
# Plotting income distribution by sex
ggplot(income_by_sex_long, aes(x = sex, y = count, fill = income_category)) +
geom_bar(stat = "identity", position = "stack", color = "black") +
labs(title = "Income Distribution by Sex",
x = "Sex",
y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
The EDA reveals a significant disparity in income distribution between genders within the dataset. Notably, 13,972 males and 8,661 females earn below or equal to $50K annually. Thats 61% of males, whereas only 39% are females. Conversely, among individuals earning above $50K comprises of 85% are males, whereas only 15% are females. This disparity highlights a considerable gender influence on income levels within the dataset.
DISTRIBUTION OF EDUCATION AND INCOME CATEGORIES
education_order <- c("Primary", "Secondary", "Tertiary", "Doctorate", "Others")
df$education_group <- factor(df$education_group, levels = education_order)
# Group data by education and income
grouped_data <- df %>%
group_by(education_group, income) %>% # Group by education and income
summarise(n = n(), .groups = "drop") # Explicitly set .groups to "drop" to remove all grouping
# Create the side-by-side bar chart
ggplot(grouped_data, aes(x = education_group, y = n, fill = income)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.5) + # Add text labels
labs(x = "Education", y = "Count", title = "Distribution of Education by Income") +
theme_minimal() +
theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, hjust = 1)) # Adjust angle of x-axis labels
Furthermore, adults with an income “>50K” is the highest among adults who have tertiary education.
SUMMARY OF EXPLORATORY DATA ANALYSIS
In summary, these are the few insight that we would like to highlight based on our findings through the EDA method:
Work Sector Dominance: The private sector employs a sizable fraction of the adult labor force. This suggests that the private sector dominates the labor market.
Distribution of Gender and Race in the Workforce: Males and White people make up a larger portion of the workforce, which may indicate differences in employment possibilities or preferences based on gender and race.
Age-Related Income Trends: The income levels show a discernible age-related pattern. While younger persons, especially those in their 20s, are more likely to be in lower income groups (<=50K), adults in their 40s, especially those in executive managerial roles, are more likely to have higher salaries (>50K).
Impact of Education on Income: Post secondary education level is linked to higher income groups, underscoring the significance of education in terms of earning potential.
Occupation-Income Correlation: A strong association has been observed between the sorts of occupations and income levels; executive managerial responsibilities are typically associated with higher salaries, whilst administrative or clerical positions tend to pay less.
In conclusion, the study shows patterns of employment and income distribution across age groups, professions, sectors, and educational attainment, along with signs that sociodemographic factors have an impact on these patterns.
# Convert 'income' to a binary numeric column
data$income_numeric <- ifelse(data$income == ">50K", 1, 0)
# Selecting relevant columns for correlation analysis
correlation_data <- data[c("income_numeric", "education.num", "hours.per.week")]
# Calculating the correlation matrix
correlation_matrix <- cor(correlation_data)
# Displaying the correlation matrix
print(correlation_matrix)
## income_numeric education.num hours.per.week
## income_numeric 1.0000000 0.3354134 0.2294349
## education.num 0.3354134 1.0000000 0.1528419
## hours.per.week 0.2294349 0.1528419 1.0000000
The correlation matrix shows moderate positive associations between ‘income_numeric’ and ‘education.num’ (0.335), and a weaker link with ‘hours.per.week’ (0.229), suggesting a positive connection between income and both education and weekly hours. ‘education.num’ appears slightly more influential than ‘hours.per.week’ on ‘income_numeric’. Meanwhile, ‘education.num’ and ‘hours.per.week’ have a mild positive correlation (0.153), indicating they may capture different aspects of income prediction, reducing multicollinearity concerns.
REGRESSION
For the regression model, Linear Regression have been choose to predict the income.
LINEAR REGRESSION
library(caret)
data <- read.csv("dfclean.csv")
# Assuming 'data' is your dataset
# Convert 'income' to a binary numeric column
data$income_numeric <- ifelse(data$income == ">50K", 1, 0)
# Selecting relevant columns for regression
regression_data <- data[c("education.num", "hours.per.week","income_numeric")]
# Check for missing values and remove rows with NAs if necessary
regression_data <- na.omit(regression_data)
# Splitting the data
set.seed(42)
train_indices_reg <- sample(nrow(regression_data), 0.8 * nrow(regression_data))
train_data_reg <- regression_data[train_indices_reg, ]
test_data_reg <- regression_data[-train_indices_reg, ]
# Fitting the linear regression model
regression_model <- lm(income_numeric ~ education.num + hours.per.week, data = train_data_reg)
# Predicting on the test set
predicted_values_reg <- predict(regression_model, newdata = test_data_reg)
CLASSIFICATION
For the classification models, two model have selected to predict the income which is Naive Bayes and Random Forest.
NAIVE BAYES
# Load required libraries
library(readr)
library(dplyr)
library(e1071) # For Naive Bayes
# Load the cleaned dataset
df <- read.csv("dfclean.csv")
# Set 'sex' and 'occupation' are categorical variables
df$sex <- as.factor(df$sex)
df$occupation <- as.factor(df$occupation)
# Split the data into training and testing sets (e.g., 80-20 split)
set.seed(123) # For reproducibility
train_indices <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
# Naive Bayes model
nb_model <- naiveBayes(income ~ ., data = train_data)
# Make predictions using Naive Bayes
nb_predictions <- predict(nb_model, newdata = test_data)
# Assess the model performance
actual_values <- test_data$income
RANDOM FOREST
# Load required libraries
library(readr)
library(dplyr)
library(randomForest) # For Random Forest
# Load the cleaned dataset
df <- read.csv("dfclean.csv")
# Convert 'income' to a factor
train_data$income <- as.factor(train_data$income)
# Fit Random Forest using the training data
rf_model <- randomForest(income ~ ., data = train_data, ntree = 100)
# Make predictions using Random Forest on the test data
rf_predictions <- predict(rf_model, newdata = test_data)
Below is the the model evaluation for Linear Regression, Naive Bayes and Random Forest.
LINEAR REGRESSION MODEL EVALUATION
# Calculate the regression model's performance (e.g., using mean squared error)
mse <- mean((test_data_reg$income_numeric - predicted_values_reg)^2)
print(paste("Mean Squared Error:", mse))
## [1] "Mean Squared Error: 0.160219392390919"
#Calculate Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE):
mae <- mean(abs(test_data_reg$income_numeric - predicted_values_reg))
print(paste("Mean Absolute Error:", mae))
## [1] "Mean Absolute Error: 0.33268564699103"
rmse <- sqrt(mean((test_data_reg$income_numeric - predicted_values_reg)^2))
print(paste("Root Mean Squared Error:", rmse))
## [1] "Root Mean Squared Error: 0.400274146543239"
#plot MSE, MAE AND RMSE
# Creating a vector of evaluation metrics
evaluation_metrics <- c(MSE = mse, MAE = mae, RMSE = rmse)
# Creating a bar plot
barplot(evaluation_metrics,
main = "Model Evaluation Metrics",
ylab = "Value",
col = "skyblue",
ylim = c(0, max(evaluation_metrics) * 1.2),
cex.names = 0.8)
# Adding text labels to the bars
text(seq_along(evaluation_metrics), evaluation_metrics, labels = round(evaluation_metrics, 4), pos = 3, cex = 0.8)
The linear regression model yielded a Mean Squared Error (MSE) of 0.1602, a Mean Absolute Error (MAE) of 0.3327, and a Root Mean Squared Error (RMSE) of 0.4003, showcasing its performance in predicting the target variable.
These metrics are used to assess the performance of the regression model. Lower values for these metrics indicate better predictive performance. In this case, the RMSE is slightly larger than the MAE.Overall, the values seem relatively close, suggesting that the model’s predictions are reasonably close to the actual values.
NAIVE BAYES MODEL EVALUATION
# Evaluation for Naive Bayes
nb_accuracy <- sum(nb_predictions == actual_values) / length(actual_values)
print(paste("Naive Bayes Accuracy:", nb_accuracy))
## [1] "Naive Bayes Accuracy: 0.811877903118779"
# Load required library
library(caret)
# Compute confusion matrix for Naive Bayes predictions
nb_conf_matrix <- table(nb_predictions, actual_values)
# Convert the confusion matrix to a matrix format
conf_matrix <- as.matrix(nb_conf_matrix)
# heatmap-style plot
confusionMatrix(conf_matrix, main = "Confusion Matrix Plot", color = "red")
## Confusion Matrix and Statistics
##
## actual_values
## nb_predictions <=50K >50K
## <=50K 4223 812
## >50K 322 671
##
## Accuracy : 0.8119
## 95% CI : (0.8018, 0.8217)
## No Information Rate : 0.754
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4294
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9292
## Specificity : 0.4525
## Pos Pred Value : 0.8387
## Neg Pred Value : 0.6757
## Prevalence : 0.7540
## Detection Rate : 0.7006
## Detection Prevalence : 0.8353
## Balanced Accuracy : 0.6908
##
## 'Positive' Class : <=50K
##
library(ggplot2)
# Convert the confusion matrix to a data frame for plotting
conf_matrix_df <- as.data.frame.matrix(conf_matrix)
conf_matrix_df$Predicted <- rownames(conf_matrix_df)
# Reshape the data for ggplot
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
conf_matrix_melted <- melt(conf_matrix_df, id.vars = "Predicted")
# Plot a labeled heatmap using ggplot2
ggplot(conf_matrix_melted, aes(x = variable, y = Predicted, fill = value, label = value)) +
geom_tile() +
geom_text() +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Confusion Matrix Heatmap", x = "Actual Values", y = "Predicted Values")
# Naive Bayes Evaluation Metrics
nb_conf_matrix <- table(nb_predictions, actual_values)
# Calculate Precision, Recall, F1-score for Naive Bayes
nb_precision <- diag(nb_conf_matrix) / colSums(nb_conf_matrix)
nb_recall <- diag(nb_conf_matrix) / rowSums(nb_conf_matrix)
nb_f1_score <- 2 * (nb_precision * nb_recall) / (nb_precision + nb_recall)
nb_precision
## <=50K >50K
## 0.9291529 0.4524612
nb_recall
## <=50K >50K
## 0.8387289 0.6757301
nb_f1_score
## <=50K >50K
## 0.8816284 0.5420032
The Naive Bayes model displayed an accuracy of approximately 81.54%, with a sensitivity (True Positive Rate) of 92.50% and a specificity (True Negative Rate) of 47.94%. Furthermore, the model demonstrated a precision (Positive Predictive Value) of 84.49%, as determined from the analysis of its confusion matrix, highlighting its performance in classification tasks for the given dataset.With a positive predictive value (precision) of 84.49% for <=50K and 67.59% for >50K, the model shows good precision in its predictions. However, the F1 score indicates a more balanced performance between precision and recall, achieving about 88.31% for <=50K and 56.09% for >50K classes.
The model appears to have higher sensitivity than specificity, indicating it’s better at correctly identifying those earning less than or equal to 50K. However, its ability to identify those earning more than 50K is relatively lower.
RANDOM FOREST MODEL EVALUATION
# Make predictions using Random Forest on the test data
rf_predictions <- predict(rf_model, newdata = test_data)
# Calculate accuracy for Random Forest
rf_accuracy <- mean(rf_predictions == test_data$income)
print(paste("Random Forest Accuracy:", rf_accuracy))
## [1] "Random Forest Accuracy: 0.863636363636364"
# Load required library
library(caret)
# Compute confusion matrix for Random Forest predictions
rf_conf_matrix <- table(rf_predictions, actual_values)
# Convert the confusion matrix to a matrix format
conf_matrix <- as.matrix(rf_conf_matrix)
# heatmap-style plot
confusionMatrix(conf_matrix, main = "Confusion Matrix Plot", color = "red")
## Confusion Matrix and Statistics
##
## actual_values
## rf_predictions <=50K >50K
## <=50K 4271 548
## >50K 274 935
##
## Accuracy : 0.8636
## 95% CI : (0.8547, 0.8722)
## No Information Rate : 0.754
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.608
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9397
## Specificity : 0.6305
## Pos Pred Value : 0.8863
## Neg Pred Value : 0.7734
## Prevalence : 0.7540
## Detection Rate : 0.7085
## Detection Prevalence : 0.7994
## Balanced Accuracy : 0.7851
##
## 'Positive' Class : <=50K
##
library(ggplot2)
# Convert the confusion matrix to a data frame for plotting
conf_matrix_df <- as.data.frame.matrix(conf_matrix)
conf_matrix_df$Predicted <- rownames(conf_matrix_df)
# Reshape the data for ggplot
library(reshape2)
conf_matrix_melted <- melt(conf_matrix_df, id.vars = "Predicted")
# Plot a labeled heatmap using ggplot2
ggplot(conf_matrix_melted, aes(x = variable, y = Predicted, fill = value, label = value)) +
geom_tile() +
geom_text() +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Confusion Matrix Heatmap", x = "Actual Values", y = "Predicted Values")
# Random Forest Evaluation Metrics
rf_conf_matrix <- table(rf_predictions, test_data$income)
# Calculate Precision, Recall, F1-score for Random Forest
rf_precision <- diag(rf_conf_matrix) / colSums(rf_conf_matrix)
rf_recall <- diag(rf_conf_matrix) / rowSums(rf_conf_matrix)
rf_f1_score <- 2 * (rf_precision * rf_recall) / (rf_precision + rf_recall)
rf_precision
## <=50K >50K
## 0.9397140 0.6304788
rf_recall
## <=50K >50K
## 0.8862835 0.7733664
rf_f1_score
## <=50K >50K
## 0.9122170 0.6946508
rf_model$importance
## MeanDecreaseGini
## X 1333.90198
## workclass 218.97136
## fnlwgt 674.31898
## education.num 645.42859
## marital.status 585.88090
## occupation 697.49158
## relationship 946.18156
## race 84.34450
## sex 89.82242
## capital.gain 600.15168
## capital.loss 184.38127
## hours.per.week 475.34020
## native.country 99.38437
## age_category 359.19437
## education_group 235.60573
The random forest model achieves an accuracy of 86.38%, showing strong sensitivity (94.04%) in predicting lower-income individuals but comparatively lower specificity (62.91%) for higher-income predictions. It demonstrates solid precision for <=50K (88.60%) but weaker precision for >50K (62.91%). The model’s F1 score indicates a balanced performance, with better balance for <=50K (91.24%) compared to >50K predictions (69.45%).
High importance scores for features like education.num, occupation, and marital.status suggest a strong association between these socio-demographic attributes and income levels within the model’s predictions.
Overall, the model excels in identifying lower-income individuals but might need improvement in recognizing higher-income ones for a more balanced prediction.
Considering the project objectives of accurately classifying individuals into income groups and analyzing socio-demographic features related to income, the Random Forest Model might be considered the better option among the three. It achieves the highest accuracy and performs well in identifying lower-income individuals and predict better for higher-income level individuals compared to the other models performances, which aligns with the objective to predict income levels accurately. Additionally, its feature importance analysis can help identify socio-demographic factors strongly correlated with income levels.