Data Discovery

Analyzing Income Trends

Overview of Data

The census income data set contains a total of 32,561 rows of data and 15 features with rows containing demographic information and income information about a person. The features include age, work class, education, marital status, occupation, native country, gender, capital gain, capital loss, and income.
The income column seems to be our target column — the other features influence this column which is in binary form i.e. $>50k$ or $\le 50k$.
Data Source: The data is from UCI Machine Learning Repository.
Data Set Documentation: The data set was extracted from a 1994 Census database and prediction task is to determine whether a person makes over 50K a year. The documentation of the data set is contained herein Income Data Set Documentation

Purpose of the project

The main purpose of the project is to investigate the key factors that determine an individual’s income level i.e. whether they fall into the above 50K income bracket per year or otherwise.
This is in a way is important because it can server as a guide for individuals on which areas they need to pay attention to in case they want to be in the above 50K income level. With confidence level and a large enough sample size we can create a sampling distribution that help us infer on a parameter of a true population and with a certain confidence.

It is worth noting however, that our data might already have some historical bias since it was extracted almost 40 years ago, and definitely a lot has changed since then. However, that is beyond the scope of this project.

Exploratory Data Analysis

# loading necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(tidyverse)
library(lindia)
library(broom)

#Loading census income data set
df_income <- read.csv("./censusincome.csv")

The following are some of the various interesting visualizations that I would like to make as part of my Exploratory Data Analysis. For each of the visualization, we have made a comment on some of the insights gathered. Investigating the effect of factors such as hours per week, and education level on income is interesting to me because they help me understand whether hard work pays (or rather used to at the time the data set was extracted), because the two are indicators of indicators of effort.

Proportions of People in the two income levels

library(ggplot2)

# Load the dataset
income_data <- read.csv("censusincome.csv")

# Calculate percentages
income_distribution <- as.data.frame(table(income_data$income))
colnames(income_distribution) <- c("Income", "Count")
income_distribution$Percentage <- (income_distribution$Count / sum(income_distribution$Count)) * 100

# Create the bar graph with percentages
ggplot(income_distribution, aes(x = Income, y = Percentage, fill = Income)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Income Distribution (Percentages)", x = "Income Levels", y = "Percentage (%)") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "steelblue")) +
  geom_text(aes(label = sprintf("%.1f%%", Percentage)), vjust = -0.5)

From the visualization above, we see that income distribution shows skewness with a majority earning below $50K in a year, with the people earning above 50K accounting for only $24\%$.

Income Distribution by Education Level

# Calculate proportions within each education level
data_prop <- df_income |>
  group_by(education, education_num) |>
  mutate(total = n()) |># Calculate total count for each education level
  group_by(education, income, education_num) |>
  summarise(count = n(), total = mean(total)) |>
  mutate(proportion = count / total) |>
  arrange(education_num)  # Sort by education_num

## `summarise()` has grouped output by 'education', 'income'. You can override
## using the `.groups` argument.

# Plot the proportion of income levels by education level
ggplot(data_prop, aes(x = reorder(education, education_num), y = proportion, fill = income)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Proportion of Income by Education Level",
       x = "Education Level",
       y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c(" <=50K" = "blue", " >50K" = "orange")) +
  scale_y_continuous(labels = scales::percent_format())  # Display y-axis as percentages

From the plot made earlier of proportion of income by education level, we see that education does actually influence an individual’s income bracket. Individuals with higher education seem to have a higher likelihood of earning above 50K compared to those who are not as much educated.

Income Distribution by Hours Worked per Week

# Calculate the mean hours worked per week for each income group
avg_hours <- df_income |>
  group_by(income) |>
  summarise(mean_hours_per_week = mean(hours_per_week))

# Create a bar plot for Average Hours Worked per Week by Income Level
ggplot(avg_hours, aes(x = income, y = mean_hours_per_week, fill = income)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Average Hours Worked per Week by Income Level",
       x = "Income Level",
       y = "Average Hours per Week") +
  scale_fill_manual(values = c(" <=50K" = "blue", " >50K" = "orange")) +
  theme_minimal()

Similarly from the plot of average hours per income level, we see that individuals with in the above 50K income bracket typically work for more hours per week as compare to those below or equal to 50K bracket.

Income Distribution by Gender

# Calculate counts for each combination of gender and income
gender_income_distribution <- as.data.frame(table(df_income$income, income_data$sex))
colnames(gender_income_distribution) <- c("Income", "Gender", "Count")

# Calculate percentages within each income category
income_totals <- aggregate(Count ~ Income, data = gender_income_distribution, sum)
gender_income_distribution <- merge(gender_income_distribution, income_totals, by = "Income")
gender_income_distribution$Percentage <- (gender_income_distribution$Count.x / gender_income_distribution$Count.y) * 100

# Create the stacked bar plot
ggplot(gender_income_distribution, aes(x = Income, y = Percentage, fill = Gender)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  labs(title = "Income Distribution by Gender (Percentage of Income Levels)", 
       x = "Income Levels", 
       y = "Percentage of Gender in Income Category (%)") +
  theme_minimal() +
  scale_fill_manual(values = c("skyblue", "pink")) +
  geom_text(aes(label = sprintf("%.1f%%", Percentage)), position = position_stack(vjust = 0.5), color = "white")

From the above visualization plot, we notice that men have higher chances of earning greater than 50K per year as compared to women, indicating a possible gender gap when it comes to compensation in the industry.

Hypothesis Testing

Devising the Null Hypothesis

We start by devising our null hypotheses based on columns of education, gender, and income. We make a hypothesis on the impact of gender and education on income level.

Hypothesis 1: Income and Gender - Neyman Pearson Framework

Step 1 - Null and Alternate Hypothesis

Null Hypothesis ($H_0$):

The proportion of individuals earning more than 50K is equal for both men and women.

Alternative Hypothesis ($H_1$):

The proportion of individuals earning more than 50K is NOT equal for both men and women.

Step 2: We choose an alpha level

As per the Neyman-Pearson framework we start by choosing $\alpha =0.05$. The 0.05 was chosen as the significance level for this test as it is neither too lenient nor too strict, and this does suit our purpose. This is the predefined threshold for determining whether to reject the Null Hypothesis, $H_0$. That means we are willing to accept to a 5% risk of a Type I error (rejecting a true null hypothesis).

Step 3: Evaluating whether we have enough data

Using power analysis, we evaluate whether we have enough amount of data to detect the desired true difference between out proportions.
We choose a power of 80%, meaning we want an 80% chance of detecting a true effect in our data set if it exists. This implies $\beta = 0.20$ which is essentially our Type II error i.e. failing to rejecting a false null hypothesis.

#load the pwr package
library(pwr)

#We pick a cohen's value of 0.2, implying we want to detect small effect size
pwr.2p.test(h=0.2, sig.level = 0.05, power = 0.80)

## 
##      Difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.2
##               n = 392.443
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: same sample sizes

From the results above, we see that we need at least 393 observations for each proportion to detect a difference with a significance level of 0.05 and an 80% power of detecting a true difference. We have met this requirement, since the number of observations by far exceed this the required 393 observations. We have 1179 women above 50K, and 6662 men above the 50K bracket. Therefore, we can progress with the test for the two proportions.

df_income_power_size <- df_income |>
  group_by(sex, income) |>
  summarise(count = n())

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

df_income_power_size

Step 4: Selecting our test

We use a z-test to compare the proportions of our two independent groups i.e proportion of men earning 50K and proportion of women earning 50K.

# Group by income and gender, and count the number of individuals in each group
df_income_grouped <- df_income |>
  group_by(income, sex) |>
  summarise(Count = n())

## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.

# Extract the counts for each income group and gender
male_greater_50K <- df_income_grouped$Count[df_income_grouped$sex == " Male" & df_income_grouped$income == " >50K"]
female_greater_50K <- df_income_grouped$Count[df_income_grouped$sex == " Female" & df_income_grouped$income == " >50K"]



# Calculate the total number of males and females
total_male <- sum(df_income_grouped$Count[df_income_grouped$sex == " Male"])
total_female <- sum(df_income_grouped$Count[df_income_grouped$sex == " Female"])

# Perform the two-proportion z-test
prop_test <- prop.test(x = c(male_greater_50K, female_greater_50K), n = c(total_male, total_female))

# Print the result of the two-proportion z-test
print(prop_test)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(male_greater_50K, female_greater_50K) out of c(total_male, total_female)
## X-squared = 1517.8, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1877104 0.2048416
## sample estimates:
##    prop 1    prop 2 
## 0.3057366 0.1094606

Step 4: Interpretation of the results

From the results above we see that $\mathbf{X^2 = 1517.8}$ which is indicating a significantly high difference between the observed and the expected proportions in our sample data.
We have 1 degree of freedom which indicate the number of independent comparisons we’re making. In our case we’re comparing the proportion of men who earn above 50K and the proportion of women who earn above 50K, and hence the degree of freedom is 1.
Our p-value is less than $2.2e-16$ which is extremely small, much smaller compared to $\alpha = 0.05$ which implies that there is a strong evidence against the null hypothesis. This consequently translates that a huge difference in proportions of men earning above 50k and proportions of women earning 50k.
Lastly we have our confidence level being [0.1877, 0.2048] indicating that we are 95% confidence that the difference in proportions of men and women above the 50K income bracket is between $18.8\%$ and $20.48\%$. We note that $0$ is not part of that interval. A z-statistic of zero would imply there is zero difference in women proportion earning above 50K and men proportion earning above 50k bracket, hence we don’t have a null case represented in our interval.
Our sample estimates men earning above 50K to 30.6% and women earning above 50K to 10.9%.
Our conclusion - Based on the above results and the strong case against the null hypothesis, we reject the null hypothesis, and accept the alternate hypothesis that poses that there is a difference in proportion of men and women in the above 50K income bracket per year.
We can Visualize our hypothesis as shown below

# Define the Z-distribution curve
z_values <- seq(-4, 4, by = 0.01)
density_values <- dnorm(z_values)

# Define the observed sample estimates and confidence intervals from your results
sample_estimate <- 0.3057366 - 0.1094606  # Difference in proportions
lower_ci <- 0.1877104
upper_ci <- 0.2048416
alpha <- 0.05  # Significance level

# Create the plot
ggplot(data.frame(z = z_values, density = density_values), aes(x = z, y = density)) +
  geom_line(size = 1.0, color = "blue") +  # Z-distribution curve
  geom_area(data = subset(data.frame(z = z_values, density = density_values), z > qnorm(1 - alpha/2)), aes(x = z), fill = "red", alpha = 0.3) +  # Upper alpha region
  geom_area(data = subset(data.frame(z = z_values, density = density_values), z < qnorm(alpha/2)), aes(x = z), fill = "red", alpha = 0.3) +  # Lower alpha region
  geom_vline(xintercept = sample_estimate, color = "green", linetype = "dashed", size = 1, show.legend = TRUE) +  # Sample estimate
  geom_vline(xintercept = lower_ci, color = "black", linetype = "dotted", size = 1, show.legend = TRUE) +  # Lower CI
  geom_vline(xintercept = upper_ci, color = "black", linetype = "dotted", size = 1, show.legend = TRUE) +  # Upper CI
  labs(title = "Z-Distribution with Confidence Interval, Alpha, and Sample Estimate", x = "Z-Score", y = "Density") +
  annotate("text", x = sample_estimate + 0.1, y = 0.3, label = paste("Sample Estimate:", round(sample_estimate, 3)), color = "green", size = 4) +
  annotate("text", x = lower_ci - 1.2, y = 0.05, label = paste("Lower CI:", round(lower_ci, 3)), color = "black", size = 4) +
  annotate("text", x = upper_ci + 1.2, y = 0.05, label = paste("Upper CI:", round(upper_ci, 3)), color = "black", size = 4) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can calculate z from the chi-square using the formula below:

\[ z = \sqrt{X^2} \]

which gives us a z statistic of $\approx 39$. This value is way deep into the rejection region of the plot show above. Again, emphasizing the need to reject the null hypothesis that states there is no difference in income levels between the gender.

Hypothesis 2: Income and Education - Fisher’s Significance Testing Framework

For this hypothesis we carry out the Fisher’s Significance Testing as follows:

Step 1: Define our hypothesis

Null Hypothesis ($H_0)$: The proportion of individuals earning above 50K is the same across different education levels.
Alternate Hypothesis ($H_1$): The proportion of individuals earning above 50K is NOT the same across different education levels.

Step 2: Select the test

Since we have two categorical variables, income and education, we’ll use the Chi-Square Test of Independence, to evaluate whether there is a significant association between them.

# Group the data by education and income and count the occurrences
df_income_grouped <- df_income |>
  group_by(education, income) |>
  summarise(Count = n())

## `summarise()` has grouped output by 'education'. You can override using the
## `.groups` argument.

# Create a contingency table for education and income
c_table_edu_income <- table(df_income$education, df_income$income)

# Perform the Chi-Square Test of Independence
chi_test <- chisq.test(c_table_edu_income)

# Print the results of the Chi-Square test
print(chi_test)

## 
##  Pearson's Chi-squared test
## 
## data:  c_table_edu_income
## X-squared = 4429.7, df = 15, p-value < 2.2e-16

Step 3: Interpreting the results

We have the X-Squared statistic as 4429.7 which essentially measures how much our expected proportions as defined in the null hypothesis differ from the observed proportions in our data set. Again, just like in the first relationship we investigated, we have a strong case here against the null hypothesis.
The other piece of information we have is the degrees of freedom which is 15 in our case, which is a product of the income categories and the education categories. i.e.

\[ d_f = (\text{no. of education categories} - 1) \times (\text{no of income categories} - 1) \]
We have our p-value of $< 2.2e-16$ which is really small, and thus the probability of finding an observation in the null space is extremely low.
The conclusion here is to reject the null hypothesis due to the small value of p which suggests a significant relationship between education level and income level. Therefore, we are essentially concluding that an individual’s income level (whether below or above 50K) will vary based on their education levels.
The visualization of our chi-square test is a follows:

# Define the degrees of freedom and alpha level
df <- 15  # Degrees of freedom based on your test result
alpha <- 0.05  # Significance level

# Create the sequence for Chi-square values
x_values <- seq(0, 40, by = 0.01)  # Adjust the upper limit based on your data
chi_sq_values <- dchisq(x_values, df)

# Calculate the critical value for 95% confidence level
critical_value <- qchisq(1 - alpha, df)

# Create the plot
ggplot(data.frame(x = x_values, y = chi_sq_values), aes(x = x, y = y)) +
  geom_line(size = 1.2, color = "black") +  # Chi-square distribution curve
  geom_area(data = subset(data.frame(x = x_values, y = chi_sq_values), x > critical_value), aes(x = x, y = y), fill = "red", alpha = 0.5) +  # Rejection region
  geom_vline(xintercept = critical_value, linetype = "dashed", color = "blue", size = 1) +  # Critical value line
  annotate("text", x = critical_value + 2, y = 0.02, label = paste("Critical Value =", round(critical_value, 2)), color = "blue", size = 3) +
  annotate("text", x = 10, y = 0.04, label = "1 - c (Confidence Level)", color = "black", size = 3) +
  annotate("text", x = critical_value + 8, y = 0.005, label = "Rejection Region", color = "red", size = 3) +
  labs(title = "Chi-Square Distribution", x = expression(chi^2), y = "Density") +
  theme_minimal()

From the plot above, we see that our x-squared statistic is way beyond the critical value indicating that it is highly unlikely that the difference between our observed proportions would have just been by chance. Therefore, we can confidently, reject the null hypothesis, in favor of the alternate hypothesis, $H_1$.

Logistic Regression Model

For our logistic model, we select the income level feature as our target variable, and then we can use gender, age, education level, and hours per week as our four explanatory variables.

First we convert gender and income level features to numeric as follows:

#Convert the income level and gender to numeric
df_income$income_numeric <- ifelse(df_income$income == " >50K", 1, 0)
df_income$gender_numeric <- ifelse(df_income$sex == ' Male', 1, 0)
head(df_income)

Finally, we build our logistic model as shown below using the glm function in R and by providing family = binomial to specify that we are building a logistic regression model.

#building the logistic model
logistic_model <- glm(income_numeric ~ age + hours_per_week + education_num + gender_numeric, data=df_income, family = binomial)

#displaying a summary of the model
logistic_model$coefficients

##    (Intercept)            age hours_per_week  education_num gender_numeric 
##    -9.13339878     0.04560424     0.03563749     0.35511411     1.16115817

The coefficients above show the effect of each predictor on the log-odds of having income above 50k. We have an intercept of -9.1334 which represents the base line log-odds of having income level above 50K for the case when all predictor variables are equal to zero.
Age predictor: For each addition year of age, the log-odds of having income above 50K income level increases by 0.0456, holding other variables constant. In terms of odds ratio, the odds of having income above 50K increases by 4.7% i.e $e^{0.0456}$.
hours_per_week: For each additional hour per week, the log odds of having income greater above 50K income level increases by 0.0356, while holding the other predictors constant. This translates to having an odds ratio of 3.6% meaning that the odds of being in the above 50K income level increases by 3.6% for every additional increase in one hour worked.
Education: This predictor has a log-odds value of 0.3551 meaning that for each additional educational level, the log odds of earning 50K and above increases by 0.3551 translating to a 42.6% increase in the odds, while keeping the other factors constant.
Gender: Being male is coded as “1” and “0” for female gender. The log-odds for this predictor is 1.1611 meaning that the log odds of earning above 50K increases by 1.1611 for male, translating to an odds increase of 3.194. This indicates that males have a 3.19 odds of having an income above 50K while compared to females, if you kept the other factors constant.
In summary, we see that of all the predictors, an increase in education level, results in the greatest increase in odds of earning 50K and above. This makes logical sense since education is typically a big influence on how much someone earns.
For the confidence interval, we can pick the education coefficient since it’s the strongest predictor of our model, and evaluate it’s CI. We use the formula shown below:

\[ CI = \hat{\beta} \pm Z \cdot \text{SE}(\hat{\beta}) \]

where $\hat{\beta}$ is the estimated coefficient for education level in this case, and $Z$ is the critical value from the standard normal distribution. For a 95% confidence level, which is the CI we desire to build, the value of the corresponding $Z$ is approximately 1.96. Lastly, $\text{SE}(\hat{\beta})$ is standard error of the coefficient.

model_summary <- summary(logistic_model)

#evaluating both the estimate (Beta) and the standard error
education_se <-model_summary$coefficients['education_num', ]
education_se

##     Estimate   Std. Error      z value     Pr(>|z|) 
##  0.355114111  0.006617087 53.666229210  0.000000000

From our logistic model, we see we have $\hat{\beta} = 0.35511$ and $SE(\hat{\beta}) = 0.006617$. Substituting these values we get $\text{CI} = (0.3421, 0.3681)$ for the log-odds of the education predictor. This translates to to $\text{CI} = (40.7\%, 44.5\%)$ for the odds ratio. This means that we are $95\%$ confident that for every addition education unit, there is an approximately $40.7\%$ - $44.5\%$ increase in the odds of earning 50K and above, while of course holding the other factors constant.