Week 6 Data Dive

Confidence Intervals

Introduction

In this R Notebook, we are going to build 3 pairs of numeric variables, and for each pair we’ll have at least one column calculated based on other columns.
The following are the relationships that we are going to investigate:
1. Age to hours per week ratio - we would want to know whether younger people work more as it would be presumed they are more energetic (hypothetically). Here the explanatory variable is age and response variable is age to hours per week ratio.
2. How education translates to income level - here we are going to first calculate a median education level, and then calculate whether each entry is below or above the median. The explanatory variable here is education (below or above median); and income is the response variable.

Loading of Libraries and Data Set

# loading necesary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

#loading our census income dataset
df_income <- read.csv("./censusincome.csv")

#for this data dive we don't need to filter out entries that have unkwown workclass

Building the numeric variable pairs

Age to hours per week ratio

we create a new column that has corresponding age to hours per work ratio.

df_income$age_to_hours_ratio <- df_income$hours_per_week / df_income$age
head(df_income[, c("age", "hours_per_week", "age_to_hours_ratio")])

We should make a note that a lower age to hours per week ratio suggests an individual works more hours in relative to their age.

Education level vs. income level

Here we’ll use the education_num feature in our data set which is the numeric representation of the education level - we’ll use it to create a new column that shows whether an individual is below or above the median education level

median_education <- median(df_income$education_num, na.rm = TRUE)
df_income$below_or_above_median_education <- ifelse(df_income$education_num > median_education, 1, 0)

For the income level we can consider it ordered i.e above and below 50k. However, we’ll still need to convert it to numeric as shown below. Binary \(0\) represents \(\le 50k\) and binary \(1\) represents \(> 50k\).

df_income$income_numeric <-ifelse(df_income$income == " >50K", 1, 0)
head(df_income[, c("education", "education_num", "below_or_above_median_education", "income","income_numeric")])

Visualization, Outliers, and Insights

Age to hours per week ratio relationship

In this section we visualize the above relationship:

# scatter plot to visualize the age
ggplot(df_income, aes(x=age, y=age_to_hours_ratio)) +
  geom_point() +
  labs(title = "age vs. hours per week to age ratio") +
  theme_minimal()

From the plot we see that the age to hours ratio tend to reduce as age increases. This in a way makes our initial hypothesis as true where we would expect young individuals to work more hours relative to their age as they are assumed more energetic.

To identify some of the outliers in this relationship we can plot a box plot as shown below:

ggplot(df_income, aes(y=age_to_hours_ratio)) +
  geom_boxplot() +
  labs(title = "identifying outliers: age to hours ratio")  +
  theme_minimal()

From the box-plot above, we see that there are a few outliers on the upper side of the box plot, i.e. a few individuals with high age to hours per week ratio. This could indicate young people working for extremely long hours or perhaps an error in the census income data set. However, the median age to hours per week ratio is approximately 1, and the spread from the median is not much indicating that for most individuals, their number of hours per week is in accordance with their age.
It is worth noting that there is not universal age to hour per week ratio, but it’s a good thing we can have our data set/sample represent the population such that in a light way we can pose \(\mathbf{1.0}\) as the approximate typical age to hours-per-week ratio.
For correlation, in this relationship we can use the Pearson Correlation since we are dealing with continuous variables here.

cor_age_hours <- cor(df_income$age_to_hours_ratio, df_income$age, method = "pearson", use="complete.obs")

cor_age_hours

## [1] -0.675767

The results above is what we were expecting i.e a negative correlation which implies that as age increases the age to hours per week ration reduces. The value \(\approx -0.7\) also indicates a strong inverse relationship between the two, a phenomenon we had observed from the scatter plot made earlier.
Lastly, we evaluate the confidence interval for the age to hours-per-week ratio using the formula below

\[ CI = \bar{x} \pm z \times \frac{\sigma}{\sqrt{n}} \]

# first we calculate the mean and standard deviation for the age-to-hours-per-week ratio
mean_ratio <- mean(df_income$age_to_hours_ratio, na.rm = TRUE)
std_dev_ratio <- sd(df_income$age_to_hours_ratio, na.rm = TRUE)

# the we calculate standard error of the mean
std_error_ratio <- std_dev_ratio / sqrt(sum(!is.na(df_income$age_to_hours_ratio)))

#for 95% confidence interval for the mean the z = 1.96 
CI <- mean_ratio + c(-1.96, 1.96) * std_error_ratio

# function for plotting the normal distribution based on the calculated mean and standard deviation
f_age_to_hours <- function(x) dnorm(x, mean = mean_ratio, sd = std_error_ratio)

# Plot the normal distribution and add confidence interval lines
ggplot() +
  geom_function(fun = f_age_to_hours, xlim = c(mean_ratio - 3*std_error_ratio, mean_ratio + 3*std_error_ratio)) +
  geom_segment(aes(x = CI[1], y = 0, xend = CI[2], yend = 0, linetype = "95% Confidence Interval"),
               color = "gray", size = 1) +
  geom_point(aes(x = mean_ratio, y = 0, color = "Sample Mean"), size = 3) +
  labs(title = "Confidence Interval for Age to Hours-Per-Week Ratio",
       x = "Age to Hours-Per-Week Ratio", y = "Density",
       color = "", linetype = "") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the above plot we see the sample mean for age to hours-per-week ratio is 1.65 and from the confidence interval we can assert with 95% confidence that the true population mean will lie somewhere between 1.16 and 1.17 i.e. if we were to collect many different samples, the true population mean for the age to hours-per-week ratio would lie between the [1.16, 1.17] interval.

Education Level vs. Income Level Relationship

In this section we visualize the above relationship:

# Bar plot: Education Above/Below Median vs Income
# Group by education level (above/below median) and income, and count the occurrences
df_education_income_group <- df_income |>
  group_by(below_or_above_median_education, income)  |>
  summarize(count = n())

## `summarise()` has grouped output by 'below_or_above_median_education'. You can
## override using the `.groups` argument.

# Convert education level to a factor with custom labels
df_education_income_group <- df_education_income_group |>
  mutate(education_label = factor(below_or_above_median_education, 
                                  levels = c(0, 1), 
                                  labels = c("Below Median", "Above Median")))

# Bar plot: Grouped by Education (Above/Below Median) and Income
ggplot(df_education_income_group, 
       aes(x = education_label, 
           y = count, 
           fill = as.factor(income))
) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Grouped by Education (Above/Below Median) and Income",
       x = "Education Level", y = "Proportion", fill = "Income") +
  theme_minimal()

From the above plot we can see that most people who have education below the median level, earn less than 50K. This is what we would expect where individuals who are more educated get paid more as education is a measure of proficiency and knowledge in a certain domain. For the category of education above the median level the distribution between individuals who earn above 50k or below 50k is quite even.
The next question we ask ourselves is what kind of outliers are in this kind of relationship? For this it is hard to find if there are any outliers, simply because of our data in the grouping is not continuous. Additionally, it is hard to think what they would be in this relationship, logically in this scenario.
For the correlation we use the Point-Biserial Correlation which is a special type of Pearson correlation that specifically applies to a situation where one variable is binary and another numeric like in our case.

cor_education_income <- cor(df_income$income_numeric, df_income$below_or_above_median_education, method = "pearson", use="complete.obs")

cor_education_income

## [1] 0.307618

From the above results a correlation of \(\approx 0.3\) indicates that while there is indication that individuals who are educated are most likely to have an income of \(>50k\) and vice versa; the relationship is not as strong. This would indicate that there are other features in our data set that probably influence the income level more, for example the working class variable.

Finally, we evaluate the confidence interval for the education level versus income proportions using the formula below:

\[ CI = \hat{p} \pm z \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

# Calculate the proportion of people with income >50K in each education group
proportions_by_education <- df_income |>
  group_by(below_or_above_median_education) |>
  summarise(proportion_above_50k = mean(income_numeric, na.rm = TRUE),
            count = n())

# Manually calculate the confidence interval for each education level
proportions_by_education <- proportions_by_education |>
  mutate(
    std_error = sqrt((proportion_above_50k * (1 - proportion_above_50k)) / count),
    conf_lower = proportion_above_50k - 1.96 * std_error,
    conf_upper = proportion_above_50k + 1.96 * std_error
  )

# Plot the confidence intervals for the proportions with labels
ggplot(proportions_by_education, aes(x = as.factor(below_or_above_median_education), y = proportion_above_50k)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = conf_lower, ymax = conf_upper), width = 0.2) +
  # Add labels for the sample proportion
  geom_text(aes(label = round(proportion_above_50k, 3), hjust=2.5), size = 3.5) +
  # Add labels for the lower bound of the confidence interval
  geom_text(aes(label = round(conf_lower, 3), y = conf_lower, vjust = 1.5), size = 3.5) +
  # Add labels for the upper bound of the confidence interval
  geom_text(aes(label = round(conf_upper, 3), y = conf_upper, vjust = -0.7), size = 3.5) +
  labs(title = "95% Confidence Interval for Proportion of Income >50K by Education Level",
       x = "Education Level (Below/Above Median)", 
       y = "Proportion of Income >50K") +
  theme_minimal()

From the plot above we can make the following interpretation:
- The sample proportion of individuals earning above \(50k\) is approximately 15% for individuals who are below median education level. The confidence interval implies that we are 95% confident that the true proportion of individuals who are below the median education level and are earning above \(50k\) will be anywhere between 14.5% and 15.5%.
- For the individuals above median education level, the sample proportion earning \(50k\) is approximately 43.1% and using the confidence interval we go ahead to assert that we are 95% confident that the true proportion for that category earning above \(50k\) will be somewhere between 42.2% and 44.1%.
- The insight we get from this is the likelihood of individuals with above median education level is significantly higher than those who have their education below median. And since the confidence intervals don’t overlap, it implies that education is an important factor in influencing the income category.

Conclusion

For this data dive, we were able to build relationships between our features, and even calculate new meaningful columns based on existing columns. We were able to evaluate the relationships of these pair variables, calculate their correlations, evaluate their confidence intervals, and draw meaningful insights.