In this R Notebook, we are going to build 3 pairs of numeric variables, and for each pair we’ll have at least one column calculated based on other columns.
The following are the relationships that we are going to investigate:
Age to hours per week ratio - we would want to know whether younger people work more as it would be presumed they are more energetic (hypothetically). Here the explanatory variable is age and response variable is age to hours per week ratio.
How education translates to income level - here we are going to first calculate a median education level, and then calculate whether each entry is below or above the median. The explanatory variable here is education (below or above median); and income is the response variable.
# loading necesary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
#loading our census income dataset
df_income <- read.csv("./censusincome.csv")
#for this data dive we don't need to filter out entries that have unkwown workclass
Age to hours per week ratio
df_income$age_to_hours_ratio <- df_income$hours_per_week / df_income$age
head(df_income[, c("age", "hours_per_week", "age_to_hours_ratio")])
Education level vs. income level
median_education <- median(df_income$education_num, na.rm = TRUE)
df_income$below_or_above_median_education <- ifelse(df_income$education_num > median_education, 1, 0)
df_income$income_numeric <-ifelse(df_income$income == " >50K", 1, 0)
head(df_income[, c("education", "education_num", "below_or_above_median_education", "income","income_numeric")])
Age to hours per week ratio relationship
# scatter plot to visualize the age
ggplot(df_income, aes(x=age, y=age_to_hours_ratio)) +
geom_point() +
labs(title = "age vs. hours per week to age ratio") +
theme_minimal()
From the plot we see that the age to hours ratio tend to reduce as age increases. This in a way makes our initial hypothesis as true where we would expect young individuals to work more hours relative to their age as they are assumed more energetic.
To identify some of the outliers in this relationship we can plot a box plot as shown below:
ggplot(df_income, aes(y=age_to_hours_ratio)) +
geom_boxplot() +
labs(title = "identifying outliers: age to hours ratio") +
theme_minimal()
From the box-plot above, we see that there are a few outliers on the upper side of the box plot, i.e. a few individuals with high age to hours per week ratio. This could indicate young people working for extremely long hours or perhaps an error in the census income data set. However, the median age to hours per week ratio is approximately 1, and the spread from the median is not much indicating that for most individuals, their number of hours per week is in accordance with their age.
It is worth noting that there is not universal age to hour per week ratio, but it’s a good thing we can have our data set/sample represent the population such that in a light way we can pose \(\mathbf{1.0}\) as the approximate typical age to hours-per-week ratio.
For correlation, in this relationship we can use the Pearson Correlation since we are dealing with continuous variables here.
cor_age_hours <- cor(df_income$age_to_hours_ratio, df_income$age, method = "pearson", use="complete.obs")
cor_age_hours
## [1] -0.675767
The results above is what we were expecting i.e a negative correlation which implies that as age increases the age to hours per week ration reduces. The value \(\approx -0.7\) also indicates a strong inverse relationship between the two, a phenomenon we had observed from the scatter plot made earlier.
Lastly, we evaluate the confidence interval for the age to hours-per-week ratio using the formula below
\[ CI = \bar{x} \pm z \times \frac{\sigma}{\sqrt{n}} \]
# first we calculate the mean and standard deviation for the age-to-hours-per-week ratio
mean_ratio <- mean(df_income$age_to_hours_ratio, na.rm = TRUE)
std_dev_ratio <- sd(df_income$age_to_hours_ratio, na.rm = TRUE)
# the we calculate standard error of the mean
std_error_ratio <- std_dev_ratio / sqrt(sum(!is.na(df_income$age_to_hours_ratio)))
#for 95% confidence interval for the mean the z = 1.96
CI <- mean_ratio + c(-1.96, 1.96) * std_error_ratio
# function for plotting the normal distribution based on the calculated mean and standard deviation
f_age_to_hours <- function(x) dnorm(x, mean = mean_ratio, sd = std_error_ratio)
# Plot the normal distribution and add confidence interval lines
ggplot() +
geom_function(fun = f_age_to_hours, xlim = c(mean_ratio - 3*std_error_ratio, mean_ratio + 3*std_error_ratio)) +
geom_segment(aes(x = CI[1], y = 0, xend = CI[2], yend = 0, linetype = "95% Confidence Interval"),
color = "gray", size = 1) +
geom_point(aes(x = mean_ratio, y = 0, color = "Sample Mean"), size = 3) +
labs(title = "Confidence Interval for Age to Hours-Per-Week Ratio",
x = "Age to Hours-Per-Week Ratio", y = "Density",
color = "", linetype = "") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Education Level vs. Income Level Relationship
# Bar plot: Education Above/Below Median vs Income
# Group by education level (above/below median) and income, and count the occurrences
df_education_income_group <- df_income |>
group_by(below_or_above_median_education, income) |>
summarize(count = n())
## `summarise()` has grouped output by 'below_or_above_median_education'. You can
## override using the `.groups` argument.
# Convert education level to a factor with custom labels
df_education_income_group <- df_education_income_group |>
mutate(education_label = factor(below_or_above_median_education,
levels = c(0, 1),
labels = c("Below Median", "Above Median")))
# Bar plot: Grouped by Education (Above/Below Median) and Income
ggplot(df_education_income_group,
aes(x = education_label,
y = count,
fill = as.factor(income))
) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Grouped by Education (Above/Below Median) and Income",
x = "Education Level", y = "Proportion", fill = "Income") +
theme_minimal()
From the above plot we can see that most people who have education below the median level, earn less than 50K. This is what we would expect where individuals who are more educated get paid more as education is a measure of proficiency and knowledge in a certain domain. For the category of education above the median level the distribution between individuals who earn above 50k or below 50k is quite even.
The next question we ask ourselves is what kind of outliers are in this kind of relationship? For this it is hard to find if there are any outliers, simply because of our data in the grouping is not continuous. Additionally, it is hard to think what they would be in this relationship, logically in this scenario.
For the correlation we use the Point-Biserial Correlation which is a special type of Pearson correlation that specifically applies to a situation where one variable is binary and another numeric like in our case.
cor_education_income <- cor(df_income$income_numeric, df_income$below_or_above_median_education, method = "pearson", use="complete.obs")
cor_education_income
## [1] 0.307618
From the above results a correlation of \(\approx 0.3\) indicates that while there is indication that individuals who are educated are most likely to have an income of \(>50k\) and vice versa; the relationship is not as strong. This would indicate that there are other features in our data set that probably influence the income level more, for example the working class variable.
Finally, we evaluate the confidence interval for the education level versus income proportions using the formula below:
\[ CI = \hat{p} \pm z \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]
# Calculate the proportion of people with income >50K in each education group
proportions_by_education <- df_income |>
group_by(below_or_above_median_education) |>
summarise(proportion_above_50k = mean(income_numeric, na.rm = TRUE),
count = n())
# Manually calculate the confidence interval for each education level
proportions_by_education <- proportions_by_education |>
mutate(
std_error = sqrt((proportion_above_50k * (1 - proportion_above_50k)) / count),
conf_lower = proportion_above_50k - 1.96 * std_error,
conf_upper = proportion_above_50k + 1.96 * std_error
)
# Plot the confidence intervals for the proportions with labels
ggplot(proportions_by_education, aes(x = as.factor(below_or_above_median_education), y = proportion_above_50k)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = conf_lower, ymax = conf_upper), width = 0.2) +
# Add labels for the sample proportion
geom_text(aes(label = round(proportion_above_50k, 3), hjust=2.5), size = 3.5) +
# Add labels for the lower bound of the confidence interval
geom_text(aes(label = round(conf_lower, 3), y = conf_lower, vjust = 1.5), size = 3.5) +
# Add labels for the upper bound of the confidence interval
geom_text(aes(label = round(conf_upper, 3), y = conf_upper, vjust = -0.7), size = 3.5) +
labs(title = "95% Confidence Interval for Proportion of Income >50K by Education Level",
x = "Education Level (Below/Above Median)",
y = "Proportion of Income >50K") +
theme_minimal()
From the plot above we can make the following interpretation: