##How do number of hours worked per week, commute time to work, and education level explain an individual’s annual income?
The data in this data set is from the “American Community Survey, 2012” results from the US Census American Community Survey, 2012. This data set has 2,000 observations of 13 variables, I will be using the following variables: -Annual income (income, dependent) -Number of hours worked per week (hrs_worked, independent) -Education level (edu, independent) -Commute time to work in minutes (time_to_work, independent)
I chose this topic because I am interested in seeing what actually affects someones income. Their are lots of narratives about why certain people have more money than others, with explanations that range from laziness and lack of effort to systemic explanations for these discrepancies. So I will try to make a more objective assessment of why these differences in income exist with the analysis of this data set.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Final_data <- read.csv("Final project dataset acs12.csv")
Here I will check the missing values, structure of the data and change the education column (edu) to a factor using the mutate function. I will then create a bar graph showing the occurrence of different levels of education which include high school or lower, college, and graduate level.
colSums(is.na(Final_data))
## income employment hrs_work race age gender
## 377 395 1041 0 0 0
## citizen time_to_work lang married edu disability
## 0 1217 105 0 58 0
## birth_qrtr
## 0
str(Final_data)
## 'data.frame': 2000 obs. of 13 variables:
## $ income : int 60000 0 NA 0 0 1700 NA NA NA 45000 ...
## $ employment : chr "not in labor force" "not in labor force" NA "not in labor force" ...
## $ hrs_work : int 40 NA NA NA NA 40 NA NA NA 84 ...
## $ race : chr "white" "white" "white" "white" ...
## $ age : int 68 88 12 17 77 35 11 7 6 27 ...
## $ gender : chr "female" "male" "female" "male" ...
## $ citizen : chr "yes" "yes" "yes" "yes" ...
## $ time_to_work: int NA NA NA NA NA 15 NA NA NA 40 ...
## $ lang : chr "english" "english" "english" "other" ...
## $ married : chr "no" "no" "no" "no" ...
## $ edu : chr "college" "hs or lower" "hs or lower" "hs or lower" ...
## $ disability : chr "no" "yes" "no" "no" ...
## $ birth_qrtr : chr "jul thru sep" "jan thru mar" "oct thru dec" "oct thru dec" ...
Final_data <- Final_data |>
mutate(edu = factor(edu))
barplot(table(Final_data$edu), main = "Education levels", xlab = "Highest education obtained", col = "Blue")
To analyze the relationship between the chosen variables I will use a multiple linear regression, this method is appropriate for this analysis because there are multiple independent variables whose relationship to the dependent variable we want to figure out.
clean_final_data <- subset(Final_data, !is.na(income) & income > 0)
income_mlg <- lm(log(income) ~ hrs_work + edu + time_to_work, data = clean_final_data)
summary(income_mlg)
##
## Call:
## lm(formula = log(income) ~ hrs_work + edu + time_to_work, data = clean_final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5161 -0.4274 0.1249 0.5525 3.2284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.085581 0.130686 61.870 < 2e-16 ***
## hrs_work 0.055215 0.002736 20.182 < 2e-16 ***
## edugrad 0.654377 0.115919 5.645 2.35e-08 ***
## eduhs or lower -0.349742 0.076623 -4.564 5.86e-06 ***
## time_to_work 0.003324 0.001495 2.223 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9043 on 740 degrees of freedom
## (149 observations deleted due to missingness)
## Multiple R-squared: 0.443, Adjusted R-squared: 0.44
## F-statistic: 147.1 on 4 and 740 DF, p-value: < 2.2e-16
Coefficients: -hrs worked 0.05: one hour of work more per week increases income by 0.05. -graduate level education 0.6: Having graduate level education increases your income by 0.6. -High school or lower level education -0.34: Having a HS or lower level of education decreases income by 0.34. Commute time to work 0.003: one minute more of commute time per day increases income by 0.003.
P-values: All findings are statistically significant at the 0.05 level.
R-squared: 0.443, this means the predictors in this model account for 44% of the variation in income levels.
#Assumptions
Linearity is not fulfilled, points cluster are present in both histograms and don’t follow the regression line evenly.
crPlots(income_mlg)
Independence is satisfied because the points representing observations are evenly distributed along the line and there are no clusters or patterns forming.
plot(resid(income_mlg), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
Homoscedasticity is not met because residuals are not scattered evenly across the lines of either the residuals vs fitted plot or the scale-location plot. The prediction line in the scale-location plot is also curved and it would need to be straight.
Normality is fulfilled, however there are small skews at the ends of the Q-Q residuals plot and there are no leverage points in the residuals vs fitted plot.
par(mfrow=c(2,2)); plot(income_mlg); par(mfrow=c(1,1))
No multicollinearity, there is no correlation between the variables, the result in the last column of the variance inflation vector shows no values of 2.24 or greater which means there is no correlation.
vif(income_mlg)
## GVIF Df GVIF^(1/(2*Df))
## hrs_work 1.032053 1 1.015900
## edu 1.032878 2 1.008120
## time_to_work 1.029934 1 1.014857
Conclusions
All three factors are important in predicting outcome however education and number of hours worked per week are the most important. Education was the one that could both positively and negatively affect an individuals income depending on the category they fell into. Having a level of education of High school or lower had a small but statistically significant negative impact on income, while having graduate level education increased income. Overall this model was able to explain 44% of the variations in income meaning that the model and these three factors have a significant level of predicting power over income. However it is incomplete because there is still 55% of the factors that go into predicting income that are missing. This for me leads to the general question of what factors account for the other 55% of the variation in income.