Research Question: Is there a significant difference in average yearly income within different education levels, and how does this depend by gender?
This dataset is from openintro where it uses data from the US Census American Community Survey 2012. Each row represents an indivduals response and each colum represents gender, yearly income, education level. I chose this to understand how the three variables vary and how they can provide more info on future research. Missing values will have to be removed for this dataset, then will have to be filtered.
data_clean <- data |>
select(edu, gender, income)
## edu gender income
## 58 0 377
missing_prop <- colSums(is.na(data_clean)) / nrow(data_clean)
These were done to determine if there are significant differences in anual income across different education levels.
data_no_na <- data_clean |>
filter(!is.na(edu), !is.na(gender), !is.na(income))
str(data_no_na)
## 'data.frame': 1623 obs. of 3 variables:
## $ edu : chr "college" "hs or lower" "hs or lower" "hs or lower" ...
## $ gender: chr "female" "male" "male" "female" ...
## $ income: int 60000 0 0 0 1700 45000 8600 0 0 33500 ...
summary(data_no_na)
## edu gender income
## Length:1623 Length:1623 Min. : 0
## Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Median : 3000
## Mean : 23600
## 3rd Qu.: 33700
## Max. :450000
Needed to remove NA’s from the dataset to have better results.
ANOVA will be used to find out if there is a significant difference and test all the differences within different education levels and gender can provide more information to help support this.
anova_result <- aov(income ~ edu, data = data_no_na)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## edu 2 4.383e+11 2.192e+11 115.3 <2e-16 ***
## Residuals 1620 3.079e+12 1.901e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This provides the p-value which is less than 0.05, which means theres a significant difference in income across education level.
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = income ~ edu, data = data_no_na)
##
## $edu
## diff lwr upr p adj
## grad-college 32967.98 22880.03 43055.94 0
## hs or lower-college -21238.01 -27440.63 -15035.38 0
## hs or lower-grad -54205.99 -63259.79 -45152.19 0
This can help compare between the education levels, including for the boxplots.
library(ggplot2)
ggplot(data_no_na, aes(x = edu, y = income)) +
geom_boxplot(fill = "#FFBBFF",
color = "skyblue") +
labs(title = "Yearly Income by Education",
x = "Education", y = "Yearly Income") +
theme_minimal()
This shows the yearly income of different education levels, the higher
it goes, it means the higher the education. The different heights
represents the differences in income and education
ggplot(data_no_na, aes(x = edu,
y = income, fill = gender)) +
geom_boxplot()+
labs(title = "Income by Gender and Education", x = "Education",
y = "Income"
) +
theme_minimal()
This provides a visual by comparing gender by education and income
levels.
This overall confirms that that education level significantly impacts the annual income. this proves that higher education can higher income. These were important because it can bring to attention the impact of education levels, although gender could have influence, this can be looked into further in the future. Future research could also look into including more variables like age, to get a better understanding.
References
dataset -> openintro https://www.openintro.org/data/index.php?data=acs12
dataset cleaning/ dealing with NA’s -> “Analyzing Tropical Storm Data.Rmd”
Anova formats -> “ANOVA.Rmd”
boxplot -> “Descriptive Statistics.Rmd”