Final Project

Research Question: Is there a significant difference in average yearly income within different education levels, and how does this depend by gender?

This dataset is from openintro where it uses data from the US Census American Community Survey 2012. Each row represents an indivduals response and each colum represents gender, yearly income, education level. I chose this to understand how the three variables vary and how they can provide more info on future research. Missing values will have to be removed for this dataset, then will have to be filtered.

Cleaning dataset

data_clean <- data |>
  select(edu, gender, income)

Identifying missing data

##    edu gender income 
##     58      0    377

Proportion of missing values

missing_prop <- colSums(is.na(data_clean)) / nrow(data_clean)

These were done to determine if there are significant differences in anual income across different education levels.

Handling missing data

data_no_na <- data_clean |>
  filter(!is.na(edu), !is.na(gender), !is.na(income))

str(data_no_na)
## 'data.frame':    1623 obs. of  3 variables:
##  $ edu   : chr  "college" "hs or lower" "hs or lower" "hs or lower" ...
##  $ gender: chr  "female" "male" "male" "female" ...
##  $ income: int  60000 0 0 0 1700 45000 8600 0 0 33500 ...
summary(data_no_na)
##      edu               gender              income      
##  Length:1623        Length:1623        Min.   :     0  
##  Class :character   Class :character   1st Qu.:     0  
##  Mode  :character   Mode  :character   Median :  3000  
##                                        Mean   : 23600  
##                                        3rd Qu.: 33700  
##                                        Max.   :450000

Needed to remove NA’s from the dataset to have better results.

ANOVA will be used to find out if there is a significant difference and test all the differences within different education levels and gender can provide more information to help support this.

Performing ANOVA

anova_result <- aov(income ~ edu, data = data_no_na)
summary(anova_result)
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## edu            2 4.383e+11 2.192e+11   115.3 <2e-16 ***
## Residuals   1620 3.079e+12 1.901e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This provides the p-value which is less than 0.05, which means theres a significant difference in income across education level.

Tukey’s Honestly Significant Difference (HSD)

TukeyHSD(anova_result)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = income ~ edu, data = data_no_na)
## 
## $edu
##                          diff       lwr       upr p adj
## grad-college         32967.98  22880.03  43055.94     0
## hs or lower-college -21238.01 -27440.63 -15035.38     0
## hs or lower-grad    -54205.99 -63259.79 -45152.19     0

This can help compare between the education levels, including for the boxplots.

Boxplot visual

library(ggplot2)

ggplot(data_no_na, aes(x = edu, y = income)) +
  geom_boxplot(fill = "#FFBBFF", 
                 color = "skyblue") +
  labs(title = "Yearly Income by Education",
       x = "Education", y = "Yearly Income") +
  theme_minimal()

This shows the yearly income of different education levels, the higher it goes, it means the higher the education. The different heights represents the differences in income and education

Boxplot 2 - Included gender

ggplot(data_no_na, aes(x = edu, 
                       y = income, fill = gender)) +
  geom_boxplot()+
  labs(title = "Income by Gender and Education", x = "Education",
    y = "Income"
  ) + 
  theme_minimal()

This provides a visual by comparing gender by education and income levels.

Conclusion and Future Directions

This overall confirms that that education level significantly impacts the annual income. this proves that higher education can higher income. These were important because it can bring to attention the impact of education levels, although gender could have influence, this can be looked into further in the future. Future research could also look into including more variables like age, to get a better understanding.

References

dataset -> openintro https://www.openintro.org/data/index.php?data=acs12

dataset cleaning/ dealing with NA’s -> “Analyzing Tropical Storm Data.Rmd”

Anova formats -> “ANOVA.Rmd”

boxplot -> “Descriptive Statistics.Rmd”