library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
setwd("~/Documents/EC/Spring 2026/DATA 101/Project Final")
county <- read_csv("county.csv")
## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(county)
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : num [1:3142] 55504 212628 25270 22668 58013 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : chr [1:3142] "yes" "yes" "no" "yes" ...
## $ median_edu : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
## $ smoking_ban : chr [1:3142] "none" "none" "partial" "none" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. state = col_character(),
## .. pop2000 = col_double(),
## .. pop2010 = col_double(),
## .. pop2017 = col_double(),
## .. pop_change = col_double(),
## .. poverty = col_double(),
## .. homeownership = col_double(),
## .. multi_unit = col_double(),
## .. unemployment_rate = col_double(),
## .. metro = col_character(),
## .. median_edu = col_character(),
## .. per_capita_income = col_double(),
## .. median_hh_income = col_double(),
## .. smoking_ban = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(county)
## # A tibble: 6 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Autauga County Alaba… 43671 54571 55504 1.48 13.7 77.5
## 2 Baldwin County Alaba… 140415 182265 212628 9.19 11.8 76.7
## 3 Barbour County Alaba… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb County Alaba… 20826 22915 22668 0.73 15.2 82.9
## 5 Blount County Alaba… 51024 57322 58013 0.68 15.6 82
## 6 Bullock County Alaba… 11714 10914 10309 -2.28 28.5 76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## # median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## # smoking_ban <chr>
Is there a significant difference in median household income among the different types of median education of counties in Maryland? The data set I selected to work on contains data from 3142 counties in the United States. This dataset has 3142 observations and 14 variables, making it perfect for this project. The research question stated in the first bolded sentence is what I am going to discover throughout this project with various coding techniques. I will utilize two variables primarily in this data set, including median_hh_income and median_edu along with two other variables to assist the organization and legibility of my project with state and name. I discovered the dataset on the OpenIntro website, which was linked to the datasets section on Blackboard. I chose this topic as I was interested in the specific differences of each county in America but I decided to filter it to just Maryland to make it more manageable for this project. It is interesting to me to observe the differences between each county in Maryland either financially or educationally. OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=county.
To find if there is a significant difference in median household income among the different types of median educations, I will perform the ANOVA (Analysis of Variance) Test to observe the correlation between the multiple levels of the median education variable. I will then obtain a p-value that shows the correlation between these two variables in this dataset. First, I will perform cleaning on the data set and select the main variables I am going to use in this project. I will also filter to just Maryland counties and select the variables necessary to solve this question. Also, I will run a code filtering out the NA values in case there are any values missing. Lastly, I will then plug this into a boxplot to have a nice visualization of the difference in median household income among the different types of median educations.
names(county) <- gsub("[(). \\-]", "_", names(county))
names(county) <- gsub("_$", "", names(county))
names(county) <- tolower(names(county))
head(county)
## # A tibble: 6 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Autauga County Alaba… 43671 54571 55504 1.48 13.7 77.5
## 2 Baldwin County Alaba… 140415 182265 212628 9.19 11.8 76.7
## 3 Barbour County Alaba… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb County Alaba… 20826 22915 22668 0.73 15.2 82.9
## 5 Blount County Alaba… 51024 57322 58013 0.68 15.6 82
## 6 Bullock County Alaba… 11714 10914 10309 -2.28 28.5 76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## # median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## # smoking_ban <chr>
maryland <- county |>
select(name, state, median_edu, median_hh_income) |>
filter(state == c("Maryland")) |>
filter(!is.na(median_hh_income)) |>
filter(!is.na(median_edu)) |>
rename(county_name = name)
head(maryland)
## # A tibble: 6 × 4
## county_name state median_edu median_hh_income
## <chr> <chr> <chr> <dbl>
## 1 Allegany County Maryland hs_diploma 42771
## 2 Anne Arundel County Maryland some_college 94502
## 3 Baltimore County Maryland some_college 71810
## 4 Calvert County Maryland some_college 100350
## 5 Caroline County Maryland hs_diploma 52469
## 6 Carroll County Maryland some_college 90510
H₀: μ₁ = μ₂ = μ₃ = … = μₖ All county’s median household income are similar. Hₐ: At least one county’s median household income differs
Null Hypothesis: The median household income is the similar among all the types of education levels in Maryland counties.
Alternative Hypothesis: The median household income differs among education types in Maryland counties.
anova_result <- aov(median_hh_income ~ median_edu, data = maryland)
anova_result
## Call:
## aov(formula = median_hh_income ~ median_edu, data = maryland)
##
## Terms:
## median_edu Residuals
## Sum of Squares 6252364689 4614275596
## Deg. of Freedom 2 21
##
## Residual standard error: 14823.21
## Estimated effects may be unbalanced
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## median_edu 2 6.252e+09 3.126e+09 14.23 0.000124 ***
## Residuals 21 4.614e+09 2.197e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
P-value: 0.000124
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = median_hh_income ~ median_edu, data = maryland)
##
## $median_edu
## diff lwr upr p adj
## hs_diploma-bachelors -62740.00 -94000.07 -31479.930 0.0001486
## some_college-bachelors -33550.59 -61481.06 -5620.117 0.0168831
## some_college-hs_diploma 29189.41 10181.13 48197.691 0.0024434
hs_diploma-bachelors: People who have only obtained their high school diploma typically have a median salary $62,740 less than someone with a bachelors. P-value is statistically significant.
some_college-bachelors: People who have gone through some college typically have a median salary $33,551 less than someone with a bachelors. P-value is statistically significant.
some_college-hs_diploma: People who have gone through some college typically have a median salary $29,189 more than someone only with a high school diploma. P-value is statistically significant.
Overall, the TukeyHSD test reinforces that there is a large difference in median household income between each education group. This supports the claim of higher median education levels resulting to higher median household income for counties in Maryland.
ggplot(maryland, aes(x = median_edu, y = median_hh_income, fill = median_edu)) +
geom_boxplot() +
labs(x = "Education Level", y = "Median Household Income",
title = "Difference in Median Household Income Among the Different Types of Median Education Levels Across Counties",
caption = "Source: county.csv (OpenIntro)") +
theme_minimal()
In this test, our alpha value is 0.05. With the p-value being 0.000124, which is less than 0.05, we reject the null. We can conclude that there is significant evidence that at least one of the county’s median household income differs depending on the median education of the county.
Looking at my findings, we can see that the median household income does differ in the boxplot I created. In the boxplot, we can see that the “bachelors” education category has the highest overall median by quite a large margin, with “some college” and “hs diploma” following. Using the boxplot, we can conclude that there is a significant difference in median household income among the different types of education levels in each county. In the ANOVA test between median household income and median education, we obtained a p-value of 0.000124, which is under the alpha value of 0.05. Using the information from both the boxplot and the ANOVA test, we can confidently confirm that there is a significant difference in median household income among the different types of education levels per county. Additionaly, the TukeyHSD Post-Hoc test provided more insights on the significant difference between each education levels correlated to their median household income value, further reinforcing our analysis. To improve this dataset’s accuracy in the future, I believe they should have gotten more data throughout various years as originally my research question was to see if there was a significant difference in median household income among the different types of counties in Maryland but I could not do that as each county only had one value in the median household income variable, making the ANOVA test not applicable. Improving this will allow the dataset to be more accurate and expand the variety of research that can be done in the future.
OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=county.