Final Project (DATA-101)

Loading Libraries, Dataset, & Setting Working Directory

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

setwd("~/Documents/EC/Spring 2026/DATA 101/Project Final")

county <- read_csv("county.csv")

## Rows: 3142 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): name, state, metro, median_edu, smoking_ban
## dbl (10): pop2000, pop2010, pop2017, pop_change, poverty, homeownership, mul...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(county)

## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : num [1:3142] 55504 212628 25270 22668 58013 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : chr [1:3142] "yes" "yes" "no" "yes" ...
##  $ median_edu       : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
##  $ smoking_ban      : chr [1:3142] "none" "none" "partial" "none" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   pop2000 = col_double(),
##   ..   pop2010 = col_double(),
##   ..   pop2017 = col_double(),
##   ..   pop_change = col_double(),
##   ..   poverty = col_double(),
##   ..   homeownership = col_double(),
##   ..   multi_unit = col_double(),
##   ..   unemployment_rate = col_double(),
##   ..   metro = col_character(),
##   ..   median_edu = col_character(),
##   ..   per_capita_income = col_double(),
##   ..   median_hh_income = col_double(),
##   ..   smoking_ban = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(county)

## # A tibble: 6 × 15
##   name           state  pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>          <chr>    <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
## 1 Autauga County Alaba…   43671   54571   55504       1.48    13.7          77.5
## 2 Baldwin County Alaba…  140415  182265  212628       9.19    11.8          76.7
## 3 Barbour County Alaba…   29038   27457   25270      -6.22    27.2          68  
## 4 Bibb County    Alaba…   20826   22915   22668       0.73    15.2          82.9
## 5 Blount County  Alaba…   51024   57322   58013       0.68    15.6          82  
## 6 Bullock County Alaba…   11714   10914   10309      -2.28    28.5          76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

Introduction

Is there a significant difference in median household income among the different types of median education of counties in Maryland? The data set I selected to work on contains data from 3142 counties in the United States. This dataset has 3142 observations and 14 variables, making it perfect for this project. The research question stated in the first bolded sentence is what I am going to discover throughout this project with various coding techniques. I will utilize two variables primarily in this data set, including median_hh_income and median_edu along with two other variables to assist the organization and legibility of my project with state and name. I discovered the dataset on the OpenIntro website, which was linked to the datasets section on Blackboard. I chose this topic as I was interested in the specific differences of each county in America but I decided to filter it to just Maryland to make it more manageable for this project. It is interesting to me to observe the differences between each county in Maryland either financially or educationally. OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=county.

Data Analysis

To find if there is a significant difference in median household income among the different types of median educations, I will perform the ANOVA (Analysis of Variance) Test to observe the correlation between the multiple levels of the median education variable. I will then obtain a p-value that shows the correlation between these two variables in this dataset. First, I will perform cleaning on the data set and select the main variables I am going to use in this project. I will also filter to just Maryland counties and select the variables necessary to solve this question. Also, I will run a code filtering out the NA values in case there are any values missing. Lastly, I will then plug this into a boxplot to have a nice visualization of the difference in median household income among the different types of median educations.

Cleaning

names(county) <- gsub("[(). \\-]", "_", names(county))
names(county) <- gsub("_$", "", names(county))
names(county) <- tolower(names(county))

head(county)

## # A tibble: 6 × 15
##   name           state  pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>          <chr>    <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
## 1 Autauga County Alaba…   43671   54571   55504       1.48    13.7          77.5
## 2 Baldwin County Alaba…  140415  182265  212628       9.19    11.8          76.7
## 3 Barbour County Alaba…   29038   27457   25270      -6.22    27.2          68  
## 4 Bibb County    Alaba…   20826   22915   22668       0.73    15.2          82.9
## 5 Blount County  Alaba…   51024   57322   58013       0.68    15.6          82  
## 6 Bullock County Alaba…   11714   10914   10309      -2.28    28.5          76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

Selecting, Filtering, & Renaming Variables

maryland <- county |>
  select(name, state, median_edu, median_hh_income) |>
  filter(state == c("Maryland")) |>
  filter(!is.na(median_hh_income)) |>
  filter(!is.na(median_edu)) |>
  rename(county_name = name)
head(maryland)

## # A tibble: 6 × 4
##   county_name         state    median_edu   median_hh_income
##   <chr>               <chr>    <chr>                   <dbl>
## 1 Allegany County     Maryland hs_diploma              42771
## 2 Anne Arundel County Maryland some_college            94502
## 3 Baltimore County    Maryland some_college            71810
## 4 Calvert County      Maryland some_college           100350
## 5 Caroline County     Maryland hs_diploma              52469
## 6 Carroll County      Maryland some_college            90510

Statistical Analysis

Hypotheses

H₀: μ₁ = μ₂ = μ₃ = … = μₖ All county’s median household income are similar. Hₐ: At least one county’s median household income differs

Null Hypothesis: The median household income is the similar among all the types of education levels in Maryland counties.

Alternative Hypothesis: The median household income differs among education types in Maryland counties.

ANOVA (Analysis of Variance) Test

anova_result <- aov(median_hh_income ~ median_edu, data = maryland)
anova_result

## Call:
##    aov(formula = median_hh_income ~ median_edu, data = maryland)
## 
## Terms:
##                 median_edu  Residuals
## Sum of Squares  6252364689 4614275596
## Deg. of Freedom          2         21
## 
## Residual standard error: 14823.21
## Estimated effects may be unbalanced

summary(anova_result)

##             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## median_edu   2 6.252e+09 3.126e+09   14.23 0.000124 ***
## Residuals   21 4.614e+09 2.197e+08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value: 0.000124

Post-Hoc Test (TukeyHSD)

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = median_hh_income ~ median_edu, data = maryland)
## 
## $median_edu
##                              diff       lwr        upr     p adj
## hs_diploma-bachelors    -62740.00 -94000.07 -31479.930 0.0001486
## some_college-bachelors  -33550.59 -61481.06  -5620.117 0.0168831
## some_college-hs_diploma  29189.41  10181.13  48197.691 0.0024434

Post-Hoc Interpertation

hs_diploma-bachelors: People who have only obtained their high school diploma typically have a median salary $62,740 less than someone with a bachelors. P-value is statistically significant.
some_college-bachelors: People who have gone through some college typically have a median salary $33,551 less than someone with a bachelors. P-value is statistically significant.
some_college-hs_diploma: People who have gone through some college typically have a median salary $29,189 more than someone only with a high school diploma. P-value is statistically significant.
Overall, the TukeyHSD test reinforces that there is a large difference in median household income between each education group. This supports the claim of higher median education levels resulting to higher median household income for counties in Maryland.

Boxplot of the Difference in Median Household Income Among the Different Types of Median Education

ggplot(maryland, aes(x = median_edu, y = median_hh_income, fill = median_edu)) +
  geom_boxplot() +
  labs(x = "Education Level", y = "Median Household Income", 
       title = "Difference in Median Household Income Among the Different Types of Median Education Levels Across Counties",
       caption = "Source: county.csv (OpenIntro)") +
  theme_minimal()

Interperting ANOVA Results

In this test, our alpha value is 0.05. With the p-value being 0.000124, which is less than 0.05, we reject the null. We can conclude that there is significant evidence that at least one of the county’s median household income differs depending on the median education of the county.

Conclusion and Future Directions

Looking at my findings, we can see that the median household income does differ in the boxplot I created. In the boxplot, we can see that the “bachelors” education category has the highest overall median by quite a large margin, with “some college” and “hs diploma” following. Using the boxplot, we can conclude that there is a significant difference in median household income among the different types of education levels in each county. In the ANOVA test between median household income and median education, we obtained a p-value of 0.000124, which is under the alpha value of 0.05. Using the information from both the boxplot and the ANOVA test, we can confidently confirm that there is a significant difference in median household income among the different types of education levels per county. Additionaly, the TukeyHSD Post-Hoc test provided more insights on the significant difference between each education levels correlated to their median household income value, further reinforcing our analysis. To improve this dataset’s accuracy in the future, I believe they should have gotten more data throughout various years as originally my research question was to see if there was a significant difference in median household income among the different types of counties in Maryland but I could not do that as each county only had one value in the median household income variable, making the ANOVA test not applicable. Improving this will allow the dataset to be more accurate and expand the variety of research that can be done in the future.

References

OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=county.