Do metropolitan counties have a significantly higher mean median household incomes than non-metropolitan counties?
The data set I am investigating is called “county.csv”. The source of the data set is the United States Counties data set found on open intro https://www.openintro.org/data/index.php?data=county. This dataset provides information on social, demographic and economic variables for the 3142 U.S counties. Also including population trends in 2000,2010 and 2017, household, poverty and education data. I choose this data set because I live in Montgomery county which is part of the Washington metropolitan area. Thus, my question investigate if living in a metropolitan county affects the average median household income. In my data set there is 3142 observations and 15 variables.
metro: metropolitan county is a county that is part of a metropolitan area. This variable is my categorical independent variable and indicates if a county is metropolitan or non metropolitan. In this data set yes indicates it is a metropolitan county and no indicates it’s not a metropolitan county.
median_hh_income :This is a quantitative numerical dependent variable that represents the median household income of a county. Median Household Income is the income where half of households in a county earn more and half earn less.
library(tidyverse)
library(corrplot)
setwd("~/Desktop/Data 101")
county <- read_csv("county.csv")
Disclaimer: I am aware that my investigation is taking the mean of an already calculated median value. Therefore, since my data does not have raw data on household income any of my calculations on mean, median, standard deviation should not be regarded as exact measurements and distributions of household incomes in the US.
I started to analyze my data by checking the head and structure using the “head” and “str” function. I then checked for NA’s using ‘colSums’ and I noticed that I had two NA’s in my “media_hh_income” observations and 3 NA’s in the “metro” observations. I also noticed the smoking bans had a lot of NA’s since it was said to be collected by various sources but I did not need to filter NA’s because I was not selecting the variable.
I filtered out NA’s in my two variables using the dplyr function filter and then I filtered for is not Na (!is.na) and selected my two variables. Additionally, I used more dplyr functions such as select andutate .First, I used select to select the variables I was focused on such as median household income and metropolitan. Second, I used the mutate function to change the values for the observations from either yes or no to also adding a column that says metropolitan or non_metropolitan.
Finally, I plotted a box plot for my visualization comparing the Median Household Income by Metro Status. I chose a boxplot because it is the easiest to visualize the spread of the data.
head(county)
## # A tibble: 6 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Autauga County Alaba… 43671 54571 55504 1.48 13.7 77.5
## 2 Baldwin County Alaba… 140415 182265 212628 9.19 11.8 76.7
## 3 Barbour County Alaba… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb County Alaba… 20826 22915 22668 0.73 15.2 82.9
## 5 Blount County Alaba… 51024 57322 58013 0.68 15.6 82
## 6 Bullock County Alaba… 11714 10914 10309 -2.28 28.5 76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## # median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## # smoking_ban <chr>
str(county)
## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : num [1:3142] 55504 212628 25270 22668 58013 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : chr [1:3142] "yes" "yes" "no" "yes" ...
## $ median_edu : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
## $ smoking_ban : chr [1:3142] "none" "none" "partial" "none" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. state = col_character(),
## .. pop2000 = col_double(),
## .. pop2010 = col_double(),
## .. pop2017 = col_double(),
## .. pop_change = col_double(),
## .. poverty = col_double(),
## .. homeownership = col_double(),
## .. multi_unit = col_double(),
## .. unemployment_rate = col_double(),
## .. metro = col_character(),
## .. median_edu = col_character(),
## .. per_capita_income = col_double(),
## .. median_hh_income = col_double(),
## .. smoking_ban = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
colSums(is.na(county))
## name state pop2000 pop2010
## 0 0 3 0
## pop2017 pop_change poverty homeownership
## 3 3 2 0
## multi_unit unemployment_rate metro median_edu
## 0 3 3 2
## per_capita_income median_hh_income smoking_ban
## 2 2 580
county_filtered<-county |>
filter(!is.na(median_hh_income) & !is.na(metro))
county_filtered
## # A tibble: 3,139 × 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Autauga County Alab… 43671 54571 55504 1.48 13.7 77.5
## 2 Baldwin County Alab… 140415 182265 212628 9.19 11.8 76.7
## 3 Barbour County Alab… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb County Alab… 20826 22915 22668 0.73 15.2 82.9
## 5 Blount County Alab… 51024 57322 58013 0.68 15.6 82
## 6 Bullock County Alab… 11714 10914 10309 -2.28 28.5 76.9
## 7 Butler County Alab… 21399 20947 19825 -2.69 24.4 69
## 8 Calhoun County Alab… 112249 118572 114728 -1.51 18.6 70.7
## 9 Chambers Coun… Alab… 36583 34215 33713 -1.2 18.8 71.4
## 10 Cherokee Coun… Alab… 23988 25989 25857 -0.6 16.1 77.5
## # ℹ 3,129 more rows
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## # median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## # smoking_ban <chr>
county_filtered <-county_filtered |>
select(median_hh_income, metro)
metro_fixed <- county_filtered |>
mutate(metropolitan2= ifelse (metro== "yes", "metropolitan", "non_metropolitan"))
metro_fixed
## # A tibble: 3,139 × 3
## median_hh_income metro metropolitan2
## <dbl> <chr> <chr>
## 1 55317 yes metropolitan
## 2 52562 yes metropolitan
## 3 33368 no non_metropolitan
## 4 43404 yes metropolitan
## 5 47412 yes metropolitan
## 6 29655 no non_metropolitan
## 7 36326 no non_metropolitan
## 8 43686 yes metropolitan
## 9 37342 no non_metropolitan
## 10 40041 no non_metropolitan
## # ℹ 3,129 more rows
ggplot(metro_fixed, aes(x = factor(metropolitan2), y = median_hh_income/100)) +
geom_boxplot(fill = c("#FF4040", "#2ca02c")) +
labs(title = "Household Income by Metro Status",
x = "Type of county", y = " Household Income (in hundreds of dollars )") +
theme_minimal()
I changed my scale to be in hundred of dollars because it was in significant figures before that. I learned this technique of dividing my scale in data 110.
The box plot showed that the Household Income (in hundred of dollars) was evidently higher in metropolitan counties than in non-metropolitan counties. For the metropolitan counties, the median household income is above 50,000. While for the median for non-metropolitan counties, it is below 50,000. The spread of household income in the metropolitan county is also larger and has more maximum outliers over 1000. This could represent the more urban and suburban areas than the non-metropolitan counties. This shows a wide economic distribution. Non-metropolitan counties also had maximum outliers and one outlier above 1000, which could represent economic diversity but there was also a dot representing a minimum outlier below 250,000. That indicates economic disparity and low household income as compared to metropolitan counties, which had no minimum outliers only maximum. Metropolitan household income shows more right-skewness because of the long whiskers and the many values above the upper quartile suggesting higher household incomes.
Hypothesis
\(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) > \(\mu_2\)
where:
\(\mu_1\) = mean metropolitan counties household income.
\(\mu_2\) = mean non metropolitan counties household income.
t.test(metro_fixed$median_hh_income[metro_fixed$metropolitan2 =="metropolitan"],
metro_fixed$median_hh_income[metro_fixed$metropolitan2 =="non_metropolitan"], alternative ="greater")
##
## Welch Two Sample t-test
##
## data: metro_fixed$median_hh_income[metro_fixed$metropolitan2 == "metropolitan"] and metro_fixed$median_hh_income[metro_fixed$metropolitan2 == "non_metropolitan"]
## t = 23.51, df = 1832.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 10568.17 Inf
## sample estimates:
## mean of x mean of y
## 56907.71 45544.11
The p-value = 2.2e-16 this is far smaller than α = 0.05, meaning it’s statistically significant at α = 0.05. There is strong evidence to suggest that the mean metropolitan counties household income is higher than the mean non metropolitan counties household income.
The CI is (10568.17, ∞). The interval is entirely above 0 , showing the mean median metropolitan counties household income is higher than the mean non metropolitan counties household income.
Therefore, the decision is to reject the null hypothesis.
In conclusion, the key findings of my analysis was that, there is a clear pattern, that the mean metropolitan counties household income is higher than the mean non metropolitan counties household income. The p value was far smaller than α = 0.05 which is statistically significant. The box plot also showed a huge difference in the medians of the household incomes with metropolitan counties being much higher than non-metropolitan counties.
A potential avenue is to investigate the same variables but with the raw data of the household incomes. Additionally, a future avenue is exploring the data sets other variables for a statistical analysis for example poverty and unemployment rate.
References https://www.countyhealthrankings.org/health-data/community-conditions/social-and-economic-factors/income-employment-and-wealth/median-household-income - used it to define median household income