Project 2

Introduction

Research Question:

Do metropolitan counties have a significantly higher mean median household incomes than non-metropolitan counties?

The data set I am investigating is called “county.csv”. The source of the data set is the United States Counties data set found on open intro https://www.openintro.org/data/index.php?data=county. This dataset provides information on social, demographic and economic variables for the 3142 U.S counties. Also including population trends in 2000,2010 and 2017, household, poverty and education data. I choose this data set because I live in Montgomery county which is part of the Washington metropolitan area. Thus, my question investigate if living in a metropolitan county affects the average median household income. In my data set there is 3142 observations and 15 variables.

The variables I choose were:

metro: metropolitan county is a county that is part of a metropolitan area. This variable is my categorical independent variable and indicates if a county is metropolitan or non metropolitan. In this data set yes indicates it is a metropolitan county and no indicates it’s not a metropolitan county.

median_hh_income :This is a quantitative numerical dependent variable that represents the median household income of a county. Median Household Income is the income where half of households in a county earn more and half earn less.

Loading libraries

library(tidyverse)
library(corrplot)

Setting working directory

setwd("~/Desktop/Data 101")
county <- read_csv("county.csv")

Data Analysis:

Disclaimer: I am aware that my investigation is taking the mean of an already calculated median value. Therefore, since my data does not have raw data on household income any of my calculations on mean, median, standard deviation should not be regarded as exact measurements and distributions of household incomes in the US.

I started to analyze my data by checking the head and structure using the “head” and “str” function. I then checked for NA’s using ‘colSums’ and I noticed that I had two NA’s in my “media_hh_income” observations and 3 NA’s in the “metro” observations. I also noticed the smoking bans had a lot of NA’s since it was said to be collected by various sources but I did not need to filter NA’s because I was not selecting the variable.

I filtered out NA’s in my two variables using the dplyr function filter and then I filtered for is not Na (!is.na) and selected my two variables. Additionally, I used more dplyr functions such as select andutate .First, I used select to select the variables I was focused on such as median household income and metropolitan. Second, I used the mutate function to change the values for the observations from either yes or no to also adding a column that says metropolitan or non_metropolitan.

Finally, I plotted a box plot for my visualization comparing the Median Household Income by Metro Status. I chose a boxplot because it is the easiest to visualize the spread of the data.

Analyzing the head and structure of the dataset

head(county)

## # A tibble: 6 × 15
##   name           state  pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>          <chr>    <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
## 1 Autauga County Alaba…   43671   54571   55504       1.48    13.7          77.5
## 2 Baldwin County Alaba…  140415  182265  212628       9.19    11.8          76.7
## 3 Barbour County Alaba…   29038   27457   25270      -6.22    27.2          68  
## 4 Bibb County    Alaba…   20826   22915   22668       0.73    15.2          82.9
## 5 Blount County  Alaba…   51024   57322   58013       0.68    15.6          82  
## 6 Bullock County Alaba…   11714   10914   10309      -2.28    28.5          76.9
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

str(county)

## spc_tbl_ [3,142 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : chr [1:3142] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : num [1:3142] 55504 212628 25270 22668 58013 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : chr [1:3142] "yes" "yes" "no" "yes" ...
##  $ median_edu       : chr [1:3142] "some_college" "some_college" "hs_diploma" "hs_diploma" ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : num [1:3142] 55317 52562 33368 43404 47412 ...
##  $ smoking_ban      : chr [1:3142] "none" "none" "partial" "none" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   state = col_character(),
##   ..   pop2000 = col_double(),
##   ..   pop2010 = col_double(),
##   ..   pop2017 = col_double(),
##   ..   pop_change = col_double(),
##   ..   poverty = col_double(),
##   ..   homeownership = col_double(),
##   ..   multi_unit = col_double(),
##   ..   unemployment_rate = col_double(),
##   ..   metro = col_character(),
##   ..   median_edu = col_character(),
##   ..   per_capita_income = col_double(),
##   ..   median_hh_income = col_double(),
##   ..   smoking_ban = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Checking for NA’S

colSums(is.na(county))

##              name             state           pop2000           pop2010 
##                 0                 0                 3                 0 
##           pop2017        pop_change           poverty     homeownership 
##                 3                 3                 2                 0 
##        multi_unit unemployment_rate             metro        median_edu 
##                 0                 3                 3                 2 
## per_capita_income  median_hh_income       smoking_ban 
##                 2                 2               580

Filtering out NA’s in my chosen variables (cleaning)

county_filtered<-county |>
filter(!is.na(median_hh_income) & !is.na(metro))
county_filtered

## # A tibble: 3,139 × 15
##    name           state pop2000 pop2010 pop2017 pop_change poverty homeownership
##    <chr>          <chr>   <dbl>   <dbl>   <dbl>      <dbl>   <dbl>         <dbl>
##  1 Autauga County Alab…   43671   54571   55504       1.48    13.7          77.5
##  2 Baldwin County Alab…  140415  182265  212628       9.19    11.8          76.7
##  3 Barbour County Alab…   29038   27457   25270      -6.22    27.2          68  
##  4 Bibb County    Alab…   20826   22915   22668       0.73    15.2          82.9
##  5 Blount County  Alab…   51024   57322   58013       0.68    15.6          82  
##  6 Bullock County Alab…   11714   10914   10309      -2.28    28.5          76.9
##  7 Butler County  Alab…   21399   20947   19825      -2.69    24.4          69  
##  8 Calhoun County Alab…  112249  118572  114728      -1.51    18.6          70.7
##  9 Chambers Coun… Alab…   36583   34215   33713      -1.2     18.8          71.4
## 10 Cherokee Coun… Alab…   23988   25989   25857      -0.6     16.1          77.5
## # ℹ 3,129 more rows
## # ℹ 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>, metro <chr>,
## #   median_edu <chr>, per_capita_income <dbl>, median_hh_income <dbl>,
## #   smoking_ban <chr>

Selecting my columns (median household income and metropolitan area)

county_filtered <-county_filtered |>
  select(median_hh_income, metro)

Made a new column for metropiltan or non-metroplitan instead of yes or no observations

metro_fixed <- county_filtered |>
  mutate(metropolitan2= ifelse (metro== "yes", "metropolitan", "non_metropolitan")) 
metro_fixed

## # A tibble: 3,139 × 3
##    median_hh_income metro metropolitan2   
##               <dbl> <chr> <chr>           
##  1            55317 yes   metropolitan    
##  2            52562 yes   metropolitan    
##  3            33368 no    non_metropolitan
##  4            43404 yes   metropolitan    
##  5            47412 yes   metropolitan    
##  6            29655 no    non_metropolitan
##  7            36326 no    non_metropolitan
##  8            43686 yes   metropolitan    
##  9            37342 no    non_metropolitan
## 10            40041 no    non_metropolitan
## # ℹ 3,129 more rows

Visualization

ggplot(metro_fixed, aes(x = factor(metropolitan2), y = median_hh_income/100)) +
  geom_boxplot(fill = c("#FF4040", "#2ca02c")) +
  labs(title = "Household Income by Metro Status",
       x = "Type of county", y = " Household Income (in hundreds of dollars )") +
  theme_minimal()

I changed my scale to be in hundred of dollars because it was in significant figures before that. I learned this technique of dividing my scale in data 110.

The box plot showed that the Household Income (in hundred of dollars) was evidently higher in metropolitan counties than in non-metropolitan counties. For the metropolitan counties, the median household income is above 50,000. While for the median for non-metropolitan counties, it is below 50,000. The spread of household income in the metropolitan county is also larger and has more maximum outliers over 1000. This could represent the more urban and suburban areas than the non-metropolitan counties. This shows a wide economic distribution. Non-metropolitan counties also had maximum outliers and one outlier above 1000, which could represent economic diversity but there was also a dot representing a minimum outlier below 250,000. That indicates economic disparity and low household income as compared to metropolitan counties, which had no minimum outliers only maximum. Metropolitan household income shows more right-skewness because of the long whiskers and the many values above the upper quartile suggesting higher household incomes.

Statistical analysis

Hypothesis

\(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) > \(\mu_2\)

where:

\(\mu_1\) = mean metropolitan counties household income.

\(\mu_2\) = mean non metropolitan counties household income.

t.test(metro_fixed$median_hh_income[metro_fixed$metropolitan2 =="metropolitan"],
        metro_fixed$median_hh_income[metro_fixed$metropolitan2 =="non_metropolitan"], alternative ="greater")

## 
##  Welch Two Sample t-test
## 
## data:  metro_fixed$median_hh_income[metro_fixed$metropolitan2 == "metropolitan"] and metro_fixed$median_hh_income[metro_fixed$metropolitan2 == "non_metropolitan"]
## t = 23.51, df = 1832.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  10568.17      Inf
## sample estimates:
## mean of x mean of y 
##  56907.71  45544.11

The p-value = 2.2e-16 this is far smaller than α = 0.05, meaning it’s statistically significant at α = 0.05. There is strong evidence to suggest that the mean metropolitan counties household income is higher than the mean non metropolitan counties household income.
The CI is (10568.17, ∞). The interval is entirely above 0 , showing the mean median metropolitan counties household income is higher than the mean non metropolitan counties household income.
Therefore, the decision is to reject the null hypothesis.

Conclusion and Future Directions :

Key findings and analysis/ discussion of results:

In conclusion, the key findings of my analysis was that, there is a clear pattern, that the mean metropolitan counties household income is higher than the mean non metropolitan counties household income. The p value was far smaller than α = 0.05 which is statistically significant. The box plot also showed a huge difference in the medians of the household incomes with metropolitan counties being much higher than non-metropolitan counties.

Potential avenues

A potential avenue is to investigate the same variables but with the raw data of the household incomes. Additionally, a future avenue is exploring the data sets other variables for a statistical analysis for example poverty and unemployment rate.

References https://www.countyhealthrankings.org/health-data/community-conditions/social-and-economic-factors/income-employment-and-wealth/median-household-income - used it to define median household income