Introduction

Research Questions

1.Do states with higher education levels have higher median household income? 2.Do states with a bigger population size have a higher median household income? 3.Do states with higher median age have a higher median household income?

Data Summary

EDA Summary

The histogram of median household income is not normally distributed, so it is important to examine the distribution of residuals using a histogram or Q-Q plot to assess normality assumptions. The correlation results provide initial insight into the strength of relationships between the quantitative variables and income. Bachelor’s Degree % (r = 0.82) and Median Home Value (r = 0.79) both show strong positive correlations with median household income, indicating that states with higher education levels and higher home values tend to have higher incomes. In contrast, Population for Poverty Status shows a weak positive correlation (r = 0.17), suggesting only a minimal linear relationship with income. Median Age (r = 0.01) and Unemployment Rate (r = 0.01) exhibit near-zero correlations, confirming little to no linear relationship.

These findings are consistent with the scatterplots, where Population for Poverty Status, Bachelor’s Degree %, and Median Home Value display positive linear trends with income. There is no evidence of curvilinear relationships in these plots, so the inclusion of higher-order (quadratic) terms is not necessary. For Median Age and Unemployment Rate, the scatterplots show no clear trend, further supporting their weak correlations with income.

In terms of the categorical variables, the summary statistics and boxplots reveal meaningful differences in income across several groups. Region shows clear variation, with the Northeast having the highest mean income (approximately 87,109), followed by the West (~82,514), while the South has the lowest mean income (approximately 69,530). Coastal status also demonstrates a difference, with coastal states having a higher mean income (~82,092) compared to inland states (~72,618). Political affiliation shows one of the largest gaps, with Democrat-leaning states having a higher mean income (~85,751) than Republican-leaning states (~71,127).

In contrast, the Population Category shows relatively small differences in mean income across levels (Small: ~75,524; Medium: ~75,193; Large: ~80,107), suggesting a weaker relationship with the response variable. Additionally, the presence of outliers—particularly within the Republican category—indicates that further diagnostic analysis using Cook’s Distance and residual plots may be necessary to assess their influence on the model.

Finally, interaction plots suggest potential interactions between Region and Population, as well as Political Affiliation and Region. This indicates that the effect of one explanatory variable on income may depend on the level of another variable, and therefore these interactions should be formally tested in the regression modeling stage.

Analysis and Results

The added technique we selected was k-fold cross validation, which involves the splitting of the data into k subsets. This will be a 5-fold cross validation approach, in which the dataset is split into 5 groups. This specific number of folds was chosen with regards to the size of the dataset, which is relatively small and has 50 observations. By using a smaller number of folds, this will ensure that the estimates are not inaccurate or skewed because of using smaller groups with less variation. With each iteration, 4 groups are used to train the model and the remaining group is used to test it. This process will be repeated 5 times to ensure that each group serves as the test set once. The purpose of this technique is to get a more reliable estimate of how the model will perform on new or unseen data.

Conclusions

Appendix A: Data Dictionary

Appendix B: Data Rows

Appendix C: Exploratory Data Analysis

$Midwest
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  67769   69404   71622   73307   75104   85086 

$Northeast
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  73733   81211   84972   87109   96838   99858 

$South
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54203   60514   67718   69530   74919   98678 

$West
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  62268   74942   80160   82514   93421   95521 
$Coastal
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54203   73522   82095   82092   94964   99858 

$Inland
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  55948   68157   71810   72618   76444   93421 
$Democrat
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  62268   80270   85029   85751   94784   99858 

$Republican
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54203   67666   71118   71127   74632   96838 
$Large
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  67631   70804   75780   80107   89931   99858 

$Medium
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  58229   62194   73032   75193   86731   98678 

$Small
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  54203   70804   74590   75524   81361   96838 
[1] 0.1656063
[1] 0.8161651
[1] 0.7937665
[1] 0.1656063
[1] 0.8161651
[1] 0.7937665
[1] -0.002710338
[1] 0.01043475

Appendix D: Final Model Output and Plots

Appendix E: References

The Council of State Governments. (2023). State tables: 2023-3-3. Book of the States. https://bookofthestates.org/tables/2023-3-3/

Federal Election Commission. (n.d.). Election results and voting information. https://www.fec.gov/introduction-campaign-finance/election-results-and-voting-information/

U.S. Bureau of Labor Statistics. (2024). Local area unemployment statistics: 2023 annual averages. https://www.bls.gov/lau/lastrk23.htm

U.S. Census Bureau. (2023). Educational attainment (Table S1501). American Community Survey 1-year estimates. https://data.census.gov/table/ACSST1Y2023.S1501

U.S. Census Bureau. (2023). Age and sex (Table S0101). American Community Survey 1-year estimates. https://data.census.gov/table/ACSST1Y2023.S0101

U.S. Census Bureau. (n.d.). Census regions and divisions of the United States [Map]. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf

U.S. Census Bureau. (2023). Median household income in the past 12 months (Table B19013). American Community Survey 1-year estimates. https://data.census.gov/table/ACSDT1Y2023.B19013

U.S. Census Bureau. (2023). ACS demographic and housing estimates (Table DP05). American Community Survey 1-year estimates. https://data.census.gov/table/ACSDP1Y2023.DP05

U.S. Census Bureau. (2023). Selected housing characteristics (Table DP04). American Community Survey 1-year estimates. https://data.census.gov/table/ACSDP1Y2023.DP04

U.S. Census Bureau. (2023). Poverty status in the past 12 months (Table S1701). American Community Survey 1-year estimates. https://data.census.gov/table/ACSST1Y2023.S1701

Background

Data Sources

Additional Help