Project 1 - Data 110 (1)

Author

Muhammad Ahtisham

Introduction

The dataset Home Income Limits contains data on the income limits calculated by their number of members in a household, location, and income. This dataset gives data based on state and county, as well. This dataset is provided by the U.S. Department of Housing and Urban Development (HUD). With adjustments based on family size, the income limits are calculated based on the HUD’s estimates of median family income.

Important Background Information: Income limits are defined by the HUD and these income limits categorize those who qualify for affordable housing programs based on their location, income, and family size. It’s also important to know about the Area Median Income (AMI). AMI is the median income in an area. The quantitative variables in this dataset use AMI to categorize the income limits into 5 categories: 30%, 50%, 60%, 80%, and 120%. For example, the AMI of a specific area is 100,000 USD, 30% of that is 30,000 USD, 50% of that is 50,000 USD, and so on. This dataset uses this to create upper income limits of those who can qualify for affordable housing programs based on their location, household size, and income.

There are 48 total variables and 4764 observations. The quantitative variables are the last 40 variables; they all show the income limit categorized by the number of people in the household and the Area Median Income (e.g. Lim80_25p8 - 80% of AMI in 2025 of an 8 person household). The first 8 variables are qualitative variables that show the state, county, address, etc.

In this project, I will focus on income limits of the minimum number of household members and the maximum number of household members of the 120% of AMI in the dataset. I’ll also be looking at the states and see how the income limits correlate with what state they reside in. Since there’s 50 states, I will focus on the top 5 most expensive states in the United States according to a 2026 report from the World Population Review.

Variables I’ll be using:

  • State
  • Lim120_25p1
  • Lim120_25p8
library(tidymodels)
library(readxl)
library(ggthemes)

Loading the Dataset

setwd("C:/Users/sarah/Downloads")
data <- read_excel("HOME_IncomeLmts_Natl_2025.xlsx")
head(data)
# A tibble: 6 × 48
  fips2010   cbsasub       hud_area_name county_town_name state statename county
  <chr>      <chr>         <chr>         <chr>            <dbl> <chr>      <dbl>
1 0100199999 METRO33860M3… Montgomery, … Autauga County       1 ALABAMA        1
2 0100399999 METRO19300M1… Daphne-Fairh… Baldwin County       1 ALABAMA        3
3 0100599999 NCNTY01005N0… Barbour Coun… Barbour County       1 ALABAMA        5
4 0100799999 METRO13820M1… Birmingham-H… Bibb County          1 ALABAMA        7
5 0100999999 METRO13820M1… Birmingham-H… Blount County        1 ALABAMA        9
6 0101199999 NCNTY01011N0… Bullock Coun… Bullock County       1 ALABAMA       11
# ℹ 41 more variables: county_name <chr>, lim30_25p1 <dbl>, lim30_25p2 <dbl>,
#   lim30_25p3 <dbl>, lim30_25p4 <dbl>, lim30_25p5 <dbl>, lim30_25p6 <dbl>,
#   lim30_25p7 <dbl>, lim30_25p8 <dbl>, lim50_25p1 <dbl>, lim50_25p2 <dbl>,
#   lim50_25p3 <dbl>, lim50_25p4 <dbl>, lim50_25p5 <dbl>, lim50_25p6 <dbl>,
#   lim50_25p7 <dbl>, lim50_25p8 <dbl>, Lim60_25p1 <dbl>, Lim60_25p2 <dbl>,
#   Lim60_25p3 <dbl>, Lim60_25p4 <dbl>, Lim60_25p5 <dbl>, Lim60_25p6 <dbl>,
#   Lim60_25p7 <dbl>, Lim60_25p8 <dbl>, Lim80_25p1 <dbl>, Lim80_25p2 <dbl>, …

Cleaning

colSums(is.na(data)) # Check for N/A values
        fips2010          cbsasub    hud_area_name county_town_name 
               0                0                0                0 
           state        statename           county      county_name 
               0                0                0                0 
      lim30_25p1       lim30_25p2       lim30_25p3       lim30_25p4 
               0                0                0                0 
      lim30_25p5       lim30_25p6       lim30_25p7       lim30_25p8 
               0                0                0                0 
      lim50_25p1       lim50_25p2       lim50_25p3       lim50_25p4 
               0                0                0                0 
      lim50_25p5       lim50_25p6       lim50_25p7       lim50_25p8 
               0                0                0                0 
      Lim60_25p1       Lim60_25p2       Lim60_25p3       Lim60_25p4 
               0                0                0                0 
      Lim60_25p5       Lim60_25p6       Lim60_25p7       Lim60_25p8 
               0                0                0                0 
      Lim80_25p1       Lim80_25p2       Lim80_25p3       Lim80_25p4 
               0                0                0                0 
      Lim80_25p5       Lim80_25p6       Lim80_25p7       Lim80_25p8 
               0                0                0                0 
     lim120_25p1      lim120_25p2      lim120_25p3      lim120_25p4 
               0                0                0                0 
     lim120_25p5      lim120_25p6      lim120_25p7      lim120_25p8 
               0                0                0                0 

No N/A values.

data2 <- data |>
  select(statename, lim120_25p1, lim120_25p8) # Select the three variables I'll be using
head(data2)
# A tibble: 6 × 3
  statename lim120_25p1 lim120_25p8
  <chr>           <dbl>       <dbl>
1 ALABAMA         70200      132400
2 ALABAMA         78100      147300
3 ALABAMA         60650      114350
4 ALABAMA         80550      151900
5 ALABAMA         80550      151900
6 ALABAMA         60650      114350

Linear Regression

fit1 <- lm(data = data2, lim120_25p8 ~ lim120_25p1) # Create a linear model of lim120_25p8 and lim120_25p1
summary(fit1)

Call:
lm(formula = lim120_25p8 ~ lim120_25p1, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-66.849 -25.063   3.211  22.075  68.310 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 8.0534266  1.9309713     4.171 3.09e-05 ***
lim120_25p1 1.8855901  0.0000227 83076.668  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30.99 on 4762 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.902e+09 on 1 and 4762 DF,  p-value: < 2.2e-16

The p-value for both variables is significant (less than 0.05), specifically < 2e-16 for lim120_25p1 and 3.09e-5 for lim120_25p8.

The adjusted R-squared is 1. This means that we have a 100% correlation between lim120_25p1 and lim120_25p8.

Equation: y = 1.8855901(lim_120_25p1) + 8.0534266

plot(fit1) # Diagnostic Plots

These diagnostic plots show a lot of insight on whether or not the linear model is appropriate or good. If the Residuals vs. Fitted plot is random, meaning the points are all scattered randomly about y=0, then the model is appropriate. In this case, the points are clustered around the x value 150,000, but there is still some randomness. The Q-Q plot shows if our model is good. If the points follow a straight diagonal line, then the model is good. In our case, they most of the time follow the line with some deviation towards the start and end. This means our model is good, or at least somewhat good.

But why might the model be considered somewhat good, but not entirely appropriate? Let’s take a look at a scatter-plot first.

options(scipen = 999) # Remove scientific notation
ggplot(data2, aes(x=lim120_25p1, y=lim120_25p8)) + # Map x-axis as lim120_25p1 and y-axis as lim120_25p8 
  geom_point() + # Scatter plot
  theme_bw() # Arbitrarily chosen theme

As expected, this scatterplot shows a 100% correlation between the two variables as the cases are perfectly in a straight line, which is why the adjusted R-squared value was 1. You may ask, why is the correlation a 100%?

The correlation between the income limits is a 100% because the HUD determines the income limit off a mathematical formula. The HUD scales the income limit for a household of one person to derive the income limit for a household of 8 people in an area. Therefore, there is no randomness; every value can be obtained by using math. This is why there is a 100% correlation between the house income limits for a particular area, because it can all be calculated. This is also why the model is somewhat good, but not completely appropriate, because it may not be appropriate to create a model that predicts these income limits when you can just use math.

Visualization

The top 5 most expensive states to live in as of 2026, according to the World Population review, are Hawaii, Massachusetts, California, District of Columbia (considered as a state in the dataset), and New York, respectively.

statesdata <- data |> # New dataset called "statesdata" from "data"
  filter(statename %in% c("HAWAII", "MASSACHUSETTS", "NEW YORK", "CALIFORNIA", "DISTRICT OF COLUMBIA")) # Filter for these 5 states
unique(statesdata$statename) # Display each unique value of the variable
[1] "CALIFORNIA"           "DISTRICT OF COLUMBIA" "HAWAII"              
[4] "MASSACHUSETTS"        "NEW YORK"            
statesdata <- statesdata |> 
  mutate(overall_mean = rowMeans(statesdata[,c(9:48)])) # Create a new variable called "overall_mean" that calculates the mean from columns 9:48, which is every income limit variable. Basically, this chunk calculates the mean of every income limit, regardless of the AMI and number of people, for every row. *See references for documentation.*
head(statesdata)
# A tibble: 6 × 49
  fips2010   cbsasub       hud_area_name county_town_name state statename county
  <chr>      <chr>         <chr>         <chr>            <dbl> <chr>      <dbl>
1 0600199999 METRO41860MM… Oakland-Frem… Alameda County       6 CALIFORN…      1
2 0600399999 NCNTY06003N0… Alpine Count… Alpine County        6 CALIFORN…      3
3 0600599999 NCNTY06005N0… Amador Count… Amador County        6 CALIFORN…      5
4 0600799999 METRO17020M1… Chico, CA MSA Butte County         6 CALIFORN…      7
5 0600999999 NCNTY06009N0… Calaveras Co… Calaveras County     6 CALIFORN…      9
6 0601199999 NCNTY06011N0… Colusa Count… Colusa County        6 CALIFORN…     11
# ℹ 42 more variables: county_name <chr>, lim30_25p1 <dbl>, lim30_25p2 <dbl>,
#   lim30_25p3 <dbl>, lim30_25p4 <dbl>, lim30_25p5 <dbl>, lim30_25p6 <dbl>,
#   lim30_25p7 <dbl>, lim30_25p8 <dbl>, lim50_25p1 <dbl>, lim50_25p2 <dbl>,
#   lim50_25p3 <dbl>, lim50_25p4 <dbl>, lim50_25p5 <dbl>, lim50_25p6 <dbl>,
#   lim50_25p7 <dbl>, lim50_25p8 <dbl>, Lim60_25p1 <dbl>, Lim60_25p2 <dbl>,
#   Lim60_25p3 <dbl>, Lim60_25p4 <dbl>, Lim60_25p5 <dbl>, Lim60_25p6 <dbl>,
#   Lim60_25p7 <dbl>, Lim60_25p8 <dbl>, Lim80_25p1 <dbl>, Lim80_25p2 <dbl>, …
statesdata <- statesdata |>
  group_by(statename) |> # Group by the state 
  summarise(grand_mean = mean(overall_mean)) # Calculate the mean of the overall_mean discussed above as per the state. This chunk calculates the grand mean of the income limit by the state.
statesdata
# A tibble: 5 × 2
  statename            grand_mean
  <chr>                     <dbl>
1 CALIFORNIA               84284.
2 DISTRICT OF COLUMBIA    109271.
3 HAWAII                   90719.
4 MASSACHUSETTS            96946.
5 NEW YORK                 76112.
ggplot(statesdata, aes(x = statename, y = grand_mean, fill = statename)) + # Map x-axis as the state name and y-axis as the grand mean. Add color using fill = statename. 
  geom_col(color = "black") + # Bar plot "color" adds a border to each bar.
  coord_flip() + # Flip x and y axes so that the x-axis's labels don't overlap with each other. 
  labs(title = "Average Home Income Limits for Most Expensive States.", x = "State", y = "Average Income Limit", caption = "Source: U.S. Department of Housing and Urban Development", fill = "State Name") + # Add title, caption, rename x and y axes. 
  scale_fill_manual(values=c("NEW YORK"="#EE2C2C","CALIFORNIA"="#CDAD00","DISTRICT OF COLUMBIA"="dodgerblue","HAWAII"="#00CD00","MASSACHUSETTS"="purple")) + # Manually add colors to each bar in the graph.
  theme_bw() # Arbitrarily chosen theme.

Conclusion

In conclusion, the state with the highest mean average income limit is the District of Columbia (D.C.), among the top 4 other most expensive states to live in, as shown by the bar graph and the table above it. The process of this project was informative and insightful. There were ups and downs when trying to tackle the dataset to complete the project requirements. Considering the dataset had no N/A values and the variable names were already cleaned (no upper-case, no spaces, no special characters), the cleaning part of the project was pretty easy. I checked for N/A values using “colSums”, a function that lets you see the number of unique values in the columns. I then performed EDA by selecting the variables I needed to complete the project, using the “select” function.

The bar graph visualization represents the mean income limit for the top 5 most expensive states to live in, as of 2026. It shows that D.C. is the highest, followed by Massachusetts and Hawaii. This means that the District of Columbia has the highest average income limit for it’s residents to qualify for affordable housing programs across all Area Median Incomes (AMI) and the number of people in a household. The state with the lowest average income limit, among the top five most expensive states, is New York. This surprised me; I figured that Hawaii, which is the state with the highest living expense would have the lowest average income limit, among these 5 states, because Hawaii’s housing costs are very high. However, it’s third in rank for having the highest mean income limit across its areas.

If I had more time, I would have seen the relationship between the county, state, and the average income limit. Adding that third variable makes it more complicated, however, can give more insight. Specifically, I wish I could have seen how the mean fluctuates between counties in a state, and possibly see which state has the highest fluctuations of the average income limit. Fluctuations may exist in the first place because there are many counties in a state, and the HUD assigns an income limit based on the area or county, not just the entire state.

References

Home Income Limits | HUD USER. (n.d.). Www.huduser.gov. https://www.huduser.gov/portal/datasets/home-income-limits.html

Income Limits | HUD USER. (2026). Huduser.gov. https://www.huduser.gov/portal/datasets/il.html#documents_2025‌

World Population Review. (2026). Cost of living index by state 2026. Worldpopulationreview.com. https://worldpopulationreview.com/state-rankings/cost-of-living-index-by-state‌

rowMeans function - RDocumentation. (2016). Rdocumentation.org. https://www.rdocumentation.org/packages/fame/versions/1.03/topics/rowMeans‌