Do rent prices differ between Low, Medium, and High cost categories?

Introduction

My dataset is about the apartment rent prices in Seattle, Washington and it has 4500 observations and 18 variables. I will be mainly focusing on cost category of the apartments ranked Low, Medium and High. As well as another variable named tract median apartment contract rent per unit, which is the median cost of the apartment per unit. I got this dataset from Data.gov and it can be found with this link https://catalog.data.gov/dataset/apartment-market-rent-prices-by-census-tract?from_hint=eyJxIjoicHJpY2VzIiwic29ydCI6InBvcHVsYXJpdHkifQ%3D%3D. The reason I chose this dataset is to have a better understanding towards how apartments are classified and how the prices differ.

library(tidyverse)

setwd("C:/Users/rjzavaleta/Downloads/Data 101")
df <- read_csv("RentTypology_-4394779434333664244.csv")
head(df)

## # A tibble: 6 × 18
##   OBJECTID       GEOID `Tract Label` `Tract Name`       Community Reporting Ar…¹
##      <dbl>       <dbl>         <dbl> <chr>              <chr>                   
## 1        1 53033001201         12.0  Census Tract 12.01 Northgate/Maple Leaf    
## 2        2 53033000404          4.04 Census Tract 4.04  Broadview/Bitter Lake   
## 3        3 53033004901         49.0  Census Tract 49.01 Fremont                 
## 4        4 53033003601         36.0  Census Tract 36.01 Green Lake              
## 5        5 53033002800         28    Census Tract 28    Greenwood/Phinney Ridge 
## 6        6 53033003100         31    Census Tract 31    Sunset Hill/Loyal Heigh…
## # ℹ abbreviated name: ¹`Community Reporting Area Name`
## # ℹ 13 more variables: `Community Reporting Area ID` <dbl>, Year <dbl>,
## #   `Tract Median Apartment Contract Rent per Square Foot` <dbl>,
## #   `Tract Median Apartment Contract Rent per Unit` <dbl>,
## #   `Year over Year Change in Rent per Square Foot` <dbl>,
## #   `Year over Year Change in Rent per Unit` <dbl>, `Cost Category` <chr>,
## #   `Year over Year Change in Rent Category` <chr>, …

names(df) <- tolower(names(df))
names(df) <- gsub(" ","_",names(df))
names(df) <- gsub("[(). //-]", "_", names(df))
head(df)

## # A tibble: 6 × 18
##   objectid       geoid tract_label tract_name         community_reporting_area…¹
##      <dbl>       <dbl>       <dbl> <chr>              <chr>                     
## 1        1 53033001201       12.0  Census Tract 12.01 Northgate/Maple Leaf      
## 2        2 53033000404        4.04 Census Tract 4.04  Broadview/Bitter Lake     
## 3        3 53033004901       49.0  Census Tract 49.01 Fremont                   
## 4        4 53033003601       36.0  Census Tract 36.01 Green Lake                
## 5        5 53033002800       28    Census Tract 28    Greenwood/Phinney Ridge   
## 6        6 53033003100       31    Census Tract 31    Sunset Hill/Loyal Heights 
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 13 more variables: community_reporting_area_id <dbl>, year <dbl>,
## #   tract_median_apartment_contract_rent_per_square_foot <dbl>,
## #   tract_median_apartment_contract_rent_per_unit <dbl>,
## #   year_over_year_change_in_rent_per_square_foot <dbl>,
## #   year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## #   year_over_year_change_in_rent_category <chr>, …

Data Analysis

First, there are some 0s in the median rent per unit, I will use sum to see how many there is to see if it will affect the dataset massively. After I will use filter to remove all 0s, Then I will use select to only show the cost category and the median rent per unit. Then using summarize to make a summary table to show the mean and standard deviation of the mean rent prices, using group by mean, sd, and count. This will help us interpret the ANOVA results better. I also made another dataset called df3, using mutate I was able to create another column to show the mean of the rent prices of the categories, I made this to create a box plot later in the project.

sum(df$tract_median_apartment_contract_rent_per_unit == 0)

## [1] 569

df2 <- df |>
  filter(df$tract_median_apartment_contract_rent_per_unit > 0)
head(df2)

## # A tibble: 6 × 18
##   objectid       geoid tract_label tract_name         community_reporting_area…¹
##      <dbl>       <dbl>       <dbl> <chr>              <chr>                     
## 1        1 53033001201       12.0  Census Tract 12.01 Northgate/Maple Leaf      
## 2        2 53033000404        4.04 Census Tract 4.04  Broadview/Bitter Lake     
## 3        3 53033004901       49.0  Census Tract 49.01 Fremont                   
## 4        4 53033003601       36.0  Census Tract 36.01 Green Lake                
## 5        5 53033002800       28    Census Tract 28    Greenwood/Phinney Ridge   
## 6        6 53033003100       31    Census Tract 31    Sunset Hill/Loyal Heights 
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 13 more variables: community_reporting_area_id <dbl>, year <dbl>,
## #   tract_median_apartment_contract_rent_per_square_foot <dbl>,
## #   tract_median_apartment_contract_rent_per_unit <dbl>,
## #   year_over_year_change_in_rent_per_square_foot <dbl>,
## #   year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## #   year_over_year_change_in_rent_category <chr>, …

df2 |>
  select(tract_median_apartment_contract_rent_per_unit, cost_category)

## # A tibble: 3,931 × 2
##    tract_median_apartment_contract_rent_per_unit cost_category
##                                            <dbl> <chr>        
##  1                                          1073 Low          
##  2                                          1043 Low          
##  3                                          1059 Medium       
##  4                                          1202 Medium       
##  5                                           998 Low          
##  6                                           917 Low          
##  7                                          1501 High         
##  8                                          1701 High         
##  9                                          1663 High         
## 10                                           884 Low          
## # ℹ 3,921 more rows

summary_table <- df2 |>
  group_by(cost_category) |>
  summarise(mean_rent = mean(tract_median_apartment_contract_rent_per_unit), 
            sd_rent = sd(tract_median_apartment_contract_rent_per_unit),
    count = n())
summary_table

## # A tibble: 3 × 4
##   cost_category mean_rent sd_rent count
##   <chr>             <dbl>   <dbl> <int>
## 1 High              1846.    443.  1301
## 2 Low               1180.    267.  1301
## 3 Medium            1434.    268.  1329

df3 <- df |>
  group_by(cost_category) |>
  mutate(mean_rent = mean(tract_median_apartment_contract_rent_per_unit))
head(df3)

## # A tibble: 6 × 19
## # Groups:   cost_category [2]
##   objectid       geoid tract_label tract_name         community_reporting_area…¹
##      <dbl>       <dbl>       <dbl> <chr>              <chr>                     
## 1        1 53033001201       12.0  Census Tract 12.01 Northgate/Maple Leaf      
## 2        2 53033000404        4.04 Census Tract 4.04  Broadview/Bitter Lake     
## 3        3 53033004901       49.0  Census Tract 49.01 Fremont                   
## 4        4 53033003601       36.0  Census Tract 36.01 Green Lake                
## 5        5 53033002800       28    Census Tract 28    Greenwood/Phinney Ridge   
## 6        6 53033003100       31    Census Tract 31    Sunset Hill/Loyal Heights 
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 14 more variables: community_reporting_area_id <dbl>, year <dbl>,
## #   tract_median_apartment_contract_rent_per_square_foot <dbl>,
## #   tract_median_apartment_contract_rent_per_unit <dbl>,
## #   year_over_year_change_in_rent_per_square_foot <dbl>,
## #   year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## #   year_over_year_change_in_rent_category <chr>, …

Statistical Analysis

For the statistical analysis I will be using ANOVA to compare the means of the cost categories of Low, Medium, and High, this is the most appropriate approach to use because it will show if the mean tract median prices are different between cost categories showing if there is a difference between them or not. The null hypothesis is that the mean price of Low, Medium and High are the same while the alternative hypothesis is that they are not all equal to each other and that there is a difference between the mean price.

Hypothesis

\(H_0\): \(\mu_High\) = \(\mu_Meduium\) = \(\mu_Low\)

\(H_a\): not all \(\mu_i\) are equal

anova_result <- aov(mean_rent ~ cost_category, data = df3)

anova_result

## Call:
##    aov(formula = mean_rent ~ cost_category, data = df3)
## 
## Terms:
##                 cost_category Residuals
## Sum of Squares      293858646         0
## Deg. of Freedom             2      3928
## 
## Residual standard error: 1.369978e-11
## Estimated effects may be unbalanced
## 569 observations deleted due to missingness

summary(anova_result)

##                 Df    Sum Sq   Mean Sq   F value Pr(>F)    
## cost_category    2 293858646 146929323 7.829e+29 <2e-16 ***
## Residuals     3928         0         0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 569 observations deleted due to missingness

TukeyHSD(anova_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mean_rent ~ cost_category, data = df3)
## 
## $cost_category
##                  diff       lwr       upr p adj
## Low-High    -665.7048 -665.7048 -665.7048     0
## Medium-High -412.5046 -412.5046 -412.5046     0
## Medium-Low   253.2003  253.2003  253.2003     0

ggplot(df2, aes(x = cost_category, y = tract_median_apartment_contract_rent_per_unit, fill = cost_category)) + geom_boxplot() +
  labs(title = "Apartment Rent by Cost Category",
    x = "Cost Category",
    y = "Median Apartment Contract Rent per Unit") + theme_minimal()

Conclusion and Future Directions

As we can see with the p-value being less than 0.5 there is extreme statistical significance that there is a difference in the mean prices of apartments based on the cost category. Using the box plot as well we can see that The High cost category had the greatest average rent, while the Low category had the smallest average rent. Some outliers were present, indicating certain census tracts had unusually high rent values compared to the rest of their category. Using Tukeys test we can see that there are a lot of significance between the differences of the cost categories.

Future Direction

Some things that we could research further are the outliers in the high cost category, or we could see why there is such a difference in the mean prices, using a different type of model we can predict what changes the cost category or what changes the price.

References

Dataset: https://catalog.data.gov/dataset/apartment-market-rent-prices-by-census-tract?from_hint=eyJxIjoicHJpY2VzIiwic29ydCI6InBvcHVsYXJpdHkifQ%3D%3D

Boxplot: From notes

Final Project - Data 101

Ricardo Zavaleta

2026-05-07