My dataset is about the apartment rent prices in Seattle, Washington and it has 4500 observations and 18 variables. I will be mainly focusing on cost category of the apartments ranked Low, Medium and High. As well as another variable named tract median apartment contract rent per unit, which is the median cost of the apartment per unit. I got this dataset from Data.gov and it can be found with this link https://catalog.data.gov/dataset/apartment-market-rent-prices-by-census-tract?from_hint=eyJxIjoicHJpY2VzIiwic29ydCI6InBvcHVsYXJpdHkifQ%3D%3D. The reason I chose this dataset is to have a better understanding towards how apartments are classified and how the prices differ.
library(tidyverse)
setwd("C:/Users/rjzavaleta/Downloads/Data 101")
df <- read_csv("RentTypology_-4394779434333664244.csv")
head(df)
## # A tibble: 6 × 18
## OBJECTID GEOID `Tract Label` `Tract Name` Community Reporting Ar…¹
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 53033001201 12.0 Census Tract 12.01 Northgate/Maple Leaf
## 2 2 53033000404 4.04 Census Tract 4.04 Broadview/Bitter Lake
## 3 3 53033004901 49.0 Census Tract 49.01 Fremont
## 4 4 53033003601 36.0 Census Tract 36.01 Green Lake
## 5 5 53033002800 28 Census Tract 28 Greenwood/Phinney Ridge
## 6 6 53033003100 31 Census Tract 31 Sunset Hill/Loyal Heigh…
## # ℹ abbreviated name: ¹`Community Reporting Area Name`
## # ℹ 13 more variables: `Community Reporting Area ID` <dbl>, Year <dbl>,
## # `Tract Median Apartment Contract Rent per Square Foot` <dbl>,
## # `Tract Median Apartment Contract Rent per Unit` <dbl>,
## # `Year over Year Change in Rent per Square Foot` <dbl>,
## # `Year over Year Change in Rent per Unit` <dbl>, `Cost Category` <chr>,
## # `Year over Year Change in Rent Category` <chr>, …
names(df) <- tolower(names(df))
names(df) <- gsub(" ","_",names(df))
names(df) <- gsub("[(). //-]", "_", names(df))
head(df)
## # A tibble: 6 × 18
## objectid geoid tract_label tract_name community_reporting_area…¹
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 53033001201 12.0 Census Tract 12.01 Northgate/Maple Leaf
## 2 2 53033000404 4.04 Census Tract 4.04 Broadview/Bitter Lake
## 3 3 53033004901 49.0 Census Tract 49.01 Fremont
## 4 4 53033003601 36.0 Census Tract 36.01 Green Lake
## 5 5 53033002800 28 Census Tract 28 Greenwood/Phinney Ridge
## 6 6 53033003100 31 Census Tract 31 Sunset Hill/Loyal Heights
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 13 more variables: community_reporting_area_id <dbl>, year <dbl>,
## # tract_median_apartment_contract_rent_per_square_foot <dbl>,
## # tract_median_apartment_contract_rent_per_unit <dbl>,
## # year_over_year_change_in_rent_per_square_foot <dbl>,
## # year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## # year_over_year_change_in_rent_category <chr>, …
First, there are some 0s in the median rent per unit, I will use sum to see how many there is to see if it will affect the dataset massively. After I will use filter to remove all 0s, Then I will use select to only show the cost category and the median rent per unit. Then using summarize to make a summary table to show the mean and standard deviation of the mean rent prices, using group by mean, sd, and count. This will help us interpret the ANOVA results better. I also made another dataset called df3, using mutate I was able to create another column to show the mean of the rent prices of the categories, I made this to create a box plot later in the project.
sum(df$tract_median_apartment_contract_rent_per_unit == 0)
## [1] 569
df2 <- df |>
filter(df$tract_median_apartment_contract_rent_per_unit > 0)
head(df2)
## # A tibble: 6 × 18
## objectid geoid tract_label tract_name community_reporting_area…¹
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 53033001201 12.0 Census Tract 12.01 Northgate/Maple Leaf
## 2 2 53033000404 4.04 Census Tract 4.04 Broadview/Bitter Lake
## 3 3 53033004901 49.0 Census Tract 49.01 Fremont
## 4 4 53033003601 36.0 Census Tract 36.01 Green Lake
## 5 5 53033002800 28 Census Tract 28 Greenwood/Phinney Ridge
## 6 6 53033003100 31 Census Tract 31 Sunset Hill/Loyal Heights
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 13 more variables: community_reporting_area_id <dbl>, year <dbl>,
## # tract_median_apartment_contract_rent_per_square_foot <dbl>,
## # tract_median_apartment_contract_rent_per_unit <dbl>,
## # year_over_year_change_in_rent_per_square_foot <dbl>,
## # year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## # year_over_year_change_in_rent_category <chr>, …
df2 |>
select(tract_median_apartment_contract_rent_per_unit, cost_category)
## # A tibble: 3,931 × 2
## tract_median_apartment_contract_rent_per_unit cost_category
## <dbl> <chr>
## 1 1073 Low
## 2 1043 Low
## 3 1059 Medium
## 4 1202 Medium
## 5 998 Low
## 6 917 Low
## 7 1501 High
## 8 1701 High
## 9 1663 High
## 10 884 Low
## # ℹ 3,921 more rows
summary_table <- df2 |>
group_by(cost_category) |>
summarise(mean_rent = mean(tract_median_apartment_contract_rent_per_unit),
sd_rent = sd(tract_median_apartment_contract_rent_per_unit),
count = n())
summary_table
## # A tibble: 3 × 4
## cost_category mean_rent sd_rent count
## <chr> <dbl> <dbl> <int>
## 1 High 1846. 443. 1301
## 2 Low 1180. 267. 1301
## 3 Medium 1434. 268. 1329
df3 <- df |>
group_by(cost_category) |>
mutate(mean_rent = mean(tract_median_apartment_contract_rent_per_unit))
head(df3)
## # A tibble: 6 × 19
## # Groups: cost_category [2]
## objectid geoid tract_label tract_name community_reporting_area…¹
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 53033001201 12.0 Census Tract 12.01 Northgate/Maple Leaf
## 2 2 53033000404 4.04 Census Tract 4.04 Broadview/Bitter Lake
## 3 3 53033004901 49.0 Census Tract 49.01 Fremont
## 4 4 53033003601 36.0 Census Tract 36.01 Green Lake
## 5 5 53033002800 28 Census Tract 28 Greenwood/Phinney Ridge
## 6 6 53033003100 31 Census Tract 31 Sunset Hill/Loyal Heights
## # ℹ abbreviated name: ¹community_reporting_area_name
## # ℹ 14 more variables: community_reporting_area_id <dbl>, year <dbl>,
## # tract_median_apartment_contract_rent_per_square_foot <dbl>,
## # tract_median_apartment_contract_rent_per_unit <dbl>,
## # year_over_year_change_in_rent_per_square_foot <dbl>,
## # year_over_year_change_in_rent_per_unit <dbl>, cost_category <chr>,
## # year_over_year_change_in_rent_category <chr>, …
For the statistical analysis I will be using ANOVA to compare the means of the cost categories of Low, Medium, and High, this is the most appropriate approach to use because it will show if the mean tract median prices are different between cost categories showing if there is a difference between them or not. The null hypothesis is that the mean price of Low, Medium and High are the same while the alternative hypothesis is that they are not all equal to each other and that there is a difference between the mean price.
Hypothesis
\(H_0\): \(\mu_High\) = \(\mu_Meduium\) = \(\mu_Low\)
\(H_a\): not all \(\mu_i\) are equal
anova_result <- aov(mean_rent ~ cost_category, data = df3)
anova_result
## Call:
## aov(formula = mean_rent ~ cost_category, data = df3)
##
## Terms:
## cost_category Residuals
## Sum of Squares 293858646 0
## Deg. of Freedom 2 3928
##
## Residual standard error: 1.369978e-11
## Estimated effects may be unbalanced
## 569 observations deleted due to missingness
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## cost_category 2 293858646 146929323 7.829e+29 <2e-16 ***
## Residuals 3928 0 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 569 observations deleted due to missingness
TukeyHSD(anova_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean_rent ~ cost_category, data = df3)
##
## $cost_category
## diff lwr upr p adj
## Low-High -665.7048 -665.7048 -665.7048 0
## Medium-High -412.5046 -412.5046 -412.5046 0
## Medium-Low 253.2003 253.2003 253.2003 0
ggplot(df2, aes(x = cost_category, y = tract_median_apartment_contract_rent_per_unit, fill = cost_category)) + geom_boxplot() +
labs(title = "Apartment Rent by Cost Category",
x = "Cost Category",
y = "Median Apartment Contract Rent per Unit") + theme_minimal()
As we can see with the p-value being less than 0.5 there is extreme statistical significance that there is a difference in the mean prices of apartments based on the cost category. Using the box plot as well we can see that The High cost category had the greatest average rent, while the Low category had the smallest average rent. Some outliers were present, indicating certain census tracts had unusually high rent values compared to the rest of their category. Using Tukeys test we can see that there are a lot of significance between the differences of the cost categories.
Some things that we could research further are the outliers in the high cost category, or we could see why there is such a difference in the mean prices, using a different type of model we can predict what changes the cost category or what changes the price.
Boxplot: From notes