In this project, I am analyzing rental data from Montgomery County to explore how rent prices change based on bedroom type and over time. I chose this dataset because housing costs affect many people, and I wanted to use data tools from class to understand real-world patterns. My dataset includes both categorical variables like bedroom type and quantitative variables like average rent across multiple years. I will use this data to create visualizations and a regression model to analyze trends and relationships in rent prices.
Importing the Libraries and Dataset
# In this step, I am loading the tidyverse package.
# I am using tidyverse because it includes tools like dplyr for data manipulation
# and ggplot2 for visualization, which I will use throughout this project.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Here, I am loading my dataset into R using read_csv().
# This allows me to store the data in a dataframe called rent_data.
# I use head() to preview the first few rows of the dataset
# so I can understand the structure and variables.
rent_data <- read_csv("2022-Rental-Facility-Occupancy-Survey-Results_20260320.csv")
## Rows: 1369 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Community Name, Community Address, Bedroom Types, Average Rent 201...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(rent_data)
## # A tibble: 6 × 11
## `Community Name` `Community Address` `Bedroom Types` `Average Rent 2016`
## <chr> <chr> <chr> <chr>
## 1 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … Studio $ 1,050
## 2 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … 1 Bedroom $ 1,441
## 3 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … 2 Bedroom $ 2,205
## 4 Bradford Road, 8806 8806 BRADFORD RD S… 1 Bedroom $ 1,195
## 5 Bradford Road, 8806 8806 BRADFORD RD S… 2 Bedroom $ 1,165
## 6 Charter House 1316 FENWICK LN SI… Studio $ 910
## # ℹ 7 more variables: `Average Rent 2017` <chr>, `Average Rent 2018` <chr>,
## # `Average Rent 2019` <chr>, `Average Rent 2020` <chr>,
## # `Average Rent 2021` <chr>, `Average Rent 2022` <chr>,
## # `Percent Change From Previous Year 2021-2022` <chr>
Dataset Cleaning
# In this step, I am cleaning the rent columns by removing dollar signs and commas.
# The original values are stored as text, so I convert them into numeric values.
# This is necessary for calculations like averages and regression.
rent_data$`Average Rent 2016` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2016`))
rent_data$`Average Rent 2017` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2017`))
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2018` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2018`))
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2019` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2019`))
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2020` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2020`))
rent_data$`Average Rent 2021` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2021`))
rent_data$`Average Rent 2022` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2022`))
Summary Statistics
# Here, I am calculating summary statistics for rent in 2022.
# These values help me understand the overall distribution of rent prices.
mean(rent_data$`Average Rent 2022`, na.rm = TRUE)
## [1] 1554.337
median(rent_data$`Average Rent 2022`, na.rm = TRUE)
## [1] 1425
sd(rent_data$`Average Rent 2022`, na.rm = TRUE)
## [1] 730.1706
var(rent_data$`Average Rent 2022`, na.rm = TRUE)
## [1] 533149.1
Data Analysis
# In this step, I group the data by bedroom type. Then I am calculating the average rent for each category.
# This allows me to compare how rent changes based on apartment size.
avg_rent <- rent_data |>
group_by(`Bedroom Types`) |>
summarize(avg_rent = mean(`Average Rent 2022`, na.rm = TRUE))
avg_rent
## # A tibble: 5 × 2
## `Bedroom Types` avg_rent
## <chr> <dbl>
## 1 1 Bedroom 1327.
## 2 2 Bedroom 1644.
## 3 3 Bedroom 2105.
## 4 4 Bedrooms or more 2416.
## 5 Studio 1247.
Data Visualization
# I create a bar chart to compare the average rent across different bedroom types.
# I use bedroom type on the x-axis and average rent on the y-axis.
avg_rent |>
ggplot(aes(x = `Bedroom Types`, y = avg_rent, fill = `Bedroom Types`)) +
# I use geom_col() because I already have calculated average values,
# so I don’t need ggplot to count anything for me.
geom_col() +
# I added clear labels so the graph is easy to understand for anyone reading it.
labs(
title = "Average Rent by Bedroom Type in Montgomery County (2022)",
x = "Bedroom Type",
y = "Average Rent ($)",
caption = "Data Source: Montgomery County Rental Facility Occupancy Survey"
) +
# I manually choose colors so each bedroom type is visually distinct
# instead of using the default ggplot colors.
scale_fill_manual(values = c("blue", "orange", "green", "purple", "red")) +
# I apply a clean theme to make the graph look more professional and easier to read.
theme_minimal()
Regression
# In this step, I create a linear regression model.
# I am testing whether rent in 2021 can predict rent in 2022.
model <- lm(`Average Rent 2022` ~ `Average Rent 2021`, data = rent_data)
# I display the summary to analyze the model results,
# including p-values and the adjusted R-squared.
summary(model)
##
## Call:
## lm(formula = `Average Rent 2022` ~ `Average Rent 2021`, data = rent_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1540.32 -24.61 -10.43 14.81 1065.78
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.104379 6.940790 1.312 0.19
## `Average Rent 2021` 1.012091 0.004119 245.720 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108.7 on 1367 degrees of freedom
## Multiple R-squared: 0.9779, Adjusted R-squared: 0.9778
## F-statistic: 6.038e+04 on 1 and 1367 DF, p-value: < 2.2e-16
This equation shows that for every $1 increase in rent in 2021, rent in 2022 increases by about $1.01.The p-value for the predictor variable is less than 2 × 10⁻¹⁶, which means the relationship is statistically significant. This indicates that rent in 2021 is a strong predictor of rent in 2022. The adjusted R² value is 0.9778, meaning that about 97.78% of the variation in 2022 rent is explained by rent in 2021. This shows that the model fits the data very well.
In this project, I cleaned the dataset by removing dollar signs and commas from the rent columns and converting them into numeric values. This step was necessary because the original data was stored as text, which would not work for calculations or modeling. I also handled missing values using na.rm = TRUE when calculating averages.
The visualization shows the average rent in 2022 for each bedroom type. I observed that rent generally increases as the number of bedrooms increases. This pattern makes sense because larger apartments typically cost more due to increased space and demand.
One limitation of this project is that I focused mainly on one year for the visualization. I also could have explored differences between specific communities to gain deeper insights.