My Research Question:

How do rental prices in Montgomery County vary by bedroom type and change over time from 2016 to 2022?

Introduction:

This dataset comes from the Montgomery County Rental Facility Occupancy Survey which includes rental housing prices in Montgomery County, Maryland from 2016 to 2022. The dataset focuses on how average rent varies by bedroom type over time. The key variables in this include Bedroom Types (categorical) and Average Rent from 2016–2022 (quantitative). This dataset represents rental trends within Montgomery County and does not include other regions. I chose this dataset since it allows me to analyze how rent changes across different unit sizes and over time using both visualization and regression analysis.

In this project, I am going to be analyzing rental data from Montgomery County to explore how rent prices change based on bedroom type and over time. I chose this dataset because housing costs affect many people, and I wanted to use data tools from class to understand real-world patterns. My dataset includes both categorical variables like bedroom type and quantitative variables like average rent across multiple years. I will use this data to create visualizations and a regression model to analyze trends and relationships in rent prices.

In my project the variables I used:

Importing the Libraries and Dataset

# In this step, I am loading the tidyverse package.
# I am using tidyverse because it includes tools like dplyr for data manipulation
# and ggplot2 for visualization, which I will use throughout this project.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Here, I am loading my dataset into R using read_csv().
# This allows me to store the data in a dataframe called rent_data.
# I use head() to preview the first few rows of the dataset
# so I can understand the structure and variables.

rent_data <- read_csv("2022-Rental-Facility-Occupancy-Survey-Results_20260320.csv")
## Rows: 1369 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Community Name, Community Address, Bedroom Types, Average Rent 201...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(rent_data)
## # A tibble: 6 × 11
##   `Community Name`       `Community Address` `Bedroom Types` `Average Rent 2016`
##   <chr>                  <chr>               <chr>           <chr>              
## 1 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … Studio          $   1,050          
## 2 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … 1 Bedroom       $   1,441          
## 3 Boulevard Of Chevy Ch… 4733 BRADLEY BLVD … 2 Bedroom       $   2,205          
## 4 Bradford Road, 8806    8806 BRADFORD RD S… 1 Bedroom       $   1,195          
## 5 Bradford Road, 8806    8806 BRADFORD RD S… 2 Bedroom       $   1,165          
## 6 Charter House          1316 FENWICK LN SI… Studio          $   910            
## # ℹ 7 more variables: `Average Rent 2017` <chr>, `Average Rent 2018` <chr>,
## #   `Average Rent 2019` <chr>, `Average Rent 2020` <chr>,
## #   `Average Rent 2021` <chr>, `Average Rent 2022` <chr>,
## #   `Percent Change From Previous Year 2021-2022` <chr>

Dataset Cleaning

# In this step, I am cleaning the rent columns by removing dollar signs and commas.
# The original values are stored as text, so I convert them into numeric values.
# This is necessary for calculations like averages and regression.

rent_data$`Average Rent 2016` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2016`)) # I am taking the 2016 rent column from the rent_data, removing symbols, and converting it into numeric values

rent_data$`Average Rent 2017` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2017`)) # I am taking the 2017 rent column from the rent_data, removing symbols, and converting it into numeric values
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2018` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2018`)) # I am taking the 2018 rent column from the rent_data, removing symbols, and converting it into numeric values
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2019` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2019`)) # I am taking the 2019 rent column from the rent_data, removing symbols, and converting it into numeric values
## Warning: NAs introduced by coercion
rent_data$`Average Rent 2020` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2020`)) # I am taking the 2020 rent column from the rent_data, removing symbols, and converting it into numeric values

rent_data$`Average Rent 2021` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2021`)) # I am taking the 2021 rent column from the rent_data, removing symbols, and converting it into numeric values

rent_data$`Average Rent 2022` <- as.numeric(gsub("[$,]", "", rent_data$`Average Rent 2022`)) # I am taking the 2022 rent column from the rent_data, removing symbols, and converting it into numeric values

Summary Statistics

# Here, I am calculating summary statistics for rent in 2022.
# These values help me understand the overall distribution of rent prices.

mean(rent_data$`Average Rent 2022`, na.rm = TRUE) # I am calculating the mean of the 2022 rent variable to find the average price
## [1] 1554.337
median(rent_data$`Average Rent 2022`, na.rm = TRUE) # I am calculating the median of the 2022 rent variable to see the middle value
## [1] 1425
sd(rent_data$`Average Rent 2022`, na.rm = TRUE) # I am calculating the standard deviation to see the spread of the 2022 rent values
## [1] 730.1706
var(rent_data$`Average Rent 2022`, na.rm = TRUE) # I am calculating the variance of the 2022 rent variable to measure variability
## [1] 533149.1

Data Analysis

# In this step, I group the data by bedroom type. Then I am calculating the average rent for each category.
# This allows me to compare how rent changes based on apartment size.

avg_rent <- rent_data |> # # I create a new variable called avg_rent where I group the data by bedroom type and calculated the average 2022 rent for each category
  group_by(`Bedroom Types`) |>
  summarize(avg_rent = mean(`Average Rent 2022`, na.rm = TRUE))

avg_rent # I showed the avg_rent in order to show the rent that is across the different bedroom types
## # A tibble: 5 × 2
##   `Bedroom Types`    avg_rent
##   <chr>                 <dbl>
## 1 1 Bedroom             1327.
## 2 2 Bedroom             1644.
## 3 3 Bedroom             2105.
## 4 4 Bedrooms or more    2416.
## 5 Studio                1247.

Data Visualization

# I created a bar chart to compare the average rent across different bedroom types.
# I use bedroom type on the x-axis and average rent on the y-axis.

# I am using avg_rent in order to create a graph where Bedroom Types (categorical variable) which are on the x-axis and avg_rent (quantitative variable) is on the y-axis
avg_rent |>
  ggplot(aes(x = `Bedroom Types`, y = avg_rent, fill = `Bedroom Types`)) +

  # I used geom_col() because I already have calculated average values,
  # so I don’t need ggplot to count anything for me.
  geom_col() +

  # I used labs() to add a title and axis labels to my graph. The title explained what the graph showed overall.The x-label represents Bedroom Types (categorical variable),and the y-label represents Average Rent (quantitative variable). This helps clearly connect the graph to my research question about how rent varies by bedroom type.
  labs(
    title = "Average Rent by Bedroom Type in Montgomery County (2022)",
    x = "Bedroom Type",
    y = "Average Rent ($)",
    caption = "Data Source: Montgomery County Rental Facility Occupancy Survey"
  ) +

 # I assign colors based on Bedroom Types so each category is easy to distinguish.
scale_fill_manual(values = c("blue", "orange", "green", "purple", "red")) +
  
# I use theme_minimal() to keep the graph clean and focused on the relationship between variables.
theme_minimal()

 # I used ggplot() to start making a graph using my dataset (rent_data). I set Bedroom Types on the x-axis (categorical variable) and Average Rent 2022 on the y-axis (numerical variable)
# I used geom_boxplot() in order to show the distribution of rent values for each bedroom type.
# This code ends up showing the median, spread, and any outliers in the rent data.
# I also used labs() in order to add title and axis for the graph to not be too difficult to comprehend 
# I used theme minimal to make the graph clean and focused on the relationship between the variables
ggplot(rent_data, aes(x = `Bedroom Types`, y = `Average Rent 2022`)) +
  geom_boxplot() +
  labs(
    title = "Spread of Rent by Bedroom Type (2022)",
    x = "Bedroom Type",
    y = "Rent ($)"
  ) +
  theme_minimal()

Regression

# In this step, I create a linear regression model.
# I am testing whether rent in 2021 can predict rent in 2022.
# I am creating a regression model which is stored in the variable model where the 2022 rent that is the (response variable) is predicted using 2021 rent which is the (explanatory variable)
model <- lm(`Average Rent 2022` ~ `Average Rent 2021`, data = rent_data)

# I am displaying the summary to analyze the model results, to analyze the relationship between the two variables
summary(model)
## 
## Call:
## lm(formula = `Average Rent 2022` ~ `Average Rent 2021`, data = rent_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1540.32   -24.61   -10.43    14.81  1065.78 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         9.104379   6.940790   1.312     0.19    
## `Average Rent 2021` 1.012091   0.004119 245.720   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108.7 on 1367 degrees of freedom
## Multiple R-squared:  0.9779, Adjusted R-squared:  0.9778 
## F-statistic: 6.038e+04 on 1 and 1367 DF,  p-value: < 2.2e-16

Regression Analysis

Regression Plot

# In this chunk, I am creating a scatterplot to analyze the relationship between rent in 2021 (explanatory variable)
# and rent in 2022. This helps me determine whether there is a linear relationship between the two years.
# I used geom_point() in order to plot each data point to observe the overall pattern, and potential outliers in the data.
 # I use geom_smooth(method = "lm") in order to add a line of best fit based on a linear regression model. This also helped me visualize the relationship that is connected between the variables.
ggplot(rent_data, aes(x = `Average Rent 2021`, y = `Average Rent 2022`)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Conclusion

In this project, I cleaned the dataset by removing dollar signs and commas from the rent columns and converting them into numeric values. This step was necessary because the original data was stored as text, which would not work for calculations or modeling. I also handled missing values using na.rm = TRUE when calculating averages. I also checked whether all the numerical variables were in the correct format in order for the analysis to have proper organization.

The visualization shows the average rent in 2022 for each bedroom type. I observed that the rent generally increases as the number of bedrooms increases. This pattern makes sense because larger apartments typically cost more due to increased space and demand. Looking as a data analyst perspective, my visualization helps explains variability in the response variable (rent) across categorical groups (bedroom types), which directly addresses the research question. The box plot also displays the differences in dispersion, displaying how bedroom categories have greater variability and outliers which indicates inconsistent pricing within those groups.

One limitation of this project is that I focused mainly on one year for the visualization. I also could have explored differences between specific communities to gain deeper insights. I feel a stronger extension to my project would preferably involve like the trends which occur over a time period. I could have also included more data in order for a comparison on whether location affects rent more than a bedroom type. That could have also strengthen my analysis by making it more detailed and useful.

References

Montgomery County Government. (2022). Rental Facility Occupancy Survey Results. Montgomery County Open Data Portal / Data.gov.
https://catalog.data.gov/dataset/2022-rental-facility-occupancy-survey-results