What variables affect price of housing in Maryland?

Introduction

So for my topic I will be looking at a linear regression model on housing in the USA more specifically in Maryland. My dataset is from Kaggle, it has 12 variables and 2 million observations. I will be only using the state of Maryland to make the dataset smaller. There are 6 quantitative and 6 categorical variables such as the price of the house, how many beds and bathrooms there are, the city, street and state, as the size of the house and the acres. I chose this topic and dataset to see how price changes with the different amount predictors, this is good to know for the future when I want to buy a house, especially where I live which is in Maryland.

# load the libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.5.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.5.2

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.5.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(RColorBrewer)

# set working directory
realtor <- read_csv("realtor-data.zip.csv")

## Rows: 2226382 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): status, city, state, zip_code
## dbl  (7): brokered_by, price, bed, bath, acre_lot, street, house_size
## date (1): prev_sold_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(realtor)

## # A tibble: 6 × 12
##   brokered_by status    price   bed  bath acre_lot  street city   state zip_code
##         <dbl> <chr>     <dbl> <dbl> <dbl>    <dbl>   <dbl> <chr>  <chr> <chr>   
## 1      103378 for_sale 105000     3     2     0.12 1962661 Adjun… Puer… 00601   
## 2       52707 for_sale  80000     4     2     0.08 1902874 Adjun… Puer… 00601   
## 3      103379 for_sale  67000     2     1     0.15 1404990 Juana… Puer… 00795   
## 4       31239 for_sale 145000     4     2     0.1  1947675 Ponce  Puer… 00731   
## 5       34632 for_sale  65000     6     2     0.05  331151 Mayag… Puer… 00680   
## 6      103378 for_sale 179000     4     3     0.46 1850806 San S… Puer… 00612   
## # ℹ 2 more variables: house_size <dbl>, prev_sold_date <date>

# filter to Maryland
realtor1 <- realtor |>
  filter(state == "Maryland")

# check for n/a's
colSums(is.na(realtor1))

##    brokered_by         status          price            bed           bath 
##              6              0              0           3819           4846 
##       acre_lot         street           city          state       zip_code 
##           8269             89              0              0              0 
##     house_size prev_sold_date 
##           5031           6784

#filter out n/a's for variables I will be using
realtor1_nona <- realtor1 |>
  filter(bed != is.na(bed)) |>
  filter(bath != is.na(bath)) |>
  filter(acre_lot != is.na(acre_lot)) |>
  filter(house_size != is.na(house_size))
head(realtor1_nona)

## # A tibble: 6 × 12
##   brokered_by status    price   bed  bath acre_lot  street city   state zip_code
##         <dbl> <chr>     <dbl> <dbl> <dbl>    <dbl>   <dbl> <chr>  <chr> <chr>   
## 1       97201 for_sale 169000     6     3     1.25 1361079 McHen… Mary… 21541   
## 2       80549 for_sale 349900     3     3     1.97  644200 Frien… Mary… 21531   
## 3       97201 for_sale 339000     3     2     0.61 1287681 Frien… Mary… 21531   
## 4       56198 for_sale 225000     2     1    17    1256627 Frien… Mary… 21531   
## 5       80549 for_sale 249000     3     2     0.25   58393 Accid… Mary… 21520   
## 6       80549 for_sale 329000     4     2     1.01  188644 McHen… Mary… 21541   
## # ℹ 2 more variables: house_size <dbl>, prev_sold_date <date>

# average price by beds in home
bed1 <- realtor1_nona |>
  group_by(bed) |>
  summarise(avg_price = mean(price, na.rm = TRUE)) |>
  arrange(desc(avg_price))

# avg price by baths in home
bath <- realtor1_nona |>
  group_by(bath) |>
  summarise(avg_price = mean(price, na.rm = TRUE)) |>
  arrange(desc(avg_price))

# High Charter Graph
 hchart(object = bed1, type = "column", hcaes(x = bed, y = avg_price,  color = bed, size = 10)) |>
  hc_xAxis(title = list(text="Beds")) |>
  hc_yAxis(title = list(text="Average Price"))

# High Charter Graph
 hchart(object = bath, type = "column", hcaes(x = bath, y = avg_price,  color = bath, size = 10)) |>
  hc_xAxis(title = list(text="City")) |>
  hc_yAxis(title = list(text="Average Price"))

We can see that

# Make a Correlation Plot
library(DataExplorer)
plot_correlation(realtor1_nona)

## 3 features with more than 20 categories ignored!
## city: 429 categories
## zip_code: 433 categories
## prev_sold_date: 5766 categories

## Warning in cor(x = structure(list(brokered_by = c(97201, 80549, 97201, 56198, :
## the standard deviation is zero

## Warning: Removed 48 rows containing missing values or values outside the scale range
## (`geom_text()`).

# Create linear model
multiple_model <- lm(price ~ bed + bath + house_size + acre_lot , data = realtor1_nona)

# View the model summary
summary(multiple_model)

## 
## Call:
## lm(formula = price ~ bed + bath + house_size + acre_lot, data = realtor1_nona)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4052663  -123228   -20773    81311 11807063 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.308e+05  7.118e+03 -18.370  < 2e-16 ***
## bed         -7.226e+03  2.335e+03  -3.095  0.00197 ** 
## bath         5.779e+04  2.315e+03  24.957  < 2e-16 ***
## house_size   2.150e+02  2.138e+00 100.540  < 2e-16 ***
## acre_lot     3.714e+01  1.268e+01   2.929  0.00340 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 329800 on 32642 degrees of freedom
## Multiple R-squared:  0.515,  Adjusted R-squared:  0.515 
## F-statistic:  8667 on 4 and 32642 DF,  p-value: < 2.2e-16

Based on the Linear model 51% of the varience of the price can be exlpained by the bedrooms, baths, house size and acres of space. We can see that all predictors are statistically significant since all p-values are less that 0.5.

Essay

We can see through the visualizations that some houses have a higher average price depending on how many baths and bedrooms they have. A couple things that I couldn’t get to work were the highcharters colors, probably because I had too many variables, but another thing was that the correlation plot was very weird, so I wish that worked out better. Something else that we can see is that the model can show that 51% of the variance can be due to the predictors I chose. Another thing I wish I could do was see that acres, and house size but that would be very repetitious making the same visualization over and over again.

Sources:

Image: https://timberlakehomes.com/

Dataset: https://kaggle.com

What Defines Prices of Homes in Maryland?

Ricardo Zavaleta

2026-04-20

What variables affect price of housing in Maryland?

Introduction

Essay

Sources: