So for my topic I will be looking at a linear regression model on housing in the USA more specifically in Maryland. My dataset is from Kaggle, it has 12 variables and 2 million observations. I will be only using the state of Maryland to make the dataset smaller. There are 6 quantitative and 6 categorical variables such as the price of the house, how many beds and bathrooms there are, the city, street and state, as the size of the house and the acres. I chose this topic and dataset to see how price changes with the different amount predictors, this is good to know for the future when I want to buy a house, especially where I live which is in Maryland.
# load the libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.5.2
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.5.2
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.5.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
# set working directory
realtor <- read_csv("realtor-data.zip.csv")
## Rows: 2226382 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): status, city, state, zip_code
## dbl (7): brokered_by, price, bed, bath, acre_lot, street, house_size
## date (1): prev_sold_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(realtor)
## # A tibble: 6 × 12
## brokered_by status price bed bath acre_lot street city state zip_code
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 103378 for_sale 105000 3 2 0.12 1962661 Adjun… Puer… 00601
## 2 52707 for_sale 80000 4 2 0.08 1902874 Adjun… Puer… 00601
## 3 103379 for_sale 67000 2 1 0.15 1404990 Juana… Puer… 00795
## 4 31239 for_sale 145000 4 2 0.1 1947675 Ponce Puer… 00731
## 5 34632 for_sale 65000 6 2 0.05 331151 Mayag… Puer… 00680
## 6 103378 for_sale 179000 4 3 0.46 1850806 San S… Puer… 00612
## # ℹ 2 more variables: house_size <dbl>, prev_sold_date <date>
# filter to Maryland
realtor1 <- realtor |>
filter(state == "Maryland")
# check for n/a's
colSums(is.na(realtor1))
## brokered_by status price bed bath
## 6 0 0 3819 4846
## acre_lot street city state zip_code
## 8269 89 0 0 0
## house_size prev_sold_date
## 5031 6784
#filter out n/a's for variables I will be using
realtor1_nona <- realtor1 |>
filter(bed != is.na(bed)) |>
filter(bath != is.na(bath)) |>
filter(acre_lot != is.na(acre_lot)) |>
filter(house_size != is.na(house_size))
head(realtor1_nona)
## # A tibble: 6 × 12
## brokered_by status price bed bath acre_lot street city state zip_code
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 97201 for_sale 169000 6 3 1.25 1361079 McHen… Mary… 21541
## 2 80549 for_sale 349900 3 3 1.97 644200 Frien… Mary… 21531
## 3 97201 for_sale 339000 3 2 0.61 1287681 Frien… Mary… 21531
## 4 56198 for_sale 225000 2 1 17 1256627 Frien… Mary… 21531
## 5 80549 for_sale 249000 3 2 0.25 58393 Accid… Mary… 21520
## 6 80549 for_sale 329000 4 2 1.01 188644 McHen… Mary… 21541
## # ℹ 2 more variables: house_size <dbl>, prev_sold_date <date>
# average price by beds in home
bed1 <- realtor1_nona |>
group_by(bed) |>
summarise(avg_price = mean(price, na.rm = TRUE)) |>
arrange(desc(avg_price))
# avg price by baths in home
bath <- realtor1_nona |>
group_by(bath) |>
summarise(avg_price = mean(price, na.rm = TRUE)) |>
arrange(desc(avg_price))
# High Charter Graph
hchart(object = bed1, type = "column", hcaes(x = bed, y = avg_price, color = bed, size = 10)) |>
hc_xAxis(title = list(text="Beds")) |>
hc_yAxis(title = list(text="Average Price"))
# High Charter Graph
hchart(object = bath, type = "column", hcaes(x = bath, y = avg_price, color = bath, size = 10)) |>
hc_xAxis(title = list(text="City")) |>
hc_yAxis(title = list(text="Average Price"))
We can see that
# Make a Correlation Plot
library(DataExplorer)
plot_correlation(realtor1_nona)
## 3 features with more than 20 categories ignored!
## city: 429 categories
## zip_code: 433 categories
## prev_sold_date: 5766 categories
## Warning in cor(x = structure(list(brokered_by = c(97201, 80549, 97201, 56198, :
## the standard deviation is zero
## Warning: Removed 48 rows containing missing values or values outside the scale range
## (`geom_text()`).
# Create linear model
multiple_model <- lm(price ~ bed + bath + house_size + acre_lot , data = realtor1_nona)
# View the model summary
summary(multiple_model)
##
## Call:
## lm(formula = price ~ bed + bath + house_size + acre_lot, data = realtor1_nona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4052663 -123228 -20773 81311 11807063
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.308e+05 7.118e+03 -18.370 < 2e-16 ***
## bed -7.226e+03 2.335e+03 -3.095 0.00197 **
## bath 5.779e+04 2.315e+03 24.957 < 2e-16 ***
## house_size 2.150e+02 2.138e+00 100.540 < 2e-16 ***
## acre_lot 3.714e+01 1.268e+01 2.929 0.00340 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 329800 on 32642 degrees of freedom
## Multiple R-squared: 0.515, Adjusted R-squared: 0.515
## F-statistic: 8667 on 4 and 32642 DF, p-value: < 2.2e-16
Based on the Linear model 51% of the varience of the price can be exlpained by the bedrooms, baths, house size and acres of space. We can see that all predictors are statistically significant since all p-values are less that 0.5.
So for my topic I will be looking at a linear regression model on housing in the USA more specifically in Maryland. My dataset is from Kaggle, it has 12 variables and 2 million observations. I will be only using the state of Maryland to make the dataset smaller. There are 6 quantitative and 6 categorical variables such as the price of the house, how many beds and bathrooms there are, the city, street and state, as the size of the house and the acres. I chose this topic and dataset to see how price changes with the different amount predictors, this is good to know for the future when I want to buy a house, especially where I live which is in Maryland.
We can see through the visualizations that some houses have a higher average price depending on how many baths and bedrooms they have. A couple things that I couldn’t get to work were the highcharters colors, probably because I had too many variables, but another thing was that the correlation plot was very weird, so I wish that worked out better. Something else that we can see is that the model can show that 51% of the variance can be due to the predictors I chose. Another thing I wish I could do was see that acres, and house size but that would be very repetitious making the same visualization over and over again.