Introduction

My research question is Which home characteristics most strongly predict a house’s sale price? The data set that will be used in this project is titled “Sale prices of houses in Duke Forest, Durham, NC”, and the data was collected from Zillow.com. There is a total of 14 variables. It was retrieved from OpenIntro and consists of various information regarding houses for sale in the Duke Forest neighborhood of Durham, North Carolina in November 2020. These variables include the cooling system, heating system, parking type, number of bathrooms, number of bedrooms, lot size and more. The variables that will be utilized for my analysis will be all the numerical variables. This includes number of bed, number of bath, lot size, area, year built, and price. I will also include one categorical variable for my visualization and that will be parking type. The results of my research question can help homeowners looking to renovate their homes intending to sell. As well as real estate agents who can take this into consideration when advertising homes to clients.

Loading Libraries and Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(ggfortify)

setwd("C:/Users/tonge/Desktop/Data 110")
houses <- read_csv("duke_forest.csv")

## Rows: 98 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): address, type, heating, cooling, parking, hoa, url
## dbl (6): price, bed, bath, area, year_built, lot
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning

str(houses)

## spc_tbl_ [98 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ address   : chr [1:98] "1 Learned Pl, Durham, NC 27705" "1616 Pinecrest Rd, Durham, NC 27705" "2418 Wrightwood Ave, Durham, NC 27705" "2527 Sevier St, Durham, NC 27705" ...
##  $ price     : num [1:98] 1520000 1030000 420000 680000 428500 ...
##  $ bed       : num [1:98] 3 5 2 4 4 3 5 4 4 3 ...
##  $ bath      : num [1:98] 4 4 3 3 3 3 5 3 5 2 ...
##  $ area      : num [1:98] 6040 4475 1745 2091 1772 ...
##  $ type      : chr [1:98] "Single Family" "Single Family" "Single Family" "Single Family" ...
##  $ year_built: num [1:98] 1972 1969 1959 1961 2020 ...
##  $ heating   : chr [1:98] "Other, Gas" "Forced air, Gas" "Forced air, Gas" "Heat pump, Other, Electric, Gas" ...
##  $ cooling   : chr [1:98] "central" "central" "central" "central" ...
##  $ parking   : chr [1:98] "0 spaces" "Carport, Covered" "Garage - Attached, Covered" "Carport, Covered" ...
##  $ lot       : num [1:98] 0.97 1.38 0.51 0.84 0.16 0.45 0.94 0.79 0.53 0.73 ...
##  $ hoa       : chr [1:98] NA NA NA NA ...
##  $ url       : chr [1:98] "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-27705/49981897_zpid/" "https://www.zillow.com/homedetails/1616-Pinecrest-Rd-Durham-NC-27705/49969247_zpid/" "https://www.zillow.com/homedetails/2418-Wrightwood-Ave-Durham-NC-27705/49972133_zpid/" "https://www.zillow.com/homedetails/2527-Sevier-St-Durham-NC-27705/49967280_zpid/" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   address = col_character(),
##   ..   price = col_double(),
##   ..   bed = col_double(),
##   ..   bath = col_double(),
##   ..   area = col_double(),
##   ..   type = col_character(),
##   ..   year_built = col_double(),
##   ..   heating = col_character(),
##   ..   cooling = col_character(),
##   ..   parking = col_character(),
##   ..   lot = col_double(),
##   ..   hoa = col_character(),
##   ..   url = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Checking for NAs

colSums(is.na(houses))

##    address      price        bed       bath       area       type year_built 
##          0          0          0          0          0          0          0 
##    heating    cooling    parking        lot        hoa        url 
##          0          0          0          1         97          0

Removing NAs in lot - since I will be using it for my visualizations

houses2 <- filter(houses,!is.na(lot))

Visualizations showing different variables’ relationship with a house’s price.

Looking at all the parking types

unique(houses2$parking)

##  [1] "0 spaces"                                                 
##  [2] "Carport, Covered"                                         
##  [3] "Garage - Attached, Covered"                               
##  [4] "Off-street, Covered"                                      
##  [5] "Carport, Garage - Attached, Covered"                      
##  [6] "Covered"                                                  
##  [7] "Garage, Garage - Detached, Covered"                       
##  [8] "Garage - Attached, On-street, Covered"                    
##  [9] "Garage, Garage - Attached, Covered"                       
## [10] "Off-street"                                               
## [11] "Garage, Garage - Detached, Off-street"                    
## [12] "Garage - Attached"                                        
## [13] "Garage, Carport, Covered"                                 
## [14] "Garage"                                                   
## [15] "Garage - Detached, Off-street, Covered"                   
## [16] "Garage - Attached, Garage - Detached, Covered"            
## [17] "Garage, Garage - Detached, Off-street, On-street, Covered"
## [18] "Garage, Garage - Detached, Off-street, Covered"           
## [19] "Garage - Attached, Off-street, Covered"

Filtering to only include “Garage - Attached” , “Carport, Covered”, and “0 spaces” to see if parking type has an impact on house price

houses_filtered <- houses2 |> 
  filter(parking %in% c("Garage - Attached", "Carport, Covered", "0 spaces"))

Scatter plot 1

Scatter plot of 3 different parking types - Covered Carport, Attached Garage, and those with no spaces

There is a lot of variation in houses with 0 parking spaces, since they display both high and low prices. We can also see it is the most frequent parking type. Homes with a covered carport tend to stay around the middle. Lastly, one of the highest prices are seen in a home with an attached garage. Overall, parking type can have an impact on house price, but it may not be the strongest predictor. There may be other factors effecting it.

scatter1 <- ggplot(data = houses_filtered,
aes(x = lot,
y = price,
color = parking)) +
geom_point(aes(shape = parking), alpha = 0.8) +
  scale_color_manual(values = c("lightskyblue", "plum1", "palegreen")) +
labs(title = "Area of house to Price of House by Parking Type",
x = "Lot (in acres)",
y = "Price of House (USD)",
caption = "OpenIntro, Data retrieved by Zillow.com")
scatter1

Scatterplot 2

The scatterplot shows a strong positive relationship between a house’s price and the lot size in acres. Therefore, as lot size goes up, so does price.

ggplot(houses2, aes(price, lot)) +
geom_point(color = "green", alpha = .3) +
geom_smooth(se = FALSE) + # remove the error band
scale_size_area() +
theme_bw() +
labs(x = "Price(USD)",
     y = "Lot size(acres)",
     caption = "OpenIntro, Data retrieved by Zillow.com",
     title = "Price and Lot size of Home for Sale")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Boxplot

Creating a new column of categorical variables stating the number of bedroom of a house

Median is 4 so the categorical variables will be split into Below 4 Bedrooms, 4 Bedrooms, and Over 4 Bedrooms.

summary(houses2$bed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.742   4.000   6.000

house3 <- houses2 |>
  mutate(bedrooms = ifelse(bed < 4, "Under 4 Bedrooms",
      ifelse(bed == 4, "4 Bedrooms", "Over 4 Bedrooms")))

Boxplot of the Distribution of House Price based on Number of Bedrooms

Shows that houses with more bedrooms tend to be more expensive. The highest price median is seen in houses with over 4 bedrooms, while the lowest price median is seen the house with under 4 bedrooms.

boxpl <- house3 |>
ggplot(aes(y=price, group = bedrooms, fill = bedrooms)) +
  scale_fill_manual(values=c("#C6E2FF","#87CEFF","#E0FFFF")) +
  labs(title = "Distribution of House Price by Number of Bedrooms",
       y = "Price of Houses",
caption = "OpenIntro, Data retrieved by Zillow.com") +
geom_boxplot()
boxpl

Multiple Linear Regression

Creating the Fit

fit1 <- lm(price ~ bed + bath + area + lot + year_built, data = houses2)
summary(fit1)

## 
## Call:
## lm(formula = price ~ bed + bath + area + lot + year_built, data = houses2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -760853  -71089   -4766   68171  446896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.307e+06  1.846e+06  -1.791  0.07662 .  
## bed         -1.276e+04  2.686e+04  -0.475  0.63599    
## bath         5.178e+04  2.524e+04   2.052  0.04305 *  
## area         9.407e+01  2.377e+01   3.957  0.00015 ***
## lot          3.507e+05  7.865e+04   4.459 2.35e-05 ***
## year_built   1.675e+03  9.435e+02   1.775  0.07919 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 150400 on 91 degrees of freedom
## Multiple R-squared:  0.5825, Adjusted R-squared:  0.5595 
## F-statistic: 25.39 on 5 and 91 DF,  p-value: 6.034e-16

The lowest p value(2.35e-05) is since seen in lot variable. Therefore, a house’s lot in acres is strongest predictor of a house’s price. Some other statistically significant p values are observed in bath and area. Adjusted R Squared is around 56%, which is moderate. Therefore, this model’s predictors(bed, bath, area, lot, and year built) explains 56% of house price variability. Equation: price = -3.307e+06 -1.276e+04(bed) + 5.178e+04(bath) + 9.4074+01(area) + 3.507e+05(lot) + 1.675e+03(year_built)

autoplot(fit1, 1:4, nrow=2, ncol=2)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Residual plot - The blue line is relatively horizontal, showing a linear model is appropriate.
QQPlot - The line of point is relatively diagonal, making it normally distributed. There may be outliers at observations 4, 1, and 7.
Scale-Location - The line goes upward meaning as fitted values for house price go up, so does residual spread.
Cook’s Distance - The Observation numbers 1, 36, and 64 have high leverage and can impact the model.

Conclusion

First, I cleaned the data set checking for NAs and removing those that were in the variables I would need for my visualizations. I did this using the filter(!is.na()) function. Overall, my analysis can conclude the lot size in acres is the biggest predictor of a home’s price, along with number of bedrooms and bathrooms. This was also corroborated in my scatterplot of house price and lot size. Therefore, the larger the area, amount of bathrooms, and amount of bedrooms the more expensive a home may be. Based off my visualization we also know that parking type is a contributor as well. However, out of all the categorical variables I am unsure which one is the best predictor. Therefore, if I were to do any further analysis I would do a Logistic Regression to see which categorical variables(heating, cooling, parking) is the best predictor of house price. Something that I struggled with was choosing which categorical variable to based my visualizations on. Variables like heating and parking had over 10 different types. While variables like cooling had only 2 types, which I found to be too little. To resolve this, I created a categorical column categorizing the number of bedrooms into groups based on the median to create a boxplot. For the parking variable, I filtered it to only 3 different parking type to create a color-coded scatter plot.

References(APA)

Sale prices of houses in Duke Forest, Durham, NC. (2020, November). Openintro.org; OpenIntro. https://www.openintro.org/data/index.php?data=duke_forest

Project 1 - Data 110

S Tonge

2026-03-30