The data set that will be used in this project is titled “Sale prices of houses in Duke Forest, Durham, NC”. It was retrieved from OpenIntro and consists of various information regarding houses for sale in the Duke Forest neighborhood of Durham, North Carolina in November 2020. These variables include the cooling system, heating system, parking type, number of bathrooms, number of bedrooms, and more. The variables that will be utilized for my analysis include the price of the home in USD, and the area in square feet. The results of my research question can help homeowners looking to renovate their homes intending to sell. As well as real estate agents who can take this into consideration when advertising homes to clients.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
setwd("C:/Users/tonge/Desktop/Data 101")
houses <- read_csv("duke_forest.csv")
## Rows: 98 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): address, type, heating, cooling, parking, hoa, url
## dbl (6): price, bed, bath, area, year_built, lot
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Checking structure and head of the dataset
str(houses)
## spc_tbl_ [98 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ address : chr [1:98] "1 Learned Pl, Durham, NC 27705" "1616 Pinecrest Rd, Durham, NC 27705" "2418 Wrightwood Ave, Durham, NC 27705" "2527 Sevier St, Durham, NC 27705" ...
## $ price : num [1:98] 1520000 1030000 420000 680000 428500 ...
## $ bed : num [1:98] 3 5 2 4 4 3 5 4 4 3 ...
## $ bath : num [1:98] 4 4 3 3 3 3 5 3 5 2 ...
## $ area : num [1:98] 6040 4475 1745 2091 1772 ...
## $ type : chr [1:98] "Single Family" "Single Family" "Single Family" "Single Family" ...
## $ year_built: num [1:98] 1972 1969 1959 1961 2020 ...
## $ heating : chr [1:98] "Other, Gas" "Forced air, Gas" "Forced air, Gas" "Heat pump, Other, Electric, Gas" ...
## $ cooling : chr [1:98] "central" "central" "central" "central" ...
## $ parking : chr [1:98] "0 spaces" "Carport, Covered" "Garage - Attached, Covered" "Carport, Covered" ...
## $ lot : num [1:98] 0.97 1.38 0.51 0.84 0.16 0.45 0.94 0.79 0.53 0.73 ...
## $ hoa : chr [1:98] NA NA NA NA ...
## $ url : chr [1:98] "https://www.zillow.com/homedetails/1-Learned-Pl-Durham-NC-27705/49981897_zpid/" "https://www.zillow.com/homedetails/1616-Pinecrest-Rd-Durham-NC-27705/49969247_zpid/" "https://www.zillow.com/homedetails/2418-Wrightwood-Ave-Durham-NC-27705/49972133_zpid/" "https://www.zillow.com/homedetails/2527-Sevier-St-Durham-NC-27705/49967280_zpid/" ...
## - attr(*, "spec")=
## .. cols(
## .. address = col_character(),
## .. price = col_double(),
## .. bed = col_double(),
## .. bath = col_double(),
## .. area = col_double(),
## .. type = col_character(),
## .. year_built = col_double(),
## .. heating = col_character(),
## .. cooling = col_character(),
## .. parking = col_character(),
## .. lot = col_double(),
## .. hoa = col_character(),
## .. url = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(houses)
## # A tibble: 6 × 13
## address price bed bath area type year_built heating cooling parking
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 1 Learned P… 1.52e6 3 4 6040 Sing… 1972 Other,… central 0 spac…
## 2 1616 Pinecr… 1.03e6 5 4 4475 Sing… 1969 Forced… central Carpor…
## 3 2418 Wright… 4.20e5 2 3 1745 Sing… 1959 Forced… central Garage…
## 4 2527 Sevier… 6.80e5 4 3 2091 Sing… 1961 Heat p… central Carpor…
## 5 2218 Myers … 4.29e5 4 3 1772 Sing… 2020 Forced… central 0 spac…
## 6 2619 Vesson… 4.56e5 3 3 1950 Sing… 2014 Forced… central Off-st…
## # ℹ 3 more variables: lot <dbl>, hoa <chr>, url <chr>
Checking for NAs: There are none in the variables I will be using
colSums(is.na(houses))
## address price bed bath area type year_built
## 0 0 0 0 0 0 0
## heating cooling parking lot hoa url
## 0 0 0 1 97 0
The scatterplot of house price and a house’s area fluctuates, showing that there may be other factors besides area effecting the price. However, it is mostly a positive slope, meaning the larger the house the more expensive it will be. The code used to create the scatterplot is from my Data 110 notes(Maliha, 2026).
house_chart <- ggplot(houses, aes(x = area, y = price)) +
geom_line()
labs(title = "Scatterplot of House Price and House Area in sq ft",
x = "Area of Home in sq ft",
y = "Price of House")
## <ggplot2::labels> List of 3
## $ x : chr "Area of Home in sq ft"
## $ y : chr "Price of House"
## $ title: chr "Scatterplot of House Price and House Area in sq ft"
theme_minimal(base_size = 14)
## <theme> List of 144
## $ line : <ggplot2::element_line>
## ..@ colour : chr "black"
## ..@ linewidth : num 0.636
## ..@ linetype : num 1
## ..@ lineend : chr "butt"
## ..@ linejoin : chr "round"
## ..@ arrow : logi FALSE
## ..@ arrow.fill : chr "black"
## ..@ inherit.blank: logi TRUE
## $ rect : <ggplot2::element_rect>
## ..@ fill : chr "white"
## ..@ colour : chr "black"
## ..@ linewidth : num 0.636
## ..@ linetype : num 1
## ..@ linejoin : chr "round"
## ..@ inherit.blank: logi TRUE
## $ text : <ggplot2::element_text>
## ..@ family : chr ""
## ..@ face : chr "plain"
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : chr "black"
## ..@ size : num 14
## ..@ hjust : num 0.5
## ..@ vjust : num 0.5
## ..@ angle : num 0
## ..@ lineheight : num 0.9
## ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 0
## ..@ debug : logi FALSE
## ..@ inherit.blank: logi TRUE
## $ title : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : NULL
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ point : <ggplot2::element_point>
## ..@ colour : chr "black"
## ..@ shape : num 19
## ..@ size : num 1.91
## ..@ fill : chr "white"
## ..@ stroke : num 0.636
## ..@ inherit.blank: logi TRUE
## $ polygon : <ggplot2::element_polygon>
## ..@ fill : chr "white"
## ..@ colour : chr "black"
## ..@ linewidth : num 0.636
## ..@ linetype : num 1
## ..@ linejoin : chr "round"
## ..@ inherit.blank: logi TRUE
## $ geom : <ggplot2::element_geom>
## ..@ ink : chr "black"
## ..@ paper : chr "white"
## ..@ accent : chr "#3366FF"
## ..@ linewidth : num 0.636
## ..@ borderwidth: num 0.636
## ..@ linetype : int 1
## ..@ bordertype : int 1
## ..@ family : chr ""
## ..@ fontsize : num 4.92
## ..@ pointsize : num 1.91
## ..@ pointshape : num 19
## ..@ colour : NULL
## ..@ fill : NULL
## $ spacing : 'simpleUnit' num 7points
## ..- attr(*, "unit")= int 8
## $ margins : <ggplot2::margin> num [1:4] 7 7 7 7
## $ aspect.ratio : NULL
## $ axis.title : NULL
## $ axis.title.x : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : num 1
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 3.5 0 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.title.x.top : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : num 0
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 0 3.5 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.title.x.bottom : NULL
## $ axis.title.y : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : num 1
## ..@ angle : num 90
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 3.5 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.title.y.left : NULL
## $ axis.title.y.right : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : num 1
## ..@ angle : num -90
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 3.5
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : chr "#4D4D4DFF"
## ..@ size : 'rel' num 0.8
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : NULL
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.x : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : num 1
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 2.8 0 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.x.top : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 0 6.3 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.x.bottom : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 6.3 0 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.y : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : num 1
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 2.8 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.y.left : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 6.3 0 0
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.y.right : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 6.3
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.text.theta : NULL
## $ axis.text.r : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : num 0.5
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : <ggplot2::margin> num [1:4] 0 2.8 0 2.8
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ axis.ticks : <ggplot2::element_blank>
## $ axis.ticks.x : NULL
## $ axis.ticks.x.top : NULL
## $ axis.ticks.x.bottom : NULL
## $ axis.ticks.y : NULL
## $ axis.ticks.y.left : NULL
## $ axis.ticks.y.right : NULL
## $ axis.ticks.theta : NULL
## $ axis.ticks.r : NULL
## $ axis.minor.ticks.x.top : NULL
## $ axis.minor.ticks.x.bottom : NULL
## $ axis.minor.ticks.y.left : NULL
## $ axis.minor.ticks.y.right : NULL
## $ axis.minor.ticks.theta : NULL
## $ axis.minor.ticks.r : NULL
## $ axis.ticks.length : 'rel' num 0.5
## $ axis.ticks.length.x : NULL
## $ axis.ticks.length.x.top : NULL
## $ axis.ticks.length.x.bottom : NULL
## $ axis.ticks.length.y : NULL
## $ axis.ticks.length.y.left : NULL
## $ axis.ticks.length.y.right : NULL
## $ axis.ticks.length.theta : NULL
## $ axis.ticks.length.r : NULL
## $ axis.minor.ticks.length : 'rel' num 0.75
## $ axis.minor.ticks.length.x : NULL
## $ axis.minor.ticks.length.x.top : NULL
## $ axis.minor.ticks.length.x.bottom: NULL
## $ axis.minor.ticks.length.y : NULL
## $ axis.minor.ticks.length.y.left : NULL
## $ axis.minor.ticks.length.y.right : NULL
## $ axis.minor.ticks.length.theta : NULL
## $ axis.minor.ticks.length.r : NULL
## $ axis.line : <ggplot2::element_blank>
## $ axis.line.x : NULL
## $ axis.line.x.top : NULL
## $ axis.line.x.bottom : NULL
## $ axis.line.y : NULL
## $ axis.line.y.left : NULL
## $ axis.line.y.right : NULL
## $ axis.line.theta : NULL
## $ axis.line.r : NULL
## $ legend.background : <ggplot2::element_blank>
## $ legend.margin : NULL
## $ legend.spacing : 'rel' num 2
## $ legend.spacing.x : NULL
## $ legend.spacing.y : NULL
## $ legend.key : <ggplot2::element_blank>
## $ legend.key.size : 'simpleUnit' num 1.2lines
## ..- attr(*, "unit")= int 3
## $ legend.key.height : NULL
## $ legend.key.width : NULL
## $ legend.key.spacing : NULL
## $ legend.key.spacing.x : NULL
## $ legend.key.spacing.y : NULL
## $ legend.key.justification : NULL
## $ legend.frame : NULL
## $ legend.ticks : NULL
## $ legend.ticks.length : 'rel' num 0.2
## $ legend.axis.line : NULL
## $ legend.text : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : 'rel' num 0.8
## ..@ hjust : NULL
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : NULL
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ legend.text.position : NULL
## $ legend.title : <ggplot2::element_text>
## ..@ family : NULL
## ..@ face : NULL
## ..@ italic : chr NA
## ..@ fontweight : num NA
## ..@ fontwidth : num NA
## ..@ colour : NULL
## ..@ size : NULL
## ..@ hjust : num 0
## ..@ vjust : NULL
## ..@ angle : NULL
## ..@ lineheight : NULL
## ..@ margin : NULL
## ..@ debug : NULL
## ..@ inherit.blank: logi TRUE
## $ legend.title.position : NULL
## $ legend.position : chr "right"
## $ legend.position.inside : NULL
## $ legend.direction : NULL
## $ legend.byrow : NULL
## $ legend.justification : chr "center"
## $ legend.justification.top : NULL
## $ legend.justification.bottom : NULL
## $ legend.justification.left : NULL
## $ legend.justification.right : NULL
## $ legend.justification.inside : NULL
## [list output truncated]
## @ complete: logi TRUE
## @ validate: logi TRUE
house_chart
Creating the correlation matrix
cor_matrix <- cor(
houses |>
select(price, area), use = "complete.obs")
cor_matrix
## price area
## price 1.000000 0.667229
## area 0.667229 1.000000
Visualization of correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
title = "Correlation Matrix of House Price and Area")
The correlation matrix resulted in r = .67, showing a moderately strong positive correlation between a house’s area and a house’s price. Similarly, the scatterplot visualization showed a positive slope, showing that as area increases, house price increases. Therefore, the larger the area of a home, the more expensive the home will be. This information can be essential for real estate agents, where extra bathrooms, bedrooms, and even basements can increase a house’s price. They can filter out houses for clients who may have a specific budget, or they can justify a home’s price when asked. Additionally, it can help homeowners who are renovating their house with the intention to sell in the future. They should renovate so their space looks more spacious, whether that means finishing a basement or a converting an attic. If I were to do further research I would do multiple linear regression. This can help us see what numerical variable is the best predictor of a house’s price. We could test variables like number of bedrooms, number of bathrooms, year built, area, and lot size on their impact.
Maliha, M. (2026). Data 110 unit 6: Correlation, scatterplots, and Plotly [Class notes]. Montgomery College. DATA 110.
Sale prices of houses in Duke Forest, Durham, NC. (2020, November). Openintro.org; OpenIntro. https://www.openintro.org/data/index.php?data=duke_forest