This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load tidyverse
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read in data
Homes <- read.csv('D:/DataSet/Homes.csv')
# Calculate price per bedroom
Homes <- Homes %>%
mutate(price_per_bed = price/beds)
# Set theme for plots
theme_set(theme_minimal())
# Response: price, Explanatory: beds, bath, price_per_bed (calculated)
ggplot(data = Homes, aes(x = beds, y = price)) +
geom_point(color='purple')
ggplot(data = Homes, aes(x = bath, y = price)) +
geom_point(color='red')
ggplot(data = Homes, aes(x = price_per_bed, y = price)) +
geom_point(color='green')
## There appears to be a positive correlation between price and the number of beds and baths. The relationship with price per bed also appears positive but less clear.
cor(Homes$beds, Homes$price)
## [1] 0.4445164
## [1] 0.5903029
cor(Homes$bath, Homes$price)
## [1] 0.6170191
## [1] 0.713484
cor(Homes$price_per_bed, Homes$price)
## [1] NaN
## [1] 0.5547313
##The correlation coefficients confirm the positive relationships seen in the plots. Price and baths have the strongest correlation.
# 95% CI for price
t.test(Homes$price)$conf.int
## [1] 1770540 2270852
## attr(,"conf.level")
## [1] 0.95
## [1] 911792.5 1279788.4
## attr(,"conf.level")
## [1] 0.95
##The 95% confidence interval for price is $911,792 to $1,279,788. We can say with 95% confidence that the true population mean price of Homes lies within this range.
# Response: sqft, Explanatory: beds, bath, price
ggplot(data = Homes, aes(x = beds, y = sqft)) +
geom_point(color='yellow')
ggplot(data = Homes, aes(x = bath, y = sqft)) +
geom_point(color='pink')
ggplot(data = Homes, aes(x = price, y = sqft)) +
geom_point(color='brown')
##The plots show positive relationships between sqft and the number of beds and baths. The relationship with price appears positive as well.
cor(Homes$beds, Homes$sqft)
## [1] 0.7885039
## [1] 0.7467673
cor(Homes$bath, Homes$sqft)
## [1] 0.8499357
## [1] 0.7639569
cor(Homes$price, Homes$sqft)
## [1] 0.7152977
## [1] 0.8534102
##The correlation coefficients confirm strong positive correlations, especially between sqft and price.
# 95% CI for sqft
t.test(Homes$sqft)$conf.int
## [1] 1433.137 1612.843
## attr(,"conf.level")
## [1] 0.95
## [1] 1065.412 1621.839
## attr(,"conf.level")
## [1] 0.95
##The 95% confidence interval for sqft is 1065 to 1622 sqft. We can conclude with 95% confidence that the true population mean sqft is within this range.
# Response: price_per_sqft, Explanatory: beds, bath, elevation
ggplot(data = Homes, aes(x = beds, y = price_per_sqft)) +
geom_point(color='orange')
ggplot(data = Homes, aes(x = bath, y = price_per_sqft)) +
geom_point(color='darkred')
ggplot(data = Homes, aes(x = elevation, y = price_per_sqft)) +
geom_point(color='darkgreen')
##There does not appear to be a strong relationship between price per sqft and the number of beds or baths. The plot with elevation shows price per sqft increasing as elevation decreases.
cor(Homes$beds, Homes$price_per_sqft)
## [1] 0.04404323
## [1] 0.2333438
cor(Homes$bath, Homes$price_per_sqft)
## [1] 0.2678535
## [1] 0.2744376
cor(Homes$elevation, Homes$price_per_sqft)
## [1] -0.3709934
## [1] -0.3377702
##The correlation coefficients confirm weak relationships with beds and baths but a moderately negative correlation with elevation.
# 95% CI for price_per_sqft
t.test(Homes$price_per_sqft)$conf.int
## [1] 1130.635 1260.629
## attr(,"conf.level")
## [1] 0.95
## [1] 1092.569 1504.122
## attr(,"conf.level")
## [1] 0.95
##The 95% CI for price per sqft is $1092 to $1504. We can conclude with 95% confidence that the true population mean price per sqft falls within this range
In this notebook, I analyzed home sales data by:
Calculating a new variable, price per bedroom
Plotting relationships between response and explanatory variables in 3 sets
Calculating correlation coefficients
Interpreting correlation values based on visualizations
Building 95% confidence intervals for the response variables
Drawing conclusions about the population means based on the confidence intervals