The objective of the homes data provided is to enable detailed analysis of housing and real estate trends. Specifically, the comprehensive set of attributes covers key aspects like pricing, sizes, layouts, locations, eras, and styles to support identifying relationships and patterns within the local market.
With fields ranging from number of bedrooms and bathrooms to elevation above sea level to year built, the data facilitates investigating how varying housing characteristics correlate with and influence sale prices. For instance, by combining granular data on square footage, lot sizes, neighborhood desirability, etc. one can better model and predict pricing comportment. Furthermore, tracking this data over time would depict market shifts—if elevated view premiums decline as taller buildings arise or aging homes dip in value as renovated replacements hit the scene. In essence, by gathering intricate data points on living spaces, the homes dataset empowers analyses to expose market forces and dynamics fundamental to property valuations and development strategies in the region.
# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
# Read in the data
homes <- read_csv("D:/DataSet/Homes.csv")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Summary statistics
summary(homes)
## in_sf beds bath price
## Min. :0.0000 Min. : 0.000 Min. : 1.000 Min. : 187518
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 749000
## Median :1.0000 Median : 2.000 Median : 2.000 Median : 1145000
## Mean :0.5447 Mean : 2.155 Mean : 1.906 Mean : 2020696
## 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 1908750
## Max. :1.0000 Max. :10.000 Max. :10.000 Max. :27500000
## year_built sqft price_per_sqft elevation
## Min. :1880 Min. : 310.0 Min. : 270.0 Min. : 0.00
## 1st Qu.:1924 1st Qu.: 832.8 1st Qu.: 730.5 1st Qu.: 10.00
## Median :1960 Median :1312.0 Median : 960.0 Median : 18.50
## Mean :1959 Mean :1523.0 Mean :1195.6 Mean : 39.85
## 3rd Qu.:2001 3rd Qu.:1809.0 3rd Qu.:1419.0 3rd Qu.: 61.00
## Max. :2016 Max. :7800.0 Max. :4601.0 Max. :238.00
## Aggregate by number of bedrooms
bedroom_counts <- homes %>%
group_by(beds) %>%
summarize(n = n())
# Visualize with a bar plot
ggplot(bedroom_counts, aes(x = beds, y = n)) +
geom_col()
# Analyze price per sqft by year built
homes %>%
mutate(year_group = cut(year_built, breaks = c(1900, 1940, 1980, 2000, Inf))) %>%
group_by(year_group) %>%
summarize(mean_price_sqft = mean(price_per_sqft, na.rm = TRUE)) %>%
ggplot(aes(x = year_group, y = mean_price_sqft)) +
geom_col()
# Analysis by neighborhood
neighborhood_prices <- homes %>%
group_by(in_sf) %>%
summarize(median_price = median(price))
This does some initial data exploration, aggregation, and visualization similar to the analysis in the pasted text. More analyses could be done looking at relationships between variables, changes over time, geographic patterns, or by applying modeling techniques.
The code analyzes a dataset of home prices and characteristics using data visualization and modeling techniques. It first creates a scatterplot with a regression line showing the relationship between price and square footage. Next, it graphs the average price over time by year built to analyze price trends.
A t-test then compares price per square foot between San Francisco and non-San Francisco homes to check for geographic differences. Finally, a linear regression model predicts home price based on number of beds, baths, square footage, and year built. Key metrics from the model are extracted to assess which factors are most related to price. Together, these provide exploratory analysis to understand drivers of home prices using graphs, statistical tests, and modeling. This allows for data-driven insights into the housing market and predictions for how changes in home characteristics may impact pricing.
# Relationship between price and square footage
ggplot(homes, aes(x = sqft, y = price)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
# Price changes over time
homes %>%
mutate(year = year(ymd(paste0(year_built, "-01-01")))) %>%
group_by(year) %>%
summarize(mean_price = mean(price, na.rm = TRUE)) %>%
ggplot(aes(x = year, y = mean_price)) +
geom_line()
# Geographic patterns in price per sq footage
sf_homes <- filter(homes, in_sf == 1)
non_sf_homes <- filter(homes, in_sf == 0)
t.test(sf_homes$price_per_sqft, non_sf_homes$price_per_sqft)
##
## Welch Two Sample t-test
##
## data: sf_homes$price_per_sqft and non_sf_homes$price_per_sqft
## t = -13.199, df = 278.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -909.0135 -673.0610
## sample estimates:
## mean of x mean of y
## 835.4851 1626.5223
# Model home price based on features
homes_model <- lm(price ~ beds + bath + sqft + year_built, data = homes)
summary(homes_model)
##
## Call:
## lm(formula = price ~ beds + bath + sqft + year_built, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8339425 -678652 -35465 510926 12299552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.597e+07 4.077e+06 -6.369 4.40e-10 ***
## beds -8.143e+05 1.194e+05 -6.821 2.69e-11 ***
## bath 4.868e+05 1.735e+05 2.806 0.00521 **
## sqft 2.403e+03 1.592e+02 15.091 < 2e-16 ***
## year_built 1.284e+04 2.082e+03 6.166 1.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1797000 on 487 degrees of freedom
## Multiple R-squared: 0.5986, Adjusted R-squared: 0.5953
## F-statistic: 181.6 on 4 and 487 DF, p-value: < 2.2e-16
homes_new <- data.frame(
beds = 2,
bath = 2,
sqft = 2000,
year_built = 2015
)
predict(homes_model, homes_new)
## 1
## 4057178
# Pie chart of homes by number of bedrooms
homes %>%
count(beds) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = "", y = prop, fill = factor(beds))) +
geom_col(color = "white") +
coord_polar(theta = "y") +
labs(title = "Homes by Number of Bedrooms")
# Pie chart of homes by neighborhood
homes %>%
mutate(neighborhood = ifelse(in_sf == 1, "SF", "Not SF")) %>%
count(neighborhood) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = "", y = prop, fill = neighborhood)) +
geom_col(color = "white") +
coord_polar(theta = "y") +
labs(title = "Homes by Neighborhood")
# Compare price per sq ft distributions by location
sf_homes <- filter(homes, in_sf == 1)
non_sf_homes <- filter(homes, in_sf == 0)
# Visualize the distributions
ggplot(homes, aes(x = price_per_sqft)) +
geom_density(aes(fill = factor(in_sf)))
# Conduct t-test
t.test(sf_homes$price_per_sqft, non_sf_homes$price_per_sqft)
##
## Welch Two Sample t-test
##
## data: sf_homes$price_per_sqft and non_sf_homes$price_per_sqft
## t = -13.199, df = 278.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -909.0135 -673.0610
## sample estimates:
## mean of x mean of y
## 835.4851 1626.5223
# State null and alternative hypotheses
# H0: There is no difference in mean price per sq ft between SF and non-SF homes
# H1: There is a difference in mean price per sq ft between SF and non-SF homes
# Linear model price ~ beds + bath + sqft + year_built
homes_model <- lm(price ~ beds + bath + sqft + year_built, data = homes)
# Hypothesis test on model coeffiecients
# H0: A model coefficient equals 0
# H1: A model coefficient does not equal 0
summary(homes_model)
##
## Call:
## lm(formula = price ~ beds + bath + sqft + year_built, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8339425 -678652 -35465 510926 12299552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.597e+07 4.077e+06 -6.369 4.40e-10 ***
## beds -8.143e+05 1.194e+05 -6.821 2.69e-11 ***
## bath 4.868e+05 1.735e+05 2.806 0.00521 **
## sqft 2.403e+03 1.592e+02 15.091 < 2e-16 ***
## year_built 1.284e+04 2.082e+03 6.166 1.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1797000 on 487 degrees of freedom
## Multiple R-squared: 0.5986, Adjusted R-squared: 0.5953
## F-statistic: 181.6 on 4 and 487 DF, p-value: < 2.2e-16
# Chi-squared test for association between in_sf and elevation
xtabs(~ in_sf + elevation, data = homes) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 284.04, df = 120, p-value = 2.501e-15
# H0: In SF location and elevation are independent
# H1: There is an association between In SF location and elevation
The exploratory analysis of the homes dataset reveals several interesting findings. Comparing the distributions of price per square footage between San Fransisco and non-San Fransisco homes shows a significant difference, with SF homes having a higher average price per square foot based on a t-test. The linear regression model including beds, baths, square footage, and year built explains a significant proportion of variance in home price, with all predictors being statistically significant. This informs a data-driven model for predicting prices. Finally, a chi-squared test shows evidence for an association between San Fransisco location and elevation level. Overall, location, physical characteristics, and year built seem most informative in explaining variation in home prices, which could guide recommendations for real estate valuation and home improvements. More spatial modeling and investigation of other amenities could provide additional insight into the market dynamics at play for the region.