Objective:

The objective of the homes data provided is to enable detailed analysis of housing and real estate trends. Specifically, the comprehensive set of attributes covers key aspects like pricing, sizes, layouts, locations, eras, and styles to support identifying relationships and patterns within the local market.

With fields ranging from number of bedrooms and bathrooms to elevation above sea level to year built, the data facilitates investigating how varying housing characteristics correlate with and influence sale prices. For instance, by combining granular data on square footage, lot sizes, neighborhood desirability, etc. one can better model and predict pricing comportment. Furthermore, tracking this data over time would depict market shifts—if elevated view premiums decline as taller buildings arise or aging homes dip in value as renovated replacements hit the scene. In essence, by gathering intricate data points on living spaces, the homes dataset empowers analyses to expose market forces and dynamics fundamental to property valuations and development strategies in the region.

# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

# Read in the data
homes <- read_csv("D:/DataSet/Homes.csv")
## Rows: 492 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): in_sf, beds, bath, price, year_built, sqft, price_per_sqft, elevation
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Summary statistics
summary(homes)
##      in_sf             beds             bath            price         
##  Min.   :0.0000   Min.   : 0.000   Min.   : 1.000   Min.   :  187518  
##  1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:  749000  
##  Median :1.0000   Median : 2.000   Median : 2.000   Median : 1145000  
##  Mean   :0.5447   Mean   : 2.155   Mean   : 1.906   Mean   : 2020696  
##  3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.: 1908750  
##  Max.   :1.0000   Max.   :10.000   Max.   :10.000   Max.   :27500000  
##    year_built        sqft        price_per_sqft     elevation     
##  Min.   :1880   Min.   : 310.0   Min.   : 270.0   Min.   :  0.00  
##  1st Qu.:1924   1st Qu.: 832.8   1st Qu.: 730.5   1st Qu.: 10.00  
##  Median :1960   Median :1312.0   Median : 960.0   Median : 18.50  
##  Mean   :1959   Mean   :1523.0   Mean   :1195.6   Mean   : 39.85  
##  3rd Qu.:2001   3rd Qu.:1809.0   3rd Qu.:1419.0   3rd Qu.: 61.00  
##  Max.   :2016   Max.   :7800.0   Max.   :4601.0   Max.   :238.00
## Aggregate by number of bedrooms 
bedroom_counts <- homes %>% 
  group_by(beds) %>%
  summarize(n = n())

# Visualize with a bar plot
ggplot(bedroom_counts, aes(x = beds, y = n)) +
  geom_col()

# Analyze price per sqft by year built
homes %>%
  mutate(year_group = cut(year_built, breaks = c(1900, 1940, 1980, 2000, Inf))) %>% 
  group_by(year_group) %>%
  summarize(mean_price_sqft = mean(price_per_sqft, na.rm = TRUE)) %>%
  ggplot(aes(x = year_group, y = mean_price_sqft)) +
  geom_col()

# Analysis by neighborhood 
neighborhood_prices <- homes %>%
  group_by(in_sf) %>% 
  summarize(median_price = median(price))

This does some initial data exploration, aggregation, and visualization similar to the analysis in the pasted text. More analyses could be done looking at relationships between variables, changes over time, geographic patterns, or by applying modeling techniques.

             The code analyzes a dataset of home prices and characteristics using data visualization and modeling techniques. It first creates a scatterplot with a regression line showing the relationship between price and square footage. Next, it graphs the average price over time by year built to analyze price trends. 
             
              A t-test then compares price per square foot between San Francisco and non-San Francisco homes to check for geographic differences. Finally, a linear regression model predicts home price based on number of beds, baths, square footage, and year built. Key metrics from the model are extracted to assess which factors are most related to price. Together, these provide exploratory analysis to understand drivers of home prices using graphs, statistical tests, and modeling. This allows for data-driven insights into the housing market and predictions for how changes in home characteristics may impact pricing.
# Relationship between price and square footage  
ggplot(homes, aes(x = sqft, y = price)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

# Price changes over time
homes %>% 
  mutate(year = year(ymd(paste0(year_built, "-01-01")))) %>% 
  group_by(year) %>%
  summarize(mean_price = mean(price, na.rm = TRUE)) %>% 
  ggplot(aes(x = year, y = mean_price)) +
  geom_line()

# Geographic patterns in price per sq footage
sf_homes <- filter(homes, in_sf == 1) 
non_sf_homes <- filter(homes, in_sf == 0)

t.test(sf_homes$price_per_sqft, non_sf_homes$price_per_sqft)
## 
##  Welch Two Sample t-test
## 
## data:  sf_homes$price_per_sqft and non_sf_homes$price_per_sqft
## t = -13.199, df = 278.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -909.0135 -673.0610
## sample estimates:
## mean of x mean of y 
##  835.4851 1626.5223
# Model home price based on features  
homes_model <- lm(price ~ beds + bath + sqft + year_built, data = homes)
summary(homes_model)
## 
## Call:
## lm(formula = price ~ beds + bath + sqft + year_built, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8339425  -678652   -35465   510926 12299552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.597e+07  4.077e+06  -6.369 4.40e-10 ***
## beds        -8.143e+05  1.194e+05  -6.821 2.69e-11 ***
## bath         4.868e+05  1.735e+05   2.806  0.00521 ** 
## sqft         2.403e+03  1.592e+02  15.091  < 2e-16 ***
## year_built   1.284e+04  2.082e+03   6.166 1.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1797000 on 487 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5953 
## F-statistic: 181.6 on 4 and 487 DF,  p-value: < 2.2e-16
homes_new <- data.frame(
  beds = 2,
  bath = 2, 
  sqft = 2000,
  year_built = 2015  
)

predict(homes_model, homes_new)
##       1 
## 4057178

Pie chart of homes by number of bedrooms

# Pie chart of homes by number of bedrooms
homes %>%
  count(beds) %>%
  mutate(prop = n/sum(n)) %>% 
  ggplot(aes(x = "", y = prop, fill = factor(beds))) + 
  geom_col(color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Homes by Number of Bedrooms")

Pie chart of homes by neighborhood

# Pie chart of homes by neighborhood 
homes %>% 
  mutate(neighborhood = ifelse(in_sf == 1, "SF", "Not SF")) %>%
  count(neighborhood) %>%
  mutate(prop = n/sum(n)) %>% 
  ggplot(aes(x = "", y = prop, fill = neighborhood)) +
  geom_col(color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Homes by Neighborhood")

Perform Probability and Hypothesis testing on the Homes data:

# Compare price per sq ft distributions by location 
sf_homes <- filter(homes, in_sf == 1)
non_sf_homes <- filter(homes, in_sf == 0)

# Visualize the distributions
ggplot(homes, aes(x = price_per_sqft)) + 
  geom_density(aes(fill = factor(in_sf)))

# Conduct t-test
t.test(sf_homes$price_per_sqft, non_sf_homes$price_per_sqft)
## 
##  Welch Two Sample t-test
## 
## data:  sf_homes$price_per_sqft and non_sf_homes$price_per_sqft
## t = -13.199, df = 278.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -909.0135 -673.0610
## sample estimates:
## mean of x mean of y 
##  835.4851 1626.5223
# State null and alternative hypotheses 
# H0: There is no difference in mean price per sq ft between SF and non-SF homes  
# H1: There is a difference in mean price per sq ft between SF and non-SF homes

# Linear model price ~ beds + bath + sqft + year_built
homes_model <- lm(price ~ beds + bath + sqft + year_built, data = homes)

# Hypothesis test on model coeffiecients 
# H0: A model coefficient equals 0 
# H1: A model coefficient does not equal 0
summary(homes_model) 
## 
## Call:
## lm(formula = price ~ beds + bath + sqft + year_built, data = homes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8339425  -678652   -35465   510926 12299552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.597e+07  4.077e+06  -6.369 4.40e-10 ***
## beds        -8.143e+05  1.194e+05  -6.821 2.69e-11 ***
## bath         4.868e+05  1.735e+05   2.806  0.00521 ** 
## sqft         2.403e+03  1.592e+02  15.091  < 2e-16 ***
## year_built   1.284e+04  2.082e+03   6.166 1.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1797000 on 487 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5953 
## F-statistic: 181.6 on 4 and 487 DF,  p-value: < 2.2e-16
# Chi-squared test for association between in_sf and elevation  
xtabs(~ in_sf + elevation, data = homes) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 284.04, df = 120, p-value = 2.501e-15
# H0: In SF location and elevation are independent  
# H1: There is an association between In SF location and elevation

Conclusion :

The exploratory analysis of the homes dataset reveals several interesting findings. Comparing the distributions of price per square footage between San Fransisco and non-San Fransisco homes shows a significant difference, with SF homes having a higher average price per square foot based on a t-test. The linear regression model including beds, baths, square footage, and year built explains a significant proportion of variance in home price, with all predictors being statistically significant. This informs a data-driven model for predicting prices. Finally, a chi-squared test shows evidence for an association between San Fransisco location and elevation level. Overall, location, physical characteristics, and year built seem most informative in explaining variation in home prices, which could guide recommendations for real estate valuation and home improvements. More spatial modeling and investigation of other amenities could provide additional insight into the market dynamics at play for the region.