Project Background
Focus on housing price trends in Santa Clara County in 2023, with a detailed analysis of the average prices of different house types: single-family homes(SFR), townhouses(TH), and condos(Condo). Santa Clara County's real estate market is known for its high demnd, whick makes it critical to understand how price dynamics evolve for different house type.
Problem Statement
Santa Clara County has different price trends due to different house types. The main concern is whether there are significant price differences and correlations between single-family homes, townhouses and condos, and how these price changes may affect the overall real estate market. Understanding these differences is critical for buyers, investors to develop future housing strategies.
Objective
To determine whether there are significant differences in the average prices of single-family homes, and condos in Santa Clara County for 2023.
Using One-Way ANOVA test, Box plots, Pairwise t-test, Shapiro-Wilk Test, Levene's Test and Correlation Analysis
Develop a Model & Assess Predictor Significance
Step 1: Install and Load R packages
# if (!require(car)) install.packages("car") # heip us find the Levene's Test
#install.packages("readxl")
#install.packages("tidyer")
library(car)
## Loading required package: carData
library(carData)
library(readxl)
library(tidyr)
library(ggplot2)
Step 2: Import & summarize the data
combined_data <- read_excel('task8.xlsx')
str(combined_data)
## tibble [12 × 4] (S3: tbl_df/tbl/data.frame)
## $ Month : chr [1:12] "Jan" "Feb" "March" "April" ...
## $ AVG $ SFR : num [1:12] 1914174 1862336 2031186 2089503 2118960 ...
## $ AVG $ TH : num [1:12] 1198247 1155453 1216500 1280010 1310374 ...
## $ AVG $ Condo: num [1:12] 776460 775791 883392 893087 836370 ...
head(combined_data)
## # A tibble: 6 × 4
## Month `AVG $ SFR` `AVG $ TH` `AVG $ Condo`
## <chr> <dbl> <dbl> <dbl>
## 1 Jan 1914174 1198247 776460
## 2 Feb 1862336 1155453 775791
## 3 March 2031186 1216500 883392
## 4 April 2089503 1280010 893087
## 5 May 2118960 1310374 836370
## 6 June 2162057 1282466 927731
colnames(combined_data) <- c("Month", "Avg_SFR", "Avg_Condo", "Avg_TH")
summary(combined_data)
## Month Avg_SFR Avg_Condo Avg_TH
## Length:12 Min. :1862336 Min. :1155453 Min. :775791
## Class :character 1st Qu.:2002821 1st Qu.:1211937 1st Qu.:827934
## Mode :character Median :2087018 Median :1250368 Median :861735
## Mean :2065312 Mean :1246097 Mean :852054
## 3rd Qu.:2154952 3rd Qu.:1285062 3rd Qu.:884279
## Max. :2217444 Max. :1316528 Max. :927731
Data Description: A description of some of the features are presented in the table below.
Variable |Definition
---------------|-------------
1. Avg_SFR |Single-family homes average price per month in 2023
2. Avg_TH |Townhomes average price per month in 2023
3. Avg_Condo |Condos average price per month in 2023
Step 4: Data visualization
# Boxplot to compare price distribution
ggplot(combined_long, aes(x = Type, y = Avg_Price, fill = Type)) +
geom_boxplot() +
labs(title = "Price Distribution by Housing Type in 2023",
x = "Housing Type",
y = "Average Price") +
theme_minimal()

Interpertation: Visualizes the distribution of average prices across three housing types in 2023.
(1) Avg_SFR: The highest median price among the three housing types. A wider IQR indicates more variability in prices compared to the other categories.
(2) Avg_TH: The second-highest median price. A smaller IQR compared to Avg_SFR, suggesting less price variability.
(3) Avg_Condo: The lowest median price, a relatively small IQR, indicating consistent pricing within this category.
Step 6: Check Assumptions of ANOVA
6.1 Residual Normality: Shapiro-Wilk Test
shapiro_test <- shapiro.test(residuals(anova_result))
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: residuals(anova_result)
## W = 0.97851, p-value = 0.695
if (shapiro_test$p.value < 0.05) {
print("Residuals are NOT normally distributed (p < 0.05). Consider transformations or non-parametric methods.")
} else {
print("Residuals are normally distributed (p >= 0.05).")
}
## [1] "Residuals are normally distributed (p >= 0.05)."
Interpretation: Conducted to assess whether the residuals from the ANOVA model are normally distributed.
P-value = 0.695 > 0.05, the residuals do not significantly diviate from normality, no need for data transformation or alternative methods.
6.2 Homogeneity of Variance: Levene’s Test
levene_test <- leveneTest(Avg_Price ~ Type, data = combined_long)
print("Levene's Test for Homogeneity of Variances:")
## [1] "Levene's Test for Homogeneity of Variances:"
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 4.4973 0.01874 *
## 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
if (levene_test$`Pr(>F)`[1] < 0.05) {
print("Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations.")
} else {
print("Variance is equal across groups (p >= 0.05). ANOVA assumption of equal variances is met.")
}
## [1] "Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations."
Interpretation: Test whether the variances across the types of housing are equal.
P-value = 0.01874 < 0.05, the variances are significantly different across the group.
Consider alternatives like Welch's ANOVA or transformations.
Step 7: Correlation Analysis
correlation_matrix <- cor(combined_data[, 2:4], use = "complete.obs") # Exclude missing values
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
## Avg_SFR Avg_Condo Avg_TH
## Avg_SFR 1.0000000 0.7563963 0.7844563
## Avg_Condo 0.7563963 1.0000000 0.5705040
## Avg_TH 0.7844563 0.5705040 1.0000000
Interpretation:The correlation matrix displays the pairwise Pearson correlation coefficients between the average prices of single-family homes, townhomes, and condos.
(1) Avg_SFR and Avg_TH: Correlation coefficient = 0.756. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of townhomes also tends to increase, and vice versa.
(2) Avg_SFR and Avg_Condo: Correlation coefficient = 0.784. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of condos also tends to increase, and vice versa.
(3) Avg_TH and Avg_Condo: Correlation coefficient = 0.571. This indicates a moderate positive correlation, showing a weaker but still meaningful relationship between townhome and condo prices.
The high correlations suggest that the average prices of these three housing types are interrelated.
Conclusion
The ANOVA test showed that there are significant differences in average prices across the three types of homes: single-family homes, townhomes, and condos. The average price for single-family homes (SFR) was the highest at $2,065,312, followed by townhomes ($1,246,097), and condos ($852,054). These differences were statistically significant, suggesting that property type plays a key role in determining housing affordability.
Pairwise t-tests further confirmed that the prices of single-family homes, townhomes, and condos are significantly different from each other.
The correlation matrix shows moderate to strong positive correlations between house types. This suggests that price trends for single-family homes, townhouses, and apartments may influence each other, although each housing type has different price dynamics.
Recommendation
Individuals
(1)With limited funds can purchase a condo.
(2)People with a more stable income and need more space can choose townhouses.
(3)People with high income and the ability to spend more can choose single family houses.
Manufacturers
(1)Condos can be built in higher density areas like Downtown
(2)Townhouses can be built closer to tech jobs like Santa Clara, Mountain View and Sunnyvale
(3)Single-family houses can be built in suburban areas where there is more space like South San Jose, Gilroy and Morgan Hill.