** Understanding 2023 Housing Market Trends **

Project Background

Focus on housing price trends in Santa Clara County in 2023, with a detailed analysis of the average prices of different house types: single-family homes(SFR), townhouses(TH), and condos(Condo). Santa Clara County's real estate market is known for its high demnd, whick makes it critical to understand how price dynamics evolve for different house type.

Problem Statement

Santa Clara County has different price trends due to different house types. The main concern is whether there are significant price differences and correlations between single-family homes, townhouses and condos, and how these price changes may affect the overall real estate market. Understanding these differences is critical for buyers, investors to develop future housing strategies. 

Objective

To determine whether there are significant differences in the average prices of single-family homes, and condos in Santa Clara County for 2023.
Using One-Way ANOVA test, Box plots, Pairwise t-test, Shapiro-Wilk Test, Levene's Test and Correlation Analysis

Develop a Model & Assess Predictor Significance

Step 1: Install and Load R packages

# if (!require(car)) install.packages("car")  # heip us find the Levene's Test
#install.packages("readxl")
#install.packages("tidyer")
library(car)
## Loading required package: carData
library(carData)
library(readxl)
library(tidyr)
library(ggplot2)

Step 2: Import & summarize the data

combined_data <- read_excel('task8.xlsx')
str(combined_data)
## tibble [12 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Month      : chr [1:12] "Jan" "Feb" "March" "April" ...
##  $ AVG $ SFR  : num [1:12] 1914174 1862336 2031186 2089503 2118960 ...
##  $ AVG $ TH   : num [1:12] 1198247 1155453 1216500 1280010 1310374 ...
##  $ AVG $ Condo: num [1:12] 776460 775791 883392 893087 836370 ...
head(combined_data)
## # A tibble: 6 × 4
##   Month `AVG $ SFR` `AVG $ TH` `AVG $ Condo`
##   <chr>       <dbl>      <dbl>         <dbl>
## 1 Jan       1914174    1198247        776460
## 2 Feb       1862336    1155453        775791
## 3 March     2031186    1216500        883392
## 4 April     2089503    1280010        893087
## 5 May       2118960    1310374        836370
## 6 June      2162057    1282466        927731
colnames(combined_data) <- c("Month", "Avg_SFR", "Avg_Condo", "Avg_TH")
summary(combined_data)
##     Month              Avg_SFR          Avg_Condo           Avg_TH      
##  Length:12          Min.   :1862336   Min.   :1155453   Min.   :775791  
##  Class :character   1st Qu.:2002821   1st Qu.:1211937   1st Qu.:827934  
##  Mode  :character   Median :2087018   Median :1250368   Median :861735  
##                     Mean   :2065312   Mean   :1246097   Mean   :852054  
##                     3rd Qu.:2154952   3rd Qu.:1285062   3rd Qu.:884279  
##                     Max.   :2217444   Max.   :1316528   Max.   :927731
Data Description: A description of some of the features are presented in the table below.
Variable       |Definition
---------------|-------------
1. Avg_SFR     |Single-family homes average price per month in 2023
2. Avg_TH      |Townhomes average price per month in 2023
3. Avg_Condo   |Condos average price per month in 2023

Step 3: Transfer Data to Long Format

# Transform data to long format
combined_long <- combined_data %>%
  pivot_longer(cols = c("Avg_SFR", "Avg_TH", "Avg_Condo"), 
               names_to = "Type", 
               values_to = "Avg_Price")

# Check the structure of the transformed data
str(combined_long)
## tibble [36 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Month    : chr [1:36] "Jan" "Jan" "Jan" "Feb" ...
##  $ Type     : chr [1:36] "Avg_SFR" "Avg_TH" "Avg_Condo" "Avg_SFR" ...
##  $ Avg_Price: num [1:36] 1914174 776460 1198247 1862336 775791 ...
# Ensure "Type" is treated as a factor
combined_long$Type <- as.factor(combined_long$Type)

Step 4: Data visualization

# Boxplot to compare price distribution
ggplot(combined_long, aes(x = Type, y = Avg_Price, fill = Type)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Housing Type in 2023", 
       x = "Housing Type", 
       y = "Average Price") +
  theme_minimal()

Interpertation: Visualizes the distribution of average prices across three housing types in 2023.
(1) Avg_SFR: The highest median price among the three housing types. A wider IQR indicates more variability in prices compared to the other categories.
(2) Avg_TH: The second-highest median price. A smaller IQR compared to Avg_SFR, suggesting less price variability.
(3) Avg_Condo: The lowest median price, a relatively small IQR, indicating consistent pricing within this category.

Step 5: Perform Statistical Analysis

5.1 One-Way ANOVA: Testing Differences Between Housing Types
anova_result <- aov(Avg_Price ~ Type, data = combined_long)
summary(anova_result)
##             Df    Sum Sq   Mean Sq F value Pr(>F)    
## Type         2 9.194e+12 4.597e+12   778.4 <2e-16 ***
## Residuals   33 1.949e+11 5.906e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: F = 778.4 and p < 2e-16, three are significant differences between the three types of average prices.
5.2 Pairwise Comparisons(If ANOVA is Significant)
# If ANOVA is significant, perform pairwise comparisons
if (summary(anova_result)[[1]]$`Pr(>F)`[1] < 0.05) {
  print("ANOVA significant; performing pairwise comparisons...")
  
  pairwise_result <- pairwise.t.test(combined_long$Avg_Price, combined_long$Type, p.adjust.method = "bonferroni")
  
  print(pairwise_result)
}
## [1] "ANOVA significant; performing pairwise comparisons..."
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  combined_long$Avg_Price and combined_long$Type 
## 
##         Avg_Condo Avg_SFR
## Avg_SFR < 2e-16   -      
## Avg_TH  1.2e-13   < 2e-16
## 
## P value adjustment method: bonferroni
Interpretation: A pairwise t-test with Bonferroni correction was performed to control for Type 1 error caused by multiple comparisons.
Comparing the three groups with each other, p-value all less than 0.05, which all show statistically significant differences in average prices.

Step 6: Check Assumptions of ANOVA

6.1 Residual Normality: Shapiro-Wilk Test
shapiro_test <- shapiro.test(residuals(anova_result))
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova_result)
## W = 0.97851, p-value = 0.695
if (shapiro_test$p.value < 0.05) {
  print("Residuals are NOT normally distributed (p < 0.05). Consider transformations or non-parametric methods.")
} else {
  print("Residuals are normally distributed (p >= 0.05).")
}
## [1] "Residuals are normally distributed (p >= 0.05)."
Interpretation: Conducted to assess whether the residuals from the ANOVA model are normally distributed.
P-value = 0.695 > 0.05, the residuals do not significantly diviate from normality, no need for data transformation or alternative methods.
6.2 Homogeneity of Variance: Levene’s Test
levene_test <- leveneTest(Avg_Price ~ Type, data = combined_long)

print("Levene's Test for Homogeneity of Variances:")
## [1] "Levene's Test for Homogeneity of Variances:"
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  2  4.4973 0.01874 *
##       33                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
if (levene_test$`Pr(>F)`[1] < 0.05) {
  print("Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations.")
} else {
  print("Variance is equal across groups (p >= 0.05). ANOVA assumption of equal variances is met.")
}
## [1] "Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations."
Interpretation: Test whether the variances across the types of housing are equal.
P-value = 0.01874 < 0.05, the variances are significantly different across the group.
Consider alternatives like Welch's ANOVA or transformations.

Step 7: Correlation Analysis

correlation_matrix <- cor(combined_data[, 2:4], use = "complete.obs")  # Exclude missing values
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
print(correlation_matrix)
##             Avg_SFR Avg_Condo    Avg_TH
## Avg_SFR   1.0000000 0.7563963 0.7844563
## Avg_Condo 0.7563963 1.0000000 0.5705040
## Avg_TH    0.7844563 0.5705040 1.0000000
Interpretation:The correlation matrix displays the pairwise Pearson correlation coefficients between the average prices of single-family homes, townhomes, and condos.
(1) Avg_SFR and Avg_TH: Correlation coefficient = 0.756. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of townhomes also tends to increase, and vice versa.
(2) Avg_SFR and Avg_Condo: Correlation coefficient = 0.784. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of condos also tends to increase, and vice versa.
(3) Avg_TH and Avg_Condo: Correlation coefficient = 0.571. This indicates a moderate positive correlation, showing a weaker but still meaningful relationship between townhome and condo prices.
The high correlations suggest that the average prices of these three housing types are interrelated.

Conclusion

The ANOVA test showed that there are significant differences in average prices across the three types of homes: single-family homes, townhomes, and condos. The average price for single-family homes (SFR) was the highest at $2,065,312, followed by townhomes ($1,246,097), and condos ($852,054). These differences were statistically significant, suggesting that property type plays a key role in determining housing affordability.
Pairwise t-tests further confirmed that the prices of single-family homes, townhomes, and condos are significantly different from each other.
The correlation matrix shows moderate to strong positive correlations between house types. This suggests that price trends for single-family homes, townhouses, and apartments may influence each other, although each housing type has different price dynamics.

Recommendation

Individuals
(1)With limited funds can purchase a condo. 
(2)People with a more stable income and need more space can choose townhouses. 
(3)People with high income and the ability to spend more can choose single family houses.

Manufacturers
(1)Condos can be built in higher density areas like Downtown
(2)Townhouses can be built closer to tech jobs like Santa Clara, Mountain View and Sunnyvale
(3)Single-family houses can be built in suburban areas where there is more space like South San Jose, Gilroy and Morgan Hill.