Understanding 2023 Housing Market Trends

Project Background

Problem Statement

Objective

Develop a Model & Assess Predictor Significance

Step 1: Install and Load R packages

# if (!require(car)) install.packages("car")  # heip us find the Levene's Test
#install.packages("readxl")
#install.packages("tidyer")
library(car)

## Warning: 套件 'car' 是用 R 版本 4.4.2 來建造的

## 載入需要的套件：carData

## Warning: 套件 'carData' 是用 R 版本 4.4.2 來建造的

library(readxl)

## Warning: 套件 'readxl' 是用 R 版本 4.4.2 來建造的

library(tidyr)

## Warning: 套件 'tidyr' 是用 R 版本 4.4.2 來建造的

library(ggplot2)

## Warning: 套件 'ggplot2' 是用 R 版本 4.4.2 來建造的

Step 2: Import & summarize the data

combined_data <- read_excel(file.choose())
str(combined_data)

## tibble [12 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Month      : chr [1:12] "Jan" "Feb" "March" "April" ...
##  $ AVG $ SFR  : num [1:12] 1914174 1862336 2031186 2089503 2118960 ...
##  $ AVG $ TH   : num [1:12] 1198247 1155453 1216500 1280010 1310374 ...
##  $ AVG $ Condo: num [1:12] 776460 775791 883392 893087 836370 ...

head(combined_data)

## # A tibble: 6 × 4
##   Month `AVG $ SFR` `AVG $ TH` `AVG $ Condo`
##   <chr>       <dbl>      <dbl>         <dbl>
## 1 Jan       1914174    1198247        776460
## 2 Feb       1862336    1155453        775791
## 3 March     2031186    1216500        883392
## 4 April     2089503    1280010        893087
## 5 May       2118960    1310374        836370
## 6 June      2162057    1282466        927731

colnames(combined_data) <- c("Month", "Avg_SFR", "Avg_Condo", "Avg_TH")

summary(combined_data)

##     Month              Avg_SFR          Avg_Condo           Avg_TH      
##  Length:12          Min.   :1862336   Min.   :1155453   Min.   :775791  
##  Class :character   1st Qu.:2002821   1st Qu.:1211937   1st Qu.:827934  
##  Mode  :character   Median :2087018   Median :1250368   Median :861735  
##                     Mean   :2065312   Mean   :1246097   Mean   :852054  
##                     3rd Qu.:2154952   3rd Qu.:1285062   3rd Qu.:884279  
##                     Max.   :2217444   Max.   :1316528   Max.   :927731

Data Description: A description of some of the features are presented in the table below.
Variable       |Definition
---------------|-------------
1. Avg_SFR     |Single-family homes average price per month in 2023
2. Avg_TH      |Townhomes average price per month in 2023
3. Avg_Condo   |Condos average price per month in 2023

Step 3: Transfer Data to Long Format

# Transform data to long format
combined_long <- combined_data %>%
  pivot_longer(cols = c("Avg_SFR", "Avg_TH", "Avg_Condo"), 
               names_to = "Type", 
               values_to = "Avg_Price")

# Check the structure of the transformed data
str(combined_long)

## tibble [36 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Month    : chr [1:36] "Jan" "Jan" "Jan" "Feb" ...
##  $ Type     : chr [1:36] "Avg_SFR" "Avg_TH" "Avg_Condo" "Avg_SFR" ...
##  $ Avg_Price: num [1:36] 1914174 776460 1198247 1862336 775791 ...

# Ensure "Type" is treated as a factor
combined_long$Type <- as.factor(combined_long$Type)

Step 4: Data visualization

# Boxplot to compare price distribution
ggplot(combined_long, aes(x = Type, y = Avg_Price, fill = Type)) +
  geom_boxplot() +
  labs(title = "Price Distribution by Housing Type in 2023", 
       x = "Housing Type", 
       y = "Average Price") +
  theme_minimal()

Interpertation: Visualizes the distribution of average prices across three housing types in 2023.
(1) Avg_SFR: The highest median price among the three housing types. A wider IQR indicates more variability in prices compared to the other categories.
(2) Avg_TH: The second-highest median price. A smaller IQR compared to Avg_SFR, suggesting less price variability.
(3) Avg_Condo: The lowest median price, a relatively small IQR, indicating consistent pricing within this category.

Step 5: Perform Statistical Analysis

5.1 One-Way ANOVA: Testing Differences Between Housing Types

anova_result <- aov(Avg_Price ~ Type, data = combined_long)
summary(anova_result)

##             Df    Sum Sq   Mean Sq F value Pr(>F)    
## Type         2 9.194e+12 4.597e+12   778.4 <2e-16 ***
## Residuals   33 1.949e+11 5.906e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: F = 778.4 and p < 2e-16, three are significant differences between the three types of average prices.

5.2 Pairwise Comparisons(If ANOVA is Significant)

# If ANOVA is significant, perform pairwise comparisons
if (summary(anova_result)[[1]]$`Pr(>F)`[1] < 0.05) {
  print("ANOVA significant; performing pairwise comparisons...")
  
  pairwise_result <- pairwise.t.test(combined_long$Avg_Price, combined_long$Type, p.adjust.method = "bonferroni")
  
  print(pairwise_result)
}

## [1] "ANOVA significant; performing pairwise comparisons..."
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  combined_long$Avg_Price and combined_long$Type 
## 
##         Avg_Condo Avg_SFR
## Avg_SFR < 2e-16   -      
## Avg_TH  1.2e-13   < 2e-16
## 
## P value adjustment method: bonferroni

Interpretation: A pairwise t-test with Bonferroni correction was performed to control for Type 1 error caused by multiple comparisons.
Comparing the three groups with each other, p-value all less than 0.05, which all show statistically significant differences in average prices.

Step 6: Check Assumptions of ANOVA

6.1 Residual Normality: Shapiro-Wilk Test

shapiro_test <- shapiro.test(residuals(anova_result))
print(shapiro_test)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(anova_result)
## W = 0.97851, p-value = 0.695

if (shapiro_test$p.value < 0.05) {
  print("Residuals are NOT normally distributed (p < 0.05). Consider transformations or non-parametric methods.")
} else {
  print("Residuals are normally distributed (p >= 0.05).")
}

## [1] "Residuals are normally distributed (p >= 0.05)."

Interpretation: Conducted to assess whether the residuals from the ANOVA model are normally distributed.
P-value > 0.05, the residuals do not significantly diviate from normality, no need for data transformation or alternative methods.

6.2 Homogeneity of Variance: Levene’s Test

levene_test <- leveneTest(Avg_Price ~ Type, data = combined_long)

print("Levene's Test for Homogeneity of Variances:")

## [1] "Levene's Test for Homogeneity of Variances:"

print(levene_test)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  2  4.4973 0.01874 *
##       33                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

if (levene_test$`Pr(>F)`[1] < 0.05) {
  print("Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations.")
} else {
  print("Variance is equal across groups (p >= 0.05). ANOVA assumption of equal variances is met.")
}

## [1] "Variance is NOT equal across groups (p < 0.05). Consider alternatives like Welch's ANOVA or transformations."

Interpretation: Test whether the variances across the types of housing are equal.
P-value = 0.01874 < 0.05, the variances are significantly different across the group.
Consider alternatives like Welch's ANOVA or transformations.

Step 7: Correlation Analysis

correlation_matrix <- cor(combined_data[, 2:4], use = "complete.obs")  # Exclude missing values
print("Correlation Matrix:")

## [1] "Correlation Matrix:"

print(correlation_matrix)

##             Avg_SFR Avg_Condo    Avg_TH
## Avg_SFR   1.0000000 0.7563963 0.7844563
## Avg_Condo 0.7563963 1.0000000 0.5705040
## Avg_TH    0.7844563 0.5705040 1.0000000

Interpretation:The correlation matrix displays the pairwise Pearson correlation coefficients between the average prices of single-family homes, townhomes, and condos.
(1) Avg_SFR and Avg_TH: Correlation coefficient = 0.756. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of townhomes also tends to increase, and vice versa.
(2) Avg_SFR and Avg_Condo: Correlation coefficient = 0.784. This indicates a strong positive correlation, suggesting that as the average price of single-family homes increases, the average price of condos also tends to increase, and vice versa.
(3) Avg_TH and Avg_Condo: Correlation coefficient = 0.571. This indicates a moderate positive correlation, showing a weaker but still meaningful relationship between townhome and condo prices.
The high correlations suggest that the average prices of these three housing types are interrelated.

** Understanding 2023 Housing Market Trends **

Project Background

Problem Statement

Objective

Develop a Model & Assess Predictor Significance

Step 1: Install and Load R packages

Step 2: Import & summarize the data

Step 3: Transfer Data to Long Format

Step 4: Data visualization

Step 5: Perform Statistical Analysis

5.1 One-Way ANOVA: Testing Differences Between Housing Types

5.2 Pairwise Comparisons(If ANOVA is Significant)

Step 6: Check Assumptions of ANOVA

6.1 Residual Normality: Shapiro-Wilk Test

6.2 Homogeneity of Variance: Levene’s Test

Step 7: Correlation Analysis

Conclusion & Recommendation

Understanding 2023 Housing Market Trends