Final Project

Introduction

Data Cleaning and EDA

In this report I analyzed various factors in the Redfin Health dataset from March 2024. This dataset was comprised of 48 variables and 3015 observations. It is made up of a combination of Maine and New Hampshire house pricing with various home factors. It was combined with a numerically converted BRFSS dataset that we’ve previously looked at.

In approach to this assignment, I initially performed a preliminary exploratory data analysis. This allowed insight into the data in it’s raw form. I mainly focused on housing factors that influence a homes price. I took the numerical columns of interest and generated correlation tables, matrices, and other plots to get a sense of the most pronounced variables. Some of the columns that stood out with higher correlation coefficients were x_square_feet and number of baths in a home. Other variables I wanted to include in my analysis were number of beds, lot size, and days on the market. Additionally, two categorical variables I wanted to explore were property type and state or province. After creating a new data frame with the variables of interest I proceeded to clean the numerical columns removing outliers and NAs.

Click to expand

rf <- read.csv("redfin_health_March2024.csv")
rf <- clean_names(rf)

# EDA with skimr and DataExplorer

rf_subset <- rf %>%
  select_if(is.numeric)

rf_subset_house <- rf_subset[2:10]
rf_subset_house_NA <- na.omit(rf_subset_house)

plot_missing(rf_subset_house)

plot_histogram(rf_subset_house)

plot_boxplot(rf_subset_house, by="price")

## Warning: Removed 7618 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

plot_correlation(rf_subset_house_NA)

plot_density(rf_subset_house)

rf_house <- rf |> dplyr::select(price, beds, baths, lot_size, x_square_feet, days_on_market, property_type, state_or_province)

# Removing outliers for Price

variable <- rf_house$price
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(price >= lower_bound & price <= upper_bound)

# removing outliers for x_square_feet

variable <- rf_house$x_square_feet
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(x_square_feet >= lower_bound & x_square_feet <= upper_bound)

# Removing outliers for number of baths

variable <- rf_house$baths
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(baths >= lower_bound & baths <= upper_bound)


# Removing outliers for number of beds

variable <- rf_house$beds
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(beds >= lower_bound & beds <= upper_bound)

# Removing outliers for lot size

variable <- rf_house$lot_size
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(lot_size >= lower_bound & lot_size <= upper_bound)

# Removing outliers for days on the market

variable <- rf_house$days_on_market
variable <- na.omit(variable)
# Calculate the IQR
Q1 <- quantile(variable, 0.25)
Q3 <- quantile(variable, 0.75)
IQR <- Q3 - Q1

# Define the upper and lower bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound <- as.integer(lower_bound)
upper_bound <- as.integer(upper_bound)
rf_house <- rf_house |> filter(days_on_market >= lower_bound & days_on_market <= upper_bound)

Analysis

In the below analysis we look at various ways to find insights into the data and variables of interest. I started with proposing two questions to explore to get a better sense of what may be significant in predicting home price in this dataset. The first question was: is there a difference in the average price between Multi-Family (2-4 Unit) Homes and Mobile/Manufactured Homes for property types? The next question I wanted to look at was: are home prices on average higher in Maine than in New Hampshire?

Problem 1

I utilized comparison of means and hypothesis testing to answer these two questions stated above. For the first question looking at if there is a difference in Multi-Family (2-4 unit) Homes and Mobile/Manufactured Homes I generated a hypothesis test. The null hypothesis was mu1=mu2 and the alternative hypothesis or the claim was mu1 NEQ mu2. Using a t-test at an alpha of 0.05 we obtained a p-value of near zero at 1.429e-13 less than our alpha of 0.05 indicating we reject the null hypothesis and conclude there is enough evidence to accept the claim that there is a difference between the average price of Multi-family 2-4 person homes and Mobile homes.

For question two I wanted to see if home prices on average were higher in Maine than in New Hampshire. In this case the null hypothesis was mu1=mu2 and the alternative hypothesis, also the claim, was mu1>mu2. Following a similar procedure as above but this time using a right tail test we obtain a p-value of .999 at an alpha of 0.05. Since the p-value is greater than alpha we do not reject the null hypothesis and conclude there is not enough evidence to say home prices are higher in Maine than in New Hampshire.

Click to expand

# Question 1
# Is there a difference in the average price between Multi-Family (2-4 Unit) Homes and Mobile/Manufactured Homes for property types?

# Hypothesis test
# H(o): mu1=mu2
# H(a): mu1!=mu2 (claim)

family <- rf_house |> filter(property_type=="Multi-Family (2-4 Unit)")
mobile <- rf_house |> filter(property_type=="Mobile/Manufactured Home")
family <- na.omit(family$price)
mobile <- na.omit(mobile$price)

mu <- 0
alpha <- .05

# Two tailed test because the claim is about there being a difference
CVusingt.test <- t.test(family, mobile,
                        alternative="two.side",
                        mu=mu,
                        confidence=1-alpha)
CVusingt.test

## 
##  Welch Two Sample t-test
## 
## data:  family and mobile
## t = 8.2122, df = 137.23, p-value = 1.43e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  168588.6 275525.4
## sample estimates:
## mean of x mean of y 
##  401542.2  179485.2

attributes(CVusingt.test)

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

CVusingt.test$statistic   # the t test statistic

##        t 
## 8.212249

CVusingt.test$parameter   # the degrees of freedom

##       df 
## 137.2318

CVusingt.test$p.value     # the p-value

## [1] 1.429752e-13

CVusingt.test$conf.int    # the confidence interval (2 numbers)

## [1] 168588.6 275525.4
## attr(,"conf.level")
## [1] 0.95

CVusingt.test$estimate    # the estimated mean

## mean of x mean of y 
##  401542.2  179485.2

CVusingt.test$null.value  # the specified hypothesized mean

## difference in means 
##                   0

CVusingt.test$stderr      # standard error of the mean

## [1] 27039.73

CVusingt.test$alternative # which kind of test (<, > or =)

## [1] "two.sided"

# P-value is near zero at 1.452e-12 less than our alpha of 0.05 indicating we reject the null hypothesis and conclude there is enough evidence to accept the claim that there is a difference between the average price of Multi-family 2-4 person homes and Mobile homes.

# Question 2
# Are home prices on average higher in Maine than in New Hampshire?


# Hypothesis test
# H(o): mu1=mu2
# H(a): mu1>mu2 (claim)

ME <- rf_house |> filter(state_or_province=="ME")
NH <- rf_house |> filter(state_or_province=="NH")
ME <- na.omit(ME$price)
NH <- na.omit(NH$price)

mu <- 0
alpha <- .05

# Two tailed test because the claim is about there being a difference
CVusingt.test <- t.test(ME, NH,
                        alternative="greater",
                        mu=mu,
                        confidence=1-alpha)
CVusingt.test

## 
##  Welch Two Sample t-test
## 
## data:  ME and NH
## t = -8.6063, df = 566.66, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -185441.9       Inf
## sample estimates:
## mean of x mean of y 
##  397975.0  553620.8

attributes(CVusingt.test)

## $names
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  
## 
## $class
## [1] "htest"

CVusingt.test$statistic   # the t test statistic

##         t 
## -8.606302

CVusingt.test$parameter   # the degrees of freedom

##       df 
## 566.6634

CVusingt.test$p.value     # the p-value

## [1] 1

CVusingt.test$conf.int    # the confidence interval (2 numbers)

## [1] -185441.9       Inf
## attr(,"conf.level")
## [1] 0.95

CVusingt.test$estimate    # the estimated mean

## mean of x mean of y 
##  397975.0  553620.8

CVusingt.test$null.value  # the specified hypothesized mean

## difference in means 
##                   0

CVusingt.test$stderr      # standard error of the mean

## [1] 18085.1

CVusingt.test$alternative # which kind of test (<, > or =)

## [1] "greater"

# In this test we obtain a P-value of .999 at an alpha of 0.05. Since the P-value is greater than alpha we do not reject the null hypothesis and conclude there is not enough evidence to say home prices are higher in Maine than in New Hampshire.

Problem 2

In the following problem I performed a simple linear regression for predicting home prices. From my exploratory analysis I saw “x_square_feet” had one the highest correlation coefficients so I selected this as the independent variable for the regression. From problem 1 and question 1 we saw that there was a significant difference between Multi-Family (2-4 unit) Homes and Mobile/Manufactured homes. I selected to explore this topic further for predicting housing prices based on these two features. I used a combination of subsetting the data and creating dummy variables for this approach. After selecting the variables needed for the regression I reduced my data to just the two types of homes and converted them to factors creating the dummy variables. I obtained the coefficients for the model including the intercepts and slopes for each of the two property types. I then plotted the data differentiating the two groups by color. Black for multi-family homes and red for mobile homes. It can be seen in the graph most of house pricing for multi-family homes sits above mobile homes thus the separate regression line is fit to the data higher.

Click to expand

rf_house_lm <- rf_house |> dplyr::select(property_type, price, x_square_feet)

# Simple linear regression

rf_house_family_mobile <- rf_house_lm |> filter(property_type=="Multi-Family (2-4 Unit)" | property_type=="Mobile/Manufactured Home")

# Convert property_type to factor with levels assigned as numbers
rf_house_family_mobile$property_type_factor <- factor(rf_house_family_mobile$property_type, levels = unique(rf_house_family_mobile$property_type))

# Now, you can perform a simple linear regression between house price and x_square_feet
lm_model <- lm(price ~ x_square_feet + property_type_factor, data = rf_house_family_mobile)
lm_model

## 
## Call:
## lm(formula = price ~ x_square_feet + property_type_factor, data = rf_house_family_mobile)
## 
## Coefficients:
##                                  (Intercept)  
##                                       153951  
##                                x_square_feet  
##                                         1477  
## property_type_factorMobile/Manufactured Home  
##                                      -216784

summary(lm_model)

## 
## Call:
## lm(formula = price ~ x_square_feet + property_type_factor, data = rf_house_family_mobile)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -566145  -85131  -27690   52260  598208 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   153951.2    26900.8   5.723
## x_square_feet                                   1477.3      133.7  11.053
## property_type_factorMobile/Manufactured Home -216783.6    31008.5  -6.991
##                                              Pr(>|t|)    
## (Intercept)                                  6.11e-08 ***
## x_square_feet                                 < 2e-16 ***
## property_type_factorMobile/Manufactured Home 1.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156200 on 140 degrees of freedom
## Multiple R-squared:  0.5534, Adjusted R-squared:  0.547 
## F-statistic: 86.73 on 2 and 140 DF,  p-value: < 2.2e-16

# plot again and include species differentiators
plot(rf_house_family_mobile$price ~ rf_house_family_mobile$x_square_feet, 
     main="House Price vs (x) Square Feet",
     xlab="Square Feet",
     ylab="Price",
     col=rf_house_family_mobile$property_type_factor
)
legend("topright", 
       title = "Homes",
       legend = levels(rf_house_family_mobile$property_type_factor),
       text.col = c(1:2),
       cex=0.5
)

#plot the lines
abline(a=lm_model$coefficients[1],b=lm_model$coefficients[2],col=1)
abline(a=lm_model$coefficients[1]+lm_model$coefficients[3],b=lm_model$coefficients[2],col=2)

Problem 3

In the last problem I created a multiple linear regression model for predicting housing price in the cleaned and segmented data set. The variables I selected for the regression were all numeric including beds, baths, lot_size, x_square_feet, and days_on_market. I first created a correlation table to view the correlation between the independent variables. Number of beds and baths had the highest correlation at .509 with lot_size and x_square_feet both being negatively correlated with number of beds.

I then created the multiple linear regression model and created the linear regression equation associated with it. Using the Pearson method I generated the R value of 0.842. This converts to an R^2 value of 0.7089. R^2 represents the proportion of the variance for a dependent variable that’s explained by an independent variable in the regression model.

I then went on to test the significance of R with a hypothesis test and using the correlation test. The null hypothesis was rho=0 and the alternative hypothesis (claim) was rho NEQ 0. Using an alpha of 0.05 we generated a summary of statistics for the Pearson’s product-moment correlation. It confirmed our R value as being 0.842 and generated a 95% confidence interval for the range of where the population R would fall. We obtained a p-value of 2.2e-16 which is less than our alpha of 0.05 implying we can reject the null hypothesis and conclude there is enough evidence to support the claim that R for the multiple regression model is significant at an alpha of 0.05.

Overall, the multiple linear regression model indicates that house prices are significantly influenced by the chosen independent variables. The coefficients provide insights into the relationship between each feature and home price. The calculated R value indicates the strength of the linear relationship between the predicted and actual home prices, providing a measure of the model’s predictive accuracy.

Click to expand

rf_multi <- rf_house |> dplyr::select(price, beds, baths, lot_size, x_square_feet, days_on_market)

cor(rf_multi[,2:5])

##                      beds      baths    lot_size x_square_feet
## beds           1.00000000 0.50918479 -0.07539607   -0.24617173
## baths          0.50918479 1.00000000  0.12046388    0.06314063
## lot_size      -0.07539607 0.12046388  1.00000000    0.15114766
## x_square_feet -0.24617173 0.06314063  0.15114766    1.00000000

# create a multiple linear model predicting housing prices on the basis of the independent variables
lm(price~beds+baths+lot_size+x_square_feet+days_on_market, data=rf_multi)

## 
## Call:
## lm(formula = price ~ beds + baths + lot_size + x_square_feet + 
##     days_on_market, data = rf_multi)
## 
## Coefficients:
##    (Intercept)            beds           baths        lot_size   x_square_feet  
##     -3.257e+05       3.595e+04       1.500e+05       7.234e-01       1.395e+03  
## days_on_market  
##     -1.168e+02

# print the linear equation
cat("y' =", 
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[1], digits = 2), "+",
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[2], digits = 2), "x(beds) +", 
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[3], digits = 2), "x(baths) +",
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[4], digits = 2), "x(lot_size) +",
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[5], digits = 2), "x(square_feet) +",
    round(coefficients(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))[6], digits = 2), "x(days_on_market)",
    "is the Linear Regression Equation.")

## y' = -325682.9 + 35947.25 x(beds) + 149952.5 x(baths) + 0.72 x(lot_size) + 1395.06 x(square_feet) + -116.76 x(days_on_market) is the Linear Regression Equation.

# Using the cor function
cat("R calculated automatically via cor() = ",
    round(cor(rf_multi$price, predict(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi)), method = "pearson"), digits = 3)
)

## R calculated automatically via cor() =  0.842

R <- 0.842
R2 <- R^2
print(R2)

## [1] 0.708964

# State the hypotheses
# NULL: rho = 0
# ALT: rho NEQ 0
# CLAIM: rho NEQ 0
rho <- 0 # hypothesized
claim <- "R of the multiple linear model is significant"
desired_action <- "support"
alpha <- 0.05 # given
alpha_statement <- paste("at alpha =", alpha, ".")

# Test the significance of R (cor-test option)
cor_test <- cor.test(x = rf_multi$price,
                     y = predict(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi)),
                     alternative = "two.sided",
                     method = "pearson",
                     conf.level = 0.95
)

cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  rf_multi$price and predict(lm(price ~ beds + baths + lot_size + x_square_feet + days_on_market, data = rf_multi))
## t = 51.212, df = 1076, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8237831 0.8586105
## sample estimates:
##       cor 
## 0.8420723

Limitations and Suggestions

I encounter several limitations throughout my analysis that could be addressed further. The first I noticed was from my first originally stated question comparing multi-family and mobile homes in comparison of means is that I only took into account two property types for comparing average price and predicting homes values. This could have been expanded to look at individual regression lines for each property type to see where large differences, similarities, and so on could be seen. Another limitation was encountered in the multiple regression analysis when using price as the dependent variable and using the selected independent variables to predict it. Again we obtained an R^2 of 70.89% saying that percentage can account for the variability in the dependent variable from the independent variables. It indicate it is a pretty good model but there is approximately 30% room for improvement. To get the R^2 value closer to one we could have included more independent variables that would potentially allow the data to fit the model better. Some other factors that could have been consider were location, year the property was built, towns population, and so on. Another limitation could have have been the removal of outliers as R^2 is very sensitive to outliers they removed for purposes of the analysis but they are also representative of the population in smaller sections and could give insight into what independent variables really drive higher price for homes in the data set. Overall, I feel I selected good independent variables for predicting home price but there was room for expansion in including more to produce a higher R^2 leading to more accurate housing predictions.

Work Cited

Kabacoff, R. I. (2015). R in Action (2nd ed.). Manning Publications.

Bluman, A. G. (2018). Elementary statistics: A step by step approach (10th ed.). McGraw Hill.

Final Project

Michael

2024-03-30

Introduction

Data Cleaning and EDA

Analysis

Problem 1

Problem 2

Problem 3

Limitations and Suggestions

Work Cited