This project covers a series of questions surrounding a housing data set from https://www.lock5stat.com/datasets3e/HomesForSale.csv. The questions are mostly based on California houses and about how certain variables affect housing prices. By answering these question using linear regression we should be able to gain insights on how various factors can affect a houses price point.
Home = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(Home)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
Q1. Use the data only for California. How much does the size of a home influence its price?
Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?
Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?
Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.
sp = data.frame(Home$Size, Home$Price)
# Perform simple linear regression
lm1_model <- lm(Home$Size ~ Home$Price, data = sp)
# Print the summary of the regression
summary(lm1_model)
##
## Call:
## lm(formula = Home$Size ~ Home$Price, data = sp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1167.9 -360.5 -117.8 228.1 2310.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1444.2737 109.1487 13.232 < 2e-16 ***
## Home$Price 1.1195 0.2428 4.611 1.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 664.8 on 118 degrees of freedom
## Multiple R-squared: 0.1527, Adjusted R-squared: 0.1455
## F-statistic: 21.26 on 1 and 118 DF, p-value: 1.022e-05
The “Residual standard error” (664.8) indicates the average difference between observed and predicted values.
The R-squared value (0.1455) indicates that 14.55% of total variation in Home size is explained (or accounted for) by price.
The p-value (1.02e-05 or 0, basically) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).
bp = data.frame(Home$Beds, Home$Price)
# Perform simple linear regression
lm2_model <- lm(Home$Beds ~ Home$Price, data = bp)
# Print the summary of the regression
summary(lm2_model)
##
## Call:
## lm(formula = Home$Beds ~ Home$Price, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1555 -0.3620 -0.2432 0.6022 2.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1241726 0.1270481 24.590 <2e-16 ***
## Home$Price 0.0006266 0.0002826 2.217 0.0285 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7738 on 118 degrees of freedom
## Multiple R-squared: 0.04, Adjusted R-squared: 0.03186
## F-statistic: 4.917 on 1 and 118 DF, p-value: 0.02851
bap = data.frame(Home$Baths, Home$Price)
# Perform simple linear regression
lm3_model <- lm(Home$Baths ~ Home$Price, data = bap)
# Print the summary of the regression
summary(lm3_model)
##
## Call:
## lm(formula = Home$Baths ~ Home$Price, data = bap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8583 -0.3796 -0.1406 0.6073 2.3516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8735308 0.1290746 14.515 < 2e-16 ***
## Home$Price 0.0014088 0.0002871 4.907 2.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7861 on 118 degrees of freedom
## Multiple R-squared: 0.1695, Adjusted R-squared: 0.1624
## F-statistic: 24.08 on 1 and 118 DF, p-value: 2.992e-06
#create data frame
data <- data.frame(Home$Size, Home$Beds, Home$Baths, Home$Price)
# Perform multiple linear regression
model <- lm(Home$Price ~ Home$Size + Home$Beds + Home$Baths, data = data)
# Print the summary of the regression model
summary(model)
##
## Call:
## lm(formula = Home$Price ~ Home$Size + Home$Beds + Home$Baths,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -352.31 -157.69 -68.89 86.14 745.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.75177 92.91802 1.117 0.2665
## Home$Size 0.08199 0.04264 1.923 0.0570 .
## Home$Beds -25.80554 32.82340 -0.786 0.4334
## Home$Baths 84.95750 34.48394 2.464 0.0152 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 228.1 on 116 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1745
## F-statistic: 9.385 on 3 and 116 DF, p-value: 1.329e-05
How can we interpret each of the coefficients and each p-value?
The coefficient 0.08199 means that for each size unit value increase, the price increases by about $0.08, holding Beds and Baths constant. The coefficient -25.80554 means that for each unit increase in Beds, the price decreases by about -$25.81, holding Size and Baths constant. The coefficient 84.95750 means that for each unit increase in Baths, the price decreases by about $84.96, holding Size and Beds constant.
The smaller p-values (0.0570 and 0.0152) indicate both variables have significant impact on Home price. The Larger p-value (0.4334) indicates Beds variables has no significant impact on Home price.
# Form a data frame
myData <- data.frame(Home$Price, Home$State)
# Perform one-way ANOVA
model2 <- aov(Home$Price ~ Home$State, data = myData)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Home$State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Conduct post-hoc Tukey's HSD test
TukeyHSD(model2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Home$Price ~ Home$State, data = myData)
##
## $`Home$State`
## diff lwr upr p adj
## NJ-CA -206.83333 -363.6729 -49.99379 0.0044754
## NY-CA -170.03333 -326.8729 -13.19379 0.0280402
## PA-CA -269.80000 -426.6395 -112.96045 0.0001011
## NY-NJ 36.80000 -120.0395 193.63955 0.9282064
## PA-NJ -62.96667 -219.8062 93.87288 0.7224830
## PA-NY -99.76667 -256.6062 57.07288 0.3505951
plot(model2, 1)
plot(model2, 2)
Since the p-value (0.000148) is basically zero, the data indicate there are significant differences in the mean house price values among the States.
The see which states make a difference in price, we can conduct a post-hoc Tukey’s HSD test.
Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.
The small significant values (0.0044754, 0.0280402, 0.0001011), mean that their is a significant difference in housing prices between (NJ-CA), (NY-CA), (PA-CA). Meanwhile the larger values (0.9282064, 0.7224830, 0.3505951), mean that their isn’t a significant difference in housing prices between (NY-NJ), (PA-NJ), (PA-NY).
We can see this looking at the graph where half of the dots fit the model while the other are off model.
The Estimate (1.1195) for the slope coefficient indicates Home price increases about $1.1195 per unit increase in size.
The p-value (0.0285) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level.
The p-value (.000002992 or basically 0) associated with the slope coefficient indicates that the relationship is statistically significant at any reasonable significance level (say at level 0.01).
How can we interpret each of the coefficients and each p-value?
The coefficient 0.08199 means that for each size unit value increase, the price increases by about $0.08, holding Beds and Baths constant. The coefficient -25.80554 means that for each unit increase in Beds, the price decreases by about -$25.81, holding Size and Baths constant. The coefficient 84.95750 means that for each unit increase in Baths, the price decreases by about $84.96, holding Size and Beds constant.
The smaller p-values (0.0570 and 0.0152) indicate both variables have significant impact on Home price. The Larger p-value (0.4334) indicates Beds variables has no significant impact on Home price.
Since the p-value (0.000148) is basically zero, the data indicate there are significant differences in the mean house price values among the States.
The see which states make a difference in price, we can conduct a post-hoc Tukey’s HSD test.
Since all adjusted p-values are quite small (smaller than the commonly used significance levels), there is a significantly different hardness values between any two levels of the cooling time. This information can guide the optimization of the cooling process to achieve the desired hardness properties for engineering applications.
The small significant values (0.0044754, 0.0280402, 0.0001011), mean that their is a significant difference in housing prices between (NJ-CA), (NY-CA), (PA-CA). Meanwhile the larger values (0.9282064, 0.7224830, 0.3505951), mean that their isn’t a significant difference in housing prices between (NY-NJ), (PA-NJ), (PA-NY).
plot(model2, 1)
plot(model2, 2)
We can see this looking at the graph where half of the dots fit the model while the other are off model.
Using linear regression we were able to find which variables had significant impacts on housing prices. We were also able to reason out why these findings make sense.
# Q1 mean sp = data.frame(Home$Size, Home$Price) | lm1_model <- lm(Home$Size ~ Home$Price, data = sp) | summary(lm1_model)
# Q2 bp = data.frame(Home$Beds, Home$Price) | lm2_model <- lm(Home$Beds ~ Home$Price, data = bp) | summary(lm2_model)
# Q3 bap = data.frame(Home$Baths, Home$Price) | lm3_model <- lm(Home$Baths ~ Home$Price, data = bap) | summary(lm3_model)
# Q4 data <- data.frame(Home$Size, Home$Beds, Home$Baths, Home$Price) | model <- lm(Home$Price ~ Home$Size + Home$Beds + Home$Baths, data = data) | summary(model)
# Q5 myData <- data.frame(Home$Price, Home$State) | model2 <- aov(Home$Price ~ Home$State, data = myData) | summary(model2) | TukeyHSD(model2) | plot(model2, 1) | plot(model2, 2)