Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
https://www.kaggle.com/datasets/prakharrathi25/home-prices-dataset?resource=download
my_git_url <- getURL("https://raw.githubusercontent.com/AhmedBuckets/SPS605/main/home_data.csv")
price_data <- read.csv(text = my_git_url)
## id date price bedrooms
## Min. :1.000e+06 Length:21613 Min. : 75000 Min. : 0.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 321950 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540088 Mean : 3.371
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651360 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 1.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.000 Median : 7.000
## Mean :0.007542 Mean :0.2343 Mean :3.409 Mean : 7.657
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 290 Min. : 0.0 Min. :1900 Min. : 0.0
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0
## Median :1560 Median : 0.0 Median :1975 Median : 0.0
## Mean :1788 Mean : 291.5 Mean :1971 Mean : 84.4
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.0
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12768
## 3rd Qu.: 10083
## Max. :871200
For my dichotomous variable I will use waterfront. The relationship between price and grade is the closest thing to a quadratic relationship I could find.
price_sqft_model <- lm(price ~ waterfront + grade + waterfront*sqft_living, data = price_data)
##
## Call:
## lm(formula = price ~ waterfront + grade + waterfront * sqft_living,
## data = price_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1485444 -132896 -23312 98391 4969552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.761e+05 1.248e+04 -46.146 < 2e-16 ***
## waterfront -3.141e+05 4.108e+04 -7.645 2.17e-14 ***
## grade 9.960e+04 2.104e+03 47.348 < 2e-16 ***
## sqft_living 1.669e+02 2.716e+00 61.476 < 2e-16 ***
## waterfront:sqft_living 3.618e+02 1.164e+01 31.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235000 on 21608 degrees of freedom
## Multiple R-squared: 0.5902, Adjusted R-squared: 0.5902
## F-statistic: 7781 on 4 and 21608 DF, p-value: < 2.2e-16
The coefficient for waterfront is -$314,100, which indicates that a home without a waterfront costs 314,100 less than one with a waterfront.
The coefficient for grade is $99,600 which indicates that grade and price are positively correlated- higher grade higher price.
The coefficient of $166.9 for sqft_living, indicating that higher sqft value correlates with higher price.
The interaction term has a coefficient of $361.8, indicating that the effect of square footage on price is an additional $362 per square foot for waterfront properties compared to non-waterfront properties. This shows that waterfront properties not only have a higher base price per square foot but also gain more value per square foot added compared to non-waterfront properties.
The residual standard error is 235,000.
The residuals tell us about the differences between observed values and values predicted by the model. The minimum, or largest underestimation by the model, is -1485444 The largest overestimation was 4969552. The median residual value is -23312. The residual values tell us more when we can compare them with the model’s predictions:
Residuals do tend to increase as we move to the right, meaning the model will struggle with prediction at larger values.
We can use a Q-Q plot to visualize whether or not the residuals are normally distributed:
The residuals don’t deviate too much from the the line until it gets to the rightmost extreme. The residuals are only normally distributed up to a point.
I would say that choosing a linear model might be appropriate only because the Q-Q plot adhered to a straight line within a certain range and the F-score and R squared were very high, indicating high predicting power.