Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

https://www.kaggle.com/datasets/prakharrathi25/home-prices-dataset?resource=download

Download the Data

my_git_url <- getURL("https://raw.githubusercontent.com/AhmedBuckets/SPS605/main/home_data.csv")
price_data <- read.csv(text = my_git_url)

Show Columns

##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651360   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200

Plotting Price against Grade

For my dichotomous variable I will use waterfront. The relationship between price and grade is the closest thing to a quadratic relationship I could find.

Make Model

price_sqft_model <- lm(price ~ waterfront + grade + waterfront*sqft_living, data = price_data)

Look at the Summary

## 
## Call:
## lm(formula = price ~ waterfront + grade + waterfront * sqft_living, 
##     data = price_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1485444  -132896   -23312    98391  4969552 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.761e+05  1.248e+04 -46.146  < 2e-16 ***
## waterfront             -3.141e+05  4.108e+04  -7.645 2.17e-14 ***
## grade                   9.960e+04  2.104e+03  47.348  < 2e-16 ***
## sqft_living             1.669e+02  2.716e+00  61.476  < 2e-16 ***
## waterfront:sqft_living  3.618e+02  1.164e+01  31.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235000 on 21608 degrees of freedom
## Multiple R-squared:  0.5902, Adjusted R-squared:  0.5902 
## F-statistic:  7781 on 4 and 21608 DF,  p-value: < 2.2e-16

Coefficients

The coefficient for waterfront is -$314,100, which indicates that a home without a waterfront costs 314,100 less than one with a waterfront.

The coefficient for grade is $99,600 which indicates that grade and price are positively correlated- higher grade higher price.

The coefficient of $166.9 for sqft_living, indicating that higher sqft value correlates with higher price.

The interaction term has a coefficient of $361.8, indicating that the effect of square footage on price is an additional $362 per square foot for waterfront properties compared to non-waterfront properties. This shows that waterfront properties not only have a higher base price per square foot but also gain more value per square foot added compared to non-waterfront properties.

Residual Analysis

The residual standard error is 235,000.

The residuals tell us about the differences between observed values and values predicted by the model. The minimum, or largest underestimation by the model, is -1485444 The largest overestimation was 4969552. The median residual value is -23312. The residual values tell us more when we can compare them with the model’s predictions:

Residuals do tend to increase as we move to the right, meaning the model will struggle with prediction at larger values.

We can use a Q-Q plot to visualize whether or not the residuals are normally distributed:

The residuals don’t deviate too much from the the line until it gets to the rightmost extreme. The residuals are only normally distributed up to a point.

I would say that choosing a linear model might be appropriate only because the Q-Q plot adhered to a straight line within a certain range and the F-score and R squared were very high, indicating high predicting power.