Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
There are a number of interesting data-sets posted by Dr. John Burkardt of Florida State University. Those for testing regressions may be found here.
This discussion will look at the relationship between the selling price of houses and other variables. Specifically:
- \(I\), the index
- \(A_0\), 1
- \(A_1\), the local selling prices, in hundreds of dollars
- \(A_2\), the number of bathrooms
- \(A_3\), the area of the site in thousands of square feet
- \(A_4\), the size of the living space in thousands of square feet
- \(A_5\), the number of garages
- \(A_6\), the number of rooms
- \(A_7\), the number of bedrooms
- \(A_8\), the age in years
- \(A_9\), 1 = brick, 2 = brick/wood, 3 = aluminum/wood, 4 = wood
- \(A_10\), 1 = two story, 2 = split level, 3 = ranch
- \(A_11\), number of fire places
- \(B\), the selling price
As in the Week 11 example, the Intercept and Index variables are irrelevant, so they will be removed. To create a dichotomous variable, a new variable called Brick will be created. This variable will be TRUE for construction types containing brick and FALSE otherwise.
House <- read.table("https://people.sc.fsu.edu/~jburkardt/datasets/regression/x27.txt",
skip = 56L)
names(House) <- c("Index", "Intercept", "LocalSell", "Baths", "Area", "LiveArea",
"Garage", "Rooms", "BedRooms", "Age", "Construction",
"Levels", "Fireplaces", "SalePrice")
House$Brick <- factor(ifelse(House$Construction %in% c(1, 2), TRUE, FALSE),
ordered = FALSE)
House$Levels <- factor(House$Levels, ordered = FALSE)
House <- House[, -c(1, 2, 11)]
Below are the first few rows and summary statistics for the house data:
head(House, n = 10L)
summary(House)
## LocalSell Baths Area LiveArea
## Min. : 3.891 Min. :1.000 Min. : 2.275 Min. :0.975
## 1st Qu.: 5.240 1st Qu.:1.000 1st Qu.: 4.855 1st Qu.:1.194
## Median : 6.041 Median :1.000 Median : 6.143 Median :1.494
## Mean : 7.222 Mean :1.268 Mean : 6.461 Mean :1.512
## 3rd Qu.: 8.275 3rd Qu.:1.500 3rd Qu.: 7.850 3rd Qu.:1.655
## Max. :16.420 Max. :2.500 Max. :12.800 Max. :3.420
## Garage Rooms BedRooms Age Levels
## Min. :0.000 Min. : 5.000 Min. :2.000 Min. : 3.00 1:24
## 1st Qu.:1.000 1st Qu.: 6.000 1st Qu.:3.000 1st Qu.:30.00 2: 2
## Median :1.250 Median : 6.000 Median :3.000 Median :36.00 3: 2
## Mean :1.339 Mean : 6.679 Mean :3.286 Mean :36.32
## 3rd Qu.:2.000 3rd Qu.: 7.000 3rd Qu.:4.000 3rd Qu.:46.50
## Max. :2.000 Max. :10.000 Max. :5.000 Max. :62.00
## Fireplaces SalePrice Brick
## Min. :0.0000 Min. :25.90 FALSE:11
## 1st Qu.:0.0000 1st Qu.:29.90 TRUE :17
## Median :0.0000 Median :36.40
## Mean :0.3214 Mean :38.16
## 3rd Qu.:1.0000 3rd Qu.:40.62
## Max. :1.0000 Max. :84.90
Already the summary shows us that Levels will be unlikely to provide predictive power as there are only two examples each of the two variables other than “two-story”, so we will remove that too.
House <- House[, -9]
Below is the matrix of paired scatter plots of the remaining variables.
pairs(House, gap = 0)
There will almost certainly be some redundancies between Area, Living Area, Rooms, BedRooms, and Baths. Interaction terms may be useful, but out of the scope of this example. Age may have a non-linear relationship with SalePrice so a quadratic term will be investigated as well. Lastly, Brick will be interacted against LocalSell.
We will start with the near-saturated model and work backwards manually.
Fit <- lm(SalePrice ~ . + I(Age^2) + Brick:LocalSell, data = House)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ . + I(Age^2) + Brick:LocalSell, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7784 -2.4877 0.1523 2.1518 7.5022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1694763 10.3808726 0.498 0.6257
## LocalSell 1.0224041 0.8855988 1.154 0.2664
## Baths 7.9932749 5.9104603 1.352 0.1963
## Area 0.0834084 0.5697778 0.146 0.8856
## LiveArea 14.4458871 5.1897575 2.784 0.0139 *
## Garage 2.1327481 1.9147820 1.114 0.2829
## Rooms -0.0245827 2.6471444 -0.009 0.9927
## BedRooms -2.1930601 4.0217130 -0.545 0.5936
## Age -0.0284473 0.2764459 -0.103 0.9194
## Fireplaces 2.2010847 2.5109540 0.877 0.3945
## BrickTRUE -0.3387326 5.1808901 -0.065 0.9487
## I(Age^2) -0.0006523 0.0039201 -0.166 0.8701
## LocalSell:BrickTRUE -0.2096753 0.6688878 -0.313 0.7582
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.33 on 15 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9065
## F-statistic: 22.8 on 12 and 15 DF, p-value: 1.875e-07
Note that as Brick is a dichotomous variable, it’s default level is considered FALSE and the model is testing for Brick being TRUE. Below will be backwards elimination in reverse order of significance.
Fit <- update(Fit, .~. - Rooms)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea +
## Garage + BedRooms + Age + Fireplaces + Brick + I(Age^2) +
## LocalSell:Brick, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7802 -2.4844 0.1374 2.1570 7.4990
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1047696 7.4505956 0.685 0.5031
## LocalSell 1.0213250 0.8500659 1.201 0.2471
## Baths 8.0060515 5.5655895 1.438 0.1696
## Area 0.0825935 0.5451041 0.152 0.8815
## LiveArea 14.4380554 4.9581849 2.912 0.0102 *
## Garage 2.1307699 1.8424766 1.156 0.2645
## BedRooms -2.2213014 2.5481173 -0.872 0.3962
## Age -0.0279733 0.2630648 -0.106 0.9166
## Fireplaces 2.1953804 2.3573578 0.931 0.3655
## BrickTRUE -0.3268780 4.8617310 -0.067 0.9472
## I(Age^2) -0.0006592 0.0037283 -0.177 0.8619
## LocalSell:BrickTRUE -0.2114958 0.6192081 -0.342 0.7371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.192 on 16 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9123
## F-statistic: 26.53 on 11 and 16 DF, p-value: 3.486e-08
Fit <- update(Fit, .~. - Brick)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea +
## Garage + BedRooms + Age + Fireplaces + I(Age^2) + LocalSell:Brick,
## data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7683 -2.5389 0.1117 2.2119 7.4292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8696286 6.3832468 0.763 0.4560
## LocalSell 1.0471386 0.7358843 1.423 0.1728
## Baths 7.9671122 5.3708604 1.483 0.1563
## Area 0.0748024 0.5168145 0.145 0.8866
## LiveArea 14.5023689 4.7204483 3.072 0.0069 **
## Garage 2.1213333 1.7825230 1.190 0.2504
## BedRooms -2.2363413 2.4628410 -0.908 0.3766
## Age -0.0247928 0.2510857 -0.099 0.9225
## Fireplaces 2.2036843 2.2841547 0.965 0.3482
## I(Age^2) -0.0006864 0.0035960 -0.191 0.8509
## LocalSell:BrickTRUE -0.2501548 0.2229809 -1.122 0.2775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.068 on 17 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9174
## F-statistic: 31 on 10 and 17 DF, p-value: 6.004e-09
Fit <- update(Fit, .~. - Age)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea +
## Garage + BedRooms + Fireplaces + I(Age^2) + LocalSell:Brick,
## data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8080 -2.5866 0.1448 2.2556 7.5049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.427995 4.427265 1.000 0.33049
## LocalSell 1.043964 0.714673 1.461 0.16132
## Baths 7.936613 5.212393 1.523 0.14522
## Area 0.080839 0.498870 0.162 0.87308
## LiveArea 14.541663 4.572431 3.180 0.00518 **
## Garage 2.172972 1.656546 1.312 0.20609
## BedRooms -2.252638 2.388756 -0.943 0.35816
## Fireplaces 2.168172 2.192739 0.989 0.33587
## I(Age^2) -0.001022 0.001144 -0.893 0.38354
## LocalSell:BrickTRUE -0.249079 0.216501 -1.150 0.26501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.954 on 18 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.922
## F-statistic: 36.45 on 9 and 18 DF, p-value: 9.547e-10
Fit <- update(Fit, .~. - Area)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage +
## BedRooms + Fireplaces + I(Age^2) + LocalSell:Brick, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6596 -2.5520 0.1068 2.1637 7.6014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.720096 3.938692 1.198 0.24550
## LocalSell 1.049407 0.695349 1.509 0.14770
## Baths 7.724790 4.914821 1.572 0.13252
## LiveArea 14.800456 4.173222 3.547 0.00216 **
## Garage 2.127830 1.590559 1.338 0.19676
## BedRooms -2.193994 2.299882 -0.954 0.35208
## Fireplaces 2.289656 2.007101 1.141 0.26814
## I(Age^2) -0.001096 0.001022 -1.072 0.29702
## LocalSell:BrickTRUE -0.247151 0.210562 -1.174 0.25499
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.852 on 19 degrees of freedom
## Multiple R-squared: 0.9479, Adjusted R-squared: 0.926
## F-statistic: 43.22 on 8 and 19 DF, p-value: 1.402e-10
Fit <- update(Fit, .~. - BedRooms)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage +
## Fireplaces + I(Age^2) + LocalSell:Brick, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.260 -1.969 -0.284 2.383 7.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4416659 3.1248602 0.781 0.44374
## LocalSell 1.0958832 0.6920787 1.583 0.12900
## Baths 7.0941822 4.8591971 1.460 0.15983
## LiveArea 12.6755046 3.5210759 3.600 0.00179 **
## Garage 1.4048380 1.3952391 1.007 0.32602
## Fireplaces 2.7413892 1.9460488 1.409 0.17428
## I(Age^2) -0.0015159 0.0009202 -1.647 0.11509
## LocalSell:BrickTRUE -0.1994902 0.2040885 -0.977 0.34001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.843 on 20 degrees of freedom
## Multiple R-squared: 0.9454, Adjusted R-squared: 0.9263
## F-statistic: 49.49 on 7 and 20 DF, p-value: 2.932e-11
Fit <- update(Fit, .~. - LocalSell:Brick)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage +
## Fireplaces + I(Age^2), data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8691 -2.3104 -0.2209 2.4710 7.2440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1160698 3.1037579 0.682 0.5028
## LocalSell 1.1546977 0.6887263 1.677 0.1085
## Baths 7.8978362 4.7840524 1.651 0.1136
## LiveArea 11.4167823 3.2736715 3.487 0.0022 **
## Garage 1.3069035 1.3901600 0.940 0.3578
## Fireplaces 2.5109406 1.9296648 1.301 0.2073
## I(Age^2) -0.0014150 0.0009134 -1.549 0.1363
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.839 on 21 degrees of freedom
## Multiple R-squared: 0.9428, Adjusted R-squared: 0.9265
## F-statistic: 57.7 on 6 and 21 DF, p-value: 5.78e-12
Fit <- update(Fit, .~. - Garage)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Fireplaces +
## I(Age^2), data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2875 -1.7244 -0.0118 1.9242 7.0367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5169917 3.0661906 0.821 0.42052
## LocalSell 1.4405077 0.6163604 2.337 0.02894 *
## Baths 7.5667439 4.7584556 1.590 0.12607
## LiveArea 11.0002568 3.2349742 3.400 0.00257 **
## Fireplaces 2.4430326 1.9232134 1.270 0.21725
## I(Age^2) -0.0011778 0.0008755 -1.345 0.19224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.829 on 22 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9269
## F-statistic: 69.43 on 5 and 22 DF, p-value: 9.984e-13
FitRmax <- Fit
This model actually has the highest Rsq and adjusted Rsq, but has non-significant variables.
Fit <- update(Fit, .~. - Fireplaces)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + I(Age^2),
## data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7024 -1.7184 -0.4171 1.8771 6.3903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0831593 2.8886488 0.375 0.71112
## LocalSell 1.6854357 0.5931837 2.841 0.00925 **
## Baths 8.7359037 4.7304632 1.847 0.07769 .
## LiveArea 9.9730003 3.1737751 3.142 0.00457 **
## I(Age^2) -0.0008314 0.0008430 -0.986 0.33428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.879 on 23 degrees of freedom
## Multiple R-squared: 0.936, Adjusted R-squared: 0.9249
## F-statistic: 84.14 on 4 and 23 DF, p-value: 2.184e-13
Fit <- update(Fit, .~. - I(Age ^ 2))
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5838 -2.1785 -0.3148 2.2052 6.5844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4900 2.4069 -0.204 0.84040
## LocalSell 1.8999 0.5516 3.445 0.00211 **
## Baths 8.3192 4.7089 1.767 0.08999 .
## LiveArea 9.5117 3.1373 3.032 0.00575 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.877 on 24 degrees of freedom
## Multiple R-squared: 0.9333, Adjusted R-squared: 0.925
## F-statistic: 112 on 3 and 24 DF, p-value: 3.013e-14
FitS1 <- Fit
At this point, all the variables are significant to the 0.1 level.
Fit <- update(Fit, .~. - Baths)
summary(Fit)
##
## Call:
## lm(formula = SalePrice ~ LocalSell + LiveArea, data = House)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6935 -3.2118 -0.0549 2.2044 8.1124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2660 2.2832 0.554 0.584
## LocalSell 2.4224 0.4849 4.996 3.77e-05 ***
## LiveArea 12.8312 2.6169 4.903 4.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.038 on 25 degrees of freedom
## Multiple R-squared: 0.9247, Adjusted R-squared: 0.9186
## F-statistic: 153.4 on 2 and 25 DF, p-value: 9.188e-15
Using a significance level of at least 0.5, this would be the selected model: \[ SalePrice = 1.266 + 2.4224\times LocalSell + 12.8312\times LivingArea \]
par(mfrow = c(2, 2))
plot(FitRmax$fitted.values, FitRmax$residuals)
plot(FitS1$fitted.values, FitS1$residuals)
plot(Fit$fitted.values, Fit$residuals)
par(mfrow = c(1, 1))
None of the three models—Rsq max, all significant at 0.1, or the final selected model—show obvious heteroskedasticity.
par(mfrow = c(2, 2))
qqnorm(FitRmax$residuals, main = "Max Rsquared")
qqline(FitRmax$residuals)
qqnorm(FitS1$residuals, main = "Significant 0.1")
qqline(FitS1$residuals)
qqnorm(Fit$residuals, main = "Significant 0.05")
qqline(Fit$residuals)
par(mfrow = c(1, 1))
The last model has the best-fitting QQ-plot to the eye.
With the data provided, the best fitting multiple-regression model is the one with two variables reflecting the local selling price of homes in the neighborhood and the living area. However, the linear model, at least as constructed, is not the best model for this data. While coded numeric, the case can be made for variables like Garage, Baths and Fireplaces to be considered factors, as the price is unlikely to be a function of their value but of their presence. Furthermore, with limited possibilities for these factors, but with the presence of real quantitative values as well, this data set would be better served by a tree-based model such as a Random Forest.