Assignment

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Solution

Data Overview

There are a number of interesting data-sets posted by Dr. John Burkardt of Florida State University. Those for testing regressions may be found here.

This discussion will look at the relationship between the selling price of houses and other variables. Specifically:

  • \(I\), the index
  • \(A_0\), 1
  • \(A_1\), the local selling prices, in hundreds of dollars
  • \(A_2\), the number of bathrooms
  • \(A_3\), the area of the site in thousands of square feet
  • \(A_4\), the size of the living space in thousands of square feet
  • \(A_5\), the number of garages
  • \(A_6\), the number of rooms
  • \(A_7\), the number of bedrooms
  • \(A_8\), the age in years
  • \(A_9\), 1 = brick, 2 = brick/wood, 3 = aluminum/wood, 4 = wood
  • \(A_10\), 1 = two story, 2 = split level, 3 = ranch
  • \(A_11\), number of fire places
  • \(B\), the selling price

As in the Week 11 example, the Intercept and Index variables are irrelevant, so they will be removed. To create a dichotomous variable, a new variable called Brick will be created. This variable will be TRUE for construction types containing brick and FALSE otherwise.

ETL

House <- read.table("https://people.sc.fsu.edu/~jburkardt/datasets/regression/x27.txt",
                skip = 56L)
names(House) <- c("Index", "Intercept", "LocalSell", "Baths", "Area", "LiveArea",
                   "Garage", "Rooms", "BedRooms", "Age", "Construction",
                   "Levels", "Fireplaces", "SalePrice")
House$Brick <- factor(ifelse(House$Construction %in% c(1, 2), TRUE, FALSE),
                      ordered = FALSE)
House$Levels <- factor(House$Levels, ordered = FALSE)
House <- House[, -c(1, 2, 11)]

Investigation

Below are the first few rows and summary statistics for the house data:

head(House, n = 10L)
summary(House)
##    LocalSell          Baths            Area           LiveArea    
##  Min.   : 3.891   Min.   :1.000   Min.   : 2.275   Min.   :0.975  
##  1st Qu.: 5.240   1st Qu.:1.000   1st Qu.: 4.855   1st Qu.:1.194  
##  Median : 6.041   Median :1.000   Median : 6.143   Median :1.494  
##  Mean   : 7.222   Mean   :1.268   Mean   : 6.461   Mean   :1.512  
##  3rd Qu.: 8.275   3rd Qu.:1.500   3rd Qu.: 7.850   3rd Qu.:1.655  
##  Max.   :16.420   Max.   :2.500   Max.   :12.800   Max.   :3.420  
##      Garage          Rooms           BedRooms          Age        Levels
##  Min.   :0.000   Min.   : 5.000   Min.   :2.000   Min.   : 3.00   1:24  
##  1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:3.000   1st Qu.:30.00   2: 2  
##  Median :1.250   Median : 6.000   Median :3.000   Median :36.00   3: 2  
##  Mean   :1.339   Mean   : 6.679   Mean   :3.286   Mean   :36.32         
##  3rd Qu.:2.000   3rd Qu.: 7.000   3rd Qu.:4.000   3rd Qu.:46.50         
##  Max.   :2.000   Max.   :10.000   Max.   :5.000   Max.   :62.00         
##    Fireplaces       SalePrice       Brick   
##  Min.   :0.0000   Min.   :25.90   FALSE:11  
##  1st Qu.:0.0000   1st Qu.:29.90   TRUE :17  
##  Median :0.0000   Median :36.40             
##  Mean   :0.3214   Mean   :38.16             
##  3rd Qu.:1.0000   3rd Qu.:40.62             
##  Max.   :1.0000   Max.   :84.90

Already the summary shows us that Levels will be unlikely to provide predictive power as there are only two examples each of the two variables other than “two-story”, so we will remove that too.

House <- House[, -9]

Below is the matrix of paired scatter plots of the remaining variables.

pairs(House, gap = 0)

There will almost certainly be some redundancies between Area, Living Area, Rooms, BedRooms, and Baths. Interaction terms may be useful, but out of the scope of this example. Age may have a non-linear relationship with SalePrice so a quadratic term will be investigated as well. Lastly, Brick will be interacted against LocalSell.

Linear Model

We will start with the near-saturated model and work backwards manually.

Fit <- lm(SalePrice ~ . + I(Age^2) + Brick:LocalSell, data = House)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ . + I(Age^2) + Brick:LocalSell, data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7784 -2.4877  0.1523  2.1518  7.5022 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          5.1694763 10.3808726   0.498   0.6257  
## LocalSell            1.0224041  0.8855988   1.154   0.2664  
## Baths                7.9932749  5.9104603   1.352   0.1963  
## Area                 0.0834084  0.5697778   0.146   0.8856  
## LiveArea            14.4458871  5.1897575   2.784   0.0139 *
## Garage               2.1327481  1.9147820   1.114   0.2829  
## Rooms               -0.0245827  2.6471444  -0.009   0.9927  
## BedRooms            -2.1930601  4.0217130  -0.545   0.5936  
## Age                 -0.0284473  0.2764459  -0.103   0.9194  
## Fireplaces           2.2010847  2.5109540   0.877   0.3945  
## BrickTRUE           -0.3387326  5.1808901  -0.065   0.9487  
## I(Age^2)            -0.0006523  0.0039201  -0.166   0.8701  
## LocalSell:BrickTRUE -0.2096753  0.6688878  -0.313   0.7582  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.33 on 15 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9065 
## F-statistic:  22.8 on 12 and 15 DF,  p-value: 1.875e-07

Note that as Brick is a dichotomous variable, it’s default level is considered FALSE and the model is testing for Brick being TRUE. Below will be backwards elimination in reverse order of significance.

Fit <- update(Fit, .~. - Rooms)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea + 
##     Garage + BedRooms + Age + Fireplaces + Brick + I(Age^2) + 
##     LocalSell:Brick, data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7802 -2.4844  0.1374  2.1570  7.4990 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          5.1047696  7.4505956   0.685   0.5031  
## LocalSell            1.0213250  0.8500659   1.201   0.2471  
## Baths                8.0060515  5.5655895   1.438   0.1696  
## Area                 0.0825935  0.5451041   0.152   0.8815  
## LiveArea            14.4380554  4.9581849   2.912   0.0102 *
## Garage               2.1307699  1.8424766   1.156   0.2645  
## BedRooms            -2.2213014  2.5481173  -0.872   0.3962  
## Age                 -0.0279733  0.2630648  -0.106   0.9166  
## Fireplaces           2.1953804  2.3573578   0.931   0.3655  
## BrickTRUE           -0.3268780  4.8617310  -0.067   0.9472  
## I(Age^2)            -0.0006592  0.0037283  -0.177   0.8619  
## LocalSell:BrickTRUE -0.2114958  0.6192081  -0.342   0.7371  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.192 on 16 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9123 
## F-statistic: 26.53 on 11 and 16 DF,  p-value: 3.486e-08
Fit <- update(Fit, .~. - Brick)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea + 
##     Garage + BedRooms + Age + Fireplaces + I(Age^2) + LocalSell:Brick, 
##     data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7683 -2.5389  0.1117  2.2119  7.4292 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          4.8696286  6.3832468   0.763   0.4560   
## LocalSell            1.0471386  0.7358843   1.423   0.1728   
## Baths                7.9671122  5.3708604   1.483   0.1563   
## Area                 0.0748024  0.5168145   0.145   0.8866   
## LiveArea            14.5023689  4.7204483   3.072   0.0069 **
## Garage               2.1213333  1.7825230   1.190   0.2504   
## BedRooms            -2.2363413  2.4628410  -0.908   0.3766   
## Age                 -0.0247928  0.2510857  -0.099   0.9225   
## Fireplaces           2.2036843  2.2841547   0.965   0.3482   
## I(Age^2)            -0.0006864  0.0035960  -0.191   0.8509   
## LocalSell:BrickTRUE -0.2501548  0.2229809  -1.122   0.2775   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.068 on 17 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9174 
## F-statistic:    31 on 10 and 17 DF,  p-value: 6.004e-09
Fit <- update(Fit, .~. - Age)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + Area + LiveArea + 
##     Garage + BedRooms + Fireplaces + I(Age^2) + LocalSell:Brick, 
##     data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8080 -2.5866  0.1448  2.2556  7.5049 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          4.427995   4.427265   1.000  0.33049   
## LocalSell            1.043964   0.714673   1.461  0.16132   
## Baths                7.936613   5.212393   1.523  0.14522   
## Area                 0.080839   0.498870   0.162  0.87308   
## LiveArea            14.541663   4.572431   3.180  0.00518 **
## Garage               2.172972   1.656546   1.312  0.20609   
## BedRooms            -2.252638   2.388756  -0.943  0.35816   
## Fireplaces           2.168172   2.192739   0.989  0.33587   
## I(Age^2)            -0.001022   0.001144  -0.893  0.38354   
## LocalSell:BrickTRUE -0.249079   0.216501  -1.150  0.26501   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.954 on 18 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.922 
## F-statistic: 36.45 on 9 and 18 DF,  p-value: 9.547e-10
Fit <- update(Fit, .~. - Area)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage + 
##     BedRooms + Fireplaces + I(Age^2) + LocalSell:Brick, data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6596 -2.5520  0.1068  2.1637  7.6014 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          4.720096   3.938692   1.198  0.24550   
## LocalSell            1.049407   0.695349   1.509  0.14770   
## Baths                7.724790   4.914821   1.572  0.13252   
## LiveArea            14.800456   4.173222   3.547  0.00216 **
## Garage               2.127830   1.590559   1.338  0.19676   
## BedRooms            -2.193994   2.299882  -0.954  0.35208   
## Fireplaces           2.289656   2.007101   1.141  0.26814   
## I(Age^2)            -0.001096   0.001022  -1.072  0.29702   
## LocalSell:BrickTRUE -0.247151   0.210562  -1.174  0.25499   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.852 on 19 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.926 
## F-statistic: 43.22 on 8 and 19 DF,  p-value: 1.402e-10
Fit <- update(Fit, .~. - BedRooms)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage + 
##     Fireplaces + I(Age^2) + LocalSell:Brick, data = House)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.260 -1.969 -0.284  2.383  7.702 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          2.4416659  3.1248602   0.781  0.44374   
## LocalSell            1.0958832  0.6920787   1.583  0.12900   
## Baths                7.0941822  4.8591971   1.460  0.15983   
## LiveArea            12.6755046  3.5210759   3.600  0.00179 **
## Garage               1.4048380  1.3952391   1.007  0.32602   
## Fireplaces           2.7413892  1.9460488   1.409  0.17428   
## I(Age^2)            -0.0015159  0.0009202  -1.647  0.11509   
## LocalSell:BrickTRUE -0.1994902  0.2040885  -0.977  0.34001   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.843 on 20 degrees of freedom
## Multiple R-squared:  0.9454, Adjusted R-squared:  0.9263 
## F-statistic: 49.49 on 7 and 20 DF,  p-value: 2.932e-11
Fit <- update(Fit, .~. - LocalSell:Brick)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Garage + 
##     Fireplaces + I(Age^2), data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8691 -2.3104 -0.2209  2.4710  7.2440 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.1160698  3.1037579   0.682   0.5028   
## LocalSell    1.1546977  0.6887263   1.677   0.1085   
## Baths        7.8978362  4.7840524   1.651   0.1136   
## LiveArea    11.4167823  3.2736715   3.487   0.0022 **
## Garage       1.3069035  1.3901600   0.940   0.3578   
## Fireplaces   2.5109406  1.9296648   1.301   0.2073   
## I(Age^2)    -0.0014150  0.0009134  -1.549   0.1363   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.839 on 21 degrees of freedom
## Multiple R-squared:  0.9428, Adjusted R-squared:  0.9265 
## F-statistic:  57.7 on 6 and 21 DF,  p-value: 5.78e-12
Fit <- update(Fit, .~. - Garage)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + Fireplaces + 
##     I(Age^2), data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2875 -1.7244 -0.0118  1.9242  7.0367 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.5169917  3.0661906   0.821  0.42052   
## LocalSell    1.4405077  0.6163604   2.337  0.02894 * 
## Baths        7.5667439  4.7584556   1.590  0.12607   
## LiveArea    11.0002568  3.2349742   3.400  0.00257 **
## Fireplaces   2.4430326  1.9232134   1.270  0.21725   
## I(Age^2)    -0.0011778  0.0008755  -1.345  0.19224   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.829 on 22 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9269 
## F-statistic: 69.43 on 5 and 22 DF,  p-value: 9.984e-13
FitRmax <- Fit

This model actually has the highest Rsq and adjusted Rsq, but has non-significant variables.

Fit <- update(Fit, .~. - Fireplaces)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea + I(Age^2), 
##     data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7024 -1.7184 -0.4171  1.8771  6.3903 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.0831593  2.8886488   0.375  0.71112   
## LocalSell    1.6854357  0.5931837   2.841  0.00925 **
## Baths        8.7359037  4.7304632   1.847  0.07769 . 
## LiveArea     9.9730003  3.1737751   3.142  0.00457 **
## I(Age^2)    -0.0008314  0.0008430  -0.986  0.33428   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.879 on 23 degrees of freedom
## Multiple R-squared:  0.936,  Adjusted R-squared:  0.9249 
## F-statistic: 84.14 on 4 and 23 DF,  p-value: 2.184e-13
Fit <- update(Fit, .~. - I(Age ^ 2))
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + Baths + LiveArea, data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5838 -2.1785 -0.3148  2.2052  6.5844 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.4900     2.4069  -0.204  0.84040   
## LocalSell     1.8999     0.5516   3.445  0.00211 **
## Baths         8.3192     4.7089   1.767  0.08999 . 
## LiveArea      9.5117     3.1373   3.032  0.00575 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.877 on 24 degrees of freedom
## Multiple R-squared:  0.9333, Adjusted R-squared:  0.925 
## F-statistic:   112 on 3 and 24 DF,  p-value: 3.013e-14
FitS1 <- Fit

At this point, all the variables are significant to the 0.1 level.

Fit <- update(Fit, .~. - Baths)
summary(Fit)
## 
## Call:
## lm(formula = SalePrice ~ LocalSell + LiveArea, data = House)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6935 -3.2118 -0.0549  2.2044  8.1124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2660     2.2832   0.554    0.584    
## LocalSell     2.4224     0.4849   4.996 3.77e-05 ***
## LiveArea     12.8312     2.6169   4.903 4.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.038 on 25 degrees of freedom
## Multiple R-squared:  0.9247, Adjusted R-squared:  0.9186 
## F-statistic: 153.4 on 2 and 25 DF,  p-value: 9.188e-15

Using a significance level of at least 0.5, this would be the selected model: \[ SalePrice = 1.266 + 2.4224\times LocalSell + 12.8312\times LivingArea \]

Check for Heteroskedacticity

par(mfrow = c(2, 2))
plot(FitRmax$fitted.values, FitRmax$residuals)
plot(FitS1$fitted.values, FitS1$residuals)
plot(Fit$fitted.values, Fit$residuals)
par(mfrow = c(1, 1))

None of the three models—Rsq max, all significant at 0.1, or the final selected model—show obvious heteroskedasticity.

QQ plots

par(mfrow = c(2, 2))
qqnorm(FitRmax$residuals, main = "Max Rsquared")
qqline(FitRmax$residuals)
qqnorm(FitS1$residuals, main = "Significant 0.1")
qqline(FitS1$residuals)
qqnorm(Fit$residuals, main = "Significant 0.05")
qqline(Fit$residuals)
par(mfrow = c(1, 1))

The last model has the best-fitting QQ-plot to the eye.

Conclusion

With the data provided, the best fitting multiple-regression model is the one with two variables reflecting the local selling price of homes in the neighborhood and the living area. However, the linear model, at least as constructed, is not the best model for this data. While coded numeric, the case can be made for variables like Garage, Baths and Fireplaces to be considered factors, as the price is unlikely to be a function of their value but of their presence. Furthermore, with limited possibilities for these factors, but with the presence of real quantitative values as well, this data set would be better served by a tree-based model such as a Random Forest.