Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

install.packages("kableExtra")
devtools::install_github("haozhu233/kableExtra")
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(kableExtra)))

Data

Description: The data are a random sample of records of resales of homes from Feb 15 to Apr 30, 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data is collected by multiple listing agencies in many cities and is used by realtors as an information base.

Variable Names:

PRICE = Selling price ($hundreds)

SQFT = Square feet of living space

AGE = Age of home (years)

FEATS = Number out of 11 features (dishwasher, refrigerator, microwave, disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit, cable TV access

NE = Located in northeast sector of city (1) or not (0)

COR = Corner location (1) or not (0)

TAX = Annual taxes ($)

house <- read.table("https://raw.githubusercontent.com/YunMai-SPS/DATA605_homework/master/data605_week12/605week12_houseprice.txt?token=AX_Wu9dZJKJukjMwojU-49A1hGjCM_fkks5aG1eUwA%3D%3D")
house <- data.frame(lapply(house, as.character), stringsAsFactors=FALSE)
colnames(house) <- house[1,]
house <- house[-1,]
house <- data.frame(lapply(house, as.numeric), stringsAsFactors=FALSE)
## Warning in lapply(house, as.numeric): NAs introduced by coercion

## Warning in lapply(house, as.numeric): NAs introduced by coercion
kable(house,  "html") %>% 
  kable_styling() %>% 
  scroll_box(width = "500px", height = "200px")
PRICE SQFT AGE FEATS NE CUST COR TAX
2050 2650 13 7 1 1 0 1639
2080 2600 NA 4 1 1 0 1088
2150 2664 6 5 1 1 0 1193
2150 2921 3 6 1 1 0 1635
1999 2580 4 4 1 1 0 1732
1900 2580 4 4 1 0 0 1534
1800 2774 2 4 1 0 0 1765
1560 1920 1 5 1 1 0 1161
1450 2150 NA 4 1 0 0 NA
1449 1710 1 3 1 1 0 1010
1375 1837 4 5 1 0 0 1191
1270 1880 8 6 1 0 0 930
1250 2150 15 3 1 0 0 984
1235 1894 14 5 1 1 0 1112
1170 1928 18 8 1 1 0 600
1180 1830 NA 3 1 0 0 733
1155 1767 16 4 1 0 0 794
1110 1630 15 3 1 0 1 867
1139 1680 17 4 1 0 1 750
995 1725 NA 3 1 0 0 923
995 1500 15 4 1 0 0 743
975 1430 NA 3 1 0 0 752
975 1360 NA 4 1 0 0 696
900 1400 16 2 1 0 1 731
960 1573 17 6 1 0 0 768
860 1385 NA 2 1 0 0 653
1695 2931 28 3 1 0 1 1142
1553 2200 28 4 1 0 0 1035
1250 2277 NA 4 1 1 0 NA
1300 2000 NA 3 1 1 0 1076
1020 1478 53 3 1 0 1 626
1020 1713 30 4 1 0 1 600
922 1326 NA 4 1 0 0 668
925 1050 NA 2 1 0 1 553
899 1464 NA 2 1 1 0 566
850 1190 41 1 1 0 0 600
876 1156 NA 1 1 0 0 NA
890 1746 NA 2 1 0 0 591
870 1280 NA 1 1 0 0 599
700 1215 NA 3 1 0 0 477
720 1121 46 4 1 0 0 398
720 1050 NA 1 1 0 0 NA
749 1733 43 6 1 0 0 656
731 1299 NA 6 1 0 0 585
725 1140 NA 3 1 0 1 490
670 1181 NA 4 1 0 0 440
2150 2848 4 6 1 1 0 1487
1599 2440 NA 5 1 1 0 1265
1350 2253 23 4 1 1 0 939
1299 2743 25 5 1 1 1 1232
1250 2180 17 4 1 0 1 1141
1239 1706 14 4 1 0 0 810
1200 1948 NA 4 1 0 0 899
1125 1710 16 4 1 1 0 800
1100 1657 NA 4 1 0 0 865
1080 2200 26 4 1 0 0 1076
1050 1680 13 4 1 0 0 875
1049 1900 34 3 1 0 0 690
955 1565 NA 3 1 1 0 648
934 1543 20 3 1 0 0 820
875 1173 6 4 1 0 0 456
889 1549 NA 4 1 0 0 723
855 1900 NA 3 1 0 0 780
835 1560 NA 5 1 0 1 638
810 1365 NA 2 1 0 0 673
805 1258 7 4 1 0 1 821
799 1314 NA 2 1 0 0 671
750 1338 NA 3 1 0 1 649
759 997 4 4 1 0 0 461
755 1275 NA 5 1 0 0 NA
750 1030 NA 1 1 0 0 486
730 1027 NA 3 1 0 0 427
729 1007 19 6 1 0 0 513
710 1083 22 4 1 0 0 504
773 1320 NA 5 1 0 0 NA
690 1348 15 2 1 1 0 NA
670 1350 NA 2 1 0 0 622
619 837 NA 2 1 0 0 342
1295 3750 NA 4 0 1 1 1200
975 1500 7 3 0 1 1 700
939 1428 40 2 0 0 0 701
820 1375 NA 1 0 0 0 585
780 1080 NA 3 0 1 0 600
770 900 NA 3 0 0 0 391
700 1505 NA 2 0 0 1 591
620 1480 NA 4 0 0 0 NA
540 1142 NA 0 0 0 0 223
1070 1464 NA 2 0 0 0 376
2100 2116 25 3 0 1 0 1209
725 1280 NA 3 0 0 0 447
660 1159 NA 0 0 0 0 225
600 1198 NA 4 0 0 0 NA
580 1051 15 2 0 0 0 426
1844 2250 40 6 0 1 0 915
1580 2563 NA 2 0 1 0 1189
699 1400 45 1 0 1 1 481
1330 1850 5 5 0 1 1 NA
1160 1720 5 4 0 0 0 867
1109 1740 4 3 0 0 0 816
1129 1700 6 4 0 0 0 725
1050 1620 6 4 0 0 0 800
1045 1630 6 4 0 0 0 750
1050 1920 8 4 0 0 0 944
1020 1606 5 4 0 0 0 811
1000 1535 7 5 0 0 1 668
1030 1540 6 2 0 0 1 826
975 1739 13 3 0 0 0 880
950 1715 NA 3 0 0 0 900
940 1305 5 3 0 0 0 647
920 1415 7 4 0 0 0 866
945 1580 9 3 0 0 0 810
874 1236 3 4 0 0 0 707
872 1229 6 3 0 0 0 721
870 1273 4 4 0 0 0 638
869 1165 7 4 0 0 0 694
766 1200 7 4 0 0 1 634
739 970 4 4 0 0 1 541

Multiple Regression Model

Then I will build the multiple regression model to predict the Annual taxes. As requiement, the model will include one quadratic term ( FEATS^2 Number out of 11 features), one dichotomous term(NE, Located in northeast sector of city), and one dichotomous vs.quantitative interaction term: NE * FEATS^2

The null hypothesis is: house features and whether house is Located in northeast sector of city do not have effects on the annual tax.

house$FEATS_2 <- (house$FEATS)^2
house$FEATS_2_NE <- (house$FEATS_2)*(house$NE)

tax.lm <- lm(TAX ~ FEATS_2+NE+FEATS_2_NE, data=house)
summary(tax.lm)
## 
## Call:
## lm(formula = TAX ~ FEATS_2 + NE + FEATS_2_NE, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -700.30 -159.13  -37.28  164.31  930.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  572.422     86.061   6.651 1.43e-09 ***
## FEATS_2       12.428      6.538   1.901   0.0601 .  
## NE           107.433    104.003   1.033   0.3040    
## FEATS_2_NE    -2.734      7.166  -0.382   0.7036    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.9 on 103 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.1578, Adjusted R-squared:  0.1333 
## F-statistic: 6.432 on 3 and 103 DF,  p-value: 0.0004901

The model dose not looks good as the p-value is larger than 0.05, suggesting that the house features and whether house is Located in northeast sector of city do not have effects on the annual tax.

Then I use the square of SQFT(Square feet of living space) as the quadratic term and keep NE(Located in northeast sector of city) as the dichotomous term.

The null hypothesis is: Square feet of living space and whether house is Located in northeast sector of city do not have effects on the annual tax.

house$SQFT_2 <- (house$SQFT)^2
house$SQFT_2_NE <- (house$SQFT_2)*(house$NE)

tax.lm <- lm(TAX ~ SQFT_2+NE+SQFT_2_NE, data=house)
summary(tax.lm)
## 
## Call:
## lm(formula = TAX ~ SQFT_2 + NE + SQFT_2_NE, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462.92  -90.39   10.67   92.47  405.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.194e+02  4.135e+01  12.562  < 2e-16 ***
## SQFT_2       7.104e-05  1.187e-05   5.986 3.17e-08 ***
## NE          -1.530e+02  5.475e+01  -2.794  0.00621 ** 
## SQFT_2_NE    7.313e-05  1.510e-05   4.843 4.52e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160.1 on 103 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.7377, Adjusted R-squared:   0.73 
## F-statistic: 96.55 on 3 and 103 DF,  p-value: < 2.2e-16

Square feet of living space and whether house is Located in northeast sector of city do not have effects on the annual tax.

the intercept is the fitted annual tax value when SQFT^2, NE, or SQFT^2*NE are equal to 0 for the Control units. In this context it is relatively meaningless since annual tax of 0 is unlikely to occur, we cannot therefore draw further interpretation from this coefficient.

Then slope 7.104e-05 means that if we hold NE and SQFT^2*NE constant an increase in 1 unit of SQFT^2 lead to an increase of 7.104e-05 dollar of annual tax.

Similarly slope -1.530e+02 means that if we hold SQFT^2 and SQFT^2*NE constant the house located at the northeast of the city will reduce the annual tax of 1.530e+02 dollar.

Finally slope 7.313e-05 means that if we hold SQFT^2 and NE constant the increase in 1 unit of the product of SQFT^2 and NE will incrase the annual tax of 7.313e-05 dollar.

Residual Analysis

Use the residual analysis techniques to check the validity of the assumptions.

plot(fitted(tax.lm),resid(tax.lm))
abline(0,0)

qqnorm(resid(tax.lm))
qqline(resid(tax.lm))

The variances of residuals areUniformly scattered about zero.

The Q-Q plot shows that the residuals follow the indicated line.

Taken together, using the square of SQFT(Square feet of living space) and NE(Located in northeast sector of city) can build a valid model. But the cofficients are too small(7e-05). This means that the square of the house area do not affect or predict the annual tax. It is interesting to see that the location at northeast sector of city negatively affect the annual tax of the house.

Other variants like age, corner location could be used to see whether we can build a better model to predict the annual tax of the house.