Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
install.packages("kableExtra")
devtools::install_github("haozhu233/kableExtra")
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(kableExtra)))
Description: The data are a random sample of records of resales of homes from Feb 15 to Apr 30, 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data is collected by multiple listing agencies in many cities and is used by realtors as an information base.
Variable Names:
PRICE = Selling price ($hundreds)
SQFT = Square feet of living space
AGE = Age of home (years)
FEATS = Number out of 11 features (dishwasher, refrigerator, microwave, disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit, cable TV access
NE = Located in northeast sector of city (1) or not (0)
COR = Corner location (1) or not (0)
TAX = Annual taxes ($)
house <- read.table("https://raw.githubusercontent.com/YunMai-SPS/DATA605_homework/master/data605_week12/605week12_houseprice.txt?token=AX_Wu9dZJKJukjMwojU-49A1hGjCM_fkks5aG1eUwA%3D%3D")
house <- data.frame(lapply(house, as.character), stringsAsFactors=FALSE)
colnames(house) <- house[1,]
house <- house[-1,]
house <- data.frame(lapply(house, as.numeric), stringsAsFactors=FALSE)
## Warning in lapply(house, as.numeric): NAs introduced by coercion
## Warning in lapply(house, as.numeric): NAs introduced by coercion
kable(house, "html") %>%
kable_styling() %>%
scroll_box(width = "500px", height = "200px")
PRICE | SQFT | AGE | FEATS | NE | CUST | COR | TAX |
---|---|---|---|---|---|---|---|
2050 | 2650 | 13 | 7 | 1 | 1 | 0 | 1639 |
2080 | 2600 | NA | 4 | 1 | 1 | 0 | 1088 |
2150 | 2664 | 6 | 5 | 1 | 1 | 0 | 1193 |
2150 | 2921 | 3 | 6 | 1 | 1 | 0 | 1635 |
1999 | 2580 | 4 | 4 | 1 | 1 | 0 | 1732 |
1900 | 2580 | 4 | 4 | 1 | 0 | 0 | 1534 |
1800 | 2774 | 2 | 4 | 1 | 0 | 0 | 1765 |
1560 | 1920 | 1 | 5 | 1 | 1 | 0 | 1161 |
1450 | 2150 | NA | 4 | 1 | 0 | 0 | NA |
1449 | 1710 | 1 | 3 | 1 | 1 | 0 | 1010 |
1375 | 1837 | 4 | 5 | 1 | 0 | 0 | 1191 |
1270 | 1880 | 8 | 6 | 1 | 0 | 0 | 930 |
1250 | 2150 | 15 | 3 | 1 | 0 | 0 | 984 |
1235 | 1894 | 14 | 5 | 1 | 1 | 0 | 1112 |
1170 | 1928 | 18 | 8 | 1 | 1 | 0 | 600 |
1180 | 1830 | NA | 3 | 1 | 0 | 0 | 733 |
1155 | 1767 | 16 | 4 | 1 | 0 | 0 | 794 |
1110 | 1630 | 15 | 3 | 1 | 0 | 1 | 867 |
1139 | 1680 | 17 | 4 | 1 | 0 | 1 | 750 |
995 | 1725 | NA | 3 | 1 | 0 | 0 | 923 |
995 | 1500 | 15 | 4 | 1 | 0 | 0 | 743 |
975 | 1430 | NA | 3 | 1 | 0 | 0 | 752 |
975 | 1360 | NA | 4 | 1 | 0 | 0 | 696 |
900 | 1400 | 16 | 2 | 1 | 0 | 1 | 731 |
960 | 1573 | 17 | 6 | 1 | 0 | 0 | 768 |
860 | 1385 | NA | 2 | 1 | 0 | 0 | 653 |
1695 | 2931 | 28 | 3 | 1 | 0 | 1 | 1142 |
1553 | 2200 | 28 | 4 | 1 | 0 | 0 | 1035 |
1250 | 2277 | NA | 4 | 1 | 1 | 0 | NA |
1300 | 2000 | NA | 3 | 1 | 1 | 0 | 1076 |
1020 | 1478 | 53 | 3 | 1 | 0 | 1 | 626 |
1020 | 1713 | 30 | 4 | 1 | 0 | 1 | 600 |
922 | 1326 | NA | 4 | 1 | 0 | 0 | 668 |
925 | 1050 | NA | 2 | 1 | 0 | 1 | 553 |
899 | 1464 | NA | 2 | 1 | 1 | 0 | 566 |
850 | 1190 | 41 | 1 | 1 | 0 | 0 | 600 |
876 | 1156 | NA | 1 | 1 | 0 | 0 | NA |
890 | 1746 | NA | 2 | 1 | 0 | 0 | 591 |
870 | 1280 | NA | 1 | 1 | 0 | 0 | 599 |
700 | 1215 | NA | 3 | 1 | 0 | 0 | 477 |
720 | 1121 | 46 | 4 | 1 | 0 | 0 | 398 |
720 | 1050 | NA | 1 | 1 | 0 | 0 | NA |
749 | 1733 | 43 | 6 | 1 | 0 | 0 | 656 |
731 | 1299 | NA | 6 | 1 | 0 | 0 | 585 |
725 | 1140 | NA | 3 | 1 | 0 | 1 | 490 |
670 | 1181 | NA | 4 | 1 | 0 | 0 | 440 |
2150 | 2848 | 4 | 6 | 1 | 1 | 0 | 1487 |
1599 | 2440 | NA | 5 | 1 | 1 | 0 | 1265 |
1350 | 2253 | 23 | 4 | 1 | 1 | 0 | 939 |
1299 | 2743 | 25 | 5 | 1 | 1 | 1 | 1232 |
1250 | 2180 | 17 | 4 | 1 | 0 | 1 | 1141 |
1239 | 1706 | 14 | 4 | 1 | 0 | 0 | 810 |
1200 | 1948 | NA | 4 | 1 | 0 | 0 | 899 |
1125 | 1710 | 16 | 4 | 1 | 1 | 0 | 800 |
1100 | 1657 | NA | 4 | 1 | 0 | 0 | 865 |
1080 | 2200 | 26 | 4 | 1 | 0 | 0 | 1076 |
1050 | 1680 | 13 | 4 | 1 | 0 | 0 | 875 |
1049 | 1900 | 34 | 3 | 1 | 0 | 0 | 690 |
955 | 1565 | NA | 3 | 1 | 1 | 0 | 648 |
934 | 1543 | 20 | 3 | 1 | 0 | 0 | 820 |
875 | 1173 | 6 | 4 | 1 | 0 | 0 | 456 |
889 | 1549 | NA | 4 | 1 | 0 | 0 | 723 |
855 | 1900 | NA | 3 | 1 | 0 | 0 | 780 |
835 | 1560 | NA | 5 | 1 | 0 | 1 | 638 |
810 | 1365 | NA | 2 | 1 | 0 | 0 | 673 |
805 | 1258 | 7 | 4 | 1 | 0 | 1 | 821 |
799 | 1314 | NA | 2 | 1 | 0 | 0 | 671 |
750 | 1338 | NA | 3 | 1 | 0 | 1 | 649 |
759 | 997 | 4 | 4 | 1 | 0 | 0 | 461 |
755 | 1275 | NA | 5 | 1 | 0 | 0 | NA |
750 | 1030 | NA | 1 | 1 | 0 | 0 | 486 |
730 | 1027 | NA | 3 | 1 | 0 | 0 | 427 |
729 | 1007 | 19 | 6 | 1 | 0 | 0 | 513 |
710 | 1083 | 22 | 4 | 1 | 0 | 0 | 504 |
773 | 1320 | NA | 5 | 1 | 0 | 0 | NA |
690 | 1348 | 15 | 2 | 1 | 1 | 0 | NA |
670 | 1350 | NA | 2 | 1 | 0 | 0 | 622 |
619 | 837 | NA | 2 | 1 | 0 | 0 | 342 |
1295 | 3750 | NA | 4 | 0 | 1 | 1 | 1200 |
975 | 1500 | 7 | 3 | 0 | 1 | 1 | 700 |
939 | 1428 | 40 | 2 | 0 | 0 | 0 | 701 |
820 | 1375 | NA | 1 | 0 | 0 | 0 | 585 |
780 | 1080 | NA | 3 | 0 | 1 | 0 | 600 |
770 | 900 | NA | 3 | 0 | 0 | 0 | 391 |
700 | 1505 | NA | 2 | 0 | 0 | 1 | 591 |
620 | 1480 | NA | 4 | 0 | 0 | 0 | NA |
540 | 1142 | NA | 0 | 0 | 0 | 0 | 223 |
1070 | 1464 | NA | 2 | 0 | 0 | 0 | 376 |
2100 | 2116 | 25 | 3 | 0 | 1 | 0 | 1209 |
725 | 1280 | NA | 3 | 0 | 0 | 0 | 447 |
660 | 1159 | NA | 0 | 0 | 0 | 0 | 225 |
600 | 1198 | NA | 4 | 0 | 0 | 0 | NA |
580 | 1051 | 15 | 2 | 0 | 0 | 0 | 426 |
1844 | 2250 | 40 | 6 | 0 | 1 | 0 | 915 |
1580 | 2563 | NA | 2 | 0 | 1 | 0 | 1189 |
699 | 1400 | 45 | 1 | 0 | 1 | 1 | 481 |
1330 | 1850 | 5 | 5 | 0 | 1 | 1 | NA |
1160 | 1720 | 5 | 4 | 0 | 0 | 0 | 867 |
1109 | 1740 | 4 | 3 | 0 | 0 | 0 | 816 |
1129 | 1700 | 6 | 4 | 0 | 0 | 0 | 725 |
1050 | 1620 | 6 | 4 | 0 | 0 | 0 | 800 |
1045 | 1630 | 6 | 4 | 0 | 0 | 0 | 750 |
1050 | 1920 | 8 | 4 | 0 | 0 | 0 | 944 |
1020 | 1606 | 5 | 4 | 0 | 0 | 0 | 811 |
1000 | 1535 | 7 | 5 | 0 | 0 | 1 | 668 |
1030 | 1540 | 6 | 2 | 0 | 0 | 1 | 826 |
975 | 1739 | 13 | 3 | 0 | 0 | 0 | 880 |
950 | 1715 | NA | 3 | 0 | 0 | 0 | 900 |
940 | 1305 | 5 | 3 | 0 | 0 | 0 | 647 |
920 | 1415 | 7 | 4 | 0 | 0 | 0 | 866 |
945 | 1580 | 9 | 3 | 0 | 0 | 0 | 810 |
874 | 1236 | 3 | 4 | 0 | 0 | 0 | 707 |
872 | 1229 | 6 | 3 | 0 | 0 | 0 | 721 |
870 | 1273 | 4 | 4 | 0 | 0 | 0 | 638 |
869 | 1165 | 7 | 4 | 0 | 0 | 0 | 694 |
766 | 1200 | 7 | 4 | 0 | 0 | 1 | 634 |
739 | 970 | 4 | 4 | 0 | 0 | 1 | 541 |
Then I will build the multiple regression model to predict the Annual taxes. As requiement, the model will include one quadratic term ( FEATS^2 Number out of 11 features), one dichotomous term(NE, Located in northeast sector of city), and one dichotomous vs.quantitative interaction term: NE * FEATS^2
The null hypothesis is: house features and whether house is Located in northeast sector of city do not have effects on the annual tax.
house$FEATS_2 <- (house$FEATS)^2
house$FEATS_2_NE <- (house$FEATS_2)*(house$NE)
tax.lm <- lm(TAX ~ FEATS_2+NE+FEATS_2_NE, data=house)
summary(tax.lm)
##
## Call:
## lm(formula = TAX ~ FEATS_2 + NE + FEATS_2_NE, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -700.30 -159.13 -37.28 164.31 930.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 572.422 86.061 6.651 1.43e-09 ***
## FEATS_2 12.428 6.538 1.901 0.0601 .
## NE 107.433 104.003 1.033 0.3040
## FEATS_2_NE -2.734 7.166 -0.382 0.7036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.9 on 103 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.1578, Adjusted R-squared: 0.1333
## F-statistic: 6.432 on 3 and 103 DF, p-value: 0.0004901
The model dose not looks good as the p-value is larger than 0.05, suggesting that the house features and whether house is Located in northeast sector of city do not have effects on the annual tax.
Then I use the square of SQFT(Square feet of living space) as the quadratic term and keep NE(Located in northeast sector of city) as the dichotomous term.
The null hypothesis is: Square feet of living space and whether house is Located in northeast sector of city do not have effects on the annual tax.
house$SQFT_2 <- (house$SQFT)^2
house$SQFT_2_NE <- (house$SQFT_2)*(house$NE)
tax.lm <- lm(TAX ~ SQFT_2+NE+SQFT_2_NE, data=house)
summary(tax.lm)
##
## Call:
## lm(formula = TAX ~ SQFT_2 + NE + SQFT_2_NE, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.92 -90.39 10.67 92.47 405.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.194e+02 4.135e+01 12.562 < 2e-16 ***
## SQFT_2 7.104e-05 1.187e-05 5.986 3.17e-08 ***
## NE -1.530e+02 5.475e+01 -2.794 0.00621 **
## SQFT_2_NE 7.313e-05 1.510e-05 4.843 4.52e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 160.1 on 103 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.7377, Adjusted R-squared: 0.73
## F-statistic: 96.55 on 3 and 103 DF, p-value: < 2.2e-16
Square feet of living space and whether house is Located in northeast sector of city do not have effects on the annual tax.
the intercept is the fitted annual tax value when SQFT^2, NE, or SQFT^2*NE are equal to 0 for the Control units. In this context it is relatively meaningless since annual tax of 0 is unlikely to occur, we cannot therefore draw further interpretation from this coefficient.
Then slope 7.104e-05 means that if we hold NE and SQFT^2*NE constant an increase in 1 unit of SQFT^2 lead to an increase of 7.104e-05 dollar of annual tax.
Similarly slope -1.530e+02 means that if we hold SQFT^2 and SQFT^2*NE constant the house located at the northeast of the city will reduce the annual tax of 1.530e+02 dollar.
Finally slope 7.313e-05 means that if we hold SQFT^2 and NE constant the increase in 1 unit of the product of SQFT^2 and NE will incrase the annual tax of 7.313e-05 dollar.
Residual Analysis
Use the residual analysis techniques to check the validity of the assumptions.
plot(fitted(tax.lm),resid(tax.lm))
abline(0,0)
qqnorm(resid(tax.lm))
qqline(resid(tax.lm))
The variances of residuals areUniformly scattered about zero.
The Q-Q plot shows that the residuals follow the indicated line.
Taken together, using the square of SQFT(Square feet of living space) and NE(Located in northeast sector of city) can build a valid model. But the cofficients are too small(7e-05). This means that the square of the house area do not affect or predict the annual tax. It is interesting to see that the location at northeast sector of city negatively affect the annual tax of the house.
Other variants like age, corner location could be used to see whether we can build a better model to predict the annual tax of the house.