The Sunday April 15, 2007 issue of the Houston Chronicle included a section devoted to real estate prices in Houston. In particular, data are presented on the 2006 median price per square foot for 1922 subdivisions. Interest centers on developing a regression model to predict:
\[Y_i = \text{2006 median price per square foot}\]
from
\[x_{1i} = \text{% New Homes (i.e., of the houses that sold in 2006, the percentage that were built in 2005 or 2006)}\]
\[x_{2i} = \text{% Foreclosures (i.e., of the houses that sold in 2006, the percentage that were identified as foreclosures)}\]
for the i = 1, … 1922 subdivisions. The first model considered was
\[Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + e\]
Model 4.6 was fit used weighted least squares with weights,
\[w_i = n_i\]
where \(n_i\) = the number of homes sold in subdivision i in 2006. Output from model 4.6 belows below.
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%2010/HoustonRealEstate.txt'
houston <- read.table(url, header = TRUE)
head(houston)
## Yi ni x1i x2i
## 1 169.20 7 0.857 0.000
## 2 56.82 6 0.167 0.667
## 3 25.52 6 0.000 1.000
## 4 90.67 5 0.800 0.200
## 5 92.65 8 0.500 0.000
## 6 87.76 9 0.667 0.000
# Assume OLS model
houston_lm <- lm(Yi ~ x1i + x2i, data=houston)
summary(houston_lm)
##
## Call:
## lm(formula = Yi ~ x1i + x2i, data = houston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.54 -16.64 -9.02 1.94 385.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.469 1.138 78.603 <2e-16 ***
## x1i 5.867 3.023 1.941 0.0524 .
## x2i -90.896 5.915 -15.368 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.27 on 1919 degrees of freedom
## Multiple R-squared: 0.1209, Adjusted R-squared: 0.12
## F-statistic: 132 on 2 and 1919 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(houston_lm)
There is noted heterosckedascity.
plot(houston_lm$residuals ~ houston_lm$fitted.values)
Why is ni the appropriate weights?
Remember:
\[w_i = \frac{1}{\text{(Standard Deviation of }Y_i)^2}\]
# Sorting the weights
head(houston[with(houston, order(ni)),])
## Yi ni x1i x2i
## 4 90.67 5 0.8 0.2
## 7 70.78 5 0.0 0.2
## 9 151.88 5 0.0 0.0
## 11 74.42 5 0.0 0.0
## 15 66.12 5 0.6 0.0
## 28 80.99 5 0.2 0.4
It violates the assumption of homoscedascity.
You either can perform a transformation or provide weights.
Researchers at National Institutes of Standards and Technology (NIST) collected pipeline
data on ultrasonic measurements of the depth of defects in the Alaska pipeline in the field. The depth of the defects were then remeasured in the laboratory. These measurements were performed in six different batches. It turns out that this batch effect is not significant and so can be ignored in the analysis that follows. The laboratory measurements are more accurate than the in-field measurements, but more time consuming and expensive. We want to develop a regression equation for correcting in-field measurements.
data(pipeline, package="faraway")
head(pipeline)
## Field Lab Batch
## 1 18 20.2 1
## 2 38 56.0 1
## 3 15 12.5 1
## 4 20 21.2 1
## 5 18 15.5 1
## 6 36 39.0 1
Lab ~ Field
. Check for non-constant variance.pipeline_lm <- lm(Lab ~ Field, data = pipeline)
pipeline_lm
##
## Call:
## lm(formula = Lab ~ Field, data = pipeline)
##
## Coefficients:
## (Intercept) Field
## -1.967 1.223
summary(pipeline_lm)
##
## Call:
## lm(formula = Lab ~ Field, data = pipeline)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.985 -4.072 -1.431 2.504 24.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.96750 1.57479 -1.249 0.214
## Field 1.22297 0.04107 29.778 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.865 on 105 degrees of freedom
## Multiple R-squared: 0.8941, Adjusted R-squared: 0.8931
## F-statistic: 886.7 on 1 and 105 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(pipeline_lm)
Heterskedascity noted in the residual plot.
Field
into 12 groups of size nine (except for the last group which has only eight values). Within each group, we compute the variance of Lab
as varlab
and the mean of Field
as meanfield
. Supposing pipeline
is the name of your data frame, the following R code will make the needed computations:data(pipeline, package="faraway")
i <- order(pipeline$Field)
npipe <- pipeline[i,]
ff <- gl(12,9)[-108]
meanfield <- unlist(lapply(split(npipe$Field,ff),mean))
varlab <- unlist(lapply(split(npipe$Lab, ff), var))
Suppose we guess that the variance in the response is linked to the predictor in the following way
\[var(Lab) = a_0 Field^{a1}\]
**Regress log(varlab)
on log(meanfield)
to estimate \(a_0\) and \(a_1\). (You might choose to remove the last point.) Use this to determine appropriate weights in a WLS fit of Lab
on Field
. Show the regression summary.
new_pipeline <- data.frame(l_varlab = log(varlab), l_meanfield = log(meanfield))
new_pipeline_lm <- lm(l_varlab ~ l_meanfield, new_pipeline, weights=1/meanfield)
new_pipeline_lm
##
## Call:
## lm(formula = l_varlab ~ l_meanfield, data = new_pipeline, weights = 1/meanfield)
##
## Coefficients:
## (Intercept) l_meanfield
## -0.07856 1.03344
summary(new_pipeline_lm)
##
## Call:
## lm(formula = l_varlab ~ l_meanfield, data = new_pipeline, weights = 1/meanfield)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -0.25034 -0.12237 0.02634 0.13204 0.17924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07856 1.02902 -0.076 0.9406
## l_meanfield 1.03344 0.34426 3.002 0.0133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.158 on 10 degrees of freedom
## Multiple R-squared: 0.474, Adjusted R-squared: 0.4214
## F-statistic: 9.011 on 1 and 10 DF, p-value: 0.0133
par(mfrow=c(2,2))
plot(new_pipeline_lm)
(Will need to double check this.)
Lab
and/or Field
so that in the transformed scale the relationship is approximately linear with constant variance. You may restrict your choice of transformation to square root, log and inverse.library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'
BoxCoxTrans(pipeline$Field)
## Box-Cox Transformation
##
## 107 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 18.00 35.00 33.58 46.50 85.00
##
## Largest/Smallest: 17
## Sample Skewness: 0.409
##
## Estimated Lambda: 0.4
BoxCoxTrans(pipeline$Lab)
## Box-Cox Transformation
##
## 107 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.30 18.35 38.00 39.10 55.55 81.90
##
## Largest/Smallest: 19
## Sample Skewness: 0.479
##
## Estimated Lambda: 0.3
The Box-Cox Transformation suggests that we transformed both the predictor and response variable to square root (given that we are restricted).
new_pipeline1 <- data.frame(sqrt_lab = sqrt(pipeline$Lab), sqrt_field = sqrt(pipeline$Field))
new_pipeline_lm1 <- lm(sqrt_lab ~ sqrt_field, new_pipeline1)
new_pipeline_lm1
##
## Call:
## lm(formula = sqrt_lab ~ sqrt_field, data = new_pipeline1)
##
## Coefficients:
## (Intercept) sqrt_field
## -0.3677 1.1355
summary(new_pipeline_lm1)
##
## Call:
## lm(formula = sqrt_lab ~ sqrt_field, data = new_pipeline1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1570 -0.4125 -0.1209 0.3098 1.5481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.36773 0.18815 -1.954 0.0533 .
## sqrt_field 1.13553 0.03247 34.973 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5561 on 105 degrees of freedom
## Multiple R-squared: 0.9209, Adjusted R-squared: 0.9202
## F-statistic: 1223 on 1 and 105 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(new_pipeline_lm1)
This is a much better fit.
Use the fat
data, fitting the model described in Section 4.2.
data(fat, package="faraway")
head(fat)
## brozek siri density age weight height adipos free neck chest abdom
## 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2
## 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0
## 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9
## 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4
## 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0
## 6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4
## hip thigh knee ankle biceps forearm wrist
## 1 94.5 59.0 37.3 21.9 32.0 27.4 17.1
## 2 98.7 58.7 37.3 23.4 30.5 28.9 18.2
## 3 99.2 59.6 38.9 24.0 28.8 25.2 16.6
## 4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
## 5 101.9 63.2 42.2 24.0 32.2 27.7 17.7
## 6 107.8 66.0 42.0 25.6 35.7 30.6 18.8
rlmod <- rlm(brozek ~ age + weight + height + neck + chest + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data=fat)
fat_lm <- lm(brozek ~ age + weight + height + neck + chest + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data=fat)
summary(rlmod)
##
## Call: rlm(formula = brozek ~ age + weight + height + neck + chest +
## abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,
## data = fat)
## Residuals:
## Min 1Q Median 3Q Max
## -10.3964 -2.7352 -0.1171 2.8008 9.4446
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -11.3460 17.1216 -0.6627
## age 0.0650 0.0319 2.0368
## weight -0.0643 0.0528 -1.2163
## height -0.0625 0.0948 -0.6595
## neck -0.4553 0.2294 -1.9846
## chest -0.0256 0.0978 -0.2614
## abdom 0.8778 0.0853 10.2891
## hip -0.2142 0.1440 -1.4872
## thigh 0.2632 0.1425 1.8473
## knee -0.1076 0.2388 -0.4505
## ankle 0.1815 0.2186 0.8306
## biceps 0.1367 0.1689 0.8091
## forearm 0.4152 0.1965 2.1126
## wrist -1.5739 0.5279 -2.9812
##
## Residual standard error: 4.073 on 238 degrees of freedom
summary(fat_lm)
##
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest +
## abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,
## data = fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.264 -2.572 -0.097 2.898 9.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.29255 16.06992 -0.952 0.34225
## age 0.05679 0.02996 1.895 0.05929 .
## weight -0.08031 0.04958 -1.620 0.10660
## height -0.06460 0.08893 -0.726 0.46830
## neck -0.43754 0.21533 -2.032 0.04327 *
## chest -0.02360 0.09184 -0.257 0.79740
## abdom 0.88543 0.08008 11.057 < 2e-16 ***
## hip -0.19842 0.13516 -1.468 0.14341
## thigh 0.23190 0.13372 1.734 0.08418 .
## knee -0.01168 0.22414 -0.052 0.95850
## ankle 0.16354 0.20514 0.797 0.42614
## biceps 0.15280 0.15851 0.964 0.33605
## forearm 0.43049 0.18445 2.334 0.02044 *
## wrist -1.47654 0.49552 -2.980 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.7353
## F-statistic: 54.63 on 13 and 238 DF, p-value: < 2.2e-16
wts <- rlmod$w
names(wts) <- row.names(fat)
head(sort(wts),10)
## 224 207 39 231 225 81 82
## 0.5269652 0.5800712 0.5987751 0.6260611 0.6304030 0.6367342 0.6461965
## 204 128 135
## 0.6487121 0.6829132 0.7005233
fat[224,]
## brozek siri density age weight height adipos free neck chest abdom
## 224 6.1 5.2 1.0874 55 142.25 67.25 22.2 133.6 35.2 92.7 82.8
## hip thigh knee ankle biceps forearm wrist
## 224 91.9 54.4 35.2 22.5 29.4 26.8 17
fat[207,]
## brozek siri density age weight height adipos free neck chest abdom
## 207 31.7 32.9 1.025 44 166 65.5 27.2 113.5 39.1 100.6 93.9
## hip thigh knee ankle biceps forearm wrist
## 207 100.1 58.9 37.6 21.4 33.1 29.5 17.3
These two points are outliers. 224 is very small (brozek) and 207 is quite high (brozek).
plot(fat$height ~ fat$weight)
There appears to be one person with a very low height and another man who is quite heavyset. Let’s identify these two persons.
which(fat$height == min(fat$height))
## [1] 42
Low height outlier: 42
which(fat$weight == max(fat$weight))
## [1] 39
Heavy outlier: 39
These are different from part B because these are outliers for height and weight, and not for brozek.