MARR 4.3

The Sunday April 15, 2007 issue of the Houston Chronicle included a section devoted to real estate prices in Houston. In particular, data are presented on the 2006 median price per square foot for 1922 subdivisions. Interest centers on developing a regression model to predict:

\[Y_i = \text{2006 median price per square foot}\]

from

\[x_{1i} = \text{% New Homes (i.e., of the houses that sold in 2006, the percentage that were built in 2005 or 2006)}\]

\[x_{2i} = \text{% Foreclosures (i.e., of the houses that sold in 2006, the percentage that were identified as foreclosures)}\]

for the i = 1, … 1922 subdivisions. The first model considered was

\[Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + e\]

Model 4.6 was fit used weighted least squares with weights,

\[w_i = n_i\]

where \(n_i\) = the number of homes sold in subdivision i in 2006. Output from model 4.6 belows below.

url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%2010/HoustonRealEstate.txt'
houston <- read.table(url, header = TRUE)
head(houston)

##       Yi ni   x1i   x2i
## 1 169.20  7 0.857 0.000
## 2  56.82  6 0.167 0.667
## 3  25.52  6 0.000 1.000
## 4  90.67  5 0.800 0.200
## 5  92.65  8 0.500 0.000
## 6  87.76  9 0.667 0.000

Explain it is necessary to use weighted least squares to fit model (4.6) and why \(w_i = n_i\) is the appropriate choice for the weights.

# Assume OLS model
houston_lm <- lm(Yi ~ x1i + x2i, data=houston)
summary(houston_lm)

## 
## Call:
## lm(formula = Yi ~ x1i + x2i, data = houston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.54 -16.64  -9.02   1.94 385.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   89.469      1.138  78.603   <2e-16 ***
## x1i            5.867      3.023   1.941   0.0524 .  
## x2i          -90.896      5.915 -15.368   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.27 on 1919 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:   0.12 
## F-statistic:   132 on 2 and 1919 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(houston_lm)

There is noted heterosckedascity.

plot(houston_lm$residuals ~ houston_lm$fitted.values)

Why is ni the appropriate weights?

Remember:

\[w_i = \frac{1}{\text{(Standard Deviation of }Y_i)^2}\]

# Sorting the weights
head(houston[with(houston, order(ni)),])

##        Yi ni x1i x2i
## 4   90.67  5 0.8 0.2
## 7   70.78  5 0.0 0.2
## 9  151.88  5 0.0 0.0
## 11  74.42  5 0.0 0.0
## 15  66.12  5 0.6 0.0
## 28  80.99  5 0.2 0.4

Explain why (4.6) is not a valid regression model.

It violates the assumption of homoscedascity.

Describe what steps you would take to obtain a valid regression model (Fig. 4.1)

You either can perform a transformation or provide weights.

LMR 8.1

Researchers at National Institutes of Standards and Technology (NIST) collected pipeline data on ultrasonic measurements of the depth of defects in the Alaska pipeline in the field. The depth of the defects were then remeasured in the laboratory. These measurements were performed in six different batches. It turns out that this batch effect is not significant and so can be ignored in the analysis that follows. The laboratory measurements are more accurate than the in-field measurements, but more time consuming and expensive. We want to develop a regression equation for correcting in-field measurements.

data(pipeline, package="faraway")
head(pipeline)

##   Field  Lab Batch
## 1    18 20.2     1
## 2    38 56.0     1
## 3    15 12.5     1
## 4    20 21.2     1
## 5    18 15.5     1
## 6    36 39.0     1

Fit a regression model Lab ~ Field. Check for non-constant variance.

pipeline_lm <- lm(Lab ~ Field, data = pipeline)
pipeline_lm

## 
## Call:
## lm(formula = Lab ~ Field, data = pipeline)
## 
## Coefficients:
## (Intercept)        Field  
##      -1.967        1.223

summary(pipeline_lm)

## 
## Call:
## lm(formula = Lab ~ Field, data = pipeline)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.985  -4.072  -1.431   2.504  24.334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.96750    1.57479  -1.249    0.214    
## Field        1.22297    0.04107  29.778   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.865 on 105 degrees of freedom
## Multiple R-squared:  0.8941, Adjusted R-squared:  0.8931 
## F-statistic: 886.7 on 1 and 105 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(pipeline_lm)

Heterskedascity noted in the residual plot.

We wish to use weights to account for the non-constant variance. Here we split the range of Field into 12 groups of size nine (except for the last group which has only eight values). Within each group, we compute the variance of Lab as varlab and the mean of Field as meanfield. Supposing pipeline is the name of your data frame, the following R code will make the needed computations:

data(pipeline, package="faraway")
i <- order(pipeline$Field)
npipe <- pipeline[i,]
ff <- gl(12,9)[-108]
meanfield <- unlist(lapply(split(npipe$Field,ff),mean))
varlab <- unlist(lapply(split(npipe$Lab, ff), var))

Suppose we guess that the variance in the response is linked to the predictor in the following way

\[var(Lab) = a_0 Field^{a1}\]

**Regress log(varlab) on log(meanfield) to estimate \(a_0\) and \(a_1\). (You might choose to remove the last point.) Use this to determine appropriate weights in a WLS fit of Lab on Field. Show the regression summary.

new_pipeline <- data.frame(l_varlab = log(varlab), l_meanfield = log(meanfield))
new_pipeline_lm <- lm(l_varlab ~ l_meanfield, new_pipeline, weights=1/meanfield)
new_pipeline_lm

## 
## Call:
## lm(formula = l_varlab ~ l_meanfield, data = new_pipeline, weights = 1/meanfield)
## 
## Coefficients:
## (Intercept)  l_meanfield  
##    -0.07856      1.03344

summary(new_pipeline_lm)

## 
## Call:
## lm(formula = l_varlab ~ l_meanfield, data = new_pipeline, weights = 1/meanfield)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25034 -0.12237  0.02634  0.13204  0.17924 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.07856    1.02902  -0.076   0.9406  
## l_meanfield  1.03344    0.34426   3.002   0.0133 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.158 on 10 degrees of freedom
## Multiple R-squared:  0.474,  Adjusted R-squared:  0.4214 
## F-statistic: 9.011 on 1 and 10 DF,  p-value: 0.0133

par(mfrow=c(2,2))
plot(new_pipeline_lm)

(Will need to double check this.)

An alternative to weighting is transformation. Find transformations on Lab and/or Field so that in the transformed scale the relationship is approximately linear with constant variance. You may restrict your choice of transformation to square root, log and inverse.

library(MASS)

## Warning: package 'MASS' was built under R version 3.4.3

library(caret)

## Warning: package 'caret' was built under R version 3.4.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'

BoxCoxTrans(pipeline$Field)

## Box-Cox Transformation
## 
## 107 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   18.00   35.00   33.58   46.50   85.00 
## 
## Largest/Smallest: 17 
## Sample Skewness: 0.409 
## 
## Estimated Lambda: 0.4

BoxCoxTrans(pipeline$Lab)

## Box-Cox Transformation
## 
## 107 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.30   18.35   38.00   39.10   55.55   81.90 
## 
## Largest/Smallest: 19 
## Sample Skewness: 0.479 
## 
## Estimated Lambda: 0.3

The Box-Cox Transformation suggests that we transformed both the predictor and response variable to square root (given that we are restricted).

new_pipeline1 <- data.frame(sqrt_lab = sqrt(pipeline$Lab), sqrt_field = sqrt(pipeline$Field))
new_pipeline_lm1 <- lm(sqrt_lab ~ sqrt_field, new_pipeline1)
new_pipeline_lm1

## 
## Call:
## lm(formula = sqrt_lab ~ sqrt_field, data = new_pipeline1)
## 
## Coefficients:
## (Intercept)   sqrt_field  
##     -0.3677       1.1355

summary(new_pipeline_lm1)

## 
## Call:
## lm(formula = sqrt_lab ~ sqrt_field, data = new_pipeline1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1570 -0.4125 -0.1209  0.3098  1.5481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.36773    0.18815  -1.954   0.0533 .  
## sqrt_field   1.13553    0.03247  34.973   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5561 on 105 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.9202 
## F-statistic:  1223 on 1 and 105 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(new_pipeline_lm1)

This is a much better fit.

LMR 8.9

Use the fat data, fitting the model described in Section 4.2.

data(fat, package="faraway")
head(fat)

##   brozek siri density age weight height adipos  free neck chest abdom
## 1   12.6 12.3  1.0708  23 154.25  67.75   23.7 134.9 36.2  93.1  85.2
## 2    6.9  6.1  1.0853  22 173.25  72.25   23.4 161.3 38.5  93.6  83.0
## 3   24.6 25.3  1.0414  22 154.00  66.25   24.7 116.0 34.0  95.8  87.9
## 4   10.9 10.4  1.0751  26 184.75  72.25   24.9 164.7 37.4 101.8  86.4
## 5   27.8 28.7  1.0340  24 184.25  71.25   25.6 133.1 34.4  97.3 100.0
## 6   20.6 20.9  1.0502  24 210.25  74.75   26.5 167.0 39.0 104.5  94.4
##     hip thigh knee ankle biceps forearm wrist
## 1  94.5  59.0 37.3  21.9   32.0    27.4  17.1
## 2  98.7  58.7 37.3  23.4   30.5    28.9  18.2
## 3  99.2  59.6 38.9  24.0   28.8    25.2  16.6
## 4 101.2  60.1 37.3  22.8   32.4    29.4  18.2
## 5 101.9  63.2 42.2  24.0   32.2    27.7  17.7
## 6 107.8  66.0 42.0  25.6   35.7    30.6  18.8

Fit the same model but now using Huber’s robust method. Comment on any substantial differences between this model and the least squares fit.

rlmod <- rlm(brozek ~ age + weight + height + neck + chest + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data=fat)
fat_lm <- lm(brozek ~ age + weight + height + neck + chest + abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, data=fat)

summary(rlmod)

## 
## Call: rlm(formula = brozek ~ age + weight + height + neck + chest + 
##     abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, 
##     data = fat)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3964  -2.7352  -0.1171   2.8008   9.4446 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) -11.3460  17.1216    -0.6627
## age           0.0650   0.0319     2.0368
## weight       -0.0643   0.0528    -1.2163
## height       -0.0625   0.0948    -0.6595
## neck         -0.4553   0.2294    -1.9846
## chest        -0.0256   0.0978    -0.2614
## abdom         0.8778   0.0853    10.2891
## hip          -0.2142   0.1440    -1.4872
## thigh         0.2632   0.1425     1.8473
## knee         -0.1076   0.2388    -0.4505
## ankle         0.1815   0.2186     0.8306
## biceps        0.1367   0.1689     0.8091
## forearm       0.4152   0.1965     2.1126
## wrist        -1.5739   0.5279    -2.9812
## 
## Residual standard error: 4.073 on 238 degrees of freedom

summary(fat_lm)

## 
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest + 
##     abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, 
##     data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.264  -2.572  -0.097   2.898   9.327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.29255   16.06992  -0.952  0.34225    
## age           0.05679    0.02996   1.895  0.05929 .  
## weight       -0.08031    0.04958  -1.620  0.10660    
## height       -0.06460    0.08893  -0.726  0.46830    
## neck         -0.43754    0.21533  -2.032  0.04327 *  
## chest        -0.02360    0.09184  -0.257  0.79740    
## abdom         0.88543    0.08008  11.057  < 2e-16 ***
## hip          -0.19842    0.13516  -1.468  0.14341    
## thigh         0.23190    0.13372   1.734  0.08418 .  
## knee         -0.01168    0.22414  -0.052  0.95850    
## ankle         0.16354    0.20514   0.797  0.42614    
## biceps        0.15280    0.15851   0.964  0.33605    
## forearm       0.43049    0.18445   2.334  0.02044 *  
## wrist        -1.47654    0.49552  -2.980  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7353 
## F-statistic: 54.63 on 13 and 238 DF,  p-value: < 2.2e-16

Identify which two cases have the lowest weights in the Huber fit. What is unusual about these two points?

wts <- rlmod$w
names(wts) <- row.names(fat)
head(sort(wts),10)

##       224       207        39       231       225        81        82 
## 0.5269652 0.5800712 0.5987751 0.6260611 0.6304030 0.6367342 0.6461965 
##       204       128       135 
## 0.6487121 0.6829132 0.7005233

fat[224,]

##     brozek siri density age weight height adipos  free neck chest abdom
## 224    6.1  5.2  1.0874  55 142.25  67.25   22.2 133.6 35.2  92.7  82.8
##      hip thigh knee ankle biceps forearm wrist
## 224 91.9  54.4 35.2  22.5   29.4    26.8    17

fat[207,]

##     brozek siri density age weight height adipos  free neck chest abdom
## 207   31.7 32.9   1.025  44    166   65.5   27.2 113.5 39.1 100.6  93.9
##       hip thigh knee ankle biceps forearm wrist
## 207 100.1  58.9 37.6  21.4   33.1    29.5  17.3

These two points are outliers. 224 is very small (brozek) and 207 is quite high (brozek).

Plot weight (of the man) against height. Identify two outlying cases. Are these the same as those identified in the previous question? Discuss.

plot(fat$height ~ fat$weight)

There appears to be one person with a very low height and another man who is quite heavyset. Let’s identify these two persons.

which(fat$height == min(fat$height))

## [1] 42

Low height outlier: 42

which(fat$weight == max(fat$weight))

## [1] 39

Heavy outlier: 39

These are different from part B because these are outliers for height and weight, and not for brozek.

CUNY 621 Suggested Exercises Week 10

Joel Park

4/12/2018

MARR 4.3

LMR 8.1

LMR 8.9