11.83 A realtor in a suburban area attempted to predict house prices solely on the basis of size. From a listing service, the realtor obtained size in thousands of square feet and asking price in thousands of dollars. Price 210 145 168 352 234 148 217 216 213 143 178 131 181 148 127 158 226 194 166 Size 2.5 1.5 1.8 4.7 2.4 1.5 2.5 3.3 2.6 1.6 1.6 1.4 2.9 1.6 1.9 1.7 2.6 1.9 1.8 Price 207 139 143 141 142 214 262 191 167 153 153 184 123 182 143 144 161 157 155 Size 2.8 1.5 1.5 1.9 1.6 2.2 2.7 2.0 2.2 1.6 1.6 2.3 1.4 1.9 1.6 1.5 1.6 1.7 1.7 Price 203 147 173 160 219 156 169 133 154 220 151 188 153 215 144 125 152 132 164 Size 2.2 1.8 1.8 1.7 2.4 1.9 1.9 1.5 2.9 2.9 1.9 2.3 1.7 2.1 1.9 1.7 1.7 1.4 2.0
## [1] 0.8570019
There seems to be an increasing relation. Price is increasing with size. The data has a correlation coefficient of 0.8570019, which indicates an increasing relation.
There is a house with a size of 4,700 square feet and a price of $352,000 that is an outlier with a Cook’s distance of .5 or higher. This data point can be considered a high leverage point as the coefficient of determination has decreased from 0.7296 to 0.59.
##
## Call:
## lm(formula = info$price ~ info$size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.523 -8.400 0.308 12.211 48.282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.347 10.029 5.419 1.37e-06 ***
## info$size 59.026 4.786 12.334 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.92 on 55 degrees of freedom
## Multiple R-squared: 0.7345, Adjusted R-squared: 0.7296
## F-statistic: 152.1 on 1 and 55 DF, p-value: < 2.2e-16
The regression equation is Price = 54.347 + 59.026*size.
##
## Call:
## lm(formula = info1$price ~ info1$size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.772 -8.298 -0.649 10.878 52.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.228 12.250 5.161 3.61e-06 ***
## info1$size 54.325 6.068 8.953 2.95e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.81 on 54 degrees of freedom
## Multiple R-squared: 0.5975, Adjusted R-squared: 0.59
## F-statistic: 80.15 on 1 and 54 DF, p-value: 2.954e-12
The new regression equation is Price = 63.228 + 54.325*size. The slope changes without the outlier 59.026 - 54.325 = 4.701. After the outlier is removed, the slope decreases because the outlier price was north of the regression line, increasing the slope.
Outlier-included: 20.92 Outlier-excluded: 20.81 They differ by 20.92 - 20.81 = 0.11. Removing the outlier has not changed the residual standard deviation much so it has not made the data more homogeneous. The relationship between the estimated value and the actual value has not been majorly affected.
11.85
## fit lwr upr
## 1 334.8552 278.865 390.8454
We are 95% confident that the price of a home having 5,000 sqft is between 278.865 and 390.8454 dollars. The issue is that a house with 5,000 sqft is outside of the range of sizes that we have in the data, which can be considered an outlier, and not a wise prediction.
The constant variance assumption does not seem reasonable because the variability increases with size.
The interval to predict a 5,000 qft home using this model would be incorrect because variance is increasing with size, and the constant variance assumption is not reasonable. The interval would be wider than what was determined.
11.86
A lawn care company tried to predict the demand for its service by zip code, using the housing density in the zip code area as a predictor. The owners obtained the number of houses and the geographic size of each zip code and calculated their sales per thousand homes and number of homes per acre. Sales 54 72 54 62 72 83 115 90 66 60 100 78 152 87 54 82 Density 6.5 4.6 5.5 4.6 4.2 4.3 2.3 3.5 3.2 8.4 3.4 4.0 2.0 3.2 6.7 3.0 Sales 59 183 171 96 134 79 94 82 66 62 45 69 65 81 94 117 Density 5.7 1.3 1.3 3.0 2.2 4.3 2.6 3.0 4.3 7 .8 9.4 4.2 5.9 6.2 2.8 2.4
## [1] -0.7707266
The correlation coefficient for sales and density is -0.7707266. This means that there is a negative linear correlation between sales and density. In other words, as sales price increases, density decreases.
##
## Call:
## lm(formula = lawn$sales ~ lawn$density)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.269 -14.348 -6.625 9.665 58.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.525 9.109 15.538 6.85e-16 ***
## lawn$density -12.893 1.946 -6.625 2.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.74 on 30 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.5805
## F-statistic: 43.9 on 1 and 30 DF, p-value: 2.464e-07
The regression equation will be Sales = 141.525 - 12.893*Density. With the ^B0 intercept of 141.525, the density is 0, and the sales for the lawn care company will be 141.525. The ^B1 slope of - 12.893 means that for every unit increase in housing density, the sales will decrease by 12.893.
Residual standard error: 21.74, is the difference between observed and predicted values. Using the empirical rule, this means that 95% of the prediction errors fall between 2 standard deviations from the mean. In the least squares regression analysis, the mean is 0, so 95% of errors fall between 0 plus or minus 2 (21.74).
11.87
Obtain a value of the t statistic for the regression model of Exercise 11.86. Is there conclusive evidence that density is a predictor of sales? lawn$density -12.893 1.946 -6.625 2.46e-07 *** The t value for density is -6.625. Ho: B1 = 0 Ha: B1 != 0 We reject Ho if pvalue is less than .05. Our pvalue for density is 2.46e-07, is less than .05. We reject the null and conclude that there is significant evidence that the slope is not equal to 0, meaning density is a predictor of sales.
Calculate a 95% confidence interval for the true value of the slope. The package should have calculated the standard error for you.
## 2.5 % 97.5 %
## (Intercept) 122.92314 160.127752
## lawn$density -16.86676 -8.918432
The confidence interval of 95% for the true value for the slope is (-16.86676,-8.918432)