STA 5206 Assignment#7 Danilo Martinez

11.83 A realtor in a suburban area attempted to predict house prices solely on the basis of size. From a listing service, the realtor obtained size in thousands of square feet and asking price in thousands of dollars. Price 210 145 168 352 234 148 217 216 213 143 178 131 181 148 127 158 226 194 166 Size 2.5 1.5 1.8 4.7 2.4 1.5 2.5 3.3 2.6 1.6 1.6 1.4 2.9 1.6 1.9 1.7 2.6 1.9 1.8 Price 207 139 143 141 142 214 262 191 167 153 153 184 123 182 143 144 161 157 155 Size 2.8 1.5 1.5 1.9 1.6 2.2 2.7 2.0 2.2 1.6 1.6 2.3 1.4 1.9 1.6 1.5 1.6 1.7 1.7 Price 203 147 173 160 219 156 169 133 154 220 151 188 153 215 144 125 152 132 164 Size 2.2 1.8 1.8 1.7 2.4 1.9 1.9 1.5 2.9 2.9 1.9 2.3 1.7 2.1 1.9 1.7 1.7 1.4 2.0

Obtain a plot of price against size. Does it appear there is an increasing relation?

## [1] 0.8570019

There seems to be an increasing relation. Price is increasing with size. The data has a correlation coefficient of 0.8570019, which indicates an increasing relation.

Locate an apparent outlier in the data. Is it a high leverage point?

There is a house with a size of 4,700 square feet and a price of $352,000 that is an outlier with a Cook’s distance of .5 or higher. This data point can be considered a high leverage point as the coefficient of determination has decreased from 0.7296 to 0.59.

Obtain a regression equation, and include the outlier in the data.

## 
## Call:
## lm(formula = info$price ~ info$size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.523  -8.400   0.308  12.211  48.282 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.347     10.029   5.419 1.37e-06 ***
## info$size     59.026      4.786  12.334  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.92 on 55 degrees of freedom
## Multiple R-squared:  0.7345, Adjusted R-squared:  0.7296 
## F-statistic: 152.1 on 1 and 55 DF,  p-value: < 2.2e-16

The regression equation is Price = 54.347 + 59.026*size.

Delete the outlier, and obtain a new regression equation. How much does the slope change without the outlier? Why?

## 
## Call:
## lm(formula = info1$price ~ info1$size)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.772  -8.298  -0.649  10.878  52.093 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   63.228     12.250   5.161 3.61e-06 ***
## info1$size    54.325      6.068   8.953 2.95e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.81 on 54 degrees of freedom
## Multiple R-squared:  0.5975, Adjusted R-squared:   0.59 
## F-statistic: 80.15 on 1 and 54 DF,  p-value: 2.954e-12

The new regression equation is Price = 63.228 + 54.325*size. The slope changes without the outlier 59.026 - 54.325 = 4.701. After the outlier is removed, the slope decreases because the outlier price was north of the regression line, increasing the slope.

Locate the residual standard deviations for the outlier-included and outlier-excluded models. Do they differ much? Why?

Outlier-included: 20.92 Outlier-excluded: 20.81 They differ by 20.92 - 20.81 = 0.11. Removing the outlier has not changed the residual standard deviation much so it has not made the data more homogeneous. The relationship between the estimated value and the actual value has not been majorly affected.

11.85

Obtain a 95% prediction interval for the asking price of a home of 5,000 square feet, based on the outlier-excluded data of Exercise 11.83. Would this be a wise prediction to make, based on the data?

##        fit     lwr      upr
## 1 334.8552 278.865 390.8454

We are 95% confident that the price of a home having 5,000 sqft is between 278.865 and 390.8454 dollars. The issue is that a house with 5,000 sqft is outside of the range of sizes that we have in the data, which can be considered an outlier, and not a wise prediction.

Obtain a plot of the price against the size. Does the constant-variance assumption seem reasonable, or does variability increase as size increases?

The constant variance assumption does not seem reasonable because the variability increases with size.

What does your answer to part (b) say about the prediction interval obtained in part (a)?

The interval to predict a 5,000 qft home using this model would be incorrect because variance is increasing with size, and the constant variance assumption is not reasonable. The interval would be wider than what was determined.

11.86

A lawn care company tried to predict the demand for its service by zip code, using the housing density in the zip code area as a predictor. The owners obtained the number of houses and the geographic size of each zip code and calculated their sales per thousand homes and number of homes per acre. Sales 54 72 54 62 72 83 115 90 66 60 100 78 152 87 54 82 Density 6.5 4.6 5.5 4.6 4.2 4.3 2.3 3.5 3.2 8.4 3.4 4.0 2.0 3.2 6.7 3.0 Sales 59 183 171 96 134 79 94 82 66 62 45 69 65 81 94 117 Density 5.7 1.3 1.3 3.0 2.2 4.3 2.6 3.0 4.3 7 .8 9.4 4.2 5.9 6.2 2.8 2.4

Obtain the correlation between the two variables. What does its sign mean?

## [1] -0.7707266

The correlation coefficient for sales and density is -0.7707266. This means that there is a negative linear correlation between sales and density. In other words, as sales price increases, density decreases.

Obtain a prediction equation with sales as the dependent variable and density as the independent variable. Interpret the intercept (yes, we know the interpretation will be a bit strange) and the slope numbers.

## 
## Call:
## lm(formula = lawn$sales ~ lawn$density)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.269 -14.348  -6.625   9.665  58.235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   141.525      9.109  15.538 6.85e-16 ***
## lawn$density  -12.893      1.946  -6.625 2.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.74 on 30 degrees of freedom
## Multiple R-squared:  0.594,  Adjusted R-squared:  0.5805 
## F-statistic:  43.9 on 1 and 30 DF,  p-value: 2.464e-07

The regression equation will be Sales = 141.525 - 12.893*Density. With the ^B0 intercept of 141.525, the density is 0, and the sales for the lawn care company will be 141.525. The ^B1 slope of - 12.893 means that for every unit increase in housing density, the sales will decrease by 12.893.

Obtain a value for the residual standard deviation. What does this number indicate about the accuracy of prediction?

Residual standard error: 21.74, is the difference between observed and predicted values. Using the empirical rule, this means that 95% of the prediction errors fall between 2 standard deviations from the mean. In the least squares regression analysis, the mean is 0, so 95% of errors fall between 0 plus or minus 2 (21.74).

11.87

Obtain a value of the t statistic for the regression model of Exercise 11.86. Is there conclusive evidence that density is a predictor of sales? lawn$density -12.893 1.946 -6.625 2.46e-07 *** The t value for density is -6.625. Ho: B1 = 0 Ha: B1 != 0 We reject Ho if pvalue is less than .05. Our pvalue for density is 2.46e-07, is less than .05. We reject the null and conclude that there is significant evidence that the slope is not equal to 0, meaning density is a predictor of sales.
Calculate a 95% confidence interval for the true value of the slope. The package should have calculated the standard error for you.

##                  2.5 %     97.5 %
## (Intercept)  122.92314 160.127752
## lawn$density -16.86676  -8.918432

The confidence interval of 95% for the true value for the slope is (-16.86676,-8.918432)