Exam 3

library(mosaic)

## Warning: package 'mosaic' was built under R version 3.2.5

## Warning: package 'dplyr' was built under R version 3.2.5

## Warning: package 'mosaicData' was built under R version 3.2.5

House prices

In this exercise you will study the data described in Agresti EXAMPLE 9.10.

You are studying house sales in Gainesville, Florida, where among other things the data contain the selling price (Price), property taxes (Taxes) and house size (Size).

Read in the data:

HousePrices <- read.table("http://asta.math.aau.dk/dan/static/datasets?file=HousePrice.dat", header=TRUE)
head(HousePrices)

##   Taxes  Price Size
## 1  3104 279900 2048
## 2  1173 146500  912
## 3  3076 237700 1654
## 4  1608 200000 2068
## 5  1454 159900 1477
## 6  2997 499900 3153

1. Make a relevant plot of the variables and discuss how they are related.

plot(HousePrices)

1. Explain the concept of correlation and determine whether there is significant positive correlation between Taxes and Size.

There is a positive correlation between taxes and size as we can see on the graph it is in linear regression, and in the correlation test the 95% hypothesis is within the confidence interval of 0,74-0.87 whereas our cor estimate is 0.82.

cor.test(~ Size + Taxes, data = HousePrices)

## 
##  Pearson's product-moment correlation
## 
## data:  Size and Taxes
## t = 14.119, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7416554 0.8745614
## sample estimates:
##       cor 
## 0.8187958

Fit a multiple regression model with Price as the response variable and Taxes and Size as predictors. Here we make a multiple regression model where we use Price as the response value and taxes and size as predictors.

model <- lm(Price ~ Taxes + Size, data = HousePrices)
summary(model)

## 
## Call:
## lm(formula = Price ~ Taxes + Size, data = HousePrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188027  -26138     347   22944  200114 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -28608.744  13519.096  -2.116   0.0369 *  
## Taxes           39.601      6.917   5.725 1.16e-07 ***
## Size            66.512     12.817   5.189 1.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48830 on 97 degrees of freedom
## Multiple R-squared:  0.7722, Adjusted R-squared:  0.7675 
## F-statistic: 164.4 on 2 and 97 DF,  p-value: < 2.2e-16

1. What are the parameters of the model and what is the interpretation of these parameters?

The parameters are the intercept and the slope. The summary shows a positivie slope therefore, as prices increases, so does the taxes and size.

1. What is the prediction equation? \[ \widehat y = a + b1 * x1 + b2 *x2 \widehat y =-28608,7 + 39,6 * x1 + 66,5 * x2 \]

Explain the output where model is the fitted multiple regression model. This explanation should as a minimum include

1. Calculation of t value and determination and interpretation of p-value.

T-val equals the estimation/SE for each value

tval1 = -28608.7 / 13519.1 
tval2 = 39.6 / 6.9
tval3 = 66.5 / 12.8
tval1

## [1] -2.116169

tval2

## [1] 5.73913

tval3

## [1] 5.195312

The p-value is much less than 5%. This means that we can rejest the null hypothesis for both the x1 and x2 variables.

(7)Interpretation of Multiple R-squared. R^2=(TSS - SSE)/ TSS We look at how many of the errors are not explained, to see how good the model is.
1. How the table of output can be used to construct confidence intervals for parameters. This should be supplemented by actual calculation for the current data using confint.

95% confidence interval: (est??t*se)

t=qt (0.025, df=97, lower.tail = FALSE)

-28608.7 + (13519.1)*(t)

## [1] -1777.029

-28608.7 - (13519.1)*(t)

## [1] -55440.37

39.601 + (6.9)*(t)

## [1] 53.29559

39.601 - (6.9)*(t)

## [1] 25.90641

66.5 + (12.8)*(t)

## [1] 91.90446

66.5 - (12.8)*(t)

## [1] 41.09554

confint(model)

##                    2.5 %      97.5 %
## (Intercept) -55440.40818 -1777.08054
## Taxes           25.87192    53.32920
## Size            41.07304    91.95066

With 95% confidence, our a1 will be between -55440.40818 and -1777.08054. b1 will be between 25.87192 and 53.32920, and b2 will be between 41.07304 and 91.95066 for our prediction equation.

Finally, you have to investigate whether or not there is an interaction between the effect of Taxes and the effect of Size as predictors of Price.

model2 <- lm(Price ~ Taxes * Size, data = HousePrices)
summary(model2)

## 
## Call:
## lm(formula = Price ~ Taxes * Size, data = HousePrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202902  -23642    -224   20081  213409 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 2.396e+04  2.450e+04   0.978   0.3305  
## Taxes       1.991e+01  1.026e+01   1.941   0.0551 .
## Size        3.329e+01  1.806e+01   1.844   0.0683 .
## Taxes:Size  1.036e-02  4.072e-03   2.544   0.0126 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47510 on 96 degrees of freedom
## Multiple R-squared:  0.7866, Adjusted R-squared:  0.7799 
## F-statistic: 117.9 on 3 and 96 DF,  p-value: < 2.2e-16

We look at the P value of each variable and the combined one. If p-value for Taxes:Size is more than 5%, then we would need to drop the combined value, return to summary(model) table and choose another model If we can see that p-value for combined is more than 5%, then we can say that particular value doesn’t have a significant impact on the response variable If combined (Taxes:Size) value has a p-value of less than 5%, then we still need to look at the estimated values of each (Taxes and Size) value and include them into calculation.

It looks like there is an interaction here aswell. Our R-squared is 0.78, so it fits rather well.