title: ‘Homework #1’ author: “Dakota McKenzie” date: “Friday, April 03, 2015” output: html_document

Question #1 a)
i)Does the size of the house increase as the city is beverly hills. ii) Are there more bathrooms if the type of house is sfh or condo. iii) Using the amount of bedrooms, predict the amount of bathrooms in a given house. iv) By using the amount of bedrooms, predict the amount of price a house costs. b) I will answer question iv. We can assume that as the amount of bedrooms increase the price of the house will also increase. Thus, the the more rooms will predict a higher price for a house, and vice versa.

hw1=read.csv("C:/Users/Dakota McKenzie/Downloads/hw1.csv")
attach(hw1)
x=hw1$x
model1=lm(y~x)
anova(model1)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 479453  479453  38.488 0.0004436 ***
## Residuals  7  87201   12457                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model2=lm(y~x+I(x^2))
anova(model2)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 479453  479453 42.0736 0.0006383 ***
## I(x^2)     1  18827   18827  1.6521 0.2460502    
## Residuals  6  68374   11396                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model3=lm(y~x+I(x^2)+I(x^3))
anova(model3)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x          1 479453  479453 39.0022 0.001542 **
## I(x^2)     1  18827   18827  1.5315 0.270827   
## I(x^3)     1   6909    6909  0.5620 0.487209   
## Residuals  5  61465   12293                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model4=lm(y~x+I(x^2)+I(x^3)+I(x^4))
anova(model4)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## x          1 479453  479453 104.7432 0.0005137 ***
## I(x^2)     1  18827   18827   4.1130 0.1124611    
## I(x^3)     1   6909    6909   1.5093 0.2865864    
## I(x^4)     1  43155   43155   9.4278 0.0372756 *  
## Residuals  4  18310    4577                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model5=lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5))
anova(model5)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x          1 479453  479453 78.6485 0.003023 **
## I(x^2)     1  18827   18827  3.0883 0.177105   
## I(x^3)     1   6909    6909  1.1333 0.365161   
## I(x^4)     1  43155   43155  7.0791 0.076296 . 
## I(x^5)     1     21      21  0.0035 0.956670   
## Residuals  3  18288    6096                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1. MSE model 1: 12457

MSE model 2: 11396
MSE model 3: 12293
MSE model 4: 4577
MSE model 5: 6096

Based off MSE training I would choose model 4 because it minimizes the Mean Square Error.

set.seed(456)
x=seq(0,4,by=.5)
y=500+200*x + rnorm(length(x),0,100)
x=predict(model1)
mse_x=sum((x-mean(x))^2)/length(x)
x2=predict(model2)
mse_x2=sum((x2-mean(x2))^2)/length(x2)
x3=predict(model3)
mse_x3=sum((x3-mean(x3))^2)/length(x3)
x4=predict(model4)
mse_x4=sum((x4-mean(x4))^2)/length(x4)
x5=predict(model5)
mse_x5=sum((x5-mean(x5))^2)/length(x5)

1. MSE model 1: 53272

MSE model 2: 55364
MSE model 3: 56132
MSE model 4: 60927
MSE model 5: 60929

MSE training data is using roughly 67% of the data in order to create the model that best fits the data. The rest of the data is the testing data which is used to test your model to see if it’s a good fit. If the model fits very well with the testing data then you know your model is an accurate model. By looking at the predictions in part c the model does make sense because the linear model resulted in the lowest MSE of the residual, which would make sense because the true model itself is linear.

Question #3 (2.4.2) a) This is a regression example and thus an inference. N=500 firms in the US P=profit, number of employees, industry b) Classification and thus a prediction. N=20 similar products previously launched P=price charged, marketing budget, comp.price, and ten other variables c) Regression and thus a prediction. N=52 weeks of 2012 weekly data P=% change in US market, % change in British market, % change in German market.