DATA 605 Final

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.

N <- 10
n <-10000
X <- runif(10000, 1, N)
Y <- rnorm(n=10000, (N+1)/2,   (N+1)/2)
x <- median(X)
y <-quantile(Y)[2]
par(mfrow=c(1, 2)) 
hist(X)
hist(Y)

summary_info <-rbind(summary(X) , summary(Y))
rownames(summary_info) <- c('X','Y')
summary_info %>%
  kable() %>%
 kable_styling(bootstrap_options = c("striped",'bordered', "hover"))

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
X	1.000178	3.272130	5.491291	5.485992	7.685153	9.999623
Y	-15.184545	1.809095	5.511779	5.489539	9.146032	26.051536

Histogram shows X is uniformly distributed between 1 and 10.

Y is normally distributed with a mean of 5.5 and standard deviation of 5.5.

x = 5.4912906   
y = 1.809095   
N = 10

P(X>x | X>y)

\[P(X>x | X>y) = \frac{P(X>x \cap X>y )}{P(X>y)}\]

Probability that X is greater than 5.4912906 given X is greater than 1.809095

pxandy <- length(X[X>x & X>y])/length(X)
py <- length(X[X>y])/length(X)
 pxandy/py

## [1] 0.5495109

P(X>x, Y>y)

Since these two are independent events so we could simply multiply the probabilities.

\[P(X>x, Y>y) = P(X>x, Y>y) = P(X>x) P(Y>y)\] Probability that X is greater than 5.4912906 and Y is greater than 1.809095

a <- length(Y[Y>y]) /length(Y)
b<- length(X[X>x])/length(X)
a*b

## [1] 0.375

P(X < x|X > y)

\[P(X < x|X > y) = \frac{P(X<x \cap X>y )}{P(X>y)}\]

Probabilities of finding numbers in X that are less than 5.4912906 given our data contains values greater than 1.809095

pxy <- length(X[X<x & X>y])/length(X)
py <- length(X[X>y])/length(X)
 
pxy/py

## [1] 0.4504891

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities

#table(Cars93$Type, Cars93$Origin)

x1<-ifelse(X>x,"X>x","X<x")
x2 <- ifelse(Y> y , 'Y>y', 'Y<y')


t <- table(x1,x2)
tt <- addmargins(t)
kable(tt) %>%
  kable_styling(bootstrap_options =  c("bordered", 'striped'))

	Y<y	Y>y	Sum
X<x	1248	3752	5000
X>x	1252	3748	5000
Sum	2500	7500	10000

t3 <- addmargins(prop.table(table(x1,x2)))
kable(t3) %>%
  kable_styling(bootstrap_options = c("bordered", 'striped'))

	Y<y	Y>y	Sum
X<x	0.1248	0.3752	0.5
X>x	0.1252	0.3748	0.5
Sum	0.2500	0.7500	1.0

y1<-ifelse(X>x & Y>y,"X>x &Y>y","not")
 
kable((prop.table(table(y1)))) %>%
  kable_styling(bootstrap_options = "bordered")

y1	Freq
not	0.6252
X>x &Y>y	0.3748

Based on the tables above we see that the marginal and joint probabilities are the same.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

chisq.test(t, correct =  T)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t
## X-squared = 0.0048, df = 1, p-value = 0.9448

fisher.test(t, simulate.p.value = T)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  t
## p-value = 0.9448
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9086062 1.0912442
## sample estimates:
## odds ratio 
##  0.9957426

In both cases the p-value is high, so we do not reject the null hypothesis of independence. The chisq test for large sample sizes and fisher’s exact test is when we have relatively small sample sizes.

Problem 2

data <- read.csv2('train.csv',stringsAsFactors = TRUE,header = T, sep = ',', na.strings = 'NA')
test <- read.csv2('test.csv',stringsAsFactors = TRUE,header = T, sep = ',', na.strings = 'NA')
train <- data[,sapply(data, is.numeric)][-1]

This dataset contains 1460 observations and 81 variables. 38 variables are numeric and 43 variables are categorical.

Descriptive and Inferential Statistics

Based on the below histogram, we see some of the variables are right skewed and probably need to be transformed. For example, all the year columns like yearbuilt, yearRemodAdd, are strongly right skewed. For any modeling or analysis, skewed variables needed to be normalized using BoxCox or some other methods. I am not quite about sure about the meaning of normalizing year variables and how to interpret that.

par(mfrow=c(3, 4))
hist.data.frame(train)

Scatterplot Matrix

Based on the plots below, we see that the all three variables have positive linear relationship. As TotalBsmtSF and GrLivArea increase SalePrice also increases.

plot_data <- train %>% select('TotalBsmtSF','GrLivArea', 'SalePrice'  )
pairs(plot_data)

var_names <- c('TotalBsmtSF','GrLivArea','LotArea', 'SalePrice'  )
var_data <- data[, var_names]
cor_matrix <- cor(var_data)
cor_matrix

##             TotalBsmtSF GrLivArea   LotArea SalePrice
## TotalBsmtSF   1.0000000 0.4548682 0.2608331 0.6135806
## GrLivArea     0.4548682 1.0000000 0.2631162 0.7086245
## LotArea       0.2608331 0.2631162 1.0000000 0.2638434
## SalePrice     0.6135806 0.7086245 0.2638434 1.0000000

corrplot(cor_matrix)

Correlation matrix shows that there is a strong correlation between Saleprice and ‘TotalBsmtSF’ and SalePrice and ‘GrLivArea’ Correlation between LotArea and SalePrice is not that strong.

SalePrice vs TotalBsmtSF

Correlation between these two variables is not zero because of the small P value of 2.2e-1. We are 80% confident correlation is between 0.5922142 and 0.6340846

cor.test(train$SalePrice,train$TotalBsmtSF, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

SalePrice Vs GrLivArea

Correlation between these two variables is not zero because of the small P value of 2.2e-16. We are 80% confident correlation is between 0.6915087 and 0.7249450

cor.test(train$SalePrice,train$GrLivArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

SalePrice vs LotArea

Correlation between these two variables is not zero because of the small P value of 2.2e-16. We are 80% confident correlation is between 0.2323391 and 0.2947946

cor.test(train$SalePrice,train$LotArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

Based on the correlation tests above we reject the null hypothesis in favor of alternative hypothesis because of the small p-values.

https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/

Would you be worried about familywise error? Why or why not?

Familywise error is that of making at least one “type I” error“, or a false positive, rejection of a true null. In our cases, the P value is extremely small. Type I error happens when the true null is rejected. But in our case the P value is very small, So I would not be worried about committing type I error.

Linear Algebra and Correlation Invert correlation matrix

percision_matrix <- solve(cor_matrix)
percision_matrix

##             TotalBsmtSF  GrLivArea     LotArea   SalePrice
## TotalBsmtSF   1.6321069 -0.0397442 -0.17031695 -0.92832834
## GrLivArea    -0.0397442  2.0350650 -0.16233936 -1.37487844
## LotArea      -0.1703170 -0.1623394  1.10622180 -0.07232846
## SalePrice    -0.9283283 -1.3748784 -0.07232846  2.56296011

Multiply the correlation matrix by the precision matrix

cor_matrix %*% percision_matrix

##             TotalBsmtSF     GrLivArea       LotArea    SalePrice
## TotalBsmtSF           1  0.000000e+00 -6.938894e-18 0.000000e+00
## GrLivArea             0  1.000000e+00  5.551115e-17 2.220446e-16
## LotArea               0  0.000000e+00  1.000000e+00 0.000000e+00
## SalePrice             0 -2.220446e-16  0.000000e+00 1.000000e+00

Multiply the precision matrix by the correlation matrix

percision_matrix%*%cor_matrix

##               TotalBsmtSF     GrLivArea       LotArea     SalePrice
## TotalBsmtSF  1.000000e+00  2.220446e-16  8.326673e-17  1.110223e-16
## GrLivArea    1.110223e-16  1.000000e+00  0.000000e+00  0.000000e+00
## LotArea     -4.857226e-17 -2.081668e-17  1.000000e+00 -4.163336e-17
## SalePrice    0.000000e+00  2.220446e-16 -1.110223e-16  1.000000e+00

LU decomposition on the matrix

l_u <- lu(cor_matrix) 
l_u$L %*% l_u$U == cor_matrix

##             TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF        TRUE      TRUE    TRUE      TRUE
## GrLivArea          TRUE      TRUE    TRUE      TRUE
## LotArea            TRUE      TRUE    TRUE      TRUE
## SalePrice          TRUE      TRUE    TRUE      TRUE

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary

MasVnrArea is right skewed as seen in the histogram below. Minimum value is 0 , so we are going to shift it to the right by adding 10.

#MasVnrArea
#WoodDeckSF
MasVnrArea <- na.omit(data$MasVnrArea)+10
min(MasVnrArea)

## [1] 10

summary(MasVnrArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    10.0    10.0   113.7   176.0  1610.0

 hist(MasVnrArea)

fitdistr to fit an exponential probability density function

Find the optimal value of ?? for this distribution, and then take 1000 samples from this exponential distribution using this value

 fit<- fitdistr(MasVnrArea, densfun='exponential')
 fit

##        rate    
##   0.0087962150 
##  (0.0002308408)

 exp_data <- rexp(1000, as.numeric(fit$estimate))

Plot a histogram and compare it with a histogram of your original variable

Both histograms looks very similar expect the MasVnrArea range and frequency is slightly higher than the data generated by rexp

 par(mfrow=c(1, 2)) 
 hist(MasVnrArea, breaks = 100 )
 hist(exp_data, breaks = 100 )

 summary(exp_data )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.025  32.032  76.930 111.165 153.379 917.725

  summary(MasVnrArea )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    10.0    10.0   113.7   176.0  1610.0

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column

\[ CDF =1-{ e }^{ -\lambda x },\quad where\quad \lambda = 0.0087962\] \[=log(1-P)=-\lambda x\]

lambda <- as.numeric(fit$estimate)
cdf5 <-  round((-1/lambda)*(log(.95)),2)
cdf95 <-  round(-1/lambda*(log(.05)),2)

Estimate for 5th percentiles is 5.83

Estimate for 95th percentiles is 340.57

Generate a 95% confidence interval from the empirical data, assuming normality

https://www.cyclismo.org/tutorial/R/confidence.html

mean_MasVnrArea <- mean(MasVnrArea)
sd_MasVnrArea <- sd(MasVnrArea)
nitems <- length(MasVnrArea)
error <- qnorm(0.975)*sd_MasVnrArea/sqrt(nitems)
 left <- mean_MasVnrArea-error
 right <- mean_MasVnrArea+error
 left

## [1] 104.372

 right

## [1] 122.9985

95% confidence interval is between 104.3719919 and 122.9985315

Provide the empirical 5th percentile and 95th percentile of the data.

quantile(MasVnrArea, probs=c(0.05, 0.95))

##  5% 95% 
##  10 466

% of data outside 95% confidence interval based on the assumed normality

(length(MasVnrArea[MasVnrArea<100])+length(MasVnrArea[MasVnrArea>125]))/length(MasVnrArea)

## [1] 0.9641873

Confidence interval for empirical and simulated data are very similar. Simulated data looks more spread out with similar distribution. When we try to fit a normal distribution, our confidence interval is a lot narrower. This is a result of trying to fit a normal distribution in a exponentially distributed data. 95% of the normal data is outside of confidence interval.

Modeling

For the modeling I am only using the numeric variables. The simplest model is using all the variables in the model and this saturated model has an Adjusted R-squared: 0.8034. This saturated model explains 80% of the variability in the data. Some of the coefficients are N/A and not all variables are statically significant. Looking at the diagnostic plots we see the residual variance is not constant and there are few outliers. This is also confirmed by looking at the plots below. Correlation plot shows that some of the variables are negatively correlated as well.

train <- data[,sapply(data, is.numeric)]
train <- na.omit(train)
corrplot(cor(train))

full.model <- lm(SalePrice~., train)

summary(full.model)

## 
## Call:
## lm(formula = SalePrice ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442182  -16955   -2824   15125  318183 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.351e+05  1.701e+06  -0.197 0.843909    
## Id            -1.205e+00  2.658e+00  -0.453 0.650332    
## MSSubClass    -2.001e+02  3.451e+01  -5.797 8.84e-09 ***
## LotFrontage   -1.160e+02  6.126e+01  -1.894 0.058503 .  
## LotArea        5.422e-01  1.575e-01   3.442 0.000599 ***
## OverallQual    1.866e+04  1.482e+03  12.592  < 2e-16 ***
## OverallCond    5.239e+03  1.368e+03   3.830 0.000135 ***
## YearBuilt      3.164e+02  8.766e+01   3.610 0.000321 ***
## YearRemodAdd   1.194e+02  8.668e+01   1.378 0.168607    
## MasVnrArea     3.141e+01  7.022e+00   4.473 8.54e-06 ***
## BsmtFinSF1     1.736e+01  5.838e+00   2.973 0.003014 ** 
## BsmtFinSF2     8.342e+00  8.766e+00   0.952 0.341532    
## BsmtUnfSF      5.005e+00  5.277e+00   0.948 0.343173    
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.597e+01  7.360e+00   6.246 6.02e-10 ***
## X2ndFlrSF      4.663e+01  6.102e+00   7.641 4.72e-14 ***
## LowQualFinSF   3.341e+01  2.794e+01   1.196 0.232009    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.043e+03  3.198e+03   2.828 0.004776 ** 
## BsmtHalfBath   2.465e+03  5.073e+03   0.486 0.627135    
## FullBath       5.433e+03  3.531e+03   1.539 0.124182    
## HalfBath      -1.098e+03  3.321e+03  -0.331 0.740945    
## BedroomAbvGr  -1.022e+04  2.155e+03  -4.742 2.40e-06 ***
## KitchenAbvGr  -2.202e+04  6.710e+03  -3.282 0.001063 ** 
## TotRmsAbvGrd   5.464e+03  1.487e+03   3.674 0.000251 ***
## Fireplaces     4.372e+03  2.189e+03   1.998 0.046020 *  
## GarageYrBlt   -4.728e+01  9.106e+01  -0.519 0.603742    
## GarageCars     1.685e+04  3.491e+03   4.827 1.58e-06 ***
## GarageArea     6.274e+00  1.213e+01   0.517 0.605002    
## WoodDeckSF     2.144e+01  1.002e+01   2.139 0.032662 *  
## OpenPorchSF   -2.252e+00  1.949e+01  -0.116 0.907998    
## EnclosedPorch  7.295e+00  2.062e+01   0.354 0.723590    
## X3SsnPorch     3.349e+01  3.758e+01   0.891 0.373163    
## ScreenPorch    5.805e+01  2.041e+01   2.844 0.004532 ** 
## PoolArea      -6.052e+01  2.990e+01  -2.024 0.043204 *  
## MiscVal       -3.761e+00  6.960e+00  -0.540 0.589016    
## MoSold        -2.217e+02  4.229e+02  -0.524 0.600188    
## YrSold        -2.474e+02  8.458e+02  -0.293 0.769917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36800 on 1085 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8034 
## F-statistic: 131.8 on 35 and 1085 DF,  p-value: < 2.2e-16

plot(full.model)

A better model would be one with less variables and with similar Adjusted R-squared value. We could use the ‘regsubsets’ or ‘step’ methods for model selection. Based on the regsusbsets summary, not all the variables are statistically significant. Below model contains only statically significant variables and only positively correlated variables (based on the correlation plot). This model has an Adjusted R-squared: 0.7749. This model contains less variables and explains as much as the variability in the data as the previous model. Diagnostic plots are like the plots from the full model above. There are few outliers shown in the residual plots. For better prediction and modeling we must include categorical variables and transform skewed variables as well. We also have to deal with NA’s. We could use columns means or medians for numerical variables.

selection <- regsubsets(SalePrice~., train)

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 2 linear dependencies found

## Reordering variables and trying again:

cordata <- train %>% select('OverallQual','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','GrLivArea' ,'GarageCars','SalePrice' )

corrplot(cor(cordata))

model <- lm(SalePrice ~  OverallQual  + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + GrLivArea  + GarageCars  , data=train)
summary(selection)

## Subset selection object
## Call: regsubsets.formula(SalePrice ~ ., train)
## 37 Variables  (and intercept)
##               Forced in Forced out
## Id                FALSE      FALSE
## MSSubClass        FALSE      FALSE
## LotFrontage       FALSE      FALSE
## LotArea           FALSE      FALSE
## OverallQual       FALSE      FALSE
## OverallCond       FALSE      FALSE
## YearBuilt         FALSE      FALSE
## YearRemodAdd      FALSE      FALSE
## MasVnrArea        FALSE      FALSE
## BsmtFinSF1        FALSE      FALSE
## BsmtFinSF2        FALSE      FALSE
## BsmtUnfSF         FALSE      FALSE
## X1stFlrSF         FALSE      FALSE
## X2ndFlrSF         FALSE      FALSE
## LowQualFinSF      FALSE      FALSE
## BsmtFullBath      FALSE      FALSE
## BsmtHalfBath      FALSE      FALSE
## FullBath          FALSE      FALSE
## HalfBath          FALSE      FALSE
## BedroomAbvGr      FALSE      FALSE
## KitchenAbvGr      FALSE      FALSE
## TotRmsAbvGrd      FALSE      FALSE
## Fireplaces        FALSE      FALSE
## GarageYrBlt       FALSE      FALSE
## GarageCars        FALSE      FALSE
## GarageArea        FALSE      FALSE
## WoodDeckSF        FALSE      FALSE
## OpenPorchSF       FALSE      FALSE
## EnclosedPorch     FALSE      FALSE
## X3SsnPorch        FALSE      FALSE
## ScreenPorch       FALSE      FALSE
## PoolArea          FALSE      FALSE
## MiscVal           FALSE      FALSE
## MoSold            FALSE      FALSE
## YrSold            FALSE      FALSE
## TotalBsmtSF       FALSE      FALSE
## GrLivArea         FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          Id  MSSubClass LotFrontage LotArea OverallQual OverallCond
## 1  ( 1 ) " " " "        " "         " "     "*"         " "        
## 2  ( 1 ) " " " "        " "         " "     "*"         " "        
## 3  ( 1 ) " " " "        " "         " "     "*"         " "        
## 4  ( 1 ) " " " "        " "         " "     "*"         " "        
## 5  ( 1 ) " " "*"        " "         " "     "*"         " "        
## 6  ( 1 ) " " "*"        " "         " "     "*"         " "        
## 7  ( 1 ) " " "*"        " "         " "     "*"         " "        
## 8  ( 1 ) " " "*"        " "         " "     "*"         "*"        
## 9  ( 1 ) " " "*"        " "         " "     "*"         "*"        
##          YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 1  ( 1 ) " "       " "          " "        " "        " "        " "      
## 2  ( 1 ) " "       " "          " "        " "        " "        " "      
## 3  ( 1 ) " "       " "          " "        "*"        " "        " "      
## 4  ( 1 ) " "       " "          " "        "*"        " "        " "      
## 5  ( 1 ) " "       " "          " "        "*"        " "        " "      
## 6  ( 1 ) " "       "*"          " "        "*"        " "        " "      
## 7  ( 1 ) " "       "*"          "*"        "*"        " "        " "      
## 8  ( 1 ) "*"       " "          " "        "*"        " "        " "      
## 9  ( 1 ) "*"       " "          "*"        "*"        " "        " "      
##          TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 1  ( 1 ) " "         " "       " "       " "          " "      
## 2  ( 1 ) " "         " "       " "       " "          "*"      
## 3  ( 1 ) " "         " "       " "       " "          "*"      
## 4  ( 1 ) " "         " "       " "       " "          "*"      
## 5  ( 1 ) " "         " "       " "       " "          "*"      
## 6  ( 1 ) " "         " "       " "       " "          "*"      
## 7  ( 1 ) " "         " "       " "       " "          "*"      
## 8  ( 1 ) " "         " "       " "       " "          "*"      
## 9  ( 1 ) " "         " "       " "       " "          "*"      
##          BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1  ( 1 ) " "          " "          " "      " "      " "         
## 2  ( 1 ) " "          " "          " "      " "      " "         
## 3  ( 1 ) " "          " "          " "      " "      " "         
## 4  ( 1 ) " "          " "          " "      " "      " "         
## 5  ( 1 ) " "          " "          " "      " "      " "         
## 6  ( 1 ) " "          " "          " "      " "      " "         
## 7  ( 1 ) " "          " "          " "      " "      " "         
## 8  ( 1 ) " "          " "          " "      " "      "*"         
## 9  ( 1 ) " "          " "          " "      " "      "*"         
##          KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## 1  ( 1 ) " "          " "          " "        " "         " "       
## 2  ( 1 ) " "          " "          " "        " "         " "       
## 3  ( 1 ) " "          " "          " "        " "         " "       
## 4  ( 1 ) " "          " "          " "        " "         "*"       
## 5  ( 1 ) " "          " "          " "        " "         "*"       
## 6  ( 1 ) " "          " "          " "        " "         "*"       
## 7  ( 1 ) " "          " "          " "        " "         "*"       
## 8  ( 1 ) " "          " "          " "        " "         "*"       
## 9  ( 1 ) " "          " "          " "        " "         "*"       
##          GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1  ( 1 ) " "        " "        " "         " "           " "       
## 2  ( 1 ) " "        " "        " "         " "           " "       
## 3  ( 1 ) " "        " "        " "         " "           " "       
## 4  ( 1 ) " "        " "        " "         " "           " "       
## 5  ( 1 ) " "        " "        " "         " "           " "       
## 6  ( 1 ) " "        " "        " "         " "           " "       
## 7  ( 1 ) " "        " "        " "         " "           " "       
## 8  ( 1 ) " "        " "        " "         " "           " "       
## 9  ( 1 ) " "        " "        " "         " "           " "       
##          ScreenPorch PoolArea MiscVal MoSold YrSold
## 1  ( 1 ) " "         " "      " "     " "    " "   
## 2  ( 1 ) " "         " "      " "     " "    " "   
## 3  ( 1 ) " "         " "      " "     " "    " "   
## 4  ( 1 ) " "         " "      " "     " "    " "   
## 5  ( 1 ) " "         " "      " "     " "    " "   
## 6  ( 1 ) " "         " "      " "     " "    " "   
## 7  ( 1 ) " "         " "      " "     " "    " "   
## 8  ( 1 ) " "         " "      " "     " "    " "   
## 9  ( 1 ) " "         " "      " "     " "    " "

summary(model)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + 
##     MasVnrArea + BsmtFinSF1 + GrLivArea + GarageCars, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -482083  -19022   -1789   15768  264386 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.861e+05  1.442e+05  -6.841 1.30e-11 ***
## OverallQual   2.310e+04  1.408e+03  16.409  < 2e-16 ***
## YearBuilt     1.165e+02  5.746e+01   2.028   0.0428 *  
## YearRemodAdd  3.398e+02  7.676e+01   4.427 1.05e-05 ***
## MasVnrArea    3.119e+01  7.315e+00   4.264 2.18e-05 ***
## BsmtFinSF1    2.700e+01  2.684e+00  10.059  < 2e-16 ***
## GrLivArea     4.696e+01  3.148e+00  14.916  < 2e-16 ***
## GarageCars    1.936e+04  2.455e+03   7.888 7.32e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39380 on 1113 degrees of freedom
## Multiple R-squared:  0.7763, Adjusted R-squared:  0.7749 
## F-statistic: 551.7 on 7 and 1113 DF,  p-value: < 2.2e-16

plot(model)

getwd()

## [1] "C:/Users/JJohn1/Google Drive/DataScience/DATA605"

Kaggle Score