Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.
N <- 10
n <-10000
X <- runif(10000, 1, N)
Y <- rnorm(n=10000, (N+1)/2, (N+1)/2)
x <- median(X)
y <-quantile(Y)[2]
par(mfrow=c(1, 2))
hist(X)
hist(Y)
summary_info <-rbind(summary(X) , summary(Y))
rownames(summary_info) <- c('X','Y')
summary_info %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",'bordered', "hover"))
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
X | 1.000178 | 3.272130 | 5.491291 | 5.485992 | 7.685153 | 9.999623 |
Y | -15.184545 | 1.809095 | 5.511779 | 5.489539 | 9.146032 | 26.051536 |
Histogram shows X is uniformly distributed between 1 and 10.
Y is normally distributed with a mean of 5.5 and standard deviation of 5.5.
x = 5.4912906
y = 1.809095
N = 10
P(X>x | X>y)
\[P(X>x | X>y) = \frac{P(X>x \cap X>y )}{P(X>y)}\]
Probability that X is greater than 5.4912906 given X is greater than 1.809095
pxandy <- length(X[X>x & X>y])/length(X)
py <- length(X[X>y])/length(X)
pxandy/py
## [1] 0.5495109
P(X>x, Y>y)
Since these two are independent events so we could simply multiply the probabilities.
\[P(X>x, Y>y) = P(X>x, Y>y) = P(X>x) P(Y>y)\] Probability that X is greater than 5.4912906 and Y is greater than 1.809095
a <- length(Y[Y>y]) /length(Y)
b<- length(X[X>x])/length(X)
a*b
## [1] 0.375
P(X < x|X > y)
\[P(X < x|X > y) = \frac{P(X<x \cap X>y )}{P(X>y)}\]
Probabilities of finding numbers in X that are less than 5.4912906 given our data contains values greater than 1.809095
pxy <- length(X[X<x & X>y])/length(X)
py <- length(X[X>y])/length(X)
pxy/py
## [1] 0.4504891
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities
#table(Cars93$Type, Cars93$Origin)
x1<-ifelse(X>x,"X>x","X<x")
x2 <- ifelse(Y> y , 'Y>y', 'Y<y')
t <- table(x1,x2)
tt <- addmargins(t)
kable(tt) %>%
kable_styling(bootstrap_options = c("bordered", 'striped'))
Y<y | Y>y | Sum | |
---|---|---|---|
X<x | 1248 | 3752 | 5000 |
X>x | 1252 | 3748 | 5000 |
Sum | 2500 | 7500 | 10000 |
t3 <- addmargins(prop.table(table(x1,x2)))
kable(t3) %>%
kable_styling(bootstrap_options = c("bordered", 'striped'))
Y<y | Y>y | Sum | |
---|---|---|---|
X<x | 0.1248 | 0.3752 | 0.5 |
X>x | 0.1252 | 0.3748 | 0.5 |
Sum | 0.2500 | 0.7500 | 1.0 |
y1<-ifelse(X>x & Y>y,"X>x &Y>y","not")
kable((prop.table(table(y1)))) %>%
kable_styling(bootstrap_options = "bordered")
y1 | Freq |
---|---|
not | 0.6252 |
X>x &Y>y | 0.3748 |
Based on the tables above we see that the marginal and joint probabilities are the same.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
chisq.test(t, correct = T)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t
## X-squared = 0.0048, df = 1, p-value = 0.9448
fisher.test(t, simulate.p.value = T)
##
## Fisher's Exact Test for Count Data
##
## data: t
## p-value = 0.9448
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9086062 1.0912442
## sample estimates:
## odds ratio
## 0.9957426
In both cases the p-value is high, so we do not reject the null hypothesis of independence. The chisq test for large sample sizes and fisher’s exact test is when we have relatively small sample sizes.
data <- read.csv2('train.csv',stringsAsFactors = TRUE,header = T, sep = ',', na.strings = 'NA')
test <- read.csv2('test.csv',stringsAsFactors = TRUE,header = T, sep = ',', na.strings = 'NA')
train <- data[,sapply(data, is.numeric)][-1]
This dataset contains 1460 observations and 81 variables. 38 variables are numeric and 43 variables are categorical.
Descriptive and Inferential Statistics
Based on the below histogram, we see some of the variables are right skewed and probably need to be transformed. For example, all the year columns like yearbuilt, yearRemodAdd, are strongly right skewed. For any modeling or analysis, skewed variables needed to be normalized using BoxCox or some other methods. I am not quite about sure about the meaning of normalizing year variables and how to interpret that.
par(mfrow=c(3, 4))
hist.data.frame(train)
Scatterplot Matrix
Based on the plots below, we see that the all three variables have positive linear relationship. As TotalBsmtSF and GrLivArea increase SalePrice also increases.
plot_data <- train %>% select('TotalBsmtSF','GrLivArea', 'SalePrice' )
pairs(plot_data)
var_names <- c('TotalBsmtSF','GrLivArea','LotArea', 'SalePrice' )
var_data <- data[, var_names]
cor_matrix <- cor(var_data)
cor_matrix
## TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF 1.0000000 0.4548682 0.2608331 0.6135806
## GrLivArea 0.4548682 1.0000000 0.2631162 0.7086245
## LotArea 0.2608331 0.2631162 1.0000000 0.2638434
## SalePrice 0.6135806 0.7086245 0.2638434 1.0000000
corrplot(cor_matrix)
Correlation matrix shows that there is a strong correlation between Saleprice and ‘TotalBsmtSF’ and SalePrice and ‘GrLivArea’ Correlation between LotArea and SalePrice is not that strong.
SalePrice vs TotalBsmtSF
Correlation between these two variables is not zero because of the small P value of 2.2e-1. We are 80% confident correlation is between 0.5922142 and 0.6340846
cor.test(train$SalePrice,train$TotalBsmtSF, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
SalePrice Vs GrLivArea
Correlation between these two variables is not zero because of the small P value of 2.2e-16. We are 80% confident correlation is between 0.6915087 and 0.7249450
cor.test(train$SalePrice,train$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
SalePrice vs LotArea
Correlation between these two variables is not zero because of the small P value of 2.2e-16. We are 80% confident correlation is between 0.2323391 and 0.2947946
cor.test(train$SalePrice,train$LotArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
Based on the correlation tests above we reject the null hypothesis in favor of alternative hypothesis because of the small p-values.
https://www.statisticshowto.datasciencecentral.com/familywise-error-rate/
Would you be worried about familywise error? Why or why not?
Familywise error is that of making at least one “type I” error“, or a false positive, rejection of a true null. In our cases, the P value is extremely small. Type I error happens when the true null is rejected. But in our case the P value is very small, So I would not be worried about committing type I error.
Linear Algebra and Correlation Invert correlation matrix
percision_matrix <- solve(cor_matrix)
percision_matrix
## TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF 1.6321069 -0.0397442 -0.17031695 -0.92832834
## GrLivArea -0.0397442 2.0350650 -0.16233936 -1.37487844
## LotArea -0.1703170 -0.1623394 1.10622180 -0.07232846
## SalePrice -0.9283283 -1.3748784 -0.07232846 2.56296011
Multiply the correlation matrix by the precision matrix
cor_matrix %*% percision_matrix
## TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF 1 0.000000e+00 -6.938894e-18 0.000000e+00
## GrLivArea 0 1.000000e+00 5.551115e-17 2.220446e-16
## LotArea 0 0.000000e+00 1.000000e+00 0.000000e+00
## SalePrice 0 -2.220446e-16 0.000000e+00 1.000000e+00
Multiply the precision matrix by the correlation matrix
percision_matrix%*%cor_matrix
## TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF 1.000000e+00 2.220446e-16 8.326673e-17 1.110223e-16
## GrLivArea 1.110223e-16 1.000000e+00 0.000000e+00 0.000000e+00
## LotArea -4.857226e-17 -2.081668e-17 1.000000e+00 -4.163336e-17
## SalePrice 0.000000e+00 2.220446e-16 -1.110223e-16 1.000000e+00
LU decomposition on the matrix
l_u <- lu(cor_matrix)
l_u$L %*% l_u$U == cor_matrix
## TotalBsmtSF GrLivArea LotArea SalePrice
## TotalBsmtSF TRUE TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE TRUE
## LotArea TRUE TRUE TRUE TRUE
## SalePrice TRUE TRUE TRUE TRUE
Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary
MasVnrArea is right skewed as seen in the histogram below. Minimum value is 0 , so we are going to shift it to the right by adding 10.
#MasVnrArea
#WoodDeckSF
MasVnrArea <- na.omit(data$MasVnrArea)+10
min(MasVnrArea)
## [1] 10
summary(MasVnrArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 10.0 10.0 113.7 176.0 1610.0
hist(MasVnrArea)
fitdistr to fit an exponential probability density function
Find the optimal value of ?? for this distribution, and then take 1000 samples from this exponential distribution using this value
fit<- fitdistr(MasVnrArea, densfun='exponential')
fit
## rate
## 0.0087962150
## (0.0002308408)
exp_data <- rexp(1000, as.numeric(fit$estimate))
Plot a histogram and compare it with a histogram of your original variable
Both histograms looks very similar expect the MasVnrArea
range and frequency is slightly higher than the data generated by rexp
par(mfrow=c(1, 2))
hist(MasVnrArea, breaks = 100 )
hist(exp_data, breaks = 100 )
summary(exp_data )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.025 32.032 76.930 111.165 153.379 917.725
summary(MasVnrArea )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 10.0 10.0 113.7 176.0 1610.0
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)
https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column
\[ CDF =1-{ e }^{ -\lambda x },\quad where\quad \lambda = 0.0087962\] \[=log(1-P)=-\lambda x\]
lambda <- as.numeric(fit$estimate)
cdf5 <- round((-1/lambda)*(log(.95)),2)
cdf95 <- round(-1/lambda*(log(.05)),2)
Estimate for 5th percentiles is 5.83
Estimate for 95th percentiles is 340.57
Generate a 95% confidence interval from the empirical data, assuming normality
https://www.cyclismo.org/tutorial/R/confidence.html
mean_MasVnrArea <- mean(MasVnrArea)
sd_MasVnrArea <- sd(MasVnrArea)
nitems <- length(MasVnrArea)
error <- qnorm(0.975)*sd_MasVnrArea/sqrt(nitems)
left <- mean_MasVnrArea-error
right <- mean_MasVnrArea+error
left
## [1] 104.372
right
## [1] 122.9985
95% confidence interval is between 104.3719919 and 122.9985315
Provide the empirical 5th percentile and 95th percentile of the data.
quantile(MasVnrArea, probs=c(0.05, 0.95))
## 5% 95%
## 10 466
% of data outside 95% confidence interval based on the assumed normality
(length(MasVnrArea[MasVnrArea<100])+length(MasVnrArea[MasVnrArea>125]))/length(MasVnrArea)
## [1] 0.9641873
Confidence interval for empirical and simulated data are very similar. Simulated data looks more spread out with similar distribution. When we try to fit a normal distribution, our confidence interval is a lot narrower. This is a result of trying to fit a normal distribution in a exponentially distributed data. 95% of the normal data is outside of confidence interval.
Modeling
For the modeling I am only using the numeric variables. The simplest model is using all the variables in the model and this saturated model has an Adjusted R-squared: 0.8034. This saturated model explains 80% of the variability in the data. Some of the coefficients are N/A and not all variables are statically significant. Looking at the diagnostic plots we see the residual variance is not constant and there are few outliers. This is also confirmed by looking at the plots below. Correlation plot shows that some of the variables are negatively correlated as well.
train <- data[,sapply(data, is.numeric)]
train <- na.omit(train)
corrplot(cor(train))
full.model <- lm(SalePrice~., train)
summary(full.model)
##
## Call:
## lm(formula = SalePrice ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442182 -16955 -2824 15125 318183
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.351e+05 1.701e+06 -0.197 0.843909
## Id -1.205e+00 2.658e+00 -0.453 0.650332
## MSSubClass -2.001e+02 3.451e+01 -5.797 8.84e-09 ***
## LotFrontage -1.160e+02 6.126e+01 -1.894 0.058503 .
## LotArea 5.422e-01 1.575e-01 3.442 0.000599 ***
## OverallQual 1.866e+04 1.482e+03 12.592 < 2e-16 ***
## OverallCond 5.239e+03 1.368e+03 3.830 0.000135 ***
## YearBuilt 3.164e+02 8.766e+01 3.610 0.000321 ***
## YearRemodAdd 1.194e+02 8.668e+01 1.378 0.168607
## MasVnrArea 3.141e+01 7.022e+00 4.473 8.54e-06 ***
## BsmtFinSF1 1.736e+01 5.838e+00 2.973 0.003014 **
## BsmtFinSF2 8.342e+00 8.766e+00 0.952 0.341532
## BsmtUnfSF 5.005e+00 5.277e+00 0.948 0.343173
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.597e+01 7.360e+00 6.246 6.02e-10 ***
## X2ndFlrSF 4.663e+01 6.102e+00 7.641 4.72e-14 ***
## LowQualFinSF 3.341e+01 2.794e+01 1.196 0.232009
## GrLivArea NA NA NA NA
## BsmtFullBath 9.043e+03 3.198e+03 2.828 0.004776 **
## BsmtHalfBath 2.465e+03 5.073e+03 0.486 0.627135
## FullBath 5.433e+03 3.531e+03 1.539 0.124182
## HalfBath -1.098e+03 3.321e+03 -0.331 0.740945
## BedroomAbvGr -1.022e+04 2.155e+03 -4.742 2.40e-06 ***
## KitchenAbvGr -2.202e+04 6.710e+03 -3.282 0.001063 **
## TotRmsAbvGrd 5.464e+03 1.487e+03 3.674 0.000251 ***
## Fireplaces 4.372e+03 2.189e+03 1.998 0.046020 *
## GarageYrBlt -4.728e+01 9.106e+01 -0.519 0.603742
## GarageCars 1.685e+04 3.491e+03 4.827 1.58e-06 ***
## GarageArea 6.274e+00 1.213e+01 0.517 0.605002
## WoodDeckSF 2.144e+01 1.002e+01 2.139 0.032662 *
## OpenPorchSF -2.252e+00 1.949e+01 -0.116 0.907998
## EnclosedPorch 7.295e+00 2.062e+01 0.354 0.723590
## X3SsnPorch 3.349e+01 3.758e+01 0.891 0.373163
## ScreenPorch 5.805e+01 2.041e+01 2.844 0.004532 **
## PoolArea -6.052e+01 2.990e+01 -2.024 0.043204 *
## MiscVal -3.761e+00 6.960e+00 -0.540 0.589016
## MoSold -2.217e+02 4.229e+02 -0.524 0.600188
## YrSold -2.474e+02 8.458e+02 -0.293 0.769917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36800 on 1085 degrees of freedom
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8034
## F-statistic: 131.8 on 35 and 1085 DF, p-value: < 2.2e-16
plot(full.model)
A better model would be one with less variables and with similar Adjusted R-squared value. We could use the ‘regsubsets’ or ‘step’ methods for model selection. Based on the regsusbsets summary, not all the variables are statistically significant. Below model contains only statically significant variables and only positively correlated variables (based on the correlation plot). This model has an Adjusted R-squared: 0.7749. This model contains less variables and explains as much as the variability in the data as the previous model. Diagnostic plots are like the plots from the full model above. There are few outliers shown in the residual plots. For better prediction and modeling we must include categorical variables and transform skewed variables as well. We also have to deal with NA’s. We could use columns means or medians for numerical variables.
selection <- regsubsets(SalePrice~., train)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 2 linear dependencies found
## Reordering variables and trying again:
cordata <- train %>% select('OverallQual','YearBuilt','YearRemodAdd','MasVnrArea','BsmtFinSF1','GrLivArea' ,'GarageCars','SalePrice' )
corrplot(cor(cordata))
model <- lm(SalePrice ~ OverallQual + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + GrLivArea + GarageCars , data=train)
summary(selection)
## Subset selection object
## Call: regsubsets.formula(SalePrice ~ ., train)
## 37 Variables (and intercept)
## Forced in Forced out
## Id FALSE FALSE
## MSSubClass FALSE FALSE
## LotFrontage FALSE FALSE
## LotArea FALSE FALSE
## OverallQual FALSE FALSE
## OverallCond FALSE FALSE
## YearBuilt FALSE FALSE
## YearRemodAdd FALSE FALSE
## MasVnrArea FALSE FALSE
## BsmtFinSF1 FALSE FALSE
## BsmtFinSF2 FALSE FALSE
## BsmtUnfSF FALSE FALSE
## X1stFlrSF FALSE FALSE
## X2ndFlrSF FALSE FALSE
## LowQualFinSF FALSE FALSE
## BsmtFullBath FALSE FALSE
## BsmtHalfBath FALSE FALSE
## FullBath FALSE FALSE
## HalfBath FALSE FALSE
## BedroomAbvGr FALSE FALSE
## KitchenAbvGr FALSE FALSE
## TotRmsAbvGrd FALSE FALSE
## Fireplaces FALSE FALSE
## GarageYrBlt FALSE FALSE
## GarageCars FALSE FALSE
## GarageArea FALSE FALSE
## WoodDeckSF FALSE FALSE
## OpenPorchSF FALSE FALSE
## EnclosedPorch FALSE FALSE
## X3SsnPorch FALSE FALSE
## ScreenPorch FALSE FALSE
## PoolArea FALSE FALSE
## MiscVal FALSE FALSE
## MoSold FALSE FALSE
## YrSold FALSE FALSE
## TotalBsmtSF FALSE FALSE
## GrLivArea FALSE FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
## Id MSSubClass LotFrontage LotArea OverallQual OverallCond
## 1 ( 1 ) " " " " " " " " "*" " "
## 2 ( 1 ) " " " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " "*" " "
## 5 ( 1 ) " " "*" " " " " "*" " "
## 6 ( 1 ) " " "*" " " " " "*" " "
## 7 ( 1 ) " " "*" " " " " "*" " "
## 8 ( 1 ) " " "*" " " " " "*" "*"
## 9 ( 1 ) " " "*" " " " " "*" "*"
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" " " " " "*" " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " "
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " "*"
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " "*"
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " "*"
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## ScreenPorch PoolArea MiscVal MoSold YrSold
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
summary(model)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + YearBuilt + YearRemodAdd +
## MasVnrArea + BsmtFinSF1 + GrLivArea + GarageCars, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -482083 -19022 -1789 15768 264386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.861e+05 1.442e+05 -6.841 1.30e-11 ***
## OverallQual 2.310e+04 1.408e+03 16.409 < 2e-16 ***
## YearBuilt 1.165e+02 5.746e+01 2.028 0.0428 *
## YearRemodAdd 3.398e+02 7.676e+01 4.427 1.05e-05 ***
## MasVnrArea 3.119e+01 7.315e+00 4.264 2.18e-05 ***
## BsmtFinSF1 2.700e+01 2.684e+00 10.059 < 2e-16 ***
## GrLivArea 4.696e+01 3.148e+00 14.916 < 2e-16 ***
## GarageCars 1.936e+04 2.455e+03 7.888 7.32e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39380 on 1113 degrees of freedom
## Multiple R-squared: 0.7763, Adjusted R-squared: 0.7749
## F-statistic: 551.7 on 7 and 1113 DF, p-value: < 2.2e-16
plot(model)
getwd()
## [1] "C:/Users/JJohn1/Google Drive/DataScience/DATA605"
Kaggle Score