Video Link: https://youtu.be/KD21iznAR9g
library(MASS)
library(Matrix)
library(matlib)
library(dplyr)
library(ggplot2)
library(tidyr)
library(kableExtra)
library(purrr)
library(Hmisc)
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.
Solution:
Generate a random variable X
# set seed value
set.seed(1)
N <- 6
X <- runif(10000, min = 1, max = N)
Generate a random variable Y
# mean
mu <- (N+1)/2
Y <- rnorm(10000 , mean = mu)
# first calculate x and y
x <- median(X)
y <- summary(Y)[2][[1]]
#p(A|B) = P(AB)/P(B)
p1 <- (sum(X>x & X>y)/length(X))/(sum(X>y)/length(X))
p1
## [1] 0.7875256
The probability of X greater than median value of X given that X is greater than first quartile of y is 0.7875
Solution:
#P(AB)
p2 <- sum(X>x & Y>y)/length(X)
p2
## [1] 0.3754
The probability of X greater than median value of X and Y is greater than first quartile of y is 0.3754
Solution:
#p(A|B) = P(AB)/P(B)
p3 <- sum(X<x & X > y)/sum(X>y) # simplified the n = length(X)
p3
## [1] 0.2124744
The probability of X less than median value of X given that X is greater than first quartile of y is 0.2124744
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Solution:
Making the joint and marginal probability table:
matrix<-matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), nrow = 2,ncol = 2)
matrix<-cbind(matrix,c(matrix[1,1]+matrix[1,2],matrix[2,1]+matrix[2,2]))
matrix<-rbind(matrix,c(matrix[1,1]+matrix[2,1],matrix[1,2]+matrix[2,2],matrix[1,3]+matrix[2,3]))
contingency<-as.data.frame(matrix)
names(contingency) <- c("X>x","X<x", "Total")
row.names(contingency) <- c("Y<y","Y>y", "Total")
kable(contingency) %>%
kable_styling(bootstrap_options = "bordered")
X>x | X<x | Total | |
---|---|---|---|
Y<y | 1246 | 1254 | 2500 |
Y>y | 3754 | 3746 | 7500 |
Total | 5000 | 5000 | 10000 |
prob_matrix<-matrix/matrix[3,3]
contingency_p<-as.data.frame(prob_matrix)
names(contingency_p) <- c("X>x","X<x", "Total")
row.names(contingency_p) <- c("Y<y","Y>y", "Total")
kable(round(contingency_p,3)) %>%
kable_styling(bootstrap_options = "bordered")
X>x | X<x | Total | |
---|---|---|---|
Y<y | 0.125 | 0.125 | 0.25 |
Y>y | 0.375 | 0.375 | 0.75 |
Total | 0.500 | 0.500 | 1.00 |
Compute P(X>x)P(Y>y)
prob_matrix[3,1]*prob_matrix[2,3]
## [1] 0.375
Compute P(X>x and Y>y)
round(prob_matrix[2,1],digits = 3)
## [1] 0.375
Verify P(X>x and Y>y)=P(X>x)P(Y>y)
prob_matrix[3,1]*prob_matrix[2,3]==round(prob_matrix[2,1],digits = 3)
## [1] TRUE
Since both results are the same, the condition P(X>x and Y>y)=P(X>x)P(Y>y)
holds and X and Y are independent.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Solution:
Fisher’s Exact Test
fisher.test(table(X>x,Y>y))
##
## Fisher's Exact Test for Count Data
##
## data: table(X > x, Y > y)
## p-value = 0.8716
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9202847 1.1052820
## sample estimates:
## odds ratio
## 1.00857
Chi Square Test
chisq.test(table(X>x,Y>y))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(X > x, Y > y)
## X-squared = 0.026133, df = 1, p-value = 0.8716
H0: X and Y are independent.
Ha: X and Y are not independent.
The chi square test for independence compares two variables to see if they are related. A small chi-square p-value demonstrates that there is a significant association between the two variables.
In both cases the p-value is much greater than a reasonable threshold of 0.05. So we do not reject the null hypothesis of independence and conclude that they are indeed independent.
Fisher’s exact test tests the null Hypothesis of independence and used when then the sample size is small.
Chi-squared test is used when the sample size is large.
We have a large enough sample size, so chi-square is more appropriate in this case.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .
Descriptive and Inferential Statistics
library(readr)
library(tidyverse)
train <- read_csv('train.csv')
test <- read_csv("test.csv")
train <- train[ ,1:81]
dim(train)
## [1] 1460 81
#summary(train)
MSSubClass is left skewed:
summary(train$MSSubClass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 20.0 50.0 56.9 70.0 190.0
hist(train$MSSubClass, main="Distribution of MSSubClass",xlab="MSSubClass")
RL has the highest frequency , C lowest frequency:
summary(train$MSZoning)
## Length Class Mode
## 1460 character character
barplot(table(train$MSZoning), main="MS Zoning")
LotFrontage is left skewed:
summary(train$LotFrontage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 69.00 70.05 80.00 313.00 259
hist(train$LotFrontage,main="Histogram of Lot Frontage",xlab="LotFrontage")
Lot Area is left skewed with very high small values:
hist(train$LotArea,main="Distribution of LotArea",xlab="Lot Area")
Ground Living Area is approximately normally distributed:
hist(train$GrLivArea,main="Distribution of Ground Living Area",xlab="Ground Living Area")
Sales price is slightly approximately normally distributed:
hist(train$SalePrice,main="Distribution of Sale Price",xlab="Sale Price")
pairs(train[,c("SalePrice","GrLivArea","LotFrontage")])
According to the scatter plot above, we can see that LotFrontage and GrLiveArea are positively correlated with sale Price.
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:matlib':
##
## tr
#a scatterplot matrix of independent variables and the dependent variable
sp_train <- train
sp_train %>%
dplyr::select(c("SalePrice", "GrLivArea", "LotFrontage")) %>%
pairs.panels(method = "pearson", hist.col = "#c95656")
We choose the following variables: SalePrice , GrLivArea and TotalBsmtSF
correlation_matrix <- cor(train[,c("SalePrice","GrLivArea","TotalBsmtSF")])
correlation_matrix
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 1.0000000 0.7086245 0.6135806
## GrLivArea 0.7086245 1.0000000 0.4548682
## TotalBsmtSF 0.6135806 0.4548682 1.0000000
From the matrix above, we can see that Sales Price has strong positive correlation with GrLivArea and moderate correlation with TotalBsmTSF.
GrLivArea shows Strong positive correlation with SalePrice and weak positive correlation with TotalBsmSF.
TotalBsmSF shows moderate positive correlation with SalePrice and weak positive correlation with GrLivArea.
SalePrice vs GrLivArea
H0: The correlation between GrLivArea and SalePrice is 0
Ha: The correlation between GrLivArea and SalePrice is other than 0
cor.test(train$SalePrice, train$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between GrLivArea and SalePrice is other than 0.
80 percent confidence interval of the test is [0.6915087 0.7249450]
.
SalePrice vs TotalBsmtSF
H0: The correlation between TotalBsmtSF and SalePrice is 0
Ha: The correlation between TotalBsmtSF and SalePrice is other than 0
cor.test(train$SalePrice, train$TotalBsmtSF, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between TotalBsmtSF and SalePrice is other than 0.
80 percent confidence interval of the test is [0.5922142 0.6340846]
.
TotalBsmtSF vs GrLivArea
H0: The correlation between TotalBsmtSF and GrLivArea is 0
Ha: The correlation between TotalBsmtSF and GrLivArea is other than 0
cor.test(train$GrLivArea, train$TotalBsmtSF, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$TotalBsmtSF
## t = 19.503, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4278380 0.4810855
## sample estimates:
## cor
## 0.4548682
Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between GrLivArea and TotalBsmtSF is other than 0.
80 percent confidence interval of the test is [0.4278380 0.4810855]
.
family_wise_error <- 1 - (1 - .05)^2
family_wise_error
## [1] 0.0975
The familywise error is the probability of a coming to at least one false conclusion in a series of hypothesis tests.
There is a 9.75% chance of type 1 error. Since the chance is low, I will not be worried for family wise error.
Linear Algebra and Correlation
Solution:
# find inverse
precision_matrix <- solve(correlation_matrix)
precision_matrix
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 2.5582310 -1.38549273 -0.93946422
## GrLivArea -1.3854927 2.01124151 -0.06473842
## TotalBsmtSF -0.9394642 -0.06473842 1.60588442
# Multiply the correlation matrix by the precision matrix
correlation_precision <- correlation_matrix %*% precision_matrix
correlation_precision
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 1.000000e+00 -2.081668e-17 0.000000e+00
## GrLivArea 5.551115e-17 1.000000e+00 1.110223e-16
## TotalBsmtSF 0.000000e+00 5.551115e-17 1.000000e+00
# multiply the precision matrix by the correlation matrix
prec_cor <- precision_matrix %*% correlation_matrix
prec_cor
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 1.000000e+00 -1.665335e-16 -1.110223e-16
## GrLivArea 2.012279e-16 1.000000e+00 1.665335e-16
## TotalBsmtSF 0.000000e+00 1.110223e-16 1.000000e+00
# LU Decomposistion
library(pracma)
##
## Attaching package: 'pracma'
## The following objects are masked from 'package:psych':
##
## logit, polar
## The following object is masked from 'package:Hmisc':
##
## ceil
## The following object is masked from 'package:purrr':
##
## cross
## The following objects are masked from 'package:matlib':
##
## angle, inv
## The following objects are masked from 'package:Matrix':
##
## expm, lu, tril, triu
lu(precision_matrix)
## $L
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 1.0000000 0.0000000 0
## GrLivArea -0.5415823 1.0000000 0
## TotalBsmtSF -0.3672320 -0.4548682 1
##
## $U
## SalePrice GrLivArea TotalBsmtSF
## SalePrice 2.558231 -1.385493 -0.9394642
## GrLivArea 0.000000 1.260883 -0.5735356
## TotalBsmtSF 0.000000 0.000000 1.0000000
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
We have selected the variable LotArea
, because it is skewed to the right.
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
library(MASS)
# Fitting of univariate distribution
(fit_dist <- fitdistr(train$LotArea, "exponential"))
## rate
## 9.508570e-05
## (2.488507e-06)
# optimam value of lambda
fit_dist$estimate
## rate
## 9.50857e-05
values <- rexp(1000, rate = fit_dist$estimate)
par(mfrow=c(1,2))
# Actual vs simulated distribution
hist(train$LotArea, breaks=40, prob=TRUE, xlab="Lot Area",
main="Original - LotArea")
hist(values, breaks=40, prob=TRUE, xlab="Generated Data",
main="Exponential - LotArea")
From the two plots we can see that our Lot Area
approximately fits a exponential
distribution.
five_95 <- ecdf(values)
values[five_95(values)==0.05]
## [1] 402.4144
values[five_95(values)==0.95]
## [1] 30176.06
So: 5% is 402.4144
and 95% is 30176.06
t.test(values)$conf.int
## [1] 9667.527 10915.180
## attr(,"conf.level")
## [1] 0.95
t.test(train$LotArea)$conf.int
## [1] 10004.42 11029.24
## attr(,"conf.level")
## [1] 0.95
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
sapply(train, function(x){sum(is.na(x))})
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Let’s do some cleanin. Will be removing all variables with large numbers of missing values.
train <-train[, !colnames(train) %in% c("Id","Alley","PoolQC","Fence","MiscFeature","FireplaceQu","LotFrontage","YearRemodAdd")]
test <- test[, !colnames(test) %in% c("Alley","PoolQC","Fence","MiscFeature","FireplaceQu","LotFrontage","YearRemodAdd")]
# convert categorical to numeric
train <- train%>%
mutate_if(is.character, as.factor)%>%
mutate_if(is.factor, as.integer)
test <- test %>%
mutate_if(is.character, as.factor)%>%
mutate_if(is.factor, as.integer)
Restricting the data to numeric variables
train <- select_if(train, is.numeric)
#test <- select_if(test, is.numeric)
#train <- na.omit(train)
train[is.na(train)] <- 0
test[is.na(test)] <- 0
Now we are in business, and we can start modeling!!!
Let’s fit a multiple regression model
Stepwise Regression
Stepwise regression is very useful for high-dimensional data containing multiple predictor variables. It will return set of variable that will return an optimal simple model.
So let’s Do it!!
library(MASS)
res.lm <- lm(SalePrice ~., data = train)
step <- stepAIC(res.lm, direction = "both", trace = FALSE)
step
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + Street + LotShape +
## LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle +
## OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl +
## Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual +
## BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 +
## BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath +
## BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
## Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive +
## WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition,
## data = train)
##
## Coefficients:
## (Intercept) MSSubClass LotArea Street LotShape
## 1.795e+06 -1.485e+02 3.386e-01 3.361e+04 -9.974e+02
## LandContour LandSlope Neighborhood Condition2 HouseStyle
## 3.537e+03 7.013e+03 2.982e+02 -9.294e+03 -1.206e+03
## OverallQual OverallCond YearBuilt RoofStyle RoofMatl
## 1.172e+04 5.256e+03 1.916e+02 2.105e+03 4.312e+03
## Exterior1st MasVnrType MasVnrArea ExterQual BsmtQual
## -5.467e+02 3.637e+03 2.626e+01 -9.784e+03 -7.088e+03
## BsmtCond BsmtExposure BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 3.397e+03 -3.183e+03 1.736e+01 2.350e+03 2.504e+01
## BsmtUnfSF `1stFlrSF` `2ndFlrSF` BsmtFullBath FullBath
## 8.234e+00 3.942e+01 4.666e+01 7.999e+03 3.810e+03
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## -4.182e+03 -1.647e+04 -8.664e+03 3.660e+03 3.743e+03
## Fireplaces GarageYrBlt GarageCars PavedDrive WoodDeckSF
## 4.799e+03 -9.056e+00 1.358e+04 2.917e+03 1.922e+01
## ScreenPorch PoolArea YrSold SaleCondition
## 4.176e+01 -3.031e+01 -1.112e+03 2.512e+03
Let call what the stepwise just told us to do.
mf <- lm(SalePrice ~ MSSubClass + LotArea + Street + LotShape +
LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle +
OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl +
Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual +
BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 +
BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath +
BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive +
WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition,
data = train)
summary(mf)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + Street + LotShape +
## LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle +
## OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl +
## Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual +
## BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 +
## BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath +
## BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
## Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive +
## WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -456136 -14119 -1041 12562 293002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.795e+06 1.284e+06 1.398 0.162217
## MSSubClass -1.485e+02 2.519e+01 -5.895 4.67e-09 ***
## LotArea 3.386e-01 1.026e-01 3.300 0.000992 ***
## Street 3.361e+04 1.366e+04 2.460 0.014023 *
## LotShape -9.974e+02 6.345e+02 -1.572 0.116175
## LandContour 3.537e+03 1.313e+03 2.695 0.007125 **
## LandSlope 7.013e+03 3.765e+03 1.863 0.062676 .
## Neighborhood 2.982e+02 1.492e+02 1.998 0.045894 *
## Condition2 -9.294e+03 3.234e+03 -2.874 0.004109 **
## HouseStyle -1.206e+03 5.939e+02 -2.030 0.042506 *
## OverallQual 1.172e+04 1.153e+03 10.163 < 2e-16 ***
## OverallCond 5.256e+03 8.817e+02 5.962 3.15e-09 ***
## YearBuilt 1.916e+02 5.115e+01 3.746 0.000187 ***
## RoofStyle 2.105e+03 1.094e+03 1.925 0.054390 .
## RoofMatl 4.312e+03 1.470e+03 2.933 0.003411 **
## Exterior1st -5.467e+02 2.740e+02 -1.995 0.046196 *
## MasVnrType 3.637e+03 1.431e+03 2.542 0.011125 *
## MasVnrArea 2.626e+01 5.832e+00 4.503 7.26e-06 ***
## ExterQual -9.784e+03 1.903e+03 -5.142 3.09e-07 ***
## BsmtQual -7.088e+03 1.314e+03 -5.394 8.07e-08 ***
## BsmtCond 3.397e+03 1.251e+03 2.715 0.006699 **
## BsmtExposure -3.183e+03 8.511e+02 -3.739 0.000192 ***
## BsmtFinSF1 1.736e+01 4.924e+00 3.525 0.000437 ***
## BsmtFinType2 2.350e+03 1.073e+03 2.190 0.028716 *
## BsmtFinSF2 2.504e+01 7.638e+00 3.278 0.001072 **
## BsmtUnfSF 8.234e+00 4.787e+00 1.720 0.085614 .
## `1stFlrSF` 3.942e+01 5.888e+00 6.694 3.12e-11 ***
## `2ndFlrSF` 4.666e+01 4.023e+00 11.597 < 2e-16 ***
## BsmtFullBath 7.999e+03 2.272e+03 3.521 0.000444 ***
## FullBath 3.810e+03 2.390e+03 1.594 0.111123
## BedroomAbvGr -4.182e+03 1.593e+03 -2.625 0.008750 **
## KitchenAbvGr -1.647e+04 4.817e+03 -3.420 0.000644 ***
## KitchenQual -8.664e+03 1.395e+03 -6.209 6.98e-10 ***
## TotRmsAbvGrd 3.660e+03 1.128e+03 3.245 0.001202 **
## Functional 3.743e+03 9.228e+02 4.056 5.26e-05 ***
## Fireplaces 4.799e+03 1.600e+03 2.999 0.002757 **
## GarageYrBlt -9.056e+00 2.512e+00 -3.605 0.000323 ***
## GarageCars 1.358e+04 1.926e+03 7.054 2.72e-12 ***
## PavedDrive 2.917e+03 2.002e+03 1.457 0.145207
## WoodDeckSF 1.922e+01 7.274e+00 2.642 0.008328 **
## ScreenPorch 4.176e+01 1.558e+01 2.680 0.007458 **
## PoolArea -3.031e+01 2.156e+01 -1.406 0.159938
## YrSold -1.112e+03 6.369e+02 -1.745 0.081192 .
## SaleCondition 2.512e+03 7.929e+02 3.168 0.001566 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31570 on 1416 degrees of freedom
## Multiple R-squared: 0.8467, Adjusted R-squared: 0.8421
## F-statistic: 181.9 on 43 and 1416 DF, p-value: < 2.2e-16
Our fitted model is 84.67% accurate in predicting Sales price based on the features variables. Since the F p value less is smaller that 0.05, we can assure that this model is a valid model.
par(mfrow=c(2,2))
# plotting the step model
plot(mf)
The residuals are approximately normally distributed. There is not heteroscedacity and pattern in the residuals, therefore the assumptions of multiple regression model are satisfied.
How about some predictions
Predicted <- predict(mf, test)
kaggle_results <- data.frame(Id = test$Id, SalePrice=Predicted)
write.csv(kaggle_results,"kaggle_results.csv",row.names = FALSE)
info <- c("theoracley", 0.18542)
names(info) <- c("Username", "Score")
kable(info, col.names = "Kaggle") %>%
kable_styling(full_width = F)
Kaggle | |
---|---|
Username | theoracley |
Score | 0.18542 |
Video link: https://youtu.be/KD21iznAR9g