Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =\frac {N+1}{2}\)
Probability.
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
#initial setup
set.seed(123)
N = 17 #My number
#Variable X
X <- runif(10000, min = 1, max = N)
#summary X
summary(X)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.001 5.046 8.913 8.961 12.894 16.999
#Variable Y
mean = (N + 1)/2
sd = (N + 1)/2
Y <- rnorm(10000, sd= sd , mean= mean)
#summary Y
summary(Y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.608 3.012 8.960 9.018 15.267 43.630
## [1] 8.913082
## [1] 3.011775
#P(X>x | X>y) = P(X>x and X>y) / P(X>y)
a1 <- length(which(X>x & X>y))
a2 <- length(which(X>y))
a1/a2## [1] 0.5701254
#P(X<x | X>y) = P(X<x and X>y) / P(X>y)
c1 <- length(which(X < x & X > y))
c2 <- length(which(X > y))
c1/c2## [1] 0.4298746
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
#create matrix
matrix <-matrix( c(length(which(X>x & Y<y)),
length(which(X>x & Y>y)),
length(which(X<x & Y<y)),
length(which(X<x & Y>y))),
nrow = 2,ncol = 2)
#Matrix
matrix<-as.data.frame(matrix)
#Row & Column Names
names(matrix) <- c("X > x","X < x")
row.names(matrix) <- c("Y < y","Y > y")
#Matrix Table
matrix_table <-matrix
matrix_table## X > x X < x
## Y < y 1244 1256
## Y > y 3756 3744
#Row & Column Sums
matrix<-cbind(matrix,"Total"=rowSums(matrix))
matrix <- rbind(matrix,"Total"=colSums(matrix))
#Convert to Probabilities
joint_prob_matrix<-round(matrix/matrix[3,3],2)
#joint probability table
kable(joint_prob_matrix)| X > x | X < x | Total | |
|---|---|---|---|
| Y < y | 0.12 | 0.13 | 0.25 |
| Y > y | 0.38 | 0.37 | 0.75 |
| Total | 0.50 | 0.50 | 1.00 |
# To prove P(X>x and Y>y)= P(X>x)P(Y>y)
#LHS = P(X>x and Y>y)
LHS= round(joint_prob_matrix[3,1]*joint_prob_matrix[2,3],2)
LHS## [1] 0.38
## [1] 0.38
## [1] TRUE
Since the result, LHS=RHS or P(X>x and Y>y)=P(X>x)P(Y>y) so it can be conclude that X and Y are independent variables.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
##
## Fisher's Exact Test for Count Data
##
## data: matrix_table
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9008857 1.0819850
## sample estimates:
## odds ratio
## 0.987281
From Fisher test p-value is 0.7995.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix_table
## X-squared = 0.064533, df = 1, p-value = 0.7995
p-value is 0.7995 which is indentical to fisher test. Fisher’s exact test is practically applied only in analysis of small samples but actually it is valid for all sample sizes. While the chi-squared test relies on an approximation, Fisher’s exact test is one of exact tests. Hence, Fisher’s exact test is most appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Independent Variables
LotArea : Lot size in square feet
GarageArea : Size of garage in square feet
YearBuilt : Original construction date
OverallQual : Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
Dependent Variable:
SalePrice : Sale Price of the House
train%>%
dplyr::select(LotArea, SalePrice, GarageArea,YearBuilt,OverallQual)%>%
mutate(LotArea = log(LotArea),
SalePrice = log(SalePrice),
GarageArea = log(GarageArea),
YearBuilt = log(YearBuilt),
OverallQual = log(OverallQual)) %>%
pairs(main = 'Scatterplot matrix LotArea, SalePrice,
\nGarageArea, YearBuilt, OverallQual', cex.main=0.7,col=rgb(0,0.5,0.5,0.7),
panel = panel.smooth, cex = 0.7, cex.labels = 1.5, font.labels = 2)train %>%
dplyr::select(LotArea, SalePrice, GarageArea,YearBuilt,OverallQual)%>%
cor() %>%
corrplot(method ="color",order = "hclust", addrect = 3, number.cex = 1, sig.level = 0.20,
addCoef.col = "black", # Add coefficient of correlation
tl.srt = 90, # Text label color and rotation
# Combine with significance
diag = TRUE)Hypotheses
\(H_0\) = There is 0 correlation between each pairwise variables
\(H_A\) = There is correlation between each pairwise variables
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
##
## Pearson's product-moment correlation
##
## data: train$GarageArea and train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
##
## Pearson's product-moment correlation
##
## data: train$YearBuilt and train$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
The confidence interval for the correlation between LotArea and SalePrice is 0.2323391 and 0.2947946, with a positive correlation of 0.2638434.
The confidence interval for the correlation between YearBuilt and Sale Price is 0.4980766 and 0.5468619, with a positive correlation of 0.5228973.
The confidence interval for the correlation between GarageArea and SalePrice is 0.6024756 and 0.6435283, with a strong positive correlation of 0.6234314
The confidence interval for the correlation between OverallQual and Sale Price is 0.7780752 and 0.8032204, with a strongest positive correlation of 0.7909816.
given the small number of tests with significantly low p-values<2.2e-16, it is likely that Familywise error rate error has not occured.
Create correlation matrix with LotArea, SalePrice, GarageArea,YearBuilt,OverallQual
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea 1.00000000 0.2638434 0.1804028 0.01422765 0.1058057
## SalePrice 0.26384335 1.0000000 0.6234314 0.52289733 0.7909816
## GarageArea 0.18040276 0.6234314 1.0000000 0.47895382 0.5620218
## YearBuilt 0.01422765 0.5228973 0.4789538 1.00000000 0.5723228
## OverallQual 0.10580574 0.7909816 0.5620218 0.57232277 1.0000000
Invert the matrix then Multiply the correlation matrix by the precision matrix,
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea 1.12607 -0.52436 -0.09763 0.15516 0.26168
## SalePrice -0.52436 3.30933 -0.72380 -0.21230 -2.03385
## GarageArea -0.09763 -0.72380 1.74631 -0.33966 -0.20423
## YearBuilt 0.15516 -0.21230 -0.33966 1.59941 -0.57298
## OverallQual 0.26168 -2.03385 -0.20423 -0.57298 3.02376
Since \(Precision = Correlation^{-1}\) thus \(Precision×Correlation\) should be equal to I.
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea 1 0 0 0 0
## SalePrice 0 1 0 0 0
## GarageArea 0 0 1 0 0
## YearBuilt 0 0 0 1 0
## OverallQual 0 0 0 0 1
multiply the precision matrix by the correlation matrix
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea 1 0 0 0 0
## SalePrice 0 1 0 0 0
## GarageArea 0 0 1 0 0
## YearBuilt 0 0 0 1 0
## OverallQual 0 0 0 0 1
#matrix factorization (From Assignment 2)
LU_factorization <- function(A) {
# Check wheter matrix is square or not
if (dim(A)[1]!=dim(A)[2]) {
return(NA)
}
U <- A
n <- dim(A)[1]
L <- diag(n)
if (n==1) {
return(list(L,U))
}
for(i in 2:n) {
for(j in 1:(i-1)) {
multiplier <- -U[i,j] / U[j,j]
U[i, ] <- multiplier * U[j, ] + U[i, ]
L[i,j] <- -multiplier
}
}
return(list(L,U))
}
LU_factorization(mat)## [[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.0000000 0.0000000 0.000000 0
## [2,] 0.26384335 1.0000000 0.0000000 0.000000 0
## [3,] 0.18040276 0.6189183 1.0000000 0.000000 0
## [4,] 0.01422765 0.5579868 0.2537876 1.000000 0
## [5,] 0.10580574 0.8201595 0.1156332 0.189492 1
##
## [[2]]
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea 1 0.2638434 0.1804028 0.01422765 0.10580574
## SalePrice 0 0.9303867 0.5758334 0.51914346 0.76306546
## GarageArea 0 0.0000000 0.6110610 0.15507971 0.07065891
## YearBuilt 0 0.0000000 0.0000000 0.67076508 0.12710462
## OverallQual 0 0.0000000 0.0000000 0.00000000 0.33071396
## LotArea SalePrice GarageArea YearBuilt OverallQual
## LotArea TRUE TRUE TRUE TRUE TRUE
## SalePrice TRUE TRUE TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE TRUE TRUE
## YearBuilt TRUE TRUE TRUE TRUE TRUE
## OverallQual TRUE TRUE TRUE TRUE TRUE
Hence, it verifies the result
From general understanding the sale price can correlate with GarageArea . For our analysis we can also see the importance of GarageArea, potential buyers will have the advantage of having a larger garage in house, so GarageArea house will have higher price .
For the purpose of our analysis I would like to assign \(X\) variable for , GarageArea
Get some basic statistics about the OverallQual variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
## vars n mean sd median trimmed mad min max range skew
## X1 1 1460 472.98 213.8 480 469.81 177.91 0 1418 1418 0.18
## kurtosis se
## X1 0.9 5.6
Evaluating few plots.
par(mfrow=c(1,2))
hist(train$GarageArea, breaks=40, border=F , col=rgb(0.97,0.51,0.47,1) , xlab="Distribution of GarageArea" , main="Histogram GarageArea")
boxplot(train$GarageArea, xlab="GarageArea" , col=rgb(0.97,0.51,0.47,1), main="Box plot of GarageArea")Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)))
## rate
## 0.002114254
set.seed(100)
pdf.dist <- rexp(1000,lambda$estimate)
hist(pdf.dist, freq = FALSE, breaks = 200,
main ="Fitted Exponential PDF with OverallQual",
xlab = "PDF OverallQual",
xlim = c(1, quantile(pdf.dist, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)Plot a histogram and compare it with a histogram of your original variable.
set.seed(100)
samp.OverallQual <- sample(X, 1000, replace=TRUE, prob=NULL)
exp.train <- data.frame(Expo=rexp(1000, lambda$estimate)) %>%
mutate(X = samp.OverallQual)
plotdist(exp.train$X, histo = TRUE, demp = TRUE)hist(train$GarageArea, freq = FALSE, breaks = 100,
main ="Comparison with original GarageArea",
xlab="Original GarageArea",
xlim = c(1, quantile(X, 0.99)))
curve(dexp(x, rate = lambda$estimate), col = "red", add = TRUE)Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
pdf.5th <- qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.5th <- round(pdf.5th, 4)
pdf.95th <- qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)
pdf.95th <- round(pdf.95th, 4)The 5th percentile is 24.2607 and 95th percentile is 1416.9219
Also generate a 95% confidence interval from the empirical data, assuming normality
## upper mean lower
## 483.9563 472.9801 462.0040
Finally, provide the empirical 5th percentile and 95th percentile of the data.
## 5% 95%
## 0.0 850.1
We are 95% confident that the mean of GarageArea is between 462.004 and 483.9563. The exponential distribution is not a good fit as we can see the center of the exp distribution is shifted left as compared to the empirical data.
Missing Data map
Data Cleaning
sd<- ldply(train, function(x) sum(is.na(x)))
sd<-sd[sd[2] > 0,]
sd<-sd[order(-sd$V1),]
sd$percent <- (sd$V1/1460)*100
sd## .id V1 percent
## 73 PoolQC 1453 99.52054795
## 75 MiscFeature 1406 96.30136986
## 7 Alley 1369 93.76712329
## 74 Fence 1179 80.75342466
## 58 FireplaceQu 690 47.26027397
## 4 LotFrontage 259 17.73972603
## 59 GarageType 81 5.54794521
## 60 GarageYrBlt 81 5.54794521
## 61 GarageFinish 81 5.54794521
## 64 GarageQual 81 5.54794521
## 65 GarageCond 81 5.54794521
## 33 BsmtExposure 38 2.60273973
## 36 BsmtFinType2 38 2.60273973
## 31 BsmtQual 37 2.53424658
## 32 BsmtCond 37 2.53424658
## 34 BsmtFinType1 37 2.53424658
## 26 MasVnrType 8 0.54794521
## 27 MasVnrArea 8 0.54794521
## 43 Electrical 1 0.06849315
sd_test<- ldply(test, function(x) sum(is.na(x)))
sd_test<-sd_test[sd_test[2] > 0,]
sd_test<-sd_test[order(-sd_test$V1),]
sd_test$percent <- (sd_test$V1/1459)*100
sd_test## .id V1 percent
## 73 PoolQC 1456 99.7943797
## 75 MiscFeature 1408 96.5044551
## 7 Alley 1352 92.6662097
## 74 Fence 1169 80.1233722
## 58 FireplaceQu 730 50.0342700
## 4 LotFrontage 227 15.5586018
## 60 GarageYrBlt 78 5.3461275
## 61 GarageFinish 78 5.3461275
## 64 GarageQual 78 5.3461275
## 65 GarageCond 78 5.3461275
## 59 GarageType 76 5.2090473
## 32 BsmtCond 45 3.0843043
## 31 BsmtQual 44 3.0157642
## 33 BsmtExposure 44 3.0157642
## 34 BsmtFinType1 42 2.8786840
## 36 BsmtFinType2 42 2.8786840
## 26 MasVnrType 16 1.0966415
## 27 MasVnrArea 15 1.0281014
## 3 MSZoning 4 0.2741604
## 10 Utilities 2 0.1370802
## 48 BsmtFullBath 2 0.1370802
## 49 BsmtHalfBath 2 0.1370802
## 56 Functional 2 0.1370802
## 24 Exterior1st 1 0.0685401
## 25 Exterior2nd 1 0.0685401
## 35 BsmtFinSF1 1 0.0685401
## 37 BsmtFinSF2 1 0.0685401
## 38 BsmtUnfSF 1 0.0685401
## 39 TotalBsmtSF 1 0.0685401
## 54 KitchenQual 1 0.0685401
## 62 GarageCars 1 0.0685401
## 63 GarageArea 1 0.0685401
## 79 SaleType 1 0.0685401
The dataset contains many numeric and categorical variables with “NA” values. I will remove Alley, PoolQC, MiscFeature columns because missing value is more than 95 percent so replacing with 0 doesn’t seems efficient. The data dictionary helps understand the meaning of “NA” for different categorical variables. e.g. The presence of “NA” in the variable “GarageQual” means that the house does not have a garage. This holds true for most of the other variables. Hence, we replaced the “NA” values from each of these categorical variables with “None”.
#Remove
train <- subset(train, select=-c( Alley,PoolQC, MiscFeature))
train$LotFrontage[which(is.na(train$LotFrontage))]<-mean(train$LotFrontage,na.rm=TRUE)
#For Numerical data
replace_to_Zero<- function(df){
df %>%
mutate_if(is.numeric, ~replace(., is.na(.), 0))
}
#For Categorial data
replace_to_No<- function(df){
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.integer)
df[is.na(df)] <- 0
df
}#for Num
train <- train %>%
replace_to_Zero()
#for Categorial data
train <- train %>%
replace_to_No()SalePrice.log <- log(train$SalePrice)
sample.log <- sample(SalePrice.log, replace = TRUE, prob = NULL)
sample.origin <- sample(train$SalePrice, replace = TRUE, prob = NULL)
d.train <- train %>%
cbind(., SalePrice.log) %>%
dplyr::select(-SalePrice)
# multiple regression
reg.lm <- lm(SalePrice.log ~ . , data = d.train)
summary(reg.lm)##
## Call:
## lm(formula = SalePrice.log ~ ., data = d.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.74212 -0.06206 0.00212 0.06817 0.53219
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.2083972166 5.7688604675 3.503 0.000475 ***
## Id -0.0000095051 0.0000088662 -1.072 0.283882
## MSSubClass -0.0002414112 0.0001981414 -1.218 0.223288
## MSZoning -0.0178000641 0.0066119271 -2.692 0.007186 **
## LotFrontage -0.0004470160 0.0002183949 -2.047 0.040864 *
## LotArea 0.0000015760 0.0000004670 3.374 0.000760 ***
## Street 0.1842472114 0.0610158922 3.020 0.002577 **
## LotShape -0.0060512080 0.0028840796 -2.098 0.036074 *
## LandContour 0.0107067706 0.0058767097 1.822 0.068686 .
## Utilities -0.1782016125 0.1451586319 -1.228 0.219793
## LotConfig -0.0016406429 0.0023873087 -0.687 0.492050
## LandSlope 0.0341146928 0.0166937085 2.044 0.041185 *
## Neighborhood 0.0008361481 0.0006862253 1.218 0.223251
## Condition1 0.0013589355 0.0044281870 0.307 0.758979
## Condition2 -0.0439037080 0.0146295282 -3.001 0.002739 **
## BldgType -0.0120771346 0.0065445066 -1.845 0.065195 .
## HouseStyle -0.0041547071 0.0028563076 -1.455 0.146014
## OverallQual 0.0701835521 0.0051757042 13.560 < 2e-16 ***
## OverallCond 0.0408832362 0.0045439157 8.997 < 2e-16 ***
## YearBuilt 0.0015439327 0.0003253952 4.745 2.30e-06 ***
## YearRemodAdd 0.0006623128 0.0002911466 2.275 0.023068 *
## RoofStyle 0.0055266097 0.0049023257 1.127 0.259792
## RoofMatl 0.0095921615 0.0065624011 1.462 0.144055
## Exterior1st -0.0039671839 0.0022815506 -1.739 0.082290 .
## Exterior2nd 0.0034076271 0.0020612326 1.653 0.098517 .
## MasVnrType 0.0047209422 0.0064176729 0.736 0.462089
## MasVnrArea 0.0000151897 0.0000261977 0.580 0.562136
## ExterQual -0.0071207196 0.0085836481 -0.830 0.406926
## ExterCond 0.0105575935 0.0054786005 1.927 0.054177 .
## Foundation 0.0123837867 0.0073374863 1.688 0.091686 .
## BsmtQual -0.0130525101 0.0059262427 -2.202 0.027795 *
## BsmtCond 0.0123170567 0.0057271156 2.151 0.031676 *
## BsmtExposure -0.0073255178 0.0038300312 -1.913 0.055999 .
## BsmtFinType1 -0.0059318791 0.0027200058 -2.181 0.029364 *
## BsmtFinSF1 0.0000397429 0.0000225967 1.759 0.078833 .
## BsmtFinType2 0.0155493247 0.0049075899 3.168 0.001566 **
## BsmtFinSF2 0.0001269949 0.0000345002 3.681 0.000241 ***
## BsmtUnfSF 0.0000296823 0.0000220592 1.346 0.178661
## TotalBsmtSF NA NA NA NA
## Heating -0.0039834371 0.0140428214 -0.284 0.776711
## HeatingQC -0.0076355112 0.0026930988 -2.835 0.004646 **
## CentralAir 0.0710187683 0.0195520808 3.632 0.000291 ***
## Electrical -0.0012517406 0.0039700408 -0.315 0.752584
## X1stFlrSF 0.0002121138 0.0000271530 7.812 1.11e-14 ***
## X2ndFlrSF 0.0001676512 0.0000209810 7.991 2.81e-15 ***
## LowQualFinSF 0.0001466739 0.0000813488 1.803 0.071602 .
## GrLivArea NA NA NA NA
## BsmtFullBath 0.0583586404 0.0106549359 5.477 5.13e-08 ***
## BsmtHalfBath 0.0233541822 0.0167569879 1.394 0.163633
## FullBath 0.0364778162 0.0117562121 3.103 0.001955 **
## HalfBath 0.0171432577 0.0110709067 1.548 0.121731
## BedroomAbvGr 0.0075243129 0.0072780510 1.034 0.301393
## KitchenAbvGr -0.0344663360 0.0220209007 -1.565 0.117773
## KitchenQual -0.0238857397 0.0063120664 -3.784 0.000161 ***
## TotRmsAbvGrd 0.0134558783 0.0051068668 2.635 0.008511 **
## Functional 0.0169482336 0.0041261378 4.108 4.23e-05 ***
## Fireplaces 0.0366280862 0.0114581717 3.197 0.001422 **
## FireplaceQu 0.0000093028 0.0034599476 0.003 0.997855
## GarageType -0.0040763241 0.0027950382 -1.458 0.144953
## GarageYrBlt 0.0000068324 0.0000255944 0.267 0.789547
## GarageFinish -0.0077657648 0.0064321899 -1.207 0.227512
## GarageCars 0.0633277607 0.0123105224 5.144 3.07e-07 ***
## GarageArea 0.0000130978 0.0000407236 0.322 0.747784
## GarageQual -0.0011246516 0.0077537784 -0.145 0.884696
## GarageCond 0.0094858868 0.0087831041 1.080 0.280324
## PavedDrive 0.0239891837 0.0090207865 2.659 0.007920 **
## WoodDeckSF 0.0001041250 0.0000327857 3.176 0.001527 **
## OpenPorchSF -0.0000341191 0.0000624097 -0.547 0.584676
## EnclosedPorch 0.0001528095 0.0000681650 2.242 0.025135 *
## X3SsnPorch 0.0001822153 0.0001269290 1.436 0.151351
## ScreenPorch 0.0003245969 0.0000700634 4.633 3.95e-06 ***
## PoolArea -0.0002682440 0.0000968468 -2.770 0.005684 **
## Fence -0.0035283660 0.0038853603 -0.908 0.363974
## MiscVal -0.0000009731 0.0000075798 -0.128 0.897870
## MoSold 0.0001120425 0.0013910234 0.081 0.935814
## YrSold -0.0071296102 0.0028511783 -2.501 0.012514 *
## SaleType -0.0015706813 0.0025039485 -0.627 0.530578
## SaleCondition 0.0225255707 0.0036071087 6.245 5.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1384 on 1384 degrees of freedom
## Multiple R-squared: 0.8861, Adjusted R-squared: 0.8799
## F-statistic: 143.5 on 75 and 1384 DF, p-value: < 2.2e-16
new.lm <- lm(SalePrice.log~LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd +
BsmtFinSF2 + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd +
Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch + ScreenPorch + PoolArea +
YrSold + MSZoning + Street + LotShape + LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond +
BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual + Functional + PavedDrive + SaleCondition , data = train)
summary(new.lm)##
## Call:
## lm(formula = SalePrice.log ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF2 + X1stFlrSF +
## X2ndFlrSF + LowQualFinSF + BsmtFullBath + FullBath + TotRmsAbvGrd +
## Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + EnclosedPorch +
## ScreenPorch + PoolArea + YrSold + MSZoning + Street + LotShape +
## LandSlope + Condition2 + ExterCond + BsmtQual + BsmtCond +
## BsmtFinType1 + BsmtFinType2 + HeatingQC + CentralAir + KitchenQual +
## Functional + PavedDrive + SaleCondition, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84550 -0.06784 0.00439 0.07533 0.60612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.0069339606 5.6808114706 3.346 0.000842 ***
## LotFrontage 0.0001179172 0.0001975262 0.597 0.550623
## LotArea 0.0000020344 0.0000004583 4.439 9.75e-06 ***
## OverallQual 0.0743382480 0.0049141906 15.127 < 2e-16 ***
## OverallCond 0.0417037788 0.0044834877 9.302 < 2e-16 ***
## YearBuilt 0.0019908443 0.0002705939 7.357 3.16e-13 ***
## YearRemodAdd 0.0005886496 0.0002817191 2.089 0.036841 *
## BsmtFinSF2 0.0000977091 0.0000308109 3.171 0.001550 **
## X1stFlrSF 0.0002616408 0.0000192414 13.598 < 2e-16 ***
## X2ndFlrSF 0.0001717200 0.0000166824 10.293 < 2e-16 ***
## LowQualFinSF 0.0001418027 0.0000802028 1.768 0.077267 .
## BsmtFullBath 0.0547207151 0.0086031341 6.361 2.70e-10 ***
## FullBath 0.0207902498 0.0103767534 2.004 0.045309 *
## TotRmsAbvGrd 0.0144143988 0.0042591409 3.384 0.000733 ***
## Fireplaces 0.0370064098 0.0070765307 5.229 1.95e-07 ***
## GarageYrBlt 0.0000073516 0.0000112393 0.654 0.513156
## GarageCars 0.0667104795 0.0084912290 7.856 7.74e-15 ***
## WoodDeckSF 0.0001323260 0.0000323033 4.096 4.43e-05 ***
## EnclosedPorch 0.0001758041 0.0000680828 2.582 0.009916 **
## ScreenPorch 0.0003691134 0.0000695941 5.304 1.31e-07 ***
## PoolArea -0.0003342638 0.0000962343 -3.473 0.000529 ***
## YrSold -0.0070332251 0.0028187330 -2.495 0.012702 *
## MSZoning -0.0225238352 0.0063263677 -3.560 0.000383 ***
## Street 0.2012942557 0.0604028949 3.333 0.000883 ***
## LotShape -0.0086064548 0.0027859848 -3.089 0.002046 **
## LandSlope 0.0262383954 0.0154297574 1.701 0.089254 .
## Condition2 -0.0461749383 0.0144059001 -3.205 0.001379 **
## ExterCond 0.0086363514 0.0054347679 1.589 0.112262
## BsmtQual -0.0130464019 0.0055255188 -2.361 0.018354 *
## BsmtCond 0.0125105370 0.0055850943 2.240 0.025246 *
## BsmtFinType1 -0.0076824225 0.0024166360 -3.179 0.001510 **
## BsmtFinType2 0.0160077376 0.0046143394 3.469 0.000538 ***
## HeatingQC -0.0097685850 0.0025747749 -3.794 0.000154 ***
## CentralAir 0.0815765515 0.0181891485 4.485 7.88e-06 ***
## KitchenQual -0.0273717358 0.0058529588 -4.677 3.19e-06 ***
## Functional 0.0187185982 0.0040511486 4.621 4.17e-06 ***
## PavedDrive 0.0252290890 0.0089792816 2.810 0.005027 **
## SaleCondition 0.0239480877 0.0035286818 6.787 1.68e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1406 on 1422 degrees of freedom
## Multiple R-squared: 0.8792, Adjusted R-squared: 0.876
## F-statistic: 279.6 on 37 and 1422 DF, p-value: < 2.2e-16
I have Optimized the model using stepAIC method. This method simplify the model without impacting much on the performance
##
## Call:
## lm(formula = SalePrice.log ~ MSZoning + LotFrontage + LotArea +
## Street + LotShape + LandContour + LandSlope + Condition2 +
## BldgType + HouseStyle + OverallQual + OverallCond + YearBuilt +
## YearRemodAdd + RoofStyle + RoofMatl + Exterior1st + Exterior2nd +
## ExterCond + Foundation + BsmtQual + BsmtCond + BsmtExposure +
## BsmtFinType1 + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + BsmtUnfSF +
## HeatingQC + CentralAir + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + KitchenAbvGr + KitchenQual +
## TotRmsAbvGrd + Functional + Fireplaces + GarageType + GarageCars +
## GarageCond + PavedDrive + WoodDeckSF + EnclosedPorch + X3SsnPorch +
## ScreenPorch + PoolArea + YrSold + SaleCondition, data = d.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77137 -0.06316 0.00248 0.06726 0.54763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.0941636102 5.6103729090 3.403 0.000684 ***
## MSZoning -0.0202002556 0.0062767755 -3.218 0.001319 **
## LotFrontage -0.0003989554 0.0002131635 -1.872 0.061470 .
## LotArea 0.0000015935 0.0000004561 3.494 0.000491 ***
## Street 0.1729126102 0.0596569328 2.898 0.003808 **
## LotShape -0.0070403716 0.0027731244 -2.539 0.011231 *
## LandContour 0.0108511603 0.0057816613 1.877 0.060749 .
## LandSlope 0.0330525183 0.0164527948 2.009 0.044735 *
## Condition2 -0.0435204826 0.0142228827 -3.060 0.002256 **
## BldgType -0.0184996556 0.0038672272 -4.784 0.0000019019428169 ***
## HouseStyle -0.0055384320 0.0025262182 -2.192 0.028515 *
## OverallQual 0.0712969651 0.0049926848 14.280 < 2e-16 ***
## OverallCond 0.0406912689 0.0044738316 9.095 < 2e-16 ***
## YearBuilt 0.0016742999 0.0003108430 5.386 0.0000000841904675 ***
## YearRemodAdd 0.0006857583 0.0002827847 2.425 0.015433 *
## RoofStyle 0.0068060432 0.0047111131 1.445 0.148771
## RoofMatl 0.0104214912 0.0064798098 1.608 0.107993
## Exterior1st -0.0037813668 0.0022345682 -1.692 0.090827 .
## Exterior2nd 0.0032280602 0.0020198385 1.598 0.110228
## ExterCond 0.0101450702 0.0053484051 1.897 0.058054 .
## Foundation 0.0113127482 0.0072015230 1.571 0.116435
## BsmtQual -0.0150420955 0.0056579443 -2.659 0.007936 **
## BsmtCond 0.0113908727 0.0055979701 2.035 0.042056 *
## BsmtExposure -0.0062319932 0.0037177145 -1.676 0.093902 .
## BsmtFinType1 -0.0062546458 0.0026775348 -2.336 0.019633 *
## BsmtFinSF1 0.0000508706 0.0000213454 2.383 0.017294 *
## BsmtFinType2 0.0153160896 0.0048139464 3.182 0.001497 **
## BsmtFinSF2 0.0001362318 0.0000336010 4.054 0.0000530160705969 ***
## BsmtUnfSF 0.0000372897 0.0000211227 1.765 0.077716 .
## HeatingQC -0.0078160552 0.0026035523 -3.002 0.002729 **
## CentralAir 0.0735189277 0.0180672864 4.069 0.0000498013295996 ***
## X1stFlrSF 0.0002105248 0.0000262289 8.026 0.0000000000000021 ***
## X2ndFlrSF 0.0001662925 0.0000197789 8.408 < 2e-16 ***
## LowQualFinSF 0.0001510518 0.0000797360 1.894 0.058377 .
## BsmtFullBath 0.0523058424 0.0100105579 5.225 0.0000002002971980 ***
## FullBath 0.0364845367 0.0113510722 3.214 0.001338 **
## HalfBath 0.0167502237 0.0108534722 1.543 0.122982
## KitchenAbvGr -0.0370844141 0.0209374752 -1.771 0.076744 .
## KitchenQual -0.0256771129 0.0057989140 -4.428 0.0000102503377568 ***
## TotRmsAbvGrd 0.0156700167 0.0045161976 3.470 0.000537 ***
## Functional 0.0171254851 0.0040536132 4.225 0.0000254590424833 ***
## Fireplaces 0.0358797990 0.0071020662 5.052 0.0000004944812149 ***
## GarageType -0.0048716103 0.0025412843 -1.917 0.055442 .
## GarageCars 0.0679560692 0.0081082167 8.381 < 2e-16 ***
## GarageCond 0.0073783311 0.0043205692 1.708 0.087909 .
## PavedDrive 0.0244141767 0.0089224749 2.736 0.006292 **
## WoodDeckSF 0.0001071295 0.0000321940 3.328 0.000899 ***
## EnclosedPorch 0.0001670790 0.0000671401 2.489 0.012943 *
## X3SsnPorch 0.0001902291 0.0001252604 1.519 0.129070
## ScreenPorch 0.0003199921 0.0000690929 4.631 0.0000039689302813 ***
## PoolArea -0.0002888713 0.0000952074 -3.034 0.002457 **
## YrSold -0.0068123313 0.0027797043 -2.451 0.014377 *
## SaleCondition 0.0223617466 0.0034782246 6.429 0.0000000001754091 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1379 on 1407 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8808
## F-statistic: 208.3 on 52 and 1407 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = SalePrice.log ~ ., data = step.lm$model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77137 -0.06316 0.00248 0.06726 0.54763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.0941636102 5.6103729090 3.403 0.000684 ***
## MSZoning -0.0202002556 0.0062767755 -3.218 0.001319 **
## LotFrontage -0.0003989554 0.0002131635 -1.872 0.061470 .
## LotArea 0.0000015935 0.0000004561 3.494 0.000491 ***
## Street 0.1729126102 0.0596569328 2.898 0.003808 **
## LotShape -0.0070403716 0.0027731244 -2.539 0.011231 *
## LandContour 0.0108511603 0.0057816613 1.877 0.060749 .
## LandSlope 0.0330525183 0.0164527948 2.009 0.044735 *
## Condition2 -0.0435204826 0.0142228827 -3.060 0.002256 **
## BldgType -0.0184996556 0.0038672272 -4.784 0.0000019019428169 ***
## HouseStyle -0.0055384320 0.0025262182 -2.192 0.028515 *
## OverallQual 0.0712969651 0.0049926848 14.280 < 2e-16 ***
## OverallCond 0.0406912689 0.0044738316 9.095 < 2e-16 ***
## YearBuilt 0.0016742999 0.0003108430 5.386 0.0000000841904675 ***
## YearRemodAdd 0.0006857583 0.0002827847 2.425 0.015433 *
## RoofStyle 0.0068060432 0.0047111131 1.445 0.148771
## RoofMatl 0.0104214912 0.0064798098 1.608 0.107993
## Exterior1st -0.0037813668 0.0022345682 -1.692 0.090827 .
## Exterior2nd 0.0032280602 0.0020198385 1.598 0.110228
## ExterCond 0.0101450702 0.0053484051 1.897 0.058054 .
## Foundation 0.0113127482 0.0072015230 1.571 0.116435
## BsmtQual -0.0150420955 0.0056579443 -2.659 0.007936 **
## BsmtCond 0.0113908727 0.0055979701 2.035 0.042056 *
## BsmtExposure -0.0062319932 0.0037177145 -1.676 0.093902 .
## BsmtFinType1 -0.0062546458 0.0026775348 -2.336 0.019633 *
## BsmtFinSF1 0.0000508706 0.0000213454 2.383 0.017294 *
## BsmtFinType2 0.0153160896 0.0048139464 3.182 0.001497 **
## BsmtFinSF2 0.0001362318 0.0000336010 4.054 0.0000530160705969 ***
## BsmtUnfSF 0.0000372897 0.0000211227 1.765 0.077716 .
## HeatingQC -0.0078160552 0.0026035523 -3.002 0.002729 **
## CentralAir 0.0735189277 0.0180672864 4.069 0.0000498013295996 ***
## X1stFlrSF 0.0002105248 0.0000262289 8.026 0.0000000000000021 ***
## X2ndFlrSF 0.0001662925 0.0000197789 8.408 < 2e-16 ***
## LowQualFinSF 0.0001510518 0.0000797360 1.894 0.058377 .
## BsmtFullBath 0.0523058424 0.0100105579 5.225 0.0000002002971980 ***
## FullBath 0.0364845367 0.0113510722 3.214 0.001338 **
## HalfBath 0.0167502237 0.0108534722 1.543 0.122982
## KitchenAbvGr -0.0370844141 0.0209374752 -1.771 0.076744 .
## KitchenQual -0.0256771129 0.0057989140 -4.428 0.0000102503377568 ***
## TotRmsAbvGrd 0.0156700167 0.0045161976 3.470 0.000537 ***
## Functional 0.0171254851 0.0040536132 4.225 0.0000254590424833 ***
## Fireplaces 0.0358797990 0.0071020662 5.052 0.0000004944812149 ***
## GarageType -0.0048716103 0.0025412843 -1.917 0.055442 .
## GarageCars 0.0679560692 0.0081082167 8.381 < 2e-16 ***
## GarageCond 0.0073783311 0.0043205692 1.708 0.087909 .
## PavedDrive 0.0244141767 0.0089224749 2.736 0.006292 **
## WoodDeckSF 0.0001071295 0.0000321940 3.328 0.000899 ***
## EnclosedPorch 0.0001670790 0.0000671401 2.489 0.012943 *
## X3SsnPorch 0.0001902291 0.0001252604 1.519 0.129070
## ScreenPorch 0.0003199921 0.0000690929 4.631 0.0000039689302813 ***
## PoolArea -0.0002888713 0.0000952074 -3.034 0.002457 **
## YrSold -0.0068123313 0.0027797043 -2.451 0.014377 *
## SaleCondition 0.0223617466 0.0034782246 6.429 0.0000000001754091 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1379 on 1407 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8808
## F-statistic: 208.3 on 52 and 1407 DF, p-value: < 2.2e-16
plot(final_lm$fitted.values, final_lm$residuals,
xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs. Residuals")
abline(h=0, col='blue')Residuals are normally distributed.
#fresh import
test <- read.csv('https://raw.githubusercontent.com/Vinayak234/DATA605/master/test.csv')
test <- test %>%
replace_to_Zero()
test <- test %>%
replace_to_No()
# Building the prediction
pred <- predict(final_lm, newdata = test)
pred.exp <- sapply(pred, exp)
Id <- test$Id
SalePrice <- pred.exp
submission <- data.frame(Id, SalePrice)
head(submission)## Id SalePrice
## 1 1461 114840.2
## 2 1462 152591.5
## 3 1463 168443.5
## 4 1464 197045.4
## 5 1465 183060.7
## 6 1466 172234.3
Kaggle Score : 0.13251 Team Name : Vinayak#2
1st Attempt.
2nd Attempt.
3rd Attempt.