house_df <- read.csv('data/train.csv',header = T,stringsAsFactors = F)Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.
a. \(P(X>x | Y>y)\)
\(P(LotArea> 4th_Quartile | SalePrice>2nd_Quartile) = P(LotArea> 4th_Quartile & SalePrice>2nd_Quartile)/P(SalePrice>2nd_Quartile)\)
xQ4 =quantile(house_df$LotArea,1)
yQ2 = quantile(house_df$SalePrice,.50)
a = (nrow(filter(house_df,(LotArea > xQ4) & (SalePrice>yQ2)))/nrow(house_df))/(nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df))
a## [1] 0
b. \(P(X>x, Y>y)\)
\(P(LotArea> 4th_Quartile & SalePrice>2nd_Quartile) = P(LotArea> 4th_Quartile) * P(SalePrice>2nd_Quartile)\)
b = nrow(filter(house_df,(LotArea > xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)
b## [1] 0
c. \(P(X<x | Y>y)\)
\(P(LotArea< 4th_Quartile | SalePrice>2nd_Quartile) = P(LotArea< 4th_Quartile & SalePrice>2nd_Quartile)/P(SalePrice>2nd_Quartile)\)
c = (nrow(filter(house_df,(LotArea < xQ4) & (SalePrice>yQ2)))/nrow(house_df))/(nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df))
c## [1] 0.9986264
(nrow(filter(house_df,(LotArea < xQ4) & (SalePrice>yQ2)))/nrow(house_df))## [1] 0.4979452
nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)## [1] 0.4982886
Does splitting the training data in this fashion make them independent? In other words, does P(XY) = P(X)P(Y) or P(X|Y)= P(X) ?
No. Splitting the training data does not make them independent.
#P(X|Y) = P(X)P(Y)
P_XY = nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)
P_XY## [1] 0.4982886
P_X = nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df)
P_X## [1] 0.9993151
Both the values are not equal. So the Lot area and Sale price are not independent.
Evaluate by running a Chi Square test for association
\(H_0\): LotArea and SalePrice are independent \(H_A\): LotArea and SalePrice are not independent
#As xQ4 is 0, subsituting xQ2
xQ2 = quantile(house_df$LotArea,.50)
x_11= nrow(filter(house_df,(LotArea > xQ2) & (SalePrice>yQ2)))
x_12= nrow(filter(house_df,(LotArea <= xQ2) & (SalePrice>yQ2)))
y_21 = nrow(filter(house_df,(LotArea > xQ2) & (SalePrice<=yQ2)))
y_22 = nrow(filter(house_df,(LotArea <=xQ2) & (SalePrice<=yQ2)))
#counts tablea
matrix(c(x_11,y_21,x_12,y_22),ncol=2)## [,1] [,2]
## [1,] 477 251
## [2,] 253 479
#Chisquared test
chisq.test(matrix(c(x_11,y_21,x_12,y_22),ncol=2))##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix(c(x_11, y_21, x_12, y_22), ncol = 2)
## X-squared = 138.7, df = 1, p-value < 2.2e-16
As the p-value is very low, we reject the null hypothesis.
Provide univariate descriptive statistics and appropriate plots for both variables.
X <- house_df$LotArea
Y <- house_df$SalePrice
#Mean and min/max values, etc
describe(X)## X
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 1073 1 10517 5718 3312 5000
## .25 .50 .75 .90 .95
## 7554 9478 11602 14382 17401
##
## lowest : 1300 1477 1491 1526 1533, highest: 70761 115149 159000 164660 215245
#Mean, Median, Quartile
summary(X)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10520 11600 215200
#Standard Deviation
sd(X)## [1] 9981.265
#Mean and min/max values, etc
describe(Y)## Y
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 663 1 180921 81086 88000 106475
## .25 .50 .75 .90 .95
## 129975 163000 214000 278000 326100
##
## lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
#Mean, Median, Quartile
summary(Y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 130000 163000 180900 214000 755000
#Standard Deviation
sd(Y)## [1] 79442.5
#Histogram of Sale Price
ggplot(house_df,aes(x = SalePrice)) +geom_histogram(bins=100) + ggtitle("Histogram of Sale Price")#Histogram of Lot area
ggplot(house_df,aes(x = LotArea)) +geom_histogram(bins=100) + ggtitle("Histogram of Lot area")#Scatterplot of LotArea vs SalePrice
ggplot(house_df,aes(x = LotArea, y= SalePrice)) + geom_point(aes(color=SalePrice)) +geom_smooth() + ggtitle("Scatterplot of LotArea vs SalePrice")## `geom_smooth()` using method = 'gam'
#Scatterplot of LotArea vs SalePrice(Excluding outliers)
house_df %>% filter(LotArea <50000) %>% ggplot(aes(x = LotArea, y= SalePrice)) + geom_point(aes(color=SalePrice)) +geom_smooth() + ggtitle("Scatterplot of LotArea vs SalePrice(Excluding outliers > 50000)")## `geom_smooth()` using method = 'gam'
Transform both variables simultaneously using Box-Cox transformations. Using the transformed variables, run a correlation analysis and interpret.Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
#Histogram of Lot Area
qplot(X,bins=100) #qqplot of Lot Area
qqnorm(X); qqline(X); #Create a linear model between Lot Area and Sale Price for Boxcox transformations
area_scale_lm = lm(house_df$SalePrice ~house_df$LotArea)
area_scale_lm##
## Call:
## lm(formula = house_df$SalePrice ~ house_df$LotArea)
##
## Coefficients:
## (Intercept) house_df$LotArea
## 158836.2 2.1
#Perform a box-cox transformation to find the correct lambda
area_scale_bc = boxcox(area_scale_lm)#Optimal lambda
optimal_lambda = with(area_scale_bc,x[which.max(y)])
optimal_lambda## [1] -0.06060606
For this lambda, it can be approximated as zero. Apply log transformation for the source variable.
#Create a new dataframe with required columns
house_df2 <- data.frame(X = house_df$LotArea , Y= house_df$SalePrice)
#Transform the column
house_df2 <- house_df2 %>% mutate(X_dash = log10(X))
ggplot(house_df2,aes(x = X, y= Y)) + geom_point(aes(color=Y)) +geom_smooth() + ggtitle("Lot Area vs Sale Price")## `geom_smooth()` using method = 'gam'
#Scatter plot of transformed column(Lot Area) vs Sale Price
ggplot(house_df2,aes(x = X_dash, y= Y)) + geom_point(aes(color=Y)) +geom_smooth() + ggtitle("Lot Area(Transformed column) vs Sale Price")## `geom_smooth()` using method = 'gam'
#Histogram of Transformed column
ggplot(house_df2,aes(x = X_dash)) +geom_histogram(bins=100) + ggtitle("Histogram of Lot area")#QQPlot of Transformed column
qqnorm(house_df2$X_dash); qqline(house_df2$X_dash)#QQPlot of Transformed column with confidence interval
qqPlot(house_df2$X_dash,envelope = .99)\(H_0\): Transformed variable(Lot Area) is normally distributed \(H_A\): Transformed variable(Lot Area) is not normally distributed
#Shapiro-Wilk Normality Test
shapiro.test(house_df2$X_dash)##
## Shapiro-Wilk normality test
##
## data: house_df2$X_dash
## W = 0.90543, p-value < 2.2e-16
#Anderson-Darling test for normality
ad.test(house_df2$X_dash)##
## Anderson-Darling normality test
##
## data: house_df2$X_dash
## A = 39.984, p-value < 2.2e-16
p-value is very low. Hence we reject null hypothesis and conclude that the transformed variables are not normally distributed.
\(H_0\): Correlation between LotArea and Sale Price variables is 0 \(H_A\): Correlation between LotArea and Sale Price variables is not 0
#Correlation between Lot Area and Sale Price
cor(house_df$LotArea,house_df$SalePrice)## [1] 0.2638434
#Correlation between Lot Area(Transformed) and Sale Price
cor(house_df2$X_dash,house_df2$Y)## [1] 0.3885203
#Correlation Test between Lot Area and Sale Price
cor.test(house_df2$X,house_df2$Y)##
## Pearson's product-moment correlation
##
## data: house_df2$X and house_df2$Y
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2154574 0.3109369
## sample estimates:
## cor
## 0.2638434
#Correlation Test between Lot Area(Transformed) and Sale Price
cortest = cor.test(house_df2$X_dash,house_df2$Y, conf.level = .99)
cortest##
## Pearson's product-moment correlation
##
## data: house_df2$X_dash and house_df2$Y
## t = 16.1, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.3297735 0.4442697
## sample estimates:
## cor
## 0.3885203
This correlation test shows that the p-value is almost 0. Hence we can reject null hypothesis and conclude that the correlation is not 0 between these variables.
Confidence interval for correlation coefficient is 0.3297735, 0.4442697
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
house_values <- data.frame(house_df$LotArea,house_df$BsmtFinSF1,house_df$BsmtUnfSF,house_df$TotalBsmtSF,house_df$X1stFlrSF,house_df$X2ndFlrSF,house_df$GrLivArea,house_df$SalePrice)
house_cor = cor(house_values)
#Precision matrix
house_pre = solve(house_cor)
house_pre## house_df.LotArea house_df.BsmtFinSF1
## house_df.LotArea 1.14126098 0.2098455
## house_df.BsmtFinSF1 0.20984550 8.9488307
## house_df.BsmtUnfSF 0.33461507 7.7022462
## house_df.TotalBsmtSF -0.29389180 -7.7249641
## house_df.X1stFlrSF -0.21687173 -0.5589164
## house_df.X2ndFlrSF -0.03939661 -0.7777032
## house_df.GrLivArea -0.09832565 0.9677546
## house_df.SalePrice -0.05999538 -0.5242619
## house_df.BsmtUnfSF house_df.TotalBsmtSF
## house_df.LotArea 0.3346151 -0.2938918
## house_df.BsmtFinSF1 7.7022462 -7.7249641
## house_df.BsmtUnfSF 7.8883506 -7.3309432
## house_df.TotalBsmtSF -7.3309432 10.4386078
## house_df.X1stFlrSF -0.2076697 -2.0005691
## house_df.X2ndFlrSF -0.5055645 0.7936117
## house_df.GrLivArea 0.3868336 -0.2806713
## house_df.SalePrice -0.2452075 -0.6124594
## house_df.X1stFlrSF house_df.X2ndFlrSF
## house_df.LotArea -0.2168717 -0.03939661
## house_df.BsmtFinSF1 -0.5589164 -0.77770319
## house_df.BsmtUnfSF -0.2076697 -0.50556451
## house_df.TotalBsmtSF -2.0005691 0.79361171
## house_df.X1stFlrSF 68.8621780 74.46669220
## house_df.X2ndFlrSF 74.4666922 84.97783277
## house_df.GrLivArea -87.8180705 -99.21040799
## house_df.SalePrice -1.7247510 -2.01677635
## house_df.GrLivArea house_df.SalePrice
## house_df.LotArea -0.09832565 -0.05999538
## house_df.BsmtFinSF1 0.96775461 -0.52426187
## house_df.BsmtUnfSF 0.38683363 -0.24520746
## house_df.TotalBsmtSF -0.28067131 -0.61245942
## house_df.X1stFlrSF -87.81807049 -1.72475101
## house_df.X2ndFlrSF -99.21040799 -2.01677635
## house_df.GrLivArea 118.11679134 0.92678691
## house_df.SalePrice 0.92678691 2.67902475
house_cor_pre = house_cor %*% house_pre
house_pre_cor = house_pre %*% house_cor
#Correlation matrix of all variables in number
corrplot(house_cor,method="number")#Correlation matrix of all variables in square
corrplot(house_cor,method="square")#Correlation matrix by precision matrix
corrplot(house_cor_pre,method="circle")#Precision matrix by Correlation matrix
corrplot(house_pre_cor,method="circle")Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable
hist(house_df$LotArea,xlab = "Lot Area",main = "Histogram of Lot Area")dist_values = fitdistr(house_df$LotArea, densfun="exponential")
rexp_lotarea <- rexp(1000,dist_values$estimate[1])
hist(rexp_lotarea,xlab = "Lot Area",main = "Histogram of Random exponential values")Test the distribution of randomly generated exponential variables.
#Kolmogorov-Smirnov Tests
ks.test(rexp_lotarea,"pexp")##
## One-sample Kolmogorov-Smirnov test
##
## data: rexp_lotarea
## D = 0.99893, p-value < 2.2e-16
## alternative hypothesis: two-sided
Plot all the distributions and try to fit the exact curve.
descdist(rexp_lotarea, discrete = F)## summary statistics
## ------
## min: 3.143102 max: 71755.07
## median: 7092.833
## mean: 10263.07
## estimated sd: 9985.043
## estimated skewness: 1.792638
## estimated kurtosis: 7.218779
Above plot shows that the ‘gamma’ and ‘exponential’ distribution are nearly close to the observation.
#Adding very low value to the random values
rexp_lotarea1 <- rexp_lotarea+.000001
fit.exp <- fitdist(rexp_lotarea1,"exp",method="mme",lower = c(0, 0))
fit.gamma <- fitdist(rexp_lotarea1,"gamma",method="mme",lower = c(0, 0))
#Exponential Distribution Plot
plot(fit.exp)#Gamma Distribution Plot
plot(fit.gamma)fit.exp$aic## [1] 20474.61
fit.gamma$aic## [1] 20477.09
From above plot and AIC values, we can conclude that exponential distribution is the exact fit of randomly generated exponential values.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
house_df$Id <- factor(house_df$Id)
house_df$YearBuilt <- factor(house_df$YearBuilt)
house_df$YrSold <- factor(house_df$YrSold)
house_df$MoSold <- factor(house_df$MoSold)
#To find all the quantitative and categorical columns
split(names(house_df),sapply(house_df,function(x) paste(class(x),collapse=",")))## $character
## [1] "MSZoning" "Street" "Alley" "LotShape"
## [5] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [9] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [13] "HouseStyle" "RoofStyle" "RoofMatl" "Exterior1st"
## [17] "Exterior2nd" "MasVnrType" "ExterQual" "ExterCond"
## [21] "Foundation" "BsmtQual" "BsmtCond" "BsmtExposure"
## [25] "BsmtFinType1" "BsmtFinType2" "Heating" "HeatingQC"
## [29] "CentralAir" "Electrical" "KitchenQual" "Functional"
## [33] "FireplaceQu" "GarageType" "GarageFinish" "GarageQual"
## [37] "GarageCond" "PavedDrive" "PoolQC" "Fence"
## [41] "MiscFeature" "SaleType" "SaleCondition"
##
## $factor
## [1] "Id" "YearBuilt" "MoSold" "YrSold"
##
## $integer
## [1] "MSSubClass" "LotFrontage" "LotArea" "OverallQual"
## [5] "OverallCond" "YearRemodAdd" "MasVnrArea" "BsmtFinSF1"
## [9] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "X1stFlrSF"
## [13] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [17] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [21] "KitchenAbvGr" "TotRmsAbvGrd" "Fireplaces" "GarageYrBlt"
## [25] "GarageCars" "GarageArea" "WoodDeckSF" "OpenPorchSF"
## [29] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [33] "MiscVal" "SalePrice"
#Distribution of numerical variables
house_df %>% select(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond ,
MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath, HalfBath,
BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea, MiscVal,SalePrice) %>% gather() %>% ggplot(aes(value)) + facet_wrap(~key,scales ="free") + geom_density() #####Data cleaning
#data cleaning
Num_NA<-sapply(house_df,function(y)length(which(is.na(y)==T)))
NA_Count<- data.frame(Item=colnames(house_df),Count=Num_NA)
#NA count list
NA_Count## Item Count
## Id Id 0
## MSSubClass MSSubClass 0
## MSZoning MSZoning 0
## LotFrontage LotFrontage 259
## LotArea LotArea 0
## Street Street 0
## Alley Alley 1369
## LotShape LotShape 0
## LandContour LandContour 0
## Utilities Utilities 0
## LotConfig LotConfig 0
## LandSlope LandSlope 0
## Neighborhood Neighborhood 0
## Condition1 Condition1 0
## Condition2 Condition2 0
## BldgType BldgType 0
## HouseStyle HouseStyle 0
## OverallQual OverallQual 0
## OverallCond OverallCond 0
## YearBuilt YearBuilt 0
## YearRemodAdd YearRemodAdd 0
## RoofStyle RoofStyle 0
## RoofMatl RoofMatl 0
## Exterior1st Exterior1st 0
## Exterior2nd Exterior2nd 0
## MasVnrType MasVnrType 8
## MasVnrArea MasVnrArea 8
## ExterQual ExterQual 0
## ExterCond ExterCond 0
## Foundation Foundation 0
## BsmtQual BsmtQual 37
## BsmtCond BsmtCond 37
## BsmtExposure BsmtExposure 38
## BsmtFinType1 BsmtFinType1 37
## BsmtFinSF1 BsmtFinSF1 0
## BsmtFinType2 BsmtFinType2 38
## BsmtFinSF2 BsmtFinSF2 0
## BsmtUnfSF BsmtUnfSF 0
## TotalBsmtSF TotalBsmtSF 0
## Heating Heating 0
## HeatingQC HeatingQC 0
## CentralAir CentralAir 0
## Electrical Electrical 1
## X1stFlrSF X1stFlrSF 0
## X2ndFlrSF X2ndFlrSF 0
## LowQualFinSF LowQualFinSF 0
## GrLivArea GrLivArea 0
## BsmtFullBath BsmtFullBath 0
## BsmtHalfBath BsmtHalfBath 0
## FullBath FullBath 0
## HalfBath HalfBath 0
## BedroomAbvGr BedroomAbvGr 0
## KitchenAbvGr KitchenAbvGr 0
## KitchenQual KitchenQual 0
## TotRmsAbvGrd TotRmsAbvGrd 0
## Functional Functional 0
## Fireplaces Fireplaces 0
## FireplaceQu FireplaceQu 690
## GarageType GarageType 81
## GarageYrBlt GarageYrBlt 81
## GarageFinish GarageFinish 81
## GarageCars GarageCars 0
## GarageArea GarageArea 0
## GarageQual GarageQual 81
## GarageCond GarageCond 81
## PavedDrive PavedDrive 0
## WoodDeckSF WoodDeckSF 0
## OpenPorchSF OpenPorchSF 0
## EnclosedPorch EnclosedPorch 0
## X3SsnPorch X3SsnPorch 0
## ScreenPorch ScreenPorch 0
## PoolArea PoolArea 0
## PoolQC PoolQC 1453
## Fence Fence 1179
## MiscFeature MiscFeature 1406
## MiscVal MiscVal 0
## MoSold MoSold 0
## YrSold YrSold 0
## SaleType SaleType 0
## SaleCondition SaleCondition 0
## SalePrice SalePrice 0
house_clean <- house_df
mask <- is.na(house_clean$MasVnrType)
house_clean$MasVnrType[mask] <- 'None'
mask <- is.na(house_clean$MasVnrType)
house_clean$MasVnrType[mask] <- 'None'
mask <- is.na(house_clean$BsmtCond)
house_clean$BsmtCond[mask] <- 'NoBasement'
mask <- is.na(house_clean$BsmtFinType1)
house_clean$BsmtFinType1[mask] <- 'None'
mask <- is.na(house_clean$BsmtFinType2)
house_clean$BsmtFinType2[mask] <- 'None'
mask <- is.na(house_clean$FireplaceQu)
house_clean$FireplaceQu[mask] <- 'NoFireplace'
mask <- is.na(house_clean$GarageFinish)
house_clean$GarageFinish[mask] <- 'NoGarage'
mask <- is.na(house_clean$GarageQual)
house_clean$GarageQual[mask] <- 'NoGarage'
mask <- is.na(house_clean$MiscFeature)
house_clean$MiscFeature[mask] <- 'NoMisc'
mask <- is.na(house_clean$BsmtExposure)
house_clean$BsmtExposure[mask] <- 'NoBasement'
mask <- is.na(house_clean$PoolQC)
house_clean$PoolQC[mask] <- 'NoPool'
mask <- is.na(house_clean$Fence)
house_clean$Fence[mask] <- 'NoFence'
mask <- is.na(house_clean$GarageType)
house_clean$GarageType[mask] <- 'NoGarage'
mask <- is.na(house_clean$GarageCond)
house_clean$GarageCond[mask] <- 'NoGarage'
mask <- is.na(house_clean$GarageArea)
house_clean$GarageArea[mask] <- 0
mask <- is.na(house_clean$GarageCars)
house_clean$GarageCars[mask] <- 0
mask <- is.na(house_clean$LotFrontage)
house_clean$LotFrontage[mask] <- 0#Segregate training and testing dataset
#Split into training and test dataset
set.seed(7340)
row <- sample(nrow(house_clean))
house_split <- house_clean[row,]
split <- round(nrow(house_split)*.80)
training <- data.frame(house_split[1:split,])
test <- data.frame(house_split[(split+1):nrow(house_split),])
test_x<- select(test,c(MSSubClass,MSZoning, LotFrontage,LotArea,
Street,LotShape, LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType, HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st,Exterior2nd ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation, BsmtCond, BsmtExposure , BsmtFinType1, BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,
HalfBath, BedroomAbvGr , KitchenAbvGr, KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea, PoolQC,Fence, MiscFeature ,
MiscVal,MoSold,YrSold,SaleType, SaleCondition)) %>% data.frame()
test_y <- select(test,SalePrice) %>% data.frame()
test_or1 <- read.csv('data/test.csv',header = T,stringsAsFactors = F)
test_original <- test_or1 %>% select(c(MSSubClass,MSZoning, LotFrontage,LotArea,
Street,LotShape, LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType, HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st,Exterior2nd ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation, BsmtCond, BsmtExposure , BsmtFinType1, BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,
HalfBath, BedroomAbvGr , KitchenAbvGr, KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea, PoolQC,Fence, MiscFeature ,
MiscVal,MoSold,YrSold,SaleType, SaleCondition)) %>% data.frame()
#Clean Test dataset
mask <- is.na(test_original$MasVnrType)
test_original$MasVnrType[mask] <- 'None'
mask <- is.na(test_original$MasVnrType)
test_original$MasVnrType[mask] <- 'None'
mask <- is.na(test_original$BsmtCond)
test_original$BsmtCond[mask] <- 'NoBasement'
mask <- is.na(test_original$BsmtFinType1)
test_original$BsmtFinType1[mask] <- 'None'
mask <- is.na(test_original$BsmtFinType2)
test_original$BsmtFinType2[mask] <- 'None'
mask <- is.na(test_original$FireplaceQu)
test_original$FireplaceQu[mask] <- 'NoFireplace'
mask <- is.na(test_original$GarageFinish)
test_original$GarageFinish[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageQual)
test_original$GarageQual[mask] <- 'NoGarage'
mask <- is.na(test_original$MiscFeature)
test_original$MiscFeature[mask] <- 'NoMisc'
mask <- is.na(test_original$BsmtExposure)
test_original$BsmtExposure[mask] <- 'NoBasement'
mask <- is.na(test_original$PoolQC)
test_original$PoolQC[mask] <- 'NoPool'
mask <- is.na(test_original$Fence)
test_original$Fence[mask] <- 'NoFence'
mask <- is.na(test_original$GarageType)
test_original$GarageType[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageCond)
test_original$GarageCond[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageArea)
test_original$GarageArea[mask] <- 0
mask <- is.na(test_original$GarageCars)
test_original$GarageCars[mask] <- 0
mask <- is.na(test_original$LotFrontage)
test_original$LotFrontage[mask] <- 0
test_original$GarageCond[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageArea)
test_original$GarageArea[mask] <- 0
mask <- is.na(test_original$GarageCars)
test_original$GarageCars[mask] <- 0
mask <- is.na(test_original$LotFrontage)
test_original$LotFrontage[mask] <- 0
#Having 0 for all null columns
test_original <- mutate_each(test_original,funs(replace(.,is.na(.),0)))
training$YearBuilt <- as.integer(training$YearBuilt )
#Split with between valid variables
#All variables
house_all <- select(training,c(MSSubClass,MSZoning, LotFrontage,LotArea,
Street,LotShape, LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType, HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st,Exterior2nd ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation, BsmtCond, BsmtExposure , BsmtFinType1, BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,
HalfBath, BedroomAbvGr , KitchenAbvGr, KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea, PoolQC,Fence, MiscFeature ,
MiscVal,MoSold,YrSold,SaleType, SaleCondition, SalePrice)) %>% data.frame()
house_all_x <- select(training,c(MSSubClass,MSZoning, LotFrontage,LotArea,
Street,LotShape, LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType, HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st,Exterior2nd ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation, BsmtCond, BsmtExposure , BsmtFinType1, BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,
HalfBath, BedroomAbvGr , KitchenAbvGr, KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea, PoolQC,Fence, MiscFeature ,
MiscVal,MoSold,YrSold,SaleType, SaleCondition)) %>% data.frame()
#Electrical, OverallQual, Alley
house_num <- select(training,c(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond , MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath, HalfBath,
BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea, MiscVal,SalePrice))
#Numeric variables
house_num_x <- select(training,c(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond , MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath, HalfBath,
BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea, MiscVal))
#Dependent variable
house_y <- select(training,SalePrice) %>% data.frame()model_glm <- glm(SalePrice~ ., data = house_num)
predicted <- predict(model_glm,newdata=test)
predicted_up <- ifelse(is.na(predicted)==TRUE,0,predicted)
RMSE(test_y$SalePrice, predicted_up) # 42286.25## [1] 52438.93
#Model 2 - Process the original dataset with cleanup using
model_radomforest <- train(x=house_all_x, y=house_y$SalePrice, method = "ranger",preProcess = c("knnImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))## Loading required package: e1071
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
## Loading required package: ranger
## + Fold1: mtry= 2
## - Fold1: mtry= 2
## + Fold1: mtry=39
## - Fold1: mtry=39
## + Fold1: mtry=76
## - Fold1: mtry=76
## + Fold2: mtry= 2
## - Fold2: mtry= 2
## + Fold2: mtry=39
## - Fold2: mtry=39
## + Fold2: mtry=76
## - Fold2: mtry=76
## + Fold3: mtry= 2
## - Fold3: mtry= 2
## + Fold3: mtry=39
## - Fold3: mtry=39
## + Fold3: mtry=76
## - Fold3: mtry=76
## + Fold4: mtry= 2
## - Fold4: mtry= 2
## + Fold4: mtry=39
## - Fold4: mtry=39
## + Fold4: mtry=76
## - Fold4: mtry=76
## + Fold5: mtry= 2
## - Fold5: mtry= 2
## + Fold5: mtry=39
## - Fold5: mtry=39
## + Fold5: mtry=76
## - Fold5: mtry=76
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 39 on full training set
#Test dataset
predicted <- predict(model_radomforest,newdata=test)
RMSE(test_y$SalePrice, predicted) # 24427.89## [1] 27867.67
#It is low around 38 variables
plot(model_radomforest)#Model 3 -Removing low-variance variables
model_zv <- train(x=house_all_x, y=house_y$SalePrice, method = "ranger",preProcess = c("zv","medianImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))
predicted <- predict(model_zv,newdata=test)## Warning in Ops.factor(left, right): '-' not meaningful for factors
RMSE(test_y$SalePrice, predicted) # 24480.04#Model 4 -Creating using glmnet with numeric values
model_glmnet <- train(x=house_num_x, y=house_y$SalePrice, method = "glmnet",preProcess = c("medianImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))## Loading required package: glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: foreach
## Loaded glmnet 2.0-5
##
## Attaching package: 'glmnet'
## The following object is masked from 'package:Metrics':
##
## auc
## + Fold1: alpha=0.10, lambda=12742
## - Fold1: alpha=0.10, lambda=12742
## + Fold1: alpha=0.55, lambda=12742
## - Fold1: alpha=0.55, lambda=12742
## + Fold1: alpha=1.00, lambda=12742
## - Fold1: alpha=1.00, lambda=12742
## + Fold2: alpha=0.10, lambda=12742
## - Fold2: alpha=0.10, lambda=12742
## + Fold2: alpha=0.55, lambda=12742
## - Fold2: alpha=0.55, lambda=12742
## + Fold2: alpha=1.00, lambda=12742
## - Fold2: alpha=1.00, lambda=12742
## + Fold3: alpha=0.10, lambda=12742
## - Fold3: alpha=0.10, lambda=12742
## + Fold3: alpha=0.55, lambda=12742
## - Fold3: alpha=0.55, lambda=12742
## + Fold3: alpha=1.00, lambda=12742
## - Fold3: alpha=1.00, lambda=12742
## + Fold4: alpha=0.10, lambda=12742
## - Fold4: alpha=0.10, lambda=12742
## + Fold4: alpha=0.55, lambda=12742
## - Fold4: alpha=0.55, lambda=12742
## + Fold4: alpha=1.00, lambda=12742
## - Fold4: alpha=1.00, lambda=12742
## + Fold5: alpha=0.10, lambda=12742
## - Fold5: alpha=0.10, lambda=12742
## + Fold5: alpha=0.55, lambda=12742
## - Fold5: alpha=0.55, lambda=12742
## + Fold5: alpha=1.00, lambda=12742
## - Fold5: alpha=1.00, lambda=12742
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 1274 on full training set
predicted <- predict(model_glmnet,newdata=test)
RMSE(test_y$SalePrice, predicted) ## [1] 47764.56
After analyzing various models, it seems below is the best model.
#Predict with original test data
predicted <- predict(model_radomforest,newdata=test_original)
final_output <- data.frame(test_or1,predicted)
final_output %>% write_csv(path = 'submission.csv')
final_output %>% select(c(Id, predicted)) %>% head()## Id predicted
## 1 1461 130999.0
## 2 1462 159878.4
## 3 1463 180786.3
## 4 1464 181695.3
## 5 1465 200290.4
## 6 1466 180546.9
rm(list=ls())