Dataset

The House Prices: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
house_df <- read.csv('data/train.csv',header = T,stringsAsFactors = F)

Probability

Calculate Probabilities

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

a. \(P(X>x | Y>y)\)

\(P(LotArea> 4th_Quartile | SalePrice>2nd_Quartile) = P(LotArea> 4th_Quartile & SalePrice>2nd_Quartile)/P(SalePrice>2nd_Quartile)\)

xQ4 =quantile(house_df$LotArea,1)
yQ2 = quantile(house_df$SalePrice,.50)


a = (nrow(filter(house_df,(LotArea > xQ4) & (SalePrice>yQ2)))/nrow(house_df))/(nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df))

a
## [1] 0

b. \(P(X>x, Y>y)\)

\(P(LotArea> 4th_Quartile & SalePrice>2nd_Quartile) = P(LotArea> 4th_Quartile) * P(SalePrice>2nd_Quartile)\)

b = nrow(filter(house_df,(LotArea > xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)

b
## [1] 0

c. \(P(X<x | Y>y)\)

\(P(LotArea< 4th_Quartile | SalePrice>2nd_Quartile) = P(LotArea< 4th_Quartile & SalePrice>2nd_Quartile)/P(SalePrice>2nd_Quartile)\)

c = (nrow(filter(house_df,(LotArea < xQ4) & (SalePrice>yQ2)))/nrow(house_df))/(nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df))
c
## [1] 0.9986264
(nrow(filter(house_df,(LotArea < xQ4) & (SalePrice>yQ2)))/nrow(house_df))
## [1] 0.4979452
nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)
## [1] 0.4982886
Independence

Does splitting the training data in this fashion make them independent? In other words, does P(XY) = P(X)P(Y) or P(X|Y)= P(X) ?

No. Splitting the training data does not make them independent.

#P(X|Y) = P(X)P(Y)
P_XY = nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df) *
nrow(filter(house_df,SalePrice>yQ2))/nrow(house_df)

P_XY
## [1] 0.4982886
P_X = nrow(filter(house_df,(LotArea < xQ4)))/nrow(house_df)
                                                  
P_X
## [1] 0.9993151

Both the values are not equal. So the Lot area and Sale price are not independent.

Chi Square test

Evaluate by running a Chi Square test for association

\(H_0\): LotArea and SalePrice are independent \(H_A\): LotArea and SalePrice are not independent

#As xQ4 is 0, subsituting xQ2
xQ2 = quantile(house_df$LotArea,.50)

x_11= nrow(filter(house_df,(LotArea >  xQ2) & (SalePrice>yQ2)))

x_12= nrow(filter(house_df,(LotArea <= xQ2) & (SalePrice>yQ2)))

y_21 = nrow(filter(house_df,(LotArea > xQ2) & (SalePrice<=yQ2)))

y_22 = nrow(filter(house_df,(LotArea <=xQ2) & (SalePrice<=yQ2)))

#counts tablea
matrix(c(x_11,y_21,x_12,y_22),ncol=2)
##      [,1] [,2]
## [1,]  477  251
## [2,]  253  479
#Chisquared test
chisq.test(matrix(c(x_11,y_21,x_12,y_22),ncol=2))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix(c(x_11, y_21, x_12, y_22), ncol = 2)
## X-squared = 138.7, df = 1, p-value < 2.2e-16

As the p-value is very low, we reject the null hypothesis.

Descriptive and Inferential Statistics.

Descriptive statistics

Provide univariate descriptive statistics and appropriate plots for both variables.

X <- house_df$LotArea
Y <- house_df$SalePrice

#Mean and min/max values, etc
describe(X)
## X 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0     1073        1    10517     5718     3312     5000 
##      .25      .50      .75      .90      .95 
##     7554     9478    11602    14382    17401 
## 
## lowest :   1300   1477   1491   1526   1533, highest:  70761 115149 159000 164660 215245
#Mean, Median, Quartile
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10520   11600  215200
#Standard Deviation
sd(X)
## [1] 9981.265
#Mean and min/max values, etc
describe(Y)
## Y 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      663        1   180921    81086    88000   106475 
##      .25      .50      .75      .90      .95 
##   129975   163000   214000   278000   326100 
## 
## lowest :  34900  35311  37900  39300  40000, highest: 582933 611657 625000 745000 755000
#Mean, Median, Quartile
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000
#Standard Deviation
sd(Y)
## [1] 79442.5
Provide a scatterplot of X and Y.
#Histogram of Sale Price
ggplot(house_df,aes(x = SalePrice)) +geom_histogram(bins=100) + ggtitle("Histogram of Sale Price")

#Histogram of Lot area
ggplot(house_df,aes(x = LotArea)) +geom_histogram(bins=100) + ggtitle("Histogram of Lot area")

#Scatterplot of LotArea vs SalePrice
ggplot(house_df,aes(x = LotArea, y= SalePrice)) + geom_point(aes(color=SalePrice)) +geom_smooth() + ggtitle("Scatterplot of LotArea vs SalePrice")
## `geom_smooth()` using method = 'gam'

#Scatterplot of LotArea vs SalePrice(Excluding outliers)
house_df %>% filter(LotArea <50000) %>% ggplot(aes(x = LotArea, y= SalePrice)) + geom_point(aes(color=SalePrice)) +geom_smooth() + ggtitle("Scatterplot of LotArea vs SalePrice(Excluding outliers > 50000)")
## `geom_smooth()` using method = 'gam'

Box-Cox transformations

Transform both variables simultaneously using Box-Cox transformations. Using the transformed variables, run a correlation analysis and interpret.Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

#Histogram of Lot Area
qplot(X,bins=100) 

#qqplot of Lot Area
qqnorm(X); qqline(X); 

#Create a linear model between Lot Area and Sale Price for Boxcox transformations
area_scale_lm = lm(house_df$SalePrice ~house_df$LotArea)
area_scale_lm
## 
## Call:
## lm(formula = house_df$SalePrice ~ house_df$LotArea)
## 
## Coefficients:
##      (Intercept)  house_df$LotArea  
##         158836.2               2.1
#Perform a box-cox transformation to find the correct lambda

area_scale_bc = boxcox(area_scale_lm)

#Optimal lambda
optimal_lambda = with(area_scale_bc,x[which.max(y)])
optimal_lambda
## [1] -0.06060606

For this lambda, it can be approximated as zero. Apply log transformation for the source variable.

#Create a new dataframe with required columns
house_df2 <- data.frame(X = house_df$LotArea , Y= house_df$SalePrice)

#Transform the column
house_df2 <- house_df2 %>%  mutate(X_dash = log10(X))


ggplot(house_df2,aes(x = X, y= Y)) + geom_point(aes(color=Y)) +geom_smooth() + ggtitle("Lot Area vs Sale Price")
## `geom_smooth()` using method = 'gam'

#Scatter plot of transformed column(Lot Area) vs Sale Price
ggplot(house_df2,aes(x = X_dash, y= Y)) + geom_point(aes(color=Y)) +geom_smooth() + ggtitle("Lot Area(Transformed column) vs Sale Price")
## `geom_smooth()` using method = 'gam'

#Histogram of Transformed column
ggplot(house_df2,aes(x = X_dash)) +geom_histogram(bins=100) + ggtitle("Histogram of Lot area")

#QQPlot of Transformed column
qqnorm(house_df2$X_dash); qqline(house_df2$X_dash)

#QQPlot of Transformed column with confidence interval
qqPlot(house_df2$X_dash,envelope = .99)

Normality test

\(H_0\): Transformed variable(Lot Area) is normally distributed \(H_A\): Transformed variable(Lot Area) is not normally distributed

#Shapiro-Wilk Normality Test

shapiro.test(house_df2$X_dash)
## 
##  Shapiro-Wilk normality test
## 
## data:  house_df2$X_dash
## W = 0.90543, p-value < 2.2e-16
#Anderson-Darling test for normality
ad.test(house_df2$X_dash)
## 
##  Anderson-Darling normality test
## 
## data:  house_df2$X_dash
## A = 39.984, p-value < 2.2e-16

p-value is very low. Hence we reject null hypothesis and conclude that the transformed variables are not normally distributed.

Correlation test

\(H_0\): Correlation between LotArea and Sale Price variables is 0 \(H_A\): Correlation between LotArea and Sale Price variables is not 0

#Correlation between Lot Area and Sale Price
cor(house_df$LotArea,house_df$SalePrice)
## [1] 0.2638434
#Correlation between Lot Area(Transformed) and Sale Price
cor(house_df2$X_dash,house_df2$Y)
## [1] 0.3885203
#Correlation Test between Lot Area and Sale Price
cor.test(house_df2$X,house_df2$Y)
## 
##  Pearson's product-moment correlation
## 
## data:  house_df2$X and house_df2$Y
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2154574 0.3109369
## sample estimates:
##       cor 
## 0.2638434
#Correlation Test between Lot Area(Transformed) and Sale Price
cortest = cor.test(house_df2$X_dash,house_df2$Y, conf.level = .99)
cortest
## 
##  Pearson's product-moment correlation
## 
## data:  house_df2$X_dash and house_df2$Y
## t = 16.1, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.3297735 0.4442697
## sample estimates:
##       cor 
## 0.3885203

This correlation test shows that the p-value is almost 0. Hence we can reject null hypothesis and conclude that the correlation is not 0 between these variables.

Confidence interval for correlation coefficient is 0.3297735, 0.4442697

Linear Algebra and Correlation.

Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html


house_values <- data.frame(house_df$LotArea,house_df$BsmtFinSF1,house_df$BsmtUnfSF,house_df$TotalBsmtSF,house_df$X1stFlrSF,house_df$X2ndFlrSF,house_df$GrLivArea,house_df$SalePrice)


house_cor = cor(house_values)

#Precision matrix
house_pre = solve(house_cor)
house_pre
##                      house_df.LotArea house_df.BsmtFinSF1
## house_df.LotArea           1.14126098           0.2098455
## house_df.BsmtFinSF1        0.20984550           8.9488307
## house_df.BsmtUnfSF         0.33461507           7.7022462
## house_df.TotalBsmtSF      -0.29389180          -7.7249641
## house_df.X1stFlrSF        -0.21687173          -0.5589164
## house_df.X2ndFlrSF        -0.03939661          -0.7777032
## house_df.GrLivArea        -0.09832565           0.9677546
## house_df.SalePrice        -0.05999538          -0.5242619
##                      house_df.BsmtUnfSF house_df.TotalBsmtSF
## house_df.LotArea              0.3346151           -0.2938918
## house_df.BsmtFinSF1           7.7022462           -7.7249641
## house_df.BsmtUnfSF            7.8883506           -7.3309432
## house_df.TotalBsmtSF         -7.3309432           10.4386078
## house_df.X1stFlrSF           -0.2076697           -2.0005691
## house_df.X2ndFlrSF           -0.5055645            0.7936117
## house_df.GrLivArea            0.3868336           -0.2806713
## house_df.SalePrice           -0.2452075           -0.6124594
##                      house_df.X1stFlrSF house_df.X2ndFlrSF
## house_df.LotArea             -0.2168717        -0.03939661
## house_df.BsmtFinSF1          -0.5589164        -0.77770319
## house_df.BsmtUnfSF           -0.2076697        -0.50556451
## house_df.TotalBsmtSF         -2.0005691         0.79361171
## house_df.X1stFlrSF           68.8621780        74.46669220
## house_df.X2ndFlrSF           74.4666922        84.97783277
## house_df.GrLivArea          -87.8180705       -99.21040799
## house_df.SalePrice           -1.7247510        -2.01677635
##                      house_df.GrLivArea house_df.SalePrice
## house_df.LotArea            -0.09832565        -0.05999538
## house_df.BsmtFinSF1          0.96775461        -0.52426187
## house_df.BsmtUnfSF           0.38683363        -0.24520746
## house_df.TotalBsmtSF        -0.28067131        -0.61245942
## house_df.X1stFlrSF         -87.81807049        -1.72475101
## house_df.X2ndFlrSF         -99.21040799        -2.01677635
## house_df.GrLivArea         118.11679134         0.92678691
## house_df.SalePrice           0.92678691         2.67902475
house_cor_pre = house_cor %*% house_pre

house_pre_cor = house_pre %*% house_cor

#Correlation matrix of all variables in number
corrplot(house_cor,method="number")

#Correlation matrix of all variables in square
corrplot(house_cor,method="square")

#Correlation matrix by precision matrix
corrplot(house_cor_pre,method="circle")

#Precision matrix by Correlation matrix
corrplot(house_pre_cor,method="circle")

Calculus-Based Probability & Statistics.

Distribution

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable

hist(house_df$LotArea,xlab = "Lot Area",main = "Histogram of Lot Area")

dist_values = fitdistr(house_df$LotArea, densfun="exponential")

rexp_lotarea <- rexp(1000,dist_values$estimate[1])
                     

hist(rexp_lotarea,xlab = "Lot Area",main = "Histogram of Random exponential values")

Test the distribution of randomly generated exponential variables.

#Kolmogorov-Smirnov Tests

ks.test(rexp_lotarea,"pexp")
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rexp_lotarea
## D = 0.99893, p-value < 2.2e-16
## alternative hypothesis: two-sided

Plot all the distributions and try to fit the exact curve.

descdist(rexp_lotarea, discrete = F)

## summary statistics
## ------
## min:  3.143102   max:  71755.07 
## median:  7092.833 
## mean:  10263.07 
## estimated sd:  9985.043 
## estimated skewness:  1.792638 
## estimated kurtosis:  7.218779

Above plot shows that the ‘gamma’ and ‘exponential’ distribution are nearly close to the observation.

#Adding very low value to the random values
rexp_lotarea1 <- rexp_lotarea+.000001

fit.exp <- fitdist(rexp_lotarea1,"exp",method="mme",lower = c(0, 0))
fit.gamma <- fitdist(rexp_lotarea1,"gamma",method="mme",lower = c(0, 0))

#Exponential Distribution Plot
plot(fit.exp)

#Gamma Distribution Plot
plot(fit.gamma)

fit.exp$aic
## [1] 20474.61
fit.gamma$aic
## [1] 20477.09

From above plot and AIC values, we can conclude that exponential distribution is the exact fit of randomly generated exponential values.

Modeling

Create Model

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

house_df$Id <- factor(house_df$Id)

house_df$YearBuilt <- factor(house_df$YearBuilt)

house_df$YrSold <- factor(house_df$YrSold)


house_df$MoSold <- factor(house_df$MoSold)

#To find all the quantitative and categorical columns
split(names(house_df),sapply(house_df,function(x) paste(class(x),collapse=",")))
## $character
##  [1] "MSZoning"      "Street"        "Alley"         "LotShape"     
##  [5] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
##  [9] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [13] "HouseStyle"    "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [17] "Exterior2nd"   "MasVnrType"    "ExterQual"     "ExterCond"    
## [21] "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure" 
## [25] "BsmtFinType1"  "BsmtFinType2"  "Heating"       "HeatingQC"    
## [29] "CentralAir"    "Electrical"    "KitchenQual"   "Functional"   
## [33] "FireplaceQu"   "GarageType"    "GarageFinish"  "GarageQual"   
## [37] "GarageCond"    "PavedDrive"    "PoolQC"        "Fence"        
## [41] "MiscFeature"   "SaleType"      "SaleCondition"
## 
## $factor
## [1] "Id"        "YearBuilt" "MoSold"    "YrSold"   
## 
## $integer
##  [1] "MSSubClass"    "LotFrontage"   "LotArea"       "OverallQual"  
##  [5] "OverallCond"   "YearRemodAdd"  "MasVnrArea"    "BsmtFinSF1"   
##  [9] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "X1stFlrSF"    
## [13] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [17] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [21] "KitchenAbvGr"  "TotRmsAbvGrd"  "Fireplaces"    "GarageYrBlt"  
## [25] "GarageCars"    "GarageArea"    "WoodDeckSF"    "OpenPorchSF"  
## [29] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [33] "MiscVal"       "SalePrice"
#Distribution of numerical variables
house_df %>% select(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond , 
 MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,  HalfBath, 
BedroomAbvGr,  KitchenAbvGr,  TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea,  MiscVal,SalePrice) %>% gather() %>%  ggplot(aes(value)) + facet_wrap(~key,scales ="free") + geom_density()

#####Data cleaning

#data cleaning

Num_NA<-sapply(house_df,function(y)length(which(is.na(y)==T)))
NA_Count<- data.frame(Item=colnames(house_df),Count=Num_NA)

#NA count list
NA_Count
##                        Item Count
## Id                       Id     0
## MSSubClass       MSSubClass     0
## MSZoning           MSZoning     0
## LotFrontage     LotFrontage   259
## LotArea             LotArea     0
## Street               Street     0
## Alley                 Alley  1369
## LotShape           LotShape     0
## LandContour     LandContour     0
## Utilities         Utilities     0
## LotConfig         LotConfig     0
## LandSlope         LandSlope     0
## Neighborhood   Neighborhood     0
## Condition1       Condition1     0
## Condition2       Condition2     0
## BldgType           BldgType     0
## HouseStyle       HouseStyle     0
## OverallQual     OverallQual     0
## OverallCond     OverallCond     0
## YearBuilt         YearBuilt     0
## YearRemodAdd   YearRemodAdd     0
## RoofStyle         RoofStyle     0
## RoofMatl           RoofMatl     0
## Exterior1st     Exterior1st     0
## Exterior2nd     Exterior2nd     0
## MasVnrType       MasVnrType     8
## MasVnrArea       MasVnrArea     8
## ExterQual         ExterQual     0
## ExterCond         ExterCond     0
## Foundation       Foundation     0
## BsmtQual           BsmtQual    37
## BsmtCond           BsmtCond    37
## BsmtExposure   BsmtExposure    38
## BsmtFinType1   BsmtFinType1    37
## BsmtFinSF1       BsmtFinSF1     0
## BsmtFinType2   BsmtFinType2    38
## BsmtFinSF2       BsmtFinSF2     0
## BsmtUnfSF         BsmtUnfSF     0
## TotalBsmtSF     TotalBsmtSF     0
## Heating             Heating     0
## HeatingQC         HeatingQC     0
## CentralAir       CentralAir     0
## Electrical       Electrical     1
## X1stFlrSF         X1stFlrSF     0
## X2ndFlrSF         X2ndFlrSF     0
## LowQualFinSF   LowQualFinSF     0
## GrLivArea         GrLivArea     0
## BsmtFullBath   BsmtFullBath     0
## BsmtHalfBath   BsmtHalfBath     0
## FullBath           FullBath     0
## HalfBath           HalfBath     0
## BedroomAbvGr   BedroomAbvGr     0
## KitchenAbvGr   KitchenAbvGr     0
## KitchenQual     KitchenQual     0
## TotRmsAbvGrd   TotRmsAbvGrd     0
## Functional       Functional     0
## Fireplaces       Fireplaces     0
## FireplaceQu     FireplaceQu   690
## GarageType       GarageType    81
## GarageYrBlt     GarageYrBlt    81
## GarageFinish   GarageFinish    81
## GarageCars       GarageCars     0
## GarageArea       GarageArea     0
## GarageQual       GarageQual    81
## GarageCond       GarageCond    81
## PavedDrive       PavedDrive     0
## WoodDeckSF       WoodDeckSF     0
## OpenPorchSF     OpenPorchSF     0
## EnclosedPorch EnclosedPorch     0
## X3SsnPorch       X3SsnPorch     0
## ScreenPorch     ScreenPorch     0
## PoolArea           PoolArea     0
## PoolQC               PoolQC  1453
## Fence                 Fence  1179
## MiscFeature     MiscFeature  1406
## MiscVal             MiscVal     0
## MoSold               MoSold     0
## YrSold               YrSold     0
## SaleType           SaleType     0
## SaleCondition SaleCondition     0
## SalePrice         SalePrice     0
house_clean <- house_df

mask <- is.na(house_clean$MasVnrType)
house_clean$MasVnrType[mask] <- 'None'

mask <- is.na(house_clean$MasVnrType)
house_clean$MasVnrType[mask] <- 'None'

mask <- is.na(house_clean$BsmtCond)
house_clean$BsmtCond[mask] <- 'NoBasement'


mask <- is.na(house_clean$BsmtFinType1)
house_clean$BsmtFinType1[mask] <- 'None'
mask <- is.na(house_clean$BsmtFinType2)
house_clean$BsmtFinType2[mask] <- 'None'
mask <- is.na(house_clean$FireplaceQu)
house_clean$FireplaceQu[mask] <- 'NoFireplace'
mask <- is.na(house_clean$GarageFinish)
house_clean$GarageFinish[mask] <- 'NoGarage'

mask <- is.na(house_clean$GarageQual)
house_clean$GarageQual[mask] <- 'NoGarage'
mask <- is.na(house_clean$MiscFeature)
house_clean$MiscFeature[mask] <- 'NoMisc'


mask <- is.na(house_clean$BsmtExposure)
house_clean$BsmtExposure[mask] <- 'NoBasement'
mask <- is.na(house_clean$PoolQC)
house_clean$PoolQC[mask] <- 'NoPool'
mask <- is.na(house_clean$Fence)
house_clean$Fence[mask] <- 'NoFence'
mask <- is.na(house_clean$GarageType)
house_clean$GarageType[mask] <- 'NoGarage'
mask <- is.na(house_clean$GarageCond)
house_clean$GarageCond[mask] <- 'NoGarage'
mask <- is.na(house_clean$GarageArea)
house_clean$GarageArea[mask] <- 0
mask <- is.na(house_clean$GarageCars)
house_clean$GarageCars[mask] <- 0
mask <- is.na(house_clean$LotFrontage)
house_clean$LotFrontage[mask] <- 0
Data Splitting and getting required columns
#Segregate training and testing dataset

#Split into training and test dataset

set.seed(7340)

row <- sample(nrow(house_clean))

house_split <- house_clean[row,]

split <- round(nrow(house_split)*.80)
training <- data.frame(house_split[1:split,])
test <- data.frame(house_split[(split+1):nrow(house_split),])

test_x<- select(test,c(MSSubClass,MSZoning,  LotFrontage,LotArea,
Street,LotShape,  LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType,  HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd,  RoofStyle, RoofMatl,  Exterior1st,Exterior2nd  ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation,  BsmtCond,  BsmtExposure , BsmtFinType1,  BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF,  GrLivArea, BsmtFullBath,  BsmtHalfBath , FullBath,
HalfBath,  BedroomAbvGr , KitchenAbvGr,  KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea,  PoolQC,Fence, MiscFeature  ,
MiscVal,MoSold,YrSold,SaleType,  SaleCondition)) %>% data.frame()

test_y <- select(test,SalePrice) %>% data.frame()

test_or1 <- read.csv('data/test.csv',header = T,stringsAsFactors = F) 

test_original <- test_or1 %>%  select(c(MSSubClass,MSZoning,  LotFrontage,LotArea,
Street,LotShape,  LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType,  HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd,  RoofStyle, RoofMatl,  Exterior1st,Exterior2nd  ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation,  BsmtCond,  BsmtExposure , BsmtFinType1,  BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF,  GrLivArea, BsmtFullBath,  BsmtHalfBath , FullBath,
HalfBath,  BedroomAbvGr , KitchenAbvGr,  KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea,  PoolQC,Fence, MiscFeature  ,
MiscVal,MoSold,YrSold,SaleType,  SaleCondition)) %>% data.frame()

#Clean Test dataset

mask <- is.na(test_original$MasVnrType)
test_original$MasVnrType[mask] <- 'None'
mask <- is.na(test_original$MasVnrType)
test_original$MasVnrType[mask] <- 'None'
mask <- is.na(test_original$BsmtCond)
test_original$BsmtCond[mask] <- 'NoBasement'

mask <- is.na(test_original$BsmtFinType1)
test_original$BsmtFinType1[mask] <- 'None'
mask <- is.na(test_original$BsmtFinType2)
test_original$BsmtFinType2[mask] <- 'None'
mask <- is.na(test_original$FireplaceQu)
test_original$FireplaceQu[mask] <- 'NoFireplace'
mask <- is.na(test_original$GarageFinish)
test_original$GarageFinish[mask] <- 'NoGarage'

mask <- is.na(test_original$GarageQual)
test_original$GarageQual[mask] <- 'NoGarage'
mask <- is.na(test_original$MiscFeature)
test_original$MiscFeature[mask] <- 'NoMisc'


mask <- is.na(test_original$BsmtExposure)
test_original$BsmtExposure[mask] <- 'NoBasement'
mask <- is.na(test_original$PoolQC)
test_original$PoolQC[mask] <- 'NoPool'
mask <- is.na(test_original$Fence)
test_original$Fence[mask] <- 'NoFence'
mask <- is.na(test_original$GarageType)
test_original$GarageType[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageCond)
test_original$GarageCond[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageArea)
test_original$GarageArea[mask] <- 0
mask <- is.na(test_original$GarageCars)
test_original$GarageCars[mask] <- 0
mask <- is.na(test_original$LotFrontage)
test_original$LotFrontage[mask] <- 0

test_original$GarageCond[mask] <- 'NoGarage'
mask <- is.na(test_original$GarageArea)
test_original$GarageArea[mask] <- 0
mask <- is.na(test_original$GarageCars)
test_original$GarageCars[mask] <- 0
mask <- is.na(test_original$LotFrontage)
test_original$LotFrontage[mask] <- 0

#Having 0 for all null columns
test_original <- mutate_each(test_original,funs(replace(.,is.na(.),0)))



training$YearBuilt <- as.integer(training$YearBuilt )

#Split with between valid variables

#All variables
house_all <- select(training,c(MSSubClass,MSZoning,  LotFrontage,LotArea,
Street,LotShape,  LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType,  HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd,  RoofStyle, RoofMatl,  Exterior1st,Exterior2nd  ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation,  BsmtCond,  BsmtExposure , BsmtFinType1,  BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF,  GrLivArea, BsmtFullBath,  BsmtHalfBath , FullBath,
HalfBath,  BedroomAbvGr , KitchenAbvGr,  KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea,  PoolQC,Fence, MiscFeature  ,
MiscVal,MoSold,YrSold,SaleType,  SaleCondition, SalePrice)) %>% data.frame()


house_all_x <- select(training,c(MSSubClass,MSZoning,  LotFrontage,LotArea,
Street,LotShape,  LandContour,Utilities,
LotConfig, LandSlope, Neighborhood , Condition1,Condition2,
BldgType,  HouseStyle,OverallQual,OverallCond,YearBuilt, YearRemodAdd,  RoofStyle, RoofMatl,  Exterior1st,Exterior2nd  ,MasVnrType,MasVnrArea,ExterQual, ExterCond, Foundation,  BsmtCond,  BsmtExposure , BsmtFinType1,  BsmtFinSF1,
BsmtFinType2 , BsmtFinSF2,BsmtUnfSF, TotalBsmtSF,Heating,
HeatingQC, CentralAir,X1stFlrSF, X2ndFlrSF,
LowQualFinSF,  GrLivArea, BsmtFullBath,  BsmtHalfBath , FullBath,
HalfBath,  BedroomAbvGr , KitchenAbvGr,  KitchenQual,TotRmsAbvGrd ,
Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt ,
GarageFinish , GarageCars,GarageArea,GarageQual,GarageCond,
PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch, X3SsnPorch, ScreenPorch,PoolArea,  PoolQC,Fence, MiscFeature  ,
MiscVal,MoSold,YrSold,SaleType,  SaleCondition)) %>% data.frame()

#Electrical, OverallQual, Alley

house_num <- select(training,c(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond ,  MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,  HalfBath, 
BedroomAbvGr,  KitchenAbvGr,  TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea,  MiscVal,SalePrice))

#Numeric variables
house_num_x <- select(training,c(MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond ,  MasVnrArea,BsmtFinSF1,BsmtFinSF2,
BsmtUnfSF, TotalBsmtSF,X1stFlrSF, X2ndFlrSF, LowQualFinSF ,
GrLivArea, BsmtFullBath, BsmtHalfBath , FullBath,  HalfBath, 
BedroomAbvGr,  KitchenAbvGr,  TotRmsAbvGrd , Fireplaces,GarageYrBlt ,
GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,
X3SsnPorch,ScreenPorch,PoolArea,  MiscVal))


#Dependent variable
house_y <- select(training,SalePrice) %>% data.frame()
Model 1 - Simple GLM with all Numeric variables
model_glm <- glm(SalePrice~ ., data = house_num)

predicted <- predict(model_glm,newdata=test)

predicted_up <- ifelse(is.na(predicted)==TRUE,0,predicted)

RMSE(test_y$SalePrice, predicted_up) # 42286.25
## [1] 52438.93
Model 2 - Process the original dataset with cleanup using KnnImpute
#Model 2 - Process the original dataset with cleanup using 

model_radomforest <- train(x=house_all_x, y=house_y$SalePrice, method = "ranger",preProcess = c("knnImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))
## Loading required package: e1071
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
## Loading required package: ranger
## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=39 
## - Fold1: mtry=39 
## + Fold1: mtry=76 
## - Fold1: mtry=76 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=39 
## - Fold2: mtry=39 
## + Fold2: mtry=76 
## - Fold2: mtry=76 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=39 
## - Fold3: mtry=39 
## + Fold3: mtry=76 
## - Fold3: mtry=76 
## + Fold4: mtry= 2 
## - Fold4: mtry= 2 
## + Fold4: mtry=39 
## - Fold4: mtry=39 
## + Fold4: mtry=76 
## - Fold4: mtry=76 
## + Fold5: mtry= 2 
## - Fold5: mtry= 2 
## + Fold5: mtry=39 
## - Fold5: mtry=39 
## + Fold5: mtry=76 
## - Fold5: mtry=76 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 39 on full training set
#Test dataset

predicted <- predict(model_radomforest,newdata=test)


RMSE(test_y$SalePrice, predicted) # 24427.89
## [1] 27867.67
#It is low around 38 variables
plot(model_radomforest)

Model 3 -Removing low-variance variables
#Model 3 -Removing low-variance variables 

model_zv <- train(x=house_all_x, y=house_y$SalePrice, method = "ranger",preProcess = c("zv","medianImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))


predicted <- predict(model_zv,newdata=test)
## Warning in Ops.factor(left, right): '-' not meaningful for factors
RMSE(test_y$SalePrice, predicted) # 24480.04
Model 4 -Creating using glmnet with numeric values
#Model 4 -Creating using glmnet with numeric values

model_glmnet <- train(x=house_num_x, y=house_y$SalePrice, method = "glmnet",preProcess = c("medianImpute","center","scale"), trControl = trainControl(method ="cv",number =5,verboseIter=TRUE))
## Loading required package: glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## Loaded glmnet 2.0-5
## 
## Attaching package: 'glmnet'
## The following object is masked from 'package:Metrics':
## 
##     auc
## + Fold1: alpha=0.10, lambda=12742 
## - Fold1: alpha=0.10, lambda=12742 
## + Fold1: alpha=0.55, lambda=12742 
## - Fold1: alpha=0.55, lambda=12742 
## + Fold1: alpha=1.00, lambda=12742 
## - Fold1: alpha=1.00, lambda=12742 
## + Fold2: alpha=0.10, lambda=12742 
## - Fold2: alpha=0.10, lambda=12742 
## + Fold2: alpha=0.55, lambda=12742 
## - Fold2: alpha=0.55, lambda=12742 
## + Fold2: alpha=1.00, lambda=12742 
## - Fold2: alpha=1.00, lambda=12742 
## + Fold3: alpha=0.10, lambda=12742 
## - Fold3: alpha=0.10, lambda=12742 
## + Fold3: alpha=0.55, lambda=12742 
## - Fold3: alpha=0.55, lambda=12742 
## + Fold3: alpha=1.00, lambda=12742 
## - Fold3: alpha=1.00, lambda=12742 
## + Fold4: alpha=0.10, lambda=12742 
## - Fold4: alpha=0.10, lambda=12742 
## + Fold4: alpha=0.55, lambda=12742 
## - Fold4: alpha=0.55, lambda=12742 
## + Fold4: alpha=1.00, lambda=12742 
## - Fold4: alpha=1.00, lambda=12742 
## + Fold5: alpha=0.10, lambda=12742 
## - Fold5: alpha=0.10, lambda=12742 
## + Fold5: alpha=0.55, lambda=12742 
## - Fold5: alpha=0.55, lambda=12742 
## + Fold5: alpha=1.00, lambda=12742 
## - Fold5: alpha=1.00, lambda=12742 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.1, lambda = 1274 on full training set
predicted <- predict(model_glmnet,newdata=test)

RMSE(test_y$SalePrice, predicted) 
## [1] 47764.56
Selecting the best model and apply to test data

After analyzing various models, it seems below is the best model.

#Predict with original test data
predicted <- predict(model_radomforest,newdata=test_original)


final_output <- data.frame(test_or1,predicted) 

final_output %>%  write_csv(path = 'submission.csv')

final_output %>% select(c(Id, predicted)) %>% head()
##     Id predicted
## 1 1461  130999.0
## 2 1462  159878.4
## 3 1463  180786.3
## 4 1464  181695.3
## 5 1465  200290.4
## 6 1466  180546.9
rm(list=ls())