Introduction

Assignment 2, Due September 4(A)/5(B), 2019 at 5:30pm Unless you have already done so, go to Kaggle’s website and register an account, https://www.kaggle.com/account/register. Read up on what Kaggle is and reflect on how you may use it in your future jobs. Go to the following competition on house price prediction: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview INDIVIDUAL 1. [10 pts]: Read the instructions on Kaggle. Learn (by yourself) how to join a Kaggle competition and submit your results. 2. [10 pts]: You will find that some predictors contain “missing data”, NA. Figure out (by yourself) how to handle missing data in regression. 3. [50pts]: Build a regression model for house price prediction and write a report explaining how you approached the task, the steps you took, how you revised your model (must explore both LASSO and ridge regression) as your analyses progressed, etc. Comment on the quality of your predictions. Include your model as an Appendix in your report. Submit the PDF of your report, and your model file(s) (R code, Excel spreadsheet, etc.) 4. [10pts]: On the front page of your report, include your position on the leaderboard at the time of your last submission. Please also include the screenshot showing your position on the leaderboard in the Appendix.

Data exploration and visualization

load data

require(readr)
require(tidyverse)
require(DataExplorer)
require(dlookr)

train <- read_csv("/Users/milin/Desktop/train.csv")
test <- read_csv("/Users/milin/Desktop/test.csv")


diagnose(train) %>% arrange(desc(missing_count))
## # A tibble: 81 x 6
##    variables  types  missing_count missing_percent unique_count unique_rate
##    <chr>      <chr>          <int>           <dbl>        <int>       <dbl>
##  1 PoolQC     chara…          1453           99.5             4     0.00274
##  2 MiscFeatu… chara…          1406           96.3             5     0.00342
##  3 Alley      chara…          1369           93.8             3     0.00205
##  4 Fence      chara…          1179           80.8             5     0.00342
##  5 Fireplace… chara…           690           47.3             6     0.00411
##  6 LotFronta… numer…           259           17.7           111     0.0760 
##  7 GarageType chara…            81            5.55            7     0.00479
##  8 GarageYrB… numer…            81            5.55           98     0.0671 
##  9 GarageFin… chara…            81            5.55            4     0.00274
## 10 GarageQual chara…            81            5.55            6     0.00411
## # … with 71 more rows
diagnose(test) %>% arrange(desc(missing_count))
## # A tibble: 80 x 6
##    variables  types  missing_count missing_percent unique_count unique_rate
##    <chr>      <chr>          <int>           <dbl>        <int>       <dbl>
##  1 PoolQC     chara…          1456           99.8             3     0.00206
##  2 MiscFeatu… chara…          1408           96.5             4     0.00274
##  3 Alley      chara…          1352           92.7             3     0.00206
##  4 Fence      chara…          1169           80.1             5     0.00343
##  5 Fireplace… chara…           730           50.0             6     0.00411
##  6 LotFronta… numer…           227           15.6           116     0.0795 
##  7 GarageYrB… numer…            78            5.35           98     0.0672 
##  8 GarageFin… chara…            78            5.35            4     0.00274
##  9 GarageQual chara…            78            5.35            5     0.00343
## 10 GarageCond chara…            78            5.35            6     0.00411
## # … with 70 more rows

The above results show that the missing values of PoolQC, MiscFeature, Alley, Fence and other variables are more than 80%, so the missing values of these variables are directly deleted.

train <- train %>% dplyr::select(-PoolQC,-MiscFeature,-Alley,-Fence)
test <- test %>% dplyr::select(-PoolQC,-MiscFeature,-Alley,-Fence)

handel missing value

Use KNN to fill in the missing values. The basic idea of KNN filling missing values is to find K rows that are most similar to this row for each row with missing values, and then take the average value of the K rows or weighted average value to fill the missing data.

require(VIM)
train1 <- kNN(train)
train1 <- train1[,c(1:77)]
test1 <- kNN(test)
test1 <- test1[,c(1:76)]

The data no longer contains any missing values.

one hot

Firstly, the classification variables in the data set were screened out.

a <- sapply(train1,class)
a
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##     "numeric"     "numeric"   "character"     "numeric"     "numeric" 
##        Street      LotShape   LandContour     Utilities     LotConfig 
##   "character"   "character"   "character"   "character"   "character" 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##   "character"   "character"   "character"   "character"   "character" 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##   "character"     "numeric"     "numeric"     "numeric"     "numeric" 
##     RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType 
##   "character"   "character"   "character"   "character"   "character" 
##    MasVnrArea     ExterQual     ExterCond    Foundation      BsmtQual 
##     "numeric"   "character"   "character"   "character"   "character" 
##      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
##   "character"   "character"   "character"     "numeric"   "character" 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC 
##     "numeric"     "numeric"     "numeric"   "character"   "character" 
##    CentralAir    Electrical      1stFlrSF      2ndFlrSF  LowQualFinSF 
##   "character"   "character"     "numeric"     "numeric"     "numeric" 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##     "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##     "numeric"     "numeric"   "character"     "numeric"   "character" 
##    Fireplaces   FireplaceQu    GarageType   GarageYrBlt  GarageFinish 
##     "numeric"   "character"   "character"     "numeric"   "character" 
##    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive 
##     "numeric"     "numeric"   "character"   "character"   "character" 
##    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch   ScreenPorch 
##     "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
##      PoolArea       MiscVal        MoSold        YrSold      SaleType 
##     "numeric"     "numeric"     "numeric"     "numeric"   "character" 
## SaleCondition     SalePrice 
##   "character"     "numeric"
train_num <- train1[,a == "numeric"]
train_cat <- train1[,a == "character"]

test_num <- test1[,a[-77] == "numeric"]
test_cat <- test1[,a[-77] == "character"]

Carry out one hot transformation for categorical variables:

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
test_cat <- test_cat %>% dplyr::select(-Utilities)
train_cat <- train_cat %>% dplyr::select(- Utilities)

alldata <- rbind(train_cat,test_cat)
for (i in 1:dim(alldata)[2]) {
  alldata[,i] <- as.factor(alldata[,i])
}

train_cat <- alldata[1:dim(train_cat)[1],]
test_cat <- alldata[-c(1:dim(train_cat)[1]),]
  
dum <- dummyVars(~.,data = alldata)

train_cat_dum <- predict(dum,train_cat)

test_cat_dum <- predict(dum,test_cat)

train_num <- train_num %>% scale() %>% data.frame()
test_num <- test_num %>% scale() %>% data.frame()

traindata <- cbind(train_num,train_cat_dum)
testdata <- cbind(test_num,test_cat_dum)

build liner regression model

liner_model <- lm(SalePrice~.,data = traindata[,-1])
summary(liner_model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = traindata[, -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27393 -0.11872  0.00439  0.12326  2.27393 
## 
## Coefficients: (41 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.7829291  0.6098678   1.284 0.199465    
## MSSubClass            -0.0100617  0.0433555  -0.232 0.816518    
## LotFrontage            0.0084555  0.0128841   0.656 0.511772    
## LotArea                0.0953944  0.0137751   6.925 7.01e-12 ***
## OverallQual            0.1121311  0.0176608   6.349 3.04e-10 ***
## OverallCond            0.0789049  0.0122690   6.431 1.81e-10 ***
## YearBuilt              0.1264605  0.0295237   4.283 1.98e-05 ***
## YearRemodAdd           0.0281537  0.0144102   1.954 0.050959 .  
## MasVnrArea             0.0442863  0.0130872   3.384 0.000737 ***
## BsmtFinSF1             0.1935575  0.0262856   7.364 3.27e-13 ***
## BsmtFinSF2             0.0490585  0.0173106   2.834 0.004672 ** 
## BsmtUnfSF              0.0814681  0.0222564   3.660 0.000262 ***
## TotalBsmtSF                   NA         NA      NA       NA    
## X1stFlrSF              0.2516488  0.0254857   9.874  < 2e-16 ***
## X2ndFlrSF              0.3701033  0.0305567  12.112  < 2e-16 ***
## LowQualFinSF           0.0086469  0.0112590   0.768 0.442637    
## GrLivArea                     NA         NA      NA       NA    
## BsmtFullBath           0.0034939  0.0129224   0.270 0.786918    
## BsmtHalfBath          -0.0032221  0.0090747  -0.355 0.722607    
## FullBath               0.0272426  0.0152974   1.781 0.075182 .  
## HalfBath               0.0067917  0.0132392   0.513 0.608042    
## BedroomAbvGr          -0.0350209  0.0140885  -2.486 0.013059 *  
## KitchenAbvGr          -0.0334483  0.0157877  -2.119 0.034323 *  
## TotRmsAbvGrd           0.0195016  0.0193582   1.007 0.313938    
## Fireplaces             0.0194008  0.0109063   1.779 0.075509 .  
## GarageYrBlt           -0.0100750  0.0191513  -0.526 0.598932    
## GarageCars             0.0285761  0.0204818   1.395 0.163210    
## GarageArea             0.0439676  0.0209526   2.098 0.036071 *  
## WoodDeckSF             0.0220741  0.0092295   2.392 0.016921 *  
## OpenPorchSF            0.0029809  0.0095684   0.312 0.755450    
## EnclosedPorch          0.0039454  0.0095824   0.412 0.680608    
## X3SsnPorch             0.0123958  0.0083004   1.493 0.135592    
## ScreenPorch            0.0196735  0.0086060   2.286 0.022423 *  
## PoolArea               0.0422263  0.0093080   4.537 6.28e-06 ***
## MiscVal                0.0004489  0.0088876   0.051 0.959724    
## MoSold                -0.0144309  0.0083248  -1.733 0.083261 .  
## YrSold                -0.0073439  0.0086424  -0.850 0.395629    
## `MSZoning.C (all)`    -0.2947011  0.1202538  -2.451 0.014398 *  
## MSZoning.FV            0.1250766  0.0936348   1.336 0.181865    
## MSZoning.RH            0.0307292  0.0960397   0.320 0.749050    
## MSZoning.RL            0.0467834  0.0473143   0.989 0.322967    
## MSZoning.RM                   NA         NA      NA       NA    
## Street.Grvl           -0.3865099  0.1522345  -2.539 0.011242 *  
## Street.Pave                   NA         NA      NA       NA    
## LotShape.IR1          -0.0253964  0.0202674  -1.253 0.210421    
## LotShape.IR2           0.0221497  0.0548512   0.404 0.686420    
## LotShape.IR3           0.0341039  0.1128361   0.302 0.762518    
## LotShape.Reg                  NA         NA      NA       NA    
## LandContour.Bnk       -0.0751658  0.0465630  -1.614 0.106723    
## LandContour.HLS        0.0303031  0.0513660   0.590 0.555335    
## LandContour.Low       -0.2000293  0.0718021  -2.786 0.005421 ** 
## LandContour.Lvl               NA         NA      NA       NA    
## LotConfig.Corner       0.0178353  0.0226428   0.788 0.431034    
## LotConfig.CulDSac      0.1223906  0.0378066   3.237 0.001239 ** 
## LotConfig.FR2         -0.0837507  0.0476849  -1.756 0.079281 .  
## LotConfig.FR3         -0.1789558  0.1581663  -1.131 0.258090    
## LotConfig.Inside              NA         NA      NA       NA    
## LandSlope.Gtl          0.5570847  0.1441020   3.866 0.000116 ***
## LandSlope.Mod          0.6366039  0.1439807   4.421 1.07e-05 ***
## LandSlope.Sev                 NA         NA      NA       NA    
## Neighborhood.Blmngtn  -0.0163826  0.1326784  -0.123 0.901751    
## Neighborhood.Blueste  -0.0231539  0.2419336  -0.096 0.923772    
## Neighborhood.BrDale   -0.0218450  0.1478908  -0.148 0.882595    
## Neighborhood.BrkSide  -0.0682929  0.1195018  -0.571 0.567779    
## Neighborhood.ClearCr  -0.1852409  0.1210091  -1.531 0.126076    
## Neighborhood.CollgCr  -0.1365754  0.1045494  -1.306 0.191687    
## Neighborhood.Crawfor   0.1362654  0.1127566   1.208 0.227091    
## Neighborhood.Edwards  -0.2587996  0.1081781  -2.392 0.016891 *  
## Neighborhood.Gilbert  -0.1572176  0.1093944  -1.437 0.150926    
## Neighborhood.IDOTRR   -0.1452660  0.1342510  -1.082 0.279444    
## Neighborhood.MeadowV  -0.0795678  0.1481597  -0.537 0.591337    
## Neighborhood.Mitchel  -0.2792718  0.1104663  -2.528 0.011592 *  
## Neighborhood.NAmes    -0.2086532  0.1034452  -2.017 0.043909 *  
## Neighborhood.NoRidge   0.3169246  0.1142131   2.775 0.005606 ** 
## Neighborhood.NPkVill   0.1662988  0.1835156   0.906 0.365017    
## Neighborhood.NridgHt   0.2157803  0.1110475   1.943 0.052228 .  
## Neighborhood.NWAmes   -0.2365850  0.1063531  -2.225 0.026295 *  
## Neighborhood.OldTown  -0.1831072  0.1201491  -1.524 0.127766    
## Neighborhood.Sawyer   -0.1491827  0.1072843  -1.391 0.164618    
## Neighborhood.SawyerW  -0.0719011  0.1075084  -0.669 0.503752    
## Neighborhood.Somerst  -0.0304059  0.1243646  -0.244 0.806892    
## Neighborhood.StoneBr   0.4691535  0.1207296   3.886 0.000107 ***
## Neighborhood.SWISU    -0.1194861  0.1284579  -0.930 0.352472    
## Neighborhood.Timber   -0.1470783  0.1140498  -1.290 0.197434    
## Neighborhood.Veenker          NA         NA      NA       NA    
## Condition1.Artery     -0.0727275  0.1619269  -0.449 0.653411    
## Condition1.Feedr       0.0071980  0.1574186   0.046 0.963537    
## Condition1.Norm        0.1149882  0.1534334   0.749 0.453739    
## Condition1.PosA       -0.0022810  0.1916658  -0.012 0.990506    
## Condition1.PosN        0.0749043  0.1707127   0.439 0.660902    
## Condition1.RRAe       -0.2828955  0.1854076  -1.526 0.127317    
## Condition1.RRAn        0.0272921  0.1608552   0.170 0.865299    
## Condition1.RRNe       -0.0984167  0.2622391  -0.375 0.707507    
## Condition1.RRNn               NA         NA      NA       NA    
## Condition2.Artery     -0.0135599  0.3422183  -0.040 0.968400    
## Condition2.Feedr      -0.0832005  0.2677304  -0.311 0.756034    
## Condition2.Norm       -0.1048598  0.2244725  -0.467 0.640484    
## Condition2.PosA        0.4799276  0.4699100   1.021 0.307305    
## Condition2.PosN       -3.0194807  0.3276853  -9.215  < 2e-16 ***
## Condition2.RRAe       -1.6132953  0.5621913  -2.870 0.004180 ** 
## Condition2.RRAn       -0.2861964  0.3737727  -0.766 0.444004    
## Condition2.RRNn               NA         NA      NA       NA    
## BldgType.1Fam          0.2373701  0.1116121   2.127 0.033641 *  
## BldgType.2fmCon        0.1476361  0.0964433   1.531 0.126075    
## BldgType.Duplex        0.1241621  0.1104364   1.124 0.261112    
## BldgType.Twnhs        -0.0397140  0.0635087  -0.625 0.531870    
## BldgType.TwnhsE               NA         NA      NA       NA    
## HouseStyle.1.5Fin     -0.0203677  0.0690041  -0.295 0.767917    
## HouseStyle.1.5Unf      0.1446251  0.1111727   1.301 0.193535    
## HouseStyle.1Story      0.0902978  0.0741225   1.218 0.223373    
## HouseStyle.2.5Fin     -0.3210785  0.1701864  -1.887 0.059446 .  
## HouseStyle.2.5Unf     -0.1475800  0.1266428  -1.165 0.244114    
## HouseStyle.2Story     -0.0957260  0.0656995  -1.457 0.145364    
## HouseStyle.SFoyer      0.0023435  0.0689864   0.034 0.972906    
## HouseStyle.SLvl               NA         NA      NA       NA    
## RoofStyle.Flat        -1.2316953  0.4383731  -2.810 0.005038 ** 
## RoofStyle.Gable       -1.1355395  0.3801298  -2.987 0.002871 ** 
## RoofStyle.Gambrel     -1.1126240  0.3922273  -2.837 0.004633 ** 
## RoofStyle.Hip         -1.1357168  0.3804251  -2.985 0.002888 ** 
## RoofStyle.Mansard     -0.9788928  0.3755304  -2.607 0.009253 ** 
## RoofStyle.Shed                NA         NA      NA       NA    
## RoofMatl.ClyTile      -9.0448265  0.4339942 -20.841  < 2e-16 ***
## RoofMatl.CompShg      -0.6798950  0.1441380  -4.717 2.67e-06 ***
## RoofMatl.Membran       0.5132349  0.4395102   1.168 0.243137    
## RoofMatl.Metal         0.1821753  0.4280163   0.426 0.670454    
## RoofMatl.Roll         -0.8638422  0.3529558  -2.447 0.014526 *  
## `RoofMatl.Tar&Grv`    -0.6152070  0.2696740  -2.281 0.022701 *  
## RoofMatl.WdShake      -0.7968688  0.2352131  -3.388 0.000727 ***
## RoofMatl.WdShngl              NA         NA      NA       NA    
## Exterior1st.AsbShng    0.1846407  0.1678237   1.100 0.271458    
## Exterior1st.AsphShn   -0.0088736  0.3991928  -0.022 0.982269    
## Exterior1st.BrkComm   -0.0027317  0.3250248  -0.008 0.993295    
## Exterior1st.BrkFace    0.2071514  0.0983777   2.106 0.035435 *  
## Exterior1st.CBlock    -0.1828724  0.3406839  -0.537 0.591517    
## Exterior1st.CemntBd   -0.0294134  0.1984546  -0.148 0.882200    
## Exterior1st.HdBoard   -0.0648821  0.0906831  -0.715 0.474447    
## Exterior1st.ImStucc   -0.4583262  0.3247645  -1.411 0.158422    
## Exterior1st.MetalSd    0.0519716  0.1264717   0.411 0.681192    
## Exterior1st.Plywood   -0.0794513  0.0911610  -0.872 0.383625    
## Exterior1st.Stone     -0.0041374  0.2709762  -0.015 0.987820    
## Exterior1st.Stucco     0.0490505  0.1222378   0.401 0.688290    
## Exterior1st.VinylSd   -0.0640784  0.1102961  -0.581 0.561369    
## `Exterior1st.Wd Sdng` -0.0609934  0.0841158  -0.725 0.468522    
## Exterior1st.WdShing           NA         NA      NA       NA    
## Exterior2nd.AsbShng   -0.1170960  0.1568482  -0.747 0.455474    
## Exterior2nd.AsphShn    0.0693135  0.2473283   0.280 0.779334    
## `Exterior2nd.Brk Cmn`  0.0183198  0.2180709   0.084 0.933064    
## Exterior2nd.BrkFace   -0.0048462  0.1072163  -0.045 0.963955    
## Exterior2nd.CBlock            NA         NA      NA       NA    
## Exterior2nd.CmentBd    0.0849450  0.1931208   0.440 0.660120    
## Exterior2nd.HdBoard    0.0536032  0.0854417   0.627 0.530536    
## Exterior2nd.ImStucc    0.2805027  0.1261046   2.224 0.026305 *  
## Exterior2nd.MetalSd   -0.0051335  0.1214637  -0.042 0.966296    
## Exterior2nd.Other     -0.2667455  0.3196219  -0.835 0.404125    
## Exterior2nd.Plywood    0.0283555  0.0815611   0.348 0.728155    
## Exterior2nd.Stone     -0.1926621  0.1673965  -1.151 0.249984    
## Exterior2nd.Stucco     0.0182728  0.1161089   0.157 0.874974    
## Exterior2nd.VinylSd    0.1059547  0.0983921   1.077 0.281754    
## `Exterior2nd.Wd Sdng`  0.0803743  0.0735140   1.093 0.274468    
## `Exterior2nd.Wd Shng`         NA         NA      NA       NA    
## MasVnrType.BrkCmn     -0.1565906  0.0911352  -1.718 0.086008 .  
## MasVnrType.BrkFace    -0.0710737  0.0364648  -1.949 0.051510 .  
## MasVnrType.None       -0.0294090  0.0395995  -0.743 0.457829    
## MasVnrType.Stone              NA         NA      NA       NA    
## ExterQual.Ex           0.2674941  0.0668751   4.000 6.72e-05 ***
## ExterQual.Fa           0.1832265  0.1216908   1.506 0.132409    
## ExterQual.Gd           0.0093229  0.0307510   0.303 0.761809    
## ExterQual.TA                  NA         NA      NA       NA    
## ExterCond.Ex           0.0647113  0.2168816   0.298 0.765470    
## ExterCond.Fa           0.0162770  0.0726673   0.224 0.822800    
## ExterCond.Gd          -0.0356658  0.0299334  -1.192 0.233686    
## ExterCond.Po           0.1444063  0.3278414   0.440 0.659670    
## ExterCond.TA                  NA         NA      NA       NA    
## Foundation.BrkTil      0.3431521  0.1860593   1.844 0.065378 .  
## Foundation.CBlock      0.3870318  0.1832827   2.112 0.034917 *  
## Foundation.PConc       0.4036620  0.1821741   2.216 0.026888 *  
## Foundation.Slab        0.4175869  0.2040571   2.046 0.040928 *  
## Foundation.Stone       0.4260338  0.2289475   1.861 0.063006 .  
## Foundation.Wood               NA         NA      NA       NA    
## BsmtQual.Ex            0.2023700  0.0521055   3.884 0.000108 ***
## BsmtQual.Fa            0.0277965  0.0614838   0.452 0.651280    
## BsmtQual.Gd           -0.0355402  0.0318394  -1.116 0.264542    
## BsmtQual.TA                   NA         NA      NA       NA    
## BsmtCond.Fa           -0.0486079  0.0536276  -0.906 0.364903    
## BsmtCond.Gd           -0.0417521  0.0408683  -1.022 0.307160    
## BsmtCond.Po            0.8416787  0.3824721   2.201 0.027948 *  
## BsmtCond.TA                   NA         NA      NA       NA    
## BsmtExposure.Av        0.0788182  0.0275564   2.860 0.004305 ** 
## BsmtExposure.Gd        0.2515593  0.0376943   6.674 3.77e-11 ***
## BsmtExposure.Mn        0.0225927  0.0319735   0.707 0.479945    
## BsmtExposure.No               NA         NA      NA       NA    
## BsmtFinType1.ALQ      -0.0433020  0.0361019  -1.199 0.230590    
## BsmtFinType1.BLQ      -0.0155023  0.0385632  -0.402 0.687755    
## BsmtFinType1.GLQ       0.0319311  0.0337512   0.946 0.344297    
## BsmtFinType1.LwQ      -0.0878497  0.0466405  -1.884 0.059862 .  
## BsmtFinType1.Rec      -0.0464868  0.0393374  -1.182 0.237537    
## BsmtFinType1.Unf              NA         NA      NA       NA    
## BsmtFinType2.ALQ       0.1304617  0.0949491   1.374 0.169687    
## BsmtFinType2.BLQ      -0.0416537  0.0642417  -0.648 0.516853    
## BsmtFinType2.GLQ       0.0916211  0.1116101   0.821 0.411861    
## BsmtFinType2.LwQ      -0.0784112  0.0578225  -1.356 0.175328    
## BsmtFinType2.Rec      -0.0203589  0.0603171  -0.338 0.735774    
## BsmtFinType2.Unf              NA         NA      NA       NA    
## Heating.Floor         -0.1121599  0.3615427  -0.310 0.756442    
## Heating.GasA          -0.1474185  0.1808174  -0.815 0.415065    
## Heating.GasW          -0.1944473  0.1950471  -0.997 0.318998    
## Heating.Grav          -0.2184253  0.2205561  -0.990 0.322204    
## Heating.OthW          -0.3918132  0.2931851  -1.336 0.181666    
## Heating.Wall                  NA         NA      NA       NA    
## HeatingQC.Ex           0.0435737  0.0261059   1.669 0.095350 .  
## HeatingQC.Fa           0.0503580  0.0561769   0.896 0.370205    
## HeatingQC.Gd          -0.0004289  0.0272075  -0.016 0.987424    
## HeatingQC.Po           0.0773969  0.3368706   0.230 0.818322    
## HeatingQC.TA                  NA         NA      NA       NA    
## CentralAir.N           0.0024397  0.0489669   0.050 0.960272    
## CentralAir.Y                  NA         NA      NA       NA    
## Electrical.FuseA       0.0179955  0.0371137   0.485 0.627852    
## Electrical.FuseF       0.0189307  0.0697092   0.272 0.786001    
## Electrical.FuseP      -0.0703400  0.2319079  -0.303 0.761705    
## Electrical.Mix        -0.5692347  0.5660117  -1.006 0.314761    
## Electrical.SBrkr              NA         NA      NA       NA    
## KitchenQual.Ex         0.2865060  0.0494137   5.798 8.52e-09 ***
## KitchenQual.Fa         0.0375162  0.0617756   0.607 0.543766    
## KitchenQual.Gd        -0.0301679  0.0268681  -1.123 0.261735    
## KitchenQual.TA                NA         NA      NA       NA    
## Functional.Maj1       -0.2796233  0.0942351  -2.967 0.003063 ** 
## Functional.Maj2       -0.2696161  0.1579777  -1.707 0.088136 .  
## Functional.Min1       -0.1499215  0.0596696  -2.513 0.012114 *  
## Functional.Min2       -0.1221749  0.0587374  -2.080 0.037731 *  
## Functional.Mod        -0.2674614  0.0944682  -2.831 0.004713 ** 
## Functional.Sev        -0.8034093  0.3605439  -2.228 0.026039 *  
## Functional.Typ                NA         NA      NA       NA    
## FireplaceQu.Ex        -0.0412153  0.0667570  -0.617 0.537091    
## FireplaceQu.Fa        -0.0350022  0.0373848  -0.936 0.349319    
## FireplaceQu.Gd        -0.0141711  0.0209403  -0.677 0.498700    
## FireplaceQu.Po         0.0334541  0.0507692   0.659 0.510055    
## FireplaceQu.TA                NA         NA      NA       NA    
## GarageType.2Types     -0.2873757  0.1382961  -2.078 0.037919 *  
## GarageType.Attchd     -0.0498285  0.0281665  -1.769 0.077131 .  
## GarageType.Basment    -0.0378261  0.0822411  -0.460 0.645639    
## GarageType.BuiltIn    -0.0683768  0.0500805  -1.365 0.172397    
## GarageType.CarPort     0.0300819  0.1234183   0.244 0.807473    
## GarageType.Detchd             NA         NA      NA       NA    
## GarageFinish.Fin       0.0035003  0.0302392   0.116 0.907865    
## GarageFinish.RFn      -0.0345935  0.0267843  -1.292 0.196754    
## GarageFinish.Unf              NA         NA      NA       NA    
## GarageQual.Ex          1.4044270  0.3763530   3.732 0.000199 ***
## GarageQual.Fa         -0.0875616  0.0583846  -1.500 0.133939    
## GarageQual.Gd          0.0140980  0.0959472   0.147 0.883207    
## GarageQual.Po         -0.2496823  0.3020676  -0.827 0.408637    
## GarageQual.TA                 NA         NA      NA       NA    
## GarageCond.Ex         -1.3431174  0.4334841  -3.098 0.001990 ** 
## GarageCond.Fa         -0.0213811  0.0651126  -0.328 0.742687    
## GarageCond.Gd         -0.0210578  0.1132845  -0.186 0.852566    
## GarageCond.Po          0.0110423  0.1734435   0.064 0.949247    
## GarageCond.TA                 NA         NA      NA       NA    
## PavedDrive.N           0.0167766  0.0431129   0.389 0.697247    
## PavedDrive.P          -0.0400452  0.0611446  -0.655 0.512638    
## PavedDrive.Y                  NA         NA      NA       NA    
## SaleType.COD          -0.0018453  0.0526469  -0.035 0.972046    
## SaleType.Con           0.3501666  0.2154136   1.626 0.104301    
## SaleType.ConLD         0.2081313  0.1121069   1.857 0.063616 .  
## SaleType.ConLI         0.0683223  0.1364445   0.501 0.616649    
## SaleType.ConLw         0.0051400  0.1436284   0.036 0.971458    
## SaleType.CWD           0.1848426  0.1539761   1.200 0.230191    
## SaleType.New           0.2858321  0.1886345   1.515 0.129962    
## SaleType.Oth           0.0752830  0.1793278   0.420 0.674700    
## SaleType.WD                   NA         NA      NA       NA    
## SaleCondition.Abnorml  0.0301204  0.1880973   0.160 0.872803    
## SaleCondition.AdjLand  0.1840785  0.2581343   0.713 0.475912    
## SaleCondition.Alloca   0.0822735  0.2128666   0.387 0.699192    
## SaleCondition.Family   0.0142096  0.1975827   0.072 0.942679    
## SaleCondition.Normal   0.0978700  0.1861911   0.526 0.599231    
## SaleCondition.Partial         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2874 on 1227 degrees of freedom
## Multiple R-squared:  0.9306, Adjusted R-squared:  0.9174 
## F-statistic: 70.87 on 232 and 1227 DF,  p-value: < 2.2e-16
# liner_model_step <- step(liner_model)
 #summary(liner_model_step)
hist(liner_model$residuals)

qqnorm(liner_model$residuals)

qqline(liner_model$residuals)

shapiro.test(liner_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  liner_model$residuals
## W = 0.84738, p-value < 2.2e-16
plot(liner_model$residuals-liner_model$fitted.values)

We plot the histogram of the error and find that the error is approximately close to the positive distribution. As can be seen from the error QQ map, the data walks on the diagonal. Further hypothesis testing, it can be judged that the error is a random error

The prediction of this model is not bad, because the results show that the error is not very large.

By analyzing the error, we can see that the error is a random error, which means that the model is better expressed in the data, so the model is considered suitable for this data. The linear model has good interpretability.

Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.

Ridge

Use glmnet with alpha=0 for ridge. If we use lambda=0, we should get the OLS estimate.

Instead of using cross-validation to determine the terms in the model, CV will be used to determine lambda. For that we’ll use cv.glmnet.

library(glmnet)
set.seed(123)
k=5
folds = rep(1:k, length.out=nrow(traindata))
fold = sample(folds, nrow(traindata), replace=F)


lambdas= c(0,10^seq(-2, 10, length=1000)) ## define a bunch of lambdas to check.
r1 = cv.glmnet(x=as.matrix(traindata[,-c(1,38)]), y=traindata$SalePrice, alpha=0, lambda=lambdas, foldid=fold, thresh=1e-16)

Find the best lambda

best.lambda.r1=r1$lambda.min
best.lambda.r1
## [1] 0.5517492

MSE for that lambda

r1$cvm[r1$lambda==best.lambda.r1]
## [1] 0.1599373

Find the coefficients for each value of lambda

coefs = coef(r1, s=lambdas)
colnames(coefs) = paste0('lambda', round(lambdas,5))
head(coefs[,1:5],10)
## 10 x 5 sparse Matrix of class "dgCMatrix"
##                   lambda0   lambda0.01 lambda0.01028 lambda0.01057
## (Intercept)  -0.727109349 -0.650123600  -0.649352907  -0.648565557
## MSSubClass   -0.010061711 -0.020991731  -0.021162599  -0.021333511
## LotFrontage   0.008455491  0.009155435   0.009172338   0.009189599
## LotArea       0.095394453  0.090356912   0.090227707   0.090095526
## OverallQual   0.112131052  0.112120227   0.112112386   0.112103998
## OverallCond   0.078904901  0.075362489   0.075274496   0.075184572
## YearBuilt     0.126460527  0.106283963   0.105820135   0.105347903
## YearRemodAdd  0.028153745  0.029075086   0.029095190   0.029115608
## MasVnrArea    0.044286271  0.045666954   0.045698397   0.045730381
## BsmtFinSF1    0.099085055  0.092648442   0.092547734   0.092444663
##              lambda0.01087
## (Intercept)   -0.647761275
## MSSubClass    -0.021504356
## LotFrontage    0.009207223
## LotArea        0.089960317
## OverallQual    0.112095036
## OverallCond    0.075092684
## YearBuilt      0.104867214
## YearRemodAdd   0.029136338
## MasVnrArea     0.045762906
## BsmtFinSF1     0.092339188

Reorganize the coefs data to prepare for ggplot.

coefs = as.data.frame(as.matrix(coefs))
coefs$var = rownames(coefs)
coefs = gather(coefs, key=lambda, value=coef, -var)
coefs$lambda = as.numeric(gsub('lambda', '', coefs$lambda))
coefs=coefs[!grepl('Intercept', coefs$var),]
head(coefs)
##           var lambda         coef
## 2  MSSubClass      0 -0.010061711
## 3 LotFrontage      0  0.008455491
## 4     LotArea      0  0.095394453
## 5 OverallQual      0  0.112131052
## 6 OverallCond      0  0.078904901
## 7   YearBuilt      0  0.126460527

Plot the coefficients as a function of lambda

require(ggplot2)
g= ggplot(data=coefs, aes(x=lambda, y=coef, group=var))+geom_line(alpha=0.7)+xlim(c(0,1000000))+
  geom_vline(xintercept=best.lambda.r1, color='red')+
  geom_vline(xintercept=0)
g
## Warning: Removed 90909 rows containing missing values (geom_path).

The prediction of this model is good, because the results show small error .

By analyzing the result,We can judge that the model has a good prediction effect on the data.The model is also suitable for this data set

Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.

Lasso

las1 = cv.glmnet(x=as.matrix(traindata[,-c(1,38)]), y=traindata$SalePrice, alpha=1,lambda=lambdas, foldid=fold, thresh=1e-16)

Find the best lambda

best.lambda.las1=las1$lambda.min
best.lambda.las1
## [1] 0.02230223

MSE for that lambda

las1$cvm[las1$lambda==best.lambda.las1]
## [1] 0.1695319

Worse than Ridge, better than OLS

Let’s see how the coefficients change for different values of lambda. First we’ll find the coefficients for each value of lambda.

coefs = coef(las1, s=lambdas)
colnames(coefs) = paste0('lambda', round(lambdas,5))
head(coefs[,1:5],10)
## 10 x 5 sparse Matrix of class "dgCMatrix"
##                   lambda0   lambda0.01 lambda0.01028 lambda0.01057
## (Intercept)  -0.584282518 -0.372344471  -0.368949806  -0.366746433
## MSSubClass   -0.010061713 -0.034087211  -0.033808365  -0.033485623
## LotFrontage   0.008455489  0.006247016   0.005875402   0.005487881
## LotArea       0.095394455  0.047762134   0.047277461   0.046780403
## OverallQual   0.112131060  0.139301294   0.140278234   0.141269725
## OverallCond   0.078904901  0.053398405   0.052998294   0.052589690
## YearBuilt     0.126460565  0.084562408   0.084519006   0.084434524
## YearRemodAdd  0.028153751  0.028473983   0.028745958   0.029013614
## MasVnrArea    0.044286273  0.030778401   0.030232088   0.029680172
## BsmtFinSF1    0.117928743  0.087311289   0.087224290   0.087124384
##              lambda0.01087
## (Intercept)   -0.364879910
## MSSubClass    -0.033142655
## LotFrontage    0.005087789
## LotArea        0.046269778
## OverallQual    0.142285045
## OverallCond    0.052170476
## YearBuilt      0.084335329
## YearRemodAdd   0.029285073
## MasVnrArea     0.029115786
## BsmtFinSF1     0.087018431

Reorganize the coefs data to prepare for ggplot.

coefs = as.data.frame(as.matrix(coefs))
coefs$var = rownames(coefs)
coefs = gather(coefs, key=lambda, value=coef, -var)
coefs$lambda = as.numeric(gsub('lambda', '', coefs$lambda))
coefs=coefs[!grepl('Intercept', coefs$var),]
head(coefs)
##           var lambda         coef
## 2  MSSubClass      0 -0.010061713
## 3 LotFrontage      0  0.008455489
## 4     LotArea      0  0.095394455
## 5 OverallQual      0  0.112131060
## 6 OverallCond      0  0.078904901
## 7   YearBuilt      0  0.126460565

Plot the coefficients as a function of lambda

g= ggplot(data=coefs, aes(x=lambda, y=coef, group=var))+geom_line(alpha=0.7)+xlim(c(0,50000))+
  geom_vline(xintercept=best.lambda.las1, color='red')+
  geom_vline(xintercept=0)
g
## Warning: Removed 120666 rows containing missing values (geom_path).

The error is very small, so it can be judged that the prediction effect of the model is good.

By analyzing the result,We can judge that the model has a good prediction effect on the data.The model is also suitable for this data set

Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.

predict

liner_predict = predict(liner_model , newdata=testdata)
## Warning in predict.lm(liner_model, newdata = testdata): prediction from a
## rank-deficient fit may be misleading
sampledata <- read_csv("/Users/milin/Downloads/rid1.csv")
## Parsed with column specification:
## cols(
##   Id = col_double(),
##   SalePrice = col_double()
## )
sub <- data.frame(id = testdata$Id,value = liner_predict)
names(sub) <- names(sampledata)
write_csv(sub,path = "liner.csv")


rid <- predict(r1,s=best.lambda.r1,newx=as.matrix(testdata[,-1]))

sub <- data.frame(id = testdata$Id,value = rid)
names(sub) <- names(sampledata)
write_csv(sub,path = "rid.csv")

lasso <- predict(las1, s=best.lambda.las1,newx=as.matrix(testdata[,-1]))


sub <- data.frame(id = testdata$Id,value = lasso)
names(sub) <- names(sampledata)
write_csv(sub,path = "lasso.csv")