Assignment 2, Due September 4(A)/5(B), 2019 at 5:30pm Unless you have already done so, go to Kaggle’s website and register an account, https://www.kaggle.com/account/register. Read up on what Kaggle is and reflect on how you may use it in your future jobs. Go to the following competition on house price prediction: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview INDIVIDUAL 1. [10 pts]: Read the instructions on Kaggle. Learn (by yourself) how to join a Kaggle competition and submit your results. 2. [10 pts]: You will find that some predictors contain “missing data”, NA. Figure out (by yourself) how to handle missing data in regression. 3. [50pts]: Build a regression model for house price prediction and write a report explaining how you approached the task, the steps you took, how you revised your model (must explore both LASSO and ridge regression) as your analyses progressed, etc. Comment on the quality of your predictions. Include your model as an Appendix in your report. Submit the PDF of your report, and your model file(s) (R code, Excel spreadsheet, etc.) 4. [10pts]: On the front page of your report, include your position on the leaderboard at the time of your last submission. Please also include the screenshot showing your position on the leaderboard in the Appendix.
load data
require(readr)
require(tidyverse)
require(DataExplorer)
require(dlookr)
train <- read_csv("/Users/milin/Desktop/train.csv")
test <- read_csv("/Users/milin/Desktop/test.csv")
diagnose(train) %>% arrange(desc(missing_count))
## # A tibble: 81 x 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 PoolQC chara… 1453 99.5 4 0.00274
## 2 MiscFeatu… chara… 1406 96.3 5 0.00342
## 3 Alley chara… 1369 93.8 3 0.00205
## 4 Fence chara… 1179 80.8 5 0.00342
## 5 Fireplace… chara… 690 47.3 6 0.00411
## 6 LotFronta… numer… 259 17.7 111 0.0760
## 7 GarageType chara… 81 5.55 7 0.00479
## 8 GarageYrB… numer… 81 5.55 98 0.0671
## 9 GarageFin… chara… 81 5.55 4 0.00274
## 10 GarageQual chara… 81 5.55 6 0.00411
## # … with 71 more rows
diagnose(test) %>% arrange(desc(missing_count))
## # A tibble: 80 x 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 PoolQC chara… 1456 99.8 3 0.00206
## 2 MiscFeatu… chara… 1408 96.5 4 0.00274
## 3 Alley chara… 1352 92.7 3 0.00206
## 4 Fence chara… 1169 80.1 5 0.00343
## 5 Fireplace… chara… 730 50.0 6 0.00411
## 6 LotFronta… numer… 227 15.6 116 0.0795
## 7 GarageYrB… numer… 78 5.35 98 0.0672
## 8 GarageFin… chara… 78 5.35 4 0.00274
## 9 GarageQual chara… 78 5.35 5 0.00343
## 10 GarageCond chara… 78 5.35 6 0.00411
## # … with 70 more rows
The above results show that the missing values of PoolQC, MiscFeature, Alley, Fence and other variables are more than 80%, so the missing values of these variables are directly deleted.
train <- train %>% dplyr::select(-PoolQC,-MiscFeature,-Alley,-Fence)
test <- test %>% dplyr::select(-PoolQC,-MiscFeature,-Alley,-Fence)
Use KNN to fill in the missing values. The basic idea of KNN filling missing values is to find K rows that are most similar to this row for each row with missing values, and then take the average value of the K rows or weighted average value to fill the missing data.
require(VIM)
train1 <- kNN(train)
train1 <- train1[,c(1:77)]
test1 <- kNN(test)
test1 <- test1[,c(1:76)]
The data no longer contains any missing values.
Firstly, the classification variables in the data set were screened out.
a <- sapply(train1,class)
a
## Id MSSubClass MSZoning LotFrontage LotArea
## "numeric" "numeric" "character" "numeric" "numeric"
## Street LotShape LandContour Utilities LotConfig
## "character" "character" "character" "character" "character"
## LandSlope Neighborhood Condition1 Condition2 BldgType
## "character" "character" "character" "character" "character"
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## "character" "numeric" "numeric" "numeric" "numeric"
## RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## "character" "character" "character" "character" "character"
## MasVnrArea ExterQual ExterCond Foundation BsmtQual
## "numeric" "character" "character" "character" "character"
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## "character" "character" "character" "numeric" "character"
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## "numeric" "numeric" "numeric" "character" "character"
## CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF
## "character" "character" "numeric" "numeric" "numeric"
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## "numeric" "numeric" "numeric" "numeric" "numeric"
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## "numeric" "numeric" "character" "numeric" "character"
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## "numeric" "character" "character" "numeric" "character"
## GarageCars GarageArea GarageQual GarageCond PavedDrive
## "numeric" "numeric" "character" "character" "character"
## WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch
## "numeric" "numeric" "numeric" "numeric" "numeric"
## PoolArea MiscVal MoSold YrSold SaleType
## "numeric" "numeric" "numeric" "numeric" "character"
## SaleCondition SalePrice
## "character" "numeric"
train_num <- train1[,a == "numeric"]
train_cat <- train1[,a == "character"]
test_num <- test1[,a[-77] == "numeric"]
test_cat <- test1[,a[-77] == "character"]
Carry out one hot transformation for categorical variables:
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
test_cat <- test_cat %>% dplyr::select(-Utilities)
train_cat <- train_cat %>% dplyr::select(- Utilities)
alldata <- rbind(train_cat,test_cat)
for (i in 1:dim(alldata)[2]) {
alldata[,i] <- as.factor(alldata[,i])
}
train_cat <- alldata[1:dim(train_cat)[1],]
test_cat <- alldata[-c(1:dim(train_cat)[1]),]
dum <- dummyVars(~.,data = alldata)
train_cat_dum <- predict(dum,train_cat)
test_cat_dum <- predict(dum,test_cat)
train_num <- train_num %>% scale() %>% data.frame()
test_num <- test_num %>% scale() %>% data.frame()
traindata <- cbind(train_num,train_cat_dum)
testdata <- cbind(test_num,test_cat_dum)
liner_model <- lm(SalePrice~.,data = traindata[,-1])
summary(liner_model)
##
## Call:
## lm(formula = SalePrice ~ ., data = traindata[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27393 -0.11872 0.00439 0.12326 2.27393
##
## Coefficients: (41 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7829291 0.6098678 1.284 0.199465
## MSSubClass -0.0100617 0.0433555 -0.232 0.816518
## LotFrontage 0.0084555 0.0128841 0.656 0.511772
## LotArea 0.0953944 0.0137751 6.925 7.01e-12 ***
## OverallQual 0.1121311 0.0176608 6.349 3.04e-10 ***
## OverallCond 0.0789049 0.0122690 6.431 1.81e-10 ***
## YearBuilt 0.1264605 0.0295237 4.283 1.98e-05 ***
## YearRemodAdd 0.0281537 0.0144102 1.954 0.050959 .
## MasVnrArea 0.0442863 0.0130872 3.384 0.000737 ***
## BsmtFinSF1 0.1935575 0.0262856 7.364 3.27e-13 ***
## BsmtFinSF2 0.0490585 0.0173106 2.834 0.004672 **
## BsmtUnfSF 0.0814681 0.0222564 3.660 0.000262 ***
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 0.2516488 0.0254857 9.874 < 2e-16 ***
## X2ndFlrSF 0.3701033 0.0305567 12.112 < 2e-16 ***
## LowQualFinSF 0.0086469 0.0112590 0.768 0.442637
## GrLivArea NA NA NA NA
## BsmtFullBath 0.0034939 0.0129224 0.270 0.786918
## BsmtHalfBath -0.0032221 0.0090747 -0.355 0.722607
## FullBath 0.0272426 0.0152974 1.781 0.075182 .
## HalfBath 0.0067917 0.0132392 0.513 0.608042
## BedroomAbvGr -0.0350209 0.0140885 -2.486 0.013059 *
## KitchenAbvGr -0.0334483 0.0157877 -2.119 0.034323 *
## TotRmsAbvGrd 0.0195016 0.0193582 1.007 0.313938
## Fireplaces 0.0194008 0.0109063 1.779 0.075509 .
## GarageYrBlt -0.0100750 0.0191513 -0.526 0.598932
## GarageCars 0.0285761 0.0204818 1.395 0.163210
## GarageArea 0.0439676 0.0209526 2.098 0.036071 *
## WoodDeckSF 0.0220741 0.0092295 2.392 0.016921 *
## OpenPorchSF 0.0029809 0.0095684 0.312 0.755450
## EnclosedPorch 0.0039454 0.0095824 0.412 0.680608
## X3SsnPorch 0.0123958 0.0083004 1.493 0.135592
## ScreenPorch 0.0196735 0.0086060 2.286 0.022423 *
## PoolArea 0.0422263 0.0093080 4.537 6.28e-06 ***
## MiscVal 0.0004489 0.0088876 0.051 0.959724
## MoSold -0.0144309 0.0083248 -1.733 0.083261 .
## YrSold -0.0073439 0.0086424 -0.850 0.395629
## `MSZoning.C (all)` -0.2947011 0.1202538 -2.451 0.014398 *
## MSZoning.FV 0.1250766 0.0936348 1.336 0.181865
## MSZoning.RH 0.0307292 0.0960397 0.320 0.749050
## MSZoning.RL 0.0467834 0.0473143 0.989 0.322967
## MSZoning.RM NA NA NA NA
## Street.Grvl -0.3865099 0.1522345 -2.539 0.011242 *
## Street.Pave NA NA NA NA
## LotShape.IR1 -0.0253964 0.0202674 -1.253 0.210421
## LotShape.IR2 0.0221497 0.0548512 0.404 0.686420
## LotShape.IR3 0.0341039 0.1128361 0.302 0.762518
## LotShape.Reg NA NA NA NA
## LandContour.Bnk -0.0751658 0.0465630 -1.614 0.106723
## LandContour.HLS 0.0303031 0.0513660 0.590 0.555335
## LandContour.Low -0.2000293 0.0718021 -2.786 0.005421 **
## LandContour.Lvl NA NA NA NA
## LotConfig.Corner 0.0178353 0.0226428 0.788 0.431034
## LotConfig.CulDSac 0.1223906 0.0378066 3.237 0.001239 **
## LotConfig.FR2 -0.0837507 0.0476849 -1.756 0.079281 .
## LotConfig.FR3 -0.1789558 0.1581663 -1.131 0.258090
## LotConfig.Inside NA NA NA NA
## LandSlope.Gtl 0.5570847 0.1441020 3.866 0.000116 ***
## LandSlope.Mod 0.6366039 0.1439807 4.421 1.07e-05 ***
## LandSlope.Sev NA NA NA NA
## Neighborhood.Blmngtn -0.0163826 0.1326784 -0.123 0.901751
## Neighborhood.Blueste -0.0231539 0.2419336 -0.096 0.923772
## Neighborhood.BrDale -0.0218450 0.1478908 -0.148 0.882595
## Neighborhood.BrkSide -0.0682929 0.1195018 -0.571 0.567779
## Neighborhood.ClearCr -0.1852409 0.1210091 -1.531 0.126076
## Neighborhood.CollgCr -0.1365754 0.1045494 -1.306 0.191687
## Neighborhood.Crawfor 0.1362654 0.1127566 1.208 0.227091
## Neighborhood.Edwards -0.2587996 0.1081781 -2.392 0.016891 *
## Neighborhood.Gilbert -0.1572176 0.1093944 -1.437 0.150926
## Neighborhood.IDOTRR -0.1452660 0.1342510 -1.082 0.279444
## Neighborhood.MeadowV -0.0795678 0.1481597 -0.537 0.591337
## Neighborhood.Mitchel -0.2792718 0.1104663 -2.528 0.011592 *
## Neighborhood.NAmes -0.2086532 0.1034452 -2.017 0.043909 *
## Neighborhood.NoRidge 0.3169246 0.1142131 2.775 0.005606 **
## Neighborhood.NPkVill 0.1662988 0.1835156 0.906 0.365017
## Neighborhood.NridgHt 0.2157803 0.1110475 1.943 0.052228 .
## Neighborhood.NWAmes -0.2365850 0.1063531 -2.225 0.026295 *
## Neighborhood.OldTown -0.1831072 0.1201491 -1.524 0.127766
## Neighborhood.Sawyer -0.1491827 0.1072843 -1.391 0.164618
## Neighborhood.SawyerW -0.0719011 0.1075084 -0.669 0.503752
## Neighborhood.Somerst -0.0304059 0.1243646 -0.244 0.806892
## Neighborhood.StoneBr 0.4691535 0.1207296 3.886 0.000107 ***
## Neighborhood.SWISU -0.1194861 0.1284579 -0.930 0.352472
## Neighborhood.Timber -0.1470783 0.1140498 -1.290 0.197434
## Neighborhood.Veenker NA NA NA NA
## Condition1.Artery -0.0727275 0.1619269 -0.449 0.653411
## Condition1.Feedr 0.0071980 0.1574186 0.046 0.963537
## Condition1.Norm 0.1149882 0.1534334 0.749 0.453739
## Condition1.PosA -0.0022810 0.1916658 -0.012 0.990506
## Condition1.PosN 0.0749043 0.1707127 0.439 0.660902
## Condition1.RRAe -0.2828955 0.1854076 -1.526 0.127317
## Condition1.RRAn 0.0272921 0.1608552 0.170 0.865299
## Condition1.RRNe -0.0984167 0.2622391 -0.375 0.707507
## Condition1.RRNn NA NA NA NA
## Condition2.Artery -0.0135599 0.3422183 -0.040 0.968400
## Condition2.Feedr -0.0832005 0.2677304 -0.311 0.756034
## Condition2.Norm -0.1048598 0.2244725 -0.467 0.640484
## Condition2.PosA 0.4799276 0.4699100 1.021 0.307305
## Condition2.PosN -3.0194807 0.3276853 -9.215 < 2e-16 ***
## Condition2.RRAe -1.6132953 0.5621913 -2.870 0.004180 **
## Condition2.RRAn -0.2861964 0.3737727 -0.766 0.444004
## Condition2.RRNn NA NA NA NA
## BldgType.1Fam 0.2373701 0.1116121 2.127 0.033641 *
## BldgType.2fmCon 0.1476361 0.0964433 1.531 0.126075
## BldgType.Duplex 0.1241621 0.1104364 1.124 0.261112
## BldgType.Twnhs -0.0397140 0.0635087 -0.625 0.531870
## BldgType.TwnhsE NA NA NA NA
## HouseStyle.1.5Fin -0.0203677 0.0690041 -0.295 0.767917
## HouseStyle.1.5Unf 0.1446251 0.1111727 1.301 0.193535
## HouseStyle.1Story 0.0902978 0.0741225 1.218 0.223373
## HouseStyle.2.5Fin -0.3210785 0.1701864 -1.887 0.059446 .
## HouseStyle.2.5Unf -0.1475800 0.1266428 -1.165 0.244114
## HouseStyle.2Story -0.0957260 0.0656995 -1.457 0.145364
## HouseStyle.SFoyer 0.0023435 0.0689864 0.034 0.972906
## HouseStyle.SLvl NA NA NA NA
## RoofStyle.Flat -1.2316953 0.4383731 -2.810 0.005038 **
## RoofStyle.Gable -1.1355395 0.3801298 -2.987 0.002871 **
## RoofStyle.Gambrel -1.1126240 0.3922273 -2.837 0.004633 **
## RoofStyle.Hip -1.1357168 0.3804251 -2.985 0.002888 **
## RoofStyle.Mansard -0.9788928 0.3755304 -2.607 0.009253 **
## RoofStyle.Shed NA NA NA NA
## RoofMatl.ClyTile -9.0448265 0.4339942 -20.841 < 2e-16 ***
## RoofMatl.CompShg -0.6798950 0.1441380 -4.717 2.67e-06 ***
## RoofMatl.Membran 0.5132349 0.4395102 1.168 0.243137
## RoofMatl.Metal 0.1821753 0.4280163 0.426 0.670454
## RoofMatl.Roll -0.8638422 0.3529558 -2.447 0.014526 *
## `RoofMatl.Tar&Grv` -0.6152070 0.2696740 -2.281 0.022701 *
## RoofMatl.WdShake -0.7968688 0.2352131 -3.388 0.000727 ***
## RoofMatl.WdShngl NA NA NA NA
## Exterior1st.AsbShng 0.1846407 0.1678237 1.100 0.271458
## Exterior1st.AsphShn -0.0088736 0.3991928 -0.022 0.982269
## Exterior1st.BrkComm -0.0027317 0.3250248 -0.008 0.993295
## Exterior1st.BrkFace 0.2071514 0.0983777 2.106 0.035435 *
## Exterior1st.CBlock -0.1828724 0.3406839 -0.537 0.591517
## Exterior1st.CemntBd -0.0294134 0.1984546 -0.148 0.882200
## Exterior1st.HdBoard -0.0648821 0.0906831 -0.715 0.474447
## Exterior1st.ImStucc -0.4583262 0.3247645 -1.411 0.158422
## Exterior1st.MetalSd 0.0519716 0.1264717 0.411 0.681192
## Exterior1st.Plywood -0.0794513 0.0911610 -0.872 0.383625
## Exterior1st.Stone -0.0041374 0.2709762 -0.015 0.987820
## Exterior1st.Stucco 0.0490505 0.1222378 0.401 0.688290
## Exterior1st.VinylSd -0.0640784 0.1102961 -0.581 0.561369
## `Exterior1st.Wd Sdng` -0.0609934 0.0841158 -0.725 0.468522
## Exterior1st.WdShing NA NA NA NA
## Exterior2nd.AsbShng -0.1170960 0.1568482 -0.747 0.455474
## Exterior2nd.AsphShn 0.0693135 0.2473283 0.280 0.779334
## `Exterior2nd.Brk Cmn` 0.0183198 0.2180709 0.084 0.933064
## Exterior2nd.BrkFace -0.0048462 0.1072163 -0.045 0.963955
## Exterior2nd.CBlock NA NA NA NA
## Exterior2nd.CmentBd 0.0849450 0.1931208 0.440 0.660120
## Exterior2nd.HdBoard 0.0536032 0.0854417 0.627 0.530536
## Exterior2nd.ImStucc 0.2805027 0.1261046 2.224 0.026305 *
## Exterior2nd.MetalSd -0.0051335 0.1214637 -0.042 0.966296
## Exterior2nd.Other -0.2667455 0.3196219 -0.835 0.404125
## Exterior2nd.Plywood 0.0283555 0.0815611 0.348 0.728155
## Exterior2nd.Stone -0.1926621 0.1673965 -1.151 0.249984
## Exterior2nd.Stucco 0.0182728 0.1161089 0.157 0.874974
## Exterior2nd.VinylSd 0.1059547 0.0983921 1.077 0.281754
## `Exterior2nd.Wd Sdng` 0.0803743 0.0735140 1.093 0.274468
## `Exterior2nd.Wd Shng` NA NA NA NA
## MasVnrType.BrkCmn -0.1565906 0.0911352 -1.718 0.086008 .
## MasVnrType.BrkFace -0.0710737 0.0364648 -1.949 0.051510 .
## MasVnrType.None -0.0294090 0.0395995 -0.743 0.457829
## MasVnrType.Stone NA NA NA NA
## ExterQual.Ex 0.2674941 0.0668751 4.000 6.72e-05 ***
## ExterQual.Fa 0.1832265 0.1216908 1.506 0.132409
## ExterQual.Gd 0.0093229 0.0307510 0.303 0.761809
## ExterQual.TA NA NA NA NA
## ExterCond.Ex 0.0647113 0.2168816 0.298 0.765470
## ExterCond.Fa 0.0162770 0.0726673 0.224 0.822800
## ExterCond.Gd -0.0356658 0.0299334 -1.192 0.233686
## ExterCond.Po 0.1444063 0.3278414 0.440 0.659670
## ExterCond.TA NA NA NA NA
## Foundation.BrkTil 0.3431521 0.1860593 1.844 0.065378 .
## Foundation.CBlock 0.3870318 0.1832827 2.112 0.034917 *
## Foundation.PConc 0.4036620 0.1821741 2.216 0.026888 *
## Foundation.Slab 0.4175869 0.2040571 2.046 0.040928 *
## Foundation.Stone 0.4260338 0.2289475 1.861 0.063006 .
## Foundation.Wood NA NA NA NA
## BsmtQual.Ex 0.2023700 0.0521055 3.884 0.000108 ***
## BsmtQual.Fa 0.0277965 0.0614838 0.452 0.651280
## BsmtQual.Gd -0.0355402 0.0318394 -1.116 0.264542
## BsmtQual.TA NA NA NA NA
## BsmtCond.Fa -0.0486079 0.0536276 -0.906 0.364903
## BsmtCond.Gd -0.0417521 0.0408683 -1.022 0.307160
## BsmtCond.Po 0.8416787 0.3824721 2.201 0.027948 *
## BsmtCond.TA NA NA NA NA
## BsmtExposure.Av 0.0788182 0.0275564 2.860 0.004305 **
## BsmtExposure.Gd 0.2515593 0.0376943 6.674 3.77e-11 ***
## BsmtExposure.Mn 0.0225927 0.0319735 0.707 0.479945
## BsmtExposure.No NA NA NA NA
## BsmtFinType1.ALQ -0.0433020 0.0361019 -1.199 0.230590
## BsmtFinType1.BLQ -0.0155023 0.0385632 -0.402 0.687755
## BsmtFinType1.GLQ 0.0319311 0.0337512 0.946 0.344297
## BsmtFinType1.LwQ -0.0878497 0.0466405 -1.884 0.059862 .
## BsmtFinType1.Rec -0.0464868 0.0393374 -1.182 0.237537
## BsmtFinType1.Unf NA NA NA NA
## BsmtFinType2.ALQ 0.1304617 0.0949491 1.374 0.169687
## BsmtFinType2.BLQ -0.0416537 0.0642417 -0.648 0.516853
## BsmtFinType2.GLQ 0.0916211 0.1116101 0.821 0.411861
## BsmtFinType2.LwQ -0.0784112 0.0578225 -1.356 0.175328
## BsmtFinType2.Rec -0.0203589 0.0603171 -0.338 0.735774
## BsmtFinType2.Unf NA NA NA NA
## Heating.Floor -0.1121599 0.3615427 -0.310 0.756442
## Heating.GasA -0.1474185 0.1808174 -0.815 0.415065
## Heating.GasW -0.1944473 0.1950471 -0.997 0.318998
## Heating.Grav -0.2184253 0.2205561 -0.990 0.322204
## Heating.OthW -0.3918132 0.2931851 -1.336 0.181666
## Heating.Wall NA NA NA NA
## HeatingQC.Ex 0.0435737 0.0261059 1.669 0.095350 .
## HeatingQC.Fa 0.0503580 0.0561769 0.896 0.370205
## HeatingQC.Gd -0.0004289 0.0272075 -0.016 0.987424
## HeatingQC.Po 0.0773969 0.3368706 0.230 0.818322
## HeatingQC.TA NA NA NA NA
## CentralAir.N 0.0024397 0.0489669 0.050 0.960272
## CentralAir.Y NA NA NA NA
## Electrical.FuseA 0.0179955 0.0371137 0.485 0.627852
## Electrical.FuseF 0.0189307 0.0697092 0.272 0.786001
## Electrical.FuseP -0.0703400 0.2319079 -0.303 0.761705
## Electrical.Mix -0.5692347 0.5660117 -1.006 0.314761
## Electrical.SBrkr NA NA NA NA
## KitchenQual.Ex 0.2865060 0.0494137 5.798 8.52e-09 ***
## KitchenQual.Fa 0.0375162 0.0617756 0.607 0.543766
## KitchenQual.Gd -0.0301679 0.0268681 -1.123 0.261735
## KitchenQual.TA NA NA NA NA
## Functional.Maj1 -0.2796233 0.0942351 -2.967 0.003063 **
## Functional.Maj2 -0.2696161 0.1579777 -1.707 0.088136 .
## Functional.Min1 -0.1499215 0.0596696 -2.513 0.012114 *
## Functional.Min2 -0.1221749 0.0587374 -2.080 0.037731 *
## Functional.Mod -0.2674614 0.0944682 -2.831 0.004713 **
## Functional.Sev -0.8034093 0.3605439 -2.228 0.026039 *
## Functional.Typ NA NA NA NA
## FireplaceQu.Ex -0.0412153 0.0667570 -0.617 0.537091
## FireplaceQu.Fa -0.0350022 0.0373848 -0.936 0.349319
## FireplaceQu.Gd -0.0141711 0.0209403 -0.677 0.498700
## FireplaceQu.Po 0.0334541 0.0507692 0.659 0.510055
## FireplaceQu.TA NA NA NA NA
## GarageType.2Types -0.2873757 0.1382961 -2.078 0.037919 *
## GarageType.Attchd -0.0498285 0.0281665 -1.769 0.077131 .
## GarageType.Basment -0.0378261 0.0822411 -0.460 0.645639
## GarageType.BuiltIn -0.0683768 0.0500805 -1.365 0.172397
## GarageType.CarPort 0.0300819 0.1234183 0.244 0.807473
## GarageType.Detchd NA NA NA NA
## GarageFinish.Fin 0.0035003 0.0302392 0.116 0.907865
## GarageFinish.RFn -0.0345935 0.0267843 -1.292 0.196754
## GarageFinish.Unf NA NA NA NA
## GarageQual.Ex 1.4044270 0.3763530 3.732 0.000199 ***
## GarageQual.Fa -0.0875616 0.0583846 -1.500 0.133939
## GarageQual.Gd 0.0140980 0.0959472 0.147 0.883207
## GarageQual.Po -0.2496823 0.3020676 -0.827 0.408637
## GarageQual.TA NA NA NA NA
## GarageCond.Ex -1.3431174 0.4334841 -3.098 0.001990 **
## GarageCond.Fa -0.0213811 0.0651126 -0.328 0.742687
## GarageCond.Gd -0.0210578 0.1132845 -0.186 0.852566
## GarageCond.Po 0.0110423 0.1734435 0.064 0.949247
## GarageCond.TA NA NA NA NA
## PavedDrive.N 0.0167766 0.0431129 0.389 0.697247
## PavedDrive.P -0.0400452 0.0611446 -0.655 0.512638
## PavedDrive.Y NA NA NA NA
## SaleType.COD -0.0018453 0.0526469 -0.035 0.972046
## SaleType.Con 0.3501666 0.2154136 1.626 0.104301
## SaleType.ConLD 0.2081313 0.1121069 1.857 0.063616 .
## SaleType.ConLI 0.0683223 0.1364445 0.501 0.616649
## SaleType.ConLw 0.0051400 0.1436284 0.036 0.971458
## SaleType.CWD 0.1848426 0.1539761 1.200 0.230191
## SaleType.New 0.2858321 0.1886345 1.515 0.129962
## SaleType.Oth 0.0752830 0.1793278 0.420 0.674700
## SaleType.WD NA NA NA NA
## SaleCondition.Abnorml 0.0301204 0.1880973 0.160 0.872803
## SaleCondition.AdjLand 0.1840785 0.2581343 0.713 0.475912
## SaleCondition.Alloca 0.0822735 0.2128666 0.387 0.699192
## SaleCondition.Family 0.0142096 0.1975827 0.072 0.942679
## SaleCondition.Normal 0.0978700 0.1861911 0.526 0.599231
## SaleCondition.Partial NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2874 on 1227 degrees of freedom
## Multiple R-squared: 0.9306, Adjusted R-squared: 0.9174
## F-statistic: 70.87 on 232 and 1227 DF, p-value: < 2.2e-16
# liner_model_step <- step(liner_model)
#summary(liner_model_step)
hist(liner_model$residuals)
qqnorm(liner_model$residuals)
qqline(liner_model$residuals)
shapiro.test(liner_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: liner_model$residuals
## W = 0.84738, p-value < 2.2e-16
plot(liner_model$residuals-liner_model$fitted.values)
We plot the histogram of the error and find that the error is approximately close to the positive distribution. As can be seen from the error QQ map, the data walks on the diagonal. Further hypothesis testing, it can be judged that the error is a random error
The prediction of this model is not bad, because the results show that the error is not very large.
By analyzing the error, we can see that the error is a random error, which means that the model is better expressed in the data, so the model is considered suitable for this data. The linear model has good interpretability.
Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.
Use glmnet with alpha=0 for ridge. If we use lambda=0, we should get the OLS estimate.
Instead of using cross-validation to determine the terms in the model, CV will be used to determine lambda. For that we’ll use cv.glmnet.
library(glmnet)
set.seed(123)
k=5
folds = rep(1:k, length.out=nrow(traindata))
fold = sample(folds, nrow(traindata), replace=F)
lambdas= c(0,10^seq(-2, 10, length=1000)) ## define a bunch of lambdas to check.
r1 = cv.glmnet(x=as.matrix(traindata[,-c(1,38)]), y=traindata$SalePrice, alpha=0, lambda=lambdas, foldid=fold, thresh=1e-16)
Find the best lambda
best.lambda.r1=r1$lambda.min
best.lambda.r1
## [1] 0.5517492
MSE for that lambda
r1$cvm[r1$lambda==best.lambda.r1]
## [1] 0.1599373
Find the coefficients for each value of lambda
coefs = coef(r1, s=lambdas)
colnames(coefs) = paste0('lambda', round(lambdas,5))
head(coefs[,1:5],10)
## 10 x 5 sparse Matrix of class "dgCMatrix"
## lambda0 lambda0.01 lambda0.01028 lambda0.01057
## (Intercept) -0.727109349 -0.650123600 -0.649352907 -0.648565557
## MSSubClass -0.010061711 -0.020991731 -0.021162599 -0.021333511
## LotFrontage 0.008455491 0.009155435 0.009172338 0.009189599
## LotArea 0.095394453 0.090356912 0.090227707 0.090095526
## OverallQual 0.112131052 0.112120227 0.112112386 0.112103998
## OverallCond 0.078904901 0.075362489 0.075274496 0.075184572
## YearBuilt 0.126460527 0.106283963 0.105820135 0.105347903
## YearRemodAdd 0.028153745 0.029075086 0.029095190 0.029115608
## MasVnrArea 0.044286271 0.045666954 0.045698397 0.045730381
## BsmtFinSF1 0.099085055 0.092648442 0.092547734 0.092444663
## lambda0.01087
## (Intercept) -0.647761275
## MSSubClass -0.021504356
## LotFrontage 0.009207223
## LotArea 0.089960317
## OverallQual 0.112095036
## OverallCond 0.075092684
## YearBuilt 0.104867214
## YearRemodAdd 0.029136338
## MasVnrArea 0.045762906
## BsmtFinSF1 0.092339188
Reorganize the coefs data to prepare for ggplot.
coefs = as.data.frame(as.matrix(coefs))
coefs$var = rownames(coefs)
coefs = gather(coefs, key=lambda, value=coef, -var)
coefs$lambda = as.numeric(gsub('lambda', '', coefs$lambda))
coefs=coefs[!grepl('Intercept', coefs$var),]
head(coefs)
## var lambda coef
## 2 MSSubClass 0 -0.010061711
## 3 LotFrontage 0 0.008455491
## 4 LotArea 0 0.095394453
## 5 OverallQual 0 0.112131052
## 6 OverallCond 0 0.078904901
## 7 YearBuilt 0 0.126460527
Plot the coefficients as a function of lambda
require(ggplot2)
g= ggplot(data=coefs, aes(x=lambda, y=coef, group=var))+geom_line(alpha=0.7)+xlim(c(0,1000000))+
geom_vline(xintercept=best.lambda.r1, color='red')+
geom_vline(xintercept=0)
g
## Warning: Removed 90909 rows containing missing values (geom_path).
The prediction of this model is good, because the results show small error .
By analyzing the result,We can judge that the model has a good prediction effect on the data.The model is also suitable for this data set
Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.
las1 = cv.glmnet(x=as.matrix(traindata[,-c(1,38)]), y=traindata$SalePrice, alpha=1,lambda=lambdas, foldid=fold, thresh=1e-16)
Find the best lambda
best.lambda.las1=las1$lambda.min
best.lambda.las1
## [1] 0.02230223
MSE for that lambda
las1$cvm[las1$lambda==best.lambda.las1]
## [1] 0.1695319
Worse than Ridge, better than OLS
Let’s see how the coefficients change for different values of lambda. First we’ll find the coefficients for each value of lambda.
coefs = coef(las1, s=lambdas)
colnames(coefs) = paste0('lambda', round(lambdas,5))
head(coefs[,1:5],10)
## 10 x 5 sparse Matrix of class "dgCMatrix"
## lambda0 lambda0.01 lambda0.01028 lambda0.01057
## (Intercept) -0.584282518 -0.372344471 -0.368949806 -0.366746433
## MSSubClass -0.010061713 -0.034087211 -0.033808365 -0.033485623
## LotFrontage 0.008455489 0.006247016 0.005875402 0.005487881
## LotArea 0.095394455 0.047762134 0.047277461 0.046780403
## OverallQual 0.112131060 0.139301294 0.140278234 0.141269725
## OverallCond 0.078904901 0.053398405 0.052998294 0.052589690
## YearBuilt 0.126460565 0.084562408 0.084519006 0.084434524
## YearRemodAdd 0.028153751 0.028473983 0.028745958 0.029013614
## MasVnrArea 0.044286273 0.030778401 0.030232088 0.029680172
## BsmtFinSF1 0.117928743 0.087311289 0.087224290 0.087124384
## lambda0.01087
## (Intercept) -0.364879910
## MSSubClass -0.033142655
## LotFrontage 0.005087789
## LotArea 0.046269778
## OverallQual 0.142285045
## OverallCond 0.052170476
## YearBuilt 0.084335329
## YearRemodAdd 0.029285073
## MasVnrArea 0.029115786
## BsmtFinSF1 0.087018431
Reorganize the coefs data to prepare for ggplot.
coefs = as.data.frame(as.matrix(coefs))
coefs$var = rownames(coefs)
coefs = gather(coefs, key=lambda, value=coef, -var)
coefs$lambda = as.numeric(gsub('lambda', '', coefs$lambda))
coefs=coefs[!grepl('Intercept', coefs$var),]
head(coefs)
## var lambda coef
## 2 MSSubClass 0 -0.010061713
## 3 LotFrontage 0 0.008455489
## 4 LotArea 0 0.095394455
## 5 OverallQual 0 0.112131060
## 6 OverallCond 0 0.078904901
## 7 YearBuilt 0 0.126460565
Plot the coefficients as a function of lambda
g= ggplot(data=coefs, aes(x=lambda, y=coef, group=var))+geom_line(alpha=0.7)+xlim(c(0,50000))+
geom_vline(xintercept=best.lambda.las1, color='red')+
geom_vline(xintercept=0)
g
## Warning: Removed 120666 rows containing missing values (geom_path).
The error is very small, so it can be judged that the prediction effect of the model is good.
By analyzing the result,We can judge that the model has a good prediction effect on the data.The model is also suitable for this data set
Through the Significance test of the feature, we can know which features have no predictive ability, and the parameter size of the feature indicates the degree of influence of this feature on the result.
liner_predict = predict(liner_model , newdata=testdata)
## Warning in predict.lm(liner_model, newdata = testdata): prediction from a
## rank-deficient fit may be misleading
sampledata <- read_csv("/Users/milin/Downloads/rid1.csv")
## Parsed with column specification:
## cols(
## Id = col_double(),
## SalePrice = col_double()
## )
sub <- data.frame(id = testdata$Id,value = liner_predict)
names(sub) <- names(sampledata)
write_csv(sub,path = "liner.csv")
rid <- predict(r1,s=best.lambda.r1,newx=as.matrix(testdata[,-1]))
sub <- data.frame(id = testdata$Id,value = rid)
names(sub) <- names(sampledata)
write_csv(sub,path = "rid.csv")
lasso <- predict(las1, s=best.lambda.las1,newx=as.matrix(testdata[,-1]))
sub <- data.frame(id = testdata$Id,value = lasso)
names(sub) <- names(sampledata)
write_csv(sub,path = "lasso.csv")