저자 책 웹페이지: https://dataninja.me/ipds-kr/

R 환경 준비

일단은 필수패키지인 tidyverse, 그리고 머신러닝을 위한 몇가지 패키지를 로드하자. (로딩 메시지를 감추기 위해 suppressMessages() 명령을 사용.)

# install.packages("tidyverse")
suppressMessages(library(tidyverse))

# install.packages(c("ROCR", "MASS", "glmnet", "randomForest", "gbm", "rpart", "boot"))
suppressMessages(library(gridExtra))
suppressMessages(library(ROCR))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
## Warning: package 'glmnet' was built under R version 3.4.2
suppressMessages(library(randomForest))
suppressMessages(library(gbm))
suppressMessages(library(rpart))
suppressMessages(library(boot))

책에서 기술한대로 RMSE (root mean squared error), MAE (median absolute error), panel.cor 함수를 정의하자:

rmse <- function(yi, yhat_i){
  sqrt(mean((yi - yhat_i)^2))
}

mae <- function(yi, yhat_i){
  mean(abs(yi - yhat_i))
}

# exmaple(pairs) 에서 따옴
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

13-1. (아이오와 주 주택 가격데이터 분석)

아이오와 주의 에임스시 주택 가격데이터(De Cock, 2011)를 구하여 회귀분석을 행하라. 데이터는 https://goo.gl/ul7Ub7 (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) 혹은 https://goo.gl/8gKgaT (http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls) https://goo.gl/qgVg2z (https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt) 에서 구할 수 있다.

변수 설명은 https://goo.gl/2vcCfT (https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) 를 참조하라. https://ww2.amstat.org/publications/jse/v19n3/decock.pdf 문서를 참조해도 좋다.

이 데이터에 대한 회귀분석을 행하라. 본문에서 기술한 방법 중 어떤 회귀분 석 방법이 가장 정확한 결과를 주는가? 결과를 보고서로 정리하라.

자료 취득

우선 다음 명령으로 자료를 다운받자:

wget https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt
wget https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls

R 로 자료를 읽어들인 후, 다음처럼 변수명을 변환하자:

  1. make.names(..., unique=TRUE) 함수로 변수명을 R 에서 사용이 쉬운 이름으로 바꾼다.
  2. 마침표(.) 대신 밑줄(_)을 사용한다.
  3. 모두 소문자로 바꾼다

그리고, id 변수인 order, pid 를 제거한다.

df1 <- read_tsv("AmesHousing.txt")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Order = col_integer(),
##   `Lot Frontage` = col_integer(),
##   `Lot Area` = col_integer(),
##   `Overall Qual` = col_integer(),
##   `Overall Cond` = col_integer(),
##   `Year Built` = col_integer(),
##   `Year Remod/Add` = col_integer(),
##   `Mas Vnr Area` = col_integer(),
##   `BsmtFin SF 1` = col_integer(),
##   `BsmtFin SF 2` = col_integer(),
##   `Bsmt Unf SF` = col_integer(),
##   `Total Bsmt SF` = col_integer(),
##   `1st Flr SF` = col_integer(),
##   `2nd Flr SF` = col_integer(),
##   `Low Qual Fin SF` = col_integer(),
##   `Gr Liv Area` = col_integer(),
##   `Bsmt Full Bath` = col_integer(),
##   `Bsmt Half Bath` = col_integer(),
##   `Full Bath` = col_integer(),
##   `Half Bath` = col_integer()
##   # ... with 17 more columns
## )
## See spec(...) for full column specifications.
names(df1) <- tolower(gsub("\\.", "_", make.names(names(df1), unique=TRUE)))
df1 <- df1 %>% dplyr::select(-order, -pid)
glimpse(df1)
## Observations: 2,930
## Variables: 80
## $ ms_subclass     <chr> "020", "020", "020", "020", "060", "060", "120...
## $ ms_zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL"...
## $ lot_frontage    <int> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, N...
## $ lot_area        <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"...
## $ alley           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lot_shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg...
## $ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl...
## $ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPu...
## $ lot_config      <chr> "Corner", "Inside", "Corner", "Corner", "Insid...
## $ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl...
## $ neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert",...
## $ condition_1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm...
## $ condition_2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"...
## $ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam"...
## $ house_style     <chr> "1Story", "1Story", "1Story", "1Story", "2Stor...
## $ overall_qual    <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond    <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built      <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add  <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ roof_style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable"...
## $ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "C...
## $ exterior_1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ exterior_2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ mas_vnr_type    <chr> "Stone", "None", "BrkFace", "None", "None", "B...
## $ mas_vnr_area    <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ exter_qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd"...
## $ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc...
## $ bsmt_qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd"...
## $ bsmt_cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ bsmt_exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No"...
## $ bsmtfin_type_1  <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ...
## $ bsmtfin_sf_1    <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_type_2  <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf...
## $ bsmtfin_sf_2    <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf     <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf   <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"...
## $ heating_qc      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex"...
## $ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "...
## $ x1st_flr_sf     <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf     <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area     <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath  <int> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath       <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath       <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr   <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ kitchen_qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd"...
## $ totrms_abvgrd   <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ...
## $ fireplaces      <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ fireplace_qu    <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, NA, "TA", ...
## $ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attch...
## $ garage_yr_blt   <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_finish   <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin...
## $ garage_cars     <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area     <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ garage_qual     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ paved_drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ wood_deck_sf    <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf   <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch  <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch    <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ pool_qc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fence           <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA, NA, NA, ...
## $ misc_feature    <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, NA, NA, NA...
## $ misc_val        <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold         <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6...
## $ yr_sold         <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD"...
## $ sale_condition  <chr> "Normal", "Normal", "Normal", "Normal", "Norma...
## $ saleprice       <int> 215000, 105000, 172000, 244000, 189900, 195500...

결측치 처리

자료의 여러 변수에 결측치가 포함되어 있다. 결측치를 찾아내는 간단한 방법은 summary() 함수를 사용하는 것이다:

# summary(df)

또다른 방법은 다음처럼 summarize_all() + funs() 트릭을 이용하는 것이다:

df1 %>%
  summarize_all(funs(length(which(is.na(.)))/length(.))) %>% 
  glimpse()
## Observations: 1
## Variables: 80
## $ ms_subclass     <dbl> 0
## $ ms_zoning       <dbl> 0
## $ lot_frontage    <dbl> 0.1672355
## $ lot_area        <dbl> 0
## $ street          <dbl> 0
## $ alley           <dbl> 0.9324232
## $ lot_shape       <dbl> 0
## $ land_contour    <dbl> 0
## $ utilities       <dbl> 0
## $ lot_config      <dbl> 0
## $ land_slope      <dbl> 0
## $ neighborhood    <dbl> 0
## $ condition_1     <dbl> 0
## $ condition_2     <dbl> 0
## $ bldg_type       <dbl> 0
## $ house_style     <dbl> 0
## $ overall_qual    <dbl> 0
## $ overall_cond    <dbl> 0
## $ year_built      <dbl> 0
## $ year_remod_add  <dbl> 0
## $ roof_style      <dbl> 0
## $ roof_matl       <dbl> 0
## $ exterior_1st    <dbl> 0
## $ exterior_2nd    <dbl> 0
## $ mas_vnr_type    <dbl> 0.007849829
## $ mas_vnr_area    <dbl> 0.007849829
## $ exter_qual      <dbl> 0
## $ exter_cond      <dbl> 0
## $ foundation      <dbl> 0
## $ bsmt_qual       <dbl> 0.02730375
## $ bsmt_cond       <dbl> 0.02730375
## $ bsmt_exposure   <dbl> 0.02832765
## $ bsmtfin_type_1  <dbl> 0.02730375
## $ bsmtfin_sf_1    <dbl> 0.0003412969
## $ bsmtfin_type_2  <dbl> 0.02764505
## $ bsmtfin_sf_2    <dbl> 0.0003412969
## $ bsmt_unf_sf     <dbl> 0.0003412969
## $ total_bsmt_sf   <dbl> 0.0003412969
## $ heating         <dbl> 0
## $ heating_qc      <dbl> 0
## $ central_air     <dbl> 0
## $ electrical      <dbl> 0.0003412969
## $ x1st_flr_sf     <dbl> 0
## $ x2nd_flr_sf     <dbl> 0
## $ low_qual_fin_sf <dbl> 0
## $ gr_liv_area     <dbl> 0
## $ bsmt_full_bath  <dbl> 0.0006825939
## $ bsmt_half_bath  <dbl> 0.0006825939
## $ full_bath       <dbl> 0
## $ half_bath       <dbl> 0
## $ bedroom_abvgr   <dbl> 0
## $ kitchen_abvgr   <dbl> 0
## $ kitchen_qual    <dbl> 0
## $ totrms_abvgrd   <dbl> 0
## $ functional      <dbl> 0
## $ fireplaces      <dbl> 0
## $ fireplace_qu    <dbl> 0.4853242
## $ garage_type     <dbl> 0.05358362
## $ garage_yr_blt   <dbl> 0.05426621
## $ garage_finish   <dbl> 0.05426621
## $ garage_cars     <dbl> 0.0003412969
## $ garage_area     <dbl> 0.0003412969
## $ garage_qual     <dbl> 0.05426621
## $ garage_cond     <dbl> 0.05426621
## $ paved_drive     <dbl> 0
## $ wood_deck_sf    <dbl> 0
## $ open_porch_sf   <dbl> 0
## $ enclosed_porch  <dbl> 0
## $ x3ssn_porch     <dbl> 0
## $ screen_porch    <dbl> 0
## $ pool_area       <dbl> 0
## $ pool_qc         <dbl> 0.9955631
## $ fence           <dbl> 0.8047782
## $ misc_feature    <dbl> 0.9638225
## $ misc_val        <dbl> 0
## $ mo_sold         <dbl> 0
## $ yr_sold         <dbl> 0
## $ sale_type       <dbl> 0
## $ sale_condition  <dbl> 0
## $ saleprice       <dbl> 0

이로부터 여러 변수들이 결측치를 가지고 있음을 알 수 있다.

결측치를 해결하는 다양한 방법이 있지만 여기서는 간단히 처리한다:

  1. 수량형 변수는 중앙값으로 대치한다.
  2. 문자형 변수는 최빈값"NA" 문자열로 대치한다.

아래 명령은 mutate_if(), rename_all() 함수등을 이용하여 위의 처리를 해준다:

df2 <- df1 %>%
  mutate_if(is.numeric, funs(imp=ifelse(is.na(.), median(., na.rm=TRUE), .))) %>%
  # mutate_if(is.character, funs(imp=ifelse(is.na(.), sort(table(.), decreasing=TRUE)[1], .))) %>%
  mutate_if(is.character, funs(imp=ifelse(is.na(.), "NA", .))) %>%
  dplyr::select(ends_with("_imp")) %>%
  rename_all(funs(gsub("_imp", "", .)))
df2 %>% glimpse()
## Observations: 2,930
## Variables: 80
## $ lot_frontage    <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 6...
## $ lot_area        <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ overall_qual    <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond    <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built      <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add  <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ mas_vnr_area    <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bsmtfin_sf_1    <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_sf_2    <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf     <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf   <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x1st_flr_sf     <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf     <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area     <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath       <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath       <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr   <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ totrms_abvgrd   <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ fireplaces      <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ garage_yr_blt   <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_cars     <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area     <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ wood_deck_sf    <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf   <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch  <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch    <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ misc_val        <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold         <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6...
## $ yr_sold         <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ saleprice       <int> 215000, 105000, 172000, 244000, 189900, 195500...
## $ ms_subclass     <chr> "020", "020", "020", "020", "060", "060", "120...
## $ ms_zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL"...
## $ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"...
## $ alley           <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"...
## $ lot_shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg...
## $ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl...
## $ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPu...
## $ lot_config      <chr> "Corner", "Inside", "Corner", "Corner", "Insid...
## $ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl...
## $ neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert",...
## $ condition_1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm...
## $ condition_2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"...
## $ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam"...
## $ house_style     <chr> "1Story", "1Story", "1Story", "1Story", "2Stor...
## $ roof_style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable"...
## $ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "C...
## $ exterior_1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ exterior_2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ mas_vnr_type    <chr> "Stone", "None", "BrkFace", "None", "None", "B...
## $ exter_qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd"...
## $ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc...
## $ bsmt_qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd"...
## $ bsmt_cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ bsmt_exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No"...
## $ bsmtfin_type_1  <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ...
## $ bsmtfin_type_2  <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf...
## $ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"...
## $ heating_qc      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex"...
## $ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "...
## $ kitchen_qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd"...
## $ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ...
## $ fireplace_qu    <chr> "Gd", "NA", "NA", "TA", "TA", "Gd", "NA", "NA"...
## $ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attch...
## $ garage_finish   <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin...
## $ garage_qual     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ paved_drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ pool_qc         <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"...
## $ fence           <chr> "NA", "MnPrv", "NA", "NA", "MnPrv", "NA", "NA"...
## $ misc_feature    <chr> "NA", "NA", "Gar2", "NA", "NA", "NA", "NA", "N...
## $ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD"...
## $ sale_condition  <chr> "Normal", "Normal", "Normal", "Normal", "Norma...

그리고, mo_sold 변수는 수량형으로 읽어들였지만, 수량형보다는 범주형으로 간주하는 것이 좋을 것 같다. 이 외에 다양한 변수를 하나하나 살펴보면 다른 많은 전처리를 해 줄 수 있겠지만, 일단 위와 같은 변환을 한 자료를 우리의 분석자료로 저장하도록 하자:

df <- df2 %>% mutate(mo_sold=as.character(mo_sold))

훈련, 검증, 테스트셋의 구분

원 데이터를 6:4:4 비율로 훈련, 검증, 테스트셋으로 나누도록 하자. (재현 가능성을 위해 set.seed()를 사용했다.)

set.seed(2017)
n <- nrow(df)
idx <- 1:n
training_idx <- sample(idx, n * .60)
idx <- setdiff(idx, training_idx)
validate_idx = sample(idx, n * .20)
test_idx <- setdiff(idx, validate_idx)
length(training_idx)
## [1] 1758
length(validate_idx)
## [1] 586
length(test_idx)
## [1] 586
training <- df[training_idx,]
validation <- df[validate_idx,]
test <- df[test_idx,]

일부 분석 함수는 문자형 변수를 자동적으로 인자형으로 변환하지 않으므로, 다음 데이터셋도 만들어 두자. mutate_if() 함수를 이용하였다.

dff <- df %>% mutate_if(is.character, as.factor)
glimpse(dff)
## Observations: 2,930
## Variables: 80
## $ lot_frontage    <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 6...
## $ lot_area        <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ overall_qual    <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond    <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built      <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add  <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ mas_vnr_area    <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bsmtfin_sf_1    <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_sf_2    <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf     <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf   <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x1st_flr_sf     <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf     <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area     <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath       <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath       <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr   <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ totrms_abvgrd   <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ fireplaces      <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ garage_yr_blt   <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_cars     <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area     <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ wood_deck_sf    <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf   <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch  <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch    <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ misc_val        <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold         <fctr> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, ...
## $ yr_sold         <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ saleprice       <int> 215000, 105000, 172000, 244000, 189900, 195500...
## $ ms_subclass     <fctr> 020, 020, 020, 020, 060, 060, 120, 120, 120, ...
## $ ms_zoning       <fctr> RL, RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, R...
## $ street          <fctr> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav...
## $ alley           <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lot_shape       <fctr> IR1, Reg, IR1, Reg, IR1, IR1, Reg, IR1, IR1, ...
## $ land_contour    <fctr> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, ...
## $ utilities       <fctr> AllPub, AllPub, AllPub, AllPub, AllPub, AllPu...
## $ lot_config      <fctr> Corner, Inside, Corner, Corner, Inside, Insid...
## $ land_slope      <fctr> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, ...
## $ neighborhood    <fctr> NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert,...
## $ condition_1     <fctr> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, No...
## $ condition_2     <fctr> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor...
## $ bldg_type       <fctr> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, T...
## $ house_style     <fctr> 1Story, 1Story, 1Story, 1Story, 2Story, 2Stor...
## $ roof_style      <fctr> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Ga...
## $ roof_matl       <fctr> CompShg, CompShg, CompShg, CompShg, CompShg, ...
## $ exterior_1st    <fctr> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, ...
## $ exterior_2nd    <fctr> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, ...
## $ mas_vnr_type    <fctr> Stone, None, BrkFace, None, None, BrkFace, No...
## $ exter_qual      <fctr> TA, TA, TA, Gd, TA, TA, Gd, Gd, Gd, TA, TA, T...
## $ exter_cond      <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, G...
## $ foundation      <fctr> CBlock, CBlock, CBlock, CBlock, PConc, PConc,...
## $ bsmt_qual       <fctr> TA, TA, TA, TA, Gd, TA, Gd, Gd, Gd, TA, Gd, G...
## $ bsmt_cond       <fctr> Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ bsmt_exposure   <fctr> Gd, No, No, No, No, No, Mn, No, No, No, No, N...
## $ bsmtfin_type_1  <fctr> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, ...
## $ bsmtfin_type_2  <fctr> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, ...
## $ heating         <fctr> GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas...
## $ heating_qc      <fctr> Fa, TA, TA, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Gd, E...
## $ central_air     <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ electrical      <fctr> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBr...
## $ kitchen_qual    <fctr> TA, TA, Gd, Ex, TA, Gd, Gd, Gd, Gd, Gd, TA, T...
## $ functional      <fctr> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, ...
## $ fireplace_qu    <fctr> Gd, NA, NA, TA, TA, Gd, NA, NA, TA, TA, TA, N...
## $ garage_type     <fctr> Attchd, Attchd, Attchd, Attchd, Attchd, Attch...
## $ garage_finish   <fctr> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, ...
## $ garage_qual     <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ garage_cond     <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ paved_drive     <fctr> P, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ pool_qc         <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ fence           <fctr> NA, MnPrv, NA, NA, MnPrv, NA, NA, NA, NA, NA,...
## $ misc_feature    <fctr> NA, NA, Gar2, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ sale_type       <fctr> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, W...
## $ sale_condition  <fctr> Normal, Normal, Normal, Normal, Normal, Norma...
training_f <- dff[training_idx, ]
validation_f <- dff[validate_idx, ]
test_f <- dff[test_idx, ]

A. 회귀분석

일단 모든 변수를 다 넣은 선형모형을 돌려보자:

df_lm_full <- lm(saleprice ~ ., data=training_f)
summary(df_lm_full)
## 
## Call:
## lm(formula = saleprice ~ ., data = training_f)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -257041  -10232       0    9485  152022 
## 
## Coefficients: (9 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.557e+04  9.486e+05   0.048 0.961687    
## lot_frontage           4.009e+01  4.198e+01   0.955 0.339757    
## lot_area               5.299e-01  1.263e-01   4.197 2.87e-05 ***
## overall_qual           5.773e+03  9.274e+02   6.225 6.27e-10 ***
## overall_cond           6.343e+03  7.781e+02   8.151 7.61e-16 ***
## year_built             4.088e+02  7.537e+01   5.424 6.82e-08 ***
## year_remod_add         8.607e+01  4.990e+01   1.725 0.084788 .  
## mas_vnr_area           2.872e+01  5.090e+00   5.642 2.01e-08 ***
## bsmtfin_sf_1           4.225e+01  4.390e+00   9.623  < 2e-16 ***
## bsmtfin_sf_2           3.891e+01  7.774e+00   5.004 6.27e-07 ***
## bsmt_unf_sf            2.036e+01  4.048e+00   5.029 5.53e-07 ***
## total_bsmt_sf                 NA         NA      NA       NA    
## x1st_flr_sf            5.013e+01  4.844e+00  10.348  < 2e-16 ***
## x2nd_flr_sf            5.361e+01  4.963e+00  10.803  < 2e-16 ***
## low_qual_fin_sf        2.650e+01  1.346e+01   1.969 0.049153 *  
## gr_liv_area                   NA         NA      NA       NA    
## bsmt_full_bath         2.572e+03  1.712e+03   1.503 0.133072    
## bsmt_half_bath         6.793e+02  2.617e+03   0.260 0.795230    
## full_bath              4.939e+03  1.921e+03   2.571 0.010225 *  
## half_bath              2.598e+03  1.841e+03   1.411 0.158410    
## bedroom_abvgr         -3.978e+03  1.203e+03  -3.307 0.000966 ***
## kitchen_abvgr         -1.360e+04  5.698e+03  -2.386 0.017139 *  
## totrms_abvgrd          1.779e+03  8.168e+02   2.178 0.029546 *  
## fireplaces             8.786e+03  2.352e+03   3.736 0.000194 ***
## garage_yr_blt          2.580e+01  4.993e+01   0.517 0.605487    
## garage_cars            3.532e+03  2.074e+03   1.703 0.088728 .  
## garage_area            1.421e+01  7.051e+00   2.016 0.044014 *  
## wood_deck_sf           6.186e+00  5.335e+00   1.159 0.246487    
## open_porch_sf         -3.697e+00  9.952e+00  -0.371 0.710353    
## enclosed_porch         1.187e+01  1.143e+01   1.039 0.298995    
## x3ssn_porch            1.344e+01  2.161e+01   0.622 0.534212    
## screen_porch           3.779e+01  1.061e+01   3.563 0.000378 ***
## pool_area             -1.826e+02  1.462e+02  -1.249 0.211928    
## misc_val               1.989e-01  1.948e+00   0.102 0.918706    
## mo_sold10             -1.128e+04  3.778e+03  -2.985 0.002879 ** 
## mo_sold11             -6.455e+03  3.840e+03  -1.681 0.093001 .  
## mo_sold12             -6.455e+01  4.169e+03  -0.015 0.987649    
## mo_sold2              -2.761e+03  3.987e+03  -0.692 0.488736    
## mo_sold3              -4.585e+03  3.544e+03  -1.294 0.195902    
## mo_sold4              -5.168e+03  3.440e+03  -1.502 0.133195    
## mo_sold5              -2.309e+03  3.302e+03  -0.699 0.484532    
## mo_sold6              -4.234e+03  3.181e+03  -1.331 0.183396    
## mo_sold7              -1.997e+03  3.206e+03  -0.623 0.533395    
## mo_sold8              -8.717e+03  3.579e+03  -2.436 0.014986 *  
## mo_sold9              -8.820e+03  3.798e+03  -2.322 0.020360 *  
## yr_sold               -7.759e+02  4.649e+02  -1.669 0.095382 .  
## ms_subclass030         6.697e+03  4.148e+03   1.615 0.106575    
## ms_subclass040         1.962e+04  1.293e+04   1.517 0.129515    
## ms_subclass045         3.141e+03  1.775e+04   0.177 0.859572    
## ms_subclass050         7.805e+03  7.293e+03   1.070 0.284697    
## ms_subclass060         2.113e+03  7.066e+03   0.299 0.764951    
## ms_subclass070         1.199e+04  7.746e+03   1.548 0.121898    
## ms_subclass075         2.403e+04  1.544e+04   1.556 0.119870    
## ms_subclass080        -1.697e+04  1.301e+04  -1.304 0.192327    
## ms_subclass085        -5.618e+03  8.583e+03  -0.655 0.512866    
## ms_subclass090        -1.647e+04  6.969e+03  -2.363 0.018253 *  
## ms_subclass120        -2.091e+04  1.229e+04  -1.701 0.089147 .  
## ms_subclass150        -5.152e+04  3.002e+04  -1.716 0.086338 .  
## ms_subclass160        -2.418e+04  1.482e+04  -1.631 0.103070    
## ms_subclass180        -2.325e+04  1.739e+04  -1.337 0.181369    
## ms_subclass190        -1.096e+04  6.878e+03  -1.593 0.111298    
## ms_zoningFV           -6.488e+03  1.118e+04  -0.580 0.561830    
## ms_zoningI (all)       5.009e+04  3.676e+04   1.362 0.173258    
## ms_zoningRH            1.898e+04  1.145e+04   1.657 0.097633 .  
## ms_zoningRL            5.054e+03  9.546e+03   0.529 0.596566    
## ms_zoningRM           -3.379e+02  8.969e+03  -0.038 0.969957    
## streetPave             1.891e+04  1.079e+04   1.751 0.080093 .  
## alleyNA               -2.612e+02  3.571e+03  -0.073 0.941699    
## alleyPave             -2.561e+03  5.382e+03  -0.476 0.634240    
## lot_shapeIR2           2.580e+03  3.804e+03   0.678 0.497843    
## lot_shapeIR3           7.752e+03  8.850e+03   0.876 0.381160    
## lot_shapeReg           1.304e+03  1.473e+03   0.885 0.376064    
## land_contourHLS        8.955e+03  4.501e+03   1.989 0.046843 *  
## land_contourLow       -1.397e+03  5.703e+03  -0.245 0.806475    
## land_contourLvl        5.155e+03  3.297e+03   1.564 0.118083    
## utilitiesNoSewr       -3.843e+04  2.487e+04  -1.545 0.122473    
## lot_configCulDSac      3.128e+03  3.071e+03   1.019 0.308559    
## lot_configFR2         -4.623e+03  3.791e+03  -1.219 0.222881    
## lot_configFR3          9.548e+02  9.550e+03   0.100 0.920370    
## lot_configInside       1.902e+02  1.659e+03   0.115 0.908693    
## land_slopeMod          1.508e+03  3.635e+03   0.415 0.678368    
## land_slopeSev         -2.150e+04  1.075e+04  -2.001 0.045616 *  
## neighborhoodBlueste   -2.537e+03  1.548e+04  -0.164 0.869811    
## neighborhoodBrDale    -2.395e+03  1.049e+04  -0.228 0.819521    
## neighborhoodBrkSide   -1.078e+04  8.162e+03  -1.321 0.186757    
## neighborhoodClearCr   -1.328e+04  8.636e+03  -1.538 0.124374    
## neighborhoodCollgCr   -1.395e+04  6.385e+03  -2.185 0.029073 *  
## neighborhoodCrawfor   -3.369e+03  7.471e+03  -0.451 0.652051    
## neighborhoodEdwards   -2.723e+04  7.061e+03  -3.856 0.000120 ***
## neighborhoodGilbert   -1.425e+04  6.691e+03  -2.130 0.033327 *  
## neighborhoodGreens     4.442e+03  1.291e+04   0.344 0.730782    
## neighborhoodGrnHill    6.562e+04  2.516e+04   2.609 0.009181 ** 
## neighborhoodIDOTRR    -1.757e+04  9.354e+03  -1.879 0.060490 .  
## neighborhoodMeadowV   -3.490e+03  1.035e+04  -0.337 0.736073    
## neighborhoodMitchel   -2.211e+04  7.126e+03  -3.103 0.001951 ** 
## neighborhoodNAmes     -2.103e+04  6.932e+03  -3.034 0.002453 ** 
## neighborhoodNoRidge    1.655e+04  7.442e+03   2.224 0.026288 *  
## neighborhoodNPkVill    2.116e+03  1.370e+04   0.154 0.877300    
## neighborhoodNridgHt    7.636e+03  6.596e+03   1.158 0.247194    
## neighborhoodNWAmes    -2.173e+04  7.063e+03  -3.076 0.002133 ** 
## neighborhoodOldTown   -2.313e+04  8.269e+03  -2.798 0.005215 ** 
## neighborhoodSawyer    -1.694e+04  7.179e+03  -2.360 0.018415 *  
## neighborhoodSawyerW   -1.499e+04  6.778e+03  -2.212 0.027142 *  
## neighborhoodSomerst    1.372e+04  7.964e+03   1.722 0.085238 .  
## neighborhoodStoneBr    4.069e+04  7.442e+03   5.467 5.36e-08 ***
## neighborhoodSWISU     -2.468e+04  8.494e+03  -2.906 0.003717 ** 
## neighborhoodTimber    -1.887e+04  7.093e+03  -2.661 0.007883 ** 
## neighborhoodVeenker   -1.952e+04  9.624e+03  -2.028 0.042764 *  
## condition_1Feedr       4.726e+03  4.452e+03   1.062 0.288554    
## condition_1Norm        1.187e+04  3.657e+03   3.247 0.001194 ** 
## condition_1PosA        1.577e+04  8.511e+03   1.853 0.064086 .  
## condition_1PosN        1.826e+04  6.491e+03   2.812 0.004981 ** 
## condition_1RRAe       -6.776e+02  6.536e+03  -0.104 0.917448    
## condition_1RRAn        6.524e+03  6.281e+03   1.039 0.299118    
## condition_1RRNe        2.147e+03  1.317e+04   0.163 0.870500    
## condition_1RRNn       -1.033e+04  1.151e+04  -0.898 0.369572    
## condition_2Feedr       4.009e+03  1.727e+04   0.232 0.816462    
## condition_2Norm        1.676e+04  1.595e+04   1.051 0.293401    
## condition_2PosA        6.685e+04  2.083e+04   3.210 0.001356 ** 
## condition_2PosN       -1.363e+05  2.207e+04  -6.176 8.47e-10 ***
## condition_2RRAn        1.392e+04  2.855e+04   0.487 0.626012    
## bldg_type2fmCon               NA         NA      NA       NA    
## bldg_typeDuplex               NA         NA      NA       NA    
## bldg_typeTwnhs        -3.752e+03  1.313e+04  -0.286 0.775080    
## bldg_typeTwnhsE        6.377e+02  1.231e+04   0.052 0.958710    
## house_style1.5Unf      7.990e+03  1.742e+04   0.459 0.646569    
## house_style1Story      6.291e+03  7.235e+03   0.870 0.384696    
## house_style2.5Fin     -1.779e+04  1.935e+04  -0.919 0.358082    
## house_style2.5Unf     -1.512e+04  1.479e+04  -1.022 0.306904    
## house_style2Story     -2.873e+00  7.276e+03   0.000 0.999685    
## house_styleSFoyer      7.365e+03  9.445e+03   0.780 0.435676    
## house_styleSLvl        2.082e+04  1.348e+04   1.544 0.122869    
## roof_styleGable        3.220e+03  2.562e+04   0.126 0.899999    
## roof_styleGambrel     -6.711e+03  2.647e+04  -0.254 0.799873    
## roof_styleHip          4.708e+03  2.567e+04   0.183 0.854489    
## roof_styleMansard     -1.150e+04  2.764e+04  -0.416 0.677310    
## roof_styleShed         2.959e+04  3.228e+04   0.917 0.359422    
## roof_matlMetal         2.728e+04  3.622e+04   0.753 0.451463    
## roof_matlRoll         -6.539e+03  2.487e+04  -0.263 0.792628    
## roof_matlTar&Grv      -8.631e+02  2.410e+04  -0.036 0.971430    
## roof_matlWdShake      -1.278e+04  1.348e+04  -0.948 0.343303    
## roof_matlWdShngl      -2.699e+03  2.029e+04  -0.133 0.894187    
## exterior_1stBrkComm    1.801e+04  1.785e+04   1.009 0.313285    
## exterior_1stBrkFace    2.571e+04  1.245e+04   2.065 0.039086 *  
## exterior_1stCemntBd   -1.619e+04  2.050e+04  -0.790 0.429699    
## exterior_1stHdBoard    1.677e+03  1.210e+04   0.139 0.889769    
## exterior_1stMetalSd    1.003e+04  1.365e+04   0.735 0.462595    
## exterior_1stPlywood    7.851e+03  1.198e+04   0.655 0.512306    
## exterior_1stPreCast    3.762e+04  2.624e+04   1.434 0.151791    
## exterior_1stStone      5.655e+03  3.013e+04   0.188 0.851154    
## exterior_1stStucco     5.442e+03  1.348e+04   0.404 0.686396    
## exterior_1stVinylSd    3.885e+03  1.318e+04   0.295 0.768222    
## exterior_1stWd Sdng    1.618e+03  1.185e+04   0.137 0.891395    
## exterior_1stWdShing    3.631e+03  1.268e+04   0.286 0.774604    
## exterior_2ndAsphShn    8.215e+03  2.148e+04   0.383 0.702142    
## exterior_2ndBrk Cmn   -1.025e+04  1.736e+04  -0.590 0.554953    
## exterior_2ndBrkFace   -1.522e+04  1.335e+04  -1.141 0.254181    
## exterior_2ndCBlock    -7.449e+03  2.985e+04  -0.250 0.802993    
## exterior_2ndCmentBd    9.969e+03  2.059e+04   0.484 0.628396    
## exterior_2ndHdBoard   -4.972e+03  1.214e+04  -0.409 0.682271    
## exterior_2ndImStucc   -2.424e+04  1.583e+04  -1.531 0.125890    
## exterior_2ndMetalSd   -9.364e+03  1.368e+04  -0.684 0.493810    
## exterior_2ndPlywood   -9.937e+03  1.179e+04  -0.843 0.399577    
## exterior_2ndPreCast           NA         NA      NA       NA    
## exterior_2ndStone     -3.962e+04  2.356e+04  -1.682 0.092874 .  
## exterior_2ndStucco    -1.749e+02  1.338e+04  -0.013 0.989570    
## exterior_2ndVinylSd   -6.997e+03  1.310e+04  -0.534 0.593331    
## exterior_2ndWd Sdng   -3.278e+03  1.188e+04  -0.276 0.782597    
## exterior_2ndWd Shng   -7.669e+03  1.235e+04  -0.621 0.534596    
## mas_vnr_typeBrkFace    1.441e+04  8.981e+03   1.605 0.108706    
## mas_vnr_typeNA         1.836e+04  1.146e+04   1.602 0.109474    
## mas_vnr_typeNone       2.064e+04  9.054e+03   2.280 0.022771 *  
## mas_vnr_typeStone      2.087e+04  9.218e+03   2.264 0.023711 *  
## exter_qualFa          -1.914e+04  8.782e+03  -2.179 0.029462 *  
## exter_qualGd          -2.242e+04  4.039e+03  -5.552 3.35e-08 ***
## exter_qualTA          -2.394e+04  4.606e+03  -5.197 2.31e-07 ***
## exter_condFa          -2.434e+03  1.057e+04  -0.230 0.817822    
## exter_condGd           1.234e+03  9.261e+03   0.133 0.893993    
## exter_condPo          -5.491e+03  2.039e+04  -0.269 0.787760    
## exter_condTA           3.804e+03  9.214e+03   0.413 0.679809    
## foundationCBlock      -9.680e+02  2.985e+03  -0.324 0.745752    
## foundationPConc        5.323e+02  3.123e+03   0.170 0.864675    
## foundationSlab        -5.579e+03  8.654e+03  -0.645 0.519204    
## foundationStone        1.890e+03  1.201e+04   0.157 0.874998    
## foundationWood        -2.322e+03  1.271e+04  -0.183 0.855015    
## bsmt_qualFa           -1.801e+04  5.222e+03  -3.449 0.000579 ***
## bsmt_qualGd           -2.048e+04  2.931e+03  -6.987 4.22e-12 ***
## bsmt_qualNA            3.149e+04  3.485e+04   0.904 0.366292    
## bsmt_qualPo           -1.470e+04  3.121e+04  -0.471 0.637757    
## bsmt_qualTA           -1.821e+04  3.679e+03  -4.948 8.34e-07 ***
## bsmt_condFa           -2.380e+02  1.836e+04  -0.013 0.989660    
## bsmt_condGd            1.208e+03  1.823e+04   0.066 0.947179    
## bsmt_condNA                   NA         NA      NA       NA    
## bsmt_condPo            3.054e+04  2.696e+04   1.133 0.257447    
## bsmt_condTA            2.587e+03  1.804e+04   0.143 0.885977    
## bsmt_exposureGd        1.192e+04  2.629e+03   4.536 6.20e-06 ***
## bsmt_exposureMn       -6.628e+03  2.736e+03  -2.423 0.015531 *  
## bsmt_exposureNA       -1.624e+04  1.335e+04  -1.217 0.223897    
## bsmt_exposureNo       -5.595e+03  1.987e+03  -2.816 0.004932 ** 
## bsmtfin_type_1BLQ      7.058e+02  2.586e+03   0.273 0.784936    
## bsmtfin_type_1GLQ      8.890e+02  2.256e+03   0.394 0.693614    
## bsmtfin_type_1LwQ     -5.453e+03  3.291e+03  -1.657 0.097733 .  
## bsmtfin_type_1NA              NA         NA      NA       NA    
## bsmtfin_type_1Rec     -2.001e+03  2.577e+03  -0.776 0.437615    
## bsmtfin_type_1Unf      2.068e+03  2.589e+03   0.799 0.424585    
## bsmtfin_type_2BLQ     -6.555e+03  5.939e+03  -1.104 0.269940    
## bsmtfin_type_2GLQ      1.063e+04  7.627e+03   1.394 0.163655    
## bsmtfin_type_2LwQ     -8.222e+03  5.699e+03  -1.443 0.149275    
## bsmtfin_type_2NA      -1.684e+04  2.407e+04  -0.700 0.484279    
## bsmtfin_type_2Rec     -5.434e+03  5.558e+03  -0.978 0.328354    
## bsmtfin_type_2Unf      2.859e+02  5.699e+03   0.050 0.959986    
## heatingGasA            1.443e+04  2.483e+04   0.581 0.561103    
## heatingGasW            1.389e+04  2.591e+04   0.536 0.591872    
## heatingGrav            8.876e+02  2.776e+04   0.032 0.974497    
## heatingOthW           -1.552e+04  3.068e+04  -0.506 0.613090    
## heatingWall            3.521e+04  3.250e+04   1.084 0.278755    
## heating_qcFa          -2.522e+03  4.140e+03  -0.609 0.542459    
## heating_qcGd          -2.447e+03  1.868e+03  -1.310 0.190485    
## heating_qcPo          -9.940e+03  1.834e+04  -0.542 0.587838    
## heating_qcTA          -2.442e+03  1.845e+03  -1.323 0.185982    
## central_airY          -3.777e+03  3.231e+03  -1.169 0.242600    
## electricalFuseF       -1.688e+03  5.743e+03  -0.294 0.768815    
## electricalFuseP        1.255e+03  1.121e+04   0.112 0.910879    
## electricalMix          4.063e+02  3.594e+04   0.011 0.990984    
## electricalNA           1.606e+04  2.338e+04   0.687 0.492278    
## electricalSBrkr        1.853e+02  2.619e+03   0.071 0.943596    
## kitchen_qualFa        -1.820e+04  5.627e+03  -3.235 0.001245 ** 
## kitchen_qualGd        -2.129e+04  3.161e+03  -6.736 2.32e-11 ***
## kitchen_qualTA        -2.191e+04  3.608e+03  -6.072 1.60e-09 ***
## functionalMaj2        -4.872e+03  1.370e+04  -0.356 0.722260    
## functionalMin1         9.031e+03  8.622e+03   1.047 0.295065    
## functionalMin2         1.162e+04  8.546e+03   1.360 0.174081    
## functionalMod         -5.091e+01  9.579e+03  -0.005 0.995760    
## functionalSal          1.796e+04  2.679e+04   0.670 0.502700    
## functionalTyp          1.850e+04  7.643e+03   2.420 0.015641 *  
## fireplace_quFa        -3.287e+03  6.210e+03  -0.529 0.596632    
## fireplace_quGd         7.123e+02  4.817e+03   0.148 0.882471    
## fireplace_quNA         8.313e+03  5.664e+03   1.468 0.142393    
## fireplace_quPo         2.355e+03  6.541e+03   0.360 0.718884    
## fireplace_quTA         3.300e+02  4.976e+03   0.066 0.947131    
## garage_typeAttchd      7.529e+03  6.447e+03   1.168 0.243062    
## garage_typeBasment     1.415e+04  8.604e+03   1.645 0.100161    
## garage_typeBuiltIn     8.217e+03  7.045e+03   1.166 0.243641    
## garage_typeCarPort     5.241e+03  1.073e+04   0.489 0.625204    
## garage_typeDetchd      9.157e+03  6.444e+03   1.421 0.155536    
## garage_typeNA          3.356e+04  2.653e+04   1.265 0.206035    
## garage_finishNA       -2.584e+04  2.996e+04  -0.863 0.388527    
## garage_finishRFn      -8.534e+02  1.773e+03  -0.481 0.630415    
## garage_finishUnf       2.644e+03  2.167e+03   1.220 0.222624    
## garage_qualFa         -5.088e+04  2.985e+04  -1.705 0.088459 .  
## garage_qualGd         -3.233e+04  2.854e+04  -1.133 0.257603    
## garage_qualNA                 NA         NA      NA       NA    
## garage_qualPo         -5.442e+04  3.553e+04  -1.532 0.125769    
## garage_qualTA         -4.787e+04  2.955e+04  -1.620 0.105412    
## garage_condFa          3.748e+04  2.470e+04   1.518 0.129323    
## garage_condGd          2.544e+04  2.520e+04   1.010 0.312887    
## garage_condNA                 NA         NA      NA       NA    
## garage_condPo          3.600e+04  2.735e+04   1.316 0.188292    
## garage_condTA          3.559e+04  2.436e+04   1.461 0.144211    
## paved_driveP          -1.177e+03  4.804e+03  -0.245 0.806543    
## paved_driveY           6.479e+02  3.015e+03   0.215 0.829846    
## pool_qcFa              1.536e+04  7.518e+04   0.204 0.838139    
## pool_qcGd              1.740e+03  7.626e+04   0.023 0.981795    
## pool_qcNA             -1.194e+05  3.152e+04  -3.789 0.000157 ***
## pool_qcTA             -2.353e+04  4.160e+04  -0.566 0.571683    
## fenceGdWo             -1.437e+03  4.503e+03  -0.319 0.749660    
## fenceMnPrv            -1.915e+03  3.721e+03  -0.515 0.606835    
## fenceMnWw             -8.587e+03  8.707e+03  -0.986 0.324167    
## fenceNA               -3.212e+03  3.393e+03  -0.947 0.343931    
## misc_featureGar2       5.472e+05  3.276e+04  16.706  < 2e-16 ***
## misc_featureNA         5.399e+05  4.248e+04  12.710  < 2e-16 ***
## misc_featureShed       5.410e+05  4.132e+04  13.091  < 2e-16 ***
## sale_typeCon           2.661e+04  1.491e+04   1.784 0.074553 .  
## sale_typeConLD         7.827e+03  7.276e+03   1.076 0.282200    
## sale_typeConLI        -3.709e+02  9.181e+03  -0.040 0.967778    
## sale_typeConLw         8.879e+03  1.004e+04   0.884 0.376759    
## sale_typeCWD           1.151e+04  1.106e+04   1.041 0.298228    
## sale_typeNew           5.967e+02  1.545e+04   0.039 0.969204    
## sale_typeOth          -8.473e+04  2.396e+04  -3.536 0.000418 ***
## sale_typeVWD          -4.993e+03  2.420e+04  -0.206 0.836560    
## sale_typeWD            1.176e+03  3.641e+03   0.323 0.746671    
## sale_conditionAdjLand  2.697e+04  1.047e+04   2.577 0.010059 *  
## sale_conditionAlloca   7.267e+03  8.378e+03   0.867 0.385854    
## sale_conditionFamily  -5.064e+03  5.246e+03  -0.965 0.334599    
## sale_conditionNormal   7.584e+03  2.642e+03   2.870 0.004159 ** 
## sale_conditionPartial  1.849e+04  1.515e+04   1.220 0.222517    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22280 on 1481 degrees of freedom
## Multiple R-squared:  0.9371, Adjusted R-squared:  0.9254 
## F-statistic: 80.01 on 276 and 1481 DF,  p-value: < 2.2e-16

통계적으로 유의한 여러 변수들이 잡힌다.

아쉽게도, 선형모형을 실행하려면 다음과 같은 에러가 생긴다. 훈련셋에는 없는 인자 수준이 검증 셋에 나타나기 때문이다.

y_obs <- validation$saleprice
yhat_lm <- predict(df_lm_full, newdata=validation_f)
 Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor ms_zoning has new levels A (agr) 

(고급문제: 위의 에러는 어떻게 해결할 수 있을까?)

선형모형 자체는 일반적으로 높은 예측력을 보이지 않기 때문에, 다음처럼 스텝(stepwise) 절차를 통한 변수선택을 시행할 수 있다. (실행시간 관계상 생략) 독자들의 컴퓨터에서 실행해 볼 것을 권한다.

df_step <- stepAIC(df_lm_full, scope = list(upper = ~ ., lower = ~1))
df_step
anova(df_step)
summary(df_step)
length(coef(df_step))
length(coef(df_lm_full))

참고로, 저자의 컴퓨터에서의 실행 후에 원 모형의 모수 개수는 286, 스텝 변수선택 이후의 모수 개수는 147 이었다.

만약 위와 같은 df_lm_full, df_step 모형이 제대로 작동하면 다음처럼 검증셋에서의 RMSE 오차값을 구할 수 있다.

y_obs <- validation$saleprice
yhat_lm <- predict(df_lm_full, newdata=validation)
yhat_step <- predict(df_step, newdata=validation)
rmse(y_obs, yhat_lm)
rmse(y_obs, yhat_step)

B. glmnet 함수를 통한 라쏘 모형, 능형회귀, 변수선택

xx <- model.matrix(saleprice ~ .-1, df)
x <- xx[training_idx, ]
y <- training$saleprice
df_cvfit <- cv.glmnet(x, y)

람다 모수의 값에 따른 오차의 값의 변화 추이는 다음과 같다:

plot(df_cvfit)

# coef(df_cvfit, s = c("lambda.1se"))
# coef(df_cvfit, s = c("lambda.min"))

라쏘 모형의 RMSE, MAE 값은:

y_obs <- validation$saleprice
yhat_glmnet <- predict(df_cvfit, s="lambda.min", newx=xx[validate_idx,])
yhat_glmnet <- yhat_glmnet[,1] # change to a vector from [n*1] matrix
rmse(y_obs, yhat_glmnet)
## [1] 44968.73
mae(y_obs, yhat_glmnet)
## [1] 17917.51

C. 나무모형

rpart::rpart() 함수를 사용해 나무 회귀분석모형을 적합하자.

df_tr <- rpart(saleprice ~ ., data = training)
df_tr
## n= 1758 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1758 1.169331e+13 181888.40  
##    2) overall_qual< 7.5 1469 3.558473e+12 156662.20  
##      4) neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 868 1.156217e+12 131366.40  
##        8) overall_qual< 4.5 157 1.291966e+11  96895.25 *
##        9) overall_qual>=4.5 711 7.992695e+11 138978.10  
##         18) x1st_flr_sf< 1322 590 3.598974e+11 132189.30 *
##         19) x1st_flr_sf>=1322 121 2.795925e+11 172080.50 *
##      5) neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,GrnHill,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 601 1.044673e+12 193196.00  
##       10) gr_liv_area< 1752 438 4.335084e+11 178563.90  
##         20) gr_liv_area< 1204 99 4.230077e+10 147199.70 *
##         21) gr_liv_area>=1204 339 2.653791e+11 187723.40 *
##       11) gr_liv_area>=1752 163 2.654050e+11 232514.20 *
##    3) overall_qual>=7.5 289 2.448313e+12 310114.40  
##      6) total_bsmt_sf< 1721.5 206 7.977165e+11 277138.10  
##       12) gr_liv_area< 2374.5 164 4.407878e+11 260563.90  
##         24) bsmt_qual=Fa,Gd,TA 107 1.971039e+11 239877.10 *
##         25) bsmt_qual=Ex 57 1.119365e+11 299397.20 *
##       13) gr_liv_area>=2374.5 42 1.359639e+11 341856.10 *
##      7) total_bsmt_sf>=1721.5 83 8.706003e+11 391959.40  
##       14) gr_liv_area< 2225.5 48 1.512143e+11 341111.20 *
##       15) gr_liv_area>=2225.5 35 4.250778e+11 461694.10  
##         30) neighborhood=Edwards,NAmes,NWAmes,Somerst,Veenker 7 7.128389e+10 319764.10 *
##         31) neighborhood=CollgCr,NoRidge,NridgHt,StoneBr 28 1.775330e+11 497176.50 *
# printcp(df_tr)
# summary(df_tr)
opar <- par(mfrow = c(1,1), xpd = NA)
plot(df_tr)
text(df_tr, use.n = TRUE)