저자 책 웹페이지: https://dataninja.me/ipds-kr/
일단은 필수패키지인 tidyverse, 그리고 머신러닝을 위한 몇가지 패키지를 로드하자. (로딩 메시지를 감추기 위해 suppressMessages() 명령을 사용.)
# install.packages("tidyverse")
suppressMessages(library(tidyverse))
# install.packages(c("ROCR", "MASS", "glmnet", "randomForest", "gbm", "rpart", "boot"))
suppressMessages(library(gridExtra))
suppressMessages(library(ROCR))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
## Warning: package 'glmnet' was built under R version 3.4.2
suppressMessages(library(randomForest))
suppressMessages(library(gbm))
suppressMessages(library(rpart))
suppressMessages(library(boot))
책에서 기술한대로 RMSE (root mean squared error), MAE (median absolute error), panel.cor 함수를 정의하자:
rmse <- function(yi, yhat_i){
sqrt(mean((yi - yhat_i)^2))
}
mae <- function(yi, yhat_i){
mean(abs(yi - yhat_i))
}
# exmaple(pairs) 에서 따옴
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
아이오와 주의 에임스시 주택 가격데이터(De Cock, 2011)를 구하여 회귀분석을 행하라. 데이터는 https://goo.gl/ul7Ub7 (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) 혹은 https://goo.gl/8gKgaT (http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls) https://goo.gl/qgVg2z (https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt) 에서 구할 수 있다.
변수 설명은 https://goo.gl/2vcCfT (https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) 를 참조하라. https://ww2.amstat.org/publications/jse/v19n3/decock.pdf 문서를 참조해도 좋다.
이 데이터에 대한 회귀분석을 행하라. 본문에서 기술한 방법 중 어떤 회귀분 석 방법이 가장 정확한 결과를 주는가? 결과를 보고서로 정리하라.
우선 다음 명령으로 자료를 다운받자:
wget https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.txt
wget https://ww2.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls
R 로 자료를 읽어들인 후, 다음처럼 변수명을 변환하자:
make.names(..., unique=TRUE) 함수로 변수명을 R 에서 사용이 쉬운 이름으로 바꾼다.그리고, id 변수인 order, pid 를 제거한다.
df1 <- read_tsv("AmesHousing.txt")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Order = col_integer(),
## `Lot Frontage` = col_integer(),
## `Lot Area` = col_integer(),
## `Overall Qual` = col_integer(),
## `Overall Cond` = col_integer(),
## `Year Built` = col_integer(),
## `Year Remod/Add` = col_integer(),
## `Mas Vnr Area` = col_integer(),
## `BsmtFin SF 1` = col_integer(),
## `BsmtFin SF 2` = col_integer(),
## `Bsmt Unf SF` = col_integer(),
## `Total Bsmt SF` = col_integer(),
## `1st Flr SF` = col_integer(),
## `2nd Flr SF` = col_integer(),
## `Low Qual Fin SF` = col_integer(),
## `Gr Liv Area` = col_integer(),
## `Bsmt Full Bath` = col_integer(),
## `Bsmt Half Bath` = col_integer(),
## `Full Bath` = col_integer(),
## `Half Bath` = col_integer()
## # ... with 17 more columns
## )
## See spec(...) for full column specifications.
names(df1) <- tolower(gsub("\\.", "_", make.names(names(df1), unique=TRUE)))
df1 <- df1 %>% dplyr::select(-order, -pid)
glimpse(df1)
## Observations: 2,930
## Variables: 80
## $ ms_subclass <chr> "020", "020", "020", "020", "060", "060", "120...
## $ ms_zoning <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL"...
## $ lot_frontage <int> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, N...
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"...
## $ alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lot_shape <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg...
## $ land_contour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl...
## $ utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPu...
## $ lot_config <chr> "Corner", "Inside", "Corner", "Corner", "Insid...
## $ land_slope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl...
## $ neighborhood <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert",...
## $ condition_1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm...
## $ condition_2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"...
## $ bldg_type <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam"...
## $ house_style <chr> "1Story", "1Story", "1Story", "1Story", "2Stor...
## $ overall_qual <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ roof_style <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable"...
## $ roof_matl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "C...
## $ exterior_1st <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ exterior_2nd <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ mas_vnr_type <chr> "Stone", "None", "BrkFace", "None", "None", "B...
## $ mas_vnr_area <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ exter_qual <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd"...
## $ exter_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ foundation <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc...
## $ bsmt_qual <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd"...
## $ bsmt_cond <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ bsmt_exposure <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No"...
## $ bsmtfin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ...
## $ bsmtfin_sf_1 <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf...
## $ bsmtfin_sf_2 <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"...
## $ heating_qc <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex"...
## $ central_air <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "...
## $ x1st_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath <int> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ kitchen_qual <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd"...
## $ totrms_abvgrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ...
## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ fireplace_qu <chr> "Gd", NA, NA, "TA", "TA", "Gd", NA, NA, "TA", ...
## $ garage_type <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attch...
## $ garage_yr_blt <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_finish <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin...
## $ garage_cars <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ garage_qual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ garage_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ paved_drive <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ pool_qc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fence <chr> NA, "MnPrv", NA, NA, "MnPrv", NA, NA, NA, NA, ...
## $ misc_feature <chr> NA, NA, "Gar2", NA, NA, NA, NA, NA, NA, NA, NA...
## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6...
## $ yr_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ sale_type <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD"...
## $ sale_condition <chr> "Normal", "Normal", "Normal", "Normal", "Norma...
## $ saleprice <int> 215000, 105000, 172000, 244000, 189900, 195500...
자료의 여러 변수에 결측치가 포함되어 있다. 결측치를 찾아내는 간단한 방법은 summary() 함수를 사용하는 것이다:
# summary(df)
또다른 방법은 다음처럼 summarize_all() + funs() 트릭을 이용하는 것이다:
df1 %>%
summarize_all(funs(length(which(is.na(.)))/length(.))) %>%
glimpse()
## Observations: 1
## Variables: 80
## $ ms_subclass <dbl> 0
## $ ms_zoning <dbl> 0
## $ lot_frontage <dbl> 0.1672355
## $ lot_area <dbl> 0
## $ street <dbl> 0
## $ alley <dbl> 0.9324232
## $ lot_shape <dbl> 0
## $ land_contour <dbl> 0
## $ utilities <dbl> 0
## $ lot_config <dbl> 0
## $ land_slope <dbl> 0
## $ neighborhood <dbl> 0
## $ condition_1 <dbl> 0
## $ condition_2 <dbl> 0
## $ bldg_type <dbl> 0
## $ house_style <dbl> 0
## $ overall_qual <dbl> 0
## $ overall_cond <dbl> 0
## $ year_built <dbl> 0
## $ year_remod_add <dbl> 0
## $ roof_style <dbl> 0
## $ roof_matl <dbl> 0
## $ exterior_1st <dbl> 0
## $ exterior_2nd <dbl> 0
## $ mas_vnr_type <dbl> 0.007849829
## $ mas_vnr_area <dbl> 0.007849829
## $ exter_qual <dbl> 0
## $ exter_cond <dbl> 0
## $ foundation <dbl> 0
## $ bsmt_qual <dbl> 0.02730375
## $ bsmt_cond <dbl> 0.02730375
## $ bsmt_exposure <dbl> 0.02832765
## $ bsmtfin_type_1 <dbl> 0.02730375
## $ bsmtfin_sf_1 <dbl> 0.0003412969
## $ bsmtfin_type_2 <dbl> 0.02764505
## $ bsmtfin_sf_2 <dbl> 0.0003412969
## $ bsmt_unf_sf <dbl> 0.0003412969
## $ total_bsmt_sf <dbl> 0.0003412969
## $ heating <dbl> 0
## $ heating_qc <dbl> 0
## $ central_air <dbl> 0
## $ electrical <dbl> 0.0003412969
## $ x1st_flr_sf <dbl> 0
## $ x2nd_flr_sf <dbl> 0
## $ low_qual_fin_sf <dbl> 0
## $ gr_liv_area <dbl> 0
## $ bsmt_full_bath <dbl> 0.0006825939
## $ bsmt_half_bath <dbl> 0.0006825939
## $ full_bath <dbl> 0
## $ half_bath <dbl> 0
## $ bedroom_abvgr <dbl> 0
## $ kitchen_abvgr <dbl> 0
## $ kitchen_qual <dbl> 0
## $ totrms_abvgrd <dbl> 0
## $ functional <dbl> 0
## $ fireplaces <dbl> 0
## $ fireplace_qu <dbl> 0.4853242
## $ garage_type <dbl> 0.05358362
## $ garage_yr_blt <dbl> 0.05426621
## $ garage_finish <dbl> 0.05426621
## $ garage_cars <dbl> 0.0003412969
## $ garage_area <dbl> 0.0003412969
## $ garage_qual <dbl> 0.05426621
## $ garage_cond <dbl> 0.05426621
## $ paved_drive <dbl> 0
## $ wood_deck_sf <dbl> 0
## $ open_porch_sf <dbl> 0
## $ enclosed_porch <dbl> 0
## $ x3ssn_porch <dbl> 0
## $ screen_porch <dbl> 0
## $ pool_area <dbl> 0
## $ pool_qc <dbl> 0.9955631
## $ fence <dbl> 0.8047782
## $ misc_feature <dbl> 0.9638225
## $ misc_val <dbl> 0
## $ mo_sold <dbl> 0
## $ yr_sold <dbl> 0
## $ sale_type <dbl> 0
## $ sale_condition <dbl> 0
## $ saleprice <dbl> 0
이로부터 여러 변수들이 결측치를 가지고 있음을 알 수 있다.
결측치를 해결하는 다양한 방법이 있지만 여기서는 간단히 처리한다:
"NA" 문자열로 대치한다.아래 명령은 mutate_if(), rename_all() 함수등을 이용하여 위의 처리를 해준다:
df2 <- df1 %>%
mutate_if(is.numeric, funs(imp=ifelse(is.na(.), median(., na.rm=TRUE), .))) %>%
# mutate_if(is.character, funs(imp=ifelse(is.na(.), sort(table(.), decreasing=TRUE)[1], .))) %>%
mutate_if(is.character, funs(imp=ifelse(is.na(.), "NA", .))) %>%
dplyr::select(ends_with("_imp")) %>%
rename_all(funs(gsub("_imp", "", .)))
df2 %>% glimpse()
## Observations: 2,930
## Variables: 80
## $ lot_frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 6...
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ overall_qual <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ mas_vnr_area <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bsmtfin_sf_1 <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_sf_2 <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x1st_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ totrms_abvgrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ garage_yr_blt <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_cars <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6...
## $ yr_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ saleprice <int> 215000, 105000, 172000, 244000, 189900, 195500...
## $ ms_subclass <chr> "020", "020", "020", "020", "060", "060", "120...
## $ ms_zoning <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL"...
## $ street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"...
## $ alley <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"...
## $ lot_shape <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg...
## $ land_contour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl...
## $ utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPu...
## $ lot_config <chr> "Corner", "Inside", "Corner", "Corner", "Insid...
## $ land_slope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl...
## $ neighborhood <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert",...
## $ condition_1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm...
## $ condition_2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"...
## $ bldg_type <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam"...
## $ house_style <chr> "1Story", "1Story", "1Story", "1Story", "2Stor...
## $ roof_style <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable"...
## $ roof_matl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "C...
## $ exterior_1st <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ exterior_2nd <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "V...
## $ mas_vnr_type <chr> "Stone", "None", "BrkFace", "None", "None", "B...
## $ exter_qual <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd"...
## $ exter_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ foundation <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc...
## $ bsmt_qual <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd"...
## $ bsmt_cond <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ bsmt_exposure <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No"...
## $ bsmtfin_type_1 <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ...
## $ bsmtfin_type_2 <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf...
## $ heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"...
## $ heating_qc <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex"...
## $ central_air <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "...
## $ kitchen_qual <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd"...
## $ functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ...
## $ fireplace_qu <chr> "Gd", "NA", "NA", "TA", "TA", "Gd", "NA", "NA"...
## $ garage_type <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attch...
## $ garage_finish <chr> "Fin", "Unf", "Unf", "Fin", "Fin", "Fin", "Fin...
## $ garage_qual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ garage_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA"...
## $ paved_drive <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ pool_qc <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"...
## $ fence <chr> "NA", "MnPrv", "NA", "NA", "MnPrv", "NA", "NA"...
## $ misc_feature <chr> "NA", "NA", "Gar2", "NA", "NA", "NA", "NA", "N...
## $ sale_type <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD"...
## $ sale_condition <chr> "Normal", "Normal", "Normal", "Normal", "Norma...
그리고, mo_sold 변수는 수량형으로 읽어들였지만, 수량형보다는 범주형으로 간주하는 것이 좋을 것 같다. 이 외에 다양한 변수를 하나하나 살펴보면 다른 많은 전처리를 해 줄 수 있겠지만, 일단 위와 같은 변환을 한 자료를 우리의 분석자료로 저장하도록 하자:
df <- df2 %>% mutate(mo_sold=as.character(mo_sold))
원 데이터를 6:4:4 비율로 훈련, 검증, 테스트셋으로 나누도록 하자. (재현 가능성을 위해 set.seed()를 사용했다.)
set.seed(2017)
n <- nrow(df)
idx <- 1:n
training_idx <- sample(idx, n * .60)
idx <- setdiff(idx, training_idx)
validate_idx = sample(idx, n * .20)
test_idx <- setdiff(idx, validate_idx)
length(training_idx)
## [1] 1758
length(validate_idx)
## [1] 586
length(test_idx)
## [1] 586
training <- df[training_idx,]
validation <- df[validate_idx,]
test <- df[test_idx,]
일부 분석 함수는 문자형 변수를 자동적으로 인자형으로 변환하지 않으므로, 다음 데이터셋도 만들어 두자. mutate_if() 함수를 이용하였다.
dff <- df %>% mutate_if(is.character, as.factor)
glimpse(dff)
## Observations: 2,930
## Variables: 80
## $ lot_frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 6...
## $ lot_area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920,...
## $ overall_qual <int> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8...
## $ overall_cond <int> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5...
## $ year_built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ year_remod_add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992...
## $ mas_vnr_area <int> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bsmtfin_sf_1 <int> 639, 468, 923, 1065, 791, 602, 616, 263, 1180,...
## $ bsmtfin_sf_2 <int> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11...
## $ bsmt_unf_sf <int> 441, 270, 406, 1045, 137, 324, 722, 1017, 415,...
## $ total_bsmt_sf <int> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x1st_flr_sf <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1...
## $ x2nd_flr_sf <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 67...
## $ low_qual_fin_sf <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gr_liv_area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280,...
## $ bsmt_full_bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1...
## $ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ full_bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3...
## $ half_bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1...
## $ bedroom_abvgr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4...
## $ kitchen_abvgr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ totrms_abvgrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 1...
## $ fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1...
## $ garage_yr_blt <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992...
## $ garage_cars <int> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3...
## $ garage_area <int> 528, 730, 312, 522, 482, 470, 582, 506, 608, 4...
## $ wood_deck_sf <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 15...
## $ open_porch_sf <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, ...
## $ enclosed_porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ x3ssn_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ screen_porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, ...
## $ pool_area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ misc_val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0...
## $ mo_sold <fctr> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, ...
## $ yr_sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010...
## $ saleprice <int> 215000, 105000, 172000, 244000, 189900, 195500...
## $ ms_subclass <fctr> 020, 020, 020, 020, 060, 060, 120, 120, 120, ...
## $ ms_zoning <fctr> RL, RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, R...
## $ street <fctr> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav...
## $ alley <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lot_shape <fctr> IR1, Reg, IR1, Reg, IR1, IR1, Reg, IR1, IR1, ...
## $ land_contour <fctr> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, ...
## $ utilities <fctr> AllPub, AllPub, AllPub, AllPub, AllPub, AllPu...
## $ lot_config <fctr> Corner, Inside, Corner, Corner, Inside, Insid...
## $ land_slope <fctr> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, ...
## $ neighborhood <fctr> NAmes, NAmes, NAmes, NAmes, Gilbert, Gilbert,...
## $ condition_1 <fctr> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, No...
## $ condition_2 <fctr> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor...
## $ bldg_type <fctr> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, T...
## $ house_style <fctr> 1Story, 1Story, 1Story, 1Story, 2Story, 2Stor...
## $ roof_style <fctr> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Ga...
## $ roof_matl <fctr> CompShg, CompShg, CompShg, CompShg, CompShg, ...
## $ exterior_1st <fctr> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, ...
## $ exterior_2nd <fctr> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, ...
## $ mas_vnr_type <fctr> Stone, None, BrkFace, None, None, BrkFace, No...
## $ exter_qual <fctr> TA, TA, TA, Gd, TA, TA, Gd, Gd, Gd, TA, TA, T...
## $ exter_cond <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, G...
## $ foundation <fctr> CBlock, CBlock, CBlock, CBlock, PConc, PConc,...
## $ bsmt_qual <fctr> TA, TA, TA, TA, Gd, TA, Gd, Gd, Gd, TA, Gd, G...
## $ bsmt_cond <fctr> Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ bsmt_exposure <fctr> Gd, No, No, No, No, No, Mn, No, No, No, No, N...
## $ bsmtfin_type_1 <fctr> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, ...
## $ bsmtfin_type_2 <fctr> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, ...
## $ heating <fctr> GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas...
## $ heating_qc <fctr> Fa, TA, TA, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Gd, E...
## $ central_air <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ electrical <fctr> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBr...
## $ kitchen_qual <fctr> TA, TA, Gd, Ex, TA, Gd, Gd, Gd, Gd, Gd, TA, T...
## $ functional <fctr> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, ...
## $ fireplace_qu <fctr> Gd, NA, NA, TA, TA, Gd, NA, NA, TA, TA, TA, N...
## $ garage_type <fctr> Attchd, Attchd, Attchd, Attchd, Attchd, Attch...
## $ garage_finish <fctr> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, ...
## $ garage_qual <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ garage_cond <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ paved_drive <fctr> P, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ pool_qc <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ fence <fctr> NA, MnPrv, NA, NA, MnPrv, NA, NA, NA, NA, NA,...
## $ misc_feature <fctr> NA, NA, Gar2, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ sale_type <fctr> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, W...
## $ sale_condition <fctr> Normal, Normal, Normal, Normal, Normal, Norma...
training_f <- dff[training_idx, ]
validation_f <- dff[validate_idx, ]
test_f <- dff[test_idx, ]
일단 모든 변수를 다 넣은 선형모형을 돌려보자:
df_lm_full <- lm(saleprice ~ ., data=training_f)
summary(df_lm_full)
##
## Call:
## lm(formula = saleprice ~ ., data = training_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257041 -10232 0 9485 152022
##
## Coefficients: (9 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.557e+04 9.486e+05 0.048 0.961687
## lot_frontage 4.009e+01 4.198e+01 0.955 0.339757
## lot_area 5.299e-01 1.263e-01 4.197 2.87e-05 ***
## overall_qual 5.773e+03 9.274e+02 6.225 6.27e-10 ***
## overall_cond 6.343e+03 7.781e+02 8.151 7.61e-16 ***
## year_built 4.088e+02 7.537e+01 5.424 6.82e-08 ***
## year_remod_add 8.607e+01 4.990e+01 1.725 0.084788 .
## mas_vnr_area 2.872e+01 5.090e+00 5.642 2.01e-08 ***
## bsmtfin_sf_1 4.225e+01 4.390e+00 9.623 < 2e-16 ***
## bsmtfin_sf_2 3.891e+01 7.774e+00 5.004 6.27e-07 ***
## bsmt_unf_sf 2.036e+01 4.048e+00 5.029 5.53e-07 ***
## total_bsmt_sf NA NA NA NA
## x1st_flr_sf 5.013e+01 4.844e+00 10.348 < 2e-16 ***
## x2nd_flr_sf 5.361e+01 4.963e+00 10.803 < 2e-16 ***
## low_qual_fin_sf 2.650e+01 1.346e+01 1.969 0.049153 *
## gr_liv_area NA NA NA NA
## bsmt_full_bath 2.572e+03 1.712e+03 1.503 0.133072
## bsmt_half_bath 6.793e+02 2.617e+03 0.260 0.795230
## full_bath 4.939e+03 1.921e+03 2.571 0.010225 *
## half_bath 2.598e+03 1.841e+03 1.411 0.158410
## bedroom_abvgr -3.978e+03 1.203e+03 -3.307 0.000966 ***
## kitchen_abvgr -1.360e+04 5.698e+03 -2.386 0.017139 *
## totrms_abvgrd 1.779e+03 8.168e+02 2.178 0.029546 *
## fireplaces 8.786e+03 2.352e+03 3.736 0.000194 ***
## garage_yr_blt 2.580e+01 4.993e+01 0.517 0.605487
## garage_cars 3.532e+03 2.074e+03 1.703 0.088728 .
## garage_area 1.421e+01 7.051e+00 2.016 0.044014 *
## wood_deck_sf 6.186e+00 5.335e+00 1.159 0.246487
## open_porch_sf -3.697e+00 9.952e+00 -0.371 0.710353
## enclosed_porch 1.187e+01 1.143e+01 1.039 0.298995
## x3ssn_porch 1.344e+01 2.161e+01 0.622 0.534212
## screen_porch 3.779e+01 1.061e+01 3.563 0.000378 ***
## pool_area -1.826e+02 1.462e+02 -1.249 0.211928
## misc_val 1.989e-01 1.948e+00 0.102 0.918706
## mo_sold10 -1.128e+04 3.778e+03 -2.985 0.002879 **
## mo_sold11 -6.455e+03 3.840e+03 -1.681 0.093001 .
## mo_sold12 -6.455e+01 4.169e+03 -0.015 0.987649
## mo_sold2 -2.761e+03 3.987e+03 -0.692 0.488736
## mo_sold3 -4.585e+03 3.544e+03 -1.294 0.195902
## mo_sold4 -5.168e+03 3.440e+03 -1.502 0.133195
## mo_sold5 -2.309e+03 3.302e+03 -0.699 0.484532
## mo_sold6 -4.234e+03 3.181e+03 -1.331 0.183396
## mo_sold7 -1.997e+03 3.206e+03 -0.623 0.533395
## mo_sold8 -8.717e+03 3.579e+03 -2.436 0.014986 *
## mo_sold9 -8.820e+03 3.798e+03 -2.322 0.020360 *
## yr_sold -7.759e+02 4.649e+02 -1.669 0.095382 .
## ms_subclass030 6.697e+03 4.148e+03 1.615 0.106575
## ms_subclass040 1.962e+04 1.293e+04 1.517 0.129515
## ms_subclass045 3.141e+03 1.775e+04 0.177 0.859572
## ms_subclass050 7.805e+03 7.293e+03 1.070 0.284697
## ms_subclass060 2.113e+03 7.066e+03 0.299 0.764951
## ms_subclass070 1.199e+04 7.746e+03 1.548 0.121898
## ms_subclass075 2.403e+04 1.544e+04 1.556 0.119870
## ms_subclass080 -1.697e+04 1.301e+04 -1.304 0.192327
## ms_subclass085 -5.618e+03 8.583e+03 -0.655 0.512866
## ms_subclass090 -1.647e+04 6.969e+03 -2.363 0.018253 *
## ms_subclass120 -2.091e+04 1.229e+04 -1.701 0.089147 .
## ms_subclass150 -5.152e+04 3.002e+04 -1.716 0.086338 .
## ms_subclass160 -2.418e+04 1.482e+04 -1.631 0.103070
## ms_subclass180 -2.325e+04 1.739e+04 -1.337 0.181369
## ms_subclass190 -1.096e+04 6.878e+03 -1.593 0.111298
## ms_zoningFV -6.488e+03 1.118e+04 -0.580 0.561830
## ms_zoningI (all) 5.009e+04 3.676e+04 1.362 0.173258
## ms_zoningRH 1.898e+04 1.145e+04 1.657 0.097633 .
## ms_zoningRL 5.054e+03 9.546e+03 0.529 0.596566
## ms_zoningRM -3.379e+02 8.969e+03 -0.038 0.969957
## streetPave 1.891e+04 1.079e+04 1.751 0.080093 .
## alleyNA -2.612e+02 3.571e+03 -0.073 0.941699
## alleyPave -2.561e+03 5.382e+03 -0.476 0.634240
## lot_shapeIR2 2.580e+03 3.804e+03 0.678 0.497843
## lot_shapeIR3 7.752e+03 8.850e+03 0.876 0.381160
## lot_shapeReg 1.304e+03 1.473e+03 0.885 0.376064
## land_contourHLS 8.955e+03 4.501e+03 1.989 0.046843 *
## land_contourLow -1.397e+03 5.703e+03 -0.245 0.806475
## land_contourLvl 5.155e+03 3.297e+03 1.564 0.118083
## utilitiesNoSewr -3.843e+04 2.487e+04 -1.545 0.122473
## lot_configCulDSac 3.128e+03 3.071e+03 1.019 0.308559
## lot_configFR2 -4.623e+03 3.791e+03 -1.219 0.222881
## lot_configFR3 9.548e+02 9.550e+03 0.100 0.920370
## lot_configInside 1.902e+02 1.659e+03 0.115 0.908693
## land_slopeMod 1.508e+03 3.635e+03 0.415 0.678368
## land_slopeSev -2.150e+04 1.075e+04 -2.001 0.045616 *
## neighborhoodBlueste -2.537e+03 1.548e+04 -0.164 0.869811
## neighborhoodBrDale -2.395e+03 1.049e+04 -0.228 0.819521
## neighborhoodBrkSide -1.078e+04 8.162e+03 -1.321 0.186757
## neighborhoodClearCr -1.328e+04 8.636e+03 -1.538 0.124374
## neighborhoodCollgCr -1.395e+04 6.385e+03 -2.185 0.029073 *
## neighborhoodCrawfor -3.369e+03 7.471e+03 -0.451 0.652051
## neighborhoodEdwards -2.723e+04 7.061e+03 -3.856 0.000120 ***
## neighborhoodGilbert -1.425e+04 6.691e+03 -2.130 0.033327 *
## neighborhoodGreens 4.442e+03 1.291e+04 0.344 0.730782
## neighborhoodGrnHill 6.562e+04 2.516e+04 2.609 0.009181 **
## neighborhoodIDOTRR -1.757e+04 9.354e+03 -1.879 0.060490 .
## neighborhoodMeadowV -3.490e+03 1.035e+04 -0.337 0.736073
## neighborhoodMitchel -2.211e+04 7.126e+03 -3.103 0.001951 **
## neighborhoodNAmes -2.103e+04 6.932e+03 -3.034 0.002453 **
## neighborhoodNoRidge 1.655e+04 7.442e+03 2.224 0.026288 *
## neighborhoodNPkVill 2.116e+03 1.370e+04 0.154 0.877300
## neighborhoodNridgHt 7.636e+03 6.596e+03 1.158 0.247194
## neighborhoodNWAmes -2.173e+04 7.063e+03 -3.076 0.002133 **
## neighborhoodOldTown -2.313e+04 8.269e+03 -2.798 0.005215 **
## neighborhoodSawyer -1.694e+04 7.179e+03 -2.360 0.018415 *
## neighborhoodSawyerW -1.499e+04 6.778e+03 -2.212 0.027142 *
## neighborhoodSomerst 1.372e+04 7.964e+03 1.722 0.085238 .
## neighborhoodStoneBr 4.069e+04 7.442e+03 5.467 5.36e-08 ***
## neighborhoodSWISU -2.468e+04 8.494e+03 -2.906 0.003717 **
## neighborhoodTimber -1.887e+04 7.093e+03 -2.661 0.007883 **
## neighborhoodVeenker -1.952e+04 9.624e+03 -2.028 0.042764 *
## condition_1Feedr 4.726e+03 4.452e+03 1.062 0.288554
## condition_1Norm 1.187e+04 3.657e+03 3.247 0.001194 **
## condition_1PosA 1.577e+04 8.511e+03 1.853 0.064086 .
## condition_1PosN 1.826e+04 6.491e+03 2.812 0.004981 **
## condition_1RRAe -6.776e+02 6.536e+03 -0.104 0.917448
## condition_1RRAn 6.524e+03 6.281e+03 1.039 0.299118
## condition_1RRNe 2.147e+03 1.317e+04 0.163 0.870500
## condition_1RRNn -1.033e+04 1.151e+04 -0.898 0.369572
## condition_2Feedr 4.009e+03 1.727e+04 0.232 0.816462
## condition_2Norm 1.676e+04 1.595e+04 1.051 0.293401
## condition_2PosA 6.685e+04 2.083e+04 3.210 0.001356 **
## condition_2PosN -1.363e+05 2.207e+04 -6.176 8.47e-10 ***
## condition_2RRAn 1.392e+04 2.855e+04 0.487 0.626012
## bldg_type2fmCon NA NA NA NA
## bldg_typeDuplex NA NA NA NA
## bldg_typeTwnhs -3.752e+03 1.313e+04 -0.286 0.775080
## bldg_typeTwnhsE 6.377e+02 1.231e+04 0.052 0.958710
## house_style1.5Unf 7.990e+03 1.742e+04 0.459 0.646569
## house_style1Story 6.291e+03 7.235e+03 0.870 0.384696
## house_style2.5Fin -1.779e+04 1.935e+04 -0.919 0.358082
## house_style2.5Unf -1.512e+04 1.479e+04 -1.022 0.306904
## house_style2Story -2.873e+00 7.276e+03 0.000 0.999685
## house_styleSFoyer 7.365e+03 9.445e+03 0.780 0.435676
## house_styleSLvl 2.082e+04 1.348e+04 1.544 0.122869
## roof_styleGable 3.220e+03 2.562e+04 0.126 0.899999
## roof_styleGambrel -6.711e+03 2.647e+04 -0.254 0.799873
## roof_styleHip 4.708e+03 2.567e+04 0.183 0.854489
## roof_styleMansard -1.150e+04 2.764e+04 -0.416 0.677310
## roof_styleShed 2.959e+04 3.228e+04 0.917 0.359422
## roof_matlMetal 2.728e+04 3.622e+04 0.753 0.451463
## roof_matlRoll -6.539e+03 2.487e+04 -0.263 0.792628
## roof_matlTar&Grv -8.631e+02 2.410e+04 -0.036 0.971430
## roof_matlWdShake -1.278e+04 1.348e+04 -0.948 0.343303
## roof_matlWdShngl -2.699e+03 2.029e+04 -0.133 0.894187
## exterior_1stBrkComm 1.801e+04 1.785e+04 1.009 0.313285
## exterior_1stBrkFace 2.571e+04 1.245e+04 2.065 0.039086 *
## exterior_1stCemntBd -1.619e+04 2.050e+04 -0.790 0.429699
## exterior_1stHdBoard 1.677e+03 1.210e+04 0.139 0.889769
## exterior_1stMetalSd 1.003e+04 1.365e+04 0.735 0.462595
## exterior_1stPlywood 7.851e+03 1.198e+04 0.655 0.512306
## exterior_1stPreCast 3.762e+04 2.624e+04 1.434 0.151791
## exterior_1stStone 5.655e+03 3.013e+04 0.188 0.851154
## exterior_1stStucco 5.442e+03 1.348e+04 0.404 0.686396
## exterior_1stVinylSd 3.885e+03 1.318e+04 0.295 0.768222
## exterior_1stWd Sdng 1.618e+03 1.185e+04 0.137 0.891395
## exterior_1stWdShing 3.631e+03 1.268e+04 0.286 0.774604
## exterior_2ndAsphShn 8.215e+03 2.148e+04 0.383 0.702142
## exterior_2ndBrk Cmn -1.025e+04 1.736e+04 -0.590 0.554953
## exterior_2ndBrkFace -1.522e+04 1.335e+04 -1.141 0.254181
## exterior_2ndCBlock -7.449e+03 2.985e+04 -0.250 0.802993
## exterior_2ndCmentBd 9.969e+03 2.059e+04 0.484 0.628396
## exterior_2ndHdBoard -4.972e+03 1.214e+04 -0.409 0.682271
## exterior_2ndImStucc -2.424e+04 1.583e+04 -1.531 0.125890
## exterior_2ndMetalSd -9.364e+03 1.368e+04 -0.684 0.493810
## exterior_2ndPlywood -9.937e+03 1.179e+04 -0.843 0.399577
## exterior_2ndPreCast NA NA NA NA
## exterior_2ndStone -3.962e+04 2.356e+04 -1.682 0.092874 .
## exterior_2ndStucco -1.749e+02 1.338e+04 -0.013 0.989570
## exterior_2ndVinylSd -6.997e+03 1.310e+04 -0.534 0.593331
## exterior_2ndWd Sdng -3.278e+03 1.188e+04 -0.276 0.782597
## exterior_2ndWd Shng -7.669e+03 1.235e+04 -0.621 0.534596
## mas_vnr_typeBrkFace 1.441e+04 8.981e+03 1.605 0.108706
## mas_vnr_typeNA 1.836e+04 1.146e+04 1.602 0.109474
## mas_vnr_typeNone 2.064e+04 9.054e+03 2.280 0.022771 *
## mas_vnr_typeStone 2.087e+04 9.218e+03 2.264 0.023711 *
## exter_qualFa -1.914e+04 8.782e+03 -2.179 0.029462 *
## exter_qualGd -2.242e+04 4.039e+03 -5.552 3.35e-08 ***
## exter_qualTA -2.394e+04 4.606e+03 -5.197 2.31e-07 ***
## exter_condFa -2.434e+03 1.057e+04 -0.230 0.817822
## exter_condGd 1.234e+03 9.261e+03 0.133 0.893993
## exter_condPo -5.491e+03 2.039e+04 -0.269 0.787760
## exter_condTA 3.804e+03 9.214e+03 0.413 0.679809
## foundationCBlock -9.680e+02 2.985e+03 -0.324 0.745752
## foundationPConc 5.323e+02 3.123e+03 0.170 0.864675
## foundationSlab -5.579e+03 8.654e+03 -0.645 0.519204
## foundationStone 1.890e+03 1.201e+04 0.157 0.874998
## foundationWood -2.322e+03 1.271e+04 -0.183 0.855015
## bsmt_qualFa -1.801e+04 5.222e+03 -3.449 0.000579 ***
## bsmt_qualGd -2.048e+04 2.931e+03 -6.987 4.22e-12 ***
## bsmt_qualNA 3.149e+04 3.485e+04 0.904 0.366292
## bsmt_qualPo -1.470e+04 3.121e+04 -0.471 0.637757
## bsmt_qualTA -1.821e+04 3.679e+03 -4.948 8.34e-07 ***
## bsmt_condFa -2.380e+02 1.836e+04 -0.013 0.989660
## bsmt_condGd 1.208e+03 1.823e+04 0.066 0.947179
## bsmt_condNA NA NA NA NA
## bsmt_condPo 3.054e+04 2.696e+04 1.133 0.257447
## bsmt_condTA 2.587e+03 1.804e+04 0.143 0.885977
## bsmt_exposureGd 1.192e+04 2.629e+03 4.536 6.20e-06 ***
## bsmt_exposureMn -6.628e+03 2.736e+03 -2.423 0.015531 *
## bsmt_exposureNA -1.624e+04 1.335e+04 -1.217 0.223897
## bsmt_exposureNo -5.595e+03 1.987e+03 -2.816 0.004932 **
## bsmtfin_type_1BLQ 7.058e+02 2.586e+03 0.273 0.784936
## bsmtfin_type_1GLQ 8.890e+02 2.256e+03 0.394 0.693614
## bsmtfin_type_1LwQ -5.453e+03 3.291e+03 -1.657 0.097733 .
## bsmtfin_type_1NA NA NA NA NA
## bsmtfin_type_1Rec -2.001e+03 2.577e+03 -0.776 0.437615
## bsmtfin_type_1Unf 2.068e+03 2.589e+03 0.799 0.424585
## bsmtfin_type_2BLQ -6.555e+03 5.939e+03 -1.104 0.269940
## bsmtfin_type_2GLQ 1.063e+04 7.627e+03 1.394 0.163655
## bsmtfin_type_2LwQ -8.222e+03 5.699e+03 -1.443 0.149275
## bsmtfin_type_2NA -1.684e+04 2.407e+04 -0.700 0.484279
## bsmtfin_type_2Rec -5.434e+03 5.558e+03 -0.978 0.328354
## bsmtfin_type_2Unf 2.859e+02 5.699e+03 0.050 0.959986
## heatingGasA 1.443e+04 2.483e+04 0.581 0.561103
## heatingGasW 1.389e+04 2.591e+04 0.536 0.591872
## heatingGrav 8.876e+02 2.776e+04 0.032 0.974497
## heatingOthW -1.552e+04 3.068e+04 -0.506 0.613090
## heatingWall 3.521e+04 3.250e+04 1.084 0.278755
## heating_qcFa -2.522e+03 4.140e+03 -0.609 0.542459
## heating_qcGd -2.447e+03 1.868e+03 -1.310 0.190485
## heating_qcPo -9.940e+03 1.834e+04 -0.542 0.587838
## heating_qcTA -2.442e+03 1.845e+03 -1.323 0.185982
## central_airY -3.777e+03 3.231e+03 -1.169 0.242600
## electricalFuseF -1.688e+03 5.743e+03 -0.294 0.768815
## electricalFuseP 1.255e+03 1.121e+04 0.112 0.910879
## electricalMix 4.063e+02 3.594e+04 0.011 0.990984
## electricalNA 1.606e+04 2.338e+04 0.687 0.492278
## electricalSBrkr 1.853e+02 2.619e+03 0.071 0.943596
## kitchen_qualFa -1.820e+04 5.627e+03 -3.235 0.001245 **
## kitchen_qualGd -2.129e+04 3.161e+03 -6.736 2.32e-11 ***
## kitchen_qualTA -2.191e+04 3.608e+03 -6.072 1.60e-09 ***
## functionalMaj2 -4.872e+03 1.370e+04 -0.356 0.722260
## functionalMin1 9.031e+03 8.622e+03 1.047 0.295065
## functionalMin2 1.162e+04 8.546e+03 1.360 0.174081
## functionalMod -5.091e+01 9.579e+03 -0.005 0.995760
## functionalSal 1.796e+04 2.679e+04 0.670 0.502700
## functionalTyp 1.850e+04 7.643e+03 2.420 0.015641 *
## fireplace_quFa -3.287e+03 6.210e+03 -0.529 0.596632
## fireplace_quGd 7.123e+02 4.817e+03 0.148 0.882471
## fireplace_quNA 8.313e+03 5.664e+03 1.468 0.142393
## fireplace_quPo 2.355e+03 6.541e+03 0.360 0.718884
## fireplace_quTA 3.300e+02 4.976e+03 0.066 0.947131
## garage_typeAttchd 7.529e+03 6.447e+03 1.168 0.243062
## garage_typeBasment 1.415e+04 8.604e+03 1.645 0.100161
## garage_typeBuiltIn 8.217e+03 7.045e+03 1.166 0.243641
## garage_typeCarPort 5.241e+03 1.073e+04 0.489 0.625204
## garage_typeDetchd 9.157e+03 6.444e+03 1.421 0.155536
## garage_typeNA 3.356e+04 2.653e+04 1.265 0.206035
## garage_finishNA -2.584e+04 2.996e+04 -0.863 0.388527
## garage_finishRFn -8.534e+02 1.773e+03 -0.481 0.630415
## garage_finishUnf 2.644e+03 2.167e+03 1.220 0.222624
## garage_qualFa -5.088e+04 2.985e+04 -1.705 0.088459 .
## garage_qualGd -3.233e+04 2.854e+04 -1.133 0.257603
## garage_qualNA NA NA NA NA
## garage_qualPo -5.442e+04 3.553e+04 -1.532 0.125769
## garage_qualTA -4.787e+04 2.955e+04 -1.620 0.105412
## garage_condFa 3.748e+04 2.470e+04 1.518 0.129323
## garage_condGd 2.544e+04 2.520e+04 1.010 0.312887
## garage_condNA NA NA NA NA
## garage_condPo 3.600e+04 2.735e+04 1.316 0.188292
## garage_condTA 3.559e+04 2.436e+04 1.461 0.144211
## paved_driveP -1.177e+03 4.804e+03 -0.245 0.806543
## paved_driveY 6.479e+02 3.015e+03 0.215 0.829846
## pool_qcFa 1.536e+04 7.518e+04 0.204 0.838139
## pool_qcGd 1.740e+03 7.626e+04 0.023 0.981795
## pool_qcNA -1.194e+05 3.152e+04 -3.789 0.000157 ***
## pool_qcTA -2.353e+04 4.160e+04 -0.566 0.571683
## fenceGdWo -1.437e+03 4.503e+03 -0.319 0.749660
## fenceMnPrv -1.915e+03 3.721e+03 -0.515 0.606835
## fenceMnWw -8.587e+03 8.707e+03 -0.986 0.324167
## fenceNA -3.212e+03 3.393e+03 -0.947 0.343931
## misc_featureGar2 5.472e+05 3.276e+04 16.706 < 2e-16 ***
## misc_featureNA 5.399e+05 4.248e+04 12.710 < 2e-16 ***
## misc_featureShed 5.410e+05 4.132e+04 13.091 < 2e-16 ***
## sale_typeCon 2.661e+04 1.491e+04 1.784 0.074553 .
## sale_typeConLD 7.827e+03 7.276e+03 1.076 0.282200
## sale_typeConLI -3.709e+02 9.181e+03 -0.040 0.967778
## sale_typeConLw 8.879e+03 1.004e+04 0.884 0.376759
## sale_typeCWD 1.151e+04 1.106e+04 1.041 0.298228
## sale_typeNew 5.967e+02 1.545e+04 0.039 0.969204
## sale_typeOth -8.473e+04 2.396e+04 -3.536 0.000418 ***
## sale_typeVWD -4.993e+03 2.420e+04 -0.206 0.836560
## sale_typeWD 1.176e+03 3.641e+03 0.323 0.746671
## sale_conditionAdjLand 2.697e+04 1.047e+04 2.577 0.010059 *
## sale_conditionAlloca 7.267e+03 8.378e+03 0.867 0.385854
## sale_conditionFamily -5.064e+03 5.246e+03 -0.965 0.334599
## sale_conditionNormal 7.584e+03 2.642e+03 2.870 0.004159 **
## sale_conditionPartial 1.849e+04 1.515e+04 1.220 0.222517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22280 on 1481 degrees of freedom
## Multiple R-squared: 0.9371, Adjusted R-squared: 0.9254
## F-statistic: 80.01 on 276 and 1481 DF, p-value: < 2.2e-16
통계적으로 유의한 여러 변수들이 잡힌다.
아쉽게도, 선형모형을 실행하려면 다음과 같은 에러가 생긴다. 훈련셋에는 없는 인자 수준이 검증 셋에 나타나기 때문이다.
y_obs <- validation$saleprice
yhat_lm <- predict(df_lm_full, newdata=validation_f)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor ms_zoning has new levels A (agr)
(고급문제: 위의 에러는 어떻게 해결할 수 있을까?)
선형모형 자체는 일반적으로 높은 예측력을 보이지 않기 때문에, 다음처럼 스텝(stepwise) 절차를 통한 변수선택을 시행할 수 있다. (실행시간 관계상 생략) 독자들의 컴퓨터에서 실행해 볼 것을 권한다.
df_step <- stepAIC(df_lm_full, scope = list(upper = ~ ., lower = ~1))
df_step
anova(df_step)
summary(df_step)
length(coef(df_step))
length(coef(df_lm_full))
참고로, 저자의 컴퓨터에서의 실행 후에 원 모형의 모수 개수는 286, 스텝 변수선택 이후의 모수 개수는 147 이었다.
만약 위와 같은 df_lm_full, df_step 모형이 제대로 작동하면 다음처럼 검증셋에서의 RMSE 오차값을 구할 수 있다.
y_obs <- validation$saleprice
yhat_lm <- predict(df_lm_full, newdata=validation)
yhat_step <- predict(df_step, newdata=validation)
rmse(y_obs, yhat_lm)
rmse(y_obs, yhat_step)
xx <- model.matrix(saleprice ~ .-1, df)
x <- xx[training_idx, ]
y <- training$saleprice
df_cvfit <- cv.glmnet(x, y)
람다 모수의 값에 따른 오차의 값의 변화 추이는 다음과 같다:
plot(df_cvfit)
# coef(df_cvfit, s = c("lambda.1se"))
# coef(df_cvfit, s = c("lambda.min"))
라쏘 모형의 RMSE, MAE 값은:
y_obs <- validation$saleprice
yhat_glmnet <- predict(df_cvfit, s="lambda.min", newx=xx[validate_idx,])
yhat_glmnet <- yhat_glmnet[,1] # change to a vector from [n*1] matrix
rmse(y_obs, yhat_glmnet)
## [1] 44968.73
mae(y_obs, yhat_glmnet)
## [1] 17917.51
rpart::rpart() 함수를 사용해 나무 회귀분석모형을 적합하자.
df_tr <- rpart(saleprice ~ ., data = training)
df_tr
## n= 1758
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1758 1.169331e+13 181888.40
## 2) overall_qual< 7.5 1469 3.558473e+12 156662.20
## 4) neighborhood=Blueste,BrDale,BrkSide,Edwards,IDOTRR,MeadowV,Mitchel,NAmes,NPkVill,OldTown,Sawyer,SWISU 868 1.156217e+12 131366.40
## 8) overall_qual< 4.5 157 1.291966e+11 96895.25 *
## 9) overall_qual>=4.5 711 7.992695e+11 138978.10
## 18) x1st_flr_sf< 1322 590 3.598974e+11 132189.30 *
## 19) x1st_flr_sf>=1322 121 2.795925e+11 172080.50 *
## 5) neighborhood=Blmngtn,ClearCr,CollgCr,Crawfor,Gilbert,GrnHill,NoRidge,NridgHt,NWAmes,SawyerW,Somerst,StoneBr,Timber,Veenker 601 1.044673e+12 193196.00
## 10) gr_liv_area< 1752 438 4.335084e+11 178563.90
## 20) gr_liv_area< 1204 99 4.230077e+10 147199.70 *
## 21) gr_liv_area>=1204 339 2.653791e+11 187723.40 *
## 11) gr_liv_area>=1752 163 2.654050e+11 232514.20 *
## 3) overall_qual>=7.5 289 2.448313e+12 310114.40
## 6) total_bsmt_sf< 1721.5 206 7.977165e+11 277138.10
## 12) gr_liv_area< 2374.5 164 4.407878e+11 260563.90
## 24) bsmt_qual=Fa,Gd,TA 107 1.971039e+11 239877.10 *
## 25) bsmt_qual=Ex 57 1.119365e+11 299397.20 *
## 13) gr_liv_area>=2374.5 42 1.359639e+11 341856.10 *
## 7) total_bsmt_sf>=1721.5 83 8.706003e+11 391959.40
## 14) gr_liv_area< 2225.5 48 1.512143e+11 341111.20 *
## 15) gr_liv_area>=2225.5 35 4.250778e+11 461694.10
## 30) neighborhood=Edwards,NAmes,NWAmes,Somerst,Veenker 7 7.128389e+10 319764.10 *
## 31) neighborhood=CollgCr,NoRidge,NridgHt,StoneBr 28 1.775330e+11 497176.50 *
# printcp(df_tr)
# summary(df_tr)
opar <- par(mfrow = c(1,1), xpd = NA)
plot(df_tr)
text(df_tr, use.n = TRUE)
par(opar)
나무모형의 출력 결과를 살펴보면 최고의 집값으로 이어지는 변수의 조합은 다음과 같음을 알 수 있다:
3) overall_qual>=7.5 289 2.448313e+12 310114.40
7) total_bsmt_sf>=1721.5 83 8.706003e+11 391959.40
15) gr_liv_area>=2225.5 35 4.250778e+11 461694.10
31) neighborhood=CollgCr,NoRidge,NridgHt,StoneBr 28 1.775330e+11 497176.50 *
아쉽게도 rpart::rpart 모형도 훈련셋에서 관측되지 않은 인자 레벨이 나오면 앞서와 같은 오류 메시지를 보내며 예측을 해내지 못한다:
yhat_tr <- predict(df_tr, validation)
# rmse(y_obs, yhat_tr)
randomForest() 함수를 적용할 때 X 예측변수들중 문자열 변수들은 인자형 변수로 바꿔 줘야 한다. 앞서 만들어둔 training_f 를 사용한다.
set.seed(2017)
df_rf <- randomForest(saleprice ~ ., training_f)
df_rf
##
## Call:
## randomForest(formula = saleprice ~ ., data = training_f)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 26
##
## Mean of squared residuals: 670582153
## % Var explained: 89.92
랜덤포레스트 모형의 오류 감소 추세 그래프는 다음과 같다:
plot(df_rf)
각 변수들의 모형에의 기여도는 다음과 같다:
varImpPlot(df_rf)
랜덤포레스트 모형의 예측결과는 다음과 같다:
yhat_rf <- predict(df_rf, newdata=validation_f)
rmse(y_obs, yhat_rf)
## [1] 25543.2
mae(y_obs, yhat_rf)
## [1] 15081.66
gbm::gbm() 함수로 부스팅 모형을 적합할 수 있다. 랜덤포레스트와 마찬가지로 X 예측변수들중 문자열 변수들은 인자형 변수로 바꿔 줘야 한다. (실행시간 관계상 생략)
set.seed(2017)
df_gbm <- gbm(saleprice ~ ., data=training_f,
n.trees=40000, cv.folds=3, verbose = TRUE)
(best_iter = gbm.perf(df_gbm, method="cv"))
yhat_gbm <- predict(df_gbm, n.trees=best_iter, newdata=validation_f)
rmse(y_obs, yhat_gbm)
검증셋에서 예측능력이 가장 높은 (RMSE 값과 MAE 값이 가장 작은) 것은 랜덤포레스트이다:
tibble(method=c("glmnet", "rf"),
rmse=c(rmse(y_obs, yhat_glmnet), rmse(y_obs, yhat_rf)),
mae=c(mae(y_obs, yhat_glmnet), mae(y_obs, yhat_rf)))
## # A tibble: 2 x 3
## method rmse mae
## <chr> <dbl> <dbl>
## 1 glmnet 44968.73 17917.51
## 2 rf 25543.20 15081.66
테스트셋을 이용해 랜덤포레스트모형의 일반화 능력을 계산해보자:
y_obs_test <- test$saleprice
yhat_rf_test <- predict(df_rf, newdata=test_f)
rmse(y_obs_test, yhat_rf_test)
## [1] 25765.92
mae(y_obs_test, yhat_rf_test)
## [1] 15794.82
다음과 같은 시각화로 예측모형들의 오차의 분포를 비교할 수 있다. glmnet 에 비해 랜덤포레스트 모형이 아주 큰 예측오차의 수가 적은 것을 알 수 있다. 즉, 랜덤포레스트 모형이 좀 더 로버스트하다고 할 수 있다.
boxplot(list(# lm = y_obs-yhat_step,
# gbm = y_obs-yhat_gbm,
glmnet = y_obs-yhat_glmnet,
rf = y_obs-yhat_rf
), ylab="Error in Validation Set")
abline(h=0, lty=2, col='blue')
다음 시각화는 glmnet 과 random forest 예측값, 그리고 실제 관측치와의 관계를 보여준다. RMSE, MAE 결과와 마찬가지로, 관측값과의 상관관계도 랜덤 포레스트가 더 높다:
pairs(data.frame(y_obs=y_obs,
# yhat_lm=yhat_lm,
yhat_glmnet=c(yhat_glmnet),
# yhat_tr=yhat_tr,
yhat_rf=yhat_rf),
lower.panel=function(x,y){ points(x,y); abline(0, 1, col='red')},
upper.panel = panel.cor)
이번 자료는 차원도 높고, 결측치도 많은 분석이 어려운 자료였다. 하지만 비교적 적은 코딩으로 예측력이 상당히 높은 랜덤포레스트 모형을 적합할 수 있었다.
관심있는 독자는 이 데이터에서 추가로 다음 분석을 시도해 볼 것을 권한다:
factor ... has new levels ... 에러가 생긴다. 이 에러를 해결하려면 어떻게 하면 될까? (위의 df_lm 과 df_tr 모형을 예로 설명하라)prcomp() 로 X변수들의 주성분 분석을 시행하라.pls 라이브러리를 사용하여 주성분 분석을 시행하라. RMSE, MAE 오차의 크기는?회귀분석을 본문에 기술된 적포도주 데이터(winequality-red.csv)에 실행해보라. 결과를 슬라이드 10여 장 내외로 요약하라.
(생략; 교재 본문 참조)
https://goo.gl/R0Pyrt (http://archive.ics.uci.edu/ml/datasets/Abalone) 데이터에 회귀분석을 적용하고, 결과를 슬라이드 10여 장 내외로 요약하라.
(생략; 결측치가 없고, 변수 개수도 적은 간단한 문제입니다.)
Rings. integer. +1.5 gives the age in years.https://goo.gl/etZcrE (http://archive.ics.uci.edu/ml/datasets/Air+Quality) 데이터에 회귀분석을 적용하고, 결과를 슬라이드 10여 장 내외로 요약하라.
(생략; 시계열 분석 데이터로 적당합니다.)
https://goo.gl/hmyTre (https://archive.ics.uci.edu/ml/datasets.html) 혹은 https://goo.gl/zSrO3C (https://www.kaggle.com/datasets) 에서 다른 고차원 회귀분석 데이터를 찾아서 본문에 설명한 분석을 실행하고, 결과를 슬라이드 10여 장 내외로 요약하라.
(생략)