Publicación del trabajo

Este trabajo está incluido en un proceso de evaluación del “Master de BigData - UNED” y desarrolla un análisis de principio a fin en la competición de kaggle “House Prices: Advanced Regression Techniques”.

El informe completo se encuentra publicado:

En los próximos días se publicará en kaggle en forma de “kernel”
Se ha publicado en Rpubs

Resumen

El objetivo del análisis es la predicción del precio de venta de las viviendas en el mercado inmobiliario de Ames, Iowa. Para la predicción de esta variable cuantitativa disponemos de 79 predictores; tanto cuantitativos como cualitativos referidos a múltiples características sobre una vivienda.

En el trabajo se reparan los valores perdidos, se construyen nuevas variables, se hace un tratamiento de los casos atípicos y se preparan los datos para llevar a cabo el modelo que ajuste el precio de venta de la vivienda. Para ello se ha elegido la metodología “Lasso regression” obteniendo una RMSE de 0.1122484.

1 Habilitar librerías y cargar los datos en R

El siguiente código comprueba que los paquetes de R que van a ser usados estén instalados. Si es necesario, el código los instala y después los carga.

# Nos aseguramos que están instalados los paquetes y, sino los instalamos
if(!is.element("knitr", installed.packages()[, 1]))
      install.packages("knitr")
if(!is.element("ggplot2", installed.packages()[, 1]))
      install.packages("ggplot2")
if(!is.element("plyr", installed.packages()[, 1]))
      install.packages("plyr")
if(!is.element("dplyr", installed.packages()[, 1]))
      install.packages("dplyr")
if(!is.element("corrplot", installed.packages()[, 1]))
      install.packages("corrplot")
if(!is.element("caret", installed.packages()[, 1]))
      install.packages("caret")
if(!is.element("gridExtra", installed.packages()[, 1]))
      install.packages("gridExtra")
if(!is.element("scales", installed.packages()[, 1]))
      install.packages("scales")
if(!is.element("Rmisc", installed.packages()[, 1]))
      install.packages("Rmisc")
if(!is.element("ggrepel", installed.packages()[, 1]))
      install.packages("ggrepel")
if(!is.element("randomForest", installed.packages()[, 1]))
      install.packages("randomForest")
if(!is.element("psych", installed.packages()[, 1]))
      install.packages("psych")

# Los cargamos
library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
library(corrplot)
library(caret)
library(gridExtra)
library(scales)
library(Rmisc)
library(ggrepel)
library(randomForest)
library(psych)

A continuación, cargamos los datos de los conjuntos de entrenamiento y prueba. En este enlace se pueden descargar a un archivo zip. Dentro del archivo encontramos los ficheros con los conjuntos de datos y la descripción de las variables.

train <- read.csv("C:/Users/ignac/OneDrive/R WorkingDirectory/input/train.csv", stringsAsFactors = F)
test <- read.csv("C:/Users/ignac/OneDrive/R WorkingDirectory/input/test.csv", stringsAsFactors = F)

1.1 Dimensiones y estructura del conjunto de datos

El conjunto de entrenamiento está formado por varibles numéricas y categóricas. Respecto a las variables categóricas se han incorporado al conjunto de datos como character variables para llevar a cabo con más comodidad el preprocesamiento de los datos: limpieza, transformación de variables, tratamiento de NA’s, etc. Más adelante en el trabajo, se convertirán a factores o factores ordenados según corresponda.

En total el conjunto presenta 81 columnas/variables, de las cuales la última corresponde a la variable objetivo (SalePrice). La primera columna (Id) la vamos a eliminar, conservando en un vector las observaciones correspondientes al conjunto de prueba, que será necesario para componer el archivo de subida a Kaggle con las predicciones llevadas a cabo.

dim(train)

## [1] 1460   81

str(train[,c(1:10, 81)]) #muestra la estructura de las 10 primeras variables junto con la variable objetivo

## 'data.frame':    1460 obs. of  11 variables:
##  $ Id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning   : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage: int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea    : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street     : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley      : chr  NA NA NA NA ...
##  $ LotShape   : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour: chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities  : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ SalePrice  : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

head(train[,c(1:10, 81)])

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities SalePrice
## 1         Lvl    AllPub    208500
## 2         Lvl    AllPub    181500
## 3         Lvl    AllPub    223500
## 4         Lvl    AllPub    140000
## 5         Lvl    AllPub    250000
## 6         Lvl    AllPub    143000

#Eliminar la columna Id, conservando las correspondientes al conjunto de prueba
test_labels <- test$Id
test$Id <- NULL
train$Id <- NULL

test$SalePrice <- NA #añadimos una columna SalePrice al conjunto de prueba
full <- rbind(train, test) #combinamos ambos conjuntos a lo largo
dim(full)

## [1] 2919   80

Construimos un vector que indica qué variables son numéricas. Lo usaremos más adelante:

numericVars <- which(sapply(full, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on

2 Limpieza y preprocesamiento de los datos

2.1 Valores perdidos

El primer paso consiste en comprobar qué variables contienen valores perdidos NA. Para ello ejecutamos el siguiente código.

NAcol <- which(colSums(is.na(full)) > 0)
sort(colSums(sapply(full[NAcol], is.na)), decreasing = TRUE)

##       PoolQC  MiscFeature        Alley        Fence    SalePrice 
##         2909         2814         2721         2348         1459 
##  FireplaceQu  LotFrontage  GarageYrBlt GarageFinish   GarageQual 
##         1420          486          159          159          159 
##   GarageCond   GarageType     BsmtCond BsmtExposure     BsmtQual 
##          159          157           82           82           81 
## BsmtFinType2 BsmtFinType1   MasVnrType   MasVnrArea     MSZoning 
##           80           79           24           23            4 
##    Utilities BsmtFullBath BsmtHalfBath   Functional  Exterior1st 
##            2            2            2            2            1 
##  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1

cat('Hay', length(NAcol), 'columnas con valores perdidos')

## Hay 35 columnas con valores perdidos

Los 1459 valores perdidos que aparecen en la variable SalePrice se corresponden, por supuesto, con las observaciones del conjunto de prueba (que es aquel para el que tenemos que predecir el valor de SalePrice). Por lo tanto, tenemos 34 regresores para los que habrá que reparar los valores perdidos.

Debido al abultado número de variables con valores perdidos y puesto que algunas forman grupos de variables, vamos a organizar esta sección en pestañas para hacer más legible el informe. Abordamos el tratamiento de las variables en orden descendente según el número de valores perdidos. La descripción de las variables podemos encontarla en el documento data_description.txt que se encuentra dentro del archivo zip cuyo enlace de descarga se encuentra más arriba.

2.1.1 Pool variables

Pool Quality y PoolArea

PoolQC: Pool quality

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool

La variable PoolQC es un factor ordenado que indica la calidad de la piscina. Los NAs se corresponden con viviendas que no tienen piscina, así que asignamos ‘No pool’ a esos valores. Además, vamos a recodificar la variable. Para ello usaremos un vector con los nuevos valores y la función ‘revalue’ del paquete plyr.

Qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

#aunque PoolQC no tiene modalidad 'Poor', otras variables que miden al calidad de distintos elementos (Fireplaces, Garages, etc) sí la incluyen. Por ello vamos a usar este vector con la escala de 0 a 5, incluyendo el valor 1 para el valor 'Poor'

full$PoolQC[is.na(full$PoolQC)] <- 'None' #imputamos el valor 'None' a los NAs

full$PoolQC <- as.integer(revalue(full$PoolQC, Qualities)) #recodificamos la variable
table(full$PoolQC) #mostramos la distribución de frecuencias

## 
##    0    2    4    5 
## 2909    2    4    4

Sin embargo, la variable PoolQC presenta otro problema. Como se ve a continuación, hay tres viviendas con ‘No Pool’ que sí tienen una medida para la variable PoolArea, lo cual debe de tratarse de un error. Asumimos que la variable correctamente recogida es PoolArea y asignamos el valor de PoolQC en función de la variable OverallQual. Ésta última recoge la calidad global de la vivienda en una escala de 1 a 10.

full[full$PoolArea>0, c('PoolArea', 'PoolQC', 'OverallQual')]

##      PoolArea PoolQC OverallQual
## 198       512      5           8
## 811       648      2           6
## 1171      576      4           6
## 1183      555      5          10
## 1299      480      4          10
## 1387      519      2           7
## 1424      738      4           6
## 1975      144      5          10
## 2421      368      0           4
## 2504      444      0           6
## 2574      228      5           8
## 2600      561      0           3
## 2711      800      4           7

full$PoolQC[2421] <- 2
full$PoolQC[2504] <- 3
full$PoolQC[2600] <- 2
table(full$PoolQC) #mostramos la distribución de frecuencias

## 
##    0    2    3    4    5 
## 2906    4    1    4    4

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.2 Miscellaneous Feature

Características varias no cubiertas en otras categorías

MiscFeature: Miscellaneous Feature

La variable MiscFeature contiene 2814 NAs. Es categórica nominal, es decir, la convertimos en factor (no ordenado). La descripción nos muestra los siguientes valores:

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA     None

Asignamos la categoría ‘None’ a los NAs y convertirmos la columna en factor.

full$MiscFeature[is.na(full$MiscFeature)] <- 'None'
full$MiscFeature <- as.factor(full$MiscFeature)

table(full$MiscFeature)

## 
## Gar2 None Othr Shed TenC 
##    5 2814    4   95    1

El siguiente gráfico nos ayuda a comprobar que la variable no es ordinal. Mostramos el precio mediano de las viviendas según los valores de MiscFeature.

ggplot(full[!is.na(full$SalePrice),], aes(x=MiscFeature, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

table(full[!is.na(full$SalePrice),]$MiscFeature) #frecuencias en el conjunto de entrenamiento

## 
## Gar2 None Othr Shed TenC 
##    2 1406    2   49    1

Observando la distribución de frecuencias y la naturaleza del fénomeno que recoge, parece que la información que aporta esta variable no vaya a ser relevante ni de calidad.

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.3 Alley

Tipo de callejón de acceso (pavimento)

Alley: Type of alley access to property

La variable contiene 2721 NAs. No podemos considerarla variable ordinal, así que la convertimos en factor. Descripción:

   Grvl Gravel
   Pave Paved
   NA   No alley access

full$Alley[is.na(full$Alley)] <- 'None'
full$Alley <- as.factor(full$Alley)

full[!is.na(full$SalePrice),] %>% group_by(Alley) %>% summarise(median = median(SalePrice), counts=n())

## # A tibble: 3 x 3
##   Alley median counts
##   <fct>  <dbl>  <int>
## 1 Grvl  119500     50
## 2 None  165000   1369
## 3 Pave  172500     41

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.4 Fence

Calidad del vallado

Fence: Fence quality

La variable Fence contiene 2348 NAs que se corresponden con viviendas sin valla. Descripción:

   GdPrv    Good Privacy
   MnPrv    Minimum Privacy
   GdWo   Good Wood
   MnWw   Minimum Wood/Wire
   NA       No Fence

full$Fence[is.na(full$Fence)] <- 'None'
table(full$Fence)

## 
## GdPrv  GdWo MnPrv  MnWw  None 
##   118   112   329    12  2348

ggplot(full[!is.na(full$SalePrice),], aes(x=Fence, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 200000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

table(full[!is.na(full$SalePrice),]$Fence) #frecuencias en el conjunto de entrenamiento

## 
## GdPrv  GdWo MnPrv  MnWw  None 
##    59    54   157    11  1179

En base al gráfico, donde el “no vallado” es mejor en términos el precio mediano de venta de las viviendas, vamos a convertir la variable en factor no ordenado.

full$Fence <- as.factor(full$Fence)

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.5 Fireplace variables

FireplaceQu(Fireplace Quality) y Fireplaces(Number of fireplaces)

La variable Fireplace Quality contiene 1420 NAs. Number of fireplaces no contiene ningún valor perdido.

Fireplace quality

FireplaceQu: Fireplace quality

Esta variable viene medida en la misma escala que Pool Quality, añadiendo la modalidad ‘Poor’. Asignaremos el valor adecuado a los NAs y convertiremos la variable en factor ordenado. Descripción:

   Ex   Excellent - Exceptional Masonry Fireplace
   Gd   Good - Masonry Fireplace in main level
   TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa   Fair - Prefabricated Fireplace in basement
   Po   Poor - Ben Franklin Stove
   NA   No Fireplace

table(full$FireplaceQu, useNA = "ifany")

## 
##   Ex   Fa   Gd   Po   TA <NA> 
##   43   74  744   46  592 1420

Asignamos el valor ‘No Fireplace’ a los NAs y usamos el vector Qualities creado anteriormente para transformar en ordinal.

full$FireplaceQu[is.na(full$FireplaceQu)] <- 'None'
full$FireplaceQu<-as.integer(revalue(full$FireplaceQu, Qualities))
table(full$FireplaceQu)

## 
##    0    1    2    3    4    5 
## 1420   46   74  592  744   43

Number of fireplaces

Fireplaces: Number of fireplaces

Fireplaces es una variable numérica discreta y no contiene valores perdidos.

table(full$Fireplaces) #hay 1420 observaciones con 0 "fireplaces"

## 
##    0    1    2    3    4 
## 1420 1268  219   11    1

sum(table(full$Fireplaces))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.6 Lot variables

LotFrontage: Linear feet of street connected to property

486 NAs. Se trata de una medida cuantitativa continua. Para corregir los valores perdidos vamos a utilizar el vecindario de la vivienda (variable Neighborhood: Physical locations within Ames city limits). Al tratarse de una característica que podemos intuir asociada a la construcción de barrios o áreas de la ciudad, distinguiremos por vecindario e imputaremos la mediana de LotFrontage del vecindario al que pertenece la vivienda con NA.

ggplot(full[!is.na(full$LotFrontage),], aes(x=as.factor(Neighborhood), y=LotFrontage)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

for (i in 1:nrow(full)){
        if(is.na(full$LotFrontage[i])){
               full$LotFrontage[i] <- as.integer(median(full$LotFrontage[full$Neighborhood==full$Neighborhood[i]], na.rm=TRUE)) 
        }
}
cat("LotFrontage contiene", sum(is.na(full$LotFrontage)), "valores perdidos")

## LotFrontage contiene 0 valores perdidos

LotShape: General shape of property

No NAs. La variable parece ordinal, con valores Reg>IR1>IR2>IR3. Descripción:

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular

#convertimos LotShape en factor ordenado
full$LotShape<-as.integer(revalue(full$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))

table(full$LotShape)

## 
##    0    1    2    3 
##   16   76  968 1859

sum(table(full$LotShape))

## [1] 2919

LotConfig: Lot configuration

No NAs. La variable no parece ordinal. El gráfico parece confirmarlo, por lo que transformamos LotConfig en factor (no ordenado). Descripción:

   Inside     Inside lot
   Corner     Corner lot
   CulDSac  Cul-de-sac
   FR2      Frontage on 2 sides of property
   FR3      Frontage on 3 sides of property

ggplot(full[!is.na(full$SalePrice),], aes(x=as.factor(LotConfig), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

full$LotConfig <- as.factor(full$LotConfig)
table(full$LotConfig)

## 
##  Corner CulDSac     FR2     FR3  Inside 
##     511     176      85      14    2133

sum(table(full$LotConfig))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.7 Garage variables

En conjunto son 7 variables las que forman este grupo

Dos de ellas tienen un solo valor NA (GarageCars y GarageArea), una de ellas tiene 157 NAs (GarageType) y las otras 4 variables tienen 159 NAs.

Primero reemplazamos los 159 NAs de GarageYrBlt: Year garage was built usando los valores de la variable que recoge la fecha de construcción de la vivienda (YearBuilt: Original construction date).

full$GarageYrBlt[is.na(full$GarageYrBlt)] <- full$YearBuilt[is.na(full$GarageYrBlt)]

En las variables GarageType, GarageFinish, GarageCond y GarageQual los valores NAs significan ‘No Garage’. GarageType tiene 157 valores perdidos, mientras que als otras 3 variables tienen 159 NAs.

#comprobamos que los 157 NAs pertenecen a las mismas observaciones en todas las variables
length(which(is.na(full$GarageType) & is.na(full$GarageFinish) & is.na(full$GarageCond) & is.na(full$GarageQual)))

## [1] 157

#buscamos los 2 NAs adicionales
kable(full[!is.na(full$GarageType) & is.na(full$GarageFinish), c('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])

	GarageCars	GarageArea	GarageType	GarageCond	GarageQual	GarageFinish
2127	1	360	Detchd	NA	NA	NA
2577	NA	NA	Detchd	NA	NA	NA

The 157 NAs within GarageType all turn out to be NA in GarageCondition, GarageQuality, and GarageFinish as well. The differences are found in houses 2127 and 2577. As you can see, house 2127 actually does seem to have a Garage and house 2577 does not. Therefore, there should be 158 houses without a Garage. To fix house 2127, I will imputate the most common values (modes) for GarageCond, GarageQual, and GarageFinish.

#imputamos las modas
full$GarageCond[2127] <- names(sort(table(full$GarageCond), decreasing = T))[1]
full$GarageQual[2127] <- names(sort(table(full$GarageQual), decreasing = T))[1]
full$GarageFinish[2127] <- names(sort(table(full$GarageFinish), decreasing = T))[1]

#mostramos la vivienda "ajustada"
kable(full[2127, c('GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])

	GarageYrBlt	GarageCars	GarageArea	GarageType	GarageCond	GarageQual	GarageFinish
2127	1910	1	360	Detchd	TA	TA	Unf

GarageCars and GarageArea: Size of garage in car capacity and Size of garage in square

Ambas tienen 1 NA. Como se ven arriba, se trata de la vivienda 2577 en ambos casos. Vamos a considerar la observación como una vivienda sin garaje:

#aplicamos a la vivienda 2577 una situación de 'No garage'
full$GarageCars[2577] <- 0
full$GarageArea[2577] <- 0
full$GarageType[2577] <- NA

#comprobamos que el número de perdidos sea 158
length(which(is.na(full$GarageType) & is.na(full$GarageFinish) & is.na(full$GarageCond) & is.na(full$GarageQual)))

## [1] 158

Una vez que la 4 variables categóricas del grupo tienen el mismo conjunto de valores perdidos vamos a ocuparnos de ellos una por una.

GarageType: Garage location

No parece una variable ordinal. Descripción:

   2Types     More than one type of garage
   Attchd     Attached to home
   Basment  Basement Garage
   BuiltIn  Built-In (Garage part of house - typically has room above garage)
   CarPort  Car Port
   Detchd     Detached from home
   NA         No Garage

#la convertimos en factor no ordenado
full$GarageType[is.na(full$GarageType)] <- 'No Garage'
full$GarageType <- as.factor(full$GarageType)
table(full$GarageType)

## 
##    2Types    Attchd   Basment   BuiltIn   CarPort    Detchd No Garage 
##        23      1723        36       186        15       778       158

GarageFinish: Interior finish of the garage

Es una característica con valores ordenados. La convertiremos en factor ordenado. Descripción:

   Fin  Finished
   RFn  Rough Finished  
   Unf  Unfinished
   NA     No Garage

#imputamos 'No Garage' a los NAs
full$GarageFinish[is.na(full$GarageFinish)] <- 'None' 

#construimos el vector de valores ordenados
Finish <- c('None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)

#convertimos en factor y chequeamos
full$GarageFinish<-as.integer(revalue(full$GarageFinish, Finish))
table(full$GarageFinish)

## 
##    0    1    2    3 
##  158 1231  811  719

GarageQual: Garage quality

Otra variable que convertimos en ordinal con el vector Qualities.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage

full$GarageQual[is.na(full$GarageQual)] <- 'None'
full$GarageQual<-as.integer(revalue(full$GarageQual, Qualities))
table(full$GarageQual)

## 
##    0    1    2    3    4    5 
##  158    5  124 2605   24    3

GarageCond: Garage condition

Otra variable que convertimos en ordinal con el vector Qualities.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage

full$GarageCond[is.na(full$GarageCond)] <- 'None'
full$GarageCond<-as.integer(revalue(full$GarageCond, Qualities))
table(full$GarageCond)

## 
##    0    1    2    3    4    5 
##  158   14   74 2655   15    3

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.8 Basement Variables

Forman un conjunto de 11 variables

Cinco de ellas tienen 79-82 NAs, seis contienen uno o dos NAs.

#entre las variables con 80+ NAs comprobamos que los 79 NAs sean los mismos
length(which(is.na(full$BsmtQual) & is.na(full$BsmtCond) & is.na(full$BsmtExposure) & is.na(full$BsmtFinType1) & is.na(full$BsmtFinType2)))

## [1] 79

#Buscamos los NAs restantes
full[!is.na(full$BsmtFinType1) & (is.na(full$BsmtCond)|is.na(full$BsmtQual)|is.na(full$BsmtExposure)|is.na(full$BsmtFinType2)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')]

##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## 333        Gd       TA           No          GLQ         <NA>
## 949        Gd       TA         <NA>          Unf          Unf
## 1488       Gd       TA         <NA>          Unf          Unf
## 2041       Gd     <NA>           Mn          GLQ          Rec
## 2186       TA     <NA>           No          BLQ          Unf
## 2218     <NA>       Fa           No          Unf          Unf
## 2219     <NA>       TA           No          Unf          Unf
## 2349       Gd       TA         <NA>          Unf          Unf
## 2525       TA     <NA>           Av          ALQ          Unf

So altogether, it seems as if there are 79 houses without a basement, because the basement variables of the other houses with missing values are all 80% complete (missing 1 out of 5 values). I am going to impute the modes to fix those 9 houses.

Basándonos en la distribución de los NAs entre las variables, asumimos que estas 9 observaciones son viviendas con sótano. Por tanto, vamos a imputar la moda de cada variable a los valores perdidos respectivamente.

#imputamos las modas a los NAs
full$BsmtFinType2[333] <- names(sort(-table(full$BsmtFinType2)))[1]
full$BsmtExposure[c(949, 1488, 2349)] <- names(sort(-table(full$BsmtExposure)))[1]
full$BsmtCond[c(2041, 2186, 2525)] <- names(sort(-table(full$BsmtCond)))[1]
full$BsmtQual[c(2218, 2219)] <- names(sort(-table(full$BsmtQual)))[1]

Now that the 5 variables considered agree upon 79 houses with ‘no basement’, I am going to factorize/hot encode them below.

Ahora que las 5 variables tienen las mismas 79 viviendas con ‘No Basement’ vamos a factorizarlas.

BsmtQual: Evaluates the height of the basement

Otra variable que convertimos en ordinal con el vector Qualities.

   Ex   Excellent (100+ inches) 
   Gd   Good (90-99 inches)
   TA   Typical (80-89 inches)
   Fa   Fair (70-79 inches)
   Po   Poor (<70 inches
   NA   No Basement

full$BsmtQual[is.na(full$BsmtQual)] <- 'None'
full$BsmtQual<-as.integer(revalue(full$BsmtQual, Qualities))
table(full$BsmtQual)

## 
##    0    2    3    4    5 
##   79   88 1285 1209  258

BsmtCond: Evaluates the general condition of the basement

Otra variable que convertimos en ordinal con el vector Qualities.

   Ex   Excellent
   Gd   Good
   TA   Typical - slight dampness allowed
   Fa   Fair - dampness or some cracking or settling
   Po   Poor - Severe cracking, settling, or wetness
   NA   No Basement

full$BsmtCond[is.na(full$BsmtCond)] <- 'None'
full$BsmtCond<-as.integer(revalue(full$BsmtCond, Qualities))
table(full$BsmtCond)

## 
##    0    1    2    3    4 
##   79    5  104 2609  122

BsmtExposure: Refers to walkout or garden level walls

Variable ordinal a la que factorizamos ordenadamente creando un vector auxiliar Exposure.

   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement

full$BsmtExposure[is.na(full$BsmtExposure)] <- 'None'
Exposure <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)

full$BsmtExposure<-as.integer(revalue(full$BsmtExposure, Exposure))
table(full$BsmtExposure)

## 
##    0    1    2    3    4 
##   79 1907  239  418  276

BsmtFinType1: Rating of basement finished area

Variable ordinal a la que factorizamos ordenadamente creando un vector auxiliar FinType.

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA     No Basement

full$BsmtFinType1[is.na(full$BsmtFinType1)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

full$BsmtFinType1<-as.integer(revalue(full$BsmtFinType1, FinType))
table(full$BsmtFinType1)

## 
##   0   1   2   3   4   5   6 
##  79 851 154 288 269 429 849

BsmtFinType2: Rating of basement finished area (if multiple types)

Variable ordinal a la que factorizamos ordenadamente usando el vector FinType.

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA     No Basement

full$BsmtFinType2[is.na(full$BsmtFinType2)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

full$BsmtFinType2<-as.integer(revalue(full$BsmtFinType2, FinType))
table(full$BsmtFinType2)

## 
##    0    1    2    3    4    5    6 
##   79 2494   87  105   68   52   34

Resto de variables con algunos NAs

Aún tenemos que tratar seis variables que continen uno o dos NAs.

#mostramos los NAs que quedan usando BsmtQual como referencia
full[(is.na(full$BsmtFullBath)|is.na(full$BsmtHalfBath)|is.na(full$BsmtFinSF1)|is.na(full$BsmtFinSF2)|is.na(full$BsmtUnfSF)|is.na(full$TotalBsmtSF)), c('BsmtQual', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]

##      BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121        0           NA           NA         NA         NA        NA
## 2189        0           NA           NA          0          0         0
##      TotalBsmtSF
## 2121          NA
## 2189           0

Resulta obvio que estas dos observaciones son viviendas sin sótano. A continuación corregimos los NAs:

BsmtFullBath: Basement full bathrooms

Variable numérica.

full$BsmtFullBath[is.na(full$BsmtFullBath)] <-0
table(full$BsmtFullBath)

## 
##    0    1    2    3 
## 1707 1172   38    2

BsmtHalfBath: Basement half bathrooms

Variable numérica.

full$BsmtHalfBath[is.na(full$BsmtHalfBath)] <-0
table(full$BsmtHalfBath)

## 
##    0    1    2 
## 2744  171    4

BsmtFinSF1: Type 1 finished square feet

Variable numérica.

full$BsmtFinSF1[is.na(full$BsmtFinSF1)] <-0

BsmtFinSF2: Type 2 finished square feet

Variable numérica.

full$BsmtFinSF2[is.na(full$BsmtFinSF2)] <-0

BsmtUnfSF: Unfinished square feet of basement area

Variable numérica.

full$BsmtUnfSF[is.na(full$BsmtUnfSF)] <-0

TotalBsmtSF: Total square feet of basement area

Variable numérica.

full$TotalBsmtSF[is.na(full$TotalBsmtSF)] <-0

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.9 Masonry variables

MasVnrType(Masonry veneer type) y MasVnrArea(Masonry veneer area in square feet)

MasVnrType tiene 24 NAs. MasVnrArea tiene 23 NAs. Primero vamos a arreglar la diferencia entre los valores perdidos de ambas variables.

#comprobamos que coincidan 23 NAs
length(which(is.na(full$MasVnrType) & is.na(full$MasVnrArea)))

## [1] 23

#buscamos el NA que tiene MasVnrType
full[is.na(full$MasVnrType) & !is.na(full$MasVnrArea), c('MasVnrType', 'MasVnrArea')]

##      MasVnrType MasVnrArea
## 2611       <NA>        198

Como una vivienda con valor en MasVnrArea también debe de incluir valor en MasVnrType, vamos a imputar el valor modal para corregir el NA:

#imputamos la moda, pero incluimos el 2nd valor porque el 1st es 'None'
full$MasVnrType[2611] <- names(sort(-table(full$MasVnrType)))[2] 
full[2611, c('MasVnrType', 'MasVnrArea')]

##      MasVnrType MasVnrArea
## 2611    BrkFace        198

Ahora tenemos 23 viviendas con ‘No Masonry’.

Masonry veneer type

Comprobamos la ordinalidad más abajo. Descripción:

   BrkCmn     Brick Common
   BrkFace  Brick Face
   CBlock     Cinder Block
   None     None
   Stone      Stone

full$MasVnrType[is.na(full$MasVnrType)] <- 'None'

full[!is.na(full$SalePrice),] %>% group_by(MasVnrType) %>% summarise(median = median(SalePrice), counts=n()) %>% arrange(median)

## # A tibble: 4 x 3
##   MasVnrType median counts
##   <chr>       <dbl>  <int>
## 1 BrkCmn     139000     15
## 2 None       143125    872
## 3 BrkFace    181000    445
## 4 Stone      246839    128

Parece que se trata de una variable ordinal. Como no parece haber una diferencia significativa entre ‘BrkCmn’ y ‘None’, convertimos la variable a un factor ordenado con tres categorías:

Masonry <- c('None'=0, 'BrkCmn'=0, 'BrkFace'=1, 'Stone'=2)
full$MasVnrType<-as.integer(revalue(full$MasVnrType, Masonry))
table(full$MasVnrType)

## 
##    0    1    2 
## 1790  880  249

MasVnrArea: Masonry veneer area in square feet

Variable numérica.

full$MasVnrArea[is.na(full$MasVnrArea)] <-0

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.10 MS Zoning

MSZoning: Identifies the general zoning classification of the sale

4 NAs. Los valores no son ordenados. Descripción:

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density

#asignamos el valor modal a los casos perdidos
full$MSZoning[is.na(full$MSZoning)] <- names(sort(-table(full$MSZoning)))[1]
full$MSZoning <- as.factor(full$MSZoning)
table(full$MSZoning)

## 
## C (all)      FV      RH      RL      RM 
##      25     139      26    2269     460

sum(table(full$MSZoning))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.11 Kitchen variables

KitchenQual(Kitchen quality) y KitchenAbvGr(Number of Kitchens above grade)

KitchenAual tiene 1 NA. Number of Kitchens above grade is complete.

Kitchen quality

1 NA. La convertimos en factor ordenado usando el vector Qualities.

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor

#imputamos la moda
full$KitchenQual[is.na(full$KitchenQual)] <- names(sort(-table(full$KitchenQual)))[1]
full$KitchenQual<-as.integer(revalue(full$KitchenQual, Qualities))
table(full$KitchenQual)

## 
##    2    3    4    5 
##   70 1493 1151  205

sum(table(full$KitchenQual))

## [1] 2919

Number of Kitchens above grade

Se trata de una variable númerica sin valores perdidos.

table(full$KitchenAbvGr)

## 
##    0    1    2    3 
##    3 2785  129    2

sum(table(full$KitchenAbvGr))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.12 Utilities

Utilities: Type of utilities available

2 NAs. Variable ordinal con valores AllPub>NoSewr>NoSeWa>ELO. Descripción:

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO    Electricity only

However, the table below shows that only one house does not have all public utilities. This house is in the train set. Therefore, imputing ‘AllPub’ for the NAs means that all houses in the test set will have ‘AllPub’. This makes the variable useless for prediction. Consequently, I will get rid of it.

La siguiente tabla muestra que solo hay una vivienda con valor distinto a ‘AllPub’. Esta vivienda se encuentra en el conjunto de entrenamiento. Por tanto, en el conjunto de prueba todas las viviendas tienen el mismo valor ‘AllPub’. Debido a esto, la variable para a convertirse en constante y pierde cualquier capacidad predictiva. Así que la eliminamos:

table(full$Utilities)

## 
## AllPub NoSeWa 
##   2916      1

kable(full[is.na(full$Utilities) | full$Utilities=='NoSeWa', 1:9])

	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities
945	20	RL	82	14375	Pave	None	2	Lvl	NoSeWa
1916	30	RL	109	21780	Grvl	None	3	Lvl	NA
1946	20	RL	64	31220	Pave	None	2	Bnk	NA

full$Utilities <- NULL

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.13 Home functionality

Functional: Home functionality

1 NA. Parece tratarse de un factor ordinal con valores Typ>..>Sal. Descripción:

   Typ  Typical Functionality
   Min1 Minor Deductions 1
   Min2 Minor Deductions 2
   Mod  Moderate Deductions
   Maj1 Major Deductions 1
   Maj2 Major Deductions 2
   Sev  Severely Damaged
   Sal  Salvage only

#imputamos la moda al único NA
full$Functional[is.na(full$Functional)] <- names(sort(-table(full$Functional)))[1]

full$Functional <- as.integer(revalue(full$Functional, c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)))
table(full$Functional)

## 
##    1    2    3    4    5    6    7 
##    2    9   19   35   70   65 2719

sum(table(full$Functional))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.14 Exterior variables

Grupo con 4 variables

2 variables tienen 1 NA, 2 variables no tienen NAs.

Exterior1st: Exterior covering on house

1 NA. Variable categórica (no ordenada). Descripción:

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock     Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other      Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone      Stone
   Stucco     Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

#imputamos la moda
full$Exterior1st[is.na(full$Exterior1st)] <- names(sort(-table(full$Exterior1st)))[1]

full$Exterior1st <- as.factor(full$Exterior1st)
table(full$Exterior1st)

## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd 
##      44       2       6      87       2     126     442       1     450 
## Plywood   Stone  Stucco VinylSd Wd Sdng WdShing 
##     221       2      43    1026     411      56

sum(table(full$Exterior1st))

## [1] 2919

Exterior2nd: Exterior covering on house (if more than one material)

1 NA. Variable categórica (no ordenada). Descripción:

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock     Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other      Other
   Plywood  Plywood
   PreCast  PreCast
   Stone      Stone
   Stucco     Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

#imputamos la moda
full$Exterior2nd[is.na(full$Exterior2nd)] <- names(sort(-table(full$Exterior2nd)))[1]

full$Exterior2nd <- as.factor(full$Exterior2nd)
table(full$Exterior2nd)

## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      38       4      22      47       3     126     406      15     447 
##   Other Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##       1     270       6      47    1015     391      81

sum(table(full$Exterior2nd))

## [1] 2919

ExterQual: Evaluates the quality of the material on the exterior

Sin NAs. Usamos el vector Qualitiespara convertirla en variable ordinal. Descripción:

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

full$ExterQual<-as.integer(revalue(full$ExterQual, Qualities))

## The following `from` values were not present in `x`: None, Po

table(full$ExterQual)

## 
##    2    3    4    5 
##   35 1798  979  107

sum(table(full$ExterQual))

## [1] 2919

ExterCond: Evaluates the present condition of the material on the exterior

Sin NAs. Usamos el vector Qualitiespara convertirla en variable ordinal. Descripción:

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

full$ExterCond<-as.integer(revalue(full$ExterCond, Qualities))

## The following `from` values were not present in `x`: None

table(full$ExterCond)

## 
##    1    2    3    4    5 
##    3   67 2538  299   12

sum(table(full$ExterCond))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.15 Electrical system

Electrical: Electrical system

1 NA. Variable categórica no ordenada.

   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix    Mixed

#imputamos la moda
full$Electrical[is.na(full$Electrical)] <- names(sort(-table(full$Electrical)))[1]

full$Electrical <- as.factor(full$Electrical)
table(full$Electrical)

## 
## FuseA FuseF FuseP   Mix SBrkr 
##   188    50     8     1  2672

sum(table(full$Electrical))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.1.16 Sale Type and Condition

SaleType: Type of sale

1 NA. Variable categórica no ordenada.

   WD     Warranty Deed - Conventional
   CWD    Warranty Deed - Cash
   VWD    Warranty Deed - VA Loan
   New    Home just constructed and sold
   COD    Court Officer Deed/Estate
   Con    Contract 15% Down payment regular terms
   ConLw    Contract Low Down payment and low interest
   ConLI    Contract Low Interest
   ConLD    Contract Low Down
   Oth    Other

#imputamos la moda
full$SaleType[is.na(full$SaleType)] <- names(sort(-table(full$SaleType)))[1]

full$SaleType <- as.factor(full$SaleType)
table(full$SaleType)

## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    87     5    26     9     8    12   239     7  2526

sum(table(full$SaleType))

## [1] 2919

SaleCondition: Condition of sale

Sin NA. Variable categórica no ordenada.

   Normal     Normal Sale
   Abnorml  Abnormal Sale -  trade, foreclosure, short sale
   AdjLand  Adjoining Land Purchase
   Alloca     Allocation - two linked properties with separate deeds, typically condo with a                  garage unit   
   Family     Sale between family members
   Partial  Home was not completed when last assessed (associated with New Homes)

full$SaleCondition <- as.factor(full$SaleCondition)
table(full$SaleCondition)

## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     190      12      24      46    2402     245

sum(table(full$SaleCondition))

## [1] 2919

Puede elegir otra pestaña más arriba para cambiar de grupo de variables

2.2 Recodificación/factorización de las variables categóricas restantes

En este punto todas las variables con NAs han sido tratadas. Ahora, aún tenemos varias variables categóricas que están almacenadas como character variable y vamos a recodificar en factores.

Charcol <- names(full[,sapply(full, is.character)])
Charcol

##  [1] "Street"       "LandContour"  "LandSlope"    "Neighborhood"
##  [5] "Condition1"   "Condition2"   "BldgType"     "HouseStyle"  
##  [9] "RoofStyle"    "RoofMatl"     "Foundation"   "Heating"     
## [13] "HeatingQC"    "CentralAir"   "PavedDrive"

cat('Hay', length(Charcol), 'columnas con variables tipo "character')

## Hay 15 columnas con variables tipo "character

Foundation: Type of foundation

    BrkTil        Brick & Tile
    CBlock        Cinder Block
    PConc           Poured Contrete 
    Slab            Slab
    Stone           Stone
    Wood            Wood

#No es ordinal, convertimos a factor
full$Foundation <- as.factor(full$Foundation)
table(full$Foundation)

## 
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    311   1235   1308     49     11      5

sum(table(full$Foundation))

## [1] 2919

Heating: Type of heating

   Floor    Floor Furnace
   GasA   Gas forced warm air furnace
   GasW   Gas hot water or steam heat
   Grav   Gravity furnace   
   OthW   Hot water or steam heat other than gas
   Wall   Wall furnace

#No es ordinal, convertimos a factor
full$Heating <- as.factor(full$Heating)
table(full$Heating)

## 
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  2874    27     9     2     6

sum(table(full$Heating))

## [1] 2919

HeatingQC: Heating quality and condition

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

#convertimos a factor ordenado usando el vector Qualities
full$HeatingQC<-as.integer(revalue(full$HeatingQC, Qualities))

## The following `from` values were not present in `x`: None

table(full$HeatingQC)

## 
##    1    2    3    4    5 
##    3   92  857  474 1493

sum(table(full$HeatingQC))

## [1] 2919

CentralAir: Central air conditioning

   N    No
   Y    Yes

full$CentralAir<-as.integer(revalue(full$CentralAir, c('N'=0, 'Y'=1)))
table(full$CentralAir)

## 
##    0    1 
##  196 2723

sum(table(full$CentralAir))

## [1] 2919

RoofStyle: Type of roof

   Flat     Flat
   Gable      Gable
   Gambrel  Gabrel (Barn)
   Hip      Hip
   Mansard  Mansard
   Shed     Shed

#No es ordinal, convertimos a factor
full$RoofStyle <- as.factor(full$RoofStyle)
table(full$RoofStyle)

## 
##    Flat   Gable Gambrel     Hip Mansard    Shed 
##      20    2310      22     551      11       5

sum(table(full$RoofStyle))

## [1] 2919

RoofMatl: Roof material

   ClyTile  Clay or Tile
   CompShg  Standard (Composite) Shingle
   Membran  Membrane
   Metal      Metal
   Roll     Roll
   Tar&Grv  Gravel & Tar
   WdShake  Wood Shakes
   WdShngl  Wood Shingles

#No es ordinal, convertimos a factor
full$RoofMatl <- as.factor(full$RoofMatl)
table(full$RoofMatl)

## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    2876       1       1       1      23       9       7

sum(table(full$RoofMatl))

## [1] 2919

LandContour: Flatness of the property

   Lvl  Near Flat/Level 
   Bnk  Banked - Quick and significant rise from street grade to building
   HLS  Hillside - Significant slope from side to side
   Low  Depression

#No es ordinal, convertimos a factor
full$LandContour <- as.factor(full$LandContour)
table(full$LandContour)

## 
##  Bnk  HLS  Low  Lvl 
##  117  120   60 2622

sum(table(full$LandContour))

## [1] 2919

LandSlope: Slope of property

   Gtl  Gentle slope
   Mod  Moderate Slope  
   Sev  Severe Slope

#Ordinal
full$LandSlope<-as.integer(revalue(full$LandSlope, c('Sev'=0, 'Mod'=1, 'Gtl'=2)))
table(full$LandSlope)

## 
##    0    1    2 
##   16  125 2778

sum(table(full$LandSlope))

## [1] 2919

BldgType: Type of dwelling

   1Fam   Single-family Detached    
   2FmCon   Two-family Conversion; originally built as one-family dwelling
   Duplx    Duplex
   TwnhsE   Townhouse End Unit
   TwnhsI   Townhouse Inside Unit

Parece una variable ordinal con 1Fam=best. Vamos a visualizarla para decidir mejor:

ggplot(full[!is.na(full$SalePrice),], aes(x=as.factor(BldgType), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue')+
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

En base al gráfico consideramos que se trata de un factor no ordenado:

#No es ordinal, convertimos a factor
full$BldgType <- as.factor(full$BldgType)
table(full$BldgType)

## 
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   2425     62    109     96    227

sum(table(full$BldgType))

## [1] 2919

HouseStyle: Style of dwelling

   1Story   One story
   1.5Fin   One and one-half story: 2nd level finished
   1.5Unf   One and one-half story: 2nd level unfinished
   2Story   Two story
   2.5Fin   Two and one-half story: 2nd level finished
   2.5Unf   Two and one-half story: 2nd level unfinished
   SFoyer   Split Foyer
   SLvl   Split Level

#No es ordinal, convertimos a factor
full$HouseStyle <- as.factor(full$HouseStyle)
table(full$HouseStyle)

## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    314     19   1471      8     24    872     83    128

sum(table(full$HouseStyle))

## [1] 2919

Neighborhood: Physical locations within Ames city limits

Es una variable categórica no ordenada. Descripción:

   Blmngtn  Bloomington Heights
   Blueste  Bluestem
   BrDale     Briardale
   BrkSide  Brookside
   ClearCr  Clear Creek
   CollgCr  College Creek
   Crawfor  Crawford
   Edwards  Edwards
   Gilbert  Gilbert
   IDOTRR     Iowa DOT and Rail Road
   MeadowV  Meadow Village
   Mitchel  Mitchell
   Names      North Ames
   NoRidge  Northridge
   NPkVill  Northpark Villa
   NridgHt  Northridge Heights
   NWAmes     Northwest Ames
   OldTown  Old Town
   SWISU      South & West of Iowa State University
   Sawyer     Sawyer
   SawyerW  Sawyer West
   Somerst  Somerset
   StoneBr  Stone Brook
   Timber     Timberland
   Veenker  Veenker

full$Neighborhood <- as.factor(full$Neighborhood)
table(full$Neighborhood)

## 
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      28      10      30     108      44     267     103     194     165 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      93      37     114     443      71      23     166     131     239 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##     151     125     182      51      48      72      24

sum(table(full$Neighborhood))

## [1] 2919

Condition1: Proximity to various conditions

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm   Normal    
   RRNn   Within 200' of North-South Railroad
   RRAn   Adjacent to North-South Railroad
   PosN   Near positive off-site feature--park, greenbelt, etc.
   PosA   Adjacent to postive off-site feature
   RRNe   Within 200' of East-West Railroad
   RRAe   Adjacent to East-West Railroad

#convertimos a factor no ordenado
full$Condition1 <- as.factor(full$Condition1)
table(full$Condition1)

## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     92    164   2511     20     39     28     50      6      9

sum(table(full$Condition1))

## [1] 2919

Condition2: Proximity to various conditions (if more than one is present)

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm   Normal    
   RRNn   Within 200' of North-South Railroad
   RRAn   Adjacent to North-South Railroad
   PosN   Near positive off-site feature--park, greenbelt, etc.
   PosA   Adjacent to postive off-site feature
   RRNe   Within 200' of East-West Railroad
   RRAe   Adjacent to East-West Railroad

#convertimos a factor no ordenado
full$Condition2 <- as.factor(full$Condition2)
table(full$Condition2)

## 
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNn 
##      5     13   2889      4      4      1      1      2

sum(table(full$Condition2))

## [1] 2919

Street: Type of road access to property

   Grvl Gravel  
   Pave Paved

#Ordinal
full$Street<-as.integer(revalue(full$Street, c('Grvl'=0, 'Pave'=1)))
table(full$Street)

## 
##    0    1 
##   12 2907

sum(table(full$Street))

## [1] 2919

PavedDrive: Paved driveway

   Y    Paved 
   P    Partial Pavement
   N    Dirt/Gravel

#Ordinal
full$PavedDrive<-as.integer(revalue(full$PavedDrive, c('N'=0, 'P'=1, 'Y'=2)))
table(full$PavedDrive)

## 
##    0    1    2 
##  216   62 2641

sum(table(full$PavedDrive))

## [1] 2919

2.3 Transformación a factor de algunas variables numéricas

Una vez que todas las variables están completas y se han asignado etiquetas numéricas a las variables categóricas (ordinales y nominales), es momento de recodificar tres variables que aun siendo numéricas deben ser consideradas como factores.

2.3.1 Year and Month Sold

Mientras que el año de construcción (YearBuilt) puede ser considerado numérico, la variable que recoge el año de venta (YrSold) solo tiene 5 valores distintos. Por tanto la consideramos factor. Sin embargo, no la recodificaremos hasta más tarde; una vez que la hayamos usado para calcular una nueva variable: Age(Edad de la vivienda)

MoSold (Moth of Sold) es transformada a factor.

str(full$YrSold)

##  int [1:2919] 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...

str(full$MoSold)

##  int [1:2919] 2 5 9 2 12 10 8 11 4 1 ...

full$MoSold <- as.factor(full$MoSold)

2.3.2 MSSubClass

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

Los valores están codificados como números pero realmente se trata de modalidades de un factor:

str(full$MSSubClass)

##  int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...

full$MSSubClass <- as.factor(full$MSSubClass)

#modificamos los nombres de las etiquetas
full$MSSubClass<-revalue(full$MSSubClass, c('20'='1 story 1946+', '30'='1 story 1945-', '40'='1 story unf attic', '45'='1,5 story unf', '50'='1,5 story fin', '60'='2 story 1946+', '70'='2 story 1945-', '75'='2,5 story all ages', '80'='split/multi level', '85'='split foyer', '90'='duplex all style/age', '120'='1 story PUD 1946+', '150'='1,5 story PUD all', '160'='2 story PUD 1946+', '180'='PUD multilevel', '190'='2 family conversion'))

str(full$MSSubClass)

##  Factor w/ 16 levels "1 story 1946+",..: 6 1 6 7 6 5 1 6 5 16 ...

3 Visualización de variables

numericVars <- which(sapply(full, is.numeric)) #index vector numeric variables
factorVars <- which(sapply(full, is.factor)) #index vector factor variables
cat('Hay', length(numericVars), 'variables numéricas, y', length(factorVars), 'variables categóricas')

## Hay 56 variables numéricas, y 23 variables categóricas

3.1 Correlaciones

Vamos a visualizar qué variables presentan un coeficiente de correlación mayor a 0.5 en valor absoluto con la variable de respuesta (SalePrice):

full_numVar <- full[, numericVars]
cor_numVar <- cor(full_numVar, use="pairwise.complete.obs") #correlacions para todas las variables numéricas

#ordenamos de mayor a menor correlación abosulta con SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#nos quedamos con las correlaciones altas
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]

corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)

3.2 Random Forest para determinar las variables más importantes

Aunque el estudio de las correlaciones nos informa de las variables más relevantes y acerca de la multicolinealidad presente entre las predictoras numéricas, vamos a llevar a cabo un análisis de tipo Random Forest para detectar qué variables son más importantes (incluidas las categóricas) para ayudarnos a guiar el estudio gráfico.

set.seed(2018)
quick_RF <- randomForest(x=full[1:1460,-79], y=full$SalePrice[1:1460], ntree=100, importance=TRUE)
imp_RF <- importance(quick_RF)
imp_DF <- data.frame(Variables = row.names(imp_RF), MSE = imp_RF[,1])
imp_DF <- imp_DF[order(imp_DF$MSE, decreasing = TRUE),]

ggplot(imp_DF[1:20,], aes(x=reorder(Variables, MSE), y=MSE, fill=MSE)) + geom_bar(stat = 'identity') + labs(x = 'Variables', y= '% increase MSE if variable is randomly permuted') + coord_flip() + theme(legend.position="none")

Las variables categóricas Neighborhood, MSSubClass y GarageType presentan alta importancia atendiendo al RF llevado a cabo.

3.2.1 Neighborhood. La variable categórica más importante

El primer gráfico muestra la mediana de SalePrice según Neighborhood. La frecuencia de cada Neighborhood se corresponde con el número de viviendas en el conjunto de entrenamiento.

El segundo gráfico representa la distribución de frecuencias para el conjunto entero.

n1 <- ggplot(full[!is.na(full$SalePrice),], aes(x=Neighborhood, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
n2 <- ggplot(data=full, aes(x=Neighborhood)) +
        geom_histogram(stat='count')+
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3)+
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(n1, n2)

3.2.2 Overall Quality y otras variables de calidad

q1 <- ggplot(data=full, aes(x=as.factor(OverallQual))) +
        geom_histogram(stat='count')
q2 <- ggplot(data=full, aes(x=as.factor(ExterQual))) +
        geom_histogram(stat='count')
q3 <- ggplot(data=full, aes(x=as.factor(BsmtQual))) +
        geom_histogram(stat='count')
q4 <- ggplot(data=full, aes(x=as.factor(KitchenQual))) +
        geom_histogram(stat='count')
q5 <- ggplot(data=full, aes(x=as.factor(GarageQual))) +
        geom_histogram(stat='count')
q6 <- ggplot(data=full, aes(x=as.factor(FireplaceQu))) +
        geom_histogram(stat='count')
q7 <- ggplot(data=full, aes(x=as.factor(PoolQC))) +
        geom_histogram(stat='count')

layout <- matrix(c(1,2,8,3,4,8,5,6,7),3,3,byrow=TRUE)
multiplot(q1, q2, q3, q4, q5, q6, q7, layout=layout)

OverallQual es muy importante y también mejor distribuida que las otras variables. ExterQual también es importante, pero tiene una alta correlación con OverallQual (0,73). KitchenQual también parece una variable para mantener, ya que todas las casas tienen una cocina y hay una variabilidad útil en la gráfica. GarageQual no parece distinguir mucho, ya que la mayoría de los garajes tienen Q3. FireplaceQu está en la lista de altas correlaciones y en la lista de variables importantes. PoolQC es muy escaso (los 13 grupos ni siquiera se pueden ver en esta escala). Trataré de crear una variable ‘has pool’ más adelante.

3.2.3 MSSubClass. La segunda variable categórica más importante

El primer gráfico muestra la mediana de SalePrice según MSSubClass La frecuencia de cada MSSubClass se corresponde con el número de viviendas en el conjunto de entrenamiento.

El segundo gráfico representa la distribución de frecuencias para el conjunto entero.

ms1 <- ggplot(full[!is.na(full$SalePrice),], aes(x=MSSubClass, y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
ms2 <- ggplot(data=full, aes(x=MSSubClass)) +
        geom_histogram(stat='count')+
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(ms1, ms2)

3.2.4 Garage variables

Varias variables del grupo Garage tienen una alta correlación con SalePrice, y también están en la lista de las 20 principales del Random Forest anterior. Sin embargo, hay una multicolinealidad entre ellos y creo que 7 variables de garaje son demasiadas de todos modos. Puede que algo como 3 variables debería ser suficiente (posiblemente GarageCars, GarageType y una medición de calidad), pero antes de hacer cualquier selección, las visualizo todas en esta sección.

#corregir un error en GarageYrBlt
full$GarageYrBlt[2593] <- 2007 #debe ser un error de entrada GarageYrBlt=2207, YearBuilt=2006, YearRemodAdd=2007.

g1 <- ggplot(data=full[full$GarageCars !=0,], aes(x=GarageYrBlt)) +
        geom_histogram()
g2 <- ggplot(data=full, aes(x=as.factor(GarageCars))) +
        geom_histogram(stat='count')
g3 <- ggplot(data= full, aes(x=GarageArea)) +
        geom_density()
g4 <- ggplot(data=full, aes(x=as.factor(GarageCond))) +
        geom_histogram(stat='count')
g5 <- ggplot(data=full, aes(x=GarageType)) +
        geom_histogram(stat='count')
g6 <- ggplot(data=full, aes(x=as.factor(GarageQual))) +
        geom_histogram(stat='count')
g7 <- ggplot(data=full, aes(x=as.factor(GarageFinish))) +
        geom_histogram(stat='count')

layout <- matrix(c(1,5,5,2,3,8,6,4,7),3,3,byrow=TRUE)
multiplot(g1, g2, g3, g4, g5, g6, g7, layout=layout)

GarageCars y GarageArea están altamente correladas. Además, GarageQual and GarageCond también parecen muy correlacionadas, con gran presencia del nivel=3.

3.2.5 Basement variables

Al igual que las variables de garaje, las variables del sótano múltiple son importantes en la matriz de correlaciones y en la lista de los 20 principales predictores de RF. Sin embargo, 11 variables del sótano parecen una exageración. Antes de decidir qué voy a hacer con ellos, estoy visualizando 8 de ellos a continuación. Las 2 variables “Bathroom” se tratan en Feature Engineering (siguiente sección).

b1 <- ggplot(data=full, aes(x=BsmtFinSF1)) +
        geom_histogram() + labs(x='Type 1 finished square feet')
b2 <- ggplot(data=full, aes(x=BsmtFinSF2)) +
        geom_histogram()+ labs(x='Type 2 finished square feet')
b3 <- ggplot(data=full, aes(x=BsmtUnfSF)) +
        geom_histogram()+ labs(x='Unfinished square feet')
b4 <- ggplot(data=full, aes(x=as.factor(BsmtFinType1))) +
        geom_histogram(stat='count')+ labs(x='Rating of Type 1 finished area')
b5 <- ggplot(data=full, aes(x=as.factor(BsmtFinType2))) +
        geom_histogram(stat='count')+ labs(x='Rating of Type 2 finished area')
b6 <- ggplot(data=full, aes(x=as.factor(BsmtQual))) +
        geom_histogram(stat='count')+ labs(x='Height of the basement')
b7 <- ggplot(data=full, aes(x=as.factor(BsmtCond))) +
        geom_histogram(stat='count')+ labs(x='Rating of general condition')
b8 <- ggplot(data=full, aes(x=as.factor(BsmtExposure))) +
        geom_histogram(stat='count')+ labs(x='Walkout or garden level walls')

layout <- matrix(c(1,2,3,4,5,9,6,7,8),3,3,byrow=TRUE)
multiplot(b1, b2, b3, b4, b5, b6, b7, b8, layout=layout)

4 Feature engineering

4.1 Número total de baños

Hay 4 variables referidas a los baños. Individualmente no son muy importantes, sin embargo, asumimos que sumando estos predictores obtendremos otro más relevante.

full$TotBathrooms <- full$FullBath + (full$HalfBath*0.5) + full$BsmtFullBath + (full$BsmtHalfBath*0.5) #variables con 'half bathroom' ponderan la mitad

Como se observa en el primer gráfico parece haber una relación positiva entre el total de baños y el precio de venta (es de 0.63). El segundo gráfico representa el histograma de TotBathrooms.

tb1 <- ggplot(data=full[!is.na(full$SalePrice),], aes(x=as.factor(TotBathrooms), y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
tb2 <- ggplot(data=full, aes(x=as.factor(TotBathrooms))) +
        geom_histogram(stat='count')
grid.arrange(tb1, tb2)

4.2 Añadiendo ‘Age’, ‘Remodeled (Yes/No)’ y variable IsNew

Hay tres variables relacionadas con la edad de la vivienda: YearBlt, YearRemodAdd, and YearSold. YearRemodAdd coincide con YearBuilt si no ha habido remodelación. Usaré YearRemodeled y YearSold para determinar Age. Sin embargo, como partes de construcciones antiguas siempre permanecerán y solo partes de la casa podrían haber sido renovadas, también presentaré una variable remodelada Sí / No. Esto debería verse como algún tipo de parámetro de penalización que indica que si Age se basa en una fecha de remodelación, probablemente valga menos que las casas que se construyeron desde cero ese mismo año.

full$Remod <- ifelse(full$YearBuilt==full$YearRemodAdd, 0, 1) #0=No Remodeling, 1=Remodeling
full$Age <- as.numeric(full$YrSold)-full$YearRemodAdd

ggplot(data=full[!is.na(full$SalePrice),], aes(x=Age, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

El gráfico muestra una relación negativa entre edad y precio de venta (como era de esperar).

cor(full$SalePrice[!is.na(full$SalePrice)], full$Age[!is.na(full$SalePrice)])

## [1] -0.5090787

Además, observamos que como suponíamos las casas remodeladas tiene menor precio.

ggplot(full[!is.na(full$SalePrice),], aes(x=as.factor(Remod), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        theme_grey(base_size = 18) +
        geom_hline(yintercept=median((full$SalePrice)[!is.na(full$SalePrice)]), linetype="dashed") #linea punteada representa el precio mediano global

Finalmente, construimos la variable IsNew (dicotómica). En conjunto hay 116 viviendas nuevas que se han vendido.

full$IsNew <- ifelse(full$YrSold==full$YearBuilt, 1, 0)
table(full$IsNew)

## 
##    0    1 
## 2803  116

Estas 116 casas nuevas se distribuyen de manera bastante uniforme entre los conjuntos de entrenamiento y prueba, y como puede ver, las casas nuevas valen mucho más en promedio.

ggplot(full[!is.na(full$SalePrice),], aes(x=as.factor(IsNew), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=6) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        theme_grey(base_size = 18) +
        geom_hline(yintercept=median((full$SalePrice)[!is.na(full$SalePrice)]), linetype="dashed")

full$YrSold <- as.factor(full$YrSold) #la convertimos en factor una vez que ya ha sido usada

4.3 Binning Neighborhood

nb1 <- ggplot(full[!is.na(full$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') + labs(x='Neighborhood', y='Median SalePrice') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=median((full$SalePrice)[!is.na(full$SalePrice)]), linetype="dashed", color = "red")

nb2 <- ggplot(full[!is.na(full$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=mean), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "mean", fill='blue') + labs(x='Neighborhood', y="Mean SalePrice") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=median((full$SalePrice)[!is.na(full$SalePrice)]), linetype="dashed", color = "red") 

grid.arrange(nb1, nb2)

La variable Neighborhood presenta demasiadas modalidades. Atendiendo a los gráficos de medianas y medias, decidimos construir una nueva variable que sustituya a Neighborhood. La llamamos NeighRich y presenta tres modalidades ordenando los barrios en pobres=0, normales=1 y ricos=2.

full$NeighRich[full$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
full$NeighRich[!full$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
full$NeighRich[full$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0

table(full$NeighRich)

## 
##    0    1    2 
##  160 2471  288

4.4 Total Square Feet

Calculamos la superficie total de la vivienda.

full$TotalSqFeet <- full$GrLivArea + full$TotalBsmtSF

ggplot(data=full[!is.na(full$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_text_repel(aes(label = ifelse(full$GrLivArea[!is.na(full$SalePrice)]>4500, rownames(full), '')))

As expected, the correlation with SalePrice is very strong indeed (0.78).

Si eliminamos los dos “outliers” la correlación aumenta un 5%.

cor(full$SalePrice[-c(524, 1299)], full$TotalSqFeet[-c(524, 1299)], use= "pairwise.complete.obs")

## [1] 0.829042

4.5 Consolidando Porch variables

Son 6 las variables relacionadas con ‘Porch’:

WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet

En función de la naturaleza de estas variables, vamos a dejar sola a WoodDeckSF y agrupar las otras cuatro.

full$TotalPorchSF <- full$OpenPorchSF + full$EnclosedPorch + full$X3SsnPorch + full$ScreenPorch

La correlación obtenida no parece demasiado alta.

cor(full$SalePrice, full$TotalPorchSF, use= "pairwise.complete.obs")

## [1] 0.1957389

ggplot(data=full[!is.na(full$SalePrice),], aes(x=TotalPorchSF, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

5 Preparando los datos para el modelado

5.1 Tratando con la multicolinealidad

Elegimos las variables que presentan colinealidad con otros predictores, elegimos aquellos menos relacionados con la variable objetivo. Para ello usamos la matriz de correlaciones mostrada más arriba.

dropVars <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalBsmtSF', 'TotalRmsAbvGrd', 'BsmtFinSF1')

full <- full[,!(names(full) %in% dropVars)]

5.2 Tratando con los casos atípicos

Eliminamos las viviendas que hemos encontrado con precios de venta anormalmente bajos y superficies muy grandes.

full <- full[-c(524, 1299),]

5.3 Preprocesamiento de los predictores

Antes de modelar, necesitamos centrar y escalar las variables realmente numéricas (no las ordinales codificadas como integer) y crear variables ficticias para los predictores categóricos.

numericVarNames <- numericVarNames[!(numericVarNames %in% c('MSSubClass', 'MoSold', 'YrSold', 'SalePrice', 'OverallQual', 'OverallCond'))] 
numericVarNames <- append(numericVarNames, c('Age', 'TotalPorchSF', 'TotBathrooms', 'TotalSqFeet'))

DFnumeric <- full[, names(full) %in% numericVarNames]

DFfactors <- full[, !(names(full) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']

cat('There are', length(DFnumeric), 'numeric variables, and', length(DFfactors), 'factor variables')

## There are 30 numeric variables, and 49 factor variables

5.3.1 Asimetría y normalización de predictores numéricos

Asimetría

Para corregir la asimetría, vamos a aplicar la transformación logarítmica a las varibles con asímetría (izquierda o derecha) mayor a 0.8.

for(i in 1:ncol(DFnumeric)){
        if (abs(skew(DFnumeric[,i]))>0.8){
                DFnumeric[,i] <- log(DFnumeric[,i] +1)
        }
}

Normalizando los datos

PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)

## Created from 2917 samples and 30 variables
## 
## Pre-processing:
##   - centered (30)
##   - ignored (0)
##   - scaled (30)

DFnorm <- predict(PreNum, DFnumeric)
dim(DFnorm)

## [1] 2917   30

5.3.2 One-hot-encoding para las categóricas

Dicotomizamos las variables categóricas.

DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)

## [1] 2917  201

5.3.3 Eliminar los niveles con ninguna o pocas observaciones

#comprobamos si hay valores ausentes en el conjunto de prueba
ZerocolTest <- which(colSums(DFdummies[(nrow(full[!is.na(full$SalePrice),])+1):nrow(full),])==0)
colnames(DFdummies[ZerocolTest])

##  [1] "Condition2RRAe"     "Condition2RRAn"     "Condition2RRNn"    
##  [4] "HouseStyle2.5Fin"   "RoofMatlMembran"    "RoofMatlMetal"     
##  [7] "RoofMatlRoll"       "Exterior1stImStucc" "Exterior1stStone"  
## [10] "Exterior2ndOther"   "HeatingOthW"        "ElectricalMix"     
## [13] "MiscFeatureTenC"

DFdummies <- DFdummies[,-ZerocolTest] #eliminar predictores

#comprobamos si hay valores ausentes en el conjunto de entrenamiento
ZerocolTrain <- which(colSums(DFdummies[1:nrow(full[!is.na(full$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])

## [1] "MSSubClass1,5 story PUD all"

DFdummies <- DFdummies[,-ZerocolTrain] #eliminar predictores

También eliminamos variables con menos de diez “unos” en el conjunto de entrenamiento.

fewOnes <- which(colSums(DFdummies[1:nrow(full[!is.na(full$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])

##  [1] "MSSubClass1 story unf attic" "LotConfigFR3"               
##  [3] "NeighborhoodBlueste"         "NeighborhoodNPkVill"        
##  [5] "Condition1PosA"              "Condition1RRNe"             
##  [7] "Condition1RRNn"              "Condition2Feedr"            
##  [9] "Condition2PosA"              "Condition2PosN"             
## [11] "RoofStyleMansard"            "RoofStyleShed"              
## [13] "RoofMatlWdShake"             "RoofMatlWdShngl"            
## [15] "Exterior1stAsphShn"          "Exterior1stBrkComm"         
## [17] "Exterior1stCBlock"           "Exterior2ndAsphShn"         
## [19] "Exterior2ndBrk Cmn"          "Exterior2ndCBlock"          
## [21] "Exterior2ndStone"            "FoundationStone"            
## [23] "FoundationWood"              "HeatingGrav"                
## [25] "HeatingWall"                 "ElectricalFuseP"            
## [27] "GarageTypeCarPort"           "MiscFeatureOthr"            
## [29] "SaleTypeCon"                 "SaleTypeConLD"              
## [31] "SaleTypeConLI"               "SaleTypeConLw"              
## [33] "SaleTypeCWD"                 "SaleTypeOth"                
## [35] "SaleConditionAdjLand"

DFdummies <- DFdummies[,-fewOnes] #eliminar predictores
dim(DFdummies)

## [1] 2917  152

combined <- cbind(DFnorm, DFdummies) #obtenemos un dataframe numérico

5.4 Tratando la asimetría de la variable respuesta

skew(full$SalePrice)

## [1] 1.877427

qqnorm(full$SalePrice)
qqline(full$SalePrice)

El coeficiente de asimetría de 1.87 indica asimetría a la derecha muy elevada y el gráfico Q-Q muestra que SalePrice no está normalamente distribuida. Aplicamos logaritmos:

full$SalePrice <- log(full$SalePrice) #no hay "ceros" aplicamos log()
skew(full$SalePrice)

## [1] 0.1213182

qqnorm(full$SalePrice)
qqline(full$SalePrice)

5.5 Obtener conjunto combinado de entrenamiento y prueba

train1 <- combined[!is.na(full$SalePrice),]
test1 <- combined[is.na(full$SalePrice),]

6 Modelado

6.1 Lasso regression model

Usamos un modelo de tipo “Lasso regression” ya que es adecuado para predecir el valor de una variable cuantitiva en presencia de predictores muy correlacionados. Si un grupo de regresores presentan mucha multicolinealidad esta técnica tiende a escoger uno de los coeficientes y descartar los demás.

Usaremos validación cruzada para encontrar el mejor valor del parámetro lambda, que es el único hiperparámetro que se necesita fijar para el modelo lasso.

set.seed(27042018)
my_control <-trainControl(method="cv", number=5)
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.001,0.1,by = 0.0005))

lasso_mod <- train(x=train1, y=full$SalePrice[!is.na(full$SalePrice)], method='glmnet', trControl= my_control, tuneGrid=lassoGrid) 
lasso_mod$bestTune

##   alpha lambda
## 4     1 0.0025

min(lasso_mod$results$RMSE)

## [1] 0.1121579

LassoPred <- predict(lasso_mod, test1)
predictions_lasso <- exp(LassoPred) #need to reverse the log to the real values
head(predictions_lasso)

##     1461     1462     1463     1464     1465     1466 
## 114351.8 162204.8 179455.3 197564.7 205952.8 169839.8

submission <- data.frame(Id = test_labels, SalePrice = predictions_lasso)
head(submission)

##        Id SalePrice
## 1461 1461  114351.8
## 1462 1462  162204.8
## 1463 1463  179455.3
## 1464 1464  197564.7
## 1465 1465  205952.8
## 1466 1466  169839.8

write.csv(submission, file = 'submission.csv', row.names = F)

Información de la sesión

sessionInfo()

## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
## [5] LC_TIME=Spanish_Spain.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] psych_1.8.12        randomForest_4.6-14 ggrepel_0.8.1      
##  [4] Rmisc_1.5           scales_1.0.0        gridExtra_2.3      
##  [7] caret_6.0-84        lattice_0.20-38     corrplot_0.84      
## [10] dplyr_0.8.3         plyr_1.8.4          ggplot2_3.2.1      
## [13] knitr_1.24         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         lubridate_1.7.4    class_7.3-15      
##  [4] glmnet_2.0-18      assertthat_0.2.1   zeallot_0.1.0     
##  [7] digest_0.6.20      ipred_0.9-9        foreach_1.4.7     
## [10] utf8_1.1.4         R6_2.4.0           backports_1.1.4   
## [13] stats4_3.5.3       evaluate_0.14      highr_0.8         
## [16] pillar_1.4.2       rlang_0.4.0        lazyeval_0.2.2    
## [19] data.table_1.12.2  rpart_4.1-15       Matrix_1.2-15     
## [22] rmarkdown_1.15     labeling_0.3       splines_3.5.3     
## [25] gower_0.2.1        stringr_1.4.0.9000 foreign_0.8-71    
## [28] munsell_0.5.0      compiler_3.5.3     xfun_0.9          
## [31] pkgconfig_2.0.2    mnormt_1.5-5       htmltools_0.3.6   
## [34] nnet_7.3-12        tidyselect_0.2.5   tibble_2.1.3      
## [37] prodlim_2018.04.18 codetools_0.2-16   fansi_0.4.0       
## [40] crayon_1.3.4       withr_2.1.2        MASS_7.3-51.1     
## [43] recipes_0.1.7      ModelMetrics_1.2.2 grid_3.5.3        
## [46] nlme_3.1-137       gtable_0.3.0       magrittr_1.5      
## [49] cli_1.1.0          stringi_1.4.3      reshape2_1.4.3    
## [52] timeDate_3043.102  generics_0.0.2     vctrs_0.2.0       
## [55] lava_1.6.6         iterators_1.0.12   tools_3.5.3       
## [58] glue_1.3.1         purrr_0.3.2        parallel_3.5.3    
## [61] survival_2.43-3    yaml_2.2.0         colorspace_1.4-1

Sys.getenv()

## ALLUSERSPROFILE       C:\ProgramData
## APPDATA               C:\Users\ignac\AppData\Roaming
## CLICOLOR_FORCE        1
## CommonProgramFiles    C:\Program Files\Common Files
## CommonProgramFiles(x86)
##                       C:\Program Files (x86)\Common Files
## CommonProgramW6432    C:\Program Files\Common Files
## COMPUTERNAME          DESKTOP-NR3SJNM
## ComSpec               C:\WINDOWS\system32\cmd.exe
## DISPLAY               :0
## DriverData            C:\Windows\System32\Drivers\DriverData
## FPS_BROWSER_APP_PROFILE_STRING
##                       Internet Explorer
## FPS_BROWSER_USER_PROFILE_STRING
##                       Default
## GFORTRAN_STDERR_UNIT
##                       -1
## GFORTRAN_STDOUT_UNIT
##                       -1
## GIT_ASKPASS           rpostback-askpass
## HOME                  C:/Users/ignac/OneDrive/Documents
## HOMEDRIVE             C:
## HOMEPATH              \Users\ignac
## LOCALAPPDATA          C:\Users\ignac\AppData\Local
## LOGONSERVER           \\DESKTOP-NR3SJNM
## MSYS2_ENV_CONV_EXCL   R_ARCH
## NOT_CRAN              true
## NUMBER_OF_PROCESSORS
##                       12
## OneDrive              C:\Users\ignac\OneDrive
## OneDriveConsumer      C:\Users\ignac\OneDrive
## OS                    Windows_NT
## PATH                  C:\Program
##                       Files\R\R-3.5.3\bin\x64;C:\Program Files
##                       (x86)\Common
##                       Files\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program
##                       Files (x86)\NVIDIA
##                       Corporation\PhysX\Common;C:\Program
##                       Files\NVIDIA Corporation\NVIDIA
##                       NvDLISR;C:\Program
##                       Files\AutoFirma\AutoFirma;C:\Program Files
##                       (x86)\Microsoft SQL
##                       Server\150\DTS\Binn\;C:\HashiCorp\Vagrant\bin;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;%SYSTEMROOT%\System32\WindowsPowerShell\v1.0\;%SYSTEMROOT%\System32\OpenSSH\;C:\Users\ignac\AppData\Local\Microsoft\WindowsApps;
## PATHEXT               .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
## PROCESSOR_ARCHITECTURE
##                       AMD64
## PROCESSOR_IDENTIFIER
##                       AMD64 Family 23 Model 8 Stepping 2,
##                       AuthenticAMD
## PROCESSOR_LEVEL       23
## PROCESSOR_REVISION    0802
## ProgramData           C:\ProgramData
## ProgramFiles          C:\Program Files
## ProgramFiles(x86)     C:\Program Files (x86)
## ProgramW6432          C:\Program Files
## PSModulePath          C:\Program
##                       Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
## PUBLIC                C:\Users\Public
## R_ARCH                /x64
## R_COMPILED_BY         gcc 4.9.3
## R_DOC_DIR             C:/PROGRA~1/R/R-35~1.3/doc
## R_HOME                C:/PROGRA~1/R/R-35~1.3
## R_LIBS                C:/Users/ignac/OneDrive/Documents/R/win-library/3.5;C:/Program
##                       Files/R/R-3.5.3/library
## R_LIBS_USER           C:/Users/ignac/OneDrive/Documents/R/win-library/3.5
## R_USER                C:/Users/ignac/OneDrive/Documents
## RMARKDOWN_MATHJAX_PATH
##                       C:/Program
##                       Files/RStudio/resources/mathjax-26
## RMARKDOWN_PREVIEW_DIR
##                       C:\Users\ignac\AppData\Local\Temp\RtmpuKHkHh
## RS_LOCAL_PEER         \\.\pipe\10485-rsession
## RS_RPOSTBACK_PATH     C:/Program Files/RStudio/bin/rpostback
## RS_SHARED_SECRET      63341846741
## RSTUDIO               1
## RSTUDIO_CONSOLE_COLOR
##                       256
## RSTUDIO_CONSOLE_WIDTH
##                       84
## RSTUDIO_MSYS_SSH      C:/Program
##                       Files/RStudio/bin/msys-ssh-1000-18
## RSTUDIO_PANDOC        C:/Program Files/RStudio/bin/pandoc
## RSTUDIO_SESSION_PORT
##                       10485
## RSTUDIO_USER_IDENTITY
##                       ignac
## RSTUDIO_WINUTILS      C:/Program Files/RStudio/bin/winutils
## SESSIONNAME           Console
## SSH_ASKPASS           rpostback-askpass
## SystemDrive           C:
## SystemRoot            C:\WINDOWS
## TEMP                  C:\Users\ignac\AppData\Local\Temp
## TERM                  xterm-256color
## TMP                   C:\Users\ignac\AppData\Local\Temp
## TZDIR                 C:/PROGRA~1/R/R-35~1.3/share/zoneinfo
## USERDOMAIN            DESKTOP-NR3SJNM
## USERDOMAIN_ROAMINGPROFILE
##                       DESKTOP-NR3SJNM
## USERNAME              ignac
## USERPROFILE           C:\Users\ignac
## VBOX_MSI_INSTALL_PATH
##                       C:\Program Files\Oracle\VirtualBox\
## windir                C:\WINDOWS

Sys.info()

##           sysname           release           version          nodename 
##         "Windows"          "10 x64"     "build 18362" "DESKTOP-NR3SJNM" 
##           machine             login              user    effective_user 
##          "x86-64"           "ignac"           "ignac"           "ignac"

Predicción del Precio de la Vivienda: Modelos de Regresión Avanzados

J. Ignacio Nicolás Serna (jnicolas86@alumno.uned.es)

16 de octubre de 2019

1 Habilitar librerías y cargar los datos en R

1.1 Dimensiones y estructura del conjunto de datos

2 Limpieza y preprocesamiento de los datos

2.1 Valores perdidos

2.1.1 Pool variables

2.1.2 Miscellaneous Feature

2.1.3 Alley

2.1.4 Fence

2.1.5 Fireplace variables

2.1.6 Lot variables

2.1.7 Garage variables

2.1.8 Basement Variables

2.1.9 Masonry variables

2.1.10 MS Zoning

2.1.11 Kitchen variables

2.1.12 Utilities

2.1.13 Home functionality

2.1.14 Exterior variables

2.1.15 Electrical system

2.1.16 Sale Type and Condition

2.2 Recodificación/factorización de las variables categóricas restantes

2.3 Transformación a factor de algunas variables numéricas

2.3.1 Year and Month Sold

2.3.2 MSSubClass

3 Visualización de variables

3.1 Correlaciones

3.2 Random Forest para determinar las variables más importantes

3.2.1 Neighborhood. La variable categórica más importante

3.2.2 Overall Quality y otras variables de calidad

3.2.3 MSSubClass. La segunda variable categórica más importante

3.2.4 Garage variables

3.2.5 Basement variables

4 Feature engineering

4.1 Número total de baños

4.2 Añadiendo ‘Age’, ‘Remodeled (Yes/No)’ y variable IsNew

4.3 Binning Neighborhood

4.4 Total Square Feet

4.5 Consolidando Porch variables

5 Preparando los datos para el modelado

5.1 Tratando con la multicolinealidad

5.2 Tratando con los casos atípicos

5.3 Preprocesamiento de los predictores

5.3.1 Asimetría y normalización de predictores numéricos

5.3.2 One-hot-encoding para las categóricas

5.3.3 Eliminar los niveles con ninguna o pocas observaciones

5.4 Tratando la asimetría de la variable respuesta

5.5 Obtener conjunto combinado de entrenamiento y prueba

6 Modelado

6.1 Lasso regression model