Dissecting NA Values Linear Regression Model

In this post, I will be discussing how to deal with NA values in results of lm() function. Dataset I will be using is House Prices: Advanced Regression Techniques from Kaggle website.

To keep the discussion relevant, I will be using four columns TotalBsmtSF, BsmtQual, BsmtExposure and SalePrice.

Load Dataset

#options(scipen=999)
library(tidyverse)
library(DMwR)
library(knitr)    # Report display, table format
library(kableExtra)
library(VIM)

house.data.df <- read.csv("https://raw.githubusercontent.com/akulapa/DATA621-Blog3/master/train.csv", header= TRUE, stringsAsFactors = F)

Based on the description of each variable provided by Kaggle, both BsmtQual and BsmtExposure are categorical variables and TotalBsmtSF and SalePrice are continous variables. We will add additional category NB(no basement) for observations where TotalBsmtSF values is zero

house.df <- house.data.df %>% select(TotalBsmtSF, BsmtQual, BsmtExposure)

house.df$BsmtQual <- as.character(house.df$BsmtQual)
house.df$BsmtQual <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtQual) #NB - No Basement
house.df$BsmtQual <- factor(house.df$BsmtQual)

house.df$BsmtExposure <- as.character(house.df$BsmtExposure)
house.df$BsmtExposure <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtExposure) #NB - No Basement
house.df$BsmtExposure <- factor(house.df$BsmtExposure)

#Impute missing vaues using kNN
house.knn <- knnImputation(house.df, 11, meth='weighAvg')

#Apply SalePrice
house.knn <- cbind(Id = house.data.df$Id, house.knn, SalePrice = house.data.df$SalePrice)

house.knn %>% head(10) %>% 
    kable("html",caption = "Sample Data Set", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Sample Data Set
Id	TotalBsmtSF	BsmtQual	BsmtExposure	SalePrice
1	856	Gd	No	208500
2	1262	Gd	Gd	181500
3	920	Gd	Mn	223500
4	756	TA	No	140000
5	1145	Gd	Av	250000
6	796	Gd	No	143000
7	1686	Ex	Av	307000
8	1107	Gd	Mn	200000
9	952	TA	No	129900
10	991	TA	No	118000

#Categories of Basement quality
levels(house.df$BsmtQual)

## [1] "Ex" "Fa" "Gd" "NB" "TA"

Description of each category,

Ex - Excellent (100+ inches) -
Gd - Good (90-99 inches)
TA - Typical (80-89 inches)
Fa - Fair (70-79 inches)
Po - Poor (<70 inches
NB - No Basement

#Categories of Basement exposure
levels(house.df$BsmtExposure)

## [1] "Av" "Gd" "Mn" "NB" "No"

Description of each category,

Gd - Good Exposure
Av - Average Exposure (split levels or foyers typically score average or above) -
Mn - Mimimum Exposure
No - No Exposure
NB - No Basement

Let’s build linear regression model,

house.lm1 <- lm(SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure, data = house.knn)
summary(house.lm1)

## 
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure, 
##     data = house.knn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureNB         NA         NA      NA       NA    
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

The model suggests each variable is significant to the model. F-Statistic greater than zero indicates current model is better than null model. However, there is something odd about the summary. There are two lines,

## Coefficients: (1 not defined because of singularities)

## BsmtExposureNB         NA         NA      NA       NA

Let’s explore to see if our dataset has any NA values,

#Let list of missing values 'NA'
aggr_plot <- aggr(house.knn, numbers=F, sortVars=F, labels=names(house.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"), plot=T)

Plot suggests there are no missing values in the dataset.

What Are `NA` Values In The Summary?

There are 37 observations without basement based on BsmtQual variable. And same is true for BsmtExposure. Both variables are linearly dependent on TotalBsmtSF.

table(house.knn$BsmtQual)

## 
##  Ex  Fa  Gd  NB  TA 
## 121  35 618  37 649

table(house.knn$BsmtExposure)

## 
##  Av  Gd  Mn  NB  No 
## 221 134 115  37 953

As we have seen from lm summary model internally converts each categorical value into its own column, then BsmtQualNB and BsmtExposureNB will have same exact matches, suggesting one column is sufficient to understand variance in SalePrice. One way to solve the issue is by removing one of the variables either BsmtQual or BsmtExposure from the dataset. In this case, the dataset is small, and we were able to manually identify the condition, how do we resolve this condition in larger datasets?

Unfortunately, we cannot emilinate observations from the dataset because of this condition.

Possible Solution

Let’s manually convert categorical values into their own columns. Mark them 1 if data meets the condition and with 0 if it fails. In other words, we will converting dataset into wider form, converting rows to columns.

#Get Column types
colTypes <- sapply(house.knn, class) %>% data.frame(stringsAsFactors = F)
colnames(colTypes) = c("colType")
colTypes <- tibble::rownames_to_column(colTypes, "colNames")

#Get Factor columns
colFactorNames <- colTypes %>% dplyr::filter(colType=="factor")
colFactorNames$id = seq(1:nrow(colFactorNames))

house.knn.wide <- house.knn

#Loop through each factor column and convert into wide form
for(i in seq(1:max(colFactorNames$id))){
  colName <- colFactorNames %>% filter(id == i) %>% select(colNames)
  
  facName <- colName$colNames
  
  strCmd <- paste0("house.knn.wide %>% gather(", facName, ",name,starts_with('", facName)
  strCmd <- paste0(strCmd, "')) %>% mutate(", facName, "= paste0(", facName, ",name)) %>%")
  strCmd <- paste0(strCmd, "mutate(present = 1) %>% select(-name) %>% spread(", facName, ",present,fill = 0)")

  house.knn.wide <- eval(parse(text=strCmd))
  
  strCmd <- paste0("house.knn.wide %>% select(starts_with('", facName)
  strCmd <- paste0(strCmd, "')) %>% names() %>% data.frame(stringsAsFactors = F) %>% mutate(ActualCol ='", facName, "')")
  
  #Save converted name and actual name mapping
  if(i==1){
    ActualNames.df <- eval(parse(text=strCmd))
    names(ActualNames.df) <- c("ConvertedName","ActualName")
  } else {
    tmp <- eval(parse(text=strCmd))
    names(tmp) <- c("ConvertedName","ActualName")
    ActualNames.df <- rbind(ActualNames.df, tmp)
  }
}

Columns BsmtQual and BsmtExposure are expanded to,

ActualNames.df %>% 
    kable("html",caption = "Column Conversions", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Column Conversions
ConvertedName	ActualName
BsmtQualEx	BsmtQual
BsmtQualFa	BsmtQual
BsmtQualGd	BsmtQual
BsmtQualNB	BsmtQual
BsmtQualTA	BsmtQual
BsmtExposureAv	BsmtExposure
BsmtExposureGd	BsmtExposure
BsmtExposureMn	BsmtExposure
BsmtExposureNB	BsmtExposure
BsmtExposureNo	BsmtExposure

Wide dataset looks as,

house.knn.wide %>% head(10) %>% 
    kable("html",caption = "Sample Data", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
  scroll_box(width = "100%", height = "200px")

Sample Data
Id	TotalBsmtSF	SalePrice	BsmtQualEx	BsmtQualGd	BsmtQualTA	BsmtExposureAv	BsmtExposureGd	BsmtExposureMn	BsmtExposureNo
1	856	208500	0	1	0	0	0	0	1
2	1262	181500	0	1	0	0	1	0	0
3	920	223500	0	1	0	0	0	1	0
4	756	140000	0	0	1	0	0	0	1
5	1145	250000	0	1	0	1	0	0	0
6	796	143000	0	1	0	0	0	0	1
7	1686	307000	1	0	0	1	0	0	0
8	1107	200000	0	1	0	0	0	1	0
9	952	129900	0	0	1	0	0	0	1
10	991	118000	0	0	1	0	0	0	1

Let’s see if columns BsmtQualNB and BsmtExposureNB in newly created dataset matches 100%. All 1460 rows match.

house.knn.wide %>% select(BsmtQualNB, BsmtExposureNB) %>% filter(BsmtQualNB == BsmtExposureNB) %>% nrow()

## [1] 1460

Now let’s build a linear regression model

house.lm2 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB + BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF, data = house.knn.wide)

summary(house.lm2)

## 
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + 
##     BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB + 
##     BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF, 
##     data = house.knn.wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients: (3 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtQualEx             NA         NA      NA       NA    
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## BsmtExposureNB         NA         NA      NA       NA    
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## BsmtExposureAv         NA         NA      NA       NA    
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

Now, our model summary has more than one variable that has coefficients as NA values.

Between BsmtQualNB and BsmtExposureNB, we only need one of them, and model chose BsmtQualNB .
Rule of thumb while creating dummy variables, we only need n - 1, but we created n columns. Hence, between BsmtQualTA, BsmtQualNB, BsmtQualGd, BsmtQualFa and BsmtQualEx model chose to eliminate BsmtQualEx .
Same thing between BsmtExposureNo, BsmtExposureNB, BsmtExposureMn, BsmtExposureGd and BsmtExposureAv, model already eliminated BsmtExposureNB, it still needs to eliminate one more column hence BsmtExposureAv was chosen.

Since each column is independent, we can remove them model. However, all observations are still part of the model.

house.lm3 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd + TotalBsmtSF, data = house.knn.wide)

summary(house.lm3)

## 
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + 
##     BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd + 
##     TotalBsmtSF, data = house.knn.wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

Finally, you have a clean model. One observation coefficients and standard errors did not change from model to model.

Dissecting NA Values Linear Regression Model

Pavan Akula

May 19, 2018

Load Dataset

What Are `NA` Values In The Summary?

Possible Solution

References

Dissecting NA Values Linear Regression Model

Pavan Akula

May 19, 2018

Load Dataset

What Are NA Values In The Summary?

Possible Solution

References

What Are `NA` Values In The Summary?