In this post, I will be discussing how to deal with NA values in results of lm() function. Dataset I will be using is House Prices: Advanced Regression Techniques from Kaggle website.

To keep the discussion relevant, I will be using four columns TotalBsmtSF, BsmtQual, BsmtExposure and SalePrice.

Load Dataset
#options(scipen=999)
library(tidyverse)
library(DMwR)
library(knitr)    # Report display, table format
library(kableExtra)
library(VIM)

house.data.df <- read.csv("https://raw.githubusercontent.com/akulapa/DATA621-Blog3/master/train.csv", header= TRUE, stringsAsFactors = F)

Based on the description of each variable provided by Kaggle, both BsmtQual and BsmtExposure are categorical variables and TotalBsmtSF and SalePrice are continous variables. We will add additional category NB(no basement) for observations where TotalBsmtSF values is zero

house.df <- house.data.df %>% select(TotalBsmtSF, BsmtQual, BsmtExposure)

house.df$BsmtQual <- as.character(house.df$BsmtQual)
house.df$BsmtQual <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtQual) #NB - No Basement
house.df$BsmtQual <- factor(house.df$BsmtQual)

house.df$BsmtExposure <- as.character(house.df$BsmtExposure)
house.df$BsmtExposure <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtExposure) #NB - No Basement
house.df$BsmtExposure <- factor(house.df$BsmtExposure)

#Impute missing vaues using kNN
house.knn <- knnImputation(house.df, 11, meth='weighAvg')

#Apply SalePrice
house.knn <- cbind(Id = house.data.df$Id, house.knn, SalePrice = house.data.df$SalePrice)

house.knn %>% head(10) %>% 
    kable("html",caption = "Sample Data Set", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Sample Data Set
Id TotalBsmtSF BsmtQual BsmtExposure SalePrice
1 856 Gd No 208500
2 1262 Gd Gd 181500
3 920 Gd Mn 223500
4 756 TA No 140000
5 1145 Gd Av 250000
6 796 Gd No 143000
7 1686 Ex Av 307000
8 1107 Gd Mn 200000
9 952 TA No 129900
10 991 TA No 118000
#Categories of Basement quality
levels(house.df$BsmtQual)
## [1] "Ex" "Fa" "Gd" "NB" "TA"

Description of each category,

#Categories of Basement exposure
levels(house.df$BsmtExposure)
## [1] "Av" "Gd" "Mn" "NB" "No"

Description of each category,

Let’s build linear regression model,

house.lm1 <- lm(SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure, data = house.knn)
summary(house.lm1)
## 
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure, 
##     data = house.knn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureNB         NA         NA      NA       NA    
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

The model suggests each variable is significant to the model. F-Statistic greater than zero indicates current model is better than null model. However, there is something odd about the summary. There are two lines,

## Coefficients: (1 not defined because of singularities)

## BsmtExposureNB         NA         NA      NA       NA   

Let’s explore to see if our dataset has any NA values,

#Let list of missing values 'NA'
aggr_plot <- aggr(house.knn, numbers=F, sortVars=F, labels=names(house.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"), plot=T)

Plot suggests there are no missing values in the dataset.

What Are NA Values In The Summary?

There are 37 observations without basement based on BsmtQual variable. And same is true for BsmtExposure. Both variables are linearly dependent on TotalBsmtSF.

table(house.knn$BsmtQual)
## 
##  Ex  Fa  Gd  NB  TA 
## 121  35 618  37 649
table(house.knn$BsmtExposure)
## 
##  Av  Gd  Mn  NB  No 
## 221 134 115  37 953

As we have seen from lm summary model internally converts each categorical value into its own column, then BsmtQualNB and BsmtExposureNB will have same exact matches, suggesting one column is sufficient to understand variance in SalePrice. One way to solve the issue is by removing one of the variables either BsmtQual or BsmtExposure from the dataset. In this case, the dataset is small, and we were able to manually identify the condition, how do we resolve this condition in larger datasets?

Unfortunately, we cannot emilinate observations from the dataset because of this condition.

Possible Solution

Let’s manually convert categorical values into their own columns. Mark them 1 if data meets the condition and with 0 if it fails. In other words, we will converting dataset into wider form, converting rows to columns.

#Get Column types
colTypes <- sapply(house.knn, class) %>% data.frame(stringsAsFactors = F)
colnames(colTypes) = c("colType")
colTypes <- tibble::rownames_to_column(colTypes, "colNames")

#Get Factor columns
colFactorNames <- colTypes %>% dplyr::filter(colType=="factor")
colFactorNames$id = seq(1:nrow(colFactorNames))

house.knn.wide <- house.knn

#Loop through each factor column and convert into wide form
for(i in seq(1:max(colFactorNames$id))){
  colName <- colFactorNames %>% filter(id == i) %>% select(colNames)
  
  facName <- colName$colNames
  
  strCmd <- paste0("house.knn.wide %>% gather(", facName, ",name,starts_with('", facName)
  strCmd <- paste0(strCmd, "')) %>% mutate(", facName, "= paste0(", facName, ",name)) %>%")
  strCmd <- paste0(strCmd, "mutate(present = 1) %>% select(-name) %>% spread(", facName, ",present,fill = 0)")

  house.knn.wide <- eval(parse(text=strCmd))
  
  strCmd <- paste0("house.knn.wide %>% select(starts_with('", facName)
  strCmd <- paste0(strCmd, "')) %>% names() %>% data.frame(stringsAsFactors = F) %>% mutate(ActualCol ='", facName, "')")
  
  #Save converted name and actual name mapping
  if(i==1){
    ActualNames.df <- eval(parse(text=strCmd))
    names(ActualNames.df) <- c("ConvertedName","ActualName")
  } else {
    tmp <- eval(parse(text=strCmd))
    names(tmp) <- c("ConvertedName","ActualName")
    ActualNames.df <- rbind(ActualNames.df, tmp)
  }
}

Columns BsmtQual and BsmtExposure are expanded to,

ActualNames.df %>% 
    kable("html",caption = "Column Conversions", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Column Conversions
ConvertedName ActualName
BsmtQualEx BsmtQual
BsmtQualFa BsmtQual
BsmtQualGd BsmtQual
BsmtQualNB BsmtQual
BsmtQualTA BsmtQual
BsmtExposureAv BsmtExposure
BsmtExposureGd BsmtExposure
BsmtExposureMn BsmtExposure
BsmtExposureNB BsmtExposure
BsmtExposureNo BsmtExposure

Wide dataset looks as,

house.knn.wide %>% head(10) %>% 
    kable("html",caption = "Sample Data", row.names = F, digits = 2) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
  scroll_box(width = "100%", height = "200px")
Sample Data
Id TotalBsmtSF SalePrice BsmtQualEx BsmtQualFa BsmtQualGd BsmtQualNB BsmtQualTA BsmtExposureAv BsmtExposureGd BsmtExposureMn BsmtExposureNB BsmtExposureNo
1 856 208500 0 0 1 0 0 0 0 0 0 1
2 1262 181500 0 0 1 0 0 0 1 0 0 0
3 920 223500 0 0 1 0 0 0 0 1 0 0
4 756 140000 0 0 0 0 1 0 0 0 0 1
5 1145 250000 0 0 1 0 0 1 0 0 0 0
6 796 143000 0 0 1 0 0 0 0 0 0 1
7 1686 307000 1 0 0 0 0 1 0 0 0 0
8 1107 200000 0 0 1 0 0 0 0 1 0 0
9 952 129900 0 0 0 0 1 0 0 0 0 1
10 991 118000 0 0 0 0 1 0 0 0 0 1

Let’s see if columns BsmtQualNB and BsmtExposureNB in newly created dataset matches 100%. All 1460 rows match.

house.knn.wide %>% select(BsmtQualNB, BsmtExposureNB) %>% filter(BsmtQualNB == BsmtExposureNB) %>% nrow()
## [1] 1460

Now let’s build a linear regression model

house.lm2 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB + BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF, data = house.knn.wide)

summary(house.lm2)
## 
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + 
##     BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB + 
##     BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF, 
##     data = house.knn.wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients: (3 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtQualEx             NA         NA      NA       NA    
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## BsmtExposureNB         NA         NA      NA       NA    
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## BsmtExposureAv         NA         NA      NA       NA    
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

Now, our model summary has more than one variable that has coefficients as NA values.

Since each column is independent, we can remove them model. However, all observations are still part of the model.

house.lm3 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd + TotalBsmtSF, data = house.knn.wide)

summary(house.lm3)
## 
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + 
##     BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd + 
##     TotalBsmtSF, data = house.knn.wide)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -490119  -27436   -5375   21243  373166 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.079e+05  8.344e+03  24.911  < 2e-16 ***
## BsmtQualTA     -1.322e+05  5.842e+03 -22.630  < 2e-16 ***
## BsmtQualNB     -1.022e+05  1.189e+04  -8.593  < 2e-16 ***
## BsmtQualGd     -8.462e+04  5.510e+03 -15.356  < 2e-16 ***
## BsmtQualFa     -1.422e+05  1.057e+04 -13.464  < 2e-16 ***
## BsmtExposureNo -2.537e+03  4.001e+03  -0.634    0.526    
## BsmtExposureMn  9.274e+03  5.973e+03   1.553    0.121    
## BsmtExposureGd  2.411e+04  5.742e+03   4.199 2.85e-05 ***
## TotalBsmtSF     6.844e+01  3.912e+00  17.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared:  0.581,  Adjusted R-squared:  0.5787 
## F-statistic: 251.5 on 8 and 1451 DF,  p-value: < 2.2e-16

Finally, you have a clean model. One observation coefficients and standard errors did not change from model to model.

References