In this post, I will be discussing how to deal with NA values in results of lm() function. Dataset I will be using is House Prices: Advanced Regression Techniques from Kaggle website.
To keep the discussion relevant, I will be using four columns TotalBsmtSF, BsmtQual, BsmtExposure and SalePrice.
#options(scipen=999)
library(tidyverse)
library(DMwR)
library(knitr) # Report display, table format
library(kableExtra)
library(VIM)
house.data.df <- read.csv("https://raw.githubusercontent.com/akulapa/DATA621-Blog3/master/train.csv", header= TRUE, stringsAsFactors = F)
Based on the description of each variable provided by Kaggle, both BsmtQual and BsmtExposure are categorical variables and TotalBsmtSF and SalePrice are continous variables. We will add additional category NB(no basement) for observations where TotalBsmtSF values is zero
house.df <- house.data.df %>% select(TotalBsmtSF, BsmtQual, BsmtExposure)
house.df$BsmtQual <- as.character(house.df$BsmtQual)
house.df$BsmtQual <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtQual) #NB - No Basement
house.df$BsmtQual <- factor(house.df$BsmtQual)
house.df$BsmtExposure <- as.character(house.df$BsmtExposure)
house.df$BsmtExposure <- ifelse(house.df$TotalBsmtSF == 0, 'NB', house.df$BsmtExposure) #NB - No Basement
house.df$BsmtExposure <- factor(house.df$BsmtExposure)
#Impute missing vaues using kNN
house.knn <- knnImputation(house.df, 11, meth='weighAvg')
#Apply SalePrice
house.knn <- cbind(Id = house.data.df$Id, house.knn, SalePrice = house.data.df$SalePrice)
house.knn %>% head(10) %>%
kable("html",caption = "Sample Data Set", row.names = F, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| Id | TotalBsmtSF | BsmtQual | BsmtExposure | SalePrice |
|---|---|---|---|---|
| 1 | 856 | Gd | No | 208500 |
| 2 | 1262 | Gd | Gd | 181500 |
| 3 | 920 | Gd | Mn | 223500 |
| 4 | 756 | TA | No | 140000 |
| 5 | 1145 | Gd | Av | 250000 |
| 6 | 796 | Gd | No | 143000 |
| 7 | 1686 | Ex | Av | 307000 |
| 8 | 1107 | Gd | Mn | 200000 |
| 9 | 952 | TA | No | 129900 |
| 10 | 991 | TA | No | 118000 |
#Categories of Basement quality
levels(house.df$BsmtQual)
## [1] "Ex" "Fa" "Gd" "NB" "TA"
Description of each category,
#Categories of Basement exposure
levels(house.df$BsmtExposure)
## [1] "Av" "Gd" "Mn" "NB" "No"
Description of each category,
Let’s build linear regression model,
house.lm1 <- lm(SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure, data = house.knn)
summary(house.lm1)
##
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF + BsmtQual + BsmtExposure,
## data = house.knn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -490119 -27436 -5375 21243 373166
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.079e+05 8.344e+03 24.911 < 2e-16 ***
## TotalBsmtSF 6.844e+01 3.912e+00 17.496 < 2e-16 ***
## BsmtQualFa -1.422e+05 1.057e+04 -13.464 < 2e-16 ***
## BsmtQualGd -8.462e+04 5.510e+03 -15.356 < 2e-16 ***
## BsmtQualNB -1.022e+05 1.189e+04 -8.593 < 2e-16 ***
## BsmtQualTA -1.322e+05 5.842e+03 -22.630 < 2e-16 ***
## BsmtExposureGd 2.411e+04 5.742e+03 4.199 2.85e-05 ***
## BsmtExposureMn 9.274e+03 5.973e+03 1.553 0.121
## BsmtExposureNB NA NA NA NA
## BsmtExposureNo -2.537e+03 4.001e+03 -0.634 0.526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared: 0.581, Adjusted R-squared: 0.5787
## F-statistic: 251.5 on 8 and 1451 DF, p-value: < 2.2e-16
The model suggests each variable is significant to the model. F-Statistic greater than zero indicates current model is better than null model. However, there is something odd about the summary. There are two lines,
## Coefficients: (1 not defined because of singularities)
## BsmtExposureNB NA NA NA NA
Let’s explore to see if our dataset has any NA values,
#Let list of missing values 'NA'
aggr_plot <- aggr(house.knn, numbers=F, sortVars=F, labels=names(house.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"), plot=T)
Plot suggests there are no missing values in the dataset.
NA Values In The Summary?There are 37 observations without basement based on BsmtQual variable. And same is true for BsmtExposure. Both variables are linearly dependent on TotalBsmtSF.
table(house.knn$BsmtQual)
##
## Ex Fa Gd NB TA
## 121 35 618 37 649
table(house.knn$BsmtExposure)
##
## Av Gd Mn NB No
## 221 134 115 37 953
As we have seen from lm summary model internally converts each categorical value into its own column, then BsmtQualNB and BsmtExposureNB will have same exact matches, suggesting one column is sufficient to understand variance in SalePrice. One way to solve the issue is by removing one of the variables either BsmtQual or BsmtExposure from the dataset. In this case, the dataset is small, and we were able to manually identify the condition, how do we resolve this condition in larger datasets?
Unfortunately, we cannot emilinate observations from the dataset because of this condition.
Let’s manually convert categorical values into their own columns. Mark them 1 if data meets the condition and with 0 if it fails. In other words, we will converting dataset into wider form, converting rows to columns.
#Get Column types
colTypes <- sapply(house.knn, class) %>% data.frame(stringsAsFactors = F)
colnames(colTypes) = c("colType")
colTypes <- tibble::rownames_to_column(colTypes, "colNames")
#Get Factor columns
colFactorNames <- colTypes %>% dplyr::filter(colType=="factor")
colFactorNames$id = seq(1:nrow(colFactorNames))
house.knn.wide <- house.knn
#Loop through each factor column and convert into wide form
for(i in seq(1:max(colFactorNames$id))){
colName <- colFactorNames %>% filter(id == i) %>% select(colNames)
facName <- colName$colNames
strCmd <- paste0("house.knn.wide %>% gather(", facName, ",name,starts_with('", facName)
strCmd <- paste0(strCmd, "')) %>% mutate(", facName, "= paste0(", facName, ",name)) %>%")
strCmd <- paste0(strCmd, "mutate(present = 1) %>% select(-name) %>% spread(", facName, ",present,fill = 0)")
house.knn.wide <- eval(parse(text=strCmd))
strCmd <- paste0("house.knn.wide %>% select(starts_with('", facName)
strCmd <- paste0(strCmd, "')) %>% names() %>% data.frame(stringsAsFactors = F) %>% mutate(ActualCol ='", facName, "')")
#Save converted name and actual name mapping
if(i==1){
ActualNames.df <- eval(parse(text=strCmd))
names(ActualNames.df) <- c("ConvertedName","ActualName")
} else {
tmp <- eval(parse(text=strCmd))
names(tmp) <- c("ConvertedName","ActualName")
ActualNames.df <- rbind(ActualNames.df, tmp)
}
}
Columns BsmtQual and BsmtExposure are expanded to,
ActualNames.df %>%
kable("html",caption = "Column Conversions", row.names = F, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| ConvertedName | ActualName |
|---|---|
| BsmtQualEx | BsmtQual |
| BsmtQualFa | BsmtQual |
| BsmtQualGd | BsmtQual |
| BsmtQualNB | BsmtQual |
| BsmtQualTA | BsmtQual |
| BsmtExposureAv | BsmtExposure |
| BsmtExposureGd | BsmtExposure |
| BsmtExposureMn | BsmtExposure |
| BsmtExposureNB | BsmtExposure |
| BsmtExposureNo | BsmtExposure |
Wide dataset looks as,
house.knn.wide %>% head(10) %>%
kable("html",caption = "Sample Data", row.names = F, digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
| Id | TotalBsmtSF | SalePrice | BsmtQualEx | BsmtQualFa | BsmtQualGd | BsmtQualNB | BsmtQualTA | BsmtExposureAv | BsmtExposureGd | BsmtExposureMn | BsmtExposureNB | BsmtExposureNo |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 856 | 208500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1262 | 181500 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 920 | 223500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 756 | 140000 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 5 | 1145 | 250000 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 796 | 143000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1686 | 307000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 8 | 1107 | 200000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 9 | 952 | 129900 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 10 | 991 | 118000 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
Let’s see if columns BsmtQualNB and BsmtExposureNB in newly created dataset matches 100%. All 1460 rows match.
house.knn.wide %>% select(BsmtQualNB, BsmtExposureNB) %>% filter(BsmtQualNB == BsmtExposureNB) %>% nrow()
## [1] 1460
Now let’s build a linear regression model
house.lm2 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB + BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF, data = house.knn.wide)
summary(house.lm2)
##
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd +
## BsmtQualFa + BsmtQualEx + BsmtExposureNo + BsmtExposureNB +
## BsmtExposureMn + BsmtExposureGd + BsmtExposureAv + TotalBsmtSF,
## data = house.knn.wide)
##
## Residuals:
## Min 1Q Median 3Q Max
## -490119 -27436 -5375 21243 373166
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.079e+05 8.344e+03 24.911 < 2e-16 ***
## BsmtQualTA -1.322e+05 5.842e+03 -22.630 < 2e-16 ***
## BsmtQualNB -1.022e+05 1.189e+04 -8.593 < 2e-16 ***
## BsmtQualGd -8.462e+04 5.510e+03 -15.356 < 2e-16 ***
## BsmtQualFa -1.422e+05 1.057e+04 -13.464 < 2e-16 ***
## BsmtQualEx NA NA NA NA
## BsmtExposureNo -2.537e+03 4.001e+03 -0.634 0.526
## BsmtExposureNB NA NA NA NA
## BsmtExposureMn 9.274e+03 5.973e+03 1.553 0.121
## BsmtExposureGd 2.411e+04 5.742e+03 4.199 2.85e-05 ***
## BsmtExposureAv NA NA NA NA
## TotalBsmtSF 6.844e+01 3.912e+00 17.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared: 0.581, Adjusted R-squared: 0.5787
## F-statistic: 251.5 on 8 and 1451 DF, p-value: < 2.2e-16
Now, our model summary has more than one variable that has coefficients as NA values.
BsmtQualNB and BsmtExposureNB, we only need one of them, and model chose BsmtQualNB .n - 1, but we created n columns. Hence, between BsmtQualTA, BsmtQualNB, BsmtQualGd, BsmtQualFa and BsmtQualEx model chose to eliminate BsmtQualEx .BsmtExposureNo, BsmtExposureNB, BsmtExposureMn, BsmtExposureGd and BsmtExposureAv, model already eliminated BsmtExposureNB, it still needs to eliminate one more column hence BsmtExposureAv was chosen.Since each column is independent, we can remove them model. However, all observations are still part of the model.
house.lm3 <- lm(SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd + BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd + TotalBsmtSF, data = house.knn.wide)
summary(house.lm3)
##
## Call:
## lm(formula = SalePrice ~ BsmtQualTA + BsmtQualNB + BsmtQualGd +
## BsmtQualFa + BsmtExposureNo + BsmtExposureMn + BsmtExposureGd +
## TotalBsmtSF, data = house.knn.wide)
##
## Residuals:
## Min 1Q Median 3Q Max
## -490119 -27436 -5375 21243 373166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.079e+05 8.344e+03 24.911 < 2e-16 ***
## BsmtQualTA -1.322e+05 5.842e+03 -22.630 < 2e-16 ***
## BsmtQualNB -1.022e+05 1.189e+04 -8.593 < 2e-16 ***
## BsmtQualGd -8.462e+04 5.510e+03 -15.356 < 2e-16 ***
## BsmtQualFa -1.422e+05 1.057e+04 -13.464 < 2e-16 ***
## BsmtExposureNo -2.537e+03 4.001e+03 -0.634 0.526
## BsmtExposureMn 9.274e+03 5.973e+03 1.553 0.121
## BsmtExposureGd 2.411e+04 5.742e+03 4.199 2.85e-05 ***
## TotalBsmtSF 6.844e+01 3.912e+00 17.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51560 on 1451 degrees of freedom
## Multiple R-squared: 0.581, Adjusted R-squared: 0.5787
## F-statistic: 251.5 on 8 and 1451 DF, p-value: < 2.2e-16
Finally, you have a clean model. One observation coefficients and standard errors did not change from model to model.