Introduction

Data cleansing is the process of identifying and/or removing any data that is incorrect. Incorrect data can be mistaken, outdated, missing, or irrelevant. Depending on your purpose, data cleansing can mean a variety of things and can have a variety of names.

During my time as a master’s student, I had to conduct a practicum research project with a data set of my choosing. The data cleansing portion of the project took a semester to complete before deciding that the data was adequate to proceed to any sort of analysis. This meant that most of my time on my practicum dealt with adjusting, imputing, and correcting any mistakes that were found throughout the data. Note that, my focus for this entry will be on the imputation methods I used to fill in for missing values.

Why is it important that the data is cleansed before beginning any sort of analysis? Since we are relying on the data set to make any assumptions, claims, or predictions it is important that the data be as “correct” as possible. In a business sector your findings could be the reason why the company decides to proceed or withdraw from investing large amounts of money into projects. It could be the reason you make or loss money. In the medical sector you are dealing with individuals’ lives, so your results could be dangerous given that the data was incorrect or misleading. Simply, crucial decisions are being made based on this data, so you want it to be as reliable as possible.

Dataset

In this dataset we have many attributes (such as county, zip code, city, number of bedrooms, bathrooms, garages and pools, square footage, year built etc.) of homes that are under foreclosure in Harris County. We have over 3,000 data entries.

Data Cleansing

Before the analysis takes place, we check if the data is entered correctly and if it is complete. For example if you look at the data we are dealing with below, a common recording error will be to have the same “Mortagee.Bank.Name” entered differently; as “WELLS FARGO BANK NA” or “WELLS FARGO BANK N.A”. Since we are using R for analysis, we may need to check if the recorded data are recognized for their correct data type by R. For example, the ($7,418) under “Estimated. Equity” below is a negative value in Excel, yet R would not identify that. The dataset can be incomplete because of missing values. For this entry we will talk in detail about my thought process of imputing missing values for the number of bedrooms.

master <- read.csv("D:/Practicum/Cleaning/cleaned.master.csv")
head(master)
##       County     City Zip.Code Bed Bath Sq.Ft Garage Pool Yr.Bt
## 1     HARRIS  Houston    77082   3  3.0 1,998      2    N  1983
## 2     HARRIS  Houston    77082   3  3.0 2,346      4    N  2003
## 3     HARRIS  Houston    77040   3  2.0 1,631      2    N  1972
## 4     HARRIS   Spring    77388   3  2.0 1,670      2    N  1973
## 5 MONTGOMERY   Spring    77381  NA  1.5 1,462      2    N  1979
## 6     HARRIS La Porte    77571   4  2.0 1,610      2    N  1963
##        Mortgagee.Bank.Name Estimated.Equity Equity..
## 1      WELLS FARGO BANK NA         $83,136       47%
## 2        QUICKEN LOANS INC         $36,565       19%
## 3         CITIMORTGAGE INC         $67,544       49%
## 4  LIVE WELL FINANCIAL INC         ($7,418)      -5%
## 5  NATIONSTAR MORTGAGE LLC          $1,061        1%
## 6 PHH MORTGAGE CORPORATION        $102,267       50%
##   Estimated.Unpaid.Balance Orig.Loan.amt Historical.Interest.Rate
## 1                 $93,802      $100,000                     3.44%
## 2                $153,016      $163,127                     3.44%
## 3                 $69,422       $86,406                     5.00%
## 4                $150,572      $171,000                     4.07%
## 5                $158,379      $192,000                     4.99%
## 6                $102,221      $123,766                     6.48%
##   Assessed.Value Loan.Origination.year Loan.Type Loan.Expiration.Year
## 1                                 2016        HE                 2031
## 2                                 2016       FHA                 2046
## 3                                 2009       FHA                 2039
## 4      $130,763                   2013       FHA                 <NA>
## 5      $159,440                   2010       FHA                 <NA>
## 6      $189,193                   2008       FHA                 2038
##            Ownership Missing N.A Total Remove
## 1     OWNER OCCUPIED       1   0     1      0
## 2 NOT OWNER OCCUPIED       1   0     1      0
## 3     OWNER OCCUPIED       1   0     1      0
## 4     OWNER OCCUPIED       0   1     1      0
## 5     OWNER OCCUPIED       0   2     2      0
## 6     OWNER OCCUPIED       0   0     0      0

Below, on an excel file we had labeled the entries with missing values, 0, and those that did not have missing values, 1. All those are labeled 1 are used to find a way to impute values for those that are labeled 0.

missing.bed <- read.csv("D:/Practicum/Cleaning/filling.bed.csv")
missing.bed$Bed <- as.numeric(missing.bed$Bed)
missing.bed$Sq.Ft <- as.numeric(missing.bed$Sq.Ft)
bed <- missing.bed[!(missing.bed$Remove==1),] #removing the missing bed values
not.bed <- missing.bed[!(missing.bed$Remove==0),] #removing the missing bed values
not.bed <- subset(not.bed, select=c(Sq.Ft))

Imputing Methods

There are a variety of ways that we could impute missing data. Below are the methods we used.

Pros & Cons of the Imputing Methods

Median/Mean Value:

Simple Linear Regression Model:

Training and Testing Datasets

First, we keep aside the records in the dataset with missing values for number of bedrooms. Then, we randomly split the rest of the dataset in 70:30 ratio into a training and a testing dataset to get an idea of the accuracy of the methods we are using before we conduct the imputation of the missing values.

set.seed(123)
sample.data <- sample(2, nrow(bed), replace = TRUE, prob =c(0.7, 0.3))
train <- bed[sample.data==1,]
testing <- bed[sample.data==2,]

Median/Mean Imputation

The mean and the median of the training dataset are 3.4771 and 3, respectively. Since the number of bedrooms is a discrete measure, we use 3 as our predicted value for the number of bedrooms in the testing dataset.

mean(train$Bed)
## [1] 3.477116
median(train$Bed)
## [1] 3
guess <- c(rep.int(3, 994))

Following is a cross comparison between the predicted and the actual values for the number of bedrooms in the testing dataset. We see that only 467 of the houses in the testing dataset had 3 bedrooms, the predicted value. This is an accuracy of 46.98%.

##      
## guess   1   2   3   4   5   6  12
##     3   6  61 467 375  75   9   1
## [1] 0.4698189

Simple Linear Regression Imputation

The easiest way to see if we can see a relationship between two variables is to view a scatterplot of the data. In our case, regressing the number of bedrooms on square footage seems to be fitting as the size of a house determines the amount of bedrooms in it; it also helps that there is a low number of missing data in square footage.

Given below is the scatterplot between the number of bedrooms and the square footage of houses. We see that since the number of bedrooms is a discrete measure, the scatterplot seems to show a step-like pattern. However, we can still notice that in overall, the data follows a positive relationship. There are some leverage and influential points present in the data, but we will not be discussing those here.

plot(x=missing.bed$Sq.Ft, y=missing.bed$Bed)

We will now use the training dataset to fit the regression model.

set.seed(123)
sample.data <- sample(2, nrow(bed), replace = TRUE, prob =c(0.7, 0.3))
train <- bed[sample.data==1,]
testing <- bed[sample.data==2,]
fit70 <- lm(Bed~Sq.Ft, data=train)
summary(fit70)
## 
## Call:
## lm(formula = Bed ~ Sq.Ft, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5718 -0.3206 -0.0158  0.3636  6.6406 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.172e+00  3.238e-02   67.08   <2e-16 ***
## Sq.Ft       5.837e-04  1.336e-05   43.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6042 on 2314 degrees of freedom
## Multiple R-squared:  0.4522, Adjusted R-squared:  0.452 
## F-statistic:  1910 on 1 and 2314 DF,  p-value: < 2.2e-16

The fitted simple regression model between number of bedroom and the square footage of the foreclosed houses is,

\(bed = 2.172 + .0005837*Sq.Ft\)

Using the fitted model, we can now predict the number of bedrooms in the testing dataset.

From the following we can get an idea about the spread of the values predicted. We observe that the simple regression model predicted 3 and 4 most frequently as the number of bedrooms for the testing dataset.

pred <- predict(fit70, testing)
pred <- round(pred)
table(pred)
## pred
##   3   4   5   6   9 
## 596 343  48   6   1

Comparing the predicted bedroom values with the actual values, we obtain an accuracy of 67.20% for this method.

table(pred, testing$Bed)
##     
## pred   1   2   3   4   5   6  12
##    3   5  60 408 122   1   0   0
##    4   1   1  59 234  46   2   0
##    5   0   0   0  17  25   5   1
##    6   0   0   0   2   3   1   0
##    9   0   0   0   0   0   1   0
mean(pred==testing$Bed)
## [1] 0.6720322

Conclusion

After analyzing our two suggested procedures, we can see that it is of benefit to use the simple linear regression imputation for this data as its accuracy is 20% more than that of the mean/median imputation method.

It should be noted that missing value imputation is a broad topic. For this dataset and its context, the two methods used here can be considered relatively simple approaches when compared with many other available imputation methods.