Data Cleaning of Nashville Housing Dataset for Analysis and Modeling

Nashville Skyline

Image source: Getty Images – Kruck20 / gettyimages.com

Introduction

In this project, I demonstrate my data cleaning skills by processing the Nashville Housing dataset, which contains property sale records from 2013-2016. The goal is to prepare the dataset for detailed analysis, ensuring its quality and reliability for predictive modeling.

Click here for Data License

Step 1: Load Libraries and Dataset

In this section, I load the required libraries and import the Nashville Housing dataset. This step ensures the data is ready for inspection and subsequent cleaning. I also perform an initial structure check to understand the shape and type of data I’ll be working with.

knitr::opts_chunk$set(echo = TRUE)

# Load required libraries
library(tidyverse)   # for data manipulation and visualization

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)   # for working with date formats
library(janitor)     # for cleaning column names quickly

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(visdat)      # for visualizing missing data

# Load the dataset
nashville <- read_csv("Nashville_housing_data_2013_2016.csv")

## New names:
## Rows: 56636 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (17): Parcel ID, Land Use, Property Address, Suite/ Condo #, Property... dbl
## (13): ...1, Unnamed: 0, Sale Price, Acreage, Neighborhood, Land Value, ... date
## (1): Sale Date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

# Preview structure of the dataset
glimpse(nashville)

## Rows: 56,636
## Columns: 31
## $ ...1                                <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ `Unnamed: 0`                        <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ `Parcel ID`                         <chr> "105 03 0D 008.00", "105 11 0 080.…
## $ `Land Use`                          <chr> "RESIDENTIAL CONDO", "SINGLE FAMIL…
## $ `Property Address`                  <chr> "1208  3RD AVE S", "1802  STEWART …
## $ `Suite/ Condo   #`                  <chr> "8", NA, NA, NA, NA, NA, NA, NA, N…
## $ `Property City`                     <chr> "NASHVILLE", "NASHVILLE", "NASHVIL…
## $ `Sale Date`                         <date> 2013-01-24, 2013-01-11, 2013-01-1…
## $ `Sale Price`                        <dbl> 132000, 191500, 202000, 32000, 102…
## $ `Legal Reference`                   <chr> "20130128-0008725", "20130118-0006…
## $ `Sold As Vacant`                    <chr> "No", "No", "No", "No", "No", "No"…
## $ `Multiple Parcels Involved in Sale` <chr> "No", "No", "No", "No", "No", "No"…
## $ `Owner Name`                        <chr> NA, "STINSON, LAURA M.", "NUNES, J…
## $ Address                             <chr> NA, "1802  STEWART PL", "2761  ROS…
## $ City                                <chr> NA, "NASHVILLE", "NASHVILLE", "NAS…
## $ State                               <chr> NA, "TN", "TN", "TN", "TN", "TN", …
## $ Acreage                             <dbl> NA, 0.17, 0.11, 0.17, 0.34, 0.17, …
## $ `Tax District`                      <chr> NA, "URBAN SERVICES DISTRICT", "CI…
## $ Neighborhood                        <dbl> NA, 3127, 9126, 3130, 3130, 3130, …
## $ image                               <chr> NA, "\\114000\\910001.JPG", "\\131…
## $ `Land Value`                        <dbl> NA, 32000, 34000, 25000, 25000, 25…
## $ `Building Value`                    <dbl> NA, 134400, 157800, 243700, 138100…
## $ `Total Value`                       <dbl> NA, 168300, 191800, 268700, 164800…
## $ `Finished Area`                     <dbl> NA, 1149.000, 2090.825, 2145.600, …
## $ `Foundation Type`                   <chr> NA, "PT BSMT", "SLAB", "FULL BSMT"…
## $ `Year Built`                        <dbl> NA, 1941, 2000, 1948, 1910, 1945, …
## $ `Exterior Wall`                     <chr> NA, "BRICK", "BRICK/FRAME", "BRICK…
## $ Grade                               <chr> NA, "C", "C", "B", "C", "C", NA, "…
## $ Bedrooms                            <dbl> NA, 2, 3, 4, 2, 2, NA, 2, 2, 2, 2,…
## $ `Full Bath`                         <dbl> NA, 1, 2, 2, 1, 1, NA, 1, 1, 1, 1,…
## $ `Half Bath`                         <dbl> NA, 0, 1, 0, 0, 0, NA, 0, 0, 0, 0,…

# Summary statistics for numerical columns
summary(nashville)

##       ...1         Unnamed: 0     Parcel ID           Land Use        
##  Min.   :    0   Min.   :    0   Length:56636       Length:56636      
##  1st Qu.:14159   1st Qu.:14159   Class :character   Class :character  
##  Median :28318   Median :28318   Mode  :character   Mode  :character  
##  Mean   :28318   Mean   :28318                                        
##  3rd Qu.:42476   3rd Qu.:42476                                        
##  Max.   :56635   Max.   :56635                                        
##                                                                       
##  Property Address   Suite/ Condo   #   Property City        Sale Date         
##  Length:56636       Length:56636       Length:56636       Min.   :2013-01-02  
##  Class :character   Class :character   Class :character   1st Qu.:2014-04-11  
##  Mode  :character   Mode  :character   Mode  :character   Median :2015-03-24  
##                                                           Mean   :2015-02-07  
##                                                           3rd Qu.:2015-12-30  
##                                                           Max.   :2016-10-31  
##                                                                               
##    Sale Price       Legal Reference    Sold As Vacant    
##  Min.   :      50   Length:56636       Length:56636      
##  1st Qu.:  135000   Class :character   Class :character  
##  Median :  205450   Mode  :character   Mode  :character  
##  Mean   :  327211                                        
##  3rd Qu.:  329000                                        
##  Max.   :54278060                                        
##                                                          
##  Multiple Parcels Involved in Sale  Owner Name          Address         
##  Length:56636                      Length:56636       Length:56636      
##  Class :character                  Class :character   Class :character  
##  Mode  :character                  Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##      City              State              Acreage        Tax District      
##  Length:56636       Length:56636       Min.   :  0.010   Length:56636      
##  Class :character   Class :character   1st Qu.:  0.180   Class :character  
##  Mode  :character   Mode  :character   Median :  0.270   Mode  :character  
##                                        Mean   :  0.499                     
##                                        3rd Qu.:  0.450                     
##                                        Max.   :160.060                     
##                                        NA's   :30619                       
##   Neighborhood      image             Land Value      Building Value    
##  Min.   : 107    Length:56636       Min.   :    100   Min.   :       0  
##  1st Qu.:3126    Class :character   1st Qu.:  21000   1st Qu.:   75900  
##  Median :3929    Mode  :character   Median :  28800   Median :  111400  
##  Mean   :4356                       Mean   :  69073   Mean   :  160802  
##  3rd Qu.:6228                       3rd Qu.:  60000   3rd Qu.:  180700  
##  Max.   :9530                       Max.   :2772000   Max.   :12971800  
##  NA's   :30619                      NA's   :30619     NA's   :30619     
##   Total Value       Finished Area    Foundation Type      Year Built   
##  Min.   :     100   Min.   :     0   Length:56636       Min.   :1799   
##  1st Qu.:  102800   1st Qu.:  1239   Class :character   1st Qu.:1948   
##  Median :  148500   Median :  1632   Mode  :character   Median :1960   
##  Mean   :  232397   Mean   :  1927                      Mean   :1964   
##  3rd Qu.:  268500   3rd Qu.:  2212                      3rd Qu.:1983   
##  Max.   :13940400   Max.   :197988                      Max.   :2017   
##  NA's   :30619      NA's   :32470                       NA's   :32471  
##  Exterior Wall         Grade              Bedrooms       Full Bath    
##  Length:56636       Length:56636       Min.   : 0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 3.00   1st Qu.: 1.00  
##  Mode  :character   Mode  :character   Median : 3.00   Median : 2.00  
##                                        Mean   : 3.09   Mean   : 1.89  
##                                        3rd Qu.: 3.00   3rd Qu.: 2.00  
##                                        Max.   :11.00   Max.   :10.00  
##                                        NA's   :32477   NA's   :32359  
##    Half Bath    
##  Min.   :0.00   
##  1st Qu.:0.00   
##  Median :0.00   
##  Mean   :0.28   
##  3rd Qu.:1.00   
##  Max.   :3.00   
##  NA's   :32490

Step 2: Clean Column Names

This section focuses on standardizing column names to ensure they follow consistent naming conventions. I’ll use the janitor package to clean the column names into a clear, readable format, making the dataset easier to work with in subsequent steps.

# Clean column names using janitor's clean_names() function
nashville <- janitor::clean_names(nashville)

# Preview the cleaned column names
colnames(nashville)

##  [1] "x1"                                "unnamed_0"                        
##  [3] "parcel_id"                         "land_use"                         
##  [5] "property_address"                  "suite_condo_number"               
##  [7] "property_city"                     "sale_date"                        
##  [9] "sale_price"                        "legal_reference"                  
## [11] "sold_as_vacant"                    "multiple_parcels_involved_in_sale"
## [13] "owner_name"                        "address"                          
## [15] "city"                              "state"                            
## [17] "acreage"                           "tax_district"                     
## [19] "neighborhood"                      "image"                            
## [21] "land_value"                        "building_value"                   
## [23] "total_value"                       "finished_area"                    
## [25] "foundation_type"                   "year_built"                       
## [27] "exterior_wall"                     "grade"                            
## [29] "bedrooms"                          "full_bath"                        
## [31] "half_bath"

Step 3: Handle Missing Values

In this section, I’ll identify any missing values in the dataset. Handling missing data is essential to ensuring the integrity and completeness of the analysis. Depending on the situation, I may choose to impute, drop, or flag rows/columns with missing values.

# Check for missing values in the dataset
colSums(is.na(nashville))

##                                x1                         unnamed_0 
##                                 0                                 0 
##                         parcel_id                          land_use 
##                                 0                                 0 
##                  property_address                suite_condo_number 
##                               159                             50527 
##                     property_city                         sale_date 
##                               159                                 0 
##                        sale_price                   legal_reference 
##                                 0                                 0 
##                    sold_as_vacant multiple_parcels_involved_in_sale 
##                                 0                                 0 
##                        owner_name                           address 
##                             31375                             30619 
##                              city                             state 
##                             30619                             30619 
##                           acreage                      tax_district 
##                             30619                             30619 
##                      neighborhood                             image 
##                             30619                             31301 
##                        land_value                    building_value 
##                             30619                             30619 
##                       total_value                     finished_area 
##                             30619                             32470 
##                   foundation_type                        year_built 
##                             32472                             32471 
##                     exterior_wall                             grade 
##                             32471                             32471 
##                          bedrooms                         full_bath 
##                             32477                             32359 
##                         half_bath 
##                             32490

# Take a random sample of 500 rows to visualize
library(dplyr)
nashville_sample <- slice_sample(nashville, n = 500)

# Now visualize missing data
vis_miss(nashville_sample)

# Decide how to handle missing data
# For example, I may drop rows where essential variables like SalePrice are missing
nashville_cleaned <- nashville %>%
  filter(!is.na(sale_price))  # Removing rows where sale_price is NA

# Summary of dataset to identify columns with missing data
summary(nashville_cleaned)

##        x1          unnamed_0      parcel_id           land_use        
##  Min.   :    0   Min.   :    0   Length:56636       Length:56636      
##  1st Qu.:14159   1st Qu.:14159   Class :character   Class :character  
##  Median :28318   Median :28318   Mode  :character   Mode  :character  
##  Mean   :28318   Mean   :28318                                        
##  3rd Qu.:42476   3rd Qu.:42476                                        
##  Max.   :56635   Max.   :56635                                        
##                                                                       
##  property_address   suite_condo_number property_city        sale_date         
##  Length:56636       Length:56636       Length:56636       Min.   :2013-01-02  
##  Class :character   Class :character   Class :character   1st Qu.:2014-04-11  
##  Mode  :character   Mode  :character   Mode  :character   Median :2015-03-24  
##                                                           Mean   :2015-02-07  
##                                                           3rd Qu.:2015-12-30  
##                                                           Max.   :2016-10-31  
##                                                                               
##    sale_price       legal_reference    sold_as_vacant    
##  Min.   :      50   Length:56636       Length:56636      
##  1st Qu.:  135000   Class :character   Class :character  
##  Median :  205450   Mode  :character   Mode  :character  
##  Mean   :  327211                                        
##  3rd Qu.:  329000                                        
##  Max.   :54278060                                        
##                                                          
##  multiple_parcels_involved_in_sale  owner_name          address         
##  Length:56636                      Length:56636       Length:56636      
##  Class :character                  Class :character   Class :character  
##  Mode  :character                  Mode  :character   Mode  :character  
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##      city              state              acreage        tax_district      
##  Length:56636       Length:56636       Min.   :  0.010   Length:56636      
##  Class :character   Class :character   1st Qu.:  0.180   Class :character  
##  Mode  :character   Mode  :character   Median :  0.270   Mode  :character  
##                                        Mean   :  0.499                     
##                                        3rd Qu.:  0.450                     
##                                        Max.   :160.060                     
##                                        NA's   :30619                       
##   neighborhood      image             land_value      building_value    
##  Min.   : 107    Length:56636       Min.   :    100   Min.   :       0  
##  1st Qu.:3126    Class :character   1st Qu.:  21000   1st Qu.:   75900  
##  Median :3929    Mode  :character   Median :  28800   Median :  111400  
##  Mean   :4356                       Mean   :  69073   Mean   :  160802  
##  3rd Qu.:6228                       3rd Qu.:  60000   3rd Qu.:  180700  
##  Max.   :9530                       Max.   :2772000   Max.   :12971800  
##  NA's   :30619                      NA's   :30619     NA's   :30619     
##   total_value       finished_area    foundation_type      year_built   
##  Min.   :     100   Min.   :     0   Length:56636       Min.   :1799   
##  1st Qu.:  102800   1st Qu.:  1239   Class :character   1st Qu.:1948   
##  Median :  148500   Median :  1632   Mode  :character   Median :1960   
##  Mean   :  232397   Mean   :  1927                      Mean   :1964   
##  3rd Qu.:  268500   3rd Qu.:  2212                      3rd Qu.:1983   
##  Max.   :13940400   Max.   :197988                      Max.   :2017   
##  NA's   :30619      NA's   :32470                       NA's   :32471  
##  exterior_wall         grade              bedrooms       full_bath    
##  Length:56636       Length:56636       Min.   : 0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 3.00   1st Qu.: 1.00  
##  Mode  :character   Mode  :character   Median : 3.00   Median : 2.00  
##                                        Mean   : 3.09   Mean   : 1.89  
##                                        3rd Qu.: 3.00   3rd Qu.: 2.00  
##                                        Max.   :11.00   Max.   :10.00  
##                                        NA's   :32477   NA's   :32359  
##    half_bath    
##  Min.   :0.00   
##  1st Qu.:0.00   
##  Median :0.00   
##  Mean   :0.28   
##  3rd Qu.:1.00   
##  Max.   :3.00   
##  NA's   :32490

# Handling missing data by replacing NAs with appropriate values

nashville_cleaned <- nashville_cleaned %>%
  mutate(
    # Replace missing sale_price with the median value
    sale_price = ifelse(is.na(sale_price), median(sale_price, na.rm = TRUE), sale_price),
    
    # Replace missing owner_name with "Unknown"
    owner_name = ifelse(is.na(owner_name), "Unknown", owner_name)
  )

# Visualize missing data after handling

# Downsample to 10,000 rows for visualization
nashville_sampled <- nashville_cleaned %>%
  slice_sample(n = 10000)

# Visualize missing data in the downsampled dataset
vis_miss(nashville_sampled)


##  Step 4: Confirm Date Format

The sale_date column is already properly formatted as a Date object, which is ideal for time-based analysis. This step confirms that no additional transformation is needed.


``` r
# Confirm the format of the sale_date column
head(nashville$sale_date)

## [1] "2013-01-24" "2013-01-11" "2013-01-18" "2013-01-18" "2013-01-23"
## [6] "2013-01-04"

class(nashville$sale_date)

## [1] "Date"

# Assign the already-clean date column to the cleaned dataset
nashville_cleaned <- nashville_cleaned %>%
  mutate(sale_date = nashville$sale_date)

Step 5: Standardize Text Fields

In this step, I clean and standardize inconsistent text fields such as property types or ownership names. This prevents grouping issues during analysis and ensures accuracy in aggregations or filters.

# Standardize text fields with appropriate cleaning
nashville_cleaned <- nashville_cleaned %>%
  mutate(
    # Standardizing 'land_use' field: Capitalize first letter of each word
    land_use = str_to_title(str_trim(land_use)),  
    
    # Standardizing 'owner_name': Proper capitalization
    owner_name = str_to_title(str_trim(owner_name)),  
    
    # Removing extra spaces from property address and standardizing case
    property_address = str_to_title(str_trim(property_address)),
    property_city = str_to_title(str_trim(property_city)),
    
    # Capitalizing tax district and foundation type
    tax_district = str_to_title(str_trim(tax_district)),
    foundation_type = str_to_title(str_trim(foundation_type))
  )

Step 6: Outlier Detection

Outliers are values that deviate significantly from other observations. Detecting and handling outliers is important for ensuring the accuracy of your analysis.

# Boxplot to identify potential outliers in sale_price and acreage
library(ggplot2)

# Remove rows with NA or infinite values for the relevant columns
nashville_cleaned <- nashville_cleaned %>%
  filter(!is.na(sale_price) & !is.na(acreage)) %>%
  filter(is.finite(sale_price) & is.finite(acreage))

# Plot the boxplot for sale_price
ggplot(nashville_cleaned, aes(x = sale_price)) +
  geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
  ggtitle("Boxplot for Sale Price") +
  xlab("Sale Price") +
  ylab("Frequency") +
  theme_minimal(base_size = 15) + # Clean minimal theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

# Plot the boxplot for acreage
ggplot(nashville_cleaned, aes(x = acreage)) +
  geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
  ggtitle("Boxplot for Acreage") +
  xlab("Acreage") +
  ylab("Frequency") +
  theme_minimal(base_size = 15) + # Clean minimal theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

# Removing outliers
nashville_cleaned <- nashville_cleaned %>%
  filter(sale_price < 10000000)  # Remove properties above $10 million

# Outlier detection using boxplot for sale_price
ggplot(nashville_cleaned, aes(x = sale_price)) +
  geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
  ggtitle("Boxplot for Sale Price") +
  xlab("Sale Price") +
  ylab("Frequency") +
  theme_minimal(base_size = 15) + # Clean minimal theme
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

Conclusion

In this project, I demonstrated key data cleaning techniques that are critical in preparing a dataset for analysis. Through the following steps, I ensured that the Nashville Housing dataset was well-prepared for further analysis:

Cleaning column names for consistency and readability using the janitor package.
Handling missing data by identifying and replacing NAs, ensuring that the dataset was complete and could be used effectively in further analysis.
Standardizing text fields such as addresses and owner names to prevent inconsistencies and errors during analysis.
Outlier detection and removal to ensure that extreme values did not distort the dataset and influence results.

By performing these steps, I not only cleaned and organized the data but also prepared it for more detailed analyses, such as exploring trends in housing prices, property features, and regional patterns.

The next steps in this analysis would involve further exploration of the cleaned data, including generating descriptive statistics, visualizing key trends, and possibly building predictive models to forecast housing prices. Additionally, performing geographic analyses could provide insights into how location affects property values.

This data cleaning process is crucial as it lays the foundation for reliable, actionable insights. The cleaned dataset is now ready for any further analysis or modeling required to gain deeper insights into the Nashville housing market.

Exporting

# Export the cleaned dataset to a CSV file
write_csv(nashville_cleaned, "nashville_cleaned.csv")