Image source: Getty Images – Kruck20 / gettyimages.com
In this project, I demonstrate my data cleaning skills by processing the Nashville Housing dataset, which contains property sale records from 2013-2016. The goal is to prepare the dataset for detailed analysis, ensuring its quality and reliability for predictive modeling.
In this section, I load the required libraries and import the Nashville Housing dataset. This step ensures the data is ready for inspection and subsequent cleaning. I also perform an initial structure check to understand the shape and type of data I’ll be working with.
knitr::opts_chunk$set(echo = TRUE)
# Load required libraries
library(tidyverse) # for data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) # for working with date formats
library(janitor) # for cleaning column names quickly
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(visdat) # for visualizing missing data
# Load the dataset
nashville <- read_csv("Nashville_housing_data_2013_2016.csv")
## New names:
## Rows: 56636 Columns: 31
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (17): Parcel ID, Land Use, Property Address, Suite/ Condo #, Property... dbl
## (13): ...1, Unnamed: 0, Sale Price, Acreage, Neighborhood, Land Value, ... date
## (1): Sale Date
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Preview structure of the dataset
glimpse(nashville)
## Rows: 56,636
## Columns: 31
## $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ `Unnamed: 0` <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## $ `Parcel ID` <chr> "105 03 0D 008.00", "105 11 0 080.…
## $ `Land Use` <chr> "RESIDENTIAL CONDO", "SINGLE FAMIL…
## $ `Property Address` <chr> "1208 3RD AVE S", "1802 STEWART …
## $ `Suite/ Condo #` <chr> "8", NA, NA, NA, NA, NA, NA, NA, N…
## $ `Property City` <chr> "NASHVILLE", "NASHVILLE", "NASHVIL…
## $ `Sale Date` <date> 2013-01-24, 2013-01-11, 2013-01-1…
## $ `Sale Price` <dbl> 132000, 191500, 202000, 32000, 102…
## $ `Legal Reference` <chr> "20130128-0008725", "20130118-0006…
## $ `Sold As Vacant` <chr> "No", "No", "No", "No", "No", "No"…
## $ `Multiple Parcels Involved in Sale` <chr> "No", "No", "No", "No", "No", "No"…
## $ `Owner Name` <chr> NA, "STINSON, LAURA M.", "NUNES, J…
## $ Address <chr> NA, "1802 STEWART PL", "2761 ROS…
## $ City <chr> NA, "NASHVILLE", "NASHVILLE", "NAS…
## $ State <chr> NA, "TN", "TN", "TN", "TN", "TN", …
## $ Acreage <dbl> NA, 0.17, 0.11, 0.17, 0.34, 0.17, …
## $ `Tax District` <chr> NA, "URBAN SERVICES DISTRICT", "CI…
## $ Neighborhood <dbl> NA, 3127, 9126, 3130, 3130, 3130, …
## $ image <chr> NA, "\\114000\\910001.JPG", "\\131…
## $ `Land Value` <dbl> NA, 32000, 34000, 25000, 25000, 25…
## $ `Building Value` <dbl> NA, 134400, 157800, 243700, 138100…
## $ `Total Value` <dbl> NA, 168300, 191800, 268700, 164800…
## $ `Finished Area` <dbl> NA, 1149.000, 2090.825, 2145.600, …
## $ `Foundation Type` <chr> NA, "PT BSMT", "SLAB", "FULL BSMT"…
## $ `Year Built` <dbl> NA, 1941, 2000, 1948, 1910, 1945, …
## $ `Exterior Wall` <chr> NA, "BRICK", "BRICK/FRAME", "BRICK…
## $ Grade <chr> NA, "C", "C", "B", "C", "C", NA, "…
## $ Bedrooms <dbl> NA, 2, 3, 4, 2, 2, NA, 2, 2, 2, 2,…
## $ `Full Bath` <dbl> NA, 1, 2, 2, 1, 1, NA, 1, 1, 1, 1,…
## $ `Half Bath` <dbl> NA, 0, 1, 0, 0, 0, NA, 0, 0, 0, 0,…
# Summary statistics for numerical columns
summary(nashville)
## ...1 Unnamed: 0 Parcel ID Land Use
## Min. : 0 Min. : 0 Length:56636 Length:56636
## 1st Qu.:14159 1st Qu.:14159 Class :character Class :character
## Median :28318 Median :28318 Mode :character Mode :character
## Mean :28318 Mean :28318
## 3rd Qu.:42476 3rd Qu.:42476
## Max. :56635 Max. :56635
##
## Property Address Suite/ Condo # Property City Sale Date
## Length:56636 Length:56636 Length:56636 Min. :2013-01-02
## Class :character Class :character Class :character 1st Qu.:2014-04-11
## Mode :character Mode :character Mode :character Median :2015-03-24
## Mean :2015-02-07
## 3rd Qu.:2015-12-30
## Max. :2016-10-31
##
## Sale Price Legal Reference Sold As Vacant
## Min. : 50 Length:56636 Length:56636
## 1st Qu.: 135000 Class :character Class :character
## Median : 205450 Mode :character Mode :character
## Mean : 327211
## 3rd Qu.: 329000
## Max. :54278060
##
## Multiple Parcels Involved in Sale Owner Name Address
## Length:56636 Length:56636 Length:56636
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## City State Acreage Tax District
## Length:56636 Length:56636 Min. : 0.010 Length:56636
## Class :character Class :character 1st Qu.: 0.180 Class :character
## Mode :character Mode :character Median : 0.270 Mode :character
## Mean : 0.499
## 3rd Qu.: 0.450
## Max. :160.060
## NA's :30619
## Neighborhood image Land Value Building Value
## Min. : 107 Length:56636 Min. : 100 Min. : 0
## 1st Qu.:3126 Class :character 1st Qu.: 21000 1st Qu.: 75900
## Median :3929 Mode :character Median : 28800 Median : 111400
## Mean :4356 Mean : 69073 Mean : 160802
## 3rd Qu.:6228 3rd Qu.: 60000 3rd Qu.: 180700
## Max. :9530 Max. :2772000 Max. :12971800
## NA's :30619 NA's :30619 NA's :30619
## Total Value Finished Area Foundation Type Year Built
## Min. : 100 Min. : 0 Length:56636 Min. :1799
## 1st Qu.: 102800 1st Qu.: 1239 Class :character 1st Qu.:1948
## Median : 148500 Median : 1632 Mode :character Median :1960
## Mean : 232397 Mean : 1927 Mean :1964
## 3rd Qu.: 268500 3rd Qu.: 2212 3rd Qu.:1983
## Max. :13940400 Max. :197988 Max. :2017
## NA's :30619 NA's :32470 NA's :32471
## Exterior Wall Grade Bedrooms Full Bath
## Length:56636 Length:56636 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 3.00 1st Qu.: 1.00
## Mode :character Mode :character Median : 3.00 Median : 2.00
## Mean : 3.09 Mean : 1.89
## 3rd Qu.: 3.00 3rd Qu.: 2.00
## Max. :11.00 Max. :10.00
## NA's :32477 NA's :32359
## Half Bath
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.28
## 3rd Qu.:1.00
## Max. :3.00
## NA's :32490
This section focuses on standardizing column names to ensure they follow consistent naming conventions. I’ll use the janitor package to clean the column names into a clear, readable format, making the dataset easier to work with in subsequent steps.
# Clean column names using janitor's clean_names() function
nashville <- janitor::clean_names(nashville)
# Preview the cleaned column names
colnames(nashville)
## [1] "x1" "unnamed_0"
## [3] "parcel_id" "land_use"
## [5] "property_address" "suite_condo_number"
## [7] "property_city" "sale_date"
## [9] "sale_price" "legal_reference"
## [11] "sold_as_vacant" "multiple_parcels_involved_in_sale"
## [13] "owner_name" "address"
## [15] "city" "state"
## [17] "acreage" "tax_district"
## [19] "neighborhood" "image"
## [21] "land_value" "building_value"
## [23] "total_value" "finished_area"
## [25] "foundation_type" "year_built"
## [27] "exterior_wall" "grade"
## [29] "bedrooms" "full_bath"
## [31] "half_bath"
In this section, I’ll identify any missing values in the dataset. Handling missing data is essential to ensuring the integrity and completeness of the analysis. Depending on the situation, I may choose to impute, drop, or flag rows/columns with missing values.
# Check for missing values in the dataset
colSums(is.na(nashville))
## x1 unnamed_0
## 0 0
## parcel_id land_use
## 0 0
## property_address suite_condo_number
## 159 50527
## property_city sale_date
## 159 0
## sale_price legal_reference
## 0 0
## sold_as_vacant multiple_parcels_involved_in_sale
## 0 0
## owner_name address
## 31375 30619
## city state
## 30619 30619
## acreage tax_district
## 30619 30619
## neighborhood image
## 30619 31301
## land_value building_value
## 30619 30619
## total_value finished_area
## 30619 32470
## foundation_type year_built
## 32472 32471
## exterior_wall grade
## 32471 32471
## bedrooms full_bath
## 32477 32359
## half_bath
## 32490
# Take a random sample of 500 rows to visualize
library(dplyr)
nashville_sample <- slice_sample(nashville, n = 500)
# Now visualize missing data
vis_miss(nashville_sample)
# Decide how to handle missing data
# For example, I may drop rows where essential variables like SalePrice are missing
nashville_cleaned <- nashville %>%
filter(!is.na(sale_price)) # Removing rows where sale_price is NA
# Summary of dataset to identify columns with missing data
summary(nashville_cleaned)
## x1 unnamed_0 parcel_id land_use
## Min. : 0 Min. : 0 Length:56636 Length:56636
## 1st Qu.:14159 1st Qu.:14159 Class :character Class :character
## Median :28318 Median :28318 Mode :character Mode :character
## Mean :28318 Mean :28318
## 3rd Qu.:42476 3rd Qu.:42476
## Max. :56635 Max. :56635
##
## property_address suite_condo_number property_city sale_date
## Length:56636 Length:56636 Length:56636 Min. :2013-01-02
## Class :character Class :character Class :character 1st Qu.:2014-04-11
## Mode :character Mode :character Mode :character Median :2015-03-24
## Mean :2015-02-07
## 3rd Qu.:2015-12-30
## Max. :2016-10-31
##
## sale_price legal_reference sold_as_vacant
## Min. : 50 Length:56636 Length:56636
## 1st Qu.: 135000 Class :character Class :character
## Median : 205450 Mode :character Mode :character
## Mean : 327211
## 3rd Qu.: 329000
## Max. :54278060
##
## multiple_parcels_involved_in_sale owner_name address
## Length:56636 Length:56636 Length:56636
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## city state acreage tax_district
## Length:56636 Length:56636 Min. : 0.010 Length:56636
## Class :character Class :character 1st Qu.: 0.180 Class :character
## Mode :character Mode :character Median : 0.270 Mode :character
## Mean : 0.499
## 3rd Qu.: 0.450
## Max. :160.060
## NA's :30619
## neighborhood image land_value building_value
## Min. : 107 Length:56636 Min. : 100 Min. : 0
## 1st Qu.:3126 Class :character 1st Qu.: 21000 1st Qu.: 75900
## Median :3929 Mode :character Median : 28800 Median : 111400
## Mean :4356 Mean : 69073 Mean : 160802
## 3rd Qu.:6228 3rd Qu.: 60000 3rd Qu.: 180700
## Max. :9530 Max. :2772000 Max. :12971800
## NA's :30619 NA's :30619 NA's :30619
## total_value finished_area foundation_type year_built
## Min. : 100 Min. : 0 Length:56636 Min. :1799
## 1st Qu.: 102800 1st Qu.: 1239 Class :character 1st Qu.:1948
## Median : 148500 Median : 1632 Mode :character Median :1960
## Mean : 232397 Mean : 1927 Mean :1964
## 3rd Qu.: 268500 3rd Qu.: 2212 3rd Qu.:1983
## Max. :13940400 Max. :197988 Max. :2017
## NA's :30619 NA's :32470 NA's :32471
## exterior_wall grade bedrooms full_bath
## Length:56636 Length:56636 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 3.00 1st Qu.: 1.00
## Mode :character Mode :character Median : 3.00 Median : 2.00
## Mean : 3.09 Mean : 1.89
## 3rd Qu.: 3.00 3rd Qu.: 2.00
## Max. :11.00 Max. :10.00
## NA's :32477 NA's :32359
## half_bath
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.28
## 3rd Qu.:1.00
## Max. :3.00
## NA's :32490
# Handling missing data by replacing NAs with appropriate values
nashville_cleaned <- nashville_cleaned %>%
mutate(
# Replace missing sale_price with the median value
sale_price = ifelse(is.na(sale_price), median(sale_price, na.rm = TRUE), sale_price),
# Replace missing owner_name with "Unknown"
owner_name = ifelse(is.na(owner_name), "Unknown", owner_name)
)
# Visualize missing data after handling
# Downsample to 10,000 rows for visualization
nashville_sampled <- nashville_cleaned %>%
slice_sample(n = 10000)
# Visualize missing data in the downsampled dataset
vis_miss(nashville_sampled)
## Step 4: Confirm Date Format
The sale_date column is already properly formatted as a Date object, which is ideal for time-based analysis. This step confirms that no additional transformation is needed.
``` r
# Confirm the format of the sale_date column
head(nashville$sale_date)
## [1] "2013-01-24" "2013-01-11" "2013-01-18" "2013-01-18" "2013-01-23"
## [6] "2013-01-04"
class(nashville$sale_date)
## [1] "Date"
# Assign the already-clean date column to the cleaned dataset
nashville_cleaned <- nashville_cleaned %>%
mutate(sale_date = nashville$sale_date)
In this step, I clean and standardize inconsistent text fields such as property types or ownership names. This prevents grouping issues during analysis and ensures accuracy in aggregations or filters.
# Standardize text fields with appropriate cleaning
nashville_cleaned <- nashville_cleaned %>%
mutate(
# Standardizing 'land_use' field: Capitalize first letter of each word
land_use = str_to_title(str_trim(land_use)),
# Standardizing 'owner_name': Proper capitalization
owner_name = str_to_title(str_trim(owner_name)),
# Removing extra spaces from property address and standardizing case
property_address = str_to_title(str_trim(property_address)),
property_city = str_to_title(str_trim(property_city)),
# Capitalizing tax district and foundation type
tax_district = str_to_title(str_trim(tax_district)),
foundation_type = str_to_title(str_trim(foundation_type))
)
Outliers are values that deviate significantly from other observations. Detecting and handling outliers is important for ensuring the accuracy of your analysis.
# Boxplot to identify potential outliers in sale_price and acreage
library(ggplot2)
# Remove rows with NA or infinite values for the relevant columns
nashville_cleaned <- nashville_cleaned %>%
filter(!is.na(sale_price) & !is.na(acreage)) %>%
filter(is.finite(sale_price) & is.finite(acreage))
# Plot the boxplot for sale_price
ggplot(nashville_cleaned, aes(x = sale_price)) +
geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
ggtitle("Boxplot for Sale Price") +
xlab("Sale Price") +
ylab("Frequency") +
theme_minimal(base_size = 15) + # Clean minimal theme
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
# Plot the boxplot for acreage
ggplot(nashville_cleaned, aes(x = acreage)) +
geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
ggtitle("Boxplot for Acreage") +
xlab("Acreage") +
ylab("Frequency") +
theme_minimal(base_size = 15) + # Clean minimal theme
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
# Removing outliers
nashville_cleaned <- nashville_cleaned %>%
filter(sale_price < 10000000) # Remove properties above $10 million
# Outlier detection using boxplot for sale_price
ggplot(nashville_cleaned, aes(x = sale_price)) +
geom_boxplot(fill = "#69b3a2", color = "#404040", outlier.colour = "#D55E00", outlier.size = 2) +
ggtitle("Boxplot for Sale Price") +
xlab("Sale Price") +
ylab("Frequency") +
theme_minimal(base_size = 15) + # Clean minimal theme
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12)
)
In this project, I demonstrated key data cleaning techniques that are critical in preparing a dataset for analysis. Through the following steps, I ensured that the Nashville Housing dataset was well-prepared for further analysis:
Cleaning column names for consistency and readability using the janitor package.
Handling missing data by identifying and replacing NAs, ensuring that the dataset was complete and could be used effectively in further analysis.
Standardizing text fields such as addresses and owner names to prevent inconsistencies and errors during analysis.
Outlier detection and removal to ensure that extreme values did not distort the dataset and influence results.
By performing these steps, I not only cleaned and organized the data but also prepared it for more detailed analyses, such as exploring trends in housing prices, property features, and regional patterns.
The next steps in this analysis would involve further exploration of the cleaned data, including generating descriptive statistics, visualizing key trends, and possibly building predictive models to forecast housing prices. Additionally, performing geographic analyses could provide insights into how location affects property values.
This data cleaning process is crucial as it lays the foundation for reliable, actionable insights. The cleaned dataset is now ready for any further analysis or modeling required to gain deeper insights into the Nashville housing market.
# Export the cleaned dataset to a CSV file
write_csv(nashville_cleaned, "nashville_cleaned.csv")