Missing value handling

Detecting Missing Values

To perform a more comprehensive analysis and visualization for detecting missing values, we’ll use additional R functions and techniques. Below are the steps with corresponding R code and explanations.

Step 1: Load the Data

We’ll start by loading the dataset into R.

R Code:

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(naniar)

## Warning: package 'naniar' was built under R version 4.3.3

library(visdat)

## Warning: package 'visdat' was built under R version 4.3.3

# Load the dataset
airbnb_data <- read.csv("Data.csv")

# Preview the dataset
head(airbnb_data)

##        id                                             NAME     host.id
## 1 1001254               Clean & quiet apt home by the park 80014485718
## 2 1002102                            Skylit Midtown Castle 52335172823
## 3 1002403              THE VILLAGE OF HARLEM....NEW YORK ! 78829239556
## 4 1002755                                                  85098326012
## 5 1003689 Entire Apt: Spacious Studio/Loft by central park 92037596077
## 6 1004098        Large Cozy 1 BR Apartment In Midtown East 45498551794
##   host_identity_verified host.name neighbourhood.group neighbourhood      lat
## 1            unconfirmed  Madaline            Brooklyn    Kensington 40.64749
## 2               verified     Jenna           Manhattan       Midtown 40.75362
## 3                            Elise           Manhattan        Harlem 40.80902
## 4            unconfirmed     Garry            Brooklyn  Clinton Hill 40.68514
## 5               verified    Lyndon           Manhattan   East Harlem 40.79851
## 6               verified  Michelle           Manhattan   Murray Hill 40.74767
##        long       country country.code instant_bookable cancellation_policy
## 1 -73.97237 United States           US            FALSE              strict
## 2 -73.98377 United States           US            FALSE            moderate
## 3 -73.94190 United States           US             TRUE            flexible
## 4 -73.95976 United States           US             TRUE            moderate
## 5 -73.94399 United States           US            FALSE            moderate
## 6 -73.97500 United States           US             TRUE            flexible
##         room.type Construction.year price service.fee minimum.nights
## 1    Private room              2020 $966        $193              10
## 2 Entire home/apt              2007 $142         $28              30
## 3    Private room              2005 $620        $124               3
## 4 Entire home/apt              2005 $368         $74              30
## 5 Entire home/apt              2009 $204         $41              10
## 6 Entire home/apt              2013 $577        $115               3
##   number.of.reviews last.review reviews.per.month review.rate.number
## 1                 9  10/19/2021              0.21                  4
## 2                45   5/21/2022              0.38                  4
## 3                 0                            NA                  5
## 4               270    7/5/2019              4.64                  4
## 5                 9  11/19/2018              0.10                  3
## 6                74   6/22/2019              0.59                  3
##   calculated.host.listings.count availability.365
## 1                              6              286
## 2                              2              228
## 3                              1              352
## 4                              1              322
## 5                              1              289
## 6                              1              374
##                                                                                                                                                                                                                                                                                                                                                                                                           house_rules
## 1                                                                                                                                                                                                                                                                                                                                Clean up and treat the home the way you'd like your home to be treated.  No smoking.
## 2 Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. I have a cute and quiet mixed chihuahua. I could accept more guests (for an extra fee) but this also needs to be confirmed beforehand. Also friends traveling together could sleep in separate beds for an extra fee (the second bed is either a sofa bed or inflatable bed). Smoking is only allowed on the porch.
## 3                              I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge to use the washer/dryer in the basement.  No smoking, inside or outside. Come home as late as you want.  If you come home stumbling drunk, it's OK the first time. If you do it again, and you wake up me or the neighbors downstairs, we will be annoyed.  (Just so you know . . . )
## 4                                                                                                                                                                                                                                                                                                                                                                                                                    
## 5                                                                                                                                                                                                                                                    Please no smoking in the house, porch or on the property (you can go to the nearby corner).  Reasonable quiet after 10:30 pm.  Please remove shoes in the house.
## 6                                                                                                                                                                                                                                                                                                                                                                                   No smoking, please, and no drugs.
##   license
## 1        
## 2        
## 3        
## 4        
## 5        
## 6

Step 2: Detect Missing Values using numerical method

To detect missing values, we’ll use functions from naniar and base R.

R Code:

# Summary of missing values in the dataset
summary(airbnb_data)

##        id               NAME              host.id         
##  Min.   : 1001254   Length:102599      Min.   :1.236e+08  
##  1st Qu.:15085814   Class :character   1st Qu.:2.458e+10  
##  Median :29136603   Mode  :character   Median :4.912e+10  
##  Mean   :29146235                      Mean   :4.925e+10  
##  3rd Qu.:43201198                      3rd Qu.:7.400e+10  
##  Max.   :57367417                      Max.   :9.876e+10  
##                                                           
##  host_identity_verified  host.name         neighbourhood.group
##  Length:102599          Length:102599      Length:102599      
##  Class :character       Class :character   Class :character   
##  Mode  :character       Mode  :character   Mode  :character   
##                                                               
##                                                               
##                                                               
##                                                               
##  neighbourhood           lat             long          country         
##  Length:102599      Min.   :40.50   Min.   :-74.25   Length:102599     
##  Class :character   1st Qu.:40.69   1st Qu.:-73.98   Class :character  
##  Mode  :character   Median :40.72   Median :-73.95   Mode  :character  
##                     Mean   :40.73   Mean   :-73.95                     
##                     3rd Qu.:40.76   3rd Qu.:-73.93                     
##                     Max.   :40.92   Max.   :-73.71                     
##                     NA's   :8       NA's   :8                          
##  country.code       instant_bookable cancellation_policy  room.type        
##  Length:102599      Mode :logical    Length:102599       Length:102599     
##  Class :character   FALSE:51474      Class :character    Class :character  
##  Mode  :character   TRUE :51020      Mode  :character    Mode  :character  
##                     NA's :105                                              
##                                                                            
##                                                                            
##                                                                            
##  Construction.year    price           service.fee        minimum.nights     
##  Min.   :2003      Length:102599      Length:102599      Min.   :-1223.000  
##  1st Qu.:2007      Class :character   Class :character   1st Qu.:    2.000  
##  Median :2012      Mode  :character   Mode  :character   Median :    3.000  
##  Mean   :2012                                            Mean   :    8.136  
##  3rd Qu.:2017                                            3rd Qu.:    5.000  
##  Max.   :2022                                            Max.   : 5645.000  
##  NA's   :214                                             NA's   :409        
##  number.of.reviews last.review        reviews.per.month review.rate.number
##  Min.   :   0.00   Length:102599      Min.   : 0.010    Min.   :1.000     
##  1st Qu.:   1.00   Class :character   1st Qu.: 0.220    1st Qu.:2.000     
##  Median :   7.00   Mode  :character   Median : 0.740    Median :3.000     
##  Mean   :  27.48                      Mean   : 1.374    Mean   :3.279     
##  3rd Qu.:  30.00                      3rd Qu.: 2.000    3rd Qu.:4.000     
##  Max.   :1024.00                      Max.   :90.000    Max.   :5.000     
##  NA's   :183                          NA's   :15879     NA's   :326       
##  calculated.host.listings.count availability.365 house_rules       
##  Min.   :  1.000                Min.   : -10.0   Length:102599     
##  1st Qu.:  1.000                1st Qu.:   3.0   Class :character  
##  Median :  1.000                Median :  96.0   Mode  :character  
##  Mean   :  7.937                Mean   : 141.1                     
##  3rd Qu.:  2.000                3rd Qu.: 269.0                     
##  Max.   :332.000                Max.   :3677.0                     
##  NA's   :319                    NA's   :448                        
##    license         
##  Length:102599     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Explanation:

summary() provides a quick overview of each column, including counts of missing values

The summary of the Airbnb dataset reveals various aspects of the data. The dataset consists of 102,599 observations across multiple columns. The id column, representing unique listings, ranges from 1,001,254 to 57,367,417. The host.id also shows a wide range, indicating diverse hosts, with a mean value of approximately 49.25 billion. Columns like host_identity_verified and host.name are character types, with no immediate indication of missing values in host_identity_verified, but missing entries are present in host.name.

Geographical information includes lat and long, both of which have a small number of missing values (8 each), and the dataset covers locations from latitude 40.50 to 40.92 and longitude -74.25 to -73.71. neighbourhood.group and neighbourhood are character fields, with neighbourhood containing 8 missing values.

Columns related to booking, such as instant_bookable, show a binary distribution (TRUE/FALSE), with a few missing entries (105). cancellation_policy and room.type are complete. The Construction.year field ranges from 2003 to 2022, with 214 missing values.

The price and service.fee columns are character types, suggesting a need for transformation into numeric values for analysis. minimum.nights shows a wide range, with a minimum of -1,223 (likely an error) to a maximum of 5,645, with 409 missing values. number.of.reviews ranges from 0 to 1,024, with 183 missing entries.

The last.review date field is missing for a significant number of rows (15,879 missing). The reviews.per.month column, ranging from 0.01 to 90, also has 15,879 missing entries, indicating that many listings either lack reviews or their review data is incomplete. The review.rate.number has missing values in 326 entries, while the calculated.host.listings.count field shows a maximum of 332 listings for a single host, with 319 missing values. The availability.365 field, indicating the number of available days per year, contains outliers and negative values, with 448 missing entries. Lastly, the license field is also missing for many entries.

This summary highlights the presence of missing and inconsistent data across various columns, necessitating careful cleaning and imputation for meaningful analysis.

Srep 02 : visualize missing value in dataset

vis_miss() from the naniar package offers a visual representation of where missing values are in the dataset.

R code:

# Visualizing missing data
vis_miss(airbnb_data, warn_large_data = FALSE) +
  labs(title = "Missing Data Pattern in Full Data")

The vis_miss function from the naniar package is used to visually represent missing data within a dataset. This method helps in quickly identifying which columns have missing values and the extent of those missing values across the entire dataset. By setting warn_large_data = FALSE, the function is allowed to visualize large datasets without issuing warnings about their size.

Interpretation of the Output:

Title: “Missing Data Pattern in Full Data” clearly indicates the focus on the dataset’s missing data.

X-axis: Represents the variables (columns) in the dataset. Each variable’s name is listed with a percentage indicating the proportion of missing data.

Y-axis: Represents the total number of observations (rows) in the dataset.

Bars: The gray bars represent the presence of data (99.3%), while black lines indicate the absence (0.7%).

Key Insights:

Most columns in the dataset have no missing data (indicated by continuous gray bars).

A few columns, such as reviews.per.month, license, and last.review, show notable missing values (black lines). Specifically: 1. reviews.per.month and last.review have around 15% missing values. 2. license is largely missing, indicating potential data quality issues or optional data entry for that field.

This visualization helps in quickly spotting which variables require attention for handling missing values through imputation, exclusion, or other strategies.

step 03: Analyze Missing Data Distribution

we can use these method for analyze missing datadistribution.

gg_miss_var(): Generates a bar plot showing the number of missing values per variable.

R code:

# Load necessary libraries
library(tidyverse)
library(naniar)

# Visualize the distribution of missing values
gg_miss_var(airbnb_data,) +
  labs(title = "Missing Values by Variable",
       x = "Variables",
       y = "Count of Missing Values")

The gg_miss_var function from the naniar package provides a clear visualization of the missing data distribution across all variables in the dataset. This method allows for an easy comparison of the extent of missing values per variable, aiding in prioritizing which variables require cleaning.

Interpretation of the Output:

Title: “Missing Values by Variable” clearly indicates that the plot focuses on the distribution of missing data.

X-axis: Represents the count of missing values for each variable.

Y-axis: Lists all the variables (columns) from the dataset.

Plot Description: Each dot on the plot corresponds to a variable. The position along the X-axis shows the number of missing values in that variable.

Key Insights:

The reviews.per.month variable has the highest count of missing values, exceeding 15,000 entries.Other variables such as availability.365, minimum.nights, and review.rate.number also exhibit a noticeable number of missing entries but are significantly less compared to reviews.per.month.Variables like host_identity_verified, host_name, and host.id appear to have little to no missing data, as they are not visually represented with dots extending far on the X-axis.

Step 04: Identify Patterns of Missingness

vis_miss() with cluster = TRUE: Helps identify patterns in missing data by clustering similar patterns.

gg_miss_upset(): Visualizes combinations of missingness across different variables using an UpSet plot.

R Code:

# Examine pairwise missingness
# Load necessary libraries
library(naniar)
library(ggplot2)

# UpSet plot for missing data combinations
gg_miss_upset(airbnb_data, nsets = 10)

The gg_miss_upset function from the naniar package is used to visualize combinations of missing values across multiple variables in a dataset. This method is particularly useful for understanding how missingness in different variables co-occurs, which can help identify patterns or dependencies in missing data.

Interpretation of the Output:

Set Size (Bottom Axis): This represents the total number of missing values for each variable, shown as bars. For example, reviews.per.month has the highest number of missing values.

Intersection Size (Left Axis): This shows the count of observations that have a specific combination of missing values across multiple variables. The largest bar at the top indicates the most common missing pattern, which involves only one variable being missing (reviews.per.month).

Dot Matrix (Middle): Each row of dots corresponds to a combination of variables with missing values. The connected dots illustrate which variables are missing together.

Summary:

The plot reveals that reviews.per.month is often missing alone, which makes it a significant issue to address.

There are other smaller intersections where multiple variables are missing together, such as availability_365 and minimum_nights. Key Insights:

Dominant Missing Pattern: The most frequent missing pattern is singular, with a single variable (reviews.per.month) missing in many cases.

Co-occurrence: While most of the missing data occurs in isolation, there are cases where multiple variables are missing simultaneously, suggesting a potential dependency or a common cause for missingness.

Handling Missing Values

We’ll address missing values using different strategies depending on the nature of the data.

Let’s break down the methods used in the provided code to handle missing values in the airbnb_data dataset:

1. Removing Rows with Missing Values

R code:

cleaned_data <- airbnb_data %>% drop_na()
summary(cleaned_data)

##        id               NAME              host.id         
##  Min.   : 1001254   Length:85444       Min.   :1.236e+08  
##  1st Qu.:15465935   Class :character   1st Qu.:2.447e+10  
##  Median :30792674   Mode  :character   Median :4.904e+10  
##  Mean   :29558170                      Mean   :4.920e+10  
##  3rd Qu.:43237512                      3rd Qu.:7.388e+10  
##  Max.   :57367417                      Max.   :9.876e+10  
##  host_identity_verified  host.name         neighbourhood.group
##  Length:85444           Length:85444       Length:85444       
##  Class :character       Class :character   Class :character   
##  Mode  :character       Mode  :character   Mode  :character   
##                                                               
##                                                               
##                                                               
##  neighbourhood           lat             long          country         
##  Length:85444       Min.   :40.50   Min.   :-74.25   Length:85444      
##  Class :character   1st Qu.:40.69   1st Qu.:-73.98   Class :character  
##  Mode  :character   Median :40.72   Median :-73.95   Mode  :character  
##                     Mean   :40.73   Mean   :-73.95                     
##                     3rd Qu.:40.76   3rd Qu.:-73.93                     
##                     Max.   :40.92   Max.   :-73.71                     
##  country.code       instant_bookable cancellation_policy  room.type        
##  Length:85444       Mode :logical    Length:85444        Length:85444      
##  Class :character   FALSE:42944      Class :character    Class :character  
##  Mode  :character   TRUE :42500      Mode  :character    Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##  Construction.year    price           service.fee        minimum.nights    
##  Min.   :2003      Length:85444       Length:85444       Min.   :-365.000  
##  1st Qu.:2007      Class :character   Class :character   1st Qu.:   2.000  
##  Median :2012      Mode  :character   Mode  :character   Median :   3.000  
##  Mean   :2012                                            Mean   :   7.469  
##  3rd Qu.:2017                                            3rd Qu.:   5.000  
##  Max.   :2022                                            Max.   :5645.000  
##  number.of.reviews last.review        reviews.per.month review.rate.number
##  Min.   :   1.00   Length:85444       Min.   : 0.010    Min.   :1.00      
##  1st Qu.:   3.00   Class :character   1st Qu.: 0.220    1st Qu.:2.00      
##  Median :  11.00   Mode  :character   Median : 0.740    Median :3.00      
##  Mean   :  32.17                      Mean   : 1.373    Mean   :3.28      
##  3rd Qu.:  38.00                      3rd Qu.: 2.000    3rd Qu.:4.00      
##  Max.   :1024.00                      Max.   :90.000    Max.   :5.00      
##  calculated.host.listings.count availability.365 house_rules       
##  Min.   :  1.000                Min.   : -10.0   Length:85444      
##  1st Qu.:  1.000                1st Qu.:   6.0   Class :character  
##  Median :  1.000                Median : 100.0   Mode  :character  
##  Mean   :  7.044                Mean   : 141.7                     
##  3rd Qu.:  2.000                3rd Qu.: 266.0                     
##  Max.   :332.000                Max.   :3677.0                     
##    license         
##  Length:85444      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

All columns now have complete data after removing rows with missing values. This method is a straightforward way to handle missing data by simply excluding incomplete records.

2. Imputing Missing Values

This approach replaces missing values with meaningful substitutes to retain all data entries while addressing the missing data problem.

a. Numerical Columns - Imputation with the Mean

R code:

airbnb_data_mean_mutated <- airbnb_data %>%
  mutate_if(is.numeric, ~ifelse(is.na(.), mean(., na.rm = TRUE), .))

# Count of missing values per column
null_counts <- colSums(is.na(airbnb_data_mean_mutated))
null_counts

##                             id                           NAME 
##                              0                              0 
##                        host.id         host_identity_verified 
##                              0                              0 
##                      host.name            neighbourhood.group 
##                              0                              0 
##                  neighbourhood                            lat 
##                              0                              0 
##                           long                        country 
##                              0                              0 
##                   country.code               instant_bookable 
##                              0                            105 
##            cancellation_policy                      room.type 
##                              0                              0 
##              Construction.year                          price 
##                              0                              0 
##                    service.fee                 minimum.nights 
##                              0                              0 
##              number.of.reviews                    last.review 
##                              0                              0 
##              reviews.per.month             review.rate.number 
##                              0                              0 
## calculated.host.listings.count               availability.365 
##                              0                              0 
##                    house_rules                        license 
##                              0                              0

The provided R code snippet appears to have two main steps:

Data Imputation:

The mutate_if() function is used to fill missing values in numeric columns with their respective column means. This ensures that no numeric column in the dataset has missing values after this step.

Counting Missing Values:

The colSums(is.na()) function calculates the number of missing values in each column of the airbnb_data_mean_mutated dataset. The resulting null_counts variable displays the number of missing values per column.

Key Output Insights:

Most columns now have 0 missing values, indicating that any previous missing numeric data has been replaced with the mean of that column. The instant_bookable column still has 105 missing values, likely because it’s a logical (binary) column, not numeric, and thus wasn’t affected by the mean imputation. This approach effectively handles missing numeric values but leaves non-numeric columns with missing data to be addressed separately.

b. Categorical Columns - Imputation with the Mode

For each categorical (character) column, missing values are replaced with the mode (the most frequently occurring value) of that column.

R code:

fill_mode <- function(x) {
  ux <- unique(na.omit(x))
  ux[which.max(tabulate(match(x, ux)))]
}
airbnb_data_c_mutated <- airbnb_data %>%
  mutate_if(is.character, ~ifelse(is.na(.), fill_mode(.), .))

# Count of missing values per column
null_counts <- colSums(is.na(airbnb_data_c_mutated))
null_counts

##                             id                           NAME 
##                              0                              0 
##                        host.id         host_identity_verified 
##                              0                              0 
##                      host.name            neighbourhood.group 
##                              0                              0 
##                  neighbourhood                            lat 
##                              0                              8 
##                           long                        country 
##                              8                              0 
##                   country.code               instant_bookable 
##                              0                            105 
##            cancellation_policy                      room.type 
##                              0                              0 
##              Construction.year                          price 
##                            214                              0 
##                    service.fee                 minimum.nights 
##                              0                            409 
##              number.of.reviews                    last.review 
##                            183                              0 
##              reviews.per.month             review.rate.number 
##                          15879                            326 
## calculated.host.listings.count               availability.365 
##                            319                            448 
##                    house_rules                        license 
##                              0                              0

Function for Mode Imputation (fill_mode):

fill_mode <- function(x) { ux <- unique(na.omit(x)) ux[which.max(tabulate(match(x, ux)))] }

Purpose: This function calculates the mode of a categorical variable.

Steps: 1. unique(na.omit(x)): Removes NA values and finds the unique elements in the column. 2. tabulate(match(x, ux)): Counts the frequency of each unique value. 3. which.max(): Identifies the value with the maximum frequency (i.e., the mode).

Use: This function is used to fill missing values in categorical columns with their mode.

Imputing Missing Values for Categorical Data:

airbnb_data_c_mutated <- airbnb_data %>% mutate_if(is.character, ~ifelse(is.na(.), fill_mode(.), .))

Purpose: This line applies the fill_mode function to impute missing values for all categorical (character) columns in the dataset.

mutate_if(is.character, …): Selects all character columns and applies the imputation where values are NA.

The mode imputation successfully fills missing values in categorical variables, as evidenced by the 0 count for most categorical columns.

Missing value handling

SC/2020/11798

2024-11-04

Detecting Missing Values

Step 1: Load the Data

Step 2: Detect Missing Values using numerical method

Explanation:

Srep 02 : visualize missing value in dataset

Interpretation of the Output:

Key Insights:

step 03: Analyze Missing Data Distribution

Interpretation of the Output:

Key Insights:

Step 04: Identify Patterns of Missingness

Interpretation of the Output:

Summary:

Handling Missing Values

1. Removing Rows with Missing Values

2. Imputing Missing Values

a. Numerical Columns - Imputation with the Mean

Data Imputation:

Counting Missing Values:

Key Output Insights:

b. Categorical Columns - Imputation with the Mode

Function for Mode Imputation (fill_mode):

Imputing Missing Values for Categorical Data: