To perform a more comprehensive analysis and visualization for detecting missing values, we’ll use additional R functions and techniques. Below are the steps with corresponding R code and explanations.
We’ll start by loading the dataset into R.
R Code:
# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(naniar)
## Warning: package 'naniar' was built under R version 4.3.3
library(visdat)
## Warning: package 'visdat' was built under R version 4.3.3
# Load the dataset
airbnb_data <- read.csv("Data.csv")
# Preview the dataset
head(airbnb_data)
## id NAME host.id
## 1 1001254 Clean & quiet apt home by the park 80014485718
## 2 1002102 Skylit Midtown Castle 52335172823
## 3 1002403 THE VILLAGE OF HARLEM....NEW YORK ! 78829239556
## 4 1002755 85098326012
## 5 1003689 Entire Apt: Spacious Studio/Loft by central park 92037596077
## 6 1004098 Large Cozy 1 BR Apartment In Midtown East 45498551794
## host_identity_verified host.name neighbourhood.group neighbourhood lat
## 1 unconfirmed Madaline Brooklyn Kensington 40.64749
## 2 verified Jenna Manhattan Midtown 40.75362
## 3 Elise Manhattan Harlem 40.80902
## 4 unconfirmed Garry Brooklyn Clinton Hill 40.68514
## 5 verified Lyndon Manhattan East Harlem 40.79851
## 6 verified Michelle Manhattan Murray Hill 40.74767
## long country country.code instant_bookable cancellation_policy
## 1 -73.97237 United States US FALSE strict
## 2 -73.98377 United States US FALSE moderate
## 3 -73.94190 United States US TRUE flexible
## 4 -73.95976 United States US TRUE moderate
## 5 -73.94399 United States US FALSE moderate
## 6 -73.97500 United States US TRUE flexible
## room.type Construction.year price service.fee minimum.nights
## 1 Private room 2020 $966 $193 10
## 2 Entire home/apt 2007 $142 $28 30
## 3 Private room 2005 $620 $124 3
## 4 Entire home/apt 2005 $368 $74 30
## 5 Entire home/apt 2009 $204 $41 10
## 6 Entire home/apt 2013 $577 $115 3
## number.of.reviews last.review reviews.per.month review.rate.number
## 1 9 10/19/2021 0.21 4
## 2 45 5/21/2022 0.38 4
## 3 0 NA 5
## 4 270 7/5/2019 4.64 4
## 5 9 11/19/2018 0.10 3
## 6 74 6/22/2019 0.59 3
## calculated.host.listings.count availability.365
## 1 6 286
## 2 2 228
## 3 1 352
## 4 1 322
## 5 1 289
## 6 1 374
## house_rules
## 1 Clean up and treat the home the way you'd like your home to be treated. No smoking.
## 2 Pet friendly but please confirm with me if the pet you are planning on bringing with you is OK. I have a cute and quiet mixed chihuahua. I could accept more guests (for an extra fee) but this also needs to be confirmed beforehand. Also friends traveling together could sleep in separate beds for an extra fee (the second bed is either a sofa bed or inflatable bed). Smoking is only allowed on the porch.
## 3 I encourage you to use my kitchen, cooking and laundry facilities. There is no additional charge to use the washer/dryer in the basement. No smoking, inside or outside. Come home as late as you want. If you come home stumbling drunk, it's OK the first time. If you do it again, and you wake up me or the neighbors downstairs, we will be annoyed. (Just so you know . . . )
## 4
## 5 Please no smoking in the house, porch or on the property (you can go to the nearby corner). Reasonable quiet after 10:30 pm. Please remove shoes in the house.
## 6 No smoking, please, and no drugs.
## license
## 1
## 2
## 3
## 4
## 5
## 6
To detect missing values, we’ll use functions from naniar and base R.
R Code:
# Summary of missing values in the dataset
summary(airbnb_data)
## id NAME host.id
## Min. : 1001254 Length:102599 Min. :1.236e+08
## 1st Qu.:15085814 Class :character 1st Qu.:2.458e+10
## Median :29136603 Mode :character Median :4.912e+10
## Mean :29146235 Mean :4.925e+10
## 3rd Qu.:43201198 3rd Qu.:7.400e+10
## Max. :57367417 Max. :9.876e+10
##
## host_identity_verified host.name neighbourhood.group
## Length:102599 Length:102599 Length:102599
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## neighbourhood lat long country
## Length:102599 Min. :40.50 Min. :-74.25 Length:102599
## Class :character 1st Qu.:40.69 1st Qu.:-73.98 Class :character
## Mode :character Median :40.72 Median :-73.95 Mode :character
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.93
## Max. :40.92 Max. :-73.71
## NA's :8 NA's :8
## country.code instant_bookable cancellation_policy room.type
## Length:102599 Mode :logical Length:102599 Length:102599
## Class :character FALSE:51474 Class :character Class :character
## Mode :character TRUE :51020 Mode :character Mode :character
## NA's :105
##
##
##
## Construction.year price service.fee minimum.nights
## Min. :2003 Length:102599 Length:102599 Min. :-1223.000
## 1st Qu.:2007 Class :character Class :character 1st Qu.: 2.000
## Median :2012 Mode :character Mode :character Median : 3.000
## Mean :2012 Mean : 8.136
## 3rd Qu.:2017 3rd Qu.: 5.000
## Max. :2022 Max. : 5645.000
## NA's :214 NA's :409
## number.of.reviews last.review reviews.per.month review.rate.number
## Min. : 0.00 Length:102599 Min. : 0.010 Min. :1.000
## 1st Qu.: 1.00 Class :character 1st Qu.: 0.220 1st Qu.:2.000
## Median : 7.00 Mode :character Median : 0.740 Median :3.000
## Mean : 27.48 Mean : 1.374 Mean :3.279
## 3rd Qu.: 30.00 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :1024.00 Max. :90.000 Max. :5.000
## NA's :183 NA's :15879 NA's :326
## calculated.host.listings.count availability.365 house_rules
## Min. : 1.000 Min. : -10.0 Length:102599
## 1st Qu.: 1.000 1st Qu.: 3.0 Class :character
## Median : 1.000 Median : 96.0 Mode :character
## Mean : 7.937 Mean : 141.1
## 3rd Qu.: 2.000 3rd Qu.: 269.0
## Max. :332.000 Max. :3677.0
## NA's :319 NA's :448
## license
## Length:102599
## Class :character
## Mode :character
##
##
##
##
summary() provides a quick overview of each column, including counts of missing values
The summary of the Airbnb dataset reveals various aspects of the
data. The dataset consists of 102,599 observations across multiple
columns. The id column, representing unique listings,
ranges from 1,001,254 to 57,367,417. The host.id also shows
a wide range, indicating diverse hosts, with a mean value of
approximately 49.25 billion. Columns like
host_identity_verified and host.name are
character types, with no immediate indication of missing values in
host_identity_verified, but missing entries are present in
host.name.
Geographical information includes lat and
long, both of which have a small number of missing values
(8 each), and the dataset covers locations from latitude 40.50 to 40.92
and longitude -74.25 to -73.71. neighbourhood.group and
neighbourhood are character fields, with
neighbourhood containing 8 missing values.
Columns related to booking, such as instant_bookable,
show a binary distribution (TRUE/FALSE), with a few missing entries
(105). cancellation_policy and room.type are
complete. The Construction.year field ranges from 2003 to
2022, with 214 missing values.
The price and service.fee columns are
character types, suggesting a need for transformation into numeric
values for analysis. minimum.nights shows a wide range,
with a minimum of -1,223 (likely an error) to a maximum of 5,645, with
409 missing values. number.of.reviews ranges from 0 to
1,024, with 183 missing entries.
The last.review date field is missing for a significant
number of rows (15,879 missing). The reviews.per.month
column, ranging from 0.01 to 90, also has 15,879 missing entries,
indicating that many listings either lack reviews or their review data
is incomplete. The review.rate.number has missing values in
326 entries, while the calculated.host.listings.count field
shows a maximum of 332 listings for a single host, with 319 missing
values. The availability.365 field, indicating the number
of available days per year, contains outliers and negative values, with
448 missing entries. Lastly, the license field is also
missing for many entries.
This summary highlights the presence of missing and inconsistent data across various columns, necessitating careful cleaning and imputation for meaningful analysis.
vis_miss() from the naniar package offers a visual representation of where missing values are in the dataset.
R code:
# Visualizing missing data
vis_miss(airbnb_data, warn_large_data = FALSE) +
labs(title = "Missing Data Pattern in Full Data")
The vis_miss function from the naniar package is used to visually
represent missing data within a dataset. This method helps in quickly
identifying which columns have missing values and the extent of those
missing values across the entire dataset. By setting warn_large_data =
FALSE, the function is allowed to visualize large datasets without
issuing warnings about their size.
Title: “Missing Data Pattern in Full Data” clearly indicates the focus on the dataset’s missing data.
X-axis: Represents the variables (columns) in the dataset. Each variable’s name is listed with a percentage indicating the proportion of missing data.
Y-axis: Represents the total number of observations (rows) in the dataset.
Bars: The gray bars represent the presence of data (99.3%), while black lines indicate the absence (0.7%).
Most columns in the dataset have no missing data (indicated by continuous gray bars).
A few columns, such as reviews.per.month, license, and last.review, show notable missing values (black lines). Specifically: 1. reviews.per.month and last.review have around 15% missing values. 2. license is largely missing, indicating potential data quality issues or optional data entry for that field.
This visualization helps in quickly spotting which variables require attention for handling missing values through imputation, exclusion, or other strategies.
we can use these method for analyze missing datadistribution.
gg_miss_var(): Generates a bar plot showing the number of missing values per variable.
R code:
# Load necessary libraries
library(tidyverse)
library(naniar)
# Visualize the distribution of missing values
gg_miss_var(airbnb_data,) +
labs(title = "Missing Values by Variable",
x = "Variables",
y = "Count of Missing Values")
The gg_miss_var function from the naniar package provides a clear
visualization of the missing data distribution across all variables in
the dataset. This method allows for an easy comparison of the extent of
missing values per variable, aiding in prioritizing which variables
require cleaning.
Title: “Missing Values by Variable” clearly indicates that the plot focuses on the distribution of missing data.
X-axis: Represents the count of missing values for each variable.
Y-axis: Lists all the variables (columns) from the dataset.
Plot Description: Each dot on the plot corresponds to a variable. The position along the X-axis shows the number of missing values in that variable.
The reviews.per.month variable has the highest count of missing values, exceeding 15,000 entries.Other variables such as availability.365, minimum.nights, and review.rate.number also exhibit a noticeable number of missing entries but are significantly less compared to reviews.per.month.Variables like host_identity_verified, host_name, and host.id appear to have little to no missing data, as they are not visually represented with dots extending far on the X-axis.
vis_miss() with cluster = TRUE: Helps identify patterns in missing data by clustering similar patterns.
gg_miss_upset(): Visualizes combinations of missingness across different variables using an UpSet plot.
R Code:
# Examine pairwise missingness
# Load necessary libraries
library(naniar)
library(ggplot2)
# UpSet plot for missing data combinations
gg_miss_upset(airbnb_data, nsets = 10)
The gg_miss_upset function from the naniar package is used to visualize combinations of missing values across multiple variables in a dataset. This method is particularly useful for understanding how missingness in different variables co-occurs, which can help identify patterns or dependencies in missing data.
Set Size (Bottom Axis): This represents the total number of missing values for each variable, shown as bars. For example, reviews.per.month has the highest number of missing values.
Intersection Size (Left Axis): This shows the count of observations that have a specific combination of missing values across multiple variables. The largest bar at the top indicates the most common missing pattern, which involves only one variable being missing (reviews.per.month).
Dot Matrix (Middle): Each row of dots corresponds to a combination of variables with missing values. The connected dots illustrate which variables are missing together.
The plot reveals that reviews.per.month is often missing alone, which makes it a significant issue to address.
There are other smaller intersections where multiple variables are missing together, such as availability_365 and minimum_nights. Key Insights:
Dominant Missing Pattern: The most frequent missing pattern is singular, with a single variable (reviews.per.month) missing in many cases.
Co-occurrence: While most of the missing data occurs in isolation, there are cases where multiple variables are missing simultaneously, suggesting a potential dependency or a common cause for missingness.
We’ll address missing values using different strategies depending on the nature of the data.
Let’s break down the methods used in the provided code to handle missing values in the airbnb_data dataset:
R code:
cleaned_data <- airbnb_data %>% drop_na()
summary(cleaned_data)
## id NAME host.id
## Min. : 1001254 Length:85444 Min. :1.236e+08
## 1st Qu.:15465935 Class :character 1st Qu.:2.447e+10
## Median :30792674 Mode :character Median :4.904e+10
## Mean :29558170 Mean :4.920e+10
## 3rd Qu.:43237512 3rd Qu.:7.388e+10
## Max. :57367417 Max. :9.876e+10
## host_identity_verified host.name neighbourhood.group
## Length:85444 Length:85444 Length:85444
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## neighbourhood lat long country
## Length:85444 Min. :40.50 Min. :-74.25 Length:85444
## Class :character 1st Qu.:40.69 1st Qu.:-73.98 Class :character
## Mode :character Median :40.72 Median :-73.95 Mode :character
## Mean :40.73 Mean :-73.95
## 3rd Qu.:40.76 3rd Qu.:-73.93
## Max. :40.92 Max. :-73.71
## country.code instant_bookable cancellation_policy room.type
## Length:85444 Mode :logical Length:85444 Length:85444
## Class :character FALSE:42944 Class :character Class :character
## Mode :character TRUE :42500 Mode :character Mode :character
##
##
##
## Construction.year price service.fee minimum.nights
## Min. :2003 Length:85444 Length:85444 Min. :-365.000
## 1st Qu.:2007 Class :character Class :character 1st Qu.: 2.000
## Median :2012 Mode :character Mode :character Median : 3.000
## Mean :2012 Mean : 7.469
## 3rd Qu.:2017 3rd Qu.: 5.000
## Max. :2022 Max. :5645.000
## number.of.reviews last.review reviews.per.month review.rate.number
## Min. : 1.00 Length:85444 Min. : 0.010 Min. :1.00
## 1st Qu.: 3.00 Class :character 1st Qu.: 0.220 1st Qu.:2.00
## Median : 11.00 Mode :character Median : 0.740 Median :3.00
## Mean : 32.17 Mean : 1.373 Mean :3.28
## 3rd Qu.: 38.00 3rd Qu.: 2.000 3rd Qu.:4.00
## Max. :1024.00 Max. :90.000 Max. :5.00
## calculated.host.listings.count availability.365 house_rules
## Min. : 1.000 Min. : -10.0 Length:85444
## 1st Qu.: 1.000 1st Qu.: 6.0 Class :character
## Median : 1.000 Median : 100.0 Mode :character
## Mean : 7.044 Mean : 141.7
## 3rd Qu.: 2.000 3rd Qu.: 266.0
## Max. :332.000 Max. :3677.0
## license
## Length:85444
## Class :character
## Mode :character
##
##
##
All columns now have complete data after removing rows with missing values. This method is a straightforward way to handle missing data by simply excluding incomplete records.
This approach replaces missing values with meaningful substitutes to retain all data entries while addressing the missing data problem.
R code:
airbnb_data_mean_mutated <- airbnb_data %>%
mutate_if(is.numeric, ~ifelse(is.na(.), mean(., na.rm = TRUE), .))
# Count of missing values per column
null_counts <- colSums(is.na(airbnb_data_mean_mutated))
null_counts
## id NAME
## 0 0
## host.id host_identity_verified
## 0 0
## host.name neighbourhood.group
## 0 0
## neighbourhood lat
## 0 0
## long country
## 0 0
## country.code instant_bookable
## 0 105
## cancellation_policy room.type
## 0 0
## Construction.year price
## 0 0
## service.fee minimum.nights
## 0 0
## number.of.reviews last.review
## 0 0
## reviews.per.month review.rate.number
## 0 0
## calculated.host.listings.count availability.365
## 0 0
## house_rules license
## 0 0
The provided R code snippet appears to have two main steps:
The mutate_if() function is used to fill missing values in numeric columns with their respective column means. This ensures that no numeric column in the dataset has missing values after this step.
The colSums(is.na()) function calculates the number of missing values in each column of the airbnb_data_mean_mutated dataset. The resulting null_counts variable displays the number of missing values per column.
Most columns now have 0 missing values, indicating that any previous missing numeric data has been replaced with the mean of that column. The instant_bookable column still has 105 missing values, likely because it’s a logical (binary) column, not numeric, and thus wasn’t affected by the mean imputation. This approach effectively handles missing numeric values but leaves non-numeric columns with missing data to be addressed separately.
For each categorical (character) column, missing values are replaced with the mode (the most frequently occurring value) of that column.
R code:
fill_mode <- function(x) {
ux <- unique(na.omit(x))
ux[which.max(tabulate(match(x, ux)))]
}
airbnb_data_c_mutated <- airbnb_data %>%
mutate_if(is.character, ~ifelse(is.na(.), fill_mode(.), .))
# Count of missing values per column
null_counts <- colSums(is.na(airbnb_data_c_mutated))
null_counts
## id NAME
## 0 0
## host.id host_identity_verified
## 0 0
## host.name neighbourhood.group
## 0 0
## neighbourhood lat
## 0 8
## long country
## 8 0
## country.code instant_bookable
## 0 105
## cancellation_policy room.type
## 0 0
## Construction.year price
## 214 0
## service.fee minimum.nights
## 0 409
## number.of.reviews last.review
## 183 0
## reviews.per.month review.rate.number
## 15879 326
## calculated.host.listings.count availability.365
## 319 448
## house_rules license
## 0 0
fill_mode <- function(x) { ux <- unique(na.omit(x)) ux[which.max(tabulate(match(x, ux)))] }
Purpose: This function calculates the mode of a categorical variable.
Steps: 1. unique(na.omit(x)): Removes NA values and finds the unique elements in the column. 2. tabulate(match(x, ux)): Counts the frequency of each unique value. 3. which.max(): Identifies the value with the maximum frequency (i.e., the mode).
Use: This function is used to fill missing values in categorical columns with their mode.
airbnb_data_c_mutated <- airbnb_data %>% mutate_if(is.character, ~ifelse(is.na(.), fill_mode(.), .))
Purpose: This line applies the fill_mode function to impute missing values for all categorical (character) columns in the dataset.
mutate_if(is.character, …): Selects all character columns and applies the imputation where values are NA.
The mode imputation successfully fills missing values in categorical variables, as evidenced by the 0 count for most categorical columns.