Intro to the code.
Missing values in the dataset need to be handled before analysing datasets.
Commonly used approaches are
1. Replacing the missing value with a rationale value which can be mean/ median/ mode/ or a constant
2. Replacing missing value with previous value in the dataset
3. Deleting the missing values (not a preffered option)
We will use inbuilt dataset “airquality” in this code.
library(tidyverse)
## -- Attaching packages ------------------
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
dat = airquality
Using summary function to see if there are missing values
summary(dat$Solar.R)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.0 115.8 205.0 185.9 258.8 334.0 7
To see which rows have missing values
which(is.na(dat$Solar.R))
## [1] 5 6 11 27 96 97 98
To replace missing values with median
dat$Solar.R = impute(dat$Solar.R,#the variable having missing values
median #value to be imputed: mean/median/mode/etc
)
summary(dat$Solar.R)
##
## 7 values imputed to 205
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 120.0 205.0 186.8 256.0 334.0
To replace missing value with mean
dat = airquality
dat$Solar.R = impute(dat$Solar.R,
mean)
summary(dat$Solar.R)
##
## 7 values imputed to 185.9315
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 120.0 194.0 185.9 256.0 334.0
To replace missing value with the previous value in the dataset
dat = airquality
dat = dat %>% fill(Solar.R)
summary(dat$Solar.R)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7 112 203 186 259 334
To delete rows with missing values
dat = airquality
dim(dat)
## [1] 153 6
dat = dat %>%
filter(Solar.R != "NA")
dim(dat)
## [1] 146 6
Thank You