Question 3

Read the data, and how many missing values do you find in Petal.Length variable?

dirty_iris <- read.csv("https://raw.githubusercontent.com/edwindj/datacleaning/master/data/dirty_iris.csv")

sum(is.na(dirty_iris$Petal.Length))
## [1] 19

Question 4

Calculate the number and the percentage of observations that are complete.

n_complete <- sum(complete.cases(dirty_iris))

percent_complete <- n_complete / nrow(dirty_iris) * 100

n_complete
## [1] 96
percent_complete
## [1] 64

Question 5

Besides missing values, is there an another type of special values containing in the numeric columns?

sum(is.infinite(as.matrix(dirty_iris)), na.rm = TRUE)
## [1] 0

Question 6

Write R code to locate the above identified special value and replace them with a missing value placeholder.

dirty_iris[is.infinite(as.matrix(dirty_iris))] <- NA

Question 7

How many observations violate the rules?

violations <- subset(dirty_iris, Sepal.Width <= 0 | Sepal.Length > 30)

violations
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 16           5.0          -3          3.5         1.0 versicolor
## 28          73.0          29         63.0          NA  virginica
## 125         49.0          30         14.0         2.0     setosa
## 130          5.7           0          1.7         0.3     setosa
nrow(violations)
## [1] 4

Question 8

Would you locate the observation that violates the rule of “Sepal.Width >0” and make reasonable corrections?

neg_index <- which(dirty_iris$Sepal.Width < 0)

dirty_iris$Sepal.Width[neg_index] <-
  abs(dirty_iris$Sepal.Width[neg_index])

zero_index <- which(dirty_iris$Sepal.Width == 0)

dirty_iris$Sepal.Width[zero_index] <- NA

dirty_iris[neg_index, ]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 16            5           3          3.5           1 versicolor
dirty_iris[zero_index, ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 130          5.7          NA          1.7         0.3  setosa

Question 9

You are going to use the four methods we learned to impute the missing values for each column, respectively:

library(VIM)
## Warning: package 'VIM' was built under R version 4.5.2
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
dirty_iris <- read.csv("https://raw.githubusercontent.com/edwindj/datacleaning/master/data/dirty_iris.csv")

dirty_iris[is.infinite(as.matrix(dirty_iris))] <- NA

dirty_iris$Sepal.Width[is.na(dirty_iris$Sepal.Width)] <-
  mean(dirty_iris$Sepal.Width, na.rm = TRUE)

dirty_iris$Petal.Length[is.na(dirty_iris$Petal.Length)] <-
  median(dirty_iris$Petal.Length, na.rm = TRUE)

dirty_iris <- kNN(dirty_iris, variable = "Petal.Width", k = 5)

dirty_iris <- dirty_iris[, !grepl("_imp$", names(dirty_iris))]

model_data <- dirty_iris[!is.na(dirty_iris$Sepal.Length), ]

colSums(is.na(dirty_iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           10            0            0            0            0