data("airquality")
aq<-data.frame(airquality)
summary(aq)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
dim(aq)
## [1] 153 6
Now we know that our dataset consists of 153 rows and 6 columns Lets check the data types of our variables
sapply(aq, class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "integer" "numeric" "integer" "integer" "integer"
It seems that all of our variables are quantitave. Now let’s take a snapshot of how our data set look like
head(aq)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Here you can clearly see that there are missing values (NA) in our dataset. We need to excatly have an idea about the missing values in our dataset. For a meaningful data analysis and insights, cleaning the dataset is the most important step. Therefore let’s see how many missing values are in each column of our data set
Note that colsums function outputs row and column sums and means for numeric arrays (or data frames). The function is.na() returns true if a missing value is found.
According to the output we can see that there are 37 Missing Values in the “Ozone” column and 7 Missing Values in the “Solar.R” column. Other columns seems to be fine as they are free of missing values.
Generally we leave out any variable who has over 5% missing values. Therefore let’s output the percentage(%) of missing values in each column. We use a function for this.
percentmissing<-function(aq){
(colSums(is.na(aq))/nrow(aq))*100
}
percentmissing(aq)
## Ozone Solar.R Wind Temp Month Day
## 24.183007 4.575163 0.000000 0.000000 0.000000 0.000000
It seems that column “Ozone” has over 24% missing values. Column “Solar.R” has only 4% missing value.
For the column “Solar.R” we can substitute column mean inplace of missing value. This is accurate because we only have to deal with 7 missing values.
But for the “Ozone” column, we have a problem. One option is to drop all rows containing missing values in column “Ozone”. But by doing that, we lose over 24% of our data. Therefore that method is not practical.
Another option is to substitute column mean inplace of missing values. But again since there are 37 missing values, we may be reducing the integrity of our dataset in a large scale. Becuase if the actual data point is far away from the mean, then we have commited a serious error.
Therefore it is clear that we have to implement complex algorithms to get rid of the missing value problem in this case. But, since this is our first data cleaning project, we will implement the simple mean imputation method to handle missing values. We will discuss about a more complex yet accurate method to deal with missing values in our next post.
aq[is.na(aq)]<-mean(aq[!is.na(aq)])
head(aq)
## Ozone Solar.R Wind Temp Month Day
## 1 41.00000 190.00000 7.4 67 5 1
## 2 36.00000 118.00000 8.0 72 5 2
## 3 12.00000 149.00000 12.6 74 5 3
## 4 18.00000 313.00000 11.5 62 5 4
## 5 56.01888 56.01888 14.3 56 5 5
## 6 28.00000 56.01888 14.9 66 5 6
So this is how you can handle missing value in R using simple yet considerably accurate method. These methods can be used to impute missing values in small datasets. When the dataset gets bigger and complex, we have to implement more complex algorithms. We will discuss about those in the next tutorial.Visit https://datasciencelk.com/ Till then, See you! Good Bye!