Air Quality Data Cleaning

summary of the dataframe and its dimensions

summary(aq)

##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
##

dim(aq)

## [1] 153   6

Now we know that our dataset consists of 153 rows and 6 columns Lets check the data types of our variables

sapply(aq, class)

##     Ozone   Solar.R      Wind      Temp     Month       Day 
## "integer" "integer" "numeric" "integer" "integer" "integer"

It seems that all of our variables are quantitave. Now let’s take a snapshot of how our data set look like

head(aq)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Here you can clearly see that there are missing values (NA) in our dataset. We need to excatly have an idea about the missing values in our dataset. For a meaningful data analysis and insights, cleaning the dataset is the most important step. Therefore let’s see how many missing values are in each column of our data set

Note that colsums function outputs row and column sums and means for numeric arrays (or data frames). The function is.na() returns true if a missing value is found.
According to the output we can see that there are 37 Missing Values in the “Ozone” column and 7 Missing Values in the “Solar.R” column. Other columns seems to be fine as they are free of missing values.

Generally we leave out any variable who has over 5% missing values. Therefore let’s output the percentage(%) of missing values in each column. We use a function for this.

percentmissing<-function(aq){
  (colSums(is.na(aq))/nrow(aq))*100
}
percentmissing(aq)

##     Ozone   Solar.R      Wind      Temp     Month       Day 
## 24.183007  4.575163  0.000000  0.000000  0.000000  0.000000

It seems that column “Ozone” has over 24% missing values. Column “Solar.R” has only 4% missing value.

For the column “Solar.R” we can substitute column mean inplace of missing value. This is accurate because we only have to deal with 7 missing values.

But for the “Ozone” column, we have a problem. One option is to drop all rows containing missing values in column “Ozone”. But by doing that, we lose over 24% of our data. Therefore that method is not practical.

Another option is to substitute column mean inplace of missing values. But again since there are 37 missing values, we may be reducing the integrity of our dataset in a large scale. Becuase if the actual data point is far away from the mean, then we have commited a serious error.

Therefore it is clear that we have to implement complex algorithms to get rid of the missing value problem in this case. But, since this is our first data cleaning project, we will implement the simple mean imputation method to handle missing values. We will discuss about a more complex yet accurate method to deal with missing values in our next post.

aq[is.na(aq)]<-mean(aq[!is.na(aq)])
head(aq)

##      Ozone   Solar.R Wind Temp Month Day
## 1 41.00000 190.00000  7.4   67     5   1
## 2 36.00000 118.00000  8.0   72     5   2
## 3 12.00000 149.00000 12.6   74     5   3
## 4 18.00000 313.00000 11.5   62     5   4
## 5 56.01888  56.01888 14.3   56     5   5
## 6 28.00000  56.01888 14.9   66     5   6

So this is how you can handle missing value in R using simple yet considerably accurate method. These methods can be used to impute missing values in small datasets. When the dataset gets bigger and complex, we have to implement more complex algorithms. We will discuss about those in the next tutorial.Visit https://datasciencelk.com/ Till then, See you! Good Bye!

Air Quality Data Cleaning

Ravindu Abeygunasekara

4/7/2020

Importing the dataset

Creating copy of the dataset to a new dataframe

summary of the dataframe and its dimensions