Source of the Data:

https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=city_day.csv

Process of calculating air quality index:

https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial

What are the safe limits of these pollutants in air for India?

http://www.arthapedia.in/index.php?title=Ambient_Air_Quality_Standards_in_India


About the Dataset:


Load Dataset(s)

Load Pollution dataset

pollution <- read.csv('https://raw.githubusercontent.com/learning-monk/datasets/master/Indian_cities_daily_pollution_2015-2020.csv', stringsAsFactor=FALSE, na.strings=c(""))

str(pollution)
## 'data.frame':    29531 obs. of  17 variables:
##  $ City             : chr  "Ahmedabad" "Ahmedabad" "Ahmedabad" "Ahmedabad" ...
##  $ Date             : chr  "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
##  $ PM2.5            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PM10             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ NO2              : num  18.2 15.7 19.3 18.5 21.4 ...
##  $ NOx              : num  17.1 16.5 29.7 18 37.8 ...
##  $ NH3              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ SO2              : num  27.6 24.6 29.1 18.6 39.3 ...
##  $ O3               : num  133.4 34.1 30.7 36.1 39.3 ...
##  $ Benzene          : num  0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
##  $ Toluene          : num  0.02 5.5 16.4 10.14 18.89 ...
##  $ Xylene           : num  0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
##  $ AQI              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AQI_Bucket       : chr  NA NA NA NA ...
##  $ Metropolitan_Area: int  1 1 1 1 1 1 1 1 1 1 ...


Data Wrangling steps

## 'data.frame':    29531 obs. of  19 variables:
##  $ City             : Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date             : Date, format: "2015-01-01" "2015-01-02" ...
##  $ PM2.5            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PM10             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ NO2              : num  18.2 15.7 19.3 18.5 21.4 ...
##  $ NOx              : num  17.1 16.5 29.7 18 37.8 ...
##  $ NH3              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ SO2              : num  27.6 24.6 29.1 18.6 39.3 ...
##  $ O3               : num  133.4 34.1 30.7 36.1 39.3 ...
##  $ Benzene          : num  0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
##  $ Toluene          : num  0.02 5.5 16.4 10.14 18.89 ...
##  $ Xylene           : num  0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
##  $ AQI              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AQI_Bucket       : Factor w/ 6 levels "Severe","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Metropolitan_Area: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Month            : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year             : Factor w/ 6 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...


Find missing values in the data set

##                   na_count
## City                     0
## Date                     0
## PM2.5                 4598
## PM10                 11140
## NO                    3582
## NO2                   3585
## NOx                   4185
## NH3                  10328
## CO                    2059
## SO2                   3854
## O3                    4022
## Benzene               5623
## Toluene               8041
## Xylene               18109
## AQI                   4681
## AQI_Bucket            4681
## Metropolitan_Area        0
## Month                    0
## Year                     0

Except City and Date columns, rest of the columns have NAs. Let’s not impute these missing values as they are not evenly missing. We will ignore these values in the plots for now.


How is AQI_Bucket distributed?

## 
##       Severe    Very Poor         Poor     Moderate Satisfactory         Good 
##         1338         2337         2781         8829         8224         1341

Plot the distribution of AQI levels

Using treemap, let’s visualize Air quality status


As different pollutants are measured on different scales and their range of values are fluctuating, it is important that we scale their values so that they can be brought under a common range and can be compared with each other in the same visualization.

How is Scaling/Normalization different from Standardization?

Scaling doesn’t change the distribution but only changes the range of values. Whereas Standardization is a technique used to normalize underlying distrbution so that machine learning algorithms / statistical tests which assumes data to be normal can be applied.

In our case, let’s scale our pollutant measurements to the range of 0 to 1. This process makes comparison easy and also helps us in plotting all the pollutants in a single chart to see the changes over the years. The downside of this technique in contrast to Standardization is we will end up with smaller standard deviations which can suppress the effect of outliers.

There are lot of scaling techniques available for use. In our case, let’s use MinMax Scaler.

MinMax Scaler:

Scaling our data

Each pollutant measurement is brought to a common scale of 0 to 1 so that comparison would make sense.

##      PM2.5            PM10             NO             NO2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.030   1st Qu.:0.056   1st Qu.:0.014   1st Qu.:0.032  
##  Median :0.051   Median :0.096   Median :0.025   Median :0.060  
##  Mean   :0.071   Mean   :0.118   Mean   :0.045   Mean   :0.079  
##  3rd Qu.:0.085   3rd Qu.:0.150   3rd Qu.:0.051   3rd Qu.:0.104  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##  NA's   :4598    NA's   :11140   NA's   :3582    NA's   :3585   
##       NOx             NH3              CO              SO2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.027   1st Qu.:0.024   1st Qu.:0.0029   1st Qu.:0.029  
##  Median :0.050   Median :0.045   Median :0.0051   Median :0.047  
##  Mean   :0.069   Mean   :0.067   Mean   :0.0128   Mean   :0.075  
##  3rd Qu.:0.086   3rd Qu.:0.085   3rd Qu.:0.0082   3rd Qu.:0.078  
##  Max.   :1.000   Max.   :1.000   Max.   :1.0000   Max.   :1.000  
##  NA's   :4185    NA's   :10328   NA's   :2059     NA's   :3854   
##        O3           Benzene         Toluene          Xylene     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.073   1st Qu.:0.000   1st Qu.:0.001   1st Qu.:0.001  
##  Median :0.120   Median :0.002   Median :0.007   Median :0.006  
##  Mean   :0.134   Mean   :0.007   Mean   :0.019   Mean   :0.018  
##  3rd Qu.:0.177   3rd Qu.:0.007   3rd Qu.:0.020   3rd Qu.:0.020  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##  NA's   :4022    NA's   :5623    NA's   :8041    NA's   :18109

Standardization:

The below code normalizes our pollutant columns by subtracting mean from each value and dividing by standard deviation so that the data is normalized and takes the shape of bell curve.

##      PM2.5             PM10              NO              NO2        
##  Min.   :-1.043   Min.   :-1.304   Min.   :-0.770   Min.   :-1.167  
##  1st Qu.:-0.597   1st Qu.:-0.683   1st Qu.:-0.524   1st Qu.:-0.687  
##  Median :-0.292   Median :-0.248   Median :-0.337   Median :-0.281  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.203   3rd Qu.: 0.349   3rd Qu.: 0.104   3rd Qu.: 0.370  
##  Max.   :13.649   Max.   : 9.733   Max.   :16.372   Max.   :13.635  
##  NA's   :4598     NA's   :11140    NA's   :3582     NA's   :3585    
##       NOx              NH3               CO               SO2        
##  Min.   :-1.021   Min.   :-0.914   Min.   :-0.3233   Min.   :-0.801  
##  1st Qu.:-0.616   1st Qu.:-0.580   1st Qu.:-0.2500   1st Qu.:-0.489  
##  Median :-0.278   Median :-0.297   Median :-0.1954   Median :-0.296  
##  Mean   : 0.000   Mean   : 0.000   Mean   :-0.0002   Mean   : 0.000  
##  3rd Qu.: 0.247   3rd Qu.: 0.255   3rd Qu.:-0.1149   3rd Qu.: 0.038  
##  Max.   :13.754   Max.   :12.827   Max.   :24.9368   Max.   : 9.891  
##  NA's   :4185     NA's   :10328    NA's   :2059      NA's   :3854    
##        O3            Benzene          Toluene           Xylene      
##  Min.   :-1.590   Min.   :-0.207   Min.   :-0.436   Min.   :-0.486  
##  1st Qu.:-0.721   1st Qu.:-0.200   1st Qu.:-0.406   1st Qu.:-0.464  
##  Median :-0.168   Median :-0.140   Median :-0.287   Median :-0.331  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.511   3rd Qu.:-0.013   3rd Qu.: 0.023   3rd Qu.: 0.044  
##  Max.   :10.292   Max.   :28.574   Max.   :22.341   Max.   :26.472  
##  NA's   :4022     NA's   :5623     NA's   :8041     NA's   :18109

The two variables PM2.5 and PM10 are positively and strongly correlated. There are visible outliers.


Scatter plot matrix:

Let’s visualize correlation between different pollutants using scatter plot matrix.

What is a scatter plot matrix?

A scatter plot matrix is collection of scatter plots displayed as a grid.

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2


Let’s see the distribution of pollutants over the years by City.

We will use Violin plots for this purpose.

What are Violin plots?

Violin plots are similar to kernel-density plots but are mirrored and rotated 90 degrees.

To understand more about Violin plots, check this interesting blog:

https://mode.com/blog/violin-plot-examples/

Distribution of PM2.5 by City over the years 2015-2020

Let’s see how the PM2.5 pollutant is distributed by months for each year for the above cities (Delhi, Gurugram, Lucknow, Patna)

As suspected, winter (Nov, Dec, Jan) months recorded high levels of PM2.5 for Delhi

Gurugram seems to show high levels during winter. But, interestingly 2017 year showed high levels of PM2.5 during mid year.

Lucknow seems to be following similar patterns as Delhi with high PM2.5 during winter.

Patna also seems to be following similar patterns as Delhi with high PM2.5 during winter.

  • The daily permissible limits of PM2.5 is between 40 - 60
  • The median levels of Delhi, Gurugram, Lucknow, Patna are close to 125 which is way above the max permissible limit


Distribution of PM10 by City over the years 2015-2020

Cities which caught my attention with PM10 levels close to and over 0.50 through the years:

Delhi, Gurugram, Jorapokhar, Kolkata, Talcher.

Note: Jorapokhar in Jharkhand state and Talcher in Odisha state are not metro cities

Let’s see how PM10 is distributed by month over the years for the cities Delhi, Gurugram, Jorapokhar, Kolkata and Talcher

PM10 is also following similar patters to PM2.5 for Delhi with visible increase in levels during winter months (Nov, Dec, Jan)

PM10 was recorded only for the years 2018, 2019 and 2020 for Gurugram. It is following winter pattern (with high levels during winter months) as well.

Jorapokhar is also following winter pattern (with high levels during winter months) as well.

  • Daily permissible levels of PM10 is between 60 - 100
  • The median PM10 level of Delhi is over 200 which is alarming.
  • The median levels of Gurugram, Jorapokhar and Talcher is over 100 which is over the permissible limit
  • The median level of Kolkata is under and close to 100


Distribution of CO by City over the years 2015-2020

High CO (Carbon Monoxide) levels close to 0.50 are observed in Ahmedabad.

Let’s see the distribution of CO by month for each year for Ahmedabad

No clear visible pattern observed

  • The daily permissible levels of CO is between 2 - 4
  • The median level of CO for Ahmedabad is over 10 which is alarming.
  • Visible outliers observed over 40


Distribution of NOx by City over the years 2015-2020

NOx levels seems to be high for Ahmedabad, Mumbai, Delhi and Patna with levels closer or over 0.50


Distribution of SO2 by City over the years 2015-2020

SO2 levels are high for Ahmedabad, Jorapokhar, Patna


Let’s use plotly to visualize PM2.5, PM10, NOx, CO and SO2 over the years

I have choosen metros Delhi, Hyderabad, Bengaluru, Chennai for indepth analysis.

Note: You may double click on the legend values to filter a particular pollutant

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Observations: Delhi Pollution levels

Observations: Mumbai Pollution levels

Observations: Hyderabad Pollution levels

Observations: Bengaluru Pollution Levels

Observations: Chennai Pollution levels