Air Quality in Indian Cities 2015-2020

Source of the Data:

https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=city_day.csv

Process of calculating air quality index:

https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial

What are the safe limits of these pollutants in air for India?

http://www.arthapedia.in/index.php?title=Ambient_Air_Quality_Standards_in_India

About the Dataset:

The dataset has air quality readings of major Indian cities for the period 2015-2020.
This dataset captures the air quality parameters of PM2.5, PM10, Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Ammonia (NH3), Carbon Monoxide (CO), Sulphur Dioxide (SO2), Ozone (O3), Benzene, Toluene, Xylene.
Computed Air Quality Index included in the dataset

Load Dataset(s)

Load Pollution dataset

pollution <- read.csv('https://raw.githubusercontent.com/learning-monk/datasets/master/Indian_cities_daily_pollution_2015-2020.csv', stringsAsFactor=FALSE, na.strings=c(""))

str(pollution)

## 'data.frame':    29531 obs. of  17 variables:
##  $ City             : chr  "Ahmedabad" "Ahmedabad" "Ahmedabad" "Ahmedabad" ...
##  $ Date             : chr  "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
##  $ PM2.5            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PM10             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ NO2              : num  18.2 15.7 19.3 18.5 21.4 ...
##  $ NOx              : num  17.1 16.5 29.7 18 37.8 ...
##  $ NH3              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ SO2              : num  27.6 24.6 29.1 18.6 39.3 ...
##  $ O3               : num  133.4 34.1 30.7 36.1 39.3 ...
##  $ Benzene          : num  0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
##  $ Toluene          : num  0.02 5.5 16.4 10.14 18.89 ...
##  $ Xylene           : num  0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
##  $ AQI              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AQI_Bucket       : chr  NA NA NA NA ...
##  $ Metropolitan_Area: int  1 1 1 1 1 1 1 1 1 1 ...

Data Wrangling steps

Convert date column to Date format
Convert City and AQI_Bucket to factor variables and Date column to Date
Add Month and Year to pollution
Re-order Month Values
Re-order bucket values in logical order in AQI_Bucket columns

## 'data.frame':    29531 obs. of  19 variables:
##  $ City             : Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date             : Date, format: "2015-01-01" "2015-01-02" ...
##  $ PM2.5            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ PM10             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ NO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ NO2              : num  18.2 15.7 19.3 18.5 21.4 ...
##  $ NOx              : num  17.1 16.5 29.7 18 37.8 ...
##  $ NH3              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CO               : num  0.92 0.97 17.4 1.7 22.1 ...
##  $ SO2              : num  27.6 24.6 29.1 18.6 39.3 ...
##  $ O3               : num  133.4 34.1 30.7 36.1 39.3 ...
##  $ Benzene          : num  0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
##  $ Toluene          : num  0.02 5.5 16.4 10.14 18.89 ...
##  $ Xylene           : num  0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
##  $ AQI              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AQI_Bucket       : Factor w/ 6 levels "Severe","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Metropolitan_Area: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Month            : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year             : Factor w/ 6 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...

Find missing values in the data set

##                   na_count
## City                     0
## Date                     0
## PM2.5                 4598
## PM10                 11140
## NO                    3582
## NO2                   3585
## NOx                   4185
## NH3                  10328
## CO                    2059
## SO2                   3854
## O3                    4022
## Benzene               5623
## Toluene               8041
## Xylene               18109
## AQI                   4681
## AQI_Bucket            4681
## Metropolitan_Area        0
## Month                    0
## Year                     0

Except City and Date columns, rest of the columns have NAs. Let’s not impute these missing values as they are not evenly missing. We will ignore these values in the plots for now.

How is AQI_Bucket distributed?

## 
##       Severe    Very Poor         Poor     Moderate Satisfactory         Good 
##         1338         2337         2781         8829         8224         1341

Plot the distribution of AQI levels

Using treemap, let’s visualize Air quality status

As different pollutants are measured on different scales and their range of values are fluctuating, it is important that we scale their values so that they can be brought under a common range and can be compared with each other in the same visualization.

How is Scaling/Normalization different from Standardization?

Scaling doesn’t change the distribution but only changes the range of values. Whereas Standardization is a technique used to normalize underlying distrbution so that machine learning algorithms / statistical tests which assumes data to be normal can be applied.

In our case, let’s scale our pollutant measurements to the range of 0 to 1. This process makes comparison easy and also helps us in plotting all the pollutants in a single chart to see the changes over the years. The downside of this technique in contrast to Standardization is we will end up with smaller standard deviations which can suppress the effect of outliers.

There are lot of scaling techniques available for use. In our case, let’s use MinMax Scaler.

MinMax Scaler:

Retains the shape of the original distribution
Doesn’t change the information in the data
Doesn’t effect outliers
The default range is 0 to 1

Scaling our data

Each pollutant measurement is brought to a common scale of 0 to 1 so that comparison would make sense.

##      PM2.5            PM10             NO             NO2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.030   1st Qu.:0.056   1st Qu.:0.014   1st Qu.:0.032  
##  Median :0.051   Median :0.096   Median :0.025   Median :0.060  
##  Mean   :0.071   Mean   :0.118   Mean   :0.045   Mean   :0.079  
##  3rd Qu.:0.085   3rd Qu.:0.150   3rd Qu.:0.051   3rd Qu.:0.104  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##  NA's   :4598    NA's   :11140   NA's   :3582    NA's   :3585   
##       NOx             NH3              CO              SO2       
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.027   1st Qu.:0.024   1st Qu.:0.0029   1st Qu.:0.029  
##  Median :0.050   Median :0.045   Median :0.0051   Median :0.047  
##  Mean   :0.069   Mean   :0.067   Mean   :0.0128   Mean   :0.075  
##  3rd Qu.:0.086   3rd Qu.:0.085   3rd Qu.:0.0082   3rd Qu.:0.078  
##  Max.   :1.000   Max.   :1.000   Max.   :1.0000   Max.   :1.000  
##  NA's   :4185    NA's   :10328   NA's   :2059     NA's   :3854   
##        O3           Benzene         Toluene          Xylene     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.073   1st Qu.:0.000   1st Qu.:0.001   1st Qu.:0.001  
##  Median :0.120   Median :0.002   Median :0.007   Median :0.006  
##  Mean   :0.134   Mean   :0.007   Mean   :0.019   Mean   :0.018  
##  3rd Qu.:0.177   3rd Qu.:0.007   3rd Qu.:0.020   3rd Qu.:0.020  
##  Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##  NA's   :4022    NA's   :5623    NA's   :8041    NA's   :18109

Standardization:

The below code normalizes our pollutant columns by subtracting mean from each value and dividing by standard deviation so that the data is normalized and takes the shape of bell curve.

##      PM2.5             PM10              NO              NO2        
##  Min.   :-1.043   Min.   :-1.304   Min.   :-0.770   Min.   :-1.167  
##  1st Qu.:-0.597   1st Qu.:-0.683   1st Qu.:-0.524   1st Qu.:-0.687  
##  Median :-0.292   Median :-0.248   Median :-0.337   Median :-0.281  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.203   3rd Qu.: 0.349   3rd Qu.: 0.104   3rd Qu.: 0.370  
##  Max.   :13.649   Max.   : 9.733   Max.   :16.372   Max.   :13.635  
##  NA's   :4598     NA's   :11140    NA's   :3582     NA's   :3585    
##       NOx              NH3               CO               SO2        
##  Min.   :-1.021   Min.   :-0.914   Min.   :-0.3233   Min.   :-0.801  
##  1st Qu.:-0.616   1st Qu.:-0.580   1st Qu.:-0.2500   1st Qu.:-0.489  
##  Median :-0.278   Median :-0.297   Median :-0.1954   Median :-0.296  
##  Mean   : 0.000   Mean   : 0.000   Mean   :-0.0002   Mean   : 0.000  
##  3rd Qu.: 0.247   3rd Qu.: 0.255   3rd Qu.:-0.1149   3rd Qu.: 0.038  
##  Max.   :13.754   Max.   :12.827   Max.   :24.9368   Max.   : 9.891  
##  NA's   :4185     NA's   :10328    NA's   :2059      NA's   :3854    
##        O3            Benzene          Toluene           Xylene      
##  Min.   :-1.590   Min.   :-0.207   Min.   :-0.436   Min.   :-0.486  
##  1st Qu.:-0.721   1st Qu.:-0.200   1st Qu.:-0.406   1st Qu.:-0.464  
##  Median :-0.168   Median :-0.140   Median :-0.287   Median :-0.331  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.511   3rd Qu.:-0.013   3rd Qu.: 0.023   3rd Qu.: 0.044  
##  Max.   :10.292   Max.   :28.574   Max.   :22.341   Max.   :26.472  
##  NA's   :4022     NA's   :5623     NA's   :8041     NA's   :18109

The two variables PM2.5 and PM10 are positively and strongly correlated. There are visible outliers.

Scatter plot matrix:

Let’s visualize correlation between different pollutants using scatter plot matrix.

What is a scatter plot matrix?

A scatter plot matrix is collection of scatter plots displayed as a grid.

It is similar to correlation plot but instead of displaying correlations, displays underlying data.
Scatter plot matrix is created using ggpairs function from GGally package.

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Except PM2.5 and PM10 rest of the pollutants plotted above have weak positive correlation among themselves Whereas PM2.5 and PM10 are strongly positively correlated above 0.8.

Benzene and Toluene shows somewhat stronger positive relationship correlation coefficient above 0.7.

Let’s see the distribution of pollutants over the years by City.

We will use Violin plots for this purpose.

What are Violin plots?

Violin plots are similar to kernel-density plots but are mirrored and rotated 90 degrees.

To understand more about Violin plots, check this interesting blog:

https://mode.com/blog/violin-plot-examples/

Distribution of PM2.5 by City over the years 2015-2020

Cities which caught my attention with PM2.5 pollution levels close to or over 0.50: Delhi, Gurugram, Lucknow, Patna
Visible outliers found in Amritsar, Guwahati, Shillong

Let’s see how the PM2.5 pollutant is distributed by months for each year for the above cities (Delhi, Gurugram, Lucknow, Patna)

As suspected, winter (Nov, Dec, Jan) months recorded high levels of PM2.5 for Delhi

Gurugram seems to show high levels during winter. But, interestingly 2017 year showed high levels of PM2.5 during mid year.

Lucknow seems to be following similar patterns as Delhi with high PM2.5 during winter.

Patna also seems to be following similar patterns as Delhi with high PM2.5 during winter.

The daily permissible limits of PM2.5 is between 40 - 60
The median levels of Delhi, Gurugram, Lucknow, Patna are close to 125 which is way above the max permissible limit

Distribution of PM10 by City over the years 2015-2020

Cities which caught my attention with PM10 levels close to and over 0.50 through the years:

Delhi, Gurugram, Jorapokhar, Kolkata, Talcher.

Interestingly metros like Bengaluru, Hyderabad seems to be consistent and median levels under 0.50 over the years.
There are visible outliers in Ahmedabad, Amritsar, Gawahati and Shillong.

Note: Jorapokhar in Jharkhand state and Talcher in Odisha state are not metro cities

Let’s see how PM10 is distributed by month over the years for the cities Delhi, Gurugram, Jorapokhar, Kolkata and Talcher

PM10 is also following similar patters to PM2.5 for Delhi with visible increase in levels during winter months (Nov, Dec, Jan)

PM10 was recorded only for the years 2018, 2019 and 2020 for Gurugram. It is following winter pattern (with high levels during winter months) as well.

Jorapokhar is also following winter pattern (with high levels during winter months) as well.

Daily permissible levels of PM10 is between 60 - 100
The median PM10 level of Delhi is over 200 which is alarming.
The median levels of Gurugram, Jorapokhar and Talcher is over 100 which is over the permissible limit
The median level of Kolkata is under and close to 100

Distribution of CO by City over the years 2015-2020

High CO (Carbon Monoxide) levels close to 0.50 are observed in Ahmedabad.

Let’s see the distribution of CO by month for each year for Ahmedabad

No clear visible pattern observed

The daily permissible levels of CO is between 2 - 4
The median level of CO for Ahmedabad is over 10 which is alarming.
Visible outliers observed over 40

Distribution of NOx by City over the years 2015-2020

NOx levels seems to be high for Ahmedabad, Mumbai, Delhi and Patna with levels closer or over 0.50

The daily permissible levels of NOx is between 40 - 80
The median levels of NOx for these cities seems to be under 80.
Visible outliers observed which are over 200

Distribution of SO2 by City over the years 2015-2020

SO2 levels are high for Ahmedabad, Jorapokhar, Patna

The daily permissible levels of SO2 is between 50 - 80.
Though the median levels are less than 50, it is important to keep an eye on this pollutant.
There are visible outliers recorded over 80(which is the max permissible limit).

Let’s use `plotly` to visualize PM2.5, PM10, NOx, CO and SO2 over the years

I have choosen metros Delhi, Hyderabad, Bengaluru, Chennai for indepth analysis.

Note: You may double click on the legend values to filter a particular pollutant

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Observations: Delhi Pollution levels

PM2.5: Clear visible peaks observed during late 2016, 2017 and 2019.
PM10: PM10 levels followed similar patterns as PM2.5. One visible peak observed during mid 2018.
NOx: Considerable peaks observed during 2015, 2016 and end of 2018
CO: Started off 2015 with high peaks and came under control during later years. One abnormal visible peak observed in Dec 2015
SO2: SO2 levels recorded high during 2016 with visible high peaks towards the end of 2016. Missing values observed during mid 2017.

Observations: Mumbai Pollution levels

PM2.5: Peaks observed during Jan 2019 and Jan 2020.
PM10: Peaks observed during Jan 2019.
NOx: Missing values observed during the period mid 2016 till end of 2017. Abnormal peaks observed during mid 2016 before values went missing.
CO: Peaks observed during the end of 2018 and starting 2019.
SO2: No visible abnormal peaks observed. Levels seems to be consistent.

Observations: Hyderabad Pollution levels

PM2.5: PM2.5 levels are abnormally high during 2015 & 2016 and came under control in the subsequent years. There are visible peaks only during these years.
PM10: PM10 levels are high during winter months of each year. There is an abnormal peak during the year 2017.
NOx: NOx levels started off on a high note in 2015 but reduced gradually. There were high levels during winter and early months of 2017 till 2020 dipping during mid years.
CO: CO level seems to be high during 2015 and 2016. There is one abnormal visible peak recorded during the end of 2016. After that abnormal peak, levels seems to be low will end of 2017. From 2018, levels are more or less consistent.
SO2: SO2 levels were high from late 2016 till early 2017. From 2018, levels seems to be more or less under control.

Observations: Bengaluru Pollution Levels

PM2.5: Visible peaks observed over the years and levels seems to be high during winter of each year
PM10: PM10 levels were recorded from late 2015. PM10 levels are following similar patterns as PM2.5 with elevated levels during winter months of each year.
NOx: Looks like NOx levels were not recorded from late 2016 till late 2017. High peaks observed during early 2016. Levels dipped past early months of 2020.
CO: CO levels were high between 2015 and 2016 and largely under control after 2016 except for few inconsistent visible peaks here and there.
SO2: SO2 levels were high between 2015 to 2016 and largely under control post 2016.

Observations: Chennai Pollution levels

PM2.5 has been consistent over the years, on a decreasing trend from late 2019 through 2020. But occasional peaks are visible.
PM10 levels are recorded from mid 2019 for Chennai. Visible peaks are observed.
NOx levels are high from mid 2018 with a visible peak during the mid of 2018.
CO levels are bit high compared to Hyderabad and Bengaluru during 2015 and till late 2016.
SO2 levels were fluctuating over the years and recorded peaks during 2015, late 2017, early 2019 and 2020.