Source of the Data:
https://www.kaggle.com/rohanrao/air-quality-data-in-india?select=city_day.csv
Process of calculating air quality index:
https://www.kaggle.com/rohanrao/calculating-aqi-air-quality-index-tutorial
What are the safe limits of these pollutants in air for India?
http://www.arthapedia.in/index.php?title=Ambient_Air_Quality_Standards_in_India
The dataset has air quality readings of major Indian cities for the period 2015-2020.
This dataset captures the air quality parameters of PM2.5, PM10, Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Ammonia (NH3), Carbon Monoxide (CO), Sulphur Dioxide (SO2), Ozone (O3), Benzene, Toluene, Xylene.
Computed Air Quality Index included in the dataset
Load Pollution dataset
pollution <- read.csv('https://raw.githubusercontent.com/learning-monk/datasets/master/Indian_cities_daily_pollution_2015-2020.csv', stringsAsFactor=FALSE, na.strings=c(""))
str(pollution)
## 'data.frame': 29531 obs. of 17 variables:
## $ City : chr "Ahmedabad" "Ahmedabad" "Ahmedabad" "Ahmedabad" ...
## $ Date : chr "2015-01-01" "2015-01-02" "2015-01-03" "2015-01-04" ...
## $ PM2.5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ PM10 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ NO2 : num 18.2 15.7 19.3 18.5 21.4 ...
## $ NOx : num 17.1 16.5 29.7 18 37.8 ...
## $ NH3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ SO2 : num 27.6 24.6 29.1 18.6 39.3 ...
## $ O3 : num 133.4 34.1 30.7 36.1 39.3 ...
## $ Benzene : num 0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
## $ Toluene : num 0.02 5.5 16.4 10.14 18.89 ...
## $ Xylene : num 0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
## $ AQI : int NA NA NA NA NA NA NA NA NA NA ...
## $ AQI_Bucket : chr NA NA NA NA ...
## $ Metropolitan_Area: int 1 1 1 1 1 1 1 1 1 1 ...
## 'data.frame': 29531 obs. of 19 variables:
## $ City : Factor w/ 26 levels "Ahmedabad","Aizawl",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2015-01-01" "2015-01-02" ...
## $ PM2.5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ PM10 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ NO2 : num 18.2 15.7 19.3 18.5 21.4 ...
## $ NOx : num 17.1 16.5 29.7 18 37.8 ...
## $ NH3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num 0.92 0.97 17.4 1.7 22.1 ...
## $ SO2 : num 27.6 24.6 29.1 18.6 39.3 ...
## $ O3 : num 133.4 34.1 30.7 36.1 39.3 ...
## $ Benzene : num 0 3.68 6.8 4.43 7.01 5.42 0 0 0 0 ...
## $ Toluene : num 0.02 5.5 16.4 10.14 18.89 ...
## $ Xylene : num 0 3.77 2.25 1 2.78 1.93 0 0 0 0 ...
## $ AQI : int NA NA NA NA NA NA NA NA NA NA ...
## $ AQI_Bucket : Factor w/ 6 levels "Severe","Very Poor",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Metropolitan_Area: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 6 levels "2015","2016",..: 1 1 1 1 1 1 1 1 1 1 ...
Find missing values in the data set
## na_count
## City 0
## Date 0
## PM2.5 4598
## PM10 11140
## NO 3582
## NO2 3585
## NOx 4185
## NH3 10328
## CO 2059
## SO2 3854
## O3 4022
## Benzene 5623
## Toluene 8041
## Xylene 18109
## AQI 4681
## AQI_Bucket 4681
## Metropolitan_Area 0
## Month 0
## Year 0
Except City and Date columns, rest of the columns have NAs. Let’s not impute these missing values as they are not evenly missing. We will ignore these values in the plots for now.
How is AQI_Bucket distributed?
##
## Severe Very Poor Poor Moderate Satisfactory Good
## 1338 2337 2781 8829 8224 1341
Plot the distribution of AQI levels
Using treemap, let’s visualize Air quality status
As different pollutants are measured on different scales and their range of values are fluctuating, it is important that we scale their values so that they can be brought under a common range and can be compared with each other in the same visualization.
How is Scaling/Normalization different from Standardization?
Scaling doesn’t change the distribution but only changes the range of values. Whereas Standardization is a technique used to normalize underlying distrbution so that machine learning algorithms / statistical tests which assumes data to be normal can be applied.
In our case, let’s scale our pollutant measurements to the range of 0 to 1. This process makes comparison easy and also helps us in plotting all the pollutants in a single chart to see the changes over the years. The downside of this technique in contrast to Standardization is we will end up with smaller standard deviations which can suppress the effect of outliers.
There are lot of scaling techniques available for use. In our case, let’s use MinMax Scaler.
MinMax Scaler:
Scaling our data
Each pollutant measurement is brought to a common scale of 0 to 1 so that comparison would make sense.
## PM2.5 PM10 NO NO2
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.030 1st Qu.:0.056 1st Qu.:0.014 1st Qu.:0.032
## Median :0.051 Median :0.096 Median :0.025 Median :0.060
## Mean :0.071 Mean :0.118 Mean :0.045 Mean :0.079
## 3rd Qu.:0.085 3rd Qu.:0.150 3rd Qu.:0.051 3rd Qu.:0.104
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## NA's :4598 NA's :11140 NA's :3582 NA's :3585
## NOx NH3 CO SO2
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.027 1st Qu.:0.024 1st Qu.:0.0029 1st Qu.:0.029
## Median :0.050 Median :0.045 Median :0.0051 Median :0.047
## Mean :0.069 Mean :0.067 Mean :0.0128 Mean :0.075
## 3rd Qu.:0.086 3rd Qu.:0.085 3rd Qu.:0.0082 3rd Qu.:0.078
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.000
## NA's :4185 NA's :10328 NA's :2059 NA's :3854
## O3 Benzene Toluene Xylene
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.073 1st Qu.:0.000 1st Qu.:0.001 1st Qu.:0.001
## Median :0.120 Median :0.002 Median :0.007 Median :0.006
## Mean :0.134 Mean :0.007 Mean :0.019 Mean :0.018
## 3rd Qu.:0.177 3rd Qu.:0.007 3rd Qu.:0.020 3rd Qu.:0.020
## Max. :1.000 Max. :1.000 Max. :1.000 Max. :1.000
## NA's :4022 NA's :5623 NA's :8041 NA's :18109
Standardization:
The below code normalizes our pollutant columns by subtracting mean from each value and dividing by standard deviation so that the data is normalized and takes the shape of bell curve.
## PM2.5 PM10 NO NO2
## Min. :-1.043 Min. :-1.304 Min. :-0.770 Min. :-1.167
## 1st Qu.:-0.597 1st Qu.:-0.683 1st Qu.:-0.524 1st Qu.:-0.687
## Median :-0.292 Median :-0.248 Median :-0.337 Median :-0.281
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 0.203 3rd Qu.: 0.349 3rd Qu.: 0.104 3rd Qu.: 0.370
## Max. :13.649 Max. : 9.733 Max. :16.372 Max. :13.635
## NA's :4598 NA's :11140 NA's :3582 NA's :3585
## NOx NH3 CO SO2
## Min. :-1.021 Min. :-0.914 Min. :-0.3233 Min. :-0.801
## 1st Qu.:-0.616 1st Qu.:-0.580 1st Qu.:-0.2500 1st Qu.:-0.489
## Median :-0.278 Median :-0.297 Median :-0.1954 Median :-0.296
## Mean : 0.000 Mean : 0.000 Mean :-0.0002 Mean : 0.000
## 3rd Qu.: 0.247 3rd Qu.: 0.255 3rd Qu.:-0.1149 3rd Qu.: 0.038
## Max. :13.754 Max. :12.827 Max. :24.9368 Max. : 9.891
## NA's :4185 NA's :10328 NA's :2059 NA's :3854
## O3 Benzene Toluene Xylene
## Min. :-1.590 Min. :-0.207 Min. :-0.436 Min. :-0.486
## 1st Qu.:-0.721 1st Qu.:-0.200 1st Qu.:-0.406 1st Qu.:-0.464
## Median :-0.168 Median :-0.140 Median :-0.287 Median :-0.331
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
## 3rd Qu.: 0.511 3rd Qu.:-0.013 3rd Qu.: 0.023 3rd Qu.: 0.044
## Max. :10.292 Max. :28.574 Max. :22.341 Max. :26.472
## NA's :4022 NA's :5623 NA's :8041 NA's :18109
The two variables PM2.5 and PM10 are positively and strongly correlated. There are visible outliers.
Let’s visualize correlation between different pollutants using scatter plot matrix.
What is a scatter plot matrix?
A scatter plot matrix is collection of scatter plots displayed as a grid.
ggpairs function from GGally package.## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Benzene and Toluene shows somewhat stronger positive relationship correlation coefficient above 0.7.Let’s see the distribution of pollutants over the years by City.
We will use Violin plots for this purpose.
What are Violin plots?
Violin plots are similar to kernel-density plots but are mirrored and rotated 90 degrees.
To understand more about Violin plots, check this interesting blog:
Cities which caught my attention with PM2.5 pollution levels close to or over 0.50: Delhi, Gurugram, Lucknow, Patna
Visible outliers found in Amritsar, Guwahati, Shillong
As suspected, winter (Nov, Dec, Jan) months recorded high levels of PM2.5 for Delhi
Gurugram seems to show high levels during winter. But, interestingly 2017 year showed high levels of PM2.5 during mid year.
Lucknow seems to be following similar patterns as Delhi with high PM2.5 during winter.
Patna also seems to be following similar patterns as Delhi with high PM2.5 during winter.
PM2.5 is between 40 - 60median levels of Delhi, Gurugram, Lucknow, Patna are close to 125 which is way above the max permissible limitCities which caught my attention with PM10 levels close to and over 0.50 through the years:
Delhi, Gurugram, Jorapokhar, Kolkata, Talcher.
Note: Jorapokhar in Jharkhand state and Talcher in Odisha state are not metro cities
PM10 is also following similar patters to PM2.5 for Delhi with visible increase in levels during winter months (Nov, Dec, Jan)
PM10 was recorded only for the years 2018, 2019 and 2020 for Gurugram. It is following winter pattern (with high levels during winter months) as well.
Jorapokhar is also following winter pattern (with high levels during winter months) as well.
PM10 is between 60 - 100median PM10 level of Delhi is over 200 which is alarming.median levels of Gurugram, Jorapokhar and Talcher is over 100 which is over the permissible limitmedian level of Kolkata is under and close to 100High CO (Carbon Monoxide) levels close to 0.50 are observed in Ahmedabad.
No clear visible pattern observed
CO is between 2 - 4CO for Ahmedabad is over 10 which is alarming.40NOx levels seems to be high for Ahmedabad, Mumbai, Delhi and Patna with levels closer or over 0.50
NOx is between 40 - 80median levels of NOx for these cities seems to be under 80.200SO2 levels are high for Ahmedabad, Jorapokhar, Patna
SO2 is between 50 - 80.50, it is important to keep an eye on this pollutant.80(which is the max permissible limit).plotly to visualize PM2.5, PM10, NOx, CO and SO2 over the yearsI have choosen metros Delhi, Hyderabad, Bengaluru, Chennai for indepth analysis.
Note: You may double click on the legend values to filter a particular pollutant
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Observations: Delhi Pollution levels
Dec 2015SO2 levels recorded high during 2016 with visible high peaks towards the end of 2016. Missing values observed during mid 2017.Observations: Mumbai Pollution levels
Observations: Hyderabad Pollution levels
Observations: Bengaluru Pollution Levels
Observations: Chennai Pollution levels