Air quality of New Delhi

1. Background:

According to a recent study published in the journal Lancet Planetary health titled “The impact of air pollution on deaths, disease burden, and life expectancy across the states of India: the Global Burden of Disease Study 2017” nearly 12.5 percent of total deaths recorded in India for the year of 2017 could be attributed to the toxic air quality. That amounts to 1.24 million people that lost their lives in 2017 due to pollution. Furthermore, the worsening pollution in the capital city of India has recently been the source of numerous debates and news articles. The problem is so severe and pervasive that the state government of Delhi was forced to close all schools in the metro region for multiple days.

As a former native of the country and having family currently residing in Delhi, I wanted to the data on Delhi air pollution over the past 5 years and see if I could identify any trends/factors that have been contributing to the decline in air quality. The government of Delhi has made several claims and in this project my focus will be to test a few of them: In this project I will be doing the following:

  1. Claim: The pollution might be bad but it has been on a downward trend year over year.

To test this claim I will obtain air quality data and compare the trends year over year

  1. Claim: The Delhi government has put forward a hypothesis that increase in the breathable particulate matter is a direct result of “stubble burning” by farmers in the neighboring state of Punjab.

To test this claim I will review the annual harvest yields for a winter crop such as what and see if the yields seem to have a correlation with the pollution.

  1. Claim: Pollution has been bad because of prevailing wind conditions.

To test this claim I will source weather data and compare these against the pollution numbers to determine if there is a correlation. Additionally, I will also determine the prevailing wind direction and see if westerly winds have a positive correlation with pollution ( since Punjab is directly west of New Delhi, it would follow that westerly winds would carry more stubble smoke into Delhi).


2. The Workflow

In order to be as effective and efficient as possible I followed the OSEMN data science workflow. The following sections are arranged to coincide with this workflow

3a. Gather the Data

Pollution Data

Sourcing the data for pollution in Delhi was fairly straight forward. The US Embassy in New Delhi records daily level of various common pollutants and makes them publicly available here. For the purposes of this project we focused on two indicators of pollution namely: - The Air Quality Index (AQI) reason being that the AQI is an aggregated index calculated based on the air concentration of various pollutants - pm2.5 which is the concentration of particulate matter 2.5 micrometers or less (in diameter) in the air. According to the EPA “When inhaled, particle pollution can travel deep into the lungs and cause or aggravate heart and lung diseases. Some”primary sources" of this pollutant are – Incomplete combustion – Automobile emissions – Dust – Cooking Hence pm2.5 is widely considered as a effective measure of overall pollution.

Time period: 2015-2019(YTD)

Frequency - Hourly

Weather Data

Obtaining the weather data proved to be a bit more challenging that I had anticipated. The era of the free api is over. In the end there was no convenient way to obtain bulk historical weather data and I ended up using the World Weather Online api which provided 2 months of free usage. However, that the api is limited to providing a single month of historical data at a time. As such, I had to reduce my temporal scope where weather data was required.

Time period: July 2018 - March 2019

Frequency - every 3 hours

Harvest Yields Data

This was sourced from a open data website hosted by the Indian government here

Time period:2015-2018

Frequency - Annual

Note: All static datasets used in this study are available on github

3b. Extract and clean the data

Extraction

Pollution

The pollution data was downloaded from the link provided above in a csv format. The column names and formatting was not very conducive for efficient coding, as such I was able to find a great function which helps standardize data upon import. This function can be found here. The credit for this function belongs to William Doane.

# A tibble: 6 x 14
  site  parameter date_lt  year month day   hour  now_cast_conc   aqi
  <chr> <chr>     <chr>   <dbl> <chr> <chr> <chr>         <dbl> <dbl>
1 New … PM2.5 - … 2015-0…  2015 01    01    01             -999  -999
2 New … PM2.5 - … 2015-0…  2015 01    01    02             -999  -999
3 New … PM2.5 - … 2015-0…  2015 01    01    03             -999  -999
4 New … PM2.5 - … 2015-0…  2015 01    01    04             -999  -999
5 New … PM2.5 - … 2015-0…  2015 01    01    05             -999  -999
6 New … PM2.5 - … 2015-0…  2015 01    01    06             -999  -999
# … with 5 more variables: aqi_category <chr>, raw_conc <dbl>, conc_unit <chr>,
#   duration <chr>, qc_name <chr>

Weather data

Obtained from the api listed above one month at a time.

library(jsonlite)

jul <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-07-01&enddate=2018-07-31")
aug <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-08-01&enddate=2018-08-31")
sep <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-09-01&enddate=2018-09-30")
oct <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-10-01&enddate=2018-10-31")
nov <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-11-01&enddate=2018-11-30")
dec <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-12-01&enddate=2018-12-31")
jan <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-01-01&enddate=2019-01-31")
feb <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-02-01&enddate=2019-02-27")
mar <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-03-01&enddate=2019-03-31")



# bind the datasets together

wdf <- rbind(jul$data$weather, aug$data$weather, sep$data$weatheroct$data$weather, 
    nov$data$weather, dec$data$weather, jan$data$weather, feb$data$weather, mar$data$weather)

# review my data
head(wdf)
        date                                                   astronomy
1 2018-07-01  05:27 AM, 07:23 PM, 09:42 PM, 07:57 AM, Waning Gibbous, 75
2 2018-07-02  05:28 AM, 07:23 PM, 10:19 PM, 08:49 AM, Waning Gibbous, 68
3 2018-07-03  05:28 AM, 07:23 PM, 10:55 PM, 09:42 AM, Waning Gibbous, 61
4 2018-07-04    05:28 AM, 07:23 PM, 11:30 PM, 10:35 AM, Last Quarter, 54
5 2018-07-05 05:29 AM, 07:23 PM, No moonrise, 11:29 AM, Last Quarter, 46
6 2018-07-06    05:29 AM, 07:23 PM, 12:04 AM, 12:23 PM, Last Quarter, 39
  maxtempC maxtempF mintempC mintempF avgtempC avgtempF totalSnow_cm sunHour
1       38      101       32       89       35       95          0.0    13.3
2       39      101       32       90       35       96          0.0    14.0
3       37       99       31       89       34       93          0.0    14.0
4       35       95       30       85       33       91          0.0    13.3
5       39      102       30       87       35       95          0.0    13.9
6       38      101       32       90       35       96          0.0    13.9
  uvIndex
1       1
2       1
3       1
4       1
5       1
6       1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         hourly
1                  0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 32, 34, 37, 38, 37, 36, 34, 92, 89, 93, 98, 101, 99, 97, 93, 3, 4, 8, 9, 10, 7, 9, 11, 5, 7, 13, 15, 16, 11, 14, 18, 263, 260, 275, 283, 277, 262, 266, 250, W, W, W, WNW, W, W, W, WSW, 116, 119, 356, 116, 116, 116, 116, 119, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Partly cloudy, Cloudy, Moderate or heavy rain shower, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Cloudy, 0.0, 0.0, 2.8, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 52, 59, 51, 44, 39, 42, 45, 58, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 998, 998, 999, 999, 997, 995, 995, 996, 30, 30, 30, 30, 30, 30, 30, 30, 47, 64, 68, 55, 46, 44, 48, 78, 41, 38, 40, 44, 46, 45, 43, 41, 106, 101, 105, 111, 114, 112, 110, 106, 23, 24, 23, 23, 22, 23, 23, 24, 74, 74, 73, 73, 72, 73, 73, 76, 35, 33, 34, 37, 39, 38, 37, 34, 94, 91, 94, 99, 102, 100, 98, 93, 3, 5, 13, 11, 11, 8, 10, 13, 5, 8, 21, 17, 18, 13, 17, 21, 41, 38, 40, 44, 46, 45, 43, 41, 106, 101, 105, 111, 114, 112, 110, 106, 1, 1, 1, 1, 1, 1, 1, 1
2 0, 300, 600, 900, 1200, 1500, 1800, 2100, 33, 32, 34, 35, 39, 38, 37, 35, 91, 90, 93, 95, 101, 101, 99, 94, 13, 8, 11, 10, 9, 5, 2, 9, 21, 13, 17, 16, 14, 8, 4, 15, 270, 264, 263, 259, 269, 258, 226, 198, W, W, W, W, W, WSW, SW, SSW, 356, 116, 176, 116, 176, 113, 113, 116, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0025_light_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0009_light_rain_showers.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Moderate or heavy rain shower, Partly cloudy, Patchy rain possible, Partly cloudy, Patchy rain possible, Sunny, Clear, Partly cloudy, 2.8, 0.0, 0.5, 0.0, 0.2, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 58, 57, 49, 47, 36, 34, 36, 45, 7, 10, 9, 10, 9, 10, 10, 10, 4, 6, 5, 6, 5, 6, 6, 6, 994, 995, 995, 996, 994, 994, 994, 995, 30, 30, 30, 30, 30, 30, 30, 30, 60, 45, 26, 25, 21, 15, 19, 41, 40, 38, 40, 42, 45, 45, 43, 39, 103, 100, 103, 107, 113, 112, 110, 103, 24, 23, 22, 23, 21, 21, 21, 21, 75, 73, 72, 73, 70, 69, 69, 70, 33, 33, 35, 36, 39, 39, 38, 35, 92, 91, 94, 96, 103, 103, 101, 95, 21, 10, 13, 12, 10, 5, 3, 12, 33, 16, 21, 19, 17, 9, 5, 20, 40, 38, 40, 42, 45, 45, 43, 39, 103, 100, 103, 107, 113, 112, 110, 103, 1, 1, 1, 1, 1, 1, 1, 1
3                          0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 31, 32, 35, 37, 34, 34, 32, 94, 89, 90, 95, 99, 94, 94, 89, 12, 6, 6, 4, 3, 4, 5, 4, 19, 10, 9, 6, 5, 6, 9, 7, 241, 215, 219, 26, 167, 72, 134, 116, WSW, SW, SW, NNE, SSE, ENE, SE, ESE, 113, 116, 356, 116, 116, 116, 116, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Partly cloudy, Moderate or heavy rain shower, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Clear, 0.0, 0.0, 3.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 47, 61, 56, 46, 38, 49, 49, 58, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 994, 996, 997, 997, 995, 994, 994, 995, 30, 30, 30, 30, 30, 30, 30, 30, 17, 27, 45, 26, 28, 45, 41, 15, 40, 38, 39, 42, 44, 41, 41, 39, 104, 101, 102, 108, 110, 106, 106, 101, 22, 24, 23, 23, 21, 23, 23, 23, 72, 75, 74, 73, 70, 73, 73, 74, 35, 32, 33, 36, 38, 35, 35, 33, 95, 90, 92, 97, 101, 95, 95, 91, 13, 7, 8, 5, 4, 4, 6, 5, 22, 12, 14, 7, 6, 7, 10, 8, 40, 38, 39, 42, 44, 41, 41, 39, 104, 101, 102, 108, 110, 106, 106, 101, 1, 1, 1, 1, 1, 1, 1, 1
4                   0, 300, 600, 900, 1200, 1500, 1800, 2100, 31, 30, 31, 34, 35, 35, 35, 32, 88, 85, 89, 94, 95, 95, 95, 89, 6, 4, 3, 7, 6, 2, 2, 6, 10, 6, 5, 11, 9, 4, 4, 9, 117, 44, 133, 305, 311, 258, 211, 61, ESE, NE, SE, NW, NW, WSW, SSW, ENE, 356, 113, 176, 119, 116, 116, 116, 116, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0025_light_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0003_white_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Moderate or heavy rain shower, Clear, Patchy rain possible, Cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, 3.6, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 61, 66, 59, 48, 46, 47, 48, 59, 7, 10, 9, 10, 10, 10, 10, 10, 4, 6, 5, 6, 6, 6, 6, 6, 995, 995, 996, 998, 996, 994, 995, 997, 30, 30, 30, 30, 30, 30, 30, 30, 26, 24, 29, 69, 46, 37, 41, 26, 38, 36, 38, 40, 41, 42, 42, 38, 100, 97, 101, 104, 106, 108, 107, 100, 24, 24, 24, 22, 22, 23, 23, 23, 75, 75, 74, 72, 72, 73, 73, 74, 32, 31, 33, 35, 35, 36, 36, 33, 90, 87, 91, 94, 96, 97, 96, 91, 11, 4, 5, 8, 6, 3, 3, 7, 18, 7, 8, 12, 10, 5, 5, 11, 38, 36, 38, 40, 41, 42, 42, 38, 100, 97, 101, 104, 106, 108, 107, 100, 1, 1, 1, 1, 1, 1, 1, 1
5                                                                                               0, 300, 600, 900, 1200, 1500, 1800, 2100, 31, 30, 32, 36, 39, 38, 37, 34, 88, 87, 90, 97, 102, 100, 98, 94, 4, 3, 5, 3, 7, 7, 6, 3, 7, 5, 8, 5, 11, 11, 10, 5, 69, 116, 134, 280, 315, 306, 311, 317, ENE, ESE, SE, W, NW, NW, NW, NW, 113, 113, 356, 113, 113, 113, 113, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Clear, Moderate or heavy rain shower, Sunny, Sunny, Sunny, Clear, Clear, 0.0, 0.0, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 60, 65, 57, 45, 33, 36, 39, 47, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 996, 996, 998, 999, 997, 995, 996, 998, 30, 30, 30, 30, 30, 30, 30, 30, 24, 23, 14, 13, 17, 17, 17, 13, 39, 38, 40, 46, 45, 44, 43, 41, 102, 100, 104, 115, 113, 111, 109, 106, 24, 24, 24, 24, 21, 21, 21, 22, 75, 76, 75, 75, 69, 70, 70, 72, 33, 32, 34, 38, 40, 39, 38, 36, 91, 89, 92, 100, 103, 102, 100, 96, 5, 4, 7, 3, 8, 8, 7, 3, 8, 6, 12, 5, 13, 12, 11, 5, 39, 38, 40, 46, 45, 44, 43, 41, 102, 100, 104, 115, 113, 111, 109, 106, 1, 1, 1, 1, 1, 1, 1, 1
6                                                                                                                     0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 32, 35, 37, 38, 37, 36, 34, 93, 90, 95, 98, 101, 99, 97, 93, 4, 3, 3, 7, 8, 7, 7, 3, 6, 5, 4, 12, 13, 11, 12, 5, 282, 321, 324, 305, 296, 295, 301, 322, WNW, NW, NW, NW, WNW, WNW, WNW, NW, 113, 113, 116, 113, 113, 113, 113, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Clear, Partly cloudy, Sunny, Sunny, Sunny, Clear, Clear, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 48, 53, 45, 37, 32, 37, 40, 46, 10, 10, 10, 10, 10, 10, 10, 10, 6, 6, 6, 6, 6, 6, 6, 6, 997, 997, 998, 999, 997, 995, 996, 997, 30, 30, 30, 30, 30, 30, 30, 30, 18, 24, 27, 25, 13, 19, 19, 4, 41, 39, 42, 42, 43, 43, 41, 43, 106, 103, 107, 108, 110, 109, 106, 109, 23, 23, 22, 21, 19, 21, 21, 23, 73, 73, 72, 69, 67, 70, 69, 73, 35, 34, 36, 38, 39, 38, 37, 36, 96, 93, 97, 100, 103, 100, 98, 98, 5, 4, 3, 9, 9, 8, 9, 3, 7, 6, 5, 14, 15, 13, 14, 5, 41, 39, 42, 42, 43, 43, 41, 43, 106, 103, 107, 108, 110, 109, 106, 109, 1, 1, 1, 1, 1, 1, 1, 1

Wheat yield

Downloaded in csv format and then read in

# A tibble: 6 x 5
  `District/Year` `2018` `2017` `2016` `2015`
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>
1 Gurdaspur         4733   4606   4654   3356
2 Pathankot         4018   4066   4049   2727
3 Amritsar          4866   4948   4478   3914
4 Tarn Taran        4685   4791   4524   4132
5 Kapurthala        4842   4922   4259   3996
6 Jalandhar         5063   4847   4456   4102

Cleaning the data

Upon review of the pollution data extracted above, we discover a column called ‘qc_name’ which denotes if the values contained within the row have gone through a quality check. We also noted a significant amount of rows where either aqi or the pm concentration (raw_cont) were negative.

# A tibble: 1 x 1
      n
  <int>
1  2843
# A tibble: 1 x 1
      n
  <int>
1  2552
# A tibble: 1 x 1
      n
  <int>
1   780

Note that even though upwards of 2000 rows have negative values only 780 rows are marked as an invalid qc check. As such to clean our data, we will remove all negative values from the data.

# A tibble: 1 x 1
      n
  <int>
1     0
# A tibble: 1 x 1
      n
  <int>
1     0

Even after trying multiple times I was unable to convert the ‘date’ field into a date time object. However, the dataset also has three columns containing the day month and year. We will use this to create a date column called ‘dt’. Once this is accomplished we will reduce our dataset down to only the relevant columns.

     site            parameter           date_lt               year     
 Length:38725       Length:38725       Length:38725       Min.   :2015  
 Class :character   Class :character   Class :character   1st Qu.:2016  
 Mode  :character   Mode  :character   Mode  :character   Median :2017  
                                                          Mean   :2017  
    month               day                hour           now_cast_conc   
 Length:38725       Length:38725       Length:38725       Min.   :   0.3  
 Class :character   Class :character   Class :character   1st Qu.:  39.1  
 Mode  :character   Mode  :character   Mode  :character   Median :  69.2  
                                                          Mean   : 105.6  
      aqi        aqi_category          raw_conc      conc_unit        
 Min.   :  1.0   Length:38725       Min.   :  1.0   Length:38725      
 1st Qu.:110.0   Class :character   1st Qu.: 38.0   Class :character  
 Median :158.0   Mode  :character   Median : 69.0   Mode  :character  
 Mean   :170.8                      Mean   :105.7                     
   duration           qc_name             date1          
 Length:38725       Length:38725       Length:38725      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
       dt                     
 Min.   :2015-02-24 06:00:00  
 1st Qu.:2016-04-21 21:00:00  
 Median :2017-06-21 18:00:00  
 Mean   :2017-07-01 19:41:06  
 [ reached getOption("max.print") -- omitted 2 rows ]
       dt                           aqi             year     
 Min.   :2015-02-24 06:00:00   Min.   :  1.0   Min.   :2015  
 1st Qu.:2016-04-21 21:00:00   1st Qu.:110.0   1st Qu.:2016  
 Median :2017-06-21 18:00:00   Median :158.0   Median :2017  
 Mean   :2017-07-01 19:41:06   Mean   :170.8   Mean   :2017  
 3rd Qu.:2018-08-29 00:00:00   3rd Qu.:195.0   3rd Qu.:2018  
 Max.   :2019-12-01 00:00:00   Max.   :878.0   Max.   :2019  
    month               day               raw_conc    
 Length:38725       Length:38725       Min.   :  1.0  
 Class :character   Class :character   1st Qu.: 38.0  
 Mode  :character   Mode  :character   Median : 69.0  
                                       Mean   :105.7  
                                       3rd Qu.:140.0  
                                       Max.   :998.0  

The pollution data is still in an hourly format. We will resample into daily averages:

Now our pollution data is ready.

The hourly weather data currently is stored as a list within the wdf dataframe. To make it useful, we will first have to unlist it.

# A tibble: 6 x 45
  date  sunrise sunset moonrise moonset moon_phase moon_illuminati… maxtempC
  <chr> <chr>   <chr>  <chr>    <chr>   <chr>      <chr>            <chr>   
1 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
2 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
3 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
4 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
5 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
6 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75               38      
# … with 37 more variables: maxtempF <chr>, mintempC <chr>, mintempF <chr>,
#   avgtempC <chr>, avgtempF <chr>, totalSnow_cm <chr>, sunHour <chr>,
#   uvIndex <chr>, time <chr>, tempC <chr>, tempF <chr>, windspeedMiles <chr>,
#   windspeedKmph <chr>, winddirDegree <chr>, winddir16Point <chr>,
#   weatherCode <chr>, weatherIconUrl <list>, weatherDesc <list>,
#   precipMM <chr>, precipInches <chr>, humidity <chr>, visibility <chr>,
#   visibilityMiles <chr>, pressure <chr>, pressureInches <chr>,
#   cloudcover <chr>, HeatIndexC <chr>, HeatIndexF <chr>, DewPointC <chr>,
#   DewPointF <chr>, WindChillC <chr>, WindChillF <chr>, WindGustMiles <chr>,
#   WindGustKmph <chr>, FeelsLikeC <chr>, FeelsLikeF <chr>, uvIndex1 <chr>

For the purposes of this analysis we will focus on a single column which is the wind speed in miles per hour.Additionally , we will convert the dates from character to a date object and wind from character to numeric.

# A tibble: 6 x 2
  dt          wind
  <date>     <dbl>
1 2018-07-01     3
2 2018-07-01     4
3 2018-07-01     8
4 2018-07-01     9
5 2018-07-01    10
6 2018-07-01     7

In this current format the data is hourly. We are going to resample this data and aggregate the weather data by calculating the daily mean.

# A tibble: 6 x 2
  dt          wind
  <date>     <dbl>
1 2018-07-01  7.62
2 2018-07-02  8.38
3 2018-07-03  5.5 
4 2018-07-04  4.5 
5 2018-07-05  4.75
6 2018-07-06  5.25

Our Weather data is now ready to work with !!


4. Exploration

Comparing pollution data with the wind data

In order to compare the windspeed with pollution levels we are going to join the daily pollution dataset with the daily wind dataset.

Let’s see if we can discern if there is a correlation between pollution and wind:

There seems to be a small negative correlation between aqi and wind. We can confirm with a correlation matrix:

As expected there is a correlation between air quality, pm2.5 and wind with a slightly higher correlation with aqi.

5. Model building

Due to the small size of the dataset (176 observations) we will build a default model to see if we can predict aqi based on wind speed


Call:
lm(formula = aqi ~ wind, data = stats)

Residuals:
     Min       1Q   Median       3Q      Max 
-147.635  -59.684    0.988   54.599  209.235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   269.23      15.94  16.895  < 2e-16 ***
wind          -16.54       2.97  -5.569 9.64e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 77.37 on 173 degrees of freedom
Multiple R-squared:  0.152, Adjusted R-squared:  0.1471 
F-statistic: 31.01 on 1 and 173 DF,  p-value: 9.636e-08

Is there a relationship between predictor and response variable? F=31.01 is far greater than 1. It can be concluded that there is a relationship between predictor and response variable. However with a R2 score of .152 this would not be considered a very good model.

Conclusion: What does all mean?

My conclusions from the analysis above are that I find all of the claims made by the Delhi Government suspect because :

  1. it does not seem that pollution has really decreased by much between 2015-2019( avg. aqi of 162.44 and 160.07 or -1.46%)

  2. any decrease in the pollution can potentially be attributed to a general increase in average wind speed

  3. even though the yields of agriculture have gone up the pollution has decreased marginally (yields of wheat increased by 18 % between 2015 and 2018)

Next steps would be to try and source more accurate data , to do more indepth modeling and use predictors other than wind.

12/11/2019