Air quality of New Delhi
1. Background:
According to a recent study published in the journal Lancet Planetary health titled “The impact of air pollution on deaths, disease burden, and life expectancy across the states of India: the Global Burden of Disease Study 2017” nearly 12.5 percent of total deaths recorded in India for the year of 2017 could be attributed to the toxic air quality. That amounts to 1.24 million people that lost their lives in 2017 due to pollution. Furthermore, the worsening pollution in the capital city of India has recently been the source of numerous debates and news articles. The problem is so severe and pervasive that the state government of Delhi was forced to close all schools in the metro region for multiple days.
As a former native of the country and having family currently residing in Delhi, I wanted to the data on Delhi air pollution over the past 5 years and see if I could identify any trends/factors that have been contributing to the decline in air quality. The government of Delhi has made several claims and in this project my focus will be to test a few of them: In this project I will be doing the following:
- Claim: The pollution might be bad but it has been on a downward trend year over year.
To test this claim I will obtain air quality data and compare the trends year over year
- Claim: The Delhi government has put forward a hypothesis that increase in the breathable particulate matter is a direct result of “stubble burning” by farmers in the neighboring state of Punjab.
To test this claim I will review the annual harvest yields for a winter crop such as what and see if the yields seem to have a correlation with the pollution.
- Claim: Pollution has been bad because of prevailing wind conditions.
To test this claim I will source weather data and compare these against the pollution numbers to determine if there is a correlation. Additionally, I will also determine the prevailing wind direction and see if westerly winds have a positive correlation with pollution ( since Punjab is directly west of New Delhi, it would follow that westerly winds would carry more stubble smoke into Delhi).
2. The Workflow
In order to be as effective and efficient as possible I followed the OSEMN data science workflow. The following sections are arranged to coincide with this workflow
3a. Gather the Data
Pollution Data
Sourcing the data for pollution in Delhi was fairly straight forward. The US Embassy in New Delhi records daily level of various common pollutants and makes them publicly available here. For the purposes of this project we focused on two indicators of pollution namely: - The Air Quality Index (AQI) reason being that the AQI is an aggregated index calculated based on the air concentration of various pollutants - pm2.5 which is the concentration of particulate matter 2.5 micrometers or less (in diameter) in the air. According to the EPA “When inhaled, particle pollution can travel deep into the lungs and cause or aggravate heart and lung diseases. Some”primary sources" of this pollutant are – Incomplete combustion – Automobile emissions – Dust – Cooking Hence pm2.5 is widely considered as a effective measure of overall pollution.
Time period: 2015-2019(YTD)
Frequency - Hourly
Weather Data
Obtaining the weather data proved to be a bit more challenging that I had anticipated. The era of the free api is over. In the end there was no convenient way to obtain bulk historical weather data and I ended up using the World Weather Online api which provided 2 months of free usage. However, that the api is limited to providing a single month of historical data at a time. As such, I had to reduce my temporal scope where weather data was required.
Time period: July 2018 - March 2019
Frequency - every 3 hours
3b. Extract and clean the data
Extraction
Pollution
The pollution data was downloaded from the link provided above in a csv format. The column names and formatting was not very conducive for efficient coding, as such I was able to find a great function which helps standardize data upon import. This function can be found here. The credit for this function belongs to William Doane.
# Clean column names function
clean_names <- function(.data, unique = FALSE) {
n <- if (is.data.frame(.data))
colnames(.data) else .data
n <- gsub("%+", "_pct_", n)
n <- gsub("\\$+", "_dollars_", n)
n <- gsub("\\++", "_plus_", n)
n <- gsub("-+", "_minus_", n)
n <- gsub("\\*+", "_star_", n)
n <- gsub("#+", "_cnt_", n)
n <- gsub("&+", "_and_", n)
n <- gsub("@+", "_at_", n)
n <- gsub("[^a-zA-Z0-9_]+", "_", n)
n <- gsub("([A-Z][a-z])", "_\\1", n)
n <- tolower(trimws(n))
n <- gsub("(^_+|_+$)", "", n)
n <- gsub("_+", "_", n)
if (unique)
n <- make.unique(n, sep = "_")
if (is.data.frame(.data)) {
colnames(.data) <- n
.data
} else {
n
}
}
library(tidyverse)
# Importing the data and applying the cleaning function
del19 <- read_csv("https://raw.githubusercontent.com/DevMeh/FinalProject/master/NewDelhi_PM2.5_2019_YTD.csv") %>%
clean_names()
del18 <- read_csv("https://raw.githubusercontent.com/DevMeh/FinalProject/master/NewDelhi_PM2.5_2018_YTD.csv") %>%
clean_names()
del17 <- read_csv("https://raw.githubusercontent.com/DevMeh/FinalProject/master/NewDelhi_PM2.5_2017_YTD.csv") %>%
clean_names()
del16 <- read_csv("https://raw.githubusercontent.com/DevMeh/FinalProject/master/NewDelhi_PM2.5_2016_YTD.csv") %>%
clean_names()
del15 <- read_csv("https://raw.githubusercontent.com/DevMeh/FinalProject/master/NewDelhi_PM2.5_2015_YTD.csv") %>%
clean_names()
# Binding the datasets together to create a single dataset
del <- rbind(del15, del16, del17, del18, del19)
# Reviewing my data
head(del)# A tibble: 6 x 14
site parameter date_lt year month day hour now_cast_conc aqi
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 New … PM2.5 - … 2015-0… 2015 01 01 01 -999 -999
2 New … PM2.5 - … 2015-0… 2015 01 01 02 -999 -999
3 New … PM2.5 - … 2015-0… 2015 01 01 03 -999 -999
4 New … PM2.5 - … 2015-0… 2015 01 01 04 -999 -999
5 New … PM2.5 - … 2015-0… 2015 01 01 05 -999 -999
6 New … PM2.5 - … 2015-0… 2015 01 01 06 -999 -999
# … with 5 more variables: aqi_category <chr>, raw_conc <dbl>, conc_unit <chr>,
# duration <chr>, qc_name <chr>
Weather data
Obtained from the api listed above one month at a time.
library(jsonlite)
jul <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-07-01&enddate=2018-07-31")
aug <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-08-01&enddate=2018-08-31")
sep <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-09-01&enddate=2018-09-30")
oct <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-10-01&enddate=2018-10-31")
nov <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-11-01&enddate=2018-11-30")
dec <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2018-12-01&enddate=2018-12-31")
jan <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-01-01&enddate=2019-01-31")
feb <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-02-01&enddate=2019-02-27")
mar <- fromJSON("http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=79fdc1f7e84f4515b69112551191112&q=new%20delhi,%20india&format=json&date=2019-03-01&enddate=2019-03-31")
# bind the datasets together
wdf <- rbind(jul$data$weather, aug$data$weather, sep$data$weatheroct$data$weather,
nov$data$weather, dec$data$weather, jan$data$weather, feb$data$weather, mar$data$weather)
# review my data
head(wdf) date astronomy
1 2018-07-01 05:27 AM, 07:23 PM, 09:42 PM, 07:57 AM, Waning Gibbous, 75
2 2018-07-02 05:28 AM, 07:23 PM, 10:19 PM, 08:49 AM, Waning Gibbous, 68
3 2018-07-03 05:28 AM, 07:23 PM, 10:55 PM, 09:42 AM, Waning Gibbous, 61
4 2018-07-04 05:28 AM, 07:23 PM, 11:30 PM, 10:35 AM, Last Quarter, 54
5 2018-07-05 05:29 AM, 07:23 PM, No moonrise, 11:29 AM, Last Quarter, 46
6 2018-07-06 05:29 AM, 07:23 PM, 12:04 AM, 12:23 PM, Last Quarter, 39
maxtempC maxtempF mintempC mintempF avgtempC avgtempF totalSnow_cm sunHour
1 38 101 32 89 35 95 0.0 13.3
2 39 101 32 90 35 96 0.0 14.0
3 37 99 31 89 34 93 0.0 14.0
4 35 95 30 85 33 91 0.0 13.3
5 39 102 30 87 35 95 0.0 13.9
6 38 101 32 90 35 96 0.0 13.9
uvIndex
1 1
2 1
3 1
4 1
5 1
6 1
hourly
1 0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 32, 34, 37, 38, 37, 36, 34, 92, 89, 93, 98, 101, 99, 97, 93, 3, 4, 8, 9, 10, 7, 9, 11, 5, 7, 13, 15, 16, 11, 14, 18, 263, 260, 275, 283, 277, 262, 266, 250, W, W, W, WNW, W, W, W, WSW, 116, 119, 356, 116, 116, 116, 116, 119, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Partly cloudy, Cloudy, Moderate or heavy rain shower, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Cloudy, 0.0, 0.0, 2.8, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 52, 59, 51, 44, 39, 42, 45, 58, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 998, 998, 999, 999, 997, 995, 995, 996, 30, 30, 30, 30, 30, 30, 30, 30, 47, 64, 68, 55, 46, 44, 48, 78, 41, 38, 40, 44, 46, 45, 43, 41, 106, 101, 105, 111, 114, 112, 110, 106, 23, 24, 23, 23, 22, 23, 23, 24, 74, 74, 73, 73, 72, 73, 73, 76, 35, 33, 34, 37, 39, 38, 37, 34, 94, 91, 94, 99, 102, 100, 98, 93, 3, 5, 13, 11, 11, 8, 10, 13, 5, 8, 21, 17, 18, 13, 17, 21, 41, 38, 40, 44, 46, 45, 43, 41, 106, 101, 105, 111, 114, 112, 110, 106, 1, 1, 1, 1, 1, 1, 1, 1
2 0, 300, 600, 900, 1200, 1500, 1800, 2100, 33, 32, 34, 35, 39, 38, 37, 35, 91, 90, 93, 95, 101, 101, 99, 94, 13, 8, 11, 10, 9, 5, 2, 9, 21, 13, 17, 16, 14, 8, 4, 15, 270, 264, 263, 259, 269, 258, 226, 198, W, W, W, W, W, WSW, SW, SSW, 356, 116, 176, 116, 176, 113, 113, 116, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0025_light_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0009_light_rain_showers.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Moderate or heavy rain shower, Partly cloudy, Patchy rain possible, Partly cloudy, Patchy rain possible, Sunny, Clear, Partly cloudy, 2.8, 0.0, 0.5, 0.0, 0.2, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 58, 57, 49, 47, 36, 34, 36, 45, 7, 10, 9, 10, 9, 10, 10, 10, 4, 6, 5, 6, 5, 6, 6, 6, 994, 995, 995, 996, 994, 994, 994, 995, 30, 30, 30, 30, 30, 30, 30, 30, 60, 45, 26, 25, 21, 15, 19, 41, 40, 38, 40, 42, 45, 45, 43, 39, 103, 100, 103, 107, 113, 112, 110, 103, 24, 23, 22, 23, 21, 21, 21, 21, 75, 73, 72, 73, 70, 69, 69, 70, 33, 33, 35, 36, 39, 39, 38, 35, 92, 91, 94, 96, 103, 103, 101, 95, 21, 10, 13, 12, 10, 5, 3, 12, 33, 16, 21, 19, 17, 9, 5, 20, 40, 38, 40, 42, 45, 45, 43, 39, 103, 100, 103, 107, 113, 112, 110, 103, 1, 1, 1, 1, 1, 1, 1, 1
3 0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 31, 32, 35, 37, 34, 34, 32, 94, 89, 90, 95, 99, 94, 94, 89, 12, 6, 6, 4, 3, 4, 5, 4, 19, 10, 9, 6, 5, 6, 9, 7, 241, 215, 219, 26, 167, 72, 134, 116, WSW, SW, SW, NNE, SSE, ENE, SE, ESE, 113, 116, 356, 116, 116, 116, 116, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Partly cloudy, Moderate or heavy rain shower, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Clear, 0.0, 0.0, 3.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 47, 61, 56, 46, 38, 49, 49, 58, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 994, 996, 997, 997, 995, 994, 994, 995, 30, 30, 30, 30, 30, 30, 30, 30, 17, 27, 45, 26, 28, 45, 41, 15, 40, 38, 39, 42, 44, 41, 41, 39, 104, 101, 102, 108, 110, 106, 106, 101, 22, 24, 23, 23, 21, 23, 23, 23, 72, 75, 74, 73, 70, 73, 73, 74, 35, 32, 33, 36, 38, 35, 35, 33, 95, 90, 92, 97, 101, 95, 95, 91, 13, 7, 8, 5, 4, 4, 6, 5, 22, 12, 14, 7, 6, 7, 10, 8, 40, 38, 39, 42, 44, 41, 41, 39, 104, 101, 102, 108, 110, 106, 106, 101, 1, 1, 1, 1, 1, 1, 1, 1
4 0, 300, 600, 900, 1200, 1500, 1800, 2100, 31, 30, 31, 34, 35, 35, 35, 32, 88, 85, 89, 94, 95, 95, 95, 89, 6, 4, 3, 7, 6, 2, 2, 6, 10, 6, 5, 11, 9, 4, 4, 9, 117, 44, 133, 305, 311, 258, 211, 61, ESE, NE, SE, NW, NW, WSW, SSW, ENE, 356, 113, 176, 119, 116, 116, 116, 116, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0025_light_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0003_white_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0002_sunny_intervals.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, Moderate or heavy rain shower, Clear, Patchy rain possible, Cloudy, Partly cloudy, Partly cloudy, Partly cloudy, Partly cloudy, 3.6, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 61, 66, 59, 48, 46, 47, 48, 59, 7, 10, 9, 10, 10, 10, 10, 10, 4, 6, 5, 6, 6, 6, 6, 6, 995, 995, 996, 998, 996, 994, 995, 997, 30, 30, 30, 30, 30, 30, 30, 30, 26, 24, 29, 69, 46, 37, 41, 26, 38, 36, 38, 40, 41, 42, 42, 38, 100, 97, 101, 104, 106, 108, 107, 100, 24, 24, 24, 22, 22, 23, 23, 23, 75, 75, 74, 72, 72, 73, 73, 74, 32, 31, 33, 35, 35, 36, 36, 33, 90, 87, 91, 94, 96, 97, 96, 91, 11, 4, 5, 8, 6, 3, 3, 7, 18, 7, 8, 12, 10, 5, 5, 11, 38, 36, 38, 40, 41, 42, 42, 38, 100, 97, 101, 104, 106, 108, 107, 100, 1, 1, 1, 1, 1, 1, 1, 1
5 0, 300, 600, 900, 1200, 1500, 1800, 2100, 31, 30, 32, 36, 39, 38, 37, 34, 88, 87, 90, 97, 102, 100, 98, 94, 4, 3, 5, 3, 7, 7, 6, 3, 7, 5, 8, 5, 11, 11, 10, 5, 69, 116, 134, 280, 315, 306, 311, 317, ENE, ESE, SE, W, NW, NW, NW, NW, 113, 113, 356, 113, 113, 113, 113, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0026_heavy_rain_showers_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Clear, Moderate or heavy rain shower, Sunny, Sunny, Sunny, Clear, Clear, 0.0, 0.0, 2.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 60, 65, 57, 45, 33, 36, 39, 47, 10, 10, 7, 10, 10, 10, 10, 10, 6, 6, 4, 6, 6, 6, 6, 6, 996, 996, 998, 999, 997, 995, 996, 998, 30, 30, 30, 30, 30, 30, 30, 30, 24, 23, 14, 13, 17, 17, 17, 13, 39, 38, 40, 46, 45, 44, 43, 41, 102, 100, 104, 115, 113, 111, 109, 106, 24, 24, 24, 24, 21, 21, 21, 22, 75, 76, 75, 75, 69, 70, 70, 72, 33, 32, 34, 38, 40, 39, 38, 36, 91, 89, 92, 100, 103, 102, 100, 96, 5, 4, 7, 3, 8, 8, 7, 3, 8, 6, 12, 5, 13, 12, 11, 5, 39, 38, 40, 46, 45, 44, 43, 41, 102, 100, 104, 115, 113, 111, 109, 106, 1, 1, 1, 1, 1, 1, 1, 1
6 0, 300, 600, 900, 1200, 1500, 1800, 2100, 34, 32, 35, 37, 38, 37, 36, 34, 93, 90, 95, 98, 101, 99, 97, 93, 4, 3, 3, 7, 8, 7, 7, 3, 6, 5, 4, 12, 13, 11, 12, 5, 282, 321, 324, 305, 296, 295, 301, 322, WNW, NW, NW, NW, WNW, WNW, WNW, NW, 113, 113, 116, 113, 113, 113, 113, 113, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0004_black_low_cloud.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0001_sunny.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, http://cdn.worldweatheronline.net/images/wsymbols01_png_64/wsymbol_0008_clear_sky_night.png, Clear, Clear, Partly cloudy, Sunny, Sunny, Sunny, Clear, Clear, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 48, 53, 45, 37, 32, 37, 40, 46, 10, 10, 10, 10, 10, 10, 10, 10, 6, 6, 6, 6, 6, 6, 6, 6, 997, 997, 998, 999, 997, 995, 996, 997, 30, 30, 30, 30, 30, 30, 30, 30, 18, 24, 27, 25, 13, 19, 19, 4, 41, 39, 42, 42, 43, 43, 41, 43, 106, 103, 107, 108, 110, 109, 106, 109, 23, 23, 22, 21, 19, 21, 21, 23, 73, 73, 72, 69, 67, 70, 69, 73, 35, 34, 36, 38, 39, 38, 37, 36, 96, 93, 97, 100, 103, 100, 98, 98, 5, 4, 3, 9, 9, 8, 9, 3, 7, 6, 5, 14, 15, 13, 14, 5, 41, 39, 42, 42, 43, 43, 41, 43, 106, 103, 107, 108, 110, 109, 106, 109, 1, 1, 1, 1, 1, 1, 1, 1
Wheat yield
Downloaded in csv format and then read in
# A tibble: 6 x 5
`District/Year` `2018` `2017` `2016` `2015`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Gurdaspur 4733 4606 4654 3356
2 Pathankot 4018 4066 4049 2727
3 Amritsar 4866 4948 4478 3914
4 Tarn Taran 4685 4791 4524 4132
5 Kapurthala 4842 4922 4259 3996
6 Jalandhar 5063 4847 4456 4102
Cleaning the data
Upon review of the pollution data extracted above, we discover a column called ‘qc_name’ which denotes if the values contained within the row have gone through a quality check. We also noted a significant amount of rows where either aqi or the pm concentration (raw_cont) were negative.
# Count of negative values in relevant columns and rows noted having an invalid
# qc check
del %>% filter(aqi < 0) %>% count()# A tibble: 1 x 1
n
<int>
1 2843
# A tibble: 1 x 1
n
<int>
1 2552
# A tibble: 1 x 1
n
<int>
1 780
Note that even though upwards of 2000 rows have negative values only 780 rows are marked as an invalid qc check. As such to clean our data, we will remove all negative values from the data.
# Remove negative values
temp_pol <- filter(del, aqi > 0 & raw_conc > 0)
# Remove any values for aqi and pm2.5 over 999 since the level of 500 is
# considered very hazardous
temp_pol <- filter(temp_pol, aqi < 999 & raw_conc < 999)
# sanity check
temp_pol %>% filter(aqi < 0) %>% count()# A tibble: 1 x 1
n
<int>
1 0
# A tibble: 1 x 1
n
<int>
1 0
Even after trying multiple times I was unable to convert the ‘date’ field into a date time object. However, the dataset also has three columns containing the day month and year. We will use this to create a date column called ‘dt’. Once this is accomplished we will reduce our dataset down to only the relevant columns.
library(lubridate)
temp1 <- temp_pol %>% mutate(date1 = paste(year, month, day, hour), dt = ymd_h(date1))
summary(temp1) site parameter date_lt year
Length:38725 Length:38725 Length:38725 Min. :2015
Class :character Class :character Class :character 1st Qu.:2016
Mode :character Mode :character Mode :character Median :2017
Mean :2017
month day hour now_cast_conc
Length:38725 Length:38725 Length:38725 Min. : 0.3
Class :character Class :character Class :character 1st Qu.: 39.1
Mode :character Mode :character Mode :character Median : 69.2
Mean : 105.6
aqi aqi_category raw_conc conc_unit
Min. : 1.0 Length:38725 Min. : 1.0 Length:38725
1st Qu.:110.0 Class :character 1st Qu.: 38.0 Class :character
Median :158.0 Mode :character Median : 69.0 Mode :character
Mean :170.8 Mean :105.7
duration qc_name date1
Length:38725 Length:38725 Length:38725
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
dt
Min. :2015-02-24 06:00:00
1st Qu.:2016-04-21 21:00:00
Median :2017-06-21 18:00:00
Mean :2017-07-01 19:41:06
[ reached getOption("max.print") -- omitted 2 rows ]
dt aqi year
Min. :2015-02-24 06:00:00 Min. : 1.0 Min. :2015
1st Qu.:2016-04-21 21:00:00 1st Qu.:110.0 1st Qu.:2016
Median :2017-06-21 18:00:00 Median :158.0 Median :2017
Mean :2017-07-01 19:41:06 Mean :170.8 Mean :2017
3rd Qu.:2018-08-29 00:00:00 3rd Qu.:195.0 3rd Qu.:2018
Max. :2019-12-01 00:00:00 Max. :878.0 Max. :2019
month day raw_conc
Length:38725 Length:38725 Min. : 1.0
Class :character Class :character 1st Qu.: 38.0
Mode :character Mode :character Median : 69.0
Mean :105.7
3rd Qu.:140.0
Max. :998.0
# Convert month and day to numeric for graphing and grouping
data <- mutate(data, monthn = as.numeric(month))
data <- mutate(data, dayn = as.numeric(day))The pollution data is still in an hourly format. We will resample into daily averages:
library(tidyquant)
dfday <- data %>% tq_transmute(mutate_fun = apply.daily, FUN = mean, na.rm = TRUE)Now our pollution data is ready.
The hourly weather data currently is stored as a list within the wdf dataframe. To make it useful, we will first have to unlist it.
# A tibble: 6 x 45
date sunrise sunset moonrise moonset moon_phase moon_illuminati… maxtempC
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
2 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
3 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
4 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
5 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
6 2018… 05:27 … 07:23… 09:42 PM 07:57 … Waning Gi… 75 38
# … with 37 more variables: maxtempF <chr>, mintempC <chr>, mintempF <chr>,
# avgtempC <chr>, avgtempF <chr>, totalSnow_cm <chr>, sunHour <chr>,
# uvIndex <chr>, time <chr>, tempC <chr>, tempF <chr>, windspeedMiles <chr>,
# windspeedKmph <chr>, winddirDegree <chr>, winddir16Point <chr>,
# weatherCode <chr>, weatherIconUrl <list>, weatherDesc <list>,
# precipMM <chr>, precipInches <chr>, humidity <chr>, visibility <chr>,
# visibilityMiles <chr>, pressure <chr>, pressureInches <chr>,
# cloudcover <chr>, HeatIndexC <chr>, HeatIndexF <chr>, DewPointC <chr>,
# DewPointF <chr>, WindChillC <chr>, WindChillF <chr>, WindGustMiles <chr>,
# WindGustKmph <chr>, FeelsLikeC <chr>, FeelsLikeF <chr>, uvIndex1 <chr>
For the purposes of this analysis we will focus on a single column which is the wind speed in miles per hour.Additionally , we will convert the dates from character to a date object and wind from character to numeric.
temp <- weather %>% mutate(dt = ymd(date))
vars <- c("dt", "windspeedMiles")
wind_data <- temp[vars]
wind_data$wind <- as.numeric(wind_data$windspeedMiles)
wind <- select(wind_data, c(dt, wind))
head(wind)# A tibble: 6 x 2
dt wind
<date> <dbl>
1 2018-07-01 3
2 2018-07-01 4
3 2018-07-01 8
4 2018-07-01 9
5 2018-07-01 10
6 2018-07-01 7
In this current format the data is hourly. We are going to resample this data and aggregate the weather data by calculating the daily mean.
library(tidyquant)
weatherday <- wind %>% tq_transmute(mutate_fun = apply.daily, FUN = mean, na.rm = TRUE)
head(weatherday)# A tibble: 6 x 2
dt wind
<date> <dbl>
1 2018-07-01 7.62
2 2018-07-02 8.38
3 2018-07-03 5.5
4 2018-07-04 4.5
5 2018-07-05 4.75
6 2018-07-06 5.25
Our Weather data is now ready to work with !!
4. Exploration
Pollution
Visualizing the pollution trends for pm2.5. We will resample again to a weekly frequency in order to reduce noise and volatility within the data
library(tidyquant)
dfweek <- data %>% tq_transmute(mutate_fun = apply.weekly, FUN = mean, na.rm = TRUE)
# pm2.5 weekly
ggplot(dfweek, aes(x = dt, y = raw_conc)) + geom_line(aes(color = year))# pm2.5 weekly
ggplot(dfweek, aes(x = monthn, y = raw_conc, group = year)) + geom_line(aes(color = year))We note that even Weekly data is too volatile due to data quality issues so we will review this data on a monthly cadence.
dfmonth <- data %>% tq_transmute(mutate_fun = apply.monthly, FUN = mean, na.rm = TRUE)
ggplot(dfmonth, aes(x = monthn, y = raw_conc, group = year)) + geom_line(aes(color = factor(year)))# AQI
ggplot(dfmonth, aes(x = monthn, y = aqi, group = year)) + geom_line(aes(color = factor(year)))Comparing pollution data with the wind data
In order to compare the windspeed with pollution levels we are going to join the daily pollution dataset with the daily wind dataset.
# convert the date time object to date time
dfday$dt <- as_date(dfday$dt)
# Join
final_day <- left_join(weatherday, dfday, by = "dt")Let’s see if we can discern if there is a correlation between pollution and wind:
# Scatter between pm2.5 and wind
ggplot(final_day, aes(x = raw_conc, y = wind, group = year)) + geom_point(aes(color = factor(year)))# Scatter between aqi and wind
ggplot(final_day, aes(x = aqi, y = wind, group = year)) + geom_point(aes(color = factor(year)))There seems to be a small negative correlation between aqi and wind. We can confirm with a correlation matrix:
# create a temp dataset for the correlation matrix
stats <- select(final_day, c(aqi, wind, raw_conc))
# remove any NAs
stats <- drop_na(stats)
library(corrplot)
corrplot(cor(stats), method = "color", addCoef.col = "black", tl.col = "black")As expected there is a correlation between air quality, pm2.5 and wind with a slightly higher correlation with aqi.
Wheat Yield
# Creating a long form dataset
wheat <- pivot_longer(pun, cols = starts_with("20"), names_to = "year", values_to = "prod")
# filtering out Punjab
wtemp <- filter(wheat, wheat[1] == "Punjab")
# PLotting
ggplot(wtemp, aes(x = year, y = prod, group = 1)) + geom_line()We see that the wheat production has consistently gone up since 2015
5. Model building
Due to the small size of the dataset (176 observations) we will build a default model to see if we can predict aqi based on wind speed
Call:
lm(formula = aqi ~ wind, data = stats)
Residuals:
Min 1Q Median 3Q Max
-147.635 -59.684 0.988 54.599 209.235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 269.23 15.94 16.895 < 2e-16 ***
wind -16.54 2.97 -5.569 9.64e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 77.37 on 173 degrees of freedom
Multiple R-squared: 0.152, Adjusted R-squared: 0.1471
F-statistic: 31.01 on 1 and 173 DF, p-value: 9.636e-08
Is there a relationship between predictor and response variable? F=31.01 is far greater than 1. It can be concluded that there is a relationship between predictor and response variable. However with a R2 score of .152 this would not be considered a very good model.
Conclusion: What does all mean?
My conclusions from the analysis above are that I find all of the claims made by the Delhi Government suspect because :
it does not seem that pollution has really decreased by much between 2015-2019( avg. aqi of 162.44 and 160.07 or -1.46%)
any decrease in the pollution can potentially be attributed to a general increase in average wind speed
even though the yields of agriculture have gone up the pollution has decreased marginally (yields of wheat increased by 18 % between 2015 and 2018)
Next steps would be to try and source more accurate data , to do more indepth modeling and use predictors other than wind.