Where to invest to combat air pollution in India?

Identifying the cities that require immediate attention to their increasing air pollution

Air pollution is most common in large cities where emissions from many different sources are concentrated. Sometimes, mountains or tall buildings prevent air pollution from spreading out. This air pollution often appears as a cloud making the air murky. It is called smog. The word “smog” comes from combining the words “smoke” and “fog.”

Large cities in poor and developing nations tend to have more air pollution than cities in developed nations. According to the World Health Organization (WHO), some of the worlds most polluted cities are Karachi, Pakistan; New Delhi, India; Beijing, China; Lima, Peru; and Cairo, Egypt. However, many developed nations also have air pollution problems. Los Angeles, California, is nicknamed Smog City.

Objective

In a crux, the task here requires a submission that would convince a rich uncle to provide monetary investment to improve the quality of air in a given city in India. The necessity is to tie up all loose ends with data-based evidence and also present a rough plan as to how things must be done and also how progress can be measured. A maximum of 3 cities can be chosen from the prospective list of 25+ cities present in the dataset at the time of this analysis.

Executive Summary

The three main cities that have been chosen are Gurugram, Lucknow and Ahmedabad.
Gurugram is to be given the initial investment for the next 3 years. If this is succesful, investment should go to Lucknow and Ahmedabad before any other city in this dataset.

To improve air in Gurugram, it would be better to collaborate with the tech giants in Gurugram(almost 50% of Fortune 500 companies have offices here) about the problem. Together with Haryana administration, they could tackle the issue as a part of their CSR activities to improve the living standards of their employees.

About Air Quality Index (AQI)

India’s National Air Quality Index programme was put into effect in the year 2015 as a step to monitor the air quality in the country. It was initially started in 14 cities and later extended to 34. (Source)

As per the AQI classification, any AQI measure can fall into a particular bucket.

Methodology

Step 1: Understanding the Data

The 5 .csv datasets were studied. Basic Exploratory analysis was done on the 5 datasets.

Step 2: Formulate Questions

The main problem given was to recommend top 3 cities for investment. During this step, further questions were formulated which would help in arriving at the answer to the final problem.

Step 3: Data Analysis

Analysis was done based on available records to answer all the questions listed in Step 2

Step 4: Creating a Model

A model was created to understand which city is likely to go from bad to worse by calculating the probability based on historical data.

Step 5: Understanding the Causes

Analysis was done on the factors influencing the pollution in finalized city to come up with relevant Recommendations

Step 6: Recommendations and Tracking

A rough plan suggesting ideas to make improvements in the 3 chosen cities was created and a method to track the progress was recommended.

Library Used

library(dplyr)
library(ggplot2)
library(gridExtra)
library(grid)
library(lattice)
library(lubridate)
library(data.table)

Step 1: Understanding the Data

Importing the Datasets

AQI_Station <- read.csv("F:/Work/R Programming/Data/Air Quality Data in India (2015 - 2020)/stations.csv", header=T, sep=',')
Station_Hour <- read.csv("F:/Work/R Programming/Data/Air Quality Data in India (2015 - 2020)/station_hour.csv", header=T, sep=',')
Station_Day <- read.csv("F:/Work/R Programming/Data/Air Quality Data in India (2015 - 2020)/station_day.csv", header=T, sep=',')
City_Hour <- read.csv("F:/Work/R Programming/Data/Air Quality Data in India (2015 - 2020)/city_hour.csv", header=T, sep=',')
City_Day <- read.csv("F:/Work/R Programming/Data/Air Quality Data in India (2015 - 2020)/city_day.csv", header=T, sep=',')

Analysis of available records
If we are going to compare cities, the comparison must ideally be between cities with comparable size of records. However, some cities have more records while others have far less. Therefore, it was necessary to take this into consideration while building the case.

NA_Cases <- City_Day[!complete.cases(City_Day[,15]),]
City_Day <- City_Day[complete.cases(City_Day[,15]),]
Records_perCity <- as.data.frame(City_Day %>% group_by(City) %>% tally(sort = TRUE))
Records_perCity$City <- factor(Records_perCity$City,levels=Records_perCity$City)

#plotting the cities along with the number of stations in each city
gp <- ggplot(data=Records_perCity, aes(x=City, y=n, label = n))
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_text(size = 3, position = position_stack(vjust = .9))
gp <- gp + theme(axis.text.x = element_text(angle = 90),
                 panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- gp + ggtitle("Cities with most Daily Records") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("City") + ylab("Number of Days")
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

Which cities have the most number of faulty records? In this case it would be important to inform the administration about faulty monitors to make sure the AQI is measured properly.

NA_Cases_perCity <- as.data.frame(NA_Cases %>% group_by(City) %>% tally(sort = TRUE))
NA_Cases_perCity$City <- factor(NA_Cases_perCity$City,levels=NA_Cases_perCity$City)
#plotting the NA Cases to understand where monitors might be faulty in each city
gp <- ggplot(data=NA_Cases_perCity, aes(x=City, y=n, label = n))
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_text(size = 3, position = position_stack(vjust = .9))
gp <- gp + theme(axis.text.x = element_text(angle = 90), 
                 panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- gp + ggtitle("Cities with most Faulty Daily Records") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("City") + ylab("Number of Days")
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

Step 2: Formulate Questions

Questions Asked
1. Do we have enough data for all cities to do proper analysis?
2. Why don’t we have enough data for all cities?
3. Which are the worst cities in terms of pollution?
4. What factors are affecting the pollution in each city?
5. What can we do to combat the air pollution?

Step 3: Data Analysis

Which cities have the most number of stations where AQI is being monitored? Are the monitors installed in these stations enough to effectively measure the AQI. Taking London as a reference, 1 monitor can effectively measure the quality of air in 11SQKM area. Accordingly, the estimate number of monitors required has been calculated based on area of each city and plotted against current number of monitors installed. This is also very critical information which needs to be communicated to the administration if we want to measure AQI properly.

#Some exploratory research on the number of stations employed in each city
Area_of_Cities <- readRDS("City_Area.RData")
Station_Count <- AQI_Station %>% group_by(City) %>% tally()
most_station_city <- as.data.frame(Station_Count %>% arrange(desc(n)) %>% top_n(10, wt=n))
most_station_city$City <- factor(most_station_city$City,levels=most_station_city$City)
most_station_city$CityArea <- Area_of_Cities
most_station_city$Ideal_Num_Stations <- round(0.09 * most_station_city$CityArea, digits=0)

#plotting the cities along with the number of stations in each city
gp <- ggplot(data=most_station_city, aes(x=City, y=n, label = n))
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_line(data=most_station_city, aes(x=City, y=Ideal_Num_Stations), color=rgb(0.1,0.1,0.7,0.5), group = 1, size = 0.5, linetype="dashed") 
gp <- gp + geom_point(data=most_station_city, aes(x=City, y=Ideal_Num_Stations))
gp <- gp + geom_text(size = 3, position = position_stack(vjust = 1.1))
gp <- gp + theme(axis.text.x = element_text(angle = 90),  
                 panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- gp + scale_color_manual(name = "Stations", labels = c("Current", "Ideal"))
gp <- gp + ggtitle("Cities with most stations vs Ideal number of Stations Required") + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position="right")
gp <- gp + xlab("City") + ylab("Number of Stations")
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

Calculate the number of months in a year when the AQI was greater than threshold(100). The plan is to target cities where the city sees less than 6 months of clean air. For the experiment, we would prefer to go with a relatively cleaner city to effectively control all the variables leading to a polluted air.

#exploratory analysis of City_Day data 
#Converting date into months and year for further analysis 
City_Day$Date <- as.Date(City_Day$Date, "%Y-%m-%d")
City_Day$Month <- as.numeric(format(City_Day$Date,'%m'))
City_Day$Year <- as.numeric(format(City_Day$Date,'%Y'))

#saving monthly data - The output will be the average AQI for each city per month for each year 
City_Month <- City_Day %>% group_by(City, Year, Month) %>% summarize(Average_AQI = mean(AQI, na.rm = TRUE))

threshold <- 100

No_of_Months_with_AQI_thresh <- City_Month %>% group_by(City, Year) %>% summarize(Number_Months = length(which(Average_AQI>=threshold))) 
#Calculating average number of months for each city with AQI more than threshold
No_of_Months_with_AQI_thresh<- No_of_Months_with_AQI_thresh %>% group_by(City) %>% mutate(Average_Number_Months = mean(Number_Months))
#Removing the cities which have 0 such months to make data look tidy
No_of_Months_with_AQI_thresh <- No_of_Months_with_AQI_thresh[!(No_of_Months_with_AQI_thresh$Average_Number_Months==0),]
#plotting the data
Cities_Months_AQI_thresh <- No_of_Months_with_AQI_thresh %>% group_by(City) %>% summarize(Average_Number_Months = mean(Number_Months))
Cities_Months_AQI_thresh <- Cities_Months_AQI_thresh %>% arrange(desc(Average_Number_Months)) %>% top_n(10, wt=Average_Number_Months)
Cities_Months_AQI_thresh$City <- factor(Cities_Months_AQI_thresh$City,levels=Cities_Months_AQI_thresh$City)
Cities_Months_AQI_thresh$Average_Number_Months <- round(Cities_Months_AQI_thresh$Average_Number_Months, digits = 1)

gp <- ggplot(Cities_Months_AQI_thresh, aes(y=Average_Number_Months, x=City, label = Average_Number_Months)) 
gp <- gp + geom_bar(color=rgb(1,0.5,0.4,0.5), fill=rgb(1,0.5,0.4,0.5), stat="identity") 
gp <- gp + geom_text(size = 3, position = position_stack(vjust = 0.95))
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = rgb(0.8,0.1,0.3,0.2), size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(1,0.5,0.4,0.1)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(1,0.5,0.4,0.1)))
gp <- gp + scale_color_manual(name = "Stations", labels = c("Current", "Ideal"))
gp <- gp + ggtitle(paste("Number of months with AQI more than",threshold, "for each city")) + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("City") + ylab("Number of Months")
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

Shortlisted_Cities <- Cities_Months_AQI_thresh %>% top_n(5, wt=Average_Number_Months)
Shortlisted_Cities_Data <- as.data.frame(City_Day %>% subset(City %in% Shortlisted_Cities$City))
Shortlisted_Cities_Data$AQI_Classification <- ifelse(Shortlisted_Cities_Data$AQI<=300, ifelse(Shortlisted_Cities_Data$AQI>250,as.integer(Shortlisted_Cities_Data$AQI/50)-1,as.integer(Shortlisted_Cities_Data$AQI/50)), 5)

Percentage_Days <- as.data.frame(Shortlisted_Cities_Data %>% group_by(City, AQI_Classification) %>%
        summarise(Number_of_Days = n()) %>%
        mutate(Percent_of_Days = Number_of_Days*100 / sum(Number_of_Days)))

Percentage_Days$Percent_of_Days <- round(Percentage_Days$Percent_of_Days, digits = 1)
#positions <- c("","Good", "Satisfactory", "Moderate", "Poor", "Very Poor", "Severe")
#Percentage_Days$AQI_Classification <- as.data.frame(unlist(Percentage_Days$AQI_Classification))


gp <- ggplot(Percentage_Days, aes(y=Percent_of_Days, x=City, label = Percent_of_Days, fill = as.factor(AQI_Classification))) 
gp <- gp + geom_bar(position="stack", stat="identity", width = 0.3) 
gp <- gp + geom_text(size = 2, position = position_stack(vjust = .9))
gp <- gp + scale_fill_manual(values=c("#15B04F","#92D04F","#FFFF10","#FFC012","#FE0116","#C01511"))
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- gp + ggtitle("Distribution of days according to AQI Buckets") + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position = "none")
gp <- gp + xlab("City") + ylab("Percentage of Days")
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

According to the graph it turns out that Ahmedabad and Delhi are heavily polluted with majority days in a year as Hazardous or Very Unhealthy. To pick these cities for cleaning the air the project will be huge and will require huge investment. We will do further analysis to understand which cities are deteriorating at a faster pace and have the highest probability of going from bad to worse. Faster the deterioration, more immediately the action should be taken.

Step 4: Creating a Model

The AQI levels for a city is not uniform throughout the last 5 years. Sometimes, the level has been bad for human respiration and at other times, harmless. However, here we are hypothesizing that the AQI level of a city on day t depends on the AQI level of the same city on day (t-1). Based on the same we will calculate the probability of a city going from bad to worse on each day

Shortlisted_Cities_Data <- as.data.frame(Shortlisted_Cities_Data %>%
                                group_by(City) %>%
                                mutate(Diff_Level = AQI_Classification - lag(AQI_Classification, default = AQI_Classification[1])))


Shortlisted_Cities_RL <- as.data.frame(Shortlisted_Cities_Data %>% select(City,Date,AQI, Month, Year, AQI_Classification, Diff_Level))
setDT(Shortlisted_Cities_RL)
## indicate where level in decreasing
Shortlisted_Cities_RL[, Negatives := Diff_Level < 0]
## use run-length-encoding by each City on the Negatives column
Shortlisted_Cities_RL[, rl := rleid(Negatives), by = .(City, Month)]
## identify how many of each 'rl' are in each group
Shortlisted_Cities_RL[, rl_len := .N, by=.(City, Month, rl)]

# Event A : Probability that the AQI level of city decreased from day t-1 to day t

Shortlisted_Cities_ProbA_ByYear <- Shortlisted_Cities_RL %>% group_by(City, Year, Month) %>% summarise(Prob_A = length(which(Negatives==TRUE)))
Shortlisted_Cities_ProbA_ByYear$Prob_A <- Shortlisted_Cities_ProbA_ByYear$Prob_A/days_in_month(Shortlisted_Cities_ProbA_ByYear$Month)

Shortlisted_Cities_ProbA_ByMonth <- Shortlisted_Cities_ProbA_ByYear %>% group_by(City, Month) %>% summarise(Prob_A = mean(Prob_A))
Shortlisted_Cities_ProbA_ByMonth$Prob_A <- round(Shortlisted_Cities_ProbA_ByMonth$Prob_A, digits = 3)

# Event B : Probability that the AQI level of city decreased from day t to day t+1
# Event AnB : Probability that the AQI level of city decreased 2 days in a row
# Event BbyA : Probability that the AQI level of city will get worse tomorrow(t+1) given that it got worse today(t)
Shortlisted_Cities_RL$NumberofPairs <- ifelse(Shortlisted_Cities_RL$Negatives==TRUE,ifelse(Shortlisted_Cities_RL$rl_len>=2, (Shortlisted_Cities_RL$rl_len-1)/Shortlisted_Cities_RL$rl_len, 0),0)
Shortlisted_Cities_RL$NumberofPairs <- round(Shortlisted_Cities_RL$NumberofPairs, digits = 2)
Shortlisted_Cities_ProbAnB_ByYear <- Shortlisted_Cities_RL %>% group_by(City, Year, Month) %>% summarise(Pairs = sum(NumberofPairs))
Shortlisted_Cities_ProbAnB_ByYear$ProbAnB <- Shortlisted_Cities_ProbAnB_ByYear$Pairs/days_in_month(Shortlisted_Cities_ProbAnB_ByYear$Month)

Shortlisted_Cities_ProbAnB_ByMonth <- Shortlisted_Cities_ProbAnB_ByYear %>% group_by(City, Month) %>% summarise(ProbAnB = mean(ProbAnB))
Shortlisted_Cities_ProbAnB_ByMonth$ProbAnB <- round(Shortlisted_Cities_ProbAnB_ByMonth$ProbAnB, digits = 3)

Shortlisted_cities_Prob <- cbind(Shortlisted_Cities_ProbA_ByMonth, "ProbAnB" = Shortlisted_Cities_ProbAnB_ByMonth$ProbAnB)
Shortlisted_cities_Prob$ProbBbyA <- Shortlisted_cities_Prob$ProbAnB/Shortlisted_cities_Prob$Prob_A
Shortlisted_cities_Prob$ProbBbyA <- round(Shortlisted_cities_Prob$ProbBbyA, digits = 3)

Shortlisted_cities_AvgProb <- Shortlisted_cities_Prob %>% group_by(City) %>% summarise(ProbBbyA = mean(ProbBbyA)) %>% arrange(desc(ProbBbyA))
Shortlisted_cities_AvgProb$ProbBbyA <- round((Shortlisted_cities_AvgProb$ProbBbyA)*100, digits = 2)
Shortlisted_cities_AvgProb$City <- factor(Shortlisted_cities_AvgProb$City,levels=Shortlisted_cities_AvgProb$City)

gp <- ggplot(Shortlisted_cities_AvgProb, aes(y=ProbBbyA, x=City, label = ProbBbyA,  fill = ProbBbyA)) 
gp <- gp + geom_bar(stat="identity",  width = 0.5) 
gp <- gp + geom_text(size = 4, position = position_stack(vjust = .9))
gp <- gp + ggtitle("Probability(%) of a City going from Bad to Worse") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + scale_fill_gradient(low = rgb(0.4,1,0.4,0.7), high = rgb(1,0.5,0.4,0.7))
gp <- gp + xlab("City") + ylab("Probability(%)")
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

City deteriorating at the fastest pace is Gurugram on which we will focus our initial investment for the next 3 years Post that we will focus the investment on Lucknow and Ahmedabad respectively

Step 5: Understanding the Causes

For the 3 cities selected we will analyse the distribution of different particles that are contributing to the AQI

Main Causes for these pollutants have been identified to understand the avenues where the investment will need to be made

PM2.5 and PM10 - Fumes from paint, hair spray, varnish, aerosol sprays and other solvents
SO2 - Various industrial processes. Coal and petroleum combustion
CO - Combustion of fuel such as natural gas, coal or wood. Vehicular exhaust
NH3 - Agricultural waste
O3 - Combustion of fossil fuel

Finalized_Cities <- as.data.frame(Shortlisted_cities_AvgProb %>% top_n(3, wt=ProbBbyA))
Finalized_Cities_Data <- as.data.frame(City_Day %>% subset(City %in% Finalized_Cities$City))
Pollutants <- data.frame(Pollutant = c("PM2.5", "PM10", "NOX", "NH3", "CO", "SO2", "O3"), threshold = c(25,50,200,27, 10, 20,100))

PercentPM25 <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentPM25 = round(mean((PM2.5/Pollutants[which(Pollutants$Pollutant=="PM2.5"),]$threshold), na.rm = TRUE), digits = 2)))
PercentPM10 <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentPM10 = round(mean((PM10/Pollutants[which(Pollutants$Pollutant=="PM10"),]$threshold), na.rm = TRUE), digits = 2)))
PercentNH3 <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentNH3 = round(mean((NH3/Pollutants[which(Pollutants$Pollutant=="NH3"),]$threshold), na.rm = TRUE), digits = 2)))
PercentCO <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentCO = round(mean((CO/Pollutants[which(Pollutants$Pollutant=="CO"),]$threshold), na.rm = TRUE), digits = 2)))
PercentSO2 <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentSO2 = round(mean((SO2/Pollutants[which(Pollutants$Pollutant=="SO2"),]$threshold), na.rm = TRUE), digits = 2)))
PercentO3 <- as.data.frame(Finalized_Cities_Data %>% group_by(City) %>%  summarise(PercentO3 = round(mean((O3/Pollutants[which(Pollutants$Pollutant=="O3"),]$threshold), na.rm = TRUE), digits = 2)))


Pollutant_Percent <- merge(PercentPM25, PercentPM10, by = "City")
Pollutant_Percent <- merge(Pollutant_Percent, PercentNH3, by = "City")
Pollutant_Percent <- merge(Pollutant_Percent, PercentCO, by = "City")
Pollutant_Percent <- merge(Pollutant_Percent, PercentSO2, by = "City")
Pollutant_Percent <- merge(Pollutant_Percent, PercentO3, by = "City")


#Transposing the dataframe to make plotting easier
CityNames <- c(Pollutant_Percent$City)
Pollutant_Percent <- as.data.frame(t(Pollutant_Percent[,-1]))
colnames(Pollutant_Percent) <- CityNames
Pollutant_Percent$Pollutants <- factor(row.names(Pollutant_Percent))

gp <- ggplot(Pollutant_Percent, aes(y=Gurugram, x=Pollutants, label = Gurugram)) 
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_text(size = 4, position = position_stack(vjust = .9))
gp <- gp + ggtitle("Ratio of Pollutants Found over Recommended Value for Gurugram") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("Pollutant") + ylab("Ratio of Pollutants")
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

#Understanding the monthly trend of each pollutant in our main city Gurugram
Gurugram_Data <- Finalized_Cities_Data %>% subset(City =="Gurugram")
AveragePM25 <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((PM2.5/Pollutants[which(Pollutants$Pollutant=="PM2.5"),]$threshold), na.rm = TRUE), digits = 2)))
AveragePM25$Pollutant <- c("PM25")
AveragePM25$PollutantType <- c("Particle")

AveragePM10 <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((PM10/Pollutants[which(Pollutants$Pollutant=="PM10"),]$threshold), na.rm = TRUE), digits = 2)))
AveragePM10$Pollutant <- c("PM10")
AveragePM10$PollutantType <- c("Particle")

AverageNH3 <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((NH3/Pollutants[which(Pollutants$Pollutant=="NH3"),]$threshold), na.rm = TRUE), digits = 2)))
#AverageNH3$mean <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean(NH3, na.rm = TRUE), digits = 2)))
AverageNH3$Pollutant <- c("NH3")
AverageNH3$PollutantType <- c("Gases")

AverageCO <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((CO/Pollutants[which(Pollutants$Pollutant=="CO"),]$threshold), na.rm = TRUE), digits = 2)))
AverageCO$Pollutant <- c("CO")
AverageCO$PollutantType <- c("Gases")

AverageSO2 <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((SO2/Pollutants[which(Pollutants$Pollutant=="SO2"),]$threshold), na.rm = TRUE), digits = 2)))
AverageSO2$Pollutant <- c("SO2")
AverageSO2$PollutantType <- c("Gases")

AverageO3 <- as.data.frame(Gurugram_Data %>% group_by(Month) %>%  summarise(Ratio = round(mean((O3/Pollutants[which(Pollutants$Pollutant=="O3"),]$threshold), na.rm = TRUE), digits = 2)))
AverageO3$Pollutant <- c("O3")
AverageO3$PollutantType <- c("Gases")

AveragePollutants <- rbind(AveragePM25,AveragePM10, AverageNH3, AverageCO, AverageSO2, AverageO3)

gp <- ggplot(AveragePollutants, aes(y=Ratio, x=Month, label = Ratio)) 
gp <- gp + geom_line(aes(color = Pollutant))
gp <- gp + geom_text(size = 2.5)
gp <- gp + facet_grid(AveragePollutants$PollutantType~., scales = "free")
gp <- gp + theme(strip.background =element_rect(fill=rgb(0.4,0.8,1,0.2)))
gp <- gp + ggtitle("Monthly Ratio of Pollutants Found over Recommended Value for Gurugram") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("Month") + ylab("Ratio of Pollutants")
gp <- gp + scale_x_discrete(labels=c("1" = "Jan", "2" = "Feb", "3" = "Mar", "4" = "Apr", "5" = "May", "6" = "June", 
                                     "7" = "July", "8" = "Aug", "9" = "Sep", "10" = "Oct", "11" = "Nov", "12" = "Dec"),
                            limits = factor(1:12))
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = rgb(0.4,0.8,1,0.2), size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

# Further analysis for our next 2 cities - Lucknow and Ahmedabad
gp <- ggplot(Pollutant_Percent, aes(y=Lucknow, x=Pollutants, label = Lucknow)) 
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_text(size = 4, position = position_stack(vjust = .9))
gp <- gp + ggtitle("Ratio of Pollutants Found over Recommended Value for Lucknow") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("Pollutant") + ylab("Ratio of Pollutants")
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

gp <- ggplot(Pollutant_Percent, aes(y=Ahmedabad, x=Pollutants, label = Ahmedabad)) 
gp <- gp + geom_bar(color=rgb(0.4,0.8,1,0.7), fill=rgb(0.4,0.8,1,0.7), stat="identity") 
gp <- gp + geom_text(size = 4, position = position_stack(vjust = .9))
gp <- gp + ggtitle("Ratio of Pollutants Found over Recommended Value for Ahmedabad") + theme(plot.title = element_text(hjust = 0.5))
gp <- gp + xlab("Pollutant") + ylab("Ratio of Pollutants")
gp <- gp + theme(panel.background = element_rect(fill = "white", colour = "lightblue", size = 0.5, linetype = "solid"),
                 panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)), 
                 panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                                 colour = rgb(0.4,0.8,1,0.2)))
gp <- grid.arrange(gp,bottom = textGrob("Source: Central Pollution Control Board of Government of India", gp = gpar(fontsize = 5), x = 1, hjust = 1))

Step 6: Recommendations and Tracking

This section draws focus on some of the opportunities that the investment can be used for in order to improve the state of air quality in Gurugram, Lucknow and Ahmedabad.

According to the analysis, we find that there are 2 major causes of pollution
1. Fossil Fuels used in Industries
2. Vehicular Exhaust

1. Fossil Fuels used in Industries

There are two approaches through which pollution can be reduced: 1. Reducing consumption or usage of a polluting product 2. Treatment of wastes, discharges and disposals of a pollutant

Using Microbes for Waste Treatment
Many countries, including the E.U., Switzerland, Canada and the U.S., have effectively implemented systems that treat waste water for most chemicals, yet significant improvement in methods are possible. In such improvements, priority should be given to considering the use of microbes or fungi for cleanup of heavy metals and organic compounds that are hard to degrade because of their high efficiency relative to chemical or physical methods
Raising Awareness among Consumers
The reasons corporations reduce their pollution are based on consumer preference for low-pollution goods and the high cost of noncompliance with environmental regulations. Consumers and governments need to do their part to push companies to decrease pollution. Although pollution prevention can provide a financial incentive for private corporations, consumer pressure is still necessary to develop company awareness of pollution issues.

2. Vehicular Exhaust

Clean technology is any process, product or service that reduces negative environmental impacts through significant energy efficiency improvements, the sustainable use of resources, or environmental protection activities. A couple of ways the uncle could invest his money would be to

Invest in a cleaner transport medium like The Delhi Metro
The Delhi Metro (DM) is a mass rapid transit system serving the National Capital Region of India. It is also the world’s first rail project to earn carbon credits under the Clean Development Mechanism of the United Nations for reductions in CO2 emissions. Looking at the period 2004-2006, one of the larger rail extensions of the DM led to a 34 percent reduction in localized CO at a major traffic intersection in the city.Source
Set up a Money-lending firm or bank
Provide low-interest loans for people to convert their vehicles into CNG-run from petrol-run or diesel-run Provide incentives to those who are willing to give away their old fuel-driven vehicles and switch to electric-vehicles.
Partner with Taxi service companies to incentivize drivers with electric cars
Provide free rides and offers to attract people. Autorickshaw drivers can be targetted and brought into the revolution as here as well.

How to measure progress

Tracking the number of stations where monitors are installed. Number of daily records monitored. This will make sure that the data we are collecting is correct and valid for tracking progress. Based on this data we can tweak the experiment during the first phase of the investment.
Tracking AQI to understand if the efforts taken are impacting the overall quality of air in the selected city
Tracking the ratio of pollutants to the recommended threshold to make sure that we are focusing on reducing all the pollutants in the air and not just a few

There can definitely be more metrics created from the Rough Plan demonstrated before. In a real-world scenario, metrics need to be defined by taking way more parameters into consideration. Hence, I shall stop at these 3 simple metrics.

References

The data has been made publicly available by the Central Pollution Control Board: Website Link which is the official portal of Government of India. They also have a real-time monitoring app: App Link

References

Some Exploratory Analysis

Air Pollution India

Analysis by Deepti Singh Chauhan

Where to invest to combat air pollution in India?

Objective

Executive Summary

About Air Quality Index (AQI)

Methodology

Step 1: Understanding the Data

Step 2: Formulate Questions

Step 3: Data Analysis

Step 4: Creating a Model

Step 5: Understanding the Causes

Step 6: Recommendations and Tracking

Step 1: Understanding the Data

Step 2: Formulate Questions

Step 3: Data Analysis

Step 4: Creating a Model

Step 5: Understanding the Causes

Step 6: Recommendations and Tracking

1. Fossil Fuels used in Industries

2. Vehicular Exhaust

How to measure progress

References

References