Indian air pollution trends 2005-2015

0.1 Obtaining the data types of different fields in the dataset
0.2 Dataset cleaning
- 0.2.1 Observations prior to 2005
- 0.2.2 Observations post 2005
0.3 Data Profiling, Summarization and Visualization
0.4 Studying the efficacy of the pollutants in categorizing the sampling sites into specific zones (Industrial, Rural, Rural_Residential and Sensitive)
- 0.4.1 Defining training, test data and the regression formula

Rationale The aim of the work was to explore the air pollution trends of India. Specific objectives included 1) Data Profiling and Exploratory data analysis of the dataset by summarizing and visualizing some characteristic trends of pollutants based on a) the states and b) ‘types’ (zones in the analyses) 2) Assessing the efficacy of zones categorization (here assumed to be) based on the pollution data. It should be ideally expected that the categories are properly classified based on the available data using regression techniques.

Datasethas been imported from kaggle.

The dataset was loaded and written using DBI and RPostgreSQL in R. This was done since the dataset was large (>450000 rows) and hence more easier to wrangle using SQL (This has been commented out since its a one time process)

0.1 Obtaining the data types of different fields in the dataset


select
column_name,
data_type
from
information_schema."columns"
where table_name = 'India_pollution'
;

8 records
column_name	data_type
state	text
location	text
type	text
so2	double precision
no2	double precision
rspm	double precision
spm	double precision
date	text

0.2 Dataset cleaning



update "India_pollution"
set "state" = trim(initcap("state"))
;
commit;


update "India_pollution"
set "location" = trim(initcap("location"))
;
commit;


update "India_pollution"
set "type" = trim(initcap("type"))
;
commit;


--2. Editing and changing the date field


update "India_pollution"
set "date"= (split_part(date,'/',3) || '/' || split_part(date,'/',1) || '/'|| split_part(date,'/',2))::date 
;
commit;


--3. Adding 'Data_not_available' to empty character field cells


--1. state


update "India_pollution"
set "state" = 'Data_not_available'
where "state" = ''
;
commit;


--2. location


update "India_pollution"
set "location" = 'Data_not_available'
where "location" = ''
;
commit;


--3. type


update "India_pollution"
set "type" = 'Data_not_available'
where "type" is null
;
commit;

Re-checking the data after cleaning


select
 *
 from
"India_pollution"
 limit 10;

Displaying records 1 - 10
state	location	type	so2	no2	rspm	spm	date
Maharashtra	Akola	Industrial	9.0	11.000000	127.00000	NA	2009-09-07
Kerala	Kozhikode	Rural_Residential	2.0	8.666667	47.33333	93	2012-05-10
Tamil Nadu	Chennai	Industrial	NA	NA	29.00000	NA	2006-06-18
Odisha	Talcher	Industrial	10.0	21.000000	79.00000	NA	2008-10-13
Chhattisgarh	Bhilai	Rural_Residential	NA	NA	NA	NA	2014-09-22
Uttar Pradesh	Kanpur	Rural_Residential	8.0	40.000000	246.00000	NA	2004-12-23
Odisha	Angul	Industrial	9.0	26.000000	91.00000	NA	2008-11-03
Delhi	Delhi	Industrial	4.2	41.200000	124.00000	296	2014-07-16
Delhi	Delhi	Industrial	4.2	60.300000	140.00000	485	2014-07-16
Kerala	Kottayam	Rural_Residential	6.0	22.000000	51.00000	NA	2012-04-19

From visually observing the data, it was seen that observations prior to 2005 were lesser than after the year 2005. Hence, a count of both observations was obtained for both categories

0.2.1 Observations prior to 2005


select
count(*) as obs_prior_2005
from
"India_pollution"
where date <= '2005-01-01'
;

1 records
obs_prior_2005
42796

0.2.2 Observations post 2005


select
count(*) as obs_post_2005
from
"India_pollution"
where date >= '2005-01-01'
;

1 records
obs_post_2005
391655

It could clearly be seen the observations prior to 2005 were 1) very less and hence records from 2005 onwards will be selected and used for further analysis


select
*
from
"India_pollution"
where date >= '2005-01-01'
;

0.3 Data Profiling, Summarization and Visualization

0.3.1 Total row counts having non-null values in the data


select
sum(case when "state" is not null then 1 else 0 end) as state_not_nulls,
sum(case when "location" is not null then 1 else 0 end) as location_not_nulls,
sum(case when "type" is not null then 1 else 0 end) as type_not_nulls,
sum(case when "so2" is not null then 1 else 0 end) as so2_not_nulls,
sum(case when "no2" is not null then 1 else 0 end) as no2_not_nulls,
sum(case when "rspm" is not null then 1 else 0 end) as rspm_not_nulls,
sum(case when "spm" is not null then 1 else 0 end) as spm_not_nulls,
sum(case when "date" is not null then 1 else 0 end) as date_not_null
FROM "India_pollution"
where date >= '2005-01-01'
;

1 records
state_not_nulls	location_not_nulls	type_not_nulls	so2_not_nulls	no2_not_nulls	rspm_not_nulls	spm_not_nulls	date_not_null
391655	391655	391655	359338	376680	352772	184555	391655

0.3.2 Distinct counts of some important fields


select
count(distinct "state") as states,
count(distinct "location") as locations,
count(distinct "type") as types
FROM "India_pollution"
where date >= '2005-01-01'
;

1 records
states	locations	types
31	269	5

0.3.3 Modifying the data field and removing NaN values (if/any) to get the final dataset for further analysis

#str(pollution_data) shows that date is a character rather than a date and hence changed to date

pollution_data$date<-as.Date(pollution_data$date,"%Y-%m-%d")

# All the entries of the category Data_not_available are removed from the dataset followed by removing NaN's (if/any) from the dataset using the library IDPmisc

require(IDPmisc)

## Loading required package: IDPmisc

india_pollution_data<-pollution_data%>%
    dplyr::filter(type %in% c("Sensitive",
                              "Residential",
                              "Industrial",
                              "Rural_Residential"))%>%
    IDPmisc::NaRV.omit(.)

head(india_pollution_data,5)

##          state  location              type  so2       no2      rspm spm
## 2       Kerala Kozhikode Rural_Residential  2.0  8.666667  47.33333  93
## 7        Delhi     Delhi        Industrial  4.2 41.200000 124.00000 296
## 8        Delhi     Delhi        Industrial  4.2 60.300000 140.00000 485
## 14       Bihar     Patna       Residential 11.9 41.000000 142.00000 440
## 28 Maharashtra  Kolhapur       Residential 10.3 15.500000  71.00000 150
##          date
## 2  2012-05-10
## 7  2014-07-16
## 8  2014-07-16
## 14 2015-01-23
## 28 2011-01-07

0.3.4 Exploratory data analysis of the cleaned data

Exploring the distribution of the different pollutants

india_pollution_data%>%
    dplyr::mutate(month_id=lubridate::year(date))%>%
    gather(env_var,
           values,
           so2:spm)%>%
    group_by(env_var)%>%
    do(
        plot = plot_ly(data = ., x=~values) %>%
            add_histogram(name = ~env_var)
    ) %>%
    subplot(nrows = 4)%>%
    layout(title="Histograms of distribution of pollutant values")

The distributions clearly show that all the pollutants have a right skewed distribution having a few but very large values. Hence, these volumes would require some transformation

Log (x+1) transformation

india_polltn_data_trans<-india_pollution_data%>%
      mutate_at(vars(contains('no2'),
                     contains('so2'),
                     contains('rspm'),
                     contains('spm')),
            log1p)

Plotting the above data to check the distributions post data transformation using psych package

require(psych)

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

psych::multi.hist(subset(india_polltn_data_trans,
                         select = so2:spm))

The data now look more normally distributed than earlier and hence were used for subsequent analysis

Associations of independent variables to check collinearity if/any

#Collinearity checked using multiple correlation using all variables in the data

# Multiple correlation using pearson's correlation coefficient. Data needs to be a a matrix hence converted within the function itself. Here Pearson's correlation coefficient is used. Spearman can also be used as per requirement


mult_correlation<-Hmisc::rcorr(as.matrix(subset(india_polltn_data_trans,
                         select = so2:spm)),
                               type=c('pearson'))

#getting the correlation coefficients

require(DT)

## Loading required package: DT

DT::datatable(round(mult_correlation$r,3))

##Visualization of the multiple correlation using ggplot and Ggally (with histograms and correlation indices in the same plot)

require(GGally)

## Loading required package: GGally

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw as default

ggpairs(subset(india_polltn_data_trans,
                         select = so2:spm),
        title = "Associations of pollutants")

There seems to be association between rspm and spm seen from the scatterplots and the correlation co-efficient. Given the count of not null values of rspm being more than spm (hence more data), rspm was therefore used in subsequent analysis

Since the objective is to study how the different zones are w.r.t. the pollutants, the further EDA and inferential and/or predictive analyses will focus on those aspects only

0.3.5 Data summary of pollutant data w.r.t. states and zones

india_data_type_summ<-india_pollution_data%>%
    dplyr::select(state,
                  type,
                  so2,no2,rspm)%>%
    drop_na(.)%>%
  tidyr::gather(env_var,
         values,
         so2:rspm)%>%
  dplyr::group_by(state,type,env_var)%>%
  dplyr::summarise(mean=mean(values,
                      na.rm = TRUE),
            sd=sd(values,
                  na.rm = TRUE),
            median=median(values,
                          na.rm = TRUE),
            min=min(values),
            max=max(values))%>%
           tidyr::drop_na(.)%>%
    dplyr::mutate_if(is.numeric,
                     round,2)

## `mutate_if()` ignored the following grouping variables:
## Columns `state`, `type`

DT::datatable(india_data_type_summ)

0.3.6 The data summary visualization w.r.t. different states and zones (types)

Here, to give an order to the sequence of states, GIS data (centroid latitude) of each state is used

# Webpage is scraped using the rvest package

#install.packages('rvest')

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

# URL (with Indian states GIS data)

webpage <- read_html("http://www.quickgs.com/latitudinal-and-longitudinal-extents-of-india-indian-states-and-cities/")

# Obtaining the data

tables_list <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)


table_df<-tables_list[1]%>%
  data.frame(.)

# Data are further cleaned to remove 1)°N and °E with the numbers,2) selecting only the state name and lat long

clean_table<-table_df%>%
  tidyr::separate(Latitude,
           c("lat","deg_N"),## separating the column to separate "°" from numbers
           sep="°")%>%
   tidyr::separate(Longitude,
           c("lon","deg_E"),
           sep="°")%>% ## separating the column to separate "°" from numbers
  tidyr::separate(lon,c("lon_d",
                 "lon_m"),sep="\\.")%>% ## separating the numeric column to separate degrees and minutes
  tidyr::separate(lat,c("lat_d",
               "lat_m"),
         sep = "\\.")%>% ## separating the numeric column to separate degrees and minutes
    dplyr::select(State,
                  lat_d,
                  lat_m,
                  lon_d,
                  lon_m)%>%
  dplyr::mutate_at(vars(contains('lat'),
                 contains('lon')),
            as.numeric)


# The lat long data is in degrees and hence is converted to decimals 

clean_table2<-clean_table%>%
  dplyr::transmute(lat_dec=lat_d + lat_m/60,
            long_dec=lon_d + lon_m/60)%>%
  dplyr::mutate(state=clean_table$State)

# Observing the state names in both datasets

clean_table2$state

##  [1] "Andhra Pradesh"    "Arunachal Pradesh" "Assam"            
##  [4] "Bihar"             "Chhattisgarh"      "Goa"              
##  [7] "Gujarat"           "Haryana"           "Himachal Pradesh" 
## [10] "Jammu & Kashmir"   "Jharkhand"         "Karnataka"        
## [13] "Kerala"            "Madhya Pradesh"    "Maharashtra"      
## [16] "Manipur"           "Meghalaya"         "Mizoram"          
## [19] "Nagaland"          "Odisha"            "Punjab"           
## [22] "Rajasthan"         "Sikkim"            "Tamil Nadu"       
## [25] "Tripura"           "Uttarakhand"       "Uttar Pradesh"    
## [28] "West Bengal"

unique(india_data_type_summ$state)

##  [1] "Andhra Pradesh"       "Assam"                "Bihar"               
##  [4] "Chandigarh"           "Chhattisgarh"         "Dadra & Nagar Haveli"
##  [7] "Daman & Diu"          "Delhi"                "Goa"                 
## [10] "Gujarat"              "Haryana"              "Himachal Pradesh"    
## [13] "Jammu & Kashmir"      "Jharkhand"            "Karnataka"           
## [16] "Kerala"               "Madhya Pradesh"       "Maharashtra"         
## [19] "Manipur"              "Meghalaya"            "Mizoram"             
## [22] "Nagaland"             "Odisha"               "Puducherry"          
## [25] "Punjab"               "Rajasthan"            "Tamil Nadu"          
## [28] "Uttar Pradesh"

setdiff(unique(india_data_type_summ$state),clean_table2$state)

## [1] "Chandigarh"           "Dadra & Nagar Haveli" "Daman & Diu"         
## [4] "Delhi"                "Puducherry"

# Looking at the pollutant and GIS data, it can be observed that there are Union territories in the former dataset whose GIS data are not provided in the latter and that Chandigarh GIS data is provided which is not a state. Therefore, these data are manually edited.

# Additional data

add_gis_data<-data.frame(lat_dec=c(20.27,20.42,11.9416,28.7041),
                         long_dec=c(73.02,72.83,79.8083,77.1025),
                         state=c("Dadra & Nagar Haveli","Daman & Diu","Puducherry","Delhi"))

# Binding with the GIS data 

clean_table3<-data.frame(rbind(clean_table2,
                               add_gis_data))


# This data is then combined with the data summary

india_data_type_summ2<-india_data_type_summ%>%
  left_join(clean_table3,
            by=c("state"))%>%
    dplyr::filter(state!='Chandigarh') # Removing Chandigarh (since its not a state)

# The states are ordered as per their latitude using forcats package and data visualized using plotly and crosstalk. Crosstalk is used to provide limited selectivity in viewing specific type of pollutant trends for specific zones

require(forcats)

require(leaflet)

## Loading required package: leaflet

require(plotly)    

require(crosstalk)

## Loading required package: crosstalk

# Creating a shared object

india_data_type_sum_shr<-crosstalk::SharedData$new(india_data_type_summ2,
                                                   group = "individual")

#Plot

plot_india_polltn<-india_data_type_sum_shr%>%
    plot_ly(x = ~forcats::fct_reorder(state,lat_dec), 
            y = ~mean, 
            color = ~type,
            colors = "Set1",
            width = "50%",
            height = 450)%>%
    group_by(type)%>%
    add_markers(size=4)%>%
    layout(title = 'Statewise pollution data',
           xaxis = list(title = "Mean values"),
           yaxis = list(title = "States"),
           showlegend=FALSE)

bscols(list(filter_checkbox(id = "type",
                       sharedData = india_data_type_sum_shr,
                       group = ~type,
                       label = "Zones",
                       inline = TRUE),
       filter_checkbox(id = "env_var",
                       sharedData = india_data_type_sum_shr,
                       group = ~env_var,
                       label = "Pollutant",
                       inline = TRUE)),
       widths = c(3.5,8.5),
       plot_india_polltn)

Yearwise trends in pollution based on the types

# Obtaining the data

india_polln_date_type_sum<-india_pollution_data%>%
    dplyr::select(date,
                  type,
                  so2,no2,rspm)%>%
    drop_na(.)%>%
    dplyr::mutate(years=lubridate::year(date))%>%
    tidyr::gather(env_var,
           values,
           so2:rspm,-date)%>%
       dplyr::group_by(type,years,env_var)%>%
    dplyr::summarise(mean=mean(values,
                        na.rm = TRUE),
              sd=sd(values,
                    na.rm = TRUE),
              median=median(values,
                            na.rm = TRUE),
              min=min(values),
              max=max(values))
 
# Shared data object

india_poln_yr_type_shr<-crosstalk::SharedData$new(india_polln_date_type_sum, group = "individual")
                                   
# Plot

plot_india_poln_yr<-india_poln_yr_type_shr%>%
    plot_ly(x = ~years, 
            y = ~mean,
            color = ~type,
            colors = "Set2",
            height = 400,
            width="80%")%>%
    group_by(type)%>%
        add_markers(size=5)%>%
    layout(title = 'Yearly pollution data of different zones',
           xaxis = list(title = "Years"),
           yaxis = list(title = "Mean values"),
           showlegend=FALSE)

bscols(list(filter_checkbox(id = "env_var",
                       sharedData = india_poln_yr_type_shr,
                       group = ~env_var,
                       label = "Pollutant"),
       filter_select(id = "type",
                     sharedData = india_poln_yr_type_shr,
                     group = ~type,
                     label = "Zones")),
       plot_india_poln_yr,
       widths = c(3.5,8.5))

Pollutant

no2

rspm

so2

Zones

0.4 Studying the efficacy of the pollutants in categorizing the sampling sites into specific zones (Industrial, Rural, Rural_Residential and Sensitive)

A multinomial regression was performed. This type of regressio was selected since the dependent variable here which was the category of localities with more than two outcomes (0 an 1) and where the outcomes dont have a predetermined weight (nominal variables)

First, the data were modified for the analysis and observed

##Multinomial regression to test the efficacy of the environmental variables in classifying the types of areas


library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

#create an index to be used for selecting training and test samples from the dataset

india_data_for_reg<-india_polltn_data_trans%>%
    dplyr::select(type:no2,rspm)%>%
    dplyr::mutate_at(vars('type'),as.factor)%>%
    drop_na(.)

table(india_data_for_reg$type)

## 
##        Industrial       Residential Rural_Residential         Sensitive 
##             45561             58118             21867              4494

As seen from the above results, the sample size for each category was very different which could bias the results Therefore, a few different approaches were undertaken for training the model to check 1) how the accuracy (if at all) changes with each type of training model

0.4.1 Defining training, test data and the regression formula

## Splitting the data into training and testing datasets for predictive performance (70 -30 split has been used throughout). Package 'caret' is used for partitioning the dataset

require(caret)

set.seed(123)

poll_data_india_reg <- caret::createDataPartition(y = india_data_for_reg$type, 
                                                  p= 0.7, 
                                                  list = FALSE)

training_data <- india_data_for_reg[poll_data_india_reg,]


test_data <- india_data_for_reg[-poll_data_india_reg,]


# Defining the formula for regression(Only a certain dependent variables have been used in the analysis.

reg.formula<-reformulate(names(india_data_for_reg[c(2:4)]), 
                         names(india_data_for_reg[1]))

reg.formula

## type ~ so2 + no2 + rspm

1. Training using all the observations from the data

# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.

library(nnet)


ind_poll_model.train <- nnet::multinom(reg.formula, 
                                       data = training_data)

## # weights:  20 (12 variable)
## initial  value 126192.989398 
## iter  10 value 121098.773288
## iter  20 value 100989.038983
## final  value 100988.409681 
## converged

# Summarize the model


summary(ind_poll_model.train)

## Call:
## nnet::multinom(formula = reg.formula, data = training_data)
## 
## Coefficients:
##                   (Intercept)        so2         no2       rspm
## Residential         1.7524049 -0.6740847  0.28037890 -0.1758999
## Rural_Residential   2.3798928 -0.9180857 -0.08000808 -0.1684136
## Sensitive           0.7756912 -1.3932132  0.36218372 -0.2457527
## 
## Std. Errors:
##                   (Intercept)        so2        no2       rspm
## Residential         0.0615881 0.01391856 0.01782615 0.01317751
## Rural_Residential   0.0730101 0.01911033 0.02290091 0.01673334
## Sensitive           0.1171593 0.03721592 0.04119710 0.02991015
## 
## Residual Deviance: 201976.8 
## AIC: 202000.8

# Predictions using the test data


mult_reg_test<-ind_poll_model.train%>%
    predict(test_data)

# Visualization of predicted probabilities using a Gain Curve Plot

require(WVPlots)

## Loading required package: WVPlots

mult_reg_test_prob<-test_data%>%
    dplyr::mutate(predicted=mult_reg_test)

WVPlots::GainCurvePlot(mult_reg_test_prob,"predicted","type","Gain Curve of pollutant model")

# Accuracy of the model

accuracy_model_all<-mean(mult_reg_test == test_data$type)

accuracy_model_all

## [1] 0.4904001

The plot and the accuracy values show that the model almost sorts the types randomly thereby reducing its reliability

2. Sample size selection based on the least sample number category (here Sensitive zone) (using training dataset)

# sampling

training_eq_samp<-training_data%>%
    group_by(type)%>%
    sample_n(table(training_data$type)['Sensitive'])

# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.

library(nnet)


ind_poll_train2 <- nnet::multinom(reg.formula, 
                                       data = training_eq_samp)

## # weights:  20 (12 variable)
## initial  value 17445.128240 
## iter  10 value 16941.645493
## final  value 16790.913779 
## converged

# Summarize the model


summary(ind_poll_train2)

## Call:
## nnet::multinom(formula = reg.formula, data = training_eq_samp)
## 
## Coefficients:
##                   (Intercept)        so2        no2       rspm
## Residential          1.640562 -0.7698662 0.35342941 -0.2051672
## Rural_Residential    3.205044 -0.9537267 0.02679519 -0.2396730
## Sensitive            3.546419 -1.4579364 0.41073270 -0.3463862
## 
## Std. Errors:
##                   (Intercept)        so2        no2       rspm
## Residential         0.2035322 0.04688569 0.06058598 0.04351234
## Rural_Residential   0.1984365 0.04851360 0.06076798 0.04357841
## Sensitive           0.1994551 0.05080364 0.06153884 0.04408358
## 
## Residual Deviance: 33581.83 
## AIC: 33605.83

# Predictions using the test data


mult_reg_test2<-ind_poll_train2%>%
    predict(test_data)


# Accuracy of the model

accuracy_model_equal<-mean(mult_reg_test2 == test_data$type)

accuracy_model_equal

## [1] 0.3039655

3. Using Kfold Cross Validation

require(caret)

training_kfold<-train(
    form = reg.formula,
    data = training_data,
    trControl = trainControl(method = "cv", 
                             number = 10),
    method = "multinom")

## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104731.995171
## iter  20 value 90872.697144
## final  value 90872.157832 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104732.804581
## iter  20 value 90879.526062
## final  value 90873.444704 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104731.995973
## iter  20 value 90872.699207
## final  value 90872.159121 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.427915
## iter  20 value 90917.010987
## final  value 90916.929529 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.271342
## iter  20 value 90918.218096
## final  value 90918.170389 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.427740
## iter  20 value 90917.012187
## final  value 90916.930772 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105329.015585
## iter  20 value 90852.076131
## final  value 90851.461754 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105328.868366
## iter  20 value 90853.023843
## final  value 90852.783352 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105329.015384
## iter  20 value 90852.077695
## final  value 90851.463078 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105648.367681
## iter  20 value 90886.518813
## final  value 90885.951591 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105594.377641
## iter  20 value 90887.284027
## final  value 90887.253137 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105648.402797
## iter  20 value 90886.520086
## final  value 90885.952895 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104990.863459
## iter  20 value 90863.279225
## final  value 90862.959185 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104991.416779
## iter  20 value 90864.635941
## final  value 90864.288406 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104990.863994
## iter  20 value 90863.280572
## final  value 90862.960516 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104722.669159
## iter  20 value 90889.962034
## final  value 90889.720703 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104725.392560
## iter  20 value 90891.717637
## final  value 90891.012357 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104722.671885
## iter  20 value 90889.963276
## final  value 90889.721996 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104706.452369
## iter  20 value 90926.474598
## final  value 90926.092528 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104709.192092
## iter  20 value 90927.711018
## final  value 90927.334493 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104706.455111
## iter  20 value 90926.475838
## final  value 90926.093772 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105219.574546
## iter  20 value 90896.773252
## final  value 90896.607918 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105231.033734
## iter  20 value 90898.075054
## final  value 90897.897504 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105219.584544
## iter  20 value 90896.774533
## final  value 90896.609210 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105694.301662
## iter  20 value 90874.844385
## final  value 90874.577994 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105688.007050
## iter  20 value 90876.218515
## final  value 90875.906692 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105694.295327
## iter  20 value 90874.845758
## final  value 90874.579325 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.965512
## iter  20 value 90915.069531
## final  value 90915.036431 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.711764
## iter  20 value 90916.332771
## final  value 90916.298268 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.965248
## iter  20 value 90915.070797
## final  value 90915.037695 
## converged
## # weights:  20 (12 variable)
## initial  value 126192.989398 
## iter  10 value 121095.951571
## iter  20 value 100990.195053
## final  value 100989.698840 
## converged

# Complete results of the training model 

training_kfold

## Penalized Multinomial Regression 
## 
## 91029 samples
##     3 predictor
##     4 classes: 'Industrial', 'Residential', 'Rural_Residential', 'Sensitive' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 81926, 81926, 81926, 81927, 81925, 81926, ... 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.4951718  0.1276838
##   1e-04  0.4951718  0.1276838
##   1e-01  0.4951938  0.1277121
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.

# Final model

training_kfold$finalModel

## Call:
## nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
## 
## Coefficients:
##                   (Intercept)        so2        no2       rspm
## Residential         1.7493945 -0.6739294  0.2806017 -0.1754878
## Rural_Residential   2.3758973 -0.9178653 -0.0797057 -0.1678638
## Sensitive           0.7710164 -1.3926222  0.3622813 -0.2450777
## 
## Residual Deviance: 201979.4 
## AIC: 202003.4

# Accuracy of the model using test data

accuracy_model_kfold<-mean(test_data$type == predict(training_kfold$finalModel, 
                               newdata = test_data))

accuracy_model_kfold

## [1] 0.4903745

# Calculate the confusion matrix for the test set

class_pred_kfold<-predict(object = training_kfold$finalModel,    
                            newdata = test_data,  
                            type = "class")

confusion_mat_pred_kfold <- caret::confusionMatrix(data = class_pred_kfold ,         # predicted classes
                      reference = test_data$type)  # actual classes

print(confusion_mat_pred_kfold)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Industrial Residential Rural_Residential Sensitive
##   Industrial              4990        3398              1341       117
##   Residential             8602       14037              5117      1231
##   Rural_Residential         76           0               102         0
##   Sensitive                  0           0                 0         0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4903          
##                  95% CI : (0.4854, 0.4953)
##     No Information Rate : 0.4469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1193          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Industrial Class: Residential
## Sensitivity                     0.3651             0.8051
## Specificity                     0.8084             0.3071
## Pos Pred Value                  0.5068             0.4843
## Neg Pred Value                  0.7025             0.6610
## Prevalence                      0.3504             0.4469
## Detection Rate                  0.1279             0.3598
## Detection Prevalence            0.2524             0.7430
## Balanced Accuracy               0.5867             0.5561
##                      Class: Rural_Residential Class: Sensitive
## Sensitivity                          0.015549          0.00000
## Specificity                          0.997658          1.00000
## Pos Pred Value                       0.573034              NaN
## Neg Pred Value                       0.833698          0.96545
## Prevalence                           0.168158          0.03455
## Detection Rate                       0.002615          0.00000
## Detection Prevalence                 0.004563          0.00000
## Balanced Accuracy                    0.506603          0.50000

4.Randomforest approach

set.seed(123)

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

model_train <- randomForest(formula = reg.formula,
                             data = training_data)

# Print the model output      
                       
print(model_train)

## 
## Call:
##  randomForest(formula = reg.formula, data = training_data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 40.41%
## Confusion matrix:
##                   Industrial Residential Rural_Residential Sensitive
## Industrial             15966       13432              2343       152
## Residential             8628       30629              1124       302
## Rural_Residential       3492        4804              6942        69
## Sensitive                607        1688               147       704
##                   class.error
## Industrial          0.4993886
## Residential         0.2471303
## Rural_Residential   0.5464820
## Sensitive           0.7762238

# predicted classes using the model

class_prediction2 <- predict(object = model_train,    
                            newdata = test_data,  
                            type = "class") 

# Calculate the confusion matrix for the test set

confusion_mat_pred <- caret::confusionMatrix(data = class_prediction2 ,         # predicted classes
                      reference = test_data$type)  # actual classes

print(confusion_mat_pred)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Industrial Residential Rural_Residential Sensitive
##   Industrial              6828        3712              1467       254
##   Residential             5763       13094              2072       712
##   Rural_Residential        997         491              2983        61
##   Sensitive                 80         138                38       321
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5954          
##                  95% CI : (0.5905, 0.6002)
##     No Information Rate : 0.4469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3494          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Industrial Class: Residential
## Sensitivity                     0.4996             0.7510
## Specificity                     0.7856             0.6039
## Pos Pred Value                  0.5569             0.6051
## Neg Pred Value                  0.7443             0.7501
## Prevalence                      0.3504             0.4469
## Detection Rate                  0.1750             0.3356
## Detection Prevalence            0.3143             0.5547
## Balanced Accuracy               0.6426             0.6774
##                      Class: Rural_Residential Class: Sensitive
## Sensitivity                           0.45473         0.238131
## Specificity                           0.95227         0.993203
## Pos Pred Value                        0.65821         0.556326
## Neg Pred Value                        0.89626         0.973279
## Prevalence                            0.16816         0.034554
## Detection Rate                        0.07647         0.008228
## Detection Prevalence                  0.11617         0.014791
## Balanced Accuracy                     0.70350         0.615667

From the results it can be seen that the accuracy of the model is 60% and the highest classification error is with the Sensitive zone category. A Kappa value of 36%* points to a fair assessment in categorization of the observations done by the model matching with the actual observed data. The results indicate that if the given pollutant data was solely used for defining such categories, it would not be very efficient in the classification. Efficiency can be increased either by 1) Incorporating more environmental variables for the same sampled sites and/or 2) Carrying out a longer temporal survey thereby generating more data and/or 3) Incorporating more study sites