Rationale The aim of the work was to explore the air pollution trends of India. Specific objectives included 1) Data Profiling and Exploratory data analysis of the dataset by summarizing and visualizing some characteristic trends of pollutants based on a) the states and b) ‘types’ (zones in the analyses) 2) Assessing the efficacy of zones categorization (here assumed to be) based on the pollution data. It should be ideally expected that the categories are properly classified based on the available data using regression techniques.

Datasethas been imported from kaggle.

The dataset was loaded and written using DBI and RPostgreSQL in R. This was done since the dataset was large (>450000 rows) and hence more easier to wrangle using SQL (This has been commented out since its a one time process)

0.1 Obtaining the data types of different fields in the dataset


select
column_name,
data_type
from
information_schema."columns"
where table_name = 'India_pollution'
;
8 records
column_name data_type
state text
location text
type text
so2 double precision
no2 double precision
rspm double precision
spm double precision
date text

0.2 Dataset cleaning



update "India_pollution"
set "state" = trim(initcap("state"))
;
commit;


update "India_pollution"
set "location" = trim(initcap("location"))
;
commit;


update "India_pollution"
set "type" = trim(initcap("type"))
;
commit;


--2. Editing and changing the date field


update "India_pollution"
set "date"= (split_part(date,'/',3) || '/' || split_part(date,'/',1) || '/'|| split_part(date,'/',2))::date 
;
commit;


--3. Adding 'Data_not_available' to empty character field cells


--1. state


update "India_pollution"
set "state" = 'Data_not_available'
where "state" = ''
;
commit;


--2. location


update "India_pollution"
set "location" = 'Data_not_available'
where "location" = ''
;
commit;


--3. type


update "India_pollution"
set "type" = 'Data_not_available'
where "type" is null
;
commit;

Re-checking the data after cleaning


select
 *
 from
"India_pollution"
 limit 10;
Displaying records 1 - 10
state location type so2 no2 rspm spm date
Maharashtra Akola Industrial 9.0 11.000000 127.00000 NA 2009-09-07
Kerala Kozhikode Rural_Residential 2.0 8.666667 47.33333 93 2012-05-10
Tamil Nadu Chennai Industrial NA NA 29.00000 NA 2006-06-18
Odisha Talcher Industrial 10.0 21.000000 79.00000 NA 2008-10-13
Chhattisgarh Bhilai Rural_Residential NA NA NA NA 2014-09-22
Uttar Pradesh Kanpur Rural_Residential 8.0 40.000000 246.00000 NA 2004-12-23
Odisha Angul Industrial 9.0 26.000000 91.00000 NA 2008-11-03
Delhi Delhi Industrial 4.2 41.200000 124.00000 296 2014-07-16
Delhi Delhi Industrial 4.2 60.300000 140.00000 485 2014-07-16
Kerala Kottayam Rural_Residential 6.0 22.000000 51.00000 NA 2012-04-19

From visually observing the data, it was seen that observations prior to 2005 were lesser than after the year 2005. Hence, a count of both observations was obtained for both categories

0.2.1 Observations prior to 2005


select
count(*) as obs_prior_2005
from
"India_pollution"
where date <= '2005-01-01'
;
1 records
obs_prior_2005
42796

0.2.2 Observations post 2005


select
count(*) as obs_post_2005
from
"India_pollution"
where date >= '2005-01-01'
;
1 records
obs_post_2005
391655

It could clearly be seen the observations prior to 2005 were 1) very less and hence records from 2005 onwards will be selected and used for further analysis


select
*
from
"India_pollution"
where date >= '2005-01-01'
;

0.3 Data Profiling, Summarization and Visualization

0.3.1 Total row counts having non-null values in the data


select
sum(case when "state" is not null then 1 else 0 end) as state_not_nulls,
sum(case when "location" is not null then 1 else 0 end) as location_not_nulls,
sum(case when "type" is not null then 1 else 0 end) as type_not_nulls,
sum(case when "so2" is not null then 1 else 0 end) as so2_not_nulls,
sum(case when "no2" is not null then 1 else 0 end) as no2_not_nulls,
sum(case when "rspm" is not null then 1 else 0 end) as rspm_not_nulls,
sum(case when "spm" is not null then 1 else 0 end) as spm_not_nulls,
sum(case when "date" is not null then 1 else 0 end) as date_not_null
FROM "India_pollution"
where date >= '2005-01-01'
;
1 records
state_not_nulls location_not_nulls type_not_nulls so2_not_nulls no2_not_nulls rspm_not_nulls spm_not_nulls date_not_null
391655 391655 391655 359338 376680 352772 184555 391655

0.3.2 Distinct counts of some important fields


select
count(distinct "state") as states,
count(distinct "location") as locations,
count(distinct "type") as types
FROM "India_pollution"
where date >= '2005-01-01'
;
1 records
states locations types
31 269 5

0.3.3 Modifying the data field and removing NaN values (if/any) to get the final dataset for further analysis

#str(pollution_data) shows that date is a character rather than a date and hence changed to date

pollution_data$date<-as.Date(pollution_data$date,"%Y-%m-%d")

# All the entries of the category Data_not_available are removed from the dataset followed by removing NaN's (if/any) from the dataset using the library IDPmisc

require(IDPmisc)
## Loading required package: IDPmisc
india_pollution_data<-pollution_data%>%
    dplyr::filter(type %in% c("Sensitive",
                              "Residential",
                              "Industrial",
                              "Rural_Residential"))%>%
    IDPmisc::NaRV.omit(.)

head(india_pollution_data,5)
##          state  location              type  so2       no2      rspm spm
## 2       Kerala Kozhikode Rural_Residential  2.0  8.666667  47.33333  93
## 7        Delhi     Delhi        Industrial  4.2 41.200000 124.00000 296
## 8        Delhi     Delhi        Industrial  4.2 60.300000 140.00000 485
## 14       Bihar     Patna       Residential 11.9 41.000000 142.00000 440
## 28 Maharashtra  Kolhapur       Residential 10.3 15.500000  71.00000 150
##          date
## 2  2012-05-10
## 7  2014-07-16
## 8  2014-07-16
## 14 2015-01-23
## 28 2011-01-07

0.3.4 Exploratory data analysis of the cleaned data

Exploring the distribution of the different pollutants

india_pollution_data%>%
    dplyr::mutate(month_id=lubridate::year(date))%>%
    gather(env_var,
           values,
           so2:spm)%>%
    group_by(env_var)%>%
    do(
        plot = plot_ly(data = ., x=~values) %>%
            add_histogram(name = ~env_var)
    ) %>%
    subplot(nrows = 4)%>%
    layout(title="Histograms of distribution of pollutant values")

The distributions clearly show that all the pollutants have a right skewed distribution having a few but very large values. Hence, these volumes would require some transformation

Log (x+1) transformation

india_polltn_data_trans<-india_pollution_data%>%
      mutate_at(vars(contains('no2'),
                     contains('so2'),
                     contains('rspm'),
                     contains('spm')),
            log1p)

Plotting the above data to check the distributions post data transformation using psych package

require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
psych::multi.hist(subset(india_polltn_data_trans,
                         select = so2:spm))

The data now look more normally distributed than earlier and hence were used for subsequent analysis

Associations of independent variables to check collinearity if/any

#Collinearity checked using multiple correlation using all variables in the data

# Multiple correlation using pearson's correlation coefficient. Data needs to be a a matrix hence converted within the function itself. Here Pearson's correlation coefficient is used. Spearman can also be used as per requirement


mult_correlation<-Hmisc::rcorr(as.matrix(subset(india_polltn_data_trans,
                         select = so2:spm)),
                               type=c('pearson'))

#getting the correlation coefficients

require(DT)
## Loading required package: DT
DT::datatable(round(mult_correlation$r,3))
##Visualization of the multiple correlation using ggplot and Ggally (with histograms and correlation indices in the same plot)

require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw as default

ggpairs(subset(india_polltn_data_trans,
                         select = so2:spm),
        title = "Associations of pollutants")

There seems to be association between rspm and spm seen from the scatterplots and the correlation co-efficient. Given the count of not null values of rspm being more than spm (hence more data), rspm was therefore used in subsequent analysis

Since the objective is to study how the different zones are w.r.t. the pollutants, the further EDA and inferential and/or predictive analyses will focus on those aspects only

0.3.5 Data summary of pollutant data w.r.t. states and zones

india_data_type_summ<-india_pollution_data%>%
    dplyr::select(state,
                  type,
                  so2,no2,rspm)%>%
    drop_na(.)%>%
  tidyr::gather(env_var,
         values,
         so2:rspm)%>%
  dplyr::group_by(state,type,env_var)%>%
  dplyr::summarise(mean=mean(values,
                      na.rm = TRUE),
            sd=sd(values,
                  na.rm = TRUE),
            median=median(values,
                          na.rm = TRUE),
            min=min(values),
            max=max(values))%>%
           tidyr::drop_na(.)%>%
    dplyr::mutate_if(is.numeric,
                     round,2)
## `mutate_if()` ignored the following grouping variables:
## Columns `state`, `type`
DT::datatable(india_data_type_summ)

0.3.6 The data summary visualization w.r.t. different states and zones (types)

Here, to give an order to the sequence of states, GIS data (centroid latitude) of each state is used

# Webpage is scraped using the rvest package

#install.packages('rvest')

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
# URL (with Indian states GIS data)

webpage <- read_html("http://www.quickgs.com/latitudinal-and-longitudinal-extents-of-india-indian-states-and-cities/")

# Obtaining the data

tables_list <- webpage %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)


table_df<-tables_list[1]%>%
  data.frame(.)

# Data are further cleaned to remove 1)°N and °E with the numbers,2) selecting only the state name and lat long

clean_table<-table_df%>%
  tidyr::separate(Latitude,
           c("lat","deg_N"),## separating the column to separate "°" from numbers
           sep="°")%>%
   tidyr::separate(Longitude,
           c("lon","deg_E"),
           sep="°")%>% ## separating the column to separate "°" from numbers
  tidyr::separate(lon,c("lon_d",
                 "lon_m"),sep="\\.")%>% ## separating the numeric column to separate degrees and minutes
  tidyr::separate(lat,c("lat_d",
               "lat_m"),
         sep = "\\.")%>% ## separating the numeric column to separate degrees and minutes
    dplyr::select(State,
                  lat_d,
                  lat_m,
                  lon_d,
                  lon_m)%>%
  dplyr::mutate_at(vars(contains('lat'),
                 contains('lon')),
            as.numeric)


# The lat long data is in degrees and hence is converted to decimals 

clean_table2<-clean_table%>%
  dplyr::transmute(lat_dec=lat_d + lat_m/60,
            long_dec=lon_d + lon_m/60)%>%
  dplyr::mutate(state=clean_table$State)

# Observing the state names in both datasets

clean_table2$state
##  [1] "Andhra Pradesh"    "Arunachal Pradesh" "Assam"            
##  [4] "Bihar"             "Chhattisgarh"      "Goa"              
##  [7] "Gujarat"           "Haryana"           "Himachal Pradesh" 
## [10] "Jammu & Kashmir"   "Jharkhand"         "Karnataka"        
## [13] "Kerala"            "Madhya Pradesh"    "Maharashtra"      
## [16] "Manipur"           "Meghalaya"         "Mizoram"          
## [19] "Nagaland"          "Odisha"            "Punjab"           
## [22] "Rajasthan"         "Sikkim"            "Tamil Nadu"       
## [25] "Tripura"           "Uttarakhand"       "Uttar Pradesh"    
## [28] "West Bengal"
unique(india_data_type_summ$state)
##  [1] "Andhra Pradesh"       "Assam"                "Bihar"               
##  [4] "Chandigarh"           "Chhattisgarh"         "Dadra & Nagar Haveli"
##  [7] "Daman & Diu"          "Delhi"                "Goa"                 
## [10] "Gujarat"              "Haryana"              "Himachal Pradesh"    
## [13] "Jammu & Kashmir"      "Jharkhand"            "Karnataka"           
## [16] "Kerala"               "Madhya Pradesh"       "Maharashtra"         
## [19] "Manipur"              "Meghalaya"            "Mizoram"             
## [22] "Nagaland"             "Odisha"               "Puducherry"          
## [25] "Punjab"               "Rajasthan"            "Tamil Nadu"          
## [28] "Uttar Pradesh"
setdiff(unique(india_data_type_summ$state),clean_table2$state)
## [1] "Chandigarh"           "Dadra & Nagar Haveli" "Daman & Diu"         
## [4] "Delhi"                "Puducherry"
# Looking at the pollutant and GIS data, it can be observed that there are Union territories in the former dataset whose GIS data are not provided in the latter and that Chandigarh GIS data is provided which is not a state. Therefore, these data are manually edited.

# Additional data

add_gis_data<-data.frame(lat_dec=c(20.27,20.42,11.9416,28.7041),
                         long_dec=c(73.02,72.83,79.8083,77.1025),
                         state=c("Dadra & Nagar Haveli","Daman & Diu","Puducherry","Delhi"))

# Binding with the GIS data 

clean_table3<-data.frame(rbind(clean_table2,
                               add_gis_data))


# This data is then combined with the data summary

india_data_type_summ2<-india_data_type_summ%>%
  left_join(clean_table3,
            by=c("state"))%>%
    dplyr::filter(state!='Chandigarh') # Removing Chandigarh (since its not a state)

# The states are ordered as per their latitude using forcats package and data visualized using plotly and crosstalk. Crosstalk is used to provide limited selectivity in viewing specific type of pollutant trends for specific zones

require(forcats)

require(leaflet)
## Loading required package: leaflet
require(plotly)    

require(crosstalk)
## Loading required package: crosstalk
# Creating a shared object

india_data_type_sum_shr<-crosstalk::SharedData$new(india_data_type_summ2,
                                                   group = "individual")

#Plot

plot_india_polltn<-india_data_type_sum_shr%>%
    plot_ly(x = ~forcats::fct_reorder(state,lat_dec), 
            y = ~mean, 
            color = ~type,
            colors = "Set1",
            width = "50%",
            height = 450)%>%
    group_by(type)%>%
    add_markers(size=4)%>%
    layout(title = 'Statewise pollution data',
           xaxis = list(title = "Mean values"),
           yaxis = list(title = "States"),
           showlegend=FALSE)

bscols(list(filter_checkbox(id = "type",
                       sharedData = india_data_type_sum_shr,
                       group = ~type,
                       label = "Zones",
                       inline = TRUE),
       filter_checkbox(id = "env_var",
                       sharedData = india_data_type_sum_shr,
                       group = ~env_var,
                       label = "Pollutant",
                       inline = TRUE)),
       widths = c(3.5,8.5),
       plot_india_polltn)

Yearwise trends in pollution based on the types

# Obtaining the data

india_polln_date_type_sum<-india_pollution_data%>%
    dplyr::select(date,
                  type,
                  so2,no2,rspm)%>%
    drop_na(.)%>%
    dplyr::mutate(years=lubridate::year(date))%>%
    tidyr::gather(env_var,
           values,
           so2:rspm,-date)%>%
       dplyr::group_by(type,years,env_var)%>%
    dplyr::summarise(mean=mean(values,
                        na.rm = TRUE),
              sd=sd(values,
                    na.rm = TRUE),
              median=median(values,
                            na.rm = TRUE),
              min=min(values),
              max=max(values))
 
# Shared data object

india_poln_yr_type_shr<-crosstalk::SharedData$new(india_polln_date_type_sum, group = "individual")
                                   
# Plot

plot_india_poln_yr<-india_poln_yr_type_shr%>%
    plot_ly(x = ~years, 
            y = ~mean,
            color = ~type,
            colors = "Set2",
            height = 400,
            width="80%")%>%
    group_by(type)%>%
        add_markers(size=5)%>%
    layout(title = 'Yearly pollution data of different zones',
           xaxis = list(title = "Years"),
           yaxis = list(title = "Mean values"),
           showlegend=FALSE)

bscols(list(filter_checkbox(id = "env_var",
                       sharedData = india_poln_yr_type_shr,
                       group = ~env_var,
                       label = "Pollutant"),
       filter_select(id = "type",
                     sharedData = india_poln_yr_type_shr,
                     group = ~type,
                     label = "Zones")),
       plot_india_poln_yr,
       widths = c(3.5,8.5))

0.4 Studying the efficacy of the pollutants in categorizing the sampling sites into specific zones (Industrial, Rural, Rural_Residential and Sensitive)

A multinomial regression was performed. This type of regressio was selected since the dependent variable here which was the category of localities with more than two outcomes (0 an 1) and where the outcomes dont have a predetermined weight (nominal variables)

First, the data were modified for the analysis and observed

##Multinomial regression to test the efficacy of the environmental variables in classifying the types of areas


library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#create an index to be used for selecting training and test samples from the dataset

india_data_for_reg<-india_polltn_data_trans%>%
    dplyr::select(type:no2,rspm)%>%
    dplyr::mutate_at(vars('type'),as.factor)%>%
    drop_na(.)

table(india_data_for_reg$type)
## 
##        Industrial       Residential Rural_Residential         Sensitive 
##             45561             58118             21867              4494

As seen from the above results, the sample size for each category was very different which could bias the results Therefore, a few different approaches were undertaken for training the model to check 1) how the accuracy (if at all) changes with each type of training model

0.4.1 Defining training, test data and the regression formula

## Splitting the data into training and testing datasets for predictive performance (70 -30 split has been used throughout). Package 'caret' is used for partitioning the dataset

require(caret)

set.seed(123)

poll_data_india_reg <- caret::createDataPartition(y = india_data_for_reg$type, 
                                                  p= 0.7, 
                                                  list = FALSE)

training_data <- india_data_for_reg[poll_data_india_reg,]


test_data <- india_data_for_reg[-poll_data_india_reg,]


# Defining the formula for regression(Only a certain dependent variables have been used in the analysis.

reg.formula<-reformulate(names(india_data_for_reg[c(2:4)]), 
                         names(india_data_for_reg[1]))

reg.formula
## type ~ so2 + no2 + rspm

1. Training using all the observations from the data

# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.

library(nnet)


ind_poll_model.train <- nnet::multinom(reg.formula, 
                                       data = training_data)
## # weights:  20 (12 variable)
## initial  value 126192.989398 
## iter  10 value 121098.773288
## iter  20 value 100989.038983
## final  value 100988.409681 
## converged
# Summarize the model


summary(ind_poll_model.train)
## Call:
## nnet::multinom(formula = reg.formula, data = training_data)
## 
## Coefficients:
##                   (Intercept)        so2         no2       rspm
## Residential         1.7524049 -0.6740847  0.28037890 -0.1758999
## Rural_Residential   2.3798928 -0.9180857 -0.08000808 -0.1684136
## Sensitive           0.7756912 -1.3932132  0.36218372 -0.2457527
## 
## Std. Errors:
##                   (Intercept)        so2        no2       rspm
## Residential         0.0615881 0.01391856 0.01782615 0.01317751
## Rural_Residential   0.0730101 0.01911033 0.02290091 0.01673334
## Sensitive           0.1171593 0.03721592 0.04119710 0.02991015
## 
## Residual Deviance: 201976.8 
## AIC: 202000.8
# Predictions using the test data


mult_reg_test<-ind_poll_model.train%>%
    predict(test_data)

# Visualization of predicted probabilities using a Gain Curve Plot

require(WVPlots)
## Loading required package: WVPlots
mult_reg_test_prob<-test_data%>%
    dplyr::mutate(predicted=mult_reg_test)

WVPlots::GainCurvePlot(mult_reg_test_prob,"predicted","type","Gain Curve of pollutant model")

# Accuracy of the model

accuracy_model_all<-mean(mult_reg_test == test_data$type)

accuracy_model_all
## [1] 0.4904001

The plot and the accuracy values show that the model almost sorts the types randomly thereby reducing its reliability

2. Sample size selection based on the least sample number category (here Sensitive zone) (using training dataset)

# sampling

training_eq_samp<-training_data%>%
    group_by(type)%>%
    sample_n(table(training_data$type)['Sensitive'])

# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.

library(nnet)


ind_poll_train2 <- nnet::multinom(reg.formula, 
                                       data = training_eq_samp)
## # weights:  20 (12 variable)
## initial  value 17445.128240 
## iter  10 value 16941.645493
## final  value 16790.913779 
## converged
# Summarize the model


summary(ind_poll_train2)
## Call:
## nnet::multinom(formula = reg.formula, data = training_eq_samp)
## 
## Coefficients:
##                   (Intercept)        so2        no2       rspm
## Residential          1.640562 -0.7698662 0.35342941 -0.2051672
## Rural_Residential    3.205044 -0.9537267 0.02679519 -0.2396730
## Sensitive            3.546419 -1.4579364 0.41073270 -0.3463862
## 
## Std. Errors:
##                   (Intercept)        so2        no2       rspm
## Residential         0.2035322 0.04688569 0.06058598 0.04351234
## Rural_Residential   0.1984365 0.04851360 0.06076798 0.04357841
## Sensitive           0.1994551 0.05080364 0.06153884 0.04408358
## 
## Residual Deviance: 33581.83 
## AIC: 33605.83
# Predictions using the test data


mult_reg_test2<-ind_poll_train2%>%
    predict(test_data)


# Accuracy of the model

accuracy_model_equal<-mean(mult_reg_test2 == test_data$type)

accuracy_model_equal
## [1] 0.3039655

3. Using Kfold Cross Validation

require(caret)

training_kfold<-train(
    form = reg.formula,
    data = training_data,
    trControl = trainControl(method = "cv", 
                             number = 10),
    method = "multinom")
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104731.995171
## iter  20 value 90872.697144
## final  value 90872.157832 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104732.804581
## iter  20 value 90879.526062
## final  value 90873.444704 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104731.995973
## iter  20 value 90872.699207
## final  value 90872.159121 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.427915
## iter  20 value 90917.010987
## final  value 90916.929529 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.271342
## iter  20 value 90918.218096
## final  value 90918.170389 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105095.427740
## iter  20 value 90917.012187
## final  value 90916.930772 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105329.015585
## iter  20 value 90852.076131
## final  value 90851.461754 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105328.868366
## iter  20 value 90853.023843
## final  value 90852.783352 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105329.015384
## iter  20 value 90852.077695
## final  value 90851.463078 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105648.367681
## iter  20 value 90886.518813
## final  value 90885.951591 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105594.377641
## iter  20 value 90887.284027
## final  value 90887.253137 
## converged
## # weights:  20 (12 variable)
## initial  value 113574.938123 
## iter  10 value 105648.402797
## iter  20 value 90886.520086
## final  value 90885.952895 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104990.863459
## iter  20 value 90863.279225
## final  value 90862.959185 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104991.416779
## iter  20 value 90864.635941
## final  value 90864.288406 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104990.863994
## iter  20 value 90863.280572
## final  value 90862.960516 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104722.669159
## iter  20 value 90889.962034
## final  value 90889.720703 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104725.392560
## iter  20 value 90891.717637
## final  value 90891.012357 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104722.671885
## iter  20 value 90889.963276
## final  value 90889.721996 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104706.452369
## iter  20 value 90926.474598
## final  value 90926.092528 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104709.192092
## iter  20 value 90927.711018
## final  value 90927.334493 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 104706.455111
## iter  20 value 90926.475838
## final  value 90926.093772 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105219.574546
## iter  20 value 90896.773252
## final  value 90896.607918 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105231.033734
## iter  20 value 90898.075054
## final  value 90897.897504 
## converged
## # weights:  20 (12 variable)
## initial  value 113576.324418 
## iter  10 value 105219.584544
## iter  20 value 90896.774533
## final  value 90896.609210 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105694.301662
## iter  20 value 90874.844385
## final  value 90874.577994 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105688.007050
## iter  20 value 90876.218515
## final  value 90875.906692 
## converged
## # weights:  20 (12 variable)
## initial  value 113573.551829 
## iter  10 value 105694.295327
## iter  20 value 90874.845758
## final  value 90874.579325 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.965512
## iter  20 value 90915.069531
## final  value 90915.036431 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.711764
## iter  20 value 90916.332771
## final  value 90916.298268 
## converged
## # weights:  20 (12 variable)
## initial  value 113572.165535 
## iter  10 value 104855.965248
## iter  20 value 90915.070797
## final  value 90915.037695 
## converged
## # weights:  20 (12 variable)
## initial  value 126192.989398 
## iter  10 value 121095.951571
## iter  20 value 100990.195053
## final  value 100989.698840 
## converged
# Complete results of the training model 

training_kfold
## Penalized Multinomial Regression 
## 
## 91029 samples
##     3 predictor
##     4 classes: 'Industrial', 'Residential', 'Rural_Residential', 'Sensitive' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 81926, 81926, 81926, 81927, 81925, 81926, ... 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.4951718  0.1276838
##   1e-04  0.4951718  0.1276838
##   1e-01  0.4951938  0.1277121
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
# Final model

training_kfold$finalModel
## Call:
## nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
## 
## Coefficients:
##                   (Intercept)        so2        no2       rspm
## Residential         1.7493945 -0.6739294  0.2806017 -0.1754878
## Rural_Residential   2.3758973 -0.9178653 -0.0797057 -0.1678638
## Sensitive           0.7710164 -1.3926222  0.3622813 -0.2450777
## 
## Residual Deviance: 201979.4 
## AIC: 202003.4
# Accuracy of the model using test data

accuracy_model_kfold<-mean(test_data$type == predict(training_kfold$finalModel, 
                               newdata = test_data))

accuracy_model_kfold
## [1] 0.4903745
# Calculate the confusion matrix for the test set

class_pred_kfold<-predict(object = training_kfold$finalModel,    
                            newdata = test_data,  
                            type = "class")

confusion_mat_pred_kfold <- caret::confusionMatrix(data = class_pred_kfold ,         # predicted classes
                      reference = test_data$type)  # actual classes

print(confusion_mat_pred_kfold)
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Industrial Residential Rural_Residential Sensitive
##   Industrial              4990        3398              1341       117
##   Residential             8602       14037              5117      1231
##   Rural_Residential         76           0               102         0
##   Sensitive                  0           0                 0         0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4903          
##                  95% CI : (0.4854, 0.4953)
##     No Information Rate : 0.4469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1193          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Industrial Class: Residential
## Sensitivity                     0.3651             0.8051
## Specificity                     0.8084             0.3071
## Pos Pred Value                  0.5068             0.4843
## Neg Pred Value                  0.7025             0.6610
## Prevalence                      0.3504             0.4469
## Detection Rate                  0.1279             0.3598
## Detection Prevalence            0.2524             0.7430
## Balanced Accuracy               0.5867             0.5561
##                      Class: Rural_Residential Class: Sensitive
## Sensitivity                          0.015549          0.00000
## Specificity                          0.997658          1.00000
## Pos Pred Value                       0.573034              NaN
## Neg Pred Value                       0.833698          0.96545
## Prevalence                           0.168158          0.03455
## Detection Rate                       0.002615          0.00000
## Detection Prevalence                 0.004563          0.00000
## Balanced Accuracy                    0.506603          0.50000

4.Randomforest approach

set.seed(123)

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
model_train <- randomForest(formula = reg.formula,
                             data = training_data)

# Print the model output      
                       
print(model_train)
## 
## Call:
##  randomForest(formula = reg.formula, data = training_data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 40.41%
## Confusion matrix:
##                   Industrial Residential Rural_Residential Sensitive
## Industrial             15966       13432              2343       152
## Residential             8628       30629              1124       302
## Rural_Residential       3492        4804              6942        69
## Sensitive                607        1688               147       704
##                   class.error
## Industrial          0.4993886
## Residential         0.2471303
## Rural_Residential   0.5464820
## Sensitive           0.7762238
# predicted classes using the model

class_prediction2 <- predict(object = model_train,    
                            newdata = test_data,  
                            type = "class") 

# Calculate the confusion matrix for the test set

confusion_mat_pred <- caret::confusionMatrix(data = class_prediction2 ,         # predicted classes
                      reference = test_data$type)  # actual classes

print(confusion_mat_pred)
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Industrial Residential Rural_Residential Sensitive
##   Industrial              6828        3712              1467       254
##   Residential             5763       13094              2072       712
##   Rural_Residential        997         491              2983        61
##   Sensitive                 80         138                38       321
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5954          
##                  95% CI : (0.5905, 0.6002)
##     No Information Rate : 0.4469          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3494          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Industrial Class: Residential
## Sensitivity                     0.4996             0.7510
## Specificity                     0.7856             0.6039
## Pos Pred Value                  0.5569             0.6051
## Neg Pred Value                  0.7443             0.7501
## Prevalence                      0.3504             0.4469
## Detection Rate                  0.1750             0.3356
## Detection Prevalence            0.3143             0.5547
## Balanced Accuracy               0.6426             0.6774
##                      Class: Rural_Residential Class: Sensitive
## Sensitivity                           0.45473         0.238131
## Specificity                           0.95227         0.993203
## Pos Pred Value                        0.65821         0.556326
## Neg Pred Value                        0.89626         0.973279
## Prevalence                            0.16816         0.034554
## Detection Rate                        0.07647         0.008228
## Detection Prevalence                  0.11617         0.014791
## Balanced Accuracy                     0.70350         0.615667

From the results it can be seen that the accuracy of the model is 60% and the highest classification error is with the Sensitive zone category. A Kappa value of 36%* points to a fair assessment in categorization of the observations done by the model matching with the actual observed data. The results indicate that if the given pollutant data was solely used for defining such categories, it would not be very efficient in the classification. Efficiency can be increased either by 1) Incorporating more environmental variables for the same sampled sites and/or 2) Carrying out a longer temporal survey thereby generating more data and/or 3) Incorporating more study sites