Rationale The aim of the work was to explore the air pollution trends of India. Specific objectives included 1) Data Profiling and Exploratory data analysis of the dataset by summarizing and visualizing some characteristic trends of pollutants based on a) the states and b) ‘types’ (zones in the analyses) 2) Assessing the efficacy of zones categorization (here assumed to be) based on the pollution data. It should be ideally expected that the categories are properly classified based on the available data using regression techniques.
Datasethas been imported from kaggle.
The dataset was loaded and written using DBI and RPostgreSQL in R. This was done since the dataset was large (>450000 rows) and hence more easier to wrangle using SQL (This has been commented out since its a one time process)
select
column_name,
data_type
from
information_schema."columns"
where table_name = 'India_pollution'
;| column_name | data_type |
|---|---|
| state | text |
| location | text |
| type | text |
| so2 | double precision |
| no2 | double precision |
| rspm | double precision |
| spm | double precision |
| date | text |
update "India_pollution"
set "state" = trim(initcap("state"))
;
commit;
update "India_pollution"
set "location" = trim(initcap("location"))
;
commit;
update "India_pollution"
set "type" = trim(initcap("type"))
;
commit;
--2. Editing and changing the date field
update "India_pollution"
set "date"= (split_part(date,'/',3) || '/' || split_part(date,'/',1) || '/'|| split_part(date,'/',2))::date
;
commit;
--3. Adding 'Data_not_available' to empty character field cells
--1. state
update "India_pollution"
set "state" = 'Data_not_available'
where "state" = ''
;
commit;
--2. location
update "India_pollution"
set "location" = 'Data_not_available'
where "location" = ''
;
commit;
--3. type
update "India_pollution"
set "type" = 'Data_not_available'
where "type" is null
;
commit;Re-checking the data after cleaning
| state | location | type | so2 | no2 | rspm | spm | date |
|---|---|---|---|---|---|---|---|
| Maharashtra | Akola | Industrial | 9.0 | 11.000000 | 127.00000 | NA | 2009-09-07 |
| Kerala | Kozhikode | Rural_Residential | 2.0 | 8.666667 | 47.33333 | 93 | 2012-05-10 |
| Tamil Nadu | Chennai | Industrial | NA | NA | 29.00000 | NA | 2006-06-18 |
| Odisha | Talcher | Industrial | 10.0 | 21.000000 | 79.00000 | NA | 2008-10-13 |
| Chhattisgarh | Bhilai | Rural_Residential | NA | NA | NA | NA | 2014-09-22 |
| Uttar Pradesh | Kanpur | Rural_Residential | 8.0 | 40.000000 | 246.00000 | NA | 2004-12-23 |
| Odisha | Angul | Industrial | 9.0 | 26.000000 | 91.00000 | NA | 2008-11-03 |
| Delhi | Delhi | Industrial | 4.2 | 41.200000 | 124.00000 | 296 | 2014-07-16 |
| Delhi | Delhi | Industrial | 4.2 | 60.300000 | 140.00000 | 485 | 2014-07-16 |
| Kerala | Kottayam | Rural_Residential | 6.0 | 22.000000 | 51.00000 | NA | 2012-04-19 |
From visually observing the data, it was seen that observations prior to 2005 were lesser than after the year 2005. Hence, a count of both observations was obtained for both categories
| obs_prior_2005 |
|---|
| 42796 |
| obs_post_2005 |
|---|
| 391655 |
It could clearly be seen the observations prior to 2005 were 1) very less and hence records from 2005 onwards will be selected and used for further analysis
select
sum(case when "state" is not null then 1 else 0 end) as state_not_nulls,
sum(case when "location" is not null then 1 else 0 end) as location_not_nulls,
sum(case when "type" is not null then 1 else 0 end) as type_not_nulls,
sum(case when "so2" is not null then 1 else 0 end) as so2_not_nulls,
sum(case when "no2" is not null then 1 else 0 end) as no2_not_nulls,
sum(case when "rspm" is not null then 1 else 0 end) as rspm_not_nulls,
sum(case when "spm" is not null then 1 else 0 end) as spm_not_nulls,
sum(case when "date" is not null then 1 else 0 end) as date_not_null
FROM "India_pollution"
where date >= '2005-01-01'
;| state_not_nulls | location_not_nulls | type_not_nulls | so2_not_nulls | no2_not_nulls | rspm_not_nulls | spm_not_nulls | date_not_null |
|---|---|---|---|---|---|---|---|
| 391655 | 391655 | 391655 | 359338 | 376680 | 352772 | 184555 | 391655 |
select
count(distinct "state") as states,
count(distinct "location") as locations,
count(distinct "type") as types
FROM "India_pollution"
where date >= '2005-01-01'
;| states | locations | types |
|---|---|---|
| 31 | 269 | 5 |
#str(pollution_data) shows that date is a character rather than a date and hence changed to date
pollution_data$date<-as.Date(pollution_data$date,"%Y-%m-%d")
# All the entries of the category Data_not_available are removed from the dataset followed by removing NaN's (if/any) from the dataset using the library IDPmisc
require(IDPmisc)## Loading required package: IDPmisc
india_pollution_data<-pollution_data%>%
dplyr::filter(type %in% c("Sensitive",
"Residential",
"Industrial",
"Rural_Residential"))%>%
IDPmisc::NaRV.omit(.)
head(india_pollution_data,5)## state location type so2 no2 rspm spm
## 2 Kerala Kozhikode Rural_Residential 2.0 8.666667 47.33333 93
## 7 Delhi Delhi Industrial 4.2 41.200000 124.00000 296
## 8 Delhi Delhi Industrial 4.2 60.300000 140.00000 485
## 14 Bihar Patna Residential 11.9 41.000000 142.00000 440
## 28 Maharashtra Kolhapur Residential 10.3 15.500000 71.00000 150
## date
## 2 2012-05-10
## 7 2014-07-16
## 8 2014-07-16
## 14 2015-01-23
## 28 2011-01-07
Exploring the distribution of the different pollutants
india_pollution_data%>%
dplyr::mutate(month_id=lubridate::year(date))%>%
gather(env_var,
values,
so2:spm)%>%
group_by(env_var)%>%
do(
plot = plot_ly(data = ., x=~values) %>%
add_histogram(name = ~env_var)
) %>%
subplot(nrows = 4)%>%
layout(title="Histograms of distribution of pollutant values")The distributions clearly show that all the pollutants have a right skewed distribution having a few but very large values. Hence, these volumes would require some transformation
Log (x+1) transformation
india_polltn_data_trans<-india_pollution_data%>%
mutate_at(vars(contains('no2'),
contains('so2'),
contains('rspm'),
contains('spm')),
log1p)Plotting the above data to check the distributions post data transformation using psych package
## Loading required package: psych
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
The data now look more normally distributed than earlier and hence were used for subsequent analysis
Associations of independent variables to check collinearity if/any
#Collinearity checked using multiple correlation using all variables in the data
# Multiple correlation using pearson's correlation coefficient. Data needs to be a a matrix hence converted within the function itself. Here Pearson's correlation coefficient is used. Spearman can also be used as per requirement
mult_correlation<-Hmisc::rcorr(as.matrix(subset(india_polltn_data_trans,
select = so2:spm)),
type=c('pearson'))
#getting the correlation coefficients
require(DT)## Loading required package: DT
##Visualization of the multiple correlation using ggplot and Ggally (with histograms and correlation indices in the same plot)
require(GGally)## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
ggplot2::theme_set(ggplot2::theme_bw()) # set theme_bw as default
ggpairs(subset(india_polltn_data_trans,
select = so2:spm),
title = "Associations of pollutants")There seems to be association between rspm and spm seen from the scatterplots and the correlation co-efficient. Given the count of not null values of rspm being more than spm (hence more data), rspm was therefore used in subsequent analysis
Since the objective is to study how the different zones are w.r.t. the pollutants, the further EDA and inferential and/or predictive analyses will focus on those aspects only
india_data_type_summ<-india_pollution_data%>%
dplyr::select(state,
type,
so2,no2,rspm)%>%
drop_na(.)%>%
tidyr::gather(env_var,
values,
so2:rspm)%>%
dplyr::group_by(state,type,env_var)%>%
dplyr::summarise(mean=mean(values,
na.rm = TRUE),
sd=sd(values,
na.rm = TRUE),
median=median(values,
na.rm = TRUE),
min=min(values),
max=max(values))%>%
tidyr::drop_na(.)%>%
dplyr::mutate_if(is.numeric,
round,2)## `mutate_if()` ignored the following grouping variables:
## Columns `state`, `type`
Here, to give an order to the sequence of states, GIS data (centroid latitude) of each state is used
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
# URL (with Indian states GIS data)
webpage <- read_html("http://www.quickgs.com/latitudinal-and-longitudinal-extents-of-india-indian-states-and-cities/")
# Obtaining the data
tables_list <- webpage %>%
html_nodes("table") %>%
html_table(fill = TRUE)
table_df<-tables_list[1]%>%
data.frame(.)
# Data are further cleaned to remove 1)°N and °E with the numbers,2) selecting only the state name and lat long
clean_table<-table_df%>%
tidyr::separate(Latitude,
c("lat","deg_N"),## separating the column to separate "°" from numbers
sep="°")%>%
tidyr::separate(Longitude,
c("lon","deg_E"),
sep="°")%>% ## separating the column to separate "°" from numbers
tidyr::separate(lon,c("lon_d",
"lon_m"),sep="\\.")%>% ## separating the numeric column to separate degrees and minutes
tidyr::separate(lat,c("lat_d",
"lat_m"),
sep = "\\.")%>% ## separating the numeric column to separate degrees and minutes
dplyr::select(State,
lat_d,
lat_m,
lon_d,
lon_m)%>%
dplyr::mutate_at(vars(contains('lat'),
contains('lon')),
as.numeric)
# The lat long data is in degrees and hence is converted to decimals
clean_table2<-clean_table%>%
dplyr::transmute(lat_dec=lat_d + lat_m/60,
long_dec=lon_d + lon_m/60)%>%
dplyr::mutate(state=clean_table$State)
# Observing the state names in both datasets
clean_table2$state## [1] "Andhra Pradesh" "Arunachal Pradesh" "Assam"
## [4] "Bihar" "Chhattisgarh" "Goa"
## [7] "Gujarat" "Haryana" "Himachal Pradesh"
## [10] "Jammu & Kashmir" "Jharkhand" "Karnataka"
## [13] "Kerala" "Madhya Pradesh" "Maharashtra"
## [16] "Manipur" "Meghalaya" "Mizoram"
## [19] "Nagaland" "Odisha" "Punjab"
## [22] "Rajasthan" "Sikkim" "Tamil Nadu"
## [25] "Tripura" "Uttarakhand" "Uttar Pradesh"
## [28] "West Bengal"
## [1] "Andhra Pradesh" "Assam" "Bihar"
## [4] "Chandigarh" "Chhattisgarh" "Dadra & Nagar Haveli"
## [7] "Daman & Diu" "Delhi" "Goa"
## [10] "Gujarat" "Haryana" "Himachal Pradesh"
## [13] "Jammu & Kashmir" "Jharkhand" "Karnataka"
## [16] "Kerala" "Madhya Pradesh" "Maharashtra"
## [19] "Manipur" "Meghalaya" "Mizoram"
## [22] "Nagaland" "Odisha" "Puducherry"
## [25] "Punjab" "Rajasthan" "Tamil Nadu"
## [28] "Uttar Pradesh"
## [1] "Chandigarh" "Dadra & Nagar Haveli" "Daman & Diu"
## [4] "Delhi" "Puducherry"
# Looking at the pollutant and GIS data, it can be observed that there are Union territories in the former dataset whose GIS data are not provided in the latter and that Chandigarh GIS data is provided which is not a state. Therefore, these data are manually edited.
# Additional data
add_gis_data<-data.frame(lat_dec=c(20.27,20.42,11.9416,28.7041),
long_dec=c(73.02,72.83,79.8083,77.1025),
state=c("Dadra & Nagar Haveli","Daman & Diu","Puducherry","Delhi"))
# Binding with the GIS data
clean_table3<-data.frame(rbind(clean_table2,
add_gis_data))
# This data is then combined with the data summary
india_data_type_summ2<-india_data_type_summ%>%
left_join(clean_table3,
by=c("state"))%>%
dplyr::filter(state!='Chandigarh') # Removing Chandigarh (since its not a state)
# The states are ordered as per their latitude using forcats package and data visualized using plotly and crosstalk. Crosstalk is used to provide limited selectivity in viewing specific type of pollutant trends for specific zones
require(forcats)
require(leaflet)## Loading required package: leaflet
## Loading required package: crosstalk
# Creating a shared object
india_data_type_sum_shr<-crosstalk::SharedData$new(india_data_type_summ2,
group = "individual")
#Plot
plot_india_polltn<-india_data_type_sum_shr%>%
plot_ly(x = ~forcats::fct_reorder(state,lat_dec),
y = ~mean,
color = ~type,
colors = "Set1",
width = "50%",
height = 450)%>%
group_by(type)%>%
add_markers(size=4)%>%
layout(title = 'Statewise pollution data',
xaxis = list(title = "Mean values"),
yaxis = list(title = "States"),
showlegend=FALSE)
bscols(list(filter_checkbox(id = "type",
sharedData = india_data_type_sum_shr,
group = ~type,
label = "Zones",
inline = TRUE),
filter_checkbox(id = "env_var",
sharedData = india_data_type_sum_shr,
group = ~env_var,
label = "Pollutant",
inline = TRUE)),
widths = c(3.5,8.5),
plot_india_polltn)Yearwise trends in pollution based on the types
# Obtaining the data
india_polln_date_type_sum<-india_pollution_data%>%
dplyr::select(date,
type,
so2,no2,rspm)%>%
drop_na(.)%>%
dplyr::mutate(years=lubridate::year(date))%>%
tidyr::gather(env_var,
values,
so2:rspm,-date)%>%
dplyr::group_by(type,years,env_var)%>%
dplyr::summarise(mean=mean(values,
na.rm = TRUE),
sd=sd(values,
na.rm = TRUE),
median=median(values,
na.rm = TRUE),
min=min(values),
max=max(values))
# Shared data object
india_poln_yr_type_shr<-crosstalk::SharedData$new(india_polln_date_type_sum, group = "individual")
# Plot
plot_india_poln_yr<-india_poln_yr_type_shr%>%
plot_ly(x = ~years,
y = ~mean,
color = ~type,
colors = "Set2",
height = 400,
width="80%")%>%
group_by(type)%>%
add_markers(size=5)%>%
layout(title = 'Yearly pollution data of different zones',
xaxis = list(title = "Years"),
yaxis = list(title = "Mean values"),
showlegend=FALSE)
bscols(list(filter_checkbox(id = "env_var",
sharedData = india_poln_yr_type_shr,
group = ~env_var,
label = "Pollutant"),
filter_select(id = "type",
sharedData = india_poln_yr_type_shr,
group = ~type,
label = "Zones")),
plot_india_poln_yr,
widths = c(3.5,8.5))A multinomial regression was performed. This type of regressio was selected since the dependent variable here which was the category of localities with more than two outcomes (0 an 1) and where the outcomes dont have a predetermined weight (nominal variables)
First, the data were modified for the analysis and observed
##Multinomial regression to test the efficacy of the environmental variables in classifying the types of areas
library(caret)## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#create an index to be used for selecting training and test samples from the dataset
india_data_for_reg<-india_polltn_data_trans%>%
dplyr::select(type:no2,rspm)%>%
dplyr::mutate_at(vars('type'),as.factor)%>%
drop_na(.)
table(india_data_for_reg$type)##
## Industrial Residential Rural_Residential Sensitive
## 45561 58118 21867 4494
As seen from the above results, the sample size for each category was very different which could bias the results Therefore, a few different approaches were undertaken for training the model to check 1) how the accuracy (if at all) changes with each type of training model
## Splitting the data into training and testing datasets for predictive performance (70 -30 split has been used throughout). Package 'caret' is used for partitioning the dataset
require(caret)
set.seed(123)
poll_data_india_reg <- caret::createDataPartition(y = india_data_for_reg$type,
p= 0.7,
list = FALSE)
training_data <- india_data_for_reg[poll_data_india_reg,]
test_data <- india_data_for_reg[-poll_data_india_reg,]
# Defining the formula for regression(Only a certain dependent variables have been used in the analysis.
reg.formula<-reformulate(names(india_data_for_reg[c(2:4)]),
names(india_data_for_reg[1]))
reg.formula## type ~ so2 + no2 + rspm
1. Training using all the observations from the data
# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.
library(nnet)
ind_poll_model.train <- nnet::multinom(reg.formula,
data = training_data)## # weights: 20 (12 variable)
## initial value 126192.989398
## iter 10 value 121098.773288
## iter 20 value 100989.038983
## final value 100988.409681
## converged
## Call:
## nnet::multinom(formula = reg.formula, data = training_data)
##
## Coefficients:
## (Intercept) so2 no2 rspm
## Residential 1.7524049 -0.6740847 0.28037890 -0.1758999
## Rural_Residential 2.3798928 -0.9180857 -0.08000808 -0.1684136
## Sensitive 0.7756912 -1.3932132 0.36218372 -0.2457527
##
## Std. Errors:
## (Intercept) so2 no2 rspm
## Residential 0.0615881 0.01391856 0.01782615 0.01317751
## Rural_Residential 0.0730101 0.01911033 0.02290091 0.01673334
## Sensitive 0.1171593 0.03721592 0.04119710 0.02991015
##
## Residual Deviance: 201976.8
## AIC: 202000.8
# Predictions using the test data
mult_reg_test<-ind_poll_model.train%>%
predict(test_data)
# Visualization of predicted probabilities using a Gain Curve Plot
require(WVPlots)## Loading required package: WVPlots
mult_reg_test_prob<-test_data%>%
dplyr::mutate(predicted=mult_reg_test)
WVPlots::GainCurvePlot(mult_reg_test_prob,"predicted","type","Gain Curve of pollutant model")# Accuracy of the model
accuracy_model_all<-mean(mult_reg_test == test_data$type)
accuracy_model_all## [1] 0.4904001
The plot and the accuracy values show that the model almost sorts the types randomly thereby reducing its reliability
2. Sample size selection based on the least sample number category (here Sensitive zone) (using training dataset)
# sampling
training_eq_samp<-training_data%>%
group_by(type)%>%
sample_n(table(training_data$type)['Sensitive'])
# Using the training dataset for making the model (Code similar to the one used above for the complete data). Same formula 'reg.formula' is used.
library(nnet)
ind_poll_train2 <- nnet::multinom(reg.formula,
data = training_eq_samp)## # weights: 20 (12 variable)
## initial value 17445.128240
## iter 10 value 16941.645493
## final value 16790.913779
## converged
## Call:
## nnet::multinom(formula = reg.formula, data = training_eq_samp)
##
## Coefficients:
## (Intercept) so2 no2 rspm
## Residential 1.640562 -0.7698662 0.35342941 -0.2051672
## Rural_Residential 3.205044 -0.9537267 0.02679519 -0.2396730
## Sensitive 3.546419 -1.4579364 0.41073270 -0.3463862
##
## Std. Errors:
## (Intercept) so2 no2 rspm
## Residential 0.2035322 0.04688569 0.06058598 0.04351234
## Rural_Residential 0.1984365 0.04851360 0.06076798 0.04357841
## Sensitive 0.1994551 0.05080364 0.06153884 0.04408358
##
## Residual Deviance: 33581.83
## AIC: 33605.83
# Predictions using the test data
mult_reg_test2<-ind_poll_train2%>%
predict(test_data)
# Accuracy of the model
accuracy_model_equal<-mean(mult_reg_test2 == test_data$type)
accuracy_model_equal## [1] 0.3039655
3. Using Kfold Cross Validation
require(caret)
training_kfold<-train(
form = reg.formula,
data = training_data,
trControl = trainControl(method = "cv",
number = 10),
method = "multinom")## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104731.995171
## iter 20 value 90872.697144
## final value 90872.157832
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104732.804581
## iter 20 value 90879.526062
## final value 90873.444704
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104731.995973
## iter 20 value 90872.699207
## final value 90872.159121
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105095.427915
## iter 20 value 90917.010987
## final value 90916.929529
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105095.271342
## iter 20 value 90918.218096
## final value 90918.170389
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105095.427740
## iter 20 value 90917.012187
## final value 90916.930772
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105329.015585
## iter 20 value 90852.076131
## final value 90851.461754
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105328.868366
## iter 20 value 90853.023843
## final value 90852.783352
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105329.015384
## iter 20 value 90852.077695
## final value 90851.463078
## converged
## # weights: 20 (12 variable)
## initial value 113574.938123
## iter 10 value 105648.367681
## iter 20 value 90886.518813
## final value 90885.951591
## converged
## # weights: 20 (12 variable)
## initial value 113574.938123
## iter 10 value 105594.377641
## iter 20 value 90887.284027
## final value 90887.253137
## converged
## # weights: 20 (12 variable)
## initial value 113574.938123
## iter 10 value 105648.402797
## iter 20 value 90886.520086
## final value 90885.952895
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104990.863459
## iter 20 value 90863.279225
## final value 90862.959185
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104991.416779
## iter 20 value 90864.635941
## final value 90864.288406
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104990.863994
## iter 20 value 90863.280572
## final value 90862.960516
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104722.669159
## iter 20 value 90889.962034
## final value 90889.720703
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104725.392560
## iter 20 value 90891.717637
## final value 90891.012357
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104722.671885
## iter 20 value 90889.963276
## final value 90889.721996
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104706.452369
## iter 20 value 90926.474598
## final value 90926.092528
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104709.192092
## iter 20 value 90927.711018
## final value 90927.334493
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 104706.455111
## iter 20 value 90926.475838
## final value 90926.093772
## converged
## # weights: 20 (12 variable)
## initial value 113576.324418
## iter 10 value 105219.574546
## iter 20 value 90896.773252
## final value 90896.607918
## converged
## # weights: 20 (12 variable)
## initial value 113576.324418
## iter 10 value 105231.033734
## iter 20 value 90898.075054
## final value 90897.897504
## converged
## # weights: 20 (12 variable)
## initial value 113576.324418
## iter 10 value 105219.584544
## iter 20 value 90896.774533
## final value 90896.609210
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105694.301662
## iter 20 value 90874.844385
## final value 90874.577994
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105688.007050
## iter 20 value 90876.218515
## final value 90875.906692
## converged
## # weights: 20 (12 variable)
## initial value 113573.551829
## iter 10 value 105694.295327
## iter 20 value 90874.845758
## final value 90874.579325
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104855.965512
## iter 20 value 90915.069531
## final value 90915.036431
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104855.711764
## iter 20 value 90916.332771
## final value 90916.298268
## converged
## # weights: 20 (12 variable)
## initial value 113572.165535
## iter 10 value 104855.965248
## iter 20 value 90915.070797
## final value 90915.037695
## converged
## # weights: 20 (12 variable)
## initial value 126192.989398
## iter 10 value 121095.951571
## iter 20 value 100990.195053
## final value 100989.698840
## converged
## Penalized Multinomial Regression
##
## 91029 samples
## 3 predictor
## 4 classes: 'Industrial', 'Residential', 'Rural_Residential', 'Sensitive'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 81926, 81926, 81926, 81927, 81925, 81926, ...
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.4951718 0.1276838
## 1e-04 0.4951718 0.1276838
## 1e-01 0.4951938 0.1277121
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
## Call:
## nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
##
## Coefficients:
## (Intercept) so2 no2 rspm
## Residential 1.7493945 -0.6739294 0.2806017 -0.1754878
## Rural_Residential 2.3758973 -0.9178653 -0.0797057 -0.1678638
## Sensitive 0.7710164 -1.3926222 0.3622813 -0.2450777
##
## Residual Deviance: 201979.4
## AIC: 202003.4
# Accuracy of the model using test data
accuracy_model_kfold<-mean(test_data$type == predict(training_kfold$finalModel,
newdata = test_data))
accuracy_model_kfold## [1] 0.4903745
# Calculate the confusion matrix for the test set
class_pred_kfold<-predict(object = training_kfold$finalModel,
newdata = test_data,
type = "class")
confusion_mat_pred_kfold <- caret::confusionMatrix(data = class_pred_kfold , # predicted classes
reference = test_data$type) # actual classes
print(confusion_mat_pred_kfold)## Confusion Matrix and Statistics
##
## Reference
## Prediction Industrial Residential Rural_Residential Sensitive
## Industrial 4990 3398 1341 117
## Residential 8602 14037 5117 1231
## Rural_Residential 76 0 102 0
## Sensitive 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4903
## 95% CI : (0.4854, 0.4953)
## No Information Rate : 0.4469
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1193
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Industrial Class: Residential
## Sensitivity 0.3651 0.8051
## Specificity 0.8084 0.3071
## Pos Pred Value 0.5068 0.4843
## Neg Pred Value 0.7025 0.6610
## Prevalence 0.3504 0.4469
## Detection Rate 0.1279 0.3598
## Detection Prevalence 0.2524 0.7430
## Balanced Accuracy 0.5867 0.5561
## Class: Rural_Residential Class: Sensitive
## Sensitivity 0.015549 0.00000
## Specificity 0.997658 1.00000
## Pos Pred Value 0.573034 NaN
## Neg Pred Value 0.833698 0.96545
## Prevalence 0.168158 0.03455
## Detection Rate 0.002615 0.00000
## Detection Prevalence 0.004563 0.00000
## Balanced Accuracy 0.506603 0.50000
4.Randomforest approach
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
model_train <- randomForest(formula = reg.formula,
data = training_data)
# Print the model output
print(model_train)##
## Call:
## randomForest(formula = reg.formula, data = training_data)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 40.41%
## Confusion matrix:
## Industrial Residential Rural_Residential Sensitive
## Industrial 15966 13432 2343 152
## Residential 8628 30629 1124 302
## Rural_Residential 3492 4804 6942 69
## Sensitive 607 1688 147 704
## class.error
## Industrial 0.4993886
## Residential 0.2471303
## Rural_Residential 0.5464820
## Sensitive 0.7762238
# predicted classes using the model
class_prediction2 <- predict(object = model_train,
newdata = test_data,
type = "class")
# Calculate the confusion matrix for the test set
confusion_mat_pred <- caret::confusionMatrix(data = class_prediction2 , # predicted classes
reference = test_data$type) # actual classes
print(confusion_mat_pred)## Confusion Matrix and Statistics
##
## Reference
## Prediction Industrial Residential Rural_Residential Sensitive
## Industrial 6828 3712 1467 254
## Residential 5763 13094 2072 712
## Rural_Residential 997 491 2983 61
## Sensitive 80 138 38 321
##
## Overall Statistics
##
## Accuracy : 0.5954
## 95% CI : (0.5905, 0.6002)
## No Information Rate : 0.4469
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3494
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: Industrial Class: Residential
## Sensitivity 0.4996 0.7510
## Specificity 0.7856 0.6039
## Pos Pred Value 0.5569 0.6051
## Neg Pred Value 0.7443 0.7501
## Prevalence 0.3504 0.4469
## Detection Rate 0.1750 0.3356
## Detection Prevalence 0.3143 0.5547
## Balanced Accuracy 0.6426 0.6774
## Class: Rural_Residential Class: Sensitive
## Sensitivity 0.45473 0.238131
## Specificity 0.95227 0.993203
## Pos Pred Value 0.65821 0.556326
## Neg Pred Value 0.89626 0.973279
## Prevalence 0.16816 0.034554
## Detection Rate 0.07647 0.008228
## Detection Prevalence 0.11617 0.014791
## Balanced Accuracy 0.70350 0.615667
From the results it can be seen that the accuracy of the model is 60% and the highest classification error is with the Sensitive zone category. A Kappa value of 36%* points to a fair assessment in categorization of the observations done by the model matching with the actual observed data. The results indicate that if the given pollutant data was solely used for defining such categories, it would not be very efficient in the classification. Efficiency can be increased either by 1) Incorporating more environmental variables for the same sampled sites and/or 2) Carrying out a longer temporal survey thereby generating more data and/or 3) Incorporating more study sites