Also assign meaningful column names. Study the distribution of different fields.
setwd("D:\\Survey Problem\\Working")
survey_data <- read.csv("20221013 NF SSP Program Endline Assessment FINAL Cleaned Database V1.0.csv", stringsAsFactors = TRUE)
colnames(survey_data) <- c("WAE ID","District", "Block", "Village","Education","Age","HH_Members","HH_Children","Livelihood","Income","Studied_Till","Land_Ownership","Land_Acres","Family_Land_Acres","Caste","Business","Business_Subsector","Business_Owner","Business_Owner_Relation","Business_Tenure","Seasonality","Average_Revenue","Digital_Score","Digital_Score_Category")
summary (survey_data)
## WAE ID District Block Village
## NF-WAE-O144: 2 Latur :101 Ausa :49 Dawanhipparga: 10
## NF-WAE-T218: 2 Osmanabad: 90 Deoni :49 Mogarga : 8
## NF-WAE-A001: 1 Osmanabad:59 Waruda : 8
## NF-WAE-A035: 1 Tuljapur :34 Mahadevwadi : 6
## NF-WAE-A041: 1 Nagtirthawadi: 5
## NF-WAE-A042: 1 Jawali : 4
## (Other) :183 (Other) :150
## Education Age HH_Members HH_Children
## 12th Grade:65 Min. :20.00 Min. : 2.000 Min. :0.000
## 10th Grade:57 1st Qu.:32.00 1st Qu.: 4.000 1st Qu.:1.000
## 9th Grade :28 Median :36.00 Median : 5.000 Median :2.000
## Graduate :15 Mean :35.81 Mean : 5.209 Mean :1.775
## 11th Grade:10 3rd Qu.:40.00 3rd Qu.: 6.000 3rd Qu.:2.000
## 8th Grade : 7 Max. :50.00 Max. :19.000 Max. :7.000
## (Other) : 9
## Livelihood Income Studied_Till Land_Ownership
## Agriculture :173 1 Lakh to 4 Lakhs :107 12th Grade:65 NO :163
## Business : 14 4 Lakh to 7 Lakhs : 2 10th Grade:57 YES : 28
## Livestock : 1 Less than 1 lakh : 82 9th Grade :28
## Salary : 3 Graduate :15
## 11th Grade:10
## 8th Grade : 7
## (Other) : 9
## Land_Acres Family_Land_Acres Caste
## Min. :0.000 Min. : 0.000 General :150
## 1st Qu.:0.000 1st Qu.: 2.000 Other Backward Class OBC OBC: 26
## Median :0.000 Median : 3.000 Scheduled Caste SC SC : 10
## Mean :0.389 Mean : 3.627 Scheduled Tribe ST ST : 5
## 3rd Qu.:0.000 3rd Qu.: 5.000
## Max. :8.000 Max. :22.000
##
## Business Business_Subsector
## Dairy :61 Subsector 1:78
## Vegetables :17 Subsector 2:19
## Vegetables, Dairy:16 Subsector 3:79
## Vegetables :10 Subsector 4:15
## Dairy, Flour Mill: 7
## Flour Mill : 7
## (Other) :73
## Business_Owner
## Exclusively run by her without any support from MALE family members :85
## Predominantly run by her with some support from MALE family members :80
## Run jointly by her and one or more MALE family members e.g., husband. :26
##
##
##
##
## Business_Owner_Relation Business_Tenure Seasonality
## Father-in-law : 3 Min. : 2.0 Min. : 3.000
## Husband :120 1st Qu.: 24.0 1st Qu.: 8.000
## Husband, Father-in-law : 5 Median : 36.0 Median : 9.000
## Husband, Son : 15 Mean : 42.2 Mean : 9.597
## Mother-in-Law : 3 3rd Qu.: 48.0 3rd Qu.:12.000
## None : 35 Max. :240.0 Max. :48.000
## Other : 10
## Average_Revenue Digital_Score Digital_Score_Category
## Min. : 7 Min. : 2.00 Between 10 and 15:35
## 1st Qu.: 30000 1st Qu.:15.00 Between 15 and 20:59
## Median : 60000 Median :19.00 Between 20 and 25:57
## Mean : 93175 Mean :18.63 Less than 10 :21
## 3rd Qu.: 119000 3rd Qu.:23.00 More than 25 :19
## Max. :1500000 Max. :32.00
##
Boxplots to analyze relation of Digital Score with all the categorical variables. Correlation Coefficient is used for numerical variables.
boxplot (Digital_Score~District, data = survey_data)
boxplot (Digital_Score~Block, data = survey_data)
boxplot (Digital_Score~Education, data = survey_data)
cor (survey_data$Digital_Score, survey_data$Age)
## [1] 0.0718775
cor (survey_data$Digital_Score, survey_data$HH_Members)
## [1] -0.06999911
cor (survey_data$Digital_Score, survey_data$HH_Children)
## [1] -0.1539235
boxplot (Digital_Score~Livelihood, data = survey_data)
boxplot (Digital_Score~Income, data = survey_data)
boxplot (Digital_Score~Studied_Till, data = survey_data)
boxplot (Digital_Score~Land_Ownership, data = survey_data)
cor (survey_data$Digital_Score, survey_data$Land_Acres)
## [1] 0.006381921
cor (survey_data$Digital_Score, survey_data$Family_Land_Acres)
## [1] 0.05708554
boxplot (Digital_Score~Caste, data = survey_data)
boxplot (Digital_Score~Business_Subsector, data = survey_data)
boxplot (Digital_Score~Business_Owner, data = survey_data)
boxplot (Digital_Score~Business_Owner_Relation, data = survey_data)
cor (survey_data$Digital_Score, survey_data$Business_Tenure)
## [1] -0.06534205
cor (survey_data$Digital_Score, survey_data$Seasonality)
## [1] 0.03150432
cor (survey_data$Digital_Score, survey_data$Average_Revenue)
## [1] 0.0140504
From the results of the boxplots and correlation coefficients it appears only these variables have weak relation with Digital Score - Block, Livelihood, HH_Children, Income, Business_Subsector, Caste, Business_Subsector, Business_Owner_Relation. Others do not have any visible relation.
Next we build regression models with Digital Score as target variable and combination of the above factors. For each regression model we analyze how much the model is able to explain the variations in the target variable using ANOVA.
survey_model <- lm(Digital_Score ~ factor(District) + factor (Livelihood) + HH_Children + factor(Income) + factor (Business_Subsector) + factor (Caste) + factor (Business_Subsector) + factor (Business_Owner_Relation), data = survey_data)
anova(survey_model)
## Analysis of Variance Table
##
## Response: Digital_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(District) 1 369.1 369.14 12.6608 0.0004834 ***
## factor(Livelihood) 3 8.2 2.74 0.0940 0.9632724
## HH_Children 1 159.2 159.17 5.4591 0.0206263 *
## factor(Income) 2 217.0 108.51 3.7216 0.0261778 *
## factor(Business_Subsector) 3 176.8 58.93 2.0213 0.1127732
## factor(Caste) 3 311.6 103.87 3.5626 0.0154742 *
## factor(Business_Owner_Relation) 6 182.9 30.49 1.0457 0.3975780
## Residuals 171 4985.7 29.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
survey_model <- lm(Digital_Score ~ factor(District) , data = survey_data)
anova(survey_model)
## Analysis of Variance Table
##
## Response: Digital_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(District) 1 369.1 369.14 11.548 0.000827 ***
## Residuals 189 6041.5 31.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
survey_model <- lm(Digital_Score ~ HH_Children , data = survey_data)
anova(survey_model)
## Analysis of Variance Table
##
## Response: Digital_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## HH_Children 1 151.9 151.883 4.5865 0.0335 *
## Residuals 189 6258.7 33.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
survey_model <- lm(Digital_Score ~ factor(Block) + factor (Livelihood) + HH_Children + factor(Income) + factor (Business_Subsector) + factor (Caste) + factor (Business_Subsector) + factor (Business_Owner_Relation), data = survey_data)
anova(survey_model)
## Analysis of Variance Table
##
## Response: Digital_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(Block) 3 554.5 184.818 6.3114 0.00044 ***
## factor(Livelihood) 3 12.6 4.202 0.1435 0.93372
## HH_Children 1 140.7 140.685 4.8043 0.02976 *
## factor(Income) 2 203.5 101.731 3.4740 0.03322 *
## factor(Business_Subsector) 3 116.6 38.870 1.3274 0.26718
## factor(Caste) 3 277.7 92.578 3.1615 0.02609 *
## factor(Business_Owner_Relation) 6 156.2 26.032 0.8890 0.50431
## Residuals 169 4948.9 29.283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
survey_model <- lm(Digital_Score ~ factor(Block) + factor(Income) + factor (Caste) , data = survey_data)
anova(survey_model)
## Analysis of Variance Table
##
## Response: Digital_Score
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(Block) 3 554.5 184.818 6.3391 0.0004121 ***
## factor(Income) 2 208.0 103.990 3.5668 0.0302383 *
## factor(Caste) 3 341.9 113.967 3.9090 0.0097683 **
## Residuals 182 5306.3 29.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All 5 models deliver very weak results. The best performance is from 4th model which has most number of variables but that also can explain only about 22% of the variations as evident from the ANOVA.
We make further attempts using classification models next, using Random Forest. Here we use the Digital Score Category as target variable.
We use a 70:30 Training:Test partitions on the datasets. We also remove the following fields from the dataframe - WAE ID (it is an id column), District (it is contained within Block), Village (too many distinct values), Business (too many combinations) and Digital_Score (since we will use the category as target now).
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
survey_data <- subset(survey_data , select = -c (1,2,4,16,23))
inTrain <- createDataPartition(y=survey_data$Digital_Score_Category, p=0.70, list=FALSE)
training <- survey_data[inTrain , ]
modelFit <- train(Digital_Score_Category ~ ., method = "rf", data = training, trControl = trainControl(method = "cv"), number = 5)
modelFit$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.2578571 -0.05518595 0.11668545 0.15407692
## 2 23 0.3038095 0.04301708 0.07700563 0.09653444
## 3 45 0.2807326 0.00908776 0.06828404 0.09545889
testing <- survey_data[-inTrain , ]
test_Digital_Score_Category <- predict (modelFit, newdata = testing )
confusionMatrix(testing$Digital_Score_Category, test_Digital_Score_Category )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Between 10 and 15 Between 15 and 20 Between 20 and 25
## Between 10 and 15 2 5 3
## Between 15 and 20 4 7 5
## Between 20 and 25 2 7 8
## Less than 10 1 3 2
## More than 25 1 4 0
## Reference
## Prediction Less than 10 More than 25
## Between 10 and 15 0 0
## Between 15 and 20 0 1
## Between 20 and 25 0 0
## Less than 10 0 0
## More than 25 0 0
##
## Overall Statistics
##
## Accuracy : 0.3091
## 95% CI : (0.1914, 0.4481)
## No Information Rate : 0.4727
## P-Value [Acc > NIR] : 0.9954
##
## Kappa : 0.0378
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Between 10 and 15 Class: Between 15 and 20
## Sensitivity 0.20000 0.2692
## Specificity 0.82222 0.6552
## Pos Pred Value 0.20000 0.4118
## Neg Pred Value 0.82222 0.5000
## Prevalence 0.18182 0.4727
## Detection Rate 0.03636 0.1273
## Detection Prevalence 0.18182 0.3091
## Balanced Accuracy 0.51111 0.4622
## Class: Between 20 and 25 Class: Less than 10
## Sensitivity 0.4444 NA
## Specificity 0.7568 0.8909
## Pos Pred Value 0.4706 NA
## Neg Pred Value 0.7368 NA
## Prevalence 0.3273 0.0000
## Detection Rate 0.1455 0.0000
## Detection Prevalence 0.3091 0.1091
## Balanced Accuracy 0.6006 NA
## Class: More than 25
## Sensitivity 0.00000
## Specificity 0.90741
## Pos Pred Value 0.00000
## Neg Pred Value 0.98000
## Prevalence 0.01818
## Detection Rate 0.00000
## Detection Prevalence 0.09091
## Balanced Accuracy 0.45370
The accuracy from Random Forest is approx.30%.