Survey Problem

Read the csv file

Also assign meaningful column names. Study the distribution of different fields.

setwd("D:\\Survey Problem\\Working")
survey_data <- read.csv("20221013 NF SSP Program Endline Assessment FINAL Cleaned Database V1.0.csv", stringsAsFactors = TRUE)
colnames(survey_data) <- c("WAE ID","District", "Block", "Village","Education","Age","HH_Members","HH_Children","Livelihood","Income","Studied_Till","Land_Ownership","Land_Acres","Family_Land_Acres","Caste","Business","Business_Subsector","Business_Owner","Business_Owner_Relation","Business_Tenure","Seasonality","Average_Revenue","Digital_Score","Digital_Score_Category")
summary (survey_data)

##          WAE ID         District         Block             Village   
##  NF-WAE-O144:  2   Latur    :101   Ausa     :49   Dawanhipparga: 10  
##  NF-WAE-T218:  2   Osmanabad: 90   Deoni    :49   Mogarga      :  8  
##  NF-WAE-A001:  1                   Osmanabad:59   Waruda       :  8  
##  NF-WAE-A035:  1                   Tuljapur :34   Mahadevwadi  :  6  
##  NF-WAE-A041:  1                                  Nagtirthawadi:  5  
##  NF-WAE-A042:  1                                  Jawali       :  4  
##  (Other)    :183                                  (Other)      :150  
##       Education       Age          HH_Members      HH_Children   
##  12th Grade:65   Min.   :20.00   Min.   : 2.000   Min.   :0.000  
##  10th Grade:57   1st Qu.:32.00   1st Qu.: 4.000   1st Qu.:1.000  
##  9th Grade :28   Median :36.00   Median : 5.000   Median :2.000  
##  Graduate  :15   Mean   :35.81   Mean   : 5.209   Mean   :1.775  
##  11th Grade:10   3rd Qu.:40.00   3rd Qu.: 6.000   3rd Qu.:2.000  
##  8th Grade : 7   Max.   :50.00   Max.   :19.000   Max.   :7.000  
##  (Other)   : 9                                                   
##         Livelihood                 Income        Studied_Till Land_Ownership
##  Agriculture :173   1 Lakh to 4 Lakhs :107   12th Grade:65    NO  :163      
##  Business    : 14   4 Lakh to 7 Lakhs :  2   10th Grade:57    YES : 28      
##  Livestock   :  1   Less than 1 lakh  : 82   9th Grade :28                  
##  Salary      :  3                            Graduate  :15                  
##                                              11th Grade:10                  
##                                              8th Grade : 7                  
##                                              (Other)   : 9                  
##    Land_Acres    Family_Land_Acres                          Caste    
##  Min.   :0.000   Min.   : 0.000    General                     :150  
##  1st Qu.:0.000   1st Qu.: 2.000    Other Backward Class OBC OBC: 26  
##  Median :0.000   Median : 3.000    Scheduled Caste SC SC       : 10  
##  Mean   :0.389   Mean   : 3.627    Scheduled Tribe ST ST       :  5  
##  3rd Qu.:0.000   3rd Qu.: 5.000                                      
##  Max.   :8.000   Max.   :22.000                                      
##                                                                      
##               Business    Business_Subsector
##  Dairy            :61   Subsector 1:78      
##  Vegetables       :17   Subsector 2:19      
##  Vegetables, Dairy:16   Subsector 3:79      
##  Vegetables       :10   Subsector 4:15      
##  Dairy, Flour Mill: 7                       
##  Flour Mill       : 7                       
##  (Other)          :73                       
##                                                                 Business_Owner
##  Exclusively run by her without any support from MALE family members   :85    
##  Predominantly run by her with some support from MALE family members   :80    
##  Run jointly by her and one or more MALE family members e.g., husband. :26    
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##             Business_Owner_Relation Business_Tenure  Seasonality    
##  Father-in-law          :  3        Min.   :  2.0   Min.   : 3.000  
##  Husband                :120        1st Qu.: 24.0   1st Qu.: 8.000  
##  Husband, Father-in-law :  5        Median : 36.0   Median : 9.000  
##  Husband, Son           : 15        Mean   : 42.2   Mean   : 9.597  
##  Mother-in-Law          :  3        3rd Qu.: 48.0   3rd Qu.:12.000  
##  None                   : 35        Max.   :240.0   Max.   :48.000  
##  Other                  : 10                                        
##  Average_Revenue   Digital_Score         Digital_Score_Category
##  Min.   :      7   Min.   : 2.00   Between 10 and 15:35        
##  1st Qu.:  30000   1st Qu.:15.00   Between 15 and 20:59        
##  Median :  60000   Median :19.00   Between 20 and 25:57        
##  Mean   :  93175   Mean   :18.63   Less than 10     :21        
##  3rd Qu.: 119000   3rd Qu.:23.00   More than 25     :19        
##  Max.   :1500000   Max.   :32.00                               
##

Exploratory Data Analysis

Boxplots to analyze relation of Digital Score with all the categorical variables. Correlation Coefficient is used for numerical variables.

boxplot (Digital_Score~District, data = survey_data)

boxplot (Digital_Score~Block, data = survey_data)

boxplot (Digital_Score~Education, data = survey_data)

cor (survey_data$Digital_Score, survey_data$Age)

## [1] 0.0718775

cor (survey_data$Digital_Score, survey_data$HH_Members)

## [1] -0.06999911

cor (survey_data$Digital_Score, survey_data$HH_Children)

## [1] -0.1539235

boxplot (Digital_Score~Livelihood, data = survey_data)

boxplot (Digital_Score~Income, data = survey_data)

boxplot (Digital_Score~Studied_Till, data = survey_data)

boxplot (Digital_Score~Land_Ownership, data = survey_data)

cor (survey_data$Digital_Score, survey_data$Land_Acres)

## [1] 0.006381921

cor (survey_data$Digital_Score, survey_data$Family_Land_Acres)

## [1] 0.05708554

boxplot (Digital_Score~Caste, data = survey_data)

boxplot (Digital_Score~Business_Subsector, data = survey_data)

boxplot (Digital_Score~Business_Owner, data = survey_data)

boxplot (Digital_Score~Business_Owner_Relation, data = survey_data)

cor (survey_data$Digital_Score, survey_data$Business_Tenure)

## [1] -0.06534205

cor (survey_data$Digital_Score, survey_data$Seasonality)

## [1] 0.03150432

cor (survey_data$Digital_Score, survey_data$Average_Revenue)

## [1] 0.0140504

Analysing results and building regression models

From the results of the boxplots and correlation coefficients it appears only these variables have weak relation with Digital Score - Block, Livelihood, HH_Children, Income, Business_Subsector, Caste, Business_Subsector, Business_Owner_Relation. Others do not have any visible relation.

Next we build regression models with Digital Score as target variable and combination of the above factors. For each regression model we analyze how much the model is able to explain the variations in the target variable using ANOVA.

survey_model <- lm(Digital_Score ~ factor(District) + factor (Livelihood) + HH_Children + factor(Income) + factor (Business_Subsector) + factor (Caste) + factor (Business_Subsector) + factor (Business_Owner_Relation), data = survey_data)
anova(survey_model)

## Analysis of Variance Table
## 
## Response: Digital_Score
##                                  Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(District)                  1  369.1  369.14 12.6608 0.0004834 ***
## factor(Livelihood)                3    8.2    2.74  0.0940 0.9632724    
## HH_Children                       1  159.2  159.17  5.4591 0.0206263 *  
## factor(Income)                    2  217.0  108.51  3.7216 0.0261778 *  
## factor(Business_Subsector)        3  176.8   58.93  2.0213 0.1127732    
## factor(Caste)                     3  311.6  103.87  3.5626 0.0154742 *  
## factor(Business_Owner_Relation)   6  182.9   30.49  1.0457 0.3975780    
## Residuals                       171 4985.7   29.16                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

survey_model <- lm(Digital_Score ~ factor(District) , data = survey_data)
anova(survey_model)

## Analysis of Variance Table
## 
## Response: Digital_Score
##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(District)   1  369.1  369.14  11.548 0.000827 ***
## Residuals        189 6041.5   31.97                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

survey_model <- lm(Digital_Score ~ HH_Children  , data = survey_data)
anova(survey_model)

## Analysis of Variance Table
## 
## Response: Digital_Score
##              Df Sum Sq Mean Sq F value Pr(>F)  
## HH_Children   1  151.9 151.883  4.5865 0.0335 *
## Residuals   189 6258.7  33.115                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

survey_model <- lm(Digital_Score ~ factor(Block) + factor (Livelihood) + HH_Children + factor(Income) + factor (Business_Subsector) + factor (Caste) + factor (Business_Subsector) + factor (Business_Owner_Relation), data = survey_data)
anova(survey_model)

## Analysis of Variance Table
## 
## Response: Digital_Score
##                                  Df Sum Sq Mean Sq F value  Pr(>F)    
## factor(Block)                     3  554.5 184.818  6.3114 0.00044 ***
## factor(Livelihood)                3   12.6   4.202  0.1435 0.93372    
## HH_Children                       1  140.7 140.685  4.8043 0.02976 *  
## factor(Income)                    2  203.5 101.731  3.4740 0.03322 *  
## factor(Business_Subsector)        3  116.6  38.870  1.3274 0.26718    
## factor(Caste)                     3  277.7  92.578  3.1615 0.02609 *  
## factor(Business_Owner_Relation)   6  156.2  26.032  0.8890 0.50431    
## Residuals                       169 4948.9  29.283                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

survey_model <- lm(Digital_Score ~ factor(Block) + factor(Income) + factor (Caste) , data = survey_data)
anova(survey_model)

## Analysis of Variance Table
## 
## Response: Digital_Score
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## factor(Block)    3  554.5 184.818  6.3391 0.0004121 ***
## factor(Income)   2  208.0 103.990  3.5668 0.0302383 *  
## factor(Caste)    3  341.9 113.967  3.9090 0.0097683 ** 
## Residuals      182 5306.3  29.155                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysing results of regression models and next steps

All 5 models deliver very weak results. The best performance is from 4th model which has most number of variables but that also can explain only about 22% of the variations as evident from the ANOVA.

We make further attempts using classification models next, using Random Forest. Here we use the Digital Score Category as target variable.

We use a 70:30 Training:Test partitions on the datasets. We also remove the following fields from the dataframe - WAE ID (it is an id column), District (it is contained within Block), Village (too many distinct values), Business (too many combinations) and Digital_Score (since we will use the category as target now).

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

survey_data  <- subset(survey_data , select = -c (1,2,4,16,23))
inTrain <- createDataPartition(y=survey_data$Digital_Score_Category, p=0.70, list=FALSE)
training <- survey_data[inTrain , ]
modelFit <- train(Digital_Score_Category ~ ., method = "rf", data = training, trControl = trainControl(method = "cv"), number = 5)
modelFit$results

##   mtry  Accuracy       Kappa AccuracySD    KappaSD
## 1    2 0.2578571 -0.05518595 0.11668545 0.15407692
## 2   23 0.3038095  0.04301708 0.07700563 0.09653444
## 3   45 0.2807326  0.00908776 0.06828404 0.09545889

testing <- survey_data[-inTrain , ]
test_Digital_Score_Category <- predict (modelFit, newdata = testing )
confusionMatrix(testing$Digital_Score_Category, test_Digital_Score_Category )

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Between 10 and 15 Between 15 and 20 Between 20 and 25
##   Between 10 and 15                 2                 5                 3
##   Between 15 and 20                 4                 7                 5
##   Between 20 and 25                 2                 7                 8
##   Less than 10                      1                 3                 2
##   More than 25                      1                 4                 0
##                    Reference
## Prediction          Less than 10 More than 25
##   Between 10 and 15            0            0
##   Between 15 and 20            0            1
##   Between 20 and 25            0            0
##   Less than 10                 0            0
##   More than 25                 0            0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3091          
##                  95% CI : (0.1914, 0.4481)
##     No Information Rate : 0.4727          
##     P-Value [Acc > NIR] : 0.9954          
##                                           
##                   Kappa : 0.0378          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Between 10 and 15 Class: Between 15 and 20
## Sensitivity                           0.20000                   0.2692
## Specificity                           0.82222                   0.6552
## Pos Pred Value                        0.20000                   0.4118
## Neg Pred Value                        0.82222                   0.5000
## Prevalence                            0.18182                   0.4727
## Detection Rate                        0.03636                   0.1273
## Detection Prevalence                  0.18182                   0.3091
## Balanced Accuracy                     0.51111                   0.4622
##                      Class: Between 20 and 25 Class: Less than 10
## Sensitivity                            0.4444                  NA
## Specificity                            0.7568              0.8909
## Pos Pred Value                         0.4706                  NA
## Neg Pred Value                         0.7368                  NA
## Prevalence                             0.3273              0.0000
## Detection Rate                         0.1455              0.0000
## Detection Prevalence                   0.3091              0.1091
## Balanced Accuracy                      0.6006                  NA
##                      Class: More than 25
## Sensitivity                      0.00000
## Specificity                      0.90741
## Pos Pred Value                   0.00000
## Neg Pred Value                   0.98000
## Prevalence                       0.01818
## Detection Rate                   0.00000
## Detection Prevalence             0.09091
## Balanced Accuracy                0.45370

The accuracy from Random Forest is approx.30%.