This project analyzes the impact of COVID-19 on education using statistical and data science techniques. The goal is to understand dropout patterns and identify key influencing factors.
data <- read.csv("C:/Users/micha/OneDrive/Desktop/open_one_time_covid_education_impact.csv")
str(data)
## 'data.frame': 4436 obs. of 27 variables:
## $ submission_id : num 4.57e+15 6.44e+15 5.00e+15 5.52e+15 5.03e+15 ...
## $ submission_date : chr "2021-03-17" "2021-03-29" "2021-03-18" "2021-03-24" ...
## $ gender : chr "Female" "Male" "Female" "Male" ...
## $ age : chr "Over 45 years old" "26 to 35 years old" "26 to 35 years old" "36 to 45 years old" ...
## $ geography : chr "Suburban/Peri-urban" "Suburban/Peri-urban" "City center or metropolitan area" "Suburban/Peri-urban" ...
## $ financial_situation : chr "I can afford food and regular expenses, but nothing else" "I cannot afford enough food for my family" "I can comfortably afford food, clothes, and furniture, and I have savings" "I can afford food, but nothing else" ...
## $ education : chr "University or college degree completed" "University or college degree completed" "University or college degree completed" "University or college degree completed" ...
## $ employment_status : chr "I am unemployed" "I am unemployed" "I work full-time, either as an employee or self-employed" "I work full-time, either as an employee or self-employed" ...
## $ submission_state : chr "Miranda" "Miranda" "Miranda" "Miranda" ...
## $ are_there_children_0_to_2_yrs_out_of_educational_system : int 0 0 1 0 0 0 0 0 0 1 ...
## $ were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school : int 1 1 1 0 1 0 1 0 0 1 ...
## $ are_there_children_who_stopped_enrolling_in_primary_education : int 1 0 1 0 0 1 0 0 0 0 ...
## $ are_there_children_who_stopped_enrolling_in_secondary_education : int 0 0 1 0 0 1 0 0 0 0 ...
## $ are_children_attending_face_to_face_classes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ can_children_observe_deterioration_of_basic_services_of_school : int 1 1 1 1 1 0 1 1 1 1 ...
## $ do_children_3_and_17_yrs_receive_regular_school_meals : chr "Every day" "No" "No" "No" ...
## $ are_there_teachers_at_scheduled_class_hours : chr "Irregularly" "Irregularly" "There are not enough" "There are enough" ...
## $ are_children_3_to_17_yrs_dealing_with_irregular_school_activity : int 0 1 1 1 1 0 1 1 0 0 ...
## $ are_children_being_teached_by_unqualified_people : int 0 0 1 1 0 1 0 0 1 0 ...
## $ did_teachers_leave_the_educational_system : int 0 1 1 1 1 1 0 1 1 0 ...
## $ do_school_and_the_teachers_have_internet_connection : int 1 0 0 0 0 1 1 0 1 1 ...
## $ do_children_have_internet_connection : int 1 1 1 1 1 0 1 0 0 1 ...
## $ do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity : int 0 1 0 0 1 0 0 1 1 0 ...
## $ does_home_shows_severe_deficit_of_electricity : int 0 0 1 0 0 0 0 0 0 1 ...
## $ does_home_shows_severe_deficit_of_internet : int 0 0 0 0 0 0 0 0 0 0 ...
## $ do_children_3_to_17_yrs_miss_class_or_in_lower_grade : int 0 0 0 0 0 0 0 0 0 0 ...
## $ are_children_promoted_with_a_modality_different_from_formal_evaluation: int 0 0 1 0 1 1 0 0 1 0 ...
summary(data)
## submission_id submission_date gender age
## Min. :4.504e+15 Length:4436 Length:4436 Length:4436
## 1st Qu.:5.077e+15 Class :character Class :character Class :character
## Median :5.642e+15 Mode :character Mode :character Mode :character
## Mean :5.633e+15
## 3rd Qu.:6.188e+15
## Max. :6.755e+15
## geography financial_situation education employment_status
## Length:4436 Length:4436 Length:4436 Length:4436
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## submission_state are_there_children_0_to_2_yrs_out_of_educational_system
## Length:4436 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2949
## 3rd Qu.:1.0000
## Max. :1.0000
## were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6132
## 3rd Qu.:1.0000
## Max. :1.0000
## are_there_children_who_stopped_enrolling_in_primary_education
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2065
## 3rd Qu.:0.0000
## Max. :1.0000
## are_there_children_who_stopped_enrolling_in_secondary_education
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1943
## 3rd Qu.:0.0000
## Max. :1.0000
## are_children_attending_face_to_face_classes
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1637
## 3rd Qu.:0.0000
## Max. :1.0000
## can_children_observe_deterioration_of_basic_services_of_school
## Min. :0.0000
## 1st Qu.:1.0000
## Median :1.0000
## Mean :0.8005
## 3rd Qu.:1.0000
## Max. :1.0000
## do_children_3_and_17_yrs_receive_regular_school_meals
## Length:4436
## Class :character
## Mode :character
##
##
##
## are_there_teachers_at_scheduled_class_hours
## Length:4436
## Class :character
## Mode :character
##
##
##
## are_children_3_to_17_yrs_dealing_with_irregular_school_activity
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6431
## 3rd Qu.:1.0000
## Max. :1.0000
## are_children_being_teached_by_unqualified_people
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3165
## 3rd Qu.:1.0000
## Max. :1.0000
## did_teachers_leave_the_educational_system
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6643
## 3rd Qu.:1.0000
## Max. :1.0000
## do_school_and_the_teachers_have_internet_connection
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5604
## 3rd Qu.:1.0000
## Max. :1.0000
## do_children_have_internet_connection
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6285
## 3rd Qu.:1.0000
## Max. :1.0000
## do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6655
## 3rd Qu.:1.0000
## Max. :1.0000
## does_home_shows_severe_deficit_of_electricity
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2845
## 3rd Qu.:1.0000
## Max. :1.0000
## does_home_shows_severe_deficit_of_internet
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5791
## 3rd Qu.:1.0000
## Max. :1.0000
## do_children_3_to_17_yrs_miss_class_or_in_lower_grade
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2464
## 3rd Qu.:0.0000
## Max. :1.0000
## are_children_promoted_with_a_modality_different_from_formal_evaluation
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4272
## 3rd Qu.:1.0000
## Max. :1.0000
colnames(data)
## [1] "submission_id"
## [2] "submission_date"
## [3] "gender"
## [4] "age"
## [5] "geography"
## [6] "financial_situation"
## [7] "education"
## [8] "employment_status"
## [9] "submission_state"
## [10] "are_there_children_0_to_2_yrs_out_of_educational_system"
## [11] "were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school"
## [12] "are_there_children_who_stopped_enrolling_in_primary_education"
## [13] "are_there_children_who_stopped_enrolling_in_secondary_education"
## [14] "are_children_attending_face_to_face_classes"
## [15] "can_children_observe_deterioration_of_basic_services_of_school"
## [16] "do_children_3_and_17_yrs_receive_regular_school_meals"
## [17] "are_there_teachers_at_scheduled_class_hours"
## [18] "are_children_3_to_17_yrs_dealing_with_irregular_school_activity"
## [19] "are_children_being_teached_by_unqualified_people"
## [20] "did_teachers_leave_the_educational_system"
## [21] "do_school_and_the_teachers_have_internet_connection"
## [22] "do_children_have_internet_connection"
## [23] "do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity"
## [24] "does_home_shows_severe_deficit_of_electricity"
## [25] "does_home_shows_severe_deficit_of_internet"
## [26] "do_children_3_to_17_yrs_miss_class_or_in_lower_grade"
## [27] "are_children_promoted_with_a_modality_different_from_formal_evaluation"
dim(data)
## [1] 4436 27
sum(is.na(data))
## [1] 0
colSums(is.na(data))
## submission_id
## 0
## submission_date
## 0
## gender
## 0
## age
## 0
## geography
## 0
## financial_situation
## 0
## education
## 0
## employment_status
## 0
## submission_state
## 0
## are_there_children_0_to_2_yrs_out_of_educational_system
## 0
## were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school
## 0
## are_there_children_who_stopped_enrolling_in_primary_education
## 0
## are_there_children_who_stopped_enrolling_in_secondary_education
## 0
## are_children_attending_face_to_face_classes
## 0
## can_children_observe_deterioration_of_basic_services_of_school
## 0
## do_children_3_and_17_yrs_receive_regular_school_meals
## 0
## are_there_teachers_at_scheduled_class_hours
## 0
## are_children_3_to_17_yrs_dealing_with_irregular_school_activity
## 0
## are_children_being_teached_by_unqualified_people
## 0
## did_teachers_leave_the_educational_system
## 0
## do_school_and_the_teachers_have_internet_connection
## 0
## do_children_have_internet_connection
## 0
## do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity
## 0
## does_home_shows_severe_deficit_of_electricity
## 0
## does_home_shows_severe_deficit_of_internet
## 0
## do_children_3_to_17_yrs_miss_class_or_in_lower_grade
## 0
## are_children_promoted_with_a_modality_different_from_formal_evaluation
## 0
Insight: The dataset includes both categorical and numerical variables. Missing values highlight the need for preprocessing.
data$gender <- as.factor(data$gender)
data$age <- as.factor(data$age)
data$geography <- as.factor(data$geography)
data$education <- as.factor(data$education)
data$employment_status <- as.factor(data$employment_status)
Insight: Categorical variables are converted into factors for proper statistical modeling.
mean(data$do_children_have_internet_connection) * 100
## [1] 62.84941
table(data$do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity)
##
## 0 1
## 1484 2952
table(data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0 1
## 1716 2720
mean(data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school) * 100
## [1] 61.3165
table(data$are_children_attending_face_to_face_classes)
##
## 0 1
## 3710 726
table(data$age, data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0 1
## 16 to 25 years old 549 753
## 26 to 35 years old 464 840
## 36 to 45 years old 413 674
## Not Available 1 2
## Over 45 years old 289 450
## Under 16 0 1
table(data$financial_situation)
##
## I can afford food and regular expenses, but nothing else
## 1060
## I can afford food, but nothing else
## 1445
## I can afford food, regular expenses, and clothes, but nothing else
## 244
## I can comfortably afford food, clothes, and furniture, and I have savings
## 157
## I can comfortably afford food, clothes, and furniture, but I don’t have savings
## 127
## I cannot afford enough food for my family
## 1163
## Not Available
## 1
## Prefer not to answer
## 239
Insight: Shows internet accessibility, dropout levels, and financial distribution among students.
table(data$do_children_have_internet_connection,
data$do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity)
##
## 0 1
## 0 485 1163
## 1 999 1789
table(data$does_home_shows_severe_deficit_of_electricity,
data$do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity)
##
## 0 1
## 0 1270 1904
## 1 214 1048
table(data$financial_situation,
data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0
## I can afford food and regular expenses, but nothing else 442
## I can afford food, but nothing else 530
## I can afford food, regular expenses, and clothes, but nothing else 89
## I can comfortably afford food, clothes, and furniture, and I have savings 56
## I can comfortably afford food, clothes, and furniture, but I don’t have savings 54
## I cannot afford enough food for my family 434
## Not Available 1
## Prefer not to answer 110
##
## 1
## I can afford food and regular expenses, but nothing else 618
## I can afford food, but nothing else 915
## I can afford food, regular expenses, and clothes, but nothing else 155
## I can comfortably afford food, clothes, and furniture, and I have savings 101
## I can comfortably afford food, clothes, and furniture, but I don’t have savings 73
## I cannot afford enough food for my family 729
## Not Available 0
## Prefer not to answer 129
table(data$does_home_shows_severe_deficit_of_internet,
data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0 1
## 0 828 1039
## 1 888 1681
table(data$geography,
data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0 1
## City center or metropolitan area 748 1172
## Not Available 1 0
## Rural 406 735
## Suburban/Peri-urban 561 813
table(data$are_children_3_to_17_yrs_dealing_with_irregular_school_activity,
data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
##
## 0 1
## 0 748 835
## 1 968 1885
table(data$did_teachers_leave_the_educational_system,
data$are_children_3_to_17_yrs_dealing_with_irregular_school_activity)
##
## 0 1
## 0 934 555
## 1 649 2298
Insight: Infrastructure issues (internet/electricity) and financial status strongly influence dropout.
data$internet_access <- as.factor(data$do_children_have_internet_connection)
data$return_to_school <- as.factor(data$were_children_3_to_17_yrs_enrolled_and_did_not_return_to_school)
data$irregular_activity <- as.factor(data$are_children_3_to_17_yrs_dealing_with_irregular_school_activity)
data$electricity_issue <- as.factor(data$does_home_shows_severe_deficit_of_electricity)
data$financial_status <- as.factor(data$financial_situation)
model1 <- glm(do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity ~ internet_access,
data = data, family = "binomial")
model2 <- glm(return_to_school ~ internet_access + electricity_issue + financial_status + geography,
data = data, family = "binomial")
model3 <- glm(irregular_activity ~ internet_access + electricity_issue + financial_status,
data = data, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = do_children_3_to_17_yrs_miss_virtual_class_due_to_lack_of_electricity ~
## internet_access, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.87461 0.05405 16.181 < 2e-16 ***
## internet_access1 -0.29195 0.06695 -4.361 1.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5654.5 on 4435 degrees of freedom
## Residual deviance: 5635.3 on 4434 degrees of freedom
## AIC: 5639.3
##
## Number of Fisher Scoring iterations: 4
summary(model2)
##
## Call:
## glm(formula = return_to_school ~ internet_access + electricity_issue +
## financial_status + geography, family = "binomial", data = data)
##
## Coefficients: (1 not defined because of singularities)
## Estimate
## (Intercept) 0.17902
## internet_access1 0.09999
## electricity_issue1 0.42152
## financial_statusI can afford food, but nothing else 0.18449
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 0.22088
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 0.24278
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings -0.06844
## financial_statusI cannot afford enough food for my family 0.13011
## financial_statusNot Available -11.84506
## financial_statusPrefer not to answer -0.18498
## geographyNot Available NA
## geographyRural 0.10113
## geographySuburban/Peri-urban -0.07416
## Std. Error
## (Intercept) 0.08630
## internet_access1 0.06518
## electricity_issue1 0.07147
## financial_statusI can afford food, but nothing else 0.08388
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 0.14751
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 0.17919
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings 0.19128
## financial_statusI cannot afford enough food for my family 0.08908
## financial_statusNot Available 196.96769
## financial_statusPrefer not to answer 0.14494
## geographyNot Available NA
## geographyRural 0.08023
## geographySuburban/Peri-urban 0.07303
## z value
## (Intercept) 2.074
## internet_access1 1.534
## electricity_issue1 5.898
## financial_statusI can afford food, but nothing else 2.199
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 1.497
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 1.355
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings -0.358
## financial_statusI cannot afford enough food for my family 1.461
## financial_statusNot Available -0.060
## financial_statusPrefer not to answer -1.276
## geographyNot Available NA
## geographyRural 1.260
## geographySuburban/Peri-urban -1.015
## Pr(>|z|)
## (Intercept) 0.0380
## internet_access1 0.1250
## electricity_issue1 3.69e-09
## financial_statusI can afford food, but nothing else 0.0279
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 0.1343
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 0.1755
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings 0.7205
## financial_statusI cannot afford enough food for my family 0.1441
## financial_statusNot Available 0.9520
## financial_statusPrefer not to answer 0.2019
## geographyNot Available NA
## geographyRural 0.2075
## geographySuburban/Peri-urban 0.3099
##
## (Intercept) *
## internet_access1
## electricity_issue1 ***
## financial_statusI can afford food, but nothing else *
## financial_statusI can afford food, regular expenses, and clothes, but nothing else
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings
## financial_statusI cannot afford enough food for my family
## financial_statusNot Available
## financial_statusPrefer not to answer
## geographyNot Available
## geographyRural
## geographySuburban/Peri-urban
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5920.4 on 4435 degrees of freedom
## Residual deviance: 5861.5 on 4424 degrees of freedom
## AIC: 5885.5
##
## Number of Fisher Scoring iterations: 10
summary(model3)
##
## Call:
## glm(formula = irregular_activity ~ internet_access + electricity_issue +
## financial_status, family = "binomial", data = data)
##
## Coefficients:
## Estimate
## (Intercept) 0.645181
## internet_access1 -0.366553
## electricity_issue1 0.900427
## financial_statusI can afford food, but nothing else -0.056385
## financial_statusI can afford food, regular expenses, and clothes, but nothing else -0.084548
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings -0.002933
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings -0.323386
## financial_statusI cannot afford enough food for my family 0.023017
## financial_statusNot Available 11.287426
## financial_statusPrefer not to answer -0.361595
## Std. Error
## (Intercept) 0.082887
## internet_access1 0.068379
## electricity_issue1 0.078209
## financial_statusI can afford food, but nothing else 0.086171
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 0.149326
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 0.181161
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings 0.195118
## financial_statusI cannot afford enough food for my family 0.092080
## financial_statusNot Available 196.967691
## financial_statusPrefer not to answer 0.147988
## z value
## (Intercept) 7.784
## internet_access1 -5.361
## electricity_issue1 11.513
## financial_statusI can afford food, but nothing else -0.654
## financial_statusI can afford food, regular expenses, and clothes, but nothing else -0.566
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings -0.016
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings -1.657
## financial_statusI cannot afford enough food for my family 0.250
## financial_statusNot Available 0.057
## financial_statusPrefer not to answer -2.443
## Pr(>|z|)
## (Intercept) 7.04e-15
## internet_access1 8.29e-08
## electricity_issue1 < 2e-16
## financial_statusI can afford food, but nothing else 0.5129
## financial_statusI can afford food, regular expenses, and clothes, but nothing else 0.5713
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings 0.9871
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings 0.0974
## financial_statusI cannot afford enough food for my family 0.8026
## financial_statusNot Available 0.9543
## financial_statusPrefer not to answer 0.0145
##
## (Intercept) ***
## internet_access1 ***
## electricity_issue1 ***
## financial_statusI can afford food, but nothing else
## financial_statusI can afford food, regular expenses, and clothes, but nothing else
## financial_statusI can comfortably afford food, clothes, and furniture, and I have savings
## financial_statusI can comfortably afford food, clothes, and furniture, but I don’t have savings .
## financial_statusI cannot afford enough food for my family
## financial_statusNot Available
## financial_statusPrefer not to answer *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5780.9 on 4435 degrees of freedom
## Residual deviance: 5575.1 on 4426 degrees of freedom
## AIC: 5595.1
##
## Number of Fisher Scoring iterations: 10
data$dropout_risk <- predict(model2, type = "response")
Insight: Logistic regression predicts dropout probability based on key factors.
data$internet_num <- as.numeric(data$internet_access)
data$return_num <- as.numeric(data$return_to_school)
data$electricity_num <- as.numeric(data$electricity_issue)
data$financial_num <- as.numeric(data$financial_status)
cor_matrix <- cor(data[, c("internet_num","return_num","electricity_num","financial_num","dropout_risk")])
cor_matrix
## internet_num return_num electricity_num financial_num
## internet_num 1.000000000 0.009096232 -0.11804453 -0.049153182
## return_num 0.009096232 1.000000000 0.09149783 -0.002601488
## electricity_num -0.118044528 0.091497829 1.00000000 0.050075172
## financial_num -0.049153182 -0.002601488 0.05007517 1.000000000
## dropout_risk 0.079628692 0.114286379 0.80097447 -0.022773434
## dropout_risk
## internet_num 0.07962869
## return_num 0.11428638
## electricity_num 0.80097447
## financial_num -0.02277343
## dropout_risk 1.00000000
Insight: Shows strength and direction of relationships among variables.
prop.table(table(data$internet_access, data$return_to_school), 1)
##
## 0 1
## 0 0.3925971 0.6074029
## 1 0.3834290 0.6165710
high_risk_students <- data[data$dropout_risk > 0.7, ]
nrow(high_risk_students)
## [1] 381
aggregate(dropout_risk ~ financial_status + geography, data = data, mean)
## financial_status
## 1 I can afford food and regular expenses, but nothing else
## 2 I can afford food, but nothing else
## 3 I can afford food, regular expenses, and clothes, but nothing else
## 4 I can comfortably afford food, clothes, and furniture, and I have savings
## 5 I can comfortably afford food, clothes, and furniture, but I don’t have savings
## 6 I cannot afford enough food for my family
## 7 Prefer not to answer
## 8 Not Available
## 9 I can afford food and regular expenses, but nothing else
## 10 I can afford food, but nothing else
## 11 I can afford food, regular expenses, and clothes, but nothing else
## 12 I can comfortably afford food, clothes, and furniture, and I have savings
## 13 I can comfortably afford food, clothes, and furniture, but I don’t have savings
## 14 I cannot afford enough food for my family
## 15 Prefer not to answer
## 16 I can afford food and regular expenses, but nothing else
## 17 I can afford food, but nothing else
## 18 I can afford food, regular expenses, and clothes, but nothing else
## 19 I can comfortably afford food, clothes, and furniture, and I have savings
## 20 I can comfortably afford food, clothes, and furniture, but I don’t have savings
## 21 I cannot afford enough food for my family
## 22 Prefer not to answer
## geography dropout_risk
## 1 City center or metropolitan area 5.856496e-01
## 2 City center or metropolitan area 6.288990e-01
## 3 City center or metropolitan area 6.378863e-01
## 4 City center or metropolitan area 6.459355e-01
## 5 City center or metropolitan area 5.726864e-01
## 6 City center or metropolitan area 6.237103e-01
## 7 City center or metropolitan area 5.373316e-01
## 8 Not Available 9.482496e-06
## 9 Rural 6.161654e-01
## 10 Rural 6.597386e-01
## 11 Rural 6.578040e-01
## 12 Rural 6.699145e-01
## 13 Rural 6.086975e-01
## 14 Rural 6.499037e-01
## 15 Rural 5.712774e-01
## 16 Suburban/Peri-urban 5.652749e-01
## 17 Suburban/Peri-urban 6.128347e-01
## 18 Suburban/Peri-urban 6.207245e-01
## 19 Suburban/Peri-urban 6.225738e-01
## 20 Suburban/Peri-urban 5.633299e-01
## 21 Suburban/Peri-urban 6.026952e-01
## 22 Suburban/Peri-urban 5.130942e-01
Insight: Identifies high-risk student groups for targeted interventions.
median(data$dropout_risk)
## [1] 0.6039146
names(sort(table(data$return_to_school), decreasing = TRUE))[1]
## [1] "1"
var(data$dropout_risk)
## [1] 0.003095888
sd(data$dropout_risk)
## [1] 0.0556407
quantile(data$dropout_risk)
## 0% 25% 50% 75% 100%
## 9.482496e-06 5.718382e-01 6.039146e-01 6.494608e-01 7.397042e-01
IQR(data$dropout_risk)
## [1] 0.0776226
quantile(data$dropout_risk, probs = c(0.25,0.5,0.75,0.9))
## 25% 50% 75% 90%
## 0.5718382 0.6039146 0.6494608 0.6967302
Insight: Describes central tendency and variability of dropout risk.
data$financial_situation[is.na(data$financial_situation)] <- "Unknown"
data$internet_binary <- ifelse(data$internet_access == "Yes", 1, 0)
reduced_data <- data[, c("internet_binary","financial_num","dropout_risk")]
Insight: Improves data quality and reduces dimensionality.
ggplot(data, aes(x = internet_access, fill = internet_access)) +
geom_bar() + theme_minimal()
ggplot(data, aes(x = return_to_school, fill = return_to_school)) +
geom_bar() + theme_minimal()
ggplot(data, aes(x = financial_status, y = dropout_risk, fill = financial_status)) +
geom_boxplot() + theme_minimal()
ggplot(data, aes(x = financial_num, y = dropout_risk)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
pie(table(data$internet_access),
main = "Internet Access Distribution")
plot(ecdf(data$dropout_risk),
main = "CDF of Dropout Risk",
sub = "Shows % of students below risk level")
ggpairs(data[, c("internet_num","financial_num","dropout_risk")])
ggplot(data, aes(x = dropout_risk)) +
geom_histogram(bins = 20, fill = "purple")
ggplot(data, aes(x = dropout_risk, fill = return_to_school)) +
geom_density(alpha = 0.4)
heat <- as.data.frame(table(data$internet_access, data$return_to_school))
ggplot(heat, aes(Var1, Var2, fill = Freq)) +
geom_tile() + geom_text(aes(label = Freq))
ggplot(data, aes(x = internet_access, fill = return_to_school)) +
geom_bar(position = "dodge") +
facet_wrap(~ geography)
ggplot(data, aes(x = internet_access, y = dropout_risk)) +
geom_jitter(alpha = 0.3) +
stat_summary(fun = mean, geom = "point", color = "red")
skewness(data$dropout_risk)
## [1] 0.003830186
kurtosis(data$dropout_risk)
## [1] 5.629831
Insight: Explains distribution shape and presence of extreme values.
In the regression output:
electricity_issue1 has p-value < 0.001
→ This means electricity problems strongly affect dropout or
irregular activity
internet_access1 in Model 1 has p-value ≈
0.0000129
→ Internet access is also statistically
significant
Conclusion: Lower p-values = stronger evidence that the variable impacts education outcomes.
Although logistic regression uses z-values, the F-statistic is used in linear regression to test:
Formula conceptually: F = (Model Variance) / (Error Variance)
Interpretation: - Large F-value → Model explains data well - Small F-value → Model is weak
In this project: - We rely on z-values and p-values instead of F-statistic because we used logistic regression (glm)
From output: - SD of dropout_risk = 0.0556
Interpretation: - This shows how much dropout probability varies from the mean - Small SD → values are closely grouped - Large SD → high variation
In this dataset: Dropout risk does not vary extremely → most students fall in a similar risk range (~0.57 to 0.65)
The CDF graph shows:
Example: - If CDF at 0.6 = 0.5
→ 50% of students have dropout risk ≤ 0.6
Interpretation in this project: - Helps understand percentage of students below a certain risk level - Useful for identifying thresholds (e.g., high-risk students)
Variance = 0.003095
Interpretation: - Measures spread of data - Low variance → data points are close to mean
Final Recommendation: Improving digital access and providing financial support will reduce dropout rates and improve educational outcomes.