SaiTeja_Final

Objective

The objective of this analysis is to investigate the factors that are affecting the student’s dropout.This analysis could provide insights into the relationship between the gender and student’s dropout. ## Target Audience Educational Institutions ## Reason for Objctive and Target Audience Practical Application :

Understanding the impact of gender on students dropout could possibly help the educational institutions in developing the environment for the students in various aspects.

Relevance tio the audience : The head of the institutions will gain a deep insight on factors affecting the students academic performance.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(tsibble)

## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

setwd("/Users/saitejaravulapalli/Documents/IUPUI_SEM 01/Intro to Statistic in R/DATA SET")
data <- read.csv("student dropout.csv", sep = ";", header = TRUE)



status_counts <- table(data$Target)


graduate_count <- status_counts["Graduate"]
enrolled_count <- status_counts["Enrolled"]
dropout_count <- status_counts["Dropout"]


cat("Number of graduates:", graduate_count, "\n")

## Number of graduates: 2209

cat("Number of enrolled students:", enrolled_count, "\n")

## Number of enrolled students: 794

cat("Number of dropouts:", dropout_count, "\n")

## Number of dropouts: 1421

creating a pie chart

visualising the students status percentage on a pie chart to describe the no of students who are getting dropped out , enrolled and graduated.

total_count <- nrow(data)

graduate_percent <- (graduate_count / total_count) * 100
enrolled_percent <- (enrolled_count / total_count) * 100
dropout_percent <- (dropout_count / total_count) * 100


status_percentages <- c(Graduate = graduate_percent, Enrolled = enrolled_percent, Dropout = dropout_percent)

# Define colors for the pie chart
status_colors <- c(Graduate = "red", Enrolled = "green", Dropout = "blue")

# Plot the pie chart
pie(status_percentages, labels = paste0(names(status_percentages), ": ", percent(status_percentages / 100)), col = status_colors, main = "Distribution of Statuses")

# Add a legend
legend("bottomright", legend = names(status_percentages), fill = status_colors, title = "Status Categories", x = "right", y = 0.1)

segregation

seperating the courses and finding out the number of students getting dropped out in each course to better understand the factors affecting the completion of education .

dropout_count <- data %>%
  filter(Target == 'Dropout') %>%
  group_by(Course) %>%
  summarise(Count = n())


course_mapping <- data.frame(
  CourseCode = c("33", "171", "8014", "9003", "9070", "9085", "9119", "9130", "9147", "9238", "9254", "9500", "9556", "9670", "9773", "9853", "9991"),
  CourseName = c("Biofuel Production Technologies", "Animation and Multimedia Design", "Social Service (evening attendance)", "Agronomy", "Communication Design", "Veterinary Nursing", "Informatics Engineering", "Equinculture", "Management", "Social Service", "Tourism", "Nursing", "Oral Hygiene", "Advertising and Marketing Management", "Journalism and Communication", "Basic Education", "Management (evening attendance)")
)


dropout_count_with_names <- merge(dropout_count, course_mapping, by.x = "Course", by.y = "CourseCode", all.x = TRUE)


ggplot(dropout_count_with_names, aes(x = CourseName, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Number of Students Dropout for Each Course", 
       x = "Course", y = "Number of Dropouts") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Grouping and Chi-square test

Grouping the students into each course and performing the chi-square test on the data for every course to determine in which departments does the gender being an important factor in students getting dropped out

filtered_data <- data %>%
  filter(Target %in% c('Dropout', 'Graduate', 'Enrolled'))


tables <- filtered_data %>%
  group_by(Course, Gender, Target) %>%
  summarise(Count = n()) %>%
  ungroup() %>%
  spread(Target, Count, fill = 0)

## `summarise()` has grouped output by 'Course', 'Gender'. You can override using
## the `.groups` argument.

results <- lapply(unique(tables$Course), function(course) {
  subset_table <- tables[tables$Course == course, ]
  chi_sq <- chisq.test(subset_table[, c('Dropout', 'Graduate', 'Enrolled')])
  return(list(course = course, chi_sq = chi_sq))
})

## Warning in chisq.test(subset_table[, c("Dropout", "Graduate", "Enrolled")]):
## Chi-squared approximation may be incorrect

## Warning in chisq.test(subset_table[, c("Dropout", "Graduate", "Enrolled")]):
## Chi-squared approximation may be incorrect

## Warning in chisq.test(subset_table[, c("Dropout", "Graduate", "Enrolled")]):
## Chi-squared approximation may be incorrect

## Warning in chisq.test(subset_table[, c("Dropout", "Graduate", "Enrolled")]):
## Chi-squared approximation may be incorrect

## Warning in chisq.test(subset_table[, c("Dropout", "Graduate", "Enrolled")]):
## Chi-squared approximation may be incorrect

for (res in results) {
  cat("Course:", res$course, "\n")
  print(res$chi_sq)
  
  
   p_value <- res$chi_sq$p.value
  
  
  if (p_value <= 0.05) {
    cat("Null Hypothesis Rejected: There is evidence of an association between variables.\n")
  } else {
    cat("Null Hypothesis Accepted: There is no significant evidence of an association between variables.\n")
  }
}

## Course: 33 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 3.7778, df = 2, p-value = 0.1512
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 171 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 0.91357, df = 2, p-value = 0.6333
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 8014 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 1.9612, df = 2, p-value = 0.3751
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 9003 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 7.916, df = 2, p-value = 0.0191
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9070 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 6.6709, df = 2, p-value = 0.0356
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9085 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 23.239, df = 2, p-value = 8.989e-06
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9119 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 3.0064, df = 2, p-value = 0.2224
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 9130 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 3.8191, df = 2, p-value = 0.1482
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 9147 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 30.128, df = 2, p-value = 2.87e-07
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9238 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 30.838, df = 2, p-value = 2.012e-07
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9254 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 14.36, df = 2, p-value = 0.0007615
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9500 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 16.02, df = 2, p-value = 0.0003321
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9556 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 6.6222, df = 2, p-value = 0.03648
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9670 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 8.7452, df = 2, p-value = 0.01262
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.
## Course: 9773 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 2.8412, df = 2, p-value = 0.2416
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 9853 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 0.37011, df = 2, p-value = 0.8311
## 
## Null Hypothesis Accepted: There is no significant evidence of an association between variables.
## Course: 9991 
## 
##  Pearson's Chi-squared test
## 
## data:  subset_table[, c("Dropout", "Graduate", "Enrolled")]
## X-squared = 14.723, df = 2, p-value = 0.0006354
## 
## Null Hypothesis Rejected: There is evidence of an association between variables.

course_p_values <- sapply(results, function(res) res$chi_sq$p.value)
course_codes <- sapply(results, function(res) res$course)


p_value_data <- data.frame(CourseCode = course_codes, P_Value = course_p_values)


p_value_data <- merge(p_value_data, course_mapping, by.x = "CourseCode", by.y = "CourseCode", all.x = TRUE)


significance_level <- 0.05


p_value_data$Significance <- ifelse(p_value_data$P_Value <= significance_level, "Accepted", "Rejected")


ggplot(p_value_data, aes(x = CourseName, y = P_Value, fill = Significance)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Accepted" = "green", "Rejected" = "red")) +
  labs(title = "P-Values for Each Course", x = "Course Name", y = "P-Value") +
  geom_hline(yintercept = significance_level, linetype = "dashed", color = "blue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Finding th probability

Finding the probability that a student being male and female and getting dropped out so that to quantify the significance of genderin the student’s dropout.

library(ggplot2)


course_names <- data.frame(
  Course = c(33, 171, 8014, 9003, 9070, 9085, 9119, 9130, 9147, 9238, 9254, 9500, 9556, 9670, 9773, 9853, 9991),
  Course_Name = c(
    "Biofuel Production Technologies", "Animation and Multimedia Design", "Social Service (evening attendance)",
    "Agronomy", "Communication Design", "Veterinary Nursing", "Informatics Engineering", "Equinculture", 
    "Management", "Social Service", "Tourism", "Nursing", "Oral Hygiene", "Advertising and Marketing Management", 
    "Journalism and Communication", "Basic Education", "Management (evening attendance)"
  )
)


prob_df <- data.frame(Course = integer(), Gender = character(), Probability = numeric(), Course_Name = character())

for (course in course_names$Course) {

  course_data <- data[data$Course == course, ]
  

  female_students <- course_data[course_data$Gender == 0, ]
  total_female <- nrow(female_students)
  
 
  male_students <- course_data[course_data$Gender == 1, ]
  total_male <- nrow(male_students)
  

  dropout_female <- sum(female_students$Target == "Dropout")
  dropout_male <- sum(male_students$Target == "Dropout")
  
 
  dropout_prob_female <- dropout_female / total_female
  dropout_prob_male <- dropout_male / total_male
  

  prob_df <- rbind(prob_df, data.frame(Course = rep(course, 2),
                                       Gender = c("Female", "Male"),
                                       Probability = c(dropout_prob_female, dropout_prob_male),
                                       Course_Name = rep(course_names$Course_Name[course_names$Course == course], 2)))
}


ggplot(prob_df, aes(x = Course_Name, y = Probability, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  labs(title = "Probability of Dropout for Female and Male Students in Different Courses",
       x = "Course", y = "Probability") +
  scale_fill_manual(values = c("Female" = "blue", "Male" = "red"),
                    labels = c("Female", "Male")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hypotheis

Null Hypothesis (H0): There is no association between the variables “Gender” and “Student Status” for each course. In other words, gender and student status are independent of each other.

Alternative Hypothesis (H1): There is an association between the variables “Gender” and “Student Status” for each course. In other words, there is a relationship or dependency between gender and student status within a course

Conclusion : In summary , this statistical analysis is done to determine the reasons for the student’s dropout ,mainly the affect of gender of a student in getting dropped out .The hypothesis test has given an in depth analysis on the fact that there is a significant affect of gender on student’s dropout. The educationl institutions should improvise the required aspects to facilitate the low probable students to continue their studies rather than being dropped out.

SaiTeja_Final_Project

2023-12-02