Importing data into “dataset”:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/MSKR/MASTERS_ADS/STATISTICS_SEM1/DATA_SET_1.csv", delim = ",")
## Rows: 4424 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Target
## dbl (36): Marital status, Application mode, Application order, Course, Dayti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Creating a custom table with mutating necessary categorical columns;
dataset_1<-dataset
dataset_1<-mutate(dataset_1, marital_status = ifelse(dataset$`Marital status` == 1, "single",
ifelse(`Marital status` == 2, "married",
ifelse(`Marital status` == 3, "widower",
ifelse(`Marital status` == 4, "divorced",
ifelse(`Marital status` == 5, "facto union",
ifelse(`Marital status` == 6, "legally seperated", "no")))))))
dataset_1<-mutate(dataset_1, day_eve_class= ifelse(dataset_1$`Daytime/evening attendance ` == 1, "day","evening"))
dataset_1<-mutate(dataset_1, target = ifelse(dataset$Target == "Graduate",2,
ifelse(Target == "Enrolled",1,
ifelse(Target == "Dropout", 0, "no"))))
dataset_1<-mutate(dataset_1, sem_results= rowMeans(select(dataset_1,`Curricular units 1st sem (grade)`, `Curricular units 2nd sem (grade)`)))
options(repos = c(CRAN = "https://cloud.r-project.org/"))
install.packages("pwr")
## Installing package into 'C:/Users/MSKR/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'pwr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\MSKR\AppData\Local\Temp\RtmpKSOKpd\downloaded_packages
library(pwr)
Null Hypothesis: The true proportion between Graduate and Dropout students count is 0.5.
Alternative Hypothesis: The true proportion between Graduate and Dropout students count is not equal to 0.5.
Effect Size (0.1):
Alpha (alpha = 0.05):
Power (0.80):
# Define parameters for sample size calculation
effect_size <- 0.1 # Define your minimum effect size (10% difference)
alpha <- 0.05
power <- 0.80
# Calculate sample size for each group
n <- pwr.2p.test(h = ES.h(0.5, 0.5 + effect_size), sig.level = alpha, power = power)$n
n
## [1] 387.1677
# Create a contingency table
dataset_2<-filter(dataset_1,Target!="Enrolled")
contingency_table <- table(dataset_2$Target)
contingency_table
##
## Dropout Graduate
## 1421 2209
prop_test <- prop.test(contingency_table)
prop_test
##
## 1-sample proportions test with continuity correction
##
## data: contingency_table, null probability 0.5
## X-squared = 170.63, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3755686 0.4075829
## sample estimates:
## p
## 0.3914601
The test value(0.3914601) is greater than considered alpha (0.05), Therefore, the Null Hypothesis fails.
Summarizing the results in short:
Test Type: One-sample proportions test with continuity correction.
Test Statistic: Chi-squared value is 170.63 with 1 degree of freedom.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true proportion is estimated to be between approximately 0.3756 and 0.4076.
Sample Estimate: The observed proportion in your data is approximately 0.3915.
# Create a bar plot for graduation rates
ggplot(dataset_2, aes(x = Target)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "steelblue") +
labs(title = "Proportion of Graduates vs. Dropouts",
y = "Proportion",
x = "Target") +
scale_y_continuous(labels = scales::percent)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The above metrics and bar plot conclude that there is strong evidence to suggest that the true proportion of the population differs from 0.5, and the Alternate Hypothesis is correct.
Null Hypothesis: The average semester grades of Graduate and Enrolled students is significantly different.
Alternative Hypothesis: The average semester grades of Graduate and Enrolled students is not significantly different.
# Filter data for graduates and enrolled students
graduates <- dataset_1 |> filter(Target == "Graduate")
enrolled <- dataset_1 |> filter(Target == "Enrolled")
t_test_result <- t.test(graduates$sem_results, enrolled$sem_results)
t_test_result
##
## Welch Two Sample t-test
##
## data: graduates$sem_results and enrolled$sem_results
## t = 11.611, df = 1153.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.287354 1.810905
## sample estimates:
## mean of x mean of y
## 12.67056 11.12143
if (t_test_result$p.value < 0.05) {
print("Reject H0: The average semester grades of Graduate and Enrolled students is not significantly different.")
} else {
print("Fail to reject H0: The average semester grades of Graduate and Enrolled students is significantly different. ")
}
## [1] "Reject H0: The average semester grades of Graduate and Enrolled students is not significantly different."
Summarizing the results in short:
Test Type: Welch Two Sample t-test comparing the means of two groups (graduates and enrolled).
Test Statistic: t = 11.611 with degrees of freedom (df) ≈ 1153.5.
P-value: Less than 2.2e-16, indicating strong evidence against the null hypothesis.
95% Confidence Interval: The true difference in means is estimated to be between approximately 1.2874 and 1.8109.
Sample Estimates:
Mean of graduates (x): 12.67056
Mean of enrolled (y): 11.12143
# Create a box plot for sem_resuts comparison
ggplot(dataset_1, aes(x = Target, y = sem_results)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Comparison of sem_results between Graduates and Enrolled Students",
y = "sem_results",
x = "Target")
The results from the above Hypothesis framework concludes that the Average semester results of Enrolled students is very close to those of already Graduated students which fails our Null Hypothesis.
From the box plot, it is evident that the Enrolled student’s grades are comparatively close to Graduated students than those with the Dropouts. This is a good signal that the enrolled students are progressing in good direction.