1.Is there a significant difference in admission grades between students attending daytime and evening classes?
2.How does the unemployment rate, inflation rate, and GDP correlate with students’ enrollment status?
Null Hypothesis (H0): There is no significant difference in the admission grades between students attending daytime and evening classes.
Alternative Hypothesis (H1): There is a significant difference in the admission grades between students attending daytime and evening classes.
Null Hypothesis (H0): There is no correlation between the unemployment rate and students’ enrollment status.
Alternative Hypothesis (H1): There is a significant correlation between the economic factors (unemployment rate, inflation rate, GDP) and students’ enrollment status.
Admission Grades and Daytime/Evening Classes:
Alpha Level (α): 0.05 (standard significance level). Power Level (1 - β): 0.8 (standard power level). Minimum Effect Size: Cohen’s d of 0.3 (considered a small effect size). Chose for practical significance.
Economic Factors and Enrollment Status:
Alpha Level (α): 0.05. Power Level (1 - β): 0.8. Minimum Effect Size: Pearson correlation coefficient (r) of 0.2 (considered a small positive correlation).
Explanation:
Alpha Level: This is the significance level, the probability of rejecting a true null hypothesis. Commonly set to 0.05. Power Level: The probability of correctly rejecting a false null hypothesis. Commonly set to 0.8. Effect Size: A measure of the strength of a phenomenon. Chosen to ensure that observed effects are practically significant. Remember that these are example values and can be adjusted based on the context of your study and the specific characteristics of your dataset.
df <- read.csv('./Downloads/students_dropout_and_academic_success.csv')
alpha <- 0.05
power <- 0.80
# Assuming a two-sample t-test
effect_size <- 0.5 # Minimum effect size of interest (Cohen's d)
# Calculate means and standard deviations
mean_daytime <- mean(df$Admission_grade[df$Daytime_evening_attendance == 1])
mean_evening <- mean(df$Admission_grade[df$Daytime_evening_attendance == 0])
sd_daytime <- sd(df$Admission_grade[df$Daytime_evening_attendance == 1])
sd_evening <- sd(df$Admission_grade[df$Daytime_evening_attendance == 0])
# Calculating the minimum required sample size
pooled_sd <- sqrt(((sd_daytime^2 + sd_evening^2) / 2))
sample_size <- (qnorm(1 - alpha/2) + qnorm(power))^2 * (pooled_sd^2) / effect_size^2
# Rounding up to the nearest integer
sample_size <- ceiling(sample_size)
cat("Minimum Sample Size:", sample_size, "\n")
## Minimum Sample Size: 7560
Given that I have a limited sample size of 4426 rows and the suggested minimum sample size for the Neyman-Pearson hypothesis test is 15120, it appears that I do not have enough data to perform the test. The Neyman-Pearson test often requires a sufficiently large sample size to achieve the desired power. With a smaller sample size, I might face challenges in detecting a significant effect even if it truly exists and we are dealing with correlation, we need to use a correlation test. However, Neyman-Pearson hypothesis testing may not be the most suitable for correlation tests. It is commonly used for testing hypotheses about population means.
# Subset the data
daytime_grades <- df$Admission_grade[df$Daytime_evening_attendance. == 1]
evening_grades <- df$Admission_grade[df$Daytime_evening_attendance. == 0]
# Perform Fisher's test
fisher_test_result <- var.test(daytime_grades, evening_grades)
print(fisher_test_result)
##
## F test to compare two variances
##
## data: daytime_grades and evening_grades
## F = 0.71705, num df = 3940, denom df = 482, p-value = 3.283e-07
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6249765 0.8168656
## sample estimates:
## ratio of variances
## 0.7170487
The p-value is very small (less than the typical significance level of 0.05), suggesting that you reject the null hypothesis. The alternative hypothesis, in this case, is that the true ratio of variances is not equal to 1. The 95 percent confidence interval for the ratio of variances does not include 1, further supporting the evidence against equal variances.
Based on this F test, you have statistical evidence to suggest that the variances of admission grades between students attending daytime and evening classes are significantly different. This implies that the means are also likely to be different.
# Assuming df is your data frame
# Assuming Target is a categorical variable with more than two levels
# Fit ANOVA model
model <- aov(Unemployment_rate ~ Target, data = df)
# Conduct the F-test
anova_result <- anova(model)
# Print the results
print(anova_result)
## Analysis of Variance Table
##
## Response: Unemployment_rate
## Df Sum Sq Mean Sq F value Pr(>F)
## Target 2 83.9 41.933 5.9225 0.0027 **
## Residuals 4421 31302.2 7.080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of the Analysis of Variance (ANOVA) table indicate that there is a significant difference in the mean Unemployment_rate across different levels of the Target variable (enrollment status). The p-value associated with the F-statistic is 0.0027, which is less than the commonly used significance level of 0.05.
Therefore, based on the p-value, you would reject the null hypothesis and conclude that there is a statistically significant difference in the mean Unemployment_rate among different enrollment statuses.
# Boxplot for visualizing the distribution of unemployment rates by enrollment status
boxplot(df$Unemployment_rate ~ df$Target,
main = "Unemployment Rate by Enrollment Status", ylab = "Unemployment Rate", xlab = "Enrollment Status")
library(ggplot2)
# Kernel density plot for visualizing the distribution of admission grades by attendance
ggplot(df, aes(x = Admission_grade, fill = as.factor(Daytime_evening_attendance.))) +
geom_density(alpha = 0.5) +
labs(title = "Kernel Density Plot - Admission Grades by Daytime/Evening Classes",
y = "Density", x = "Admission Grade") +
theme_minimal()