For research and hypothesis testing, we use the dataset “Student
Behavior” (https://www.kaggle.com/datasets/gunapro/student-behavior).
We decided to choose it, because it contains a lot of data for
analysis and the topic itself was interesting for us to research.
The dataset contains 19 columns, for example, gender, grades,
hobbies, daily studying time, travel time, salary expectation etc. Using
these data, we can formulate certain hypotheses and test them.
To analyze how college grades depend on different behavior. To do
this we will analyze the dataset, formulate hypotheses and test them,
draw conclusions from the results obtained using knowledge from the
course of P&S.
In this data we will analyze columns like Gender, 12th Mark, College Mark, Daily Studying Time.\
library(readr)
library(ggplot2)
data <- read_csv("./Student_Behaviour.csv")
## Rows: 235 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Certification Course, Gender, Department, hobbies, daily studing t...
## dbl (6): Height(CM), Weight(KG), 10th Mark, 12th Mark, college mark, salary...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
## # A tibble: 235 × 19
## `Certification Course` Gender Department `Height(CM)` `Weight(KG)`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 No Male BCA 100 58
## 2 No Female BCA 90 40
## 3 Yes Male BCA 159 78
## 4 Yes Female BCA 147 20
## 5 No Male BCA 170 54
## 6 Yes Female BCA 139 33
## 7 Yes Male BCA 165 50
## 8 No Male BCA 152 43
## 9 No Male BCA 190 85
## 10 No Male BCA 150 84
## # ℹ 225 more rows
## # ℹ 14 more variables: `10th Mark` <dbl>, `12th Mark` <dbl>,
## # `college mark` <dbl>, hobbies <chr>, `daily studing time` <chr>,
## # `prefer to study in` <chr>, `salary expectation` <dbl>,
## # `Do you like your degree?` <chr>,
## # `willingness to pursue a career based on their degree` <chr>,
## # `social medai & video` <chr>, `Travelling Time` <chr>, …
# read that data that we will use to test our hypotheses
data$`12th Mark` <- as.numeric(data$`12th Mark`)
data$`college mark` <- as.numeric(data$`college mark`)
data_subset <- data[, c('Gender', '12th Mark', 'daily studing time', 'college mark')]
data_subset
## # A tibble: 235 × 4
## Gender `12th Mark` `daily studing time` `college mark`
## <chr> <dbl> <chr> <dbl>
## 1 Male 65 0 - 30 minute 80
## 2 Female 80 30 - 60 minute 70
## 3 Male 61 1 - 2 Hour 55
## 4 Female 59 1 - 2 Hour 58
## 5 Male 65 30 - 60 minute 30
## 6 Female 75 30 - 60 minute 70
## 7 Male 63 1 - 2 Hour 3
## 8 Male 61.7 1 - 2 Hour 75
## 9 Male 67.5 0 - 30 minute 60
## 10 Male 65 0 - 30 minute 70
## # ℹ 225 more rows
# Enhanced Density Plot of 'college mark' by Daily Study Time
ggplot(data_subset, aes(x=`college mark`, fill=`daily studing time`, color=`daily studing time`)) +
geom_density(alpha=0.5, linewidth=1.2) +
geom_rug(aes(color=`daily studing time`), alpha=0.5) +
scale_fill_brewer(palette="Set3") +
scale_color_brewer(palette="Set2") +
labs(title="Density Plot of College Marks by Daily Study Time",
x="College Mark",
y="Density",
fill="Daily Study Time",
color="Daily Study Time") +
theme_minimal() +
theme(legend.position="right")
mean_college_mark_0_30 <- mean(data$"college mark"[data$"daily studing time" == "0 - 30 minute"])
mean_college_mark_30_60 <- mean(data$"college mark"[data$"daily studing time" == "30 - 60 minute"])
mean_college_mark_1_2 <- mean(data$"college mark"[data$"daily studing time" == "1 - 2 Hour"])
mean_college_mark_2_3 <- mean(data$"college mark"[data$"daily studing time" == "2 - 3 hour"])
mean_college_mark_3_4 <- mean(data$"college mark"[data$"daily studing time" == "3 - 4 hour"])
mean_college_mark_more_4 <- mean(data$"college mark"[data$"daily studing time" == "More Than 4 hour"])
# Print results
cat("Mean of college marks relative to 0 -30 minute daily study time = ",mean_college_mark_0_30 , "\n")
## Mean of college marks relative to 0 -30 minute daily study time = 68.69043
cat("Mean of college marks relative to 30 - 60 minute daily study time = ",mean_college_mark_30_60 , "\n")
## Mean of college marks relative to 30 - 60 minute daily study time = 69.8642
cat("Mean of college marks relative to 1 - 2 hour daily study time = ",mean_college_mark_1_2 , "\n")
## Mean of college marks relative to 1 - 2 hour daily study time = 70.31754
cat("Mean of college marks relative to 2 -3 hour daily study time = ",mean_college_mark_2_3 , "\n")
## Mean of college marks relative to 2 -3 hour daily study time = 75.92917
cat("Mean of college marks relative to 3 - 4 hour daily study time = ",mean_college_mark_3_4 , "\n")
## Mean of college marks relative to 3 - 4 hour daily study time = 75.52
cat("Mean of college marks relative to more than 4 hour daily study time = ",mean_college_mark_more_4 , "\n")
## Mean of college marks relative to more than 4 hour daily study time = 67.75
# Find some statistics for college marks
mean_mark <- mean(data$`college mark`, na.rm = TRUE)
median_mark <- median(data$`college mark`, na.rm = TRUE)
mode_mark <- as.numeric(names(sort(table(data$`college mark`), decreasing = TRUE)[1]))
skewness <- function(x) mean((x-mean(x))^3) / (mean((x-mean(x))^2))^(3/2)
# Plotting the histogram for 'college mark'
ggplot(data, aes(x=`college mark`)) +
geom_histogram(bins=30, fill="skyblue", color="black") +
geom_vline(aes(xintercept=mean_mark), color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median_mark), color="darkgreen", linetype="dashed", size=1) +
geom_vline(aes(xintercept=mode_mark), color="blue", linetype="dashed", size=1) +
labs(title="Histogram of College Marks", x="College Mark", y="Frequency") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
cat("Mean value = ",mean_mark , "\n")
## Mean value = 70.66055
cat ("Median value = ",median_mark, "\n")
## Median value = 70
cat ("Mode value = " ,mode_mark, "\n")
## Mode value = 70
cat("Skewness:", skewness(data$`college mark`), "\n")
## Skewness: -1.619283
I hypotheses
H0: the duration of study time (whether 0-2 hours or
2 and more hours) does not lead to any significant difference in the
distribution of college marks.
H1: there is a significant difference in college marks
distribution between students who study for different durations (0-2
hours vs. 2 and more hours).
This question is natural and relevant, as it directly relates to student behavior and educational strategies. On the one hand, there is a generally accepted opinion about the importance of the time allocated to study, on the other - the need to confirm this opinion with scientific methods.
To test your hypotheses, it is appropriate to apply the one-sided Kolmogorov-Smirnov test (KS test). This test allows you to compare two samples to determine whether they come from the same distribution. The KS test is non-parametric, which makes it suitable in cases where we cannot be sure of the normality of the distribution of scores.
data_group1 <- data$`college mark`[data$`daily studing time` == '0 - 30 minute' |
data$`daily studing time` == '30 - 60 minute' |
data$`daily studing time` == '1 - 2 Hour']
data_group2 <- data$`college mark`[data$`daily studing time` == '2 - 3 hour' |
data$`daily studing time` == '3 - 4 hour' |
data$`daily studing time` == 'More Than 4 hour']
ggplot(data = data.frame(data_group1), aes(x = data_group1)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightpink', color = 'black') +
labs(title = 'Histogram and Density of College Marks for 0 - 2 Hours Study Time',
x = 'College Mark',
y = 'Density') +
theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(data = data.frame(data_group2), aes(x = data_group2)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightgreen', color = 'black') +
labs(title = 'Histogram and Density of College Marks for 2 and more Hours Study Time',
x = 'College Mark',
y = 'Density') +
theme_minimal()
ks_test_result <- ks.test(data_group1, data_group2,alternative = "greater" )
ks_test_result
##
## Exact two-sample Kolmogorov-Smirnov test
##
## data: data_group1 and data_group2
## D^+ = 0.20745, p-value = 0.02001
## alternative hypothesis: the CDF of x lies above that of y
In the two-sample case alternative = “greater” includes distributions
for which x is stochastically smaller than y (the CDF of x lies above
and hence to the left of that for y) As we can see from the result of
our Kolgomorov - Smirnov test p-value is 0.02001. A higher p-value
suggests that the two samples are drawn from the same distribution. But
in our case the observed p-value, which is NOT in close proximity to 1,
indicates a low degree of similarity between the distributions of
data_group1 and data_group2.
In statistical terms, this suggests that there is significant difference
between the two samples.
So we have sufficient evidence to reject the null hypothesis (H0).
By rejecting the null hypothesis, we accept the alternative hypothesis
(H1), which states that there is a significant difference in the
distributions of college marks between the two groups. This indicates
that the amount of time students spend studying daily (as categorized in
data_group1 and data_group2) does have a significant impact on their
college marks.
Let’s present the data we’ll use for comparison.
# Basic statistical summary for columns '12th Mark' and 'college mark'
data <- data_subset
stats_summary <- summary(data[, c('12th Mark', 'college mark')])
stats_summary
## 12th Mark college mark
## Min. :45.00 Min. : 1.00
## 1st Qu.:60.00 1st Qu.: 60.00
## Median :69.00 Median : 70.00
## Mean :68.78 Mean : 70.66
## 3rd Qu.:76.00 3rd Qu.: 80.00
## Max. :94.00 Max. :100.00
# Enhanced Density Plot of 'college mark' by Daily Study Time
data_df <- data.frame(
marks_12 = data$`12th Mark`,
Gender = "Entire Dataset" # Add a label for the entire dataset
)
# Create a data frame for male students
male_df <- data.frame(
marks_12 = data$`12th Mark`[data$Gender == "Male"],
Gender = "Male"
)
# Create a data frame for female students
female_df <- data.frame(
marks_12 = data$`12th Mark`[data$Gender == "Female"],
Gender = "Female"
)
# List of data frames
data_frames_list <- list(data_df, male_df, female_df)
# Function to create density plot
create_density_plot <- function(data_frame, title) {
ggplot(data_frame, aes(x = marks_12, fill = Gender, color = Gender)) +
geom_histogram(aes(y = ..density..), binwidth = 5, alpha = 0.7, position = "identity") +
geom_density(alpha = 0.5, position = "identity") +
labs(
title = title,
x = "12th Mark",
y = "Density",
fill = "Gender",
color = "Gender"
) +
theme_minimal()
}
# Plot density histograms for each data frame
for (i in seq_along(data_frames_list)) {
current_data_frame <- data_frames_list[[i]]
current_title <- paste("Combined Density Histogram of 12th Marks -", levels(current_data_frame$Gender))
current_plot <- create_density_plot(current_data_frame, current_title)
print(current_plot)
}
Here are additional functions for calculating the mean, median, mode, and skewness.
# Function to compute various measures of a dataset
measure <- function(data_set) {
mean_mark <- mean(data_set, na.rm = TRUE)
median_mark <- median(data_set, na.rm = TRUE)
tab <- table(data_set)
mode_mark <- as.numeric(names(tab[tab == max(tab)]))
skewness <- function(x) mean((x - mean(x))^3) / (mean((x - mean(x))^2))^(3/2)
# Return a list of computed measures
return(list(mean_mark = mean_mark, median_mark = median_mark, mode_mark = mode_mark, skewness = skewness(data_set)))
}
# Function to print the computed measures
ploted <- function(measure) {
cat("Mean value = ", measure$mean_mark, "\n")
cat("Median value = ", measure$median_mark, "\n")
cat("Mode value = ", measure$mode_mark, "\n")
cat("Skewness:", measure$skewness, "\n")
cat("\n")
}
And print values
# Print header for 12th marks
cat("12th marks\n\n")
## 12th marks
cat("For all dataset\n")
## For all dataset
ploted(measure(data$`12th Mark`))
## Mean value = 68.78013
## Median value = 69
## Mode value = 60 70
## Skewness: 0.06783603
# Print statistics for male students
cat("For male\n")
## For male
ploted(measure(data$`12th Mark`[data$Gender == "Male"]))
## Mean value = 67.29378
## Median value = 67
## Mode value = 60
## Skewness: 0.2074527
# Print statistics for female students
cat("For female\n")
## For female
ploted(measure(data$`12th Mark`[data$Gender == "Female"]))
## Mean value = 71.71519
## Median value = 74
## Mode value = 75
## Skewness: -0.3037386
Upon visualizing these values alongside the histogram, a notable alignment is observed, affirming the consistency between our computed statistics and the graphical representation of the dataset.”
# Create a data frame for College marks
data_df <- data.frame(
college_mark = data$`college mark`,
Gender = "Entire Dataset" # Add a label for the entire dataset
)
# Create a data frame for male students
male_df <- data.frame(
college_mark = data$`college mark`[data$Gender == "Male"],
Gender = "Male"
)
# Create a data frame for female students
female_df <- data.frame(
college_mark = data$`college mark`[data$Gender == "Female"],
Gender = "Female"
)
# Combine data frames
combined_data <- rbind(data_df, male_df, female_df)
# List of data frames
data_frames_list <- list(data_df, male_df, female_df)
# Function to create density plot
create_density_plot <- function(data_frame, title) {
ggplot(data_frame, aes(x = college_mark, fill = Gender, color = Gender)) +
geom_histogram(aes(y = ..density..), binwidth = 5, alpha = 0.7, position = "identity") +
geom_density(alpha = 0.5, position = "identity") +
labs(
title = title,
x = "College Mark",
y = "Density",
fill = "Gender",
color = "Gender"
) +
theme_minimal()
}
# Plot density histograms for each data frame
for (i in seq_along(data_frames_list)) {
current_data_frame <- data_frames_list[[i]]
current_title <- paste("Combined Density Histogram of College Marks -", levels(current_data_frame$Gender))
current_plot <- create_density_plot(current_data_frame, current_title)
print(current_plot)
}
And print values
cat("College marks\n\n")
## College marks
# Print statistics for the entire dataset
cat("For all dataset\n")
## For all dataset
ploted(measure(data$`college mark`))
## Mean value = 70.66055
## Median value = 70
## Mode value = 70
## Skewness: -1.619283
# Print statistics for male students
cat("For male\n")
## For male
ploted(measure(data$"college mark"[data$"Gender" == "Male"]))
## Mean value = 67.51558
## Median value = 70
## Mode value = 70
## Skewness: -1.690257
# Print statistics for female students
cat("For female\n")
## For female
ploted(measure(data$"college mark"[data$"Gender" == "Female"]))
## Mean value = 76.87089
## Median value = 80
## Mode value = 70 80 85
## Skewness: -1.632203
We can see, that all values that we printed coincide with the histogram
Now create some functions, where we will calculate measures and then plot them.
# Function to calculate measures
measure <- function(data_set) {
mean_mark <- mean(data_set, na.rm = TRUE)
median_mark <- median(data_set, na.rm = TRUE)
tab <- table(data_set)
mode_mark <- as.numeric(names(tab[tab == max(tab)]))
skewness <- function(x) mean((x - mean(x))^3) / (mean((x - mean(x))^2))^(3/2)
return(list(mean_mark = mean_mark, median_mark = median_mark, mode_mark = mode_mark, skewness = skewness(data_set)))
}
# Function to plot histogram
histo <- function(data_set, description, measurement) {
ggplot(data_set, aes(x = `12th Mark`)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
geom_vline(aes(xintercept = measurement$mean_mark), color = "red", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = measurement$median_mark), color = "darkgreen", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = measurement$mode_mark[1]), color = "blue", linetype = "dashed", size = 1) +
labs(title = description, x = "12th Mark", y = "Frequency") +
theme_minimal()
}
Now call our functions to plot histograms.
# Plot histogram for all students
all_measures <- measure(data$`12th Mark`)
histo(data, "Histogram of 12th Marks for All Students", all_measures)
# Plot histogram for male students
male_data <- data[data$Gender == "Male", ]
male_measures <- measure(male_data$`12th Mark`)
histo(male_data, "Histogram of 12th Marks for Male Students", male_measures)
# Plot histogram for female students
female_data <- data[data$Gender == "Female", ]
female_measures <- measure(female_data$`12th Mark`)
histo(female_data, "Histogram of 12th Marks for Female Students", female_measures)
Function that we will use to test hypothesis
data_compare <- function(data_set, data_college, description) {
# Linear regression: data2 regressed on data1
regression_model <- lm(data_set ~ data_college)
# Summary of the regression
cat("Summary of", description, "\n")
print(summary(regression_model))
# Plot the data and the regression line
plot(data_set, data_college, pch = 16, col = "blue", main = description)
abline(regression_model, col = "red")
}
II hypotheses
H0: There is a statistically significant correlation
between the marks obtained in the 12th grade and the marks obtained in
college.
H1: There is no statistically significant correlation
between the marks obtained in the 12th grade and the marks obtained in
college.
We investigate whether grades in college depend on how well students did in school. We will use data about college marks and 12th marks. We also want to compare the correlation separately for boys, girls, and boys and girls.
Background: This hypothesis is based on the common assumption that academic performance in high school (12th grade) may have a direct impact on academic performance in college. The idea is that students who perform well in high school are likely to continue performing well in college, and vice versa.
Educational Continuity: The hypothesis is grounded in the concept of educational continuity, where one’s academic abilities and habits developed in high school are expected to persist into the college years. However, this hypothesis is formulated as a null hypothesis (H0) and an alternative hypothesis (H1) to allow for empirical testing.
data_12_all = data$`12th Mark`
data_college_all = data$`college mark`
data_12_male = data$`12th Mark`[data$Gender == "Male"]
data_college_male = data$`college mark`[data$Gender == "Male"]
data_12_female = data$`12th Mark`[data$Gender == "Female"]
data_college_female = data$`college mark`[data$Gender == "Female"]
data_compare(data_12_all, data_college_all, "Linear Regression Example")
## Summary of Linear Regression Example
##
## Call:
## lm(formula = data_set ~ data_college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5502 -7.1056 -0.2502 7.1610 23.9637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.73856 3.00646 15.879 < 2e-16 ***
## data_college 0.29778 0.04154 7.169 9.9e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.993 on 233 degrees of freedom
## Multiple R-squared: 0.1807, Adjusted R-squared: 0.1772
## F-statistic: 51.4 on 1 and 233 DF, p-value: 9.898e-12
data_compare(data_12_male, data_college_male, "Linear Regression Example for Male")
## Summary of Linear Regression Example for Male
##
## Call:
## lm(formula = data_set ~ data_college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8475 -7.0192 -0.2903 6.7797 21.3814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.24525 3.35251 15.584 < 2e-16 ***
## data_college 0.22289 0.04832 4.613 8.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.636 on 154 degrees of freedom
## Multiple R-squared: 0.1214, Adjusted R-squared: 0.1157
## F-statistic: 21.28 on 1 and 154 DF, p-value: 8.313e-06
data_compare(data_12_female, data_college_female, "Linear Regression Example for Female")
## Summary of Linear Regression Example for Female
##
## Call:
## lm(formula = data_set ~ data_college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.9429 -6.6307 -0.1501 7.2643 28.0321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.46511 6.94128 5.253 1.29e-06 ***
## data_college 0.45856 0.08902 5.151 1.94e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 77 degrees of freedom
## Multiple R-squared: 0.2563, Adjusted R-squared: 0.2466
## F-statistic: 26.54 on 1 and 77 DF, p-value: 1.936e-06
1. Linear Regression Example (Entire Dataset):
The regression equation is given by \(y=47.74+0.30x\).
The p-value is \(9.898×10^{−12}\), which is less than the significance level of 0.05.
Conclusion: We reject the null hypothesis. There is evidence to suggest that there is no statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college for the entire dataset.
2. Linear Regression Example for Male Students and Female Students:
The regression equations are given by \(y=52.25+0.22x\) and \(y=36.47+0.46x\) respectively.
The p-values is \(8.313×10^{−6}\) and \(1.936×10^{−6}\) respectively, which are less than the significance level of 0.05.
Conclusion: We also reject these null hypothesis, like in first test for all data
The results from the linear regression analyses for the entire dataset, as well as for male and female students separately, consistently indicate that there is no statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college. The p-values for all cases are well below the chosen significance level of 0.05, providing strong evidence to reject the null hypothesis in each scenario. Consequently, based on the available data, we conclude that the marks obtained in the 12th grade do not significantly predict the marks obtained in college.