Dataset

For research and hypothesis testing, we use the dataset “Student Behavior” (https://www.kaggle.com/datasets/gunapro/student-behavior).

We decided to choose it, because it contains a lot of data for analysis and the topic itself was interesting for us to research.

The dataset contains 19 columns, for example, gender, grades, hobbies, daily studying time, travel time, salary expectation etc. Using these data, we can formulate certain hypotheses and test them.

Project aim

To analyze how college grades depend on different behavior. To do this we will analyze the dataset, formulate hypotheses and test them, draw conclusions from the results obtained using knowledge from the course of P&S.

In this data we will analyze columns like Gender, 12th Mark, College Mark, Daily Studying Time.\

Data description

library(readr)
library(ggplot2)
data <- read_csv("./Student_Behaviour.csv")

## Rows: 235 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Certification Course, Gender, Department, hobbies, daily studing t...
## dbl  (6): Height(CM), Weight(KG), 10th Mark, 12th Mark, college mark, salary...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data

## # A tibble: 235 × 19
##    `Certification Course` Gender Department `Height(CM)` `Weight(KG)`
##    <chr>                  <chr>  <chr>             <dbl>        <dbl>
##  1 No                     Male   BCA                 100           58
##  2 No                     Female BCA                  90           40
##  3 Yes                    Male   BCA                 159           78
##  4 Yes                    Female BCA                 147           20
##  5 No                     Male   BCA                 170           54
##  6 Yes                    Female BCA                 139           33
##  7 Yes                    Male   BCA                 165           50
##  8 No                     Male   BCA                 152           43
##  9 No                     Male   BCA                 190           85
## 10 No                     Male   BCA                 150           84
## # ℹ 225 more rows
## # ℹ 14 more variables: `10th Mark` <dbl>, `12th Mark` <dbl>,
## #   `college mark` <dbl>, hobbies <chr>, `daily studing time` <chr>,
## #   `prefer to study in` <chr>, `salary expectation` <dbl>,
## #   `Do you like your degree?` <chr>,
## #   `willingness to pursue a career based on their degree` <chr>,
## #   `social medai & video` <chr>, `Travelling Time` <chr>, …

# read that data that we will use to test our hypotheses
data$`12th Mark` <- as.numeric(data$`12th Mark`)
data$`college mark` <- as.numeric(data$`college mark`)
data_subset <- data[, c('Gender', '12th Mark', 'daily studing time', 'college mark')]
data_subset

## # A tibble: 235 × 4
##    Gender `12th Mark` `daily studing time` `college mark`
##    <chr>        <dbl> <chr>                         <dbl>
##  1 Male          65   0 - 30 minute                    80
##  2 Female        80   30 - 60 minute                   70
##  3 Male          61   1 - 2 Hour                       55
##  4 Female        59   1 - 2 Hour                       58
##  5 Male          65   30 - 60 minute                   30
##  6 Female        75   30 - 60 minute                   70
##  7 Male          63   1 - 2 Hour                        3
##  8 Male          61.7 1 - 2 Hour                       75
##  9 Male          67.5 0 - 30 minute                    60
## 10 Male          65   0 - 30 minute                    70
## # ℹ 225 more rows

# Enhanced Density Plot of 'college mark' by Daily Study Time
ggplot(data_subset, aes(x=`college mark`, fill=`daily studing time`, color=`daily studing time`)) +
  geom_density(alpha=0.5, linewidth=1.2) +  
  geom_rug(aes(color=`daily studing time`), alpha=0.5) +  
  scale_fill_brewer(palette="Set3") +  
  scale_color_brewer(palette="Set2") +
  labs(title="Density Plot of College Marks by Daily Study Time",
       x="College Mark",
       y="Density",
       fill="Daily Study Time",
       color="Daily Study Time") +
  theme_minimal() +
  theme(legend.position="right")

mean_college_mark_0_30 <- mean(data$"college mark"[data$"daily studing time" == "0 - 30 minute"])
mean_college_mark_30_60 <- mean(data$"college mark"[data$"daily studing time" == "30 - 60 minute"])
mean_college_mark_1_2 <- mean(data$"college mark"[data$"daily studing time" == "1 - 2 Hour"])
mean_college_mark_2_3 <- mean(data$"college mark"[data$"daily studing time" == "2 - 3 hour"])
mean_college_mark_3_4 <- mean(data$"college mark"[data$"daily studing time" == "3 - 4 hour"])
mean_college_mark_more_4 <- mean(data$"college mark"[data$"daily studing time" == "More Than 4 hour"])


# Print results
cat("Mean of college marks relative to 0 -30 minute daily study time = ",mean_college_mark_0_30 , "\n")

## Mean of college marks relative to 0 -30 minute daily study time =  68.69043

cat("Mean of college marks relative to 30 - 60 minute daily study time = ",mean_college_mark_30_60 , "\n")

## Mean of college marks relative to 30 - 60 minute daily study time =  69.8642

cat("Mean of college marks relative to 1 - 2 hour daily study time = ",mean_college_mark_1_2 , "\n")

## Mean of college marks relative to 1 - 2 hour daily study time =  70.31754

cat("Mean of college marks relative to 2 -3 hour daily study time = ",mean_college_mark_2_3 , "\n")

## Mean of college marks relative to 2 -3 hour daily study time =  75.92917

cat("Mean of college marks relative to 3 - 4 hour daily study time = ",mean_college_mark_3_4 , "\n")

## Mean of college marks relative to 3 - 4 hour daily study time =  75.52

cat("Mean of college marks relative to more than 4 hour daily study time = ",mean_college_mark_more_4 , "\n")

## Mean of college marks relative to more than 4 hour daily study time =  67.75

# Find some statistics for college marks
mean_mark <- mean(data$`college mark`, na.rm = TRUE)
median_mark <- median(data$`college mark`, na.rm = TRUE)
mode_mark <- as.numeric(names(sort(table(data$`college mark`), decreasing = TRUE)[1]))
skewness <- function(x) mean((x-mean(x))^3) / (mean((x-mean(x))^2))^(3/2)


# Plotting the histogram for 'college mark'
ggplot(data, aes(x=`college mark`)) +
  geom_histogram(bins=30, fill="skyblue", color="black") +
  geom_vline(aes(xintercept=mean_mark), color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median_mark), color="darkgreen", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mode_mark), color="blue", linetype="dashed", size=1) +
  labs(title="Histogram of College Marks", x="College Mark", y="Frequency") +
  
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

cat("Mean value = ",mean_mark , "\n")

## Mean value =  70.66055

cat ("Median value = ",median_mark, "\n")

## Median value =  70

cat ("Mode value = " ,mode_mark, "\n")

## Mode value =  70

cat("Skewness:", skewness(data$`college mark`), "\n")

## Skewness: -1.619283

Testing hypothesis 1

I hypotheses

H0: the duration of study time (whether 0-2 hours or 2 and more hours) does not lead to any significant difference in the distribution of college marks.
H1: there is a significant difference in college marks distribution between students who study for different durations (0-2 hours vs. 2 and more hours).

This question is natural and relevant, as it directly relates to student behavior and educational strategies. On the one hand, there is a generally accepted opinion about the importance of the time allocated to study, on the other - the need to confirm this opinion with scientific methods.

To test your hypotheses, it is appropriate to apply the one-sided Kolmogorov-Smirnov test (KS test). This test allows you to compare two samples to determine whether they come from the same distribution. The KS test is non-parametric, which makes it suitable in cases where we cannot be sure of the normality of the distribution of scores.

data_group1 <- data$`college mark`[data$`daily studing time` == '0 - 30 minute' | 
                                   data$`daily studing time` == '30 - 60 minute' | 
                                   data$`daily studing time` == '1 - 2 Hour']

data_group2 <- data$`college mark`[data$`daily studing time` == '2 - 3 hour' | 
                                   data$`daily studing time` == '3 - 4 hour' | 
                                   data$`daily studing time` == 'More Than 4 hour']

ggplot(data = data.frame(data_group1), aes(x = data_group1)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightpink', color = 'black') +
  labs(title = 'Histogram and Density of College Marks for 0 - 2 Hours Study Time', 
       x = 'College Mark', 
       y = 'Density') +
  theme_minimal()

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(data = data.frame(data_group2), aes(x = data_group2)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = 'lightgreen', color = 'black') +
  labs(title = 'Histogram and Density of College Marks for 2 and more Hours Study Time', 
       x = 'College Mark', 
       y = 'Density') +
  theme_minimal()

ks_test_result <- ks.test(data_group1, data_group2,alternative = "greater" )
ks_test_result

## 
##  Exact two-sample Kolmogorov-Smirnov test
## 
## data:  data_group1 and data_group2
## D^+ = 0.20745, p-value = 0.02001
## alternative hypothesis: the CDF of x lies above that of y

Conclusion on first hypothesis

In the two-sample case alternative = “greater” includes distributions for which x is stochastically smaller than y (the CDF of x lies above and hence to the left of that for y) As we can see from the result of our Kolgomorov - Smirnov test p-value is 0.02001. A higher p-value suggests that the two samples are drawn from the same distribution. But in our case the observed p-value, which is NOT in close proximity to 1, indicates a low degree of similarity between the distributions of data_group1 and data_group2.
In statistical terms, this suggests that there is significant difference between the two samples.
So we have sufficient evidence to reject the null hypothesis (H0).
By rejecting the null hypothesis, we accept the alternative hypothesis (H1), which states that there is a significant difference in the distributions of college marks between the two groups. This indicates that the amount of time students spend studying daily (as categorized in data_group1 and data_group2) does have a significant impact on their college marks.

Let’s present the data we’ll use for comparison.

# Basic statistical summary for columns '12th Mark' and 'college mark'
data <- data_subset
stats_summary <- summary(data[, c('12th Mark', 'college mark')])
stats_summary

##    12th Mark      college mark   
##  Min.   :45.00   Min.   :  1.00  
##  1st Qu.:60.00   1st Qu.: 60.00  
##  Median :69.00   Median : 70.00  
##  Mean   :68.78   Mean   : 70.66  
##  3rd Qu.:76.00   3rd Qu.: 80.00  
##  Max.   :94.00   Max.   :100.00

# Enhanced Density Plot of 'college mark' by Daily Study Time

data_df <- data.frame(
  marks_12 = data$`12th Mark`,
  Gender = "Entire Dataset"  # Add a label for the entire dataset
)

# Create a data frame for male students
male_df <- data.frame(
  marks_12 = data$`12th Mark`[data$Gender == "Male"],
  Gender = "Male"
)

# Create a data frame for female students
female_df <- data.frame(
  marks_12 = data$`12th Mark`[data$Gender == "Female"],
  Gender = "Female"
)


# List of data frames
data_frames_list <- list(data_df, male_df, female_df)

# Function to create density plot
create_density_plot <- function(data_frame, title) {
  ggplot(data_frame, aes(x = marks_12, fill = Gender, color = Gender)) +
    geom_histogram(aes(y = ..density..), binwidth = 5, alpha = 0.7, position = "identity") +
    geom_density(alpha = 0.5, position = "identity") +
    labs(
      title = title,
      x = "12th Mark",
      y = "Density",
      fill = "Gender",
      color = "Gender"
    ) +
    theme_minimal()
}

# Plot density histograms for each data frame
for (i in seq_along(data_frames_list)) {
  current_data_frame <- data_frames_list[[i]]
  current_title <- paste("Combined Density Histogram of 12th Marks -", levels(current_data_frame$Gender))
  current_plot <- create_density_plot(current_data_frame, current_title)
  print(current_plot)
}

Here are additional functions for calculating the mean, median, mode, and skewness.

# Function to compute various measures of a dataset
measure <- function(data_set) {
  mean_mark <- mean(data_set, na.rm = TRUE)
  median_mark <- median(data_set, na.rm = TRUE)
  tab <- table(data_set)
  mode_mark <- as.numeric(names(tab[tab == max(tab)]))
  skewness <- function(x) mean((x - mean(x))^3) / (mean((x - mean(x))^2))^(3/2)
  
  # Return a list of computed measures
  return(list(mean_mark = mean_mark, median_mark = median_mark, mode_mark = mode_mark, skewness = skewness(data_set)))
}

# Function to print the computed measures
ploted <- function(measure) {
  cat("Mean value = ", measure$mean_mark, "\n")
  cat("Median value = ", measure$median_mark, "\n")
  cat("Mode value = ", measure$mode_mark, "\n")
  cat("Skewness:", measure$skewness, "\n")
  cat("\n")
}

And print values

# Print header for 12th marks
cat("12th marks\n\n")

## 12th marks

cat("For all dataset\n")

## For all dataset

ploted(measure(data$`12th Mark`))

## Mean value =  68.78013 
## Median value =  69 
## Mode value =  60 70 
## Skewness: 0.06783603

# Print statistics for male students
cat("For male\n")

## For male

ploted(measure(data$`12th Mark`[data$Gender == "Male"]))

## Mean value =  67.29378 
## Median value =  67 
## Mode value =  60 
## Skewness: 0.2074527

# Print statistics for female students
cat("For female\n")

## For female

ploted(measure(data$`12th Mark`[data$Gender == "Female"]))

## Mean value =  71.71519 
## Median value =  74 
## Mode value =  75 
## Skewness: -0.3037386

Upon visualizing these values alongside the histogram, a notable alignment is observed, affirming the consistency between our computed statistics and the graphical representation of the dataset.”

# Create a data frame for College marks
data_df <- data.frame(
  college_mark = data$`college mark`,
  Gender = "Entire Dataset"  # Add a label for the entire dataset
)

# Create a data frame for male students
male_df <- data.frame(
  college_mark = data$`college mark`[data$Gender == "Male"],
  Gender = "Male"
)

# Create a data frame for female students
female_df <- data.frame(
  college_mark = data$`college mark`[data$Gender == "Female"],
  Gender = "Female"
)

# Combine data frames
combined_data <- rbind(data_df, male_df, female_df)

# List of data frames
data_frames_list <- list(data_df, male_df, female_df)

# Function to create density plot
create_density_plot <- function(data_frame, title) {
  ggplot(data_frame, aes(x = college_mark, fill = Gender, color = Gender)) +
    geom_histogram(aes(y = ..density..), binwidth = 5, alpha = 0.7, position = "identity") +
    geom_density(alpha = 0.5, position = "identity") +
    labs(
      title = title,
      x = "College Mark",
      y = "Density",
      fill = "Gender",
      color = "Gender"
    ) +
    theme_minimal()
}

# Plot density histograms for each data frame
for (i in seq_along(data_frames_list)) {
  current_data_frame <- data_frames_list[[i]]
  current_title <- paste("Combined Density Histogram of College Marks -", levels(current_data_frame$Gender))
  current_plot <- create_density_plot(current_data_frame, current_title)
  print(current_plot)
}

And print values

cat("College marks\n\n")

## College marks

# Print statistics for the entire dataset
cat("For all dataset\n")

## For all dataset

ploted(measure(data$`college mark`))

## Mean value =  70.66055 
## Median value =  70 
## Mode value =  70 
## Skewness: -1.619283

# Print statistics for male students
cat("For male\n")

## For male

ploted(measure(data$"college mark"[data$"Gender" == "Male"]))

## Mean value =  67.51558 
## Median value =  70 
## Mode value =  70 
## Skewness: -1.690257

# Print statistics for female students
cat("For female\n")

## For female

ploted(measure(data$"college mark"[data$"Gender" == "Female"]))

## Mean value =  76.87089 
## Median value =  80 
## Mode value =  70 80 85 
## Skewness: -1.632203

We can see, that all values that we printed coincide with the histogram

Now create some functions, where we will calculate measures and then plot them.

# Function to calculate measures
measure <- function(data_set) {
  mean_mark <- mean(data_set, na.rm = TRUE)
  median_mark <- median(data_set, na.rm = TRUE)
  tab <- table(data_set)
  mode_mark <- as.numeric(names(tab[tab == max(tab)]))
  skewness <- function(x) mean((x - mean(x))^3) / (mean((x - mean(x))^2))^(3/2)
  
  return(list(mean_mark = mean_mark, median_mark = median_mark, mode_mark = mode_mark, skewness = skewness(data_set)))
}


# Function to plot histogram
histo <- function(data_set, description, measurement) {
  ggplot(data_set, aes(x = `12th Mark`)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    geom_vline(aes(xintercept = measurement$mean_mark), color = "red", linetype = "dashed", size = 1) +
    geom_vline(aes(xintercept = measurement$median_mark), color = "darkgreen", linetype = "dashed", size = 1) +
    geom_vline(aes(xintercept = measurement$mode_mark[1]), color = "blue", linetype = "dashed", size = 1) +
    labs(title = description, x = "12th Mark", y = "Frequency") +
    theme_minimal()
}

Now call our functions to plot histograms.

# Plot histogram for all students
all_measures <- measure(data$`12th Mark`)
histo(data, "Histogram of 12th Marks for All Students", all_measures)

# Plot histogram for male students
male_data <- data[data$Gender == "Male", ]
male_measures <- measure(male_data$`12th Mark`)
histo(male_data, "Histogram of 12th Marks for Male Students", male_measures)

# Plot histogram for female students
female_data <- data[data$Gender == "Female", ]
female_measures <- measure(female_data$`12th Mark`)
histo(female_data, "Histogram of 12th Marks for Female Students", female_measures)

Function that we will use to test hypothesis

data_compare <- function(data_set, data_college, description) {

  # Linear regression: data2 regressed on data1
  regression_model <- lm(data_set ~ data_college)
  
  # Summary of the regression
  cat("Summary of", description, "\n")
  print(summary(regression_model))
  # Plot the data and the regression line
  plot(data_set, data_college, pch = 16, col = "blue", main = description)
  abline(regression_model, col = "red")
}

Testing hypothesis 2

II hypotheses

H0: There is a statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college.
H1: There is no statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college.

We investigate whether grades in college depend on how well students did in school. We will use data about college marks and 12th marks. We also want to compare the correlation separately for boys, girls, and boys and girls.

Background: This hypothesis is based on the common assumption that academic performance in high school (12th grade) may have a direct impact on academic performance in college. The idea is that students who perform well in high school are likely to continue performing well in college, and vice versa.

Educational Continuity: The hypothesis is grounded in the concept of educational continuity, where one’s academic abilities and habits developed in high school are expected to persist into the college years. However, this hypothesis is formulated as a null hypothesis (H0) and an alternative hypothesis (H1) to allow for empirical testing.

data_12_all = data$`12th Mark`
data_college_all = data$`college mark`

data_12_male = data$`12th Mark`[data$Gender == "Male"]
data_college_male = data$`college mark`[data$Gender == "Male"]

data_12_female = data$`12th Mark`[data$Gender == "Female"]
data_college_female = data$`college mark`[data$Gender == "Female"]

data_compare(data_12_all, data_college_all, "Linear Regression Example")

## Summary of Linear Regression Example 
## 
## Call:
## lm(formula = data_set ~ data_college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5502  -7.1056  -0.2502   7.1610  23.9637 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.73856    3.00646  15.879  < 2e-16 ***
## data_college  0.29778    0.04154   7.169  9.9e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.993 on 233 degrees of freedom
## Multiple R-squared:  0.1807, Adjusted R-squared:  0.1772 
## F-statistic:  51.4 on 1 and 233 DF,  p-value: 9.898e-12

data_compare(data_12_male, data_college_male, "Linear Regression Example for Male")

## Summary of Linear Regression Example for Male 
## 
## Call:
## lm(formula = data_set ~ data_college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8475  -7.0192  -0.2903   6.7797  21.3814 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.24525    3.35251  15.584  < 2e-16 ***
## data_college  0.22289    0.04832   4.613 8.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.636 on 154 degrees of freedom
## Multiple R-squared:  0.1214, Adjusted R-squared:  0.1157 
## F-statistic: 21.28 on 1 and 154 DF,  p-value: 8.313e-06

data_compare(data_12_female, data_college_female, "Linear Regression Example for Female")

## Summary of Linear Regression Example for Female 
## 
## Call:
## lm(formula = data_set ~ data_college)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9429  -6.6307  -0.1501   7.2643  28.0321 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.46511    6.94128   5.253 1.29e-06 ***
## data_college  0.45856    0.08902   5.151 1.94e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 77 degrees of freedom
## Multiple R-squared:  0.2563, Adjusted R-squared:  0.2466 
## F-statistic: 26.54 on 1 and 77 DF,  p-value: 1.936e-06

1. Linear Regression Example (Entire Dataset):

The regression equation is given by \(y=47.74+0.30x\).
The p-value is \(9.898×10^{−12}\), which is less than the significance level of 0.05.
Conclusion: We reject the null hypothesis. There is evidence to suggest that there is no statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college for the entire dataset.

2. Linear Regression Example for Male Students and Female Students:

The regression equations are given by \(y=52.25+0.22x\) and \(y=36.47+0.46x\) respectively.
The p-values is \(8.313×10^{−6}\) and \(1.936×10^{−6}\) respectively, which are less than the significance level of 0.05.
Conclusion: We also reject these null hypothesis, like in first test for all data

Conclusion on second hypothesis

The results from the linear regression analyses for the entire dataset, as well as for male and female students separately, consistently indicate that there is no statistically significant correlation between the marks obtained in the 12th grade and the marks obtained in college. The p-values for all cases are well below the chosen significance level of 0.05, providing strong evidence to reject the null hypothesis in each scenario. Consequently, based on the available data, we conclude that the marks obtained in the 12th grade do not significantly predict the marks obtained in college.