Domen Novak

Data source: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression

Research question 1: Is there a significant difference in the average performance index between students who study for a high number of hours and those who study for a low number of hours, considering variables such as previous exam scores, hours of sleep, and the number of mock exam tests solved?

H0:There is no significant difference in the average performance index between students who study for a high number of hours and those who study for a low number of hours.

H1: There is a significant difference in the average performance index between students who study for a high number of hours and those who study for a low number of hours.

data <- read.csv("Student_Performance.csv") #naming the data table as 'data' for easier coding

summary(data) #descriptive statistics

##  Hours.Studied   Previous.Scores Extracurricular.Activities  Sleep.Hours   
##  Min.   :1.000   Min.   :40.00   Min.   :0.0000             Min.   :4.000  
##  1st Qu.:3.000   1st Qu.:54.00   1st Qu.:0.0000             1st Qu.:5.000  
##  Median :5.000   Median :69.00   Median :0.0000             Median :7.000  
##  Mean   :4.993   Mean   :69.45   Mean   :0.4948             Mean   :6.531  
##  3rd Qu.:7.000   3rd Qu.:85.00   3rd Qu.:1.0000             3rd Qu.:8.000  
##  Max.   :9.000   Max.   :99.00   Max.   :1.0000             Max.   :9.000  
##  Sample.Question.Papers.Practiced Performance.Index
##  Min.   :0.000                    Min.   : 10.00   
##  1st Qu.:2.000                    1st Qu.: 40.00   
##  Median :5.000                    Median : 55.00   
##  Mean   :4.583                    Mean   : 55.22   
##  3rd Qu.:7.000                    3rd Qu.: 71.00   
##  Max.   :9.000                    Max.   :100.00

Hours.Studied:

Min.: The minimum amount of hours studied is 1.

Median: The median or the middle value is 5, meaning half of the values are from 1 to 5, whereas the other half is from 5 to 9.

Mean: The mean or average hours studied is approximately 4.99. . Max.: The maximum hours studied is 9.

Previous.Scores:

Min.: The minimum value in the data set for previous exam scores is 40.

Median: The median (50th percentile) is 69, suggesting that half of the individuals have previous exam scores of 69 or lower.

Mean: The mean or average previous exam score is approximately 69.45.

Max.: The maximum value is 99, indicating that the maximum previous exam score is 99.

Extracurricular.Activities:

Min and max values here are just 1 and 0, indicating “yes” and “no” for extracurricular activities.

Median: The median is 0, suggesting that half of the individuals have no extracurricular activities.

Mean: The mean is approximately 0.49, indicating that, on average, individuals have a small fraction of extracurricular activities.

Sleep.Hours:

Min.: The minimum value in the data set for sleep hours is 4, indicating that some individuals have as little as 4 hours of sleep.

Median: The median (50th percentile) is 7, suggesting that half of the individuals have 7 hours of sleep or fewer.

Mean: The mean or average sleep hours is approximately 6.53.

Max.: The maximum value is 9, indicating that the maximum number of sleep hours is 9.

Sample.Question.Papers.Practiced:

Min.: The minimum value in the data set for the number of sample question papers practiced is 0, indicating that some individuals have not practiced any sample question papers.

Median: The median (50th percentile) is 5, suggesting that half of the individuals have practiced 5 or fewer sample question papers.

Mean: The mean or average number of sample question papers practiced is approximately 4.58.

Max.: The maximum value is 9, indicating that some individuals have practiced all 9 sample question papers.

Performance.Index:

Min.: The minimum value in the data set for the performance index is 10.

Median: The median (50th percentile) is 55, suggesting that half of the individuals have a performance index of 55 or lower.

Mean: The mean or average performance index is approximately 55.22.

Max.: The maximum value is 100, indicating that the maximum performance index is 100.

t.test(data$Hours.Studied,data$Performance.Index, #parametric test
       paired = FALSE,
       var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  data$Hours.Studied and data$Performance.Index
## t = -259.11, df = 10362, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -50.61191 -49.85189
## sample estimates:
## mean of x mean of y 
##    4.9929   55.2248

Based on the extremely low p-value and the confidence interval not including zero, we can confidently reject the null hypothesis in favour of the alternative (p<0.005). Students who study for a high number of hours having a notably higher performance index than those who study for a low number of hours.

kruskal.test(data$Hours.Studied, data$Performance.Index) #non-parametric test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$Hours.Studied and data$Performance.Index
## Kruskal-Wallis chi-squared = 2041.5, df = 90, p-value < 2.2e-16

The result Kruskal-Wallis rank sum test indicates a highly significant p-value (< 2.2e-16), so we can reject the null hypothesis in favour of the alternative. Students who study for a high number of hours having a notably higher performance index than those who study for a low number of hours.

To decide which test is actually better in this case, we must look at assumptions. For parametric tests (t-test), data is assumed to be normally distributed. However, non-parametric tests do not assume normal distribution. To find out, we run a Shapiro-Wilk normality test.

sample_indices <- sample(1:nrow(data), size = 4999) #because the sample was too large (10,000) for R Studio to run the Shapiro-Wilk test, I have reduced it.

shapiro.test(sample_indices) #checking normality of distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  sample_indices
## W = 0.95464, p-value < 2.2e-16

The results of the Shapiro-Wilk test show a p-value which is extremely small (p<0.05), we can reject the null hypothesis (H0: data is normally distributed), in favour of the alternative (H1: data is not normally distributed). The sample data does not come from a normally distributed population.

Because it is not normally distributed, non-parametric tests are better. This means that the Kruskall-Wallis test is more appropriate than the t-test in this case.

Conclusion: Because of the low p-value we gained from the Kruskall-Wallis test we can reject the null hypothesis at p<0.005 significance. We accept the alternative hypothesis which states that there is a significant difference in the average performance index between students who study for a high number of hours and those who study for a low number of hours.

Data source: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression

Research Question 2: How do the variables hours studied, previous scores, sleep hours, sample question papers practiced and extracurricular activities impact the performance index of students in a university?

HO: None of the variables have a significant impact on performance index.

H1: At least one of the coefficients of hours studied, previous scores, sleep hours , sample question papers practiced and extracurricular activities in the regression model is non-zero, indicating a significant impact on performance index.

The expected effects by the explanatory variables on the dependent variable are as follows:

Hours studied: positive effect on performance index - more hours mean higher performance index.

Previous scores: a higher previous score would not necessarily mean a higher performance index.

Sleep hours: positive effect on performance index - more hours of sleep mean higher performance index, but possibly only up to a certain point (i.e. there would be no significant difference between a person who sleeps 8 hours and a person who sleeps 10 hours).

Sample question papers practiced: positive effect on performance index - more papers practiced means a higher performance index.

Extracurricular activities: positive effect on performance index - if a person does extracurricular activities we can expect a higher performance index since sports are usually good for your mental capabilities as well as physical.

NOTE: Since the data source is the same in both questions, I have not repeated the descriptive statistics - please find them above under Question 1.

fit <- lm(data$Performance.Index ~ data$Hours.Studied + data$Previous.Scores + data$Sleep.Hours + #linear regression function 
            data$Sample.Question.Papers.Practiced + data$Extracurricular.Activities,
                       data = data)

summary(fit)

## 
## Call:
## lm(formula = data$Performance.Index ~ data$Hours.Studied + data$Previous.Scores + 
##     data$Sleep.Hours + data$Sample.Question.Papers.Practiced + 
##     data$Extracurricular.Activities, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6333 -1.3684 -0.0311  1.3556  8.7932 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           -34.075588   0.127143 -268.01   <2e-16
## data$Hours.Studied                      2.852982   0.007873  362.35   <2e-16
## data$Previous.Scores                    1.018434   0.001175  866.45   <2e-16
## data$Sleep.Hours                        0.480560   0.012022   39.97   <2e-16
## data$Sample.Question.Papers.Practiced   0.193802   0.007110   27.26   <2e-16
## data$Extracurricular.Activities         0.612898   0.040781   15.03   <2e-16
##                                          
## (Intercept)                           ***
## data$Hours.Studied                    ***
## data$Previous.Scores                  ***
## data$Sleep.Hours                      ***
## data$Sample.Question.Papers.Practiced ***
## data$Extracurricular.Activities       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.038 on 9994 degrees of freedom
## Multiple R-squared:  0.9888, Adjusted R-squared:  0.9887 
## F-statistic: 1.757e+05 on 5 and 9994 DF,  p-value: < 2.2e-16

Each coefficient provides the estimated change in the dependent variable (Performance.Index) for a one-unit change in the corresponding predictor variable, holding other variables constant.

For example, the coefficient for data$Hours.Studied is 2.852982. This suggests that, on average, for every additional hour studied, the predicted Performance.Index increases by approximately 2.85 points, assuming all other variables remain unchanged

The significance codes (*) indicate the statistical significance of each coefficient. All coefficients have very low p-values (< 0.001), suggesting that they are highly significant in predicting the Performance.Index.

The high R-squared (0.9888) indicates that 98.88% of the variance in the dependent variable (Performance index) is explained by linear effect of exploratory variables.

The extremely high F-statistic (1.757e+05) with a very low p-value (< 0.05) suggests that the model is statistically significant. This means that at least one predictor variable is significantly related to the Performance.Index.

Now, let us check assumptions of regression to see if we can generalize these results to the entire population:

Assumption 1 - Linearity

# Fit a linear regression model
model <- lm(Performance.Index ~ Hours.Studied + Previous.Scores + Sleep.Hours + Sample.Question.Papers.Practiced + Extracurricular.Activities, data = data)

# residuals vs. fitted values plot
par(mfrow = c(1, 1))
plot(model$fitted.values, model$residuals, xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs. Fitted Values")

Based on the above plot we can confirm linearity of parameters and we can say that Assumption 1 hold true

Assumption 2 - Expected value of errors equals 0

sum_residuals <- sum(residuals(model))
print(sum_residuals)

## [1] -2.184832e-13

Since the sum of residuals is very close to 0, we can say that Assumption 2 holds true

Assumption 3 - Homoskedasticity

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(model) #Breusch-Pagan test for heteroskedasticity

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                     Data                      
##  ---------------------------------------------
##  Response : Performance.Index 
##  Variables: fitted values of Performance.Index 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    0.2624659 
##  Prob > Chi2   =    0.6084311

From the above test we got a high p-value (p-value = 0.6084311 > 0.05). There is no significant evidence to reject the null hypothesis. We do not have strong statistical evidence to suggest that there is heteroskedasticity in the residuals of your regression model. We say that Assumption 3 holds true.

Assumption 4 - Normality of errors

hist(residuals(model), main = "Histogram of Residuals") # graphical test for normality of errors

Since the histogram follows a bell-shaped curve we can say it is normally distributed. We can assume that Assumption 4 holds true.

Assumption 5 - Errors are independent

Since there is no panel data or hierarchical structure we can assume that Assumption 5 holds true.

Assumption 6 - No perfect multicolinearity

library(car)

## Loading required package: carData

mean(vif(model)) # Average variance inflation factor to check multicolinearity

## [1] 1.000553

Since average VIF is very close to 1, we can say that Assumption 6 holds true.

Assumption 7 - number of units must be greater than number of estimated parameters

length(coefficients(model)) # this command gives total number of estimated parameters in the regression model

## [1] 6

Since the number give here is 6, which is above the minimum amount of 3 units, we can say that Assumption 7 holds true

Conclusion: Since all of the seven assumptions are met, we can conclude that the estimated coefficients are unbiased and efficient.The p-values associated with the coefficients provide reliable information about their statistical significance.The overall fit of the model, as indicated by the F-statistic, is valid. All of the results we gathered from the regression model hold true and the expected effects of explanatory variables on the dependent variable are also as we expected.