MVA Homework Assignment 1

Importing and preparation of the data

Source: Student Performance Factors (Keggle)

#Importing the data

mydata <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW1/StudentPerformanceFactors.csv", header=TRUE, sep=",", dec=".")
head(mydata)

##   Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1            23         84                  Low                High
## 2            19         64                  Low              Medium
## 3            24         98               Medium              Medium
## 4            29         89                  Low              Medium
## 5            19         92               Medium              Medium
## 6            19         88               Medium              Medium
##   Extracurricular_Activities Sleep_Hours Previous_Scores
## 1                         No           7              73
## 2                         No           8              59
## 3                        Yes           7              91
## 4                        Yes           8              98
## 5                        Yes           6              65
## 6                        Yes           8              89
##   Motivation_Level Internet_Access Tutoring_Sessions Family_Income
## 1              Low             Yes                 0           Low
## 2              Low             Yes                 2        Medium
## 3           Medium             Yes                 2        Medium
## 4           Medium             Yes                 1        Medium
## 5           Medium             Yes                 3        Medium
## 6           Medium             Yes                 3        Medium
##   Teacher_Quality School_Type Peer_Influence Physical_Activity
## 1          Medium      Public       Positive                 3
## 2          Medium      Public       Negative                 4
## 3          Medium      Public        Neutral                 4
## 4          Medium      Public       Negative                 4
## 5            High      Public        Neutral                 4
## 6          Medium      Public       Positive                 3
##   Learning_Disabilities Parental_Education_Level Distance_from_Home
## 1                    No              High School               Near
## 2                    No                  College           Moderate
## 3                    No             Postgraduate               Near
## 4                    No              High School           Moderate
## 5                    No                  College               Near
## 6                    No             Postgraduate               Near
##   Gender Exam_Score
## 1   Male         67
## 2 Female         61
## 3   Male         74
## 4   Male         71
## 5 Female         70
## 6   Male         71

Unit of observation is one individual student.

Description of all variables before the data manipulation:

Hours_Studied: Number of hours spent studying per week.
Attendance: Percentage of classes attended.
Parental_Involvement: Level of parental involvement in the student’s education (Low, Medium, High).
Access_to_Resources: Availability of educational resources (Low, Medium, High).
Extracurricular_Activities: Participation in extracurricular activities (Yes, No).
Sleep_Hours: Average number of hours of sleep per night.
Previous_Scores: Scores from previous exams.
Motivation_Level: Student’s level of motivation (Low, Medium, High).
Internet_Access: Availability of internet access (Yes, No).
Tutoring_Sessions: Number of tutoring sessions attended per month.
Family_Income: Family income level (Low, Medium, High).
Teacher_Quality: Quality of the teachers (Low, Medium, High).
School_Type: Type of school attended (Public, Private).
Peer_Influence: Influence of peers on academic performance (Positive, Neutral, Negative).
Physical_Activity: Average number of hours of physical activity per week.
Learning_Disabilities: Presence of learning disabilities (Yes, No).
Parental_Education_Level: Highest education level of parents (High School, College, Postgraduate).
Distance_from_Home: Distance from home to school (Near, Moderate, Far).
Gender: Gender of the student (Male, Female).
Exam_Score: Final exam score.

summary(mydata)

##  Hours_Studied     Attendance     Parental_Involvement
##  Min.   : 1.00   Min.   : 60.00   Length:6607         
##  1st Qu.:16.00   1st Qu.: 70.00   Class :character    
##  Median :20.00   Median : 80.00   Mode  :character    
##  Mean   :19.98   Mean   : 79.98                       
##  3rd Qu.:24.00   3rd Qu.: 90.00                       
##  Max.   :44.00   Max.   :100.00                       
##  Access_to_Resources Extracurricular_Activities  Sleep_Hours    
##  Length:6607         Length:6607                Min.   : 4.000  
##  Class :character    Class :character           1st Qu.: 6.000  
##  Mode  :character    Mode  :character           Median : 7.000  
##                                                 Mean   : 7.029  
##                                                 3rd Qu.: 8.000  
##                                                 Max.   :10.000  
##  Previous_Scores  Motivation_Level   Internet_Access   
##  Min.   : 50.00   Length:6607        Length:6607       
##  1st Qu.: 63.00   Class :character   Class :character  
##  Median : 75.00   Mode  :character   Mode  :character  
##  Mean   : 75.07                                        
##  3rd Qu.: 88.00                                        
##  Max.   :100.00                                        
##  Tutoring_Sessions Family_Income      Teacher_Quality   
##  Min.   :0.000     Length:6607        Length:6607       
##  1st Qu.:1.000     Class :character   Class :character  
##  Median :1.000     Mode  :character   Mode  :character  
##  Mean   :1.494                                          
##  3rd Qu.:2.000                                          
##  Max.   :8.000                                          
##  School_Type        Peer_Influence     Physical_Activity
##  Length:6607        Length:6607        Min.   :0.000    
##  Class :character   Class :character   1st Qu.:2.000    
##  Mode  :character   Mode  :character   Median :3.000    
##                                        Mean   :2.968    
##                                        3rd Qu.:4.000    
##                                        Max.   :6.000    
##  Learning_Disabilities Parental_Education_Level Distance_from_Home
##  Length:6607           Length:6607              Length:6607       
##  Class :character      Class :character         Class :character  
##  Mode  :character      Mode  :character         Mode  :character  
##                                                                   
##                                                                   
##                                                                   
##     Gender            Exam_Score    
##  Length:6607        Min.   : 55.00  
##  Class :character   1st Qu.: 65.00  
##  Mode  :character   Median : 67.00  
##                     Mean   : 67.24  
##                     3rd Qu.: 69.00  
##                     Max.   :101.00

Here we can clearly see that data needs to be adjusted. Since we have a lot of variables, I decided to delete some of the variables (columns) that I believe are not important for my research. My further steps continue as followed.

# Keep only the necessary variables for the analysis
mydata2 <- mydata[, colnames(mydata) %in% c("Gender", "Exam_Score", "Hours_Studied", "Extracurricular_Activities", "Parental_Involvement", "Motivation_Level")]

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Adding additional column ID
mydata2$ID <- seq(1, nrow(mydata2))
mydata2 <- mydata2 %>%
  select(ID, everything())

#Factoring variables 

mydata2$Gender <- factor(mydata2$Gender,
                         levels = c("Male", "Female"),
                         labels = c("Male", "Female"))

mydata2$Extracurricular_Activities <- factor(mydata2$Extracurricular_Activities,
                         levels = c("Yes", "No"),
                         labels = c("Yes", "No"))

mydata2$Parental_Involvement <- factor(mydata2$Parental_Involvement,
                         levels = c("High", "Medium", "Low"),
                         labels = c("High", "Medium", "Low"))

mydata2$Motivation_Level <- factor(mydata2$Motivation_Level,
                         levels = c("High", "Medium", "Low"),
                         labels = c("High", "Medium", "Low"))

summary(mydata2)

##        ID       Hours_Studied   Parental_Involvement
##  Min.   :   1   Min.   : 1.00   High  :1908         
##  1st Qu.:1652   1st Qu.:16.00   Medium:3362         
##  Median :3304   Median :20.00   Low   :1337         
##  Mean   :3304   Mean   :19.98                       
##  3rd Qu.:4956   3rd Qu.:24.00                       
##  Max.   :6607   Max.   :44.00                       
##  Extracurricular_Activities Motivation_Level    Gender    
##  Yes:3938                   High  :1319      Male  :3814  
##  No :2669                   Medium:3351      Female:2793  
##                             Low   :1937                   
##                                                           
##                                                           
##                                                           
##    Exam_Score    
##  Min.   : 55.00  
##  1st Qu.: 65.00  
##  Median : 67.00  
##  Mean   : 67.24  
##  3rd Qu.: 69.00  
##  Max.   :101.00

We can see that this already looks much better, due to factoring.

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

mydata_stat.desc <- mydata2[ , c(-1, -3, -4, -5,-6)]
round(stat.desc(mydata_stat.desc), 2)

##              Hours_Studied Exam_Score
## nbr.val            6607.00    6607.00
## nbr.null              0.00       0.00
## nbr.na                0.00       0.00
## min                   1.00      55.00
## max                  44.00     101.00
## range                43.00      46.00
## sum              131977.00  444226.00
## median               20.00      67.00
## mean                 19.98      67.24
## SE.mean               0.07       0.05
## CI.mean.0.95          0.14       0.09
## var                  35.89      15.14
## std.dev               5.99       3.89
## coef.var              0.30       0.06

The maximum hours of studying for an exam was 44 hours, and the minimal was 1 hour. In total all students together studied for 131.977 hours. The mean of studying hours was 19.98 hours and the median was 20 hours. Meaning 50% of students studied for up to including 20 hours for the exam and 50% of students studied for more than 20 hours.

The highest exam score was 101 points, and the minimal was 55 points, meaning the range of points was 46. The average amount of earned points was 67.27 points and the median was 67 points. Meaning 50% of students achieved up to including 67 points for the exam and 50% of students earned more than 67 points on the exam.

Since we see we do not have any n/a we do not have to perform the removal of non-availables.

Let’s take a look at how our data looks like with function head before we continue to our first research question:

head(mydata2)

##   ID Hours_Studied Parental_Involvement Extracurricular_Activities
## 1  1            23                  Low                         No
## 2  2            19                  Low                         No
## 3  3            24               Medium                        Yes
## 4  4            29                  Low                        Yes
## 5  5            19               Medium                        Yes
## 6  6            19               Medium                        Yes
##   Motivation_Level Gender Exam_Score
## 1              Low   Male         67
## 2              Low Female         61
## 3           Medium   Male         74
## 4           Medium   Male         71
## 5           Medium Female         70
## 6           Medium   Male         71

# Summary of the data we are actually going to use for this analysis

library(psych)
summary(mydata2)

##        ID       Hours_Studied   Parental_Involvement
##  Min.   :   1   Min.   : 1.00   High  :1908         
##  1st Qu.:1652   1st Qu.:16.00   Medium:3362         
##  Median :3304   Median :20.00   Low   :1337         
##  Mean   :3304   Mean   :19.98                       
##  3rd Qu.:4956   3rd Qu.:24.00                       
##  Max.   :6607   Max.   :44.00                       
##  Extracurricular_Activities Motivation_Level    Gender    
##  Yes:3938                   High  :1319      Male  :3814  
##  No :2669                   Medium:3351      Female:2793  
##                             Low   :1937                   
##                                                           
##                                                           
##                                                           
##    Exam_Score    
##  Min.   : 55.00  
##  1st Qu.: 65.00  
##  Median : 67.00  
##  Mean   : 67.24  
##  3rd Qu.: 69.00  
##  Max.   :101.00

RQ1: Are students who participate in extracurricular activities outside of school more successful in exams compared to students who do not participate in extracurricular activities?

Here we are checking if participation in extracurricular activities affects the final exam score.

We are going to conduct the analysis of two independent samples .

#Descriptive statistics by group - Male and Female

library(psych)
describeBy(mydata2$Exam_Score, mydata2$Extracurricular_Activities)

## 
##  Descriptive statistics by group 
## group: Yes
##    vars    n  mean   sd median trimmed  mad min max range skew
## X1    1 3938 67.44 3.94     67    67.3 2.97  57 101    44  1.8
##    kurtosis   se
## X1    11.58 0.06
## ---------------------------------------------------- 
## group: No
##    vars    n  mean   sd median trimmed  mad min max range skew
## X1    1 2669 66.93 3.79     67   66.84 2.97  55  97    42 1.39
##    kurtosis   se
## X1     8.78 0.07

For the analysis, the following assumptions need to be completed:

Variable is numeric: this is assumption is fulfilled since exam score is numeric variable
The distribution of variable is normal in both populations: I will test this with Shapiro-Wilk normality test
The data must come from two independent populations: this is true
Variable has the same variance in both populations: since this assumption is often violated we apply Welch correction

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ECA_yes <- ggplot(mydata2[mydata2$Extracurricular_Activities == "Yes",  ], aes(x = Exam_Score)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, col = "black", fill = "orchid1") +
  ylab("Frequency") +
  ggtitle("Exam score distribution of students with extracurricular activities")

ECA_no <- ggplot(mydata2[mydata2$Extracurricular_Activities == "No",  ], aes(x = Exam_Score)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 1, col = "black", fill = "steelblue1") +
  ylab("Frequency") +
  ggtitle("Exam score distribution of students without extracurricular activities")

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggarrange(ECA_yes, ECA_no,
          ncol = 2, nrow = 1)

library(ggpubr)
ggqqplot(mydata2,
         "Exam_Score",
         facet.by = "Extracurricular_Activities")

Based on the graphs and quantile quantile plot above It is possible that we have a normal distribution and possibly some outliers. But let’s check it.

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata2 %>%
  group_by(Extracurricular_Activities) %>%
  shapiro_test(Exam_Score)

## # A tibble: 2 × 4
##   Extracurricular_Activities variable   statistic        p
##   <fct>                      <chr>          <dbl>    <dbl>
## 1 Yes                        Exam_Score     0.892 3.40e-46
## 2 No                         Exam_Score     0.918 7.87e-36

Shapiro test Hypothesis:

H0: The variable Exam Score is normally distributed.
H1: The variable Exam Score is not normally distributed.

Based on the executed Shapiro test we can reject the null hypothesis H0 that the variable Exam Score is normally distributed based on the p-value p<0.001 for both of the groups of students.

Since normality is not confirmed we could only continue with non-parametric tests but later on we will also conduct the parametric tests for educational purposes.

With this we did not manage to confirmed the second assumption for two independent sample arithmetic means testing.

I decided not to remove any of the units, since I do not see them as outliers but only as highly performing and ambitious students that had equal possibilities as all the other students and they managed to perform well.

Since we have independent sample of 2 population (students who do and do not participate in extracurricular activities), we will start our parametric test with t-test with Welch correction (Welch two sample test). This is the test you usually use when normality is confirmed (which is not our case).

# Independent samples t-test

t.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata2$Exam_Score by mydata2$Extracurricular_Activities
## t = 5.2826, df = 5873.3, p-value = 1.319e-07
## alternative hypothesis: true difference in means between group Yes and group No is not equal to 0
## 95 percent confidence interval:
##  0.3210011 0.6998262
## sample estimates:
## mean in group Yes  mean in group No 
##          67.44185          66.93143

The hypothesis:

ECA_yes = Students who participate in extracurricular activities

ECA_no = Students who do not participate in extracurricular activities

H0: mean(ECA_yes) = mean(ECA_no) or mean(ECA_yes) - mean(ECA_no) = 0
H1: mean(ECA_yes) =/ mean(ECA_no) or mean(ECA_yes) - mean(ECA_no) =/ 0

With Welch two sample test we can reject H0 with p<0.01, therefore reject that the difference in means between students who do and students who do not participate in extracurricular activities is equal to 0.

I also measured the effect size with Cohen’s d statistics:

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities, 
                     pooled_sd = FALSE)

## Cohen's d |       95% CI
## ------------------------
## 0.13      | [0.08, 0.18]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.13, rules = "sawilowsky2009")

## [1] "very small"
## (Rules: sawilowsky2009)

We can observe very small differences in the average exam score results between student who do and students who do not participate in extracurricular activities.

Moving on to our non-parametric alternative that we use in case if normality was rejected with Shapiro-Wilk test, as happened in our case:

wilcox.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities, 
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Exam_Score by mydata2$Extracurricular_Activities
## W = 5649817, p-value = 1.924e-07
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location of exam score is the same for student who do and students who do not participate in extracurricular activities.
H1: Distribution location of exam score is not the same for student who do and students who do not participate in extracurricular activities.

Based on the p-value p<0.001, we can reject H0 that the distribution location of of exam score is the same for student who do and students who do not participate in extracurricular activities.

library(ggplot2)
ggplot(mydata2, aes(x = Exam_Score, fill = Extracurricular_Activities)) +
  geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
  scale_x_continuous(breaks = seq(0, 75, 5)) +
  scale_fill_manual(values = c("No" = "steelblue1", "Yes" = "orchid1")) +
  ylab("Frequency") +
  labs(fill = "Exam Score")

The function scale_fill_manual allowed me to manually specify aesthetic fill values for the graph to match the previous one in colors for both groups of students.

library(effectsize)
effectsize(wilcox.test(mydata2$Exam_Score ~ mydata2$Extracurricular_Activities, 
           correct = FALSE,
           exact = FALSE,
           alternative = "two.sided"))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.08              | [0.05, 0.10]

Based on Funder&Ozer 2019 scale, the difference in distribution location is very small.

#RQ1 Conclusion and answer

In conclusion, I believe that a better test to do is non-parametric Wilcoxon Rank Sum Test since normality was not confirmed. Based on the p-value p<0.001, we can reject H0 that the distribution location of of exam score is the same for student who do and students who do not participate in extracurricular activities.

I would conclude that the answer to my research question is that participation in extracurricular activities has an effect on the final exam score of students, yet it is proven to be very small (r=0.08). Based on the calculated average exam score we could conclude that students with extracurricular activities indeed perform a bit better than students who do not have extracurricular activities.

RQ2: Is there a relationship between the number of hours studied and the final exam score?

library(psych)
psych::describe(mydata2[ , c("Hours_Studied", "Exam_Score")])

##               vars    n  mean   sd median trimmed  mad min max range
## Hours_Studied    1 6607 19.98 5.99     20   19.97 5.93   1  44    43
## Exam_Score       2 6607 67.24 3.89     67   67.11 2.97  55 101    46
##               skew kurtosis   se
## Hours_Studied 0.01     0.02 0.07
## Exam_Score    1.64    10.56 0.05

library(ggplot2)
ggplot(mydata2, aes(x = Hours_Studied, y = Exam_Score)) +
  geom_point()

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(mydata2[ , c(2, 7)], smooth = FALSE)

We can see that this graphical representations are suggesting a possible positive correlation. Let’s check the linear relationship with descriptive approach.

Since we have two numerical variables we are going to conduct Pearson Chi squared test:

cor(mydata2$Hours_Studied, mydata2$Exam_Score, 
    method = "pearson",
    use = "complete.obs")

## [1] 0.445455

Based on the result we can say that the linear relationship between hours spent studying for the exam and actual exam score is positive and semi strong (r=0.445).

We still conduct the test of correlation coefficient with:

cor.test(mydata2$Hours_Studied, mydata2$Exam_Score, 
         method = "pearson",
         use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  mydata2$Hours_Studied and mydata2$Exam_Score
## t = 40.436, df = 6605, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4259164 0.4645782
## sample estimates:
##      cor 
## 0.445455

H0: ρ = 0
H1: ρ =/ 0

Based on the p-value p<0.001 we can reject the null hypothesis H0 that correlation coefficient is equal to 0.

#RQ2 Conclusion and answer

Based on the statistical tests above we can conclude that there is a statistically significant correlation between hours spent studying and exam results.

In conclusion, the answer to my research question is that there is a positive linear relationship between the number of hours studied and the final exam score.

RQ3: Is there an association between parental involvement and the motivation level of students?

# Pearson Chi2 test
results <- chisq.test(mydata2$Parental_Involvement, mydata2$Motivation_Level, 
                      correct = FALSE)

results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata2$Parental_Involvement and mydata2$Motivation_Level
## X-squared = 4.3704, df = 4, p-value = 0.3582

H0: There is no association between parental involvement and student’s motivation level
H1: There is association between parental involvement and student’s motivation level

We cannot reject H0 that there is no association between parental involvement and student’s motivation level based on the p-value p>0.05.

addmargins(results$observed)

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement High Medium  Low  Sum
##                       High    359    975  574 1908
##                       Medium  682   1685  995 3362
##                       Low     278    691  368 1337
##                       Sum    1319   3351 1937 6607

Here R created the table with observed (empirical) findings from our data set.

For example: Number 359 means that there were 359 students that experienced high parental involvement and high intrinsic motivation and 278 students that experienced low parental involvement and high intrinsic motivation. Together in the right column we see the sums of parental involvement by it’s categories and in the bottom row we can see sums of observed units based on motivation level of a student. And in the right bottom corner is the sum of all units which matches the number of units in the sample (n=6607).

We continue with calculating a table of expected frequencies.

#Calculating expected frequencies
round(results$expected, 2)

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement   High  Medium    Low
##                       High   380.91  967.72 559.38
##                       Medium 671.18 1705.17 985.65
##                       Low    266.91  678.11 391.97

In this step we calculated the expected frequencies. This means how many student we would expect to see in each paired category if there wouldn’t be any association between the variables parental involvement and motivation level.

For example: If there wouldn’t be association between parental involvement and motivation level we would expect 391.97 student with low parental support and low motivation level.

With this calculations we also check one of the required assumptions for this statistical analysis. Assumptions for association analysis between two categorical variables has 2 main assumptions that we checked in previous steps:

The observations are independent of each other: This is true
All expected frequencies are greater than 5: We can see in the table above that this is true

Since we checked that all assumptions hold we can continue with parametric tests for checking our hypothesis.

# Calculation of standardized residuals

round(results$res, 2)

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement  High Medium   Low
##                       High   -1.12   0.23  0.62
##                       Medium  0.42  -0.49  0.30
##                       Low     0.68   0.49 -1.21

Here we can see that non of the standardized residuals are significant (all are bellow 1.96), meaning there are no significant differences found. We could say we didn’t find any statistical effect of parental involvement on student’s motivation level.

addmargins(round(prop.table(results$observed), 3))

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement  High Medium   Low   Sum
##                       High   0.054  0.148 0.087 0.289
##                       Medium 0.103  0.255 0.151 0.509
##                       Low    0.042  0.105 0.056 0.203
##                       Sum    0.199  0.508 0.294 1.001

Here is a structured table where all of the data together sums up to 1 (around 1 or 1.001 in our case). For example in our data there is 10.3% of students with medium parental involvement and high motivation level and 5.6% of students with low motivation level and low parental involvement. There is also 19.9% of students with high motivation level and 50.9% of students of medium level of parental involvement.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement  High Medium   Low   Sum
##                       High   0.188  0.511 0.301 1.000
##                       Medium 0.203  0.501 0.296 1.000
##                       Low    0.208  0.517 0.275 1.000

In this table we focused on the parental involvement. We observe that also from the table where all of the parental involvement categories sum up to one. Results form this table can be interpreted as followed:

30.1% of students with high parental involvement experience low motivation level and 20.8% of students with low parental support experience high motivation level.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##                             mydata2$Motivation_Level
## mydata2$Parental_Involvement  High Medium   Low
##                       High   0.272  0.291 0.296
##                       Medium 0.517  0.503 0.514
##                       Low    0.211  0.206 0.190
##                       Sum    1.000  1.000 1.000

In this table we can observe data based on the motivation level variable. We can spot this from the bottom row with the sum of ones.

The data from this table can be interpreted as followed: Out of all people with high motivation there is 27.7% students with high parental involvement, 51.7% of students with medium parental involvement and 21.1% of students with low parental involvement (and all of these sum to 100% of students with high motivation).

library(effectsize)
effectsize::cramers_v(mydata2$Parental_Involvement, mydata2$Motivation_Level)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 5.29e-03          | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(5.29e-03)

## [1] "tiny"
## (Rules: funder2019)

Based on Cramers statistics we can say there are tiny differences between motivation level of students and parental involvement.

#RQ3 Conclusion and answer

Based on the test and checked standardized residuals, We can conclude that there is no association found between parental involvement and the motivation level of students, since we cannot reject the null hypothesis of there being no association between parental involvement and motivation level of a student (p>0.05). However the Cramers test suggests some differences, but tiny (r=0.00529).

Therefore, official answer to third research question is that there is no statistically significant association between parental involvement and student’s motivation level.

MVA Homework Assignment 1

Veronika Avbar

2025-01-13

Importing and preparation of the data

RQ1: Are students who participate in extracurricular activities outside of school more successful in exams compared to students who do not participate in extracurricular activities?

RQ2: Is there a relationship between the number of hours studied and the final exam score?

RQ3: Is there an association between parental involvement and the motivation level of students?