Homework_1_Tobias_Schöfbeck

Homework assignment 1

Exercises 1, 2, 3, 4, 5 and part of 6

data <- read.csv("StudentPerformanceFactors.csv", header = TRUE)
head(data)
  Hours_Studied Attendance Parental_Involvement Access_to_Resources
1            23         84                  Low                High
2            19         64                  Low              Medium
3            24         98               Medium              Medium
4            29         89                  Low              Medium
5            19         92               Medium              Medium
6            19         88               Medium              Medium
  Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
1                         No           7              73              Low
2                         No           8              59              Low
3                        Yes           7              91           Medium
4                        Yes           8              98           Medium
5                        Yes           6              65           Medium
6                        Yes           8              89           Medium
  Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
1             Yes                 0           Low          Medium      Public
2             Yes                 2        Medium          Medium      Public
3             Yes                 2        Medium          Medium      Public
4             Yes                 1        Medium          Medium      Public
5             Yes                 3        Medium            High      Public
6             Yes                 3        Medium          Medium      Public
  Peer_Influence Physical_Activity Learning_Disabilities
1       Positive                 3                    No
2       Negative                 4                    No
3        Neutral                 4                    No
4       Negative                 4                    No
5        Neutral                 4                    No
6       Positive                 3                    No
  Parental_Education_Level Distance_from_Home Gender Exam_Score
1              High School               Near   Male         67
2                  College           Moderate Female         61
3             Postgraduate               Near   Male         74
4              High School           Moderate   Male         71
5                  College               Near Female         70
6             Postgraduate               Near   Male         71
data <- data[1:200, c(-3,-4,-5,-6,-8,-9,-10,-12,-13,-14,-15,-16,-17,-18)]
head(data)
  Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1            23         84              73           Low   Male         67
2            19         64              59        Medium Female         61
3            24         98              91        Medium   Male         74
4            29         89              98        Medium   Male         71
5            19         92              65        Medium Female         70
6            19         88              89        Medium   Male         71

I found the dataset on Kaggle under the link: “https://www.kaggle.com/datasets/lainguyn123/student-performance-factors?resource=download

First, I altered the dataset, as to not have 6607 observations and 20 variables. After deleting observations and variables, the data has the variables Hours_Studied, Attendance, Previous_Scores, Family_Income, Gender, Exam_Score and 200 observations.

Hours_Studied is the number of hours spent per week studying.

Attendance is the percentage of classes attended.

Previous_Scores is the score for the previous exam.

Family_Income is the Family income level in three categories (Low, Medium, High).

Gender is the gender of the student (Male, Female).

Exam_Score is the Final exam score.

Exercise 6

data$Gender <- factor(data$Gender)
data$Family_Income <- factor(data$Family_Income)
summary(data)
 Hours_Studied     Attendance    Previous_Scores  Family_Income    Gender   
 Min.   : 4.00   Min.   : 60.0   Min.   : 50.00   High  :36     Female: 73  
 1st Qu.:16.00   1st Qu.: 70.0   1st Qu.: 65.00   Low   :76     Male  :127  
 Median :20.00   Median : 79.0   Median : 77.50   Medium:88                 
 Mean   :19.79   Mean   : 80.2   Mean   : 76.44                             
 3rd Qu.:23.00   3rd Qu.: 91.0   3rd Qu.: 89.00                             
 Max.   :36.00   Max.   :100.0   Max.   :100.00                             
   Exam_Score    
 Min.   : 60.00  
 1st Qu.: 65.00  
 Median : 67.00  
 Mean   : 67.33  
 3rd Qu.: 69.00  
 Max.   :100.00  

I then converted Gender and Family_Income into factors. The data shows that there are 73 females and 127 males. The 200 individuals are divided by their family income. 36 people fall into the category high, 76 in low and 88 in medium.

data[2,3] <- 60

After that, I changed the Previous Test Score of one student, by only a few points, as I didn’t want

to change the data itself too much.

dataF <- data[data$Gender == "Female", ]
head(dataF)
   Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
2             19         64              60        Medium Female         61
5             19         92              65        Medium Female         70
16            17         68              70        Medium Female         64
18            22         70              82           Low Female         65
19            15         80              91           Low Female         67
21            29         78              99          High Female         69

First I created a subset with only Female students, this subset now has 73 observations.

hattendance <- data[data$Attendance >= 80,]
head(hattendance)
  Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1            23         84              73           Low   Male         67
3            24         98              91        Medium   Male         74
4            29         89              98        Medium   Male         71
5            19         92              65        Medium Female         70
6            19         88              89        Medium   Male         71
7            29         84              68           Low   Male         67

Then I created a subset of students that have an attendance of over 80%, which has 99 observations.

library(dplyr)
data <- data %>% rename(Household_Income = Family_Income)
head(data)
  Hours_Studied Attendance Previous_Scores Household_Income Gender Exam_Score
1            23         84              73              Low   Male         67
2            19         64              60           Medium Female         61
3            24         98              91           Medium   Male         74
4            29         89              98           Medium   Male         71
5            19         92              65           Medium Female         70
6            19         88              89           Medium   Male         71

With the help of the library “dplyr” I changed the name of Family_Income to Household_Income.

Exercise 7

library(psych)
describeBy(data$Exam_Score, data$Gender)

 Descriptive statistics by group 
group: Female
   vars  n  mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 73 67.47 4.98     67   66.95 2.97  61 100    39 3.88    22.62 0.58
------------------------------------------------------------ 
group: Male
   vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 127 67.26 3.15     67   67.25 2.97  60  76    16 0.08    -0.47 0.28

With the function describeBy we can now compare male and female students. The groups have very similar means - around 67. And their median is also both 67. This means that 50 percent of scores is below 67 and 50 percent is above 67. Female havethe bigger range of exam scores between 61 and 100, while males only have scores between 60 to 76. So a range of 16 compared to the females with a range of 39.

Exercise 8

library(ggplot2)
ggplot(data, aes(x=Exam_Score)) +
  geom_histogram(position = "dodge", binwidth = 6, colour = "darkblue", fill = "cyan3") +
  facet_wrap(~Gender, ncol = 1)

In these graphs we can see the frequency of exam scores among the genders. The distribution seems to be around the same, while there are more males represented in the data.

ggplot(data, aes(x = Household_Income, y = Exam_Score, fill = Gender)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Spectral") + 
  xlab("Household Income") + 
  ylab("Exam Score") + 
  labs(fill = "Gender")

These boxplots show the influence of the family income on the exam results and if that changes with gender. The boxplot show that the family income seems to have a bigger influence on the males, but all in all it is not sure if there is any significant difference in performance due to household income. There are two outlier obsvervations which outperform the other students. Both are female.

library(car)
scatterplot(Exam_Score ~ Previous_Scores | Gender, ylab = "Exam result",
            xlab = "Previous Exam result", smooth = FALSE, data = data)

In this scatterplot we can see the relation between the previous exam result and the current exam result for the two genders. The exam results seem to be fairly lower this time. Most students have around 60 to 75 points, while in the previous exam, where in the range from 50 to 100 points. Only one female student was able to get 100 points. There doesn´t seem to be a correlation between previous exam score and final exam score.