data <-read.csv("StudentPerformanceFactors.csv", header =TRUE)head(data)
Hours_Studied Attendance Parental_Involvement Access_to_Resources
1 23 84 Low High
2 19 64 Low Medium
3 24 98 Medium Medium
4 29 89 Low Medium
5 19 92 Medium Medium
6 19 88 Medium Medium
Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
1 No 7 73 Low
2 No 8 59 Low
3 Yes 7 91 Medium
4 Yes 8 98 Medium
5 Yes 6 65 Medium
6 Yes 8 89 Medium
Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
1 Yes 0 Low Medium Public
2 Yes 2 Medium Medium Public
3 Yes 2 Medium Medium Public
4 Yes 1 Medium Medium Public
5 Yes 3 Medium High Public
6 Yes 3 Medium Medium Public
Peer_Influence Physical_Activity Learning_Disabilities
1 Positive 3 No
2 Negative 4 No
3 Neutral 4 No
4 Negative 4 No
5 Neutral 4 No
6 Positive 3 No
Parental_Education_Level Distance_from_Home Gender Exam_Score
1 High School Near Male 67
2 College Moderate Female 61
3 Postgraduate Near Male 74
4 High School Moderate Male 71
5 College Near Female 70
6 Postgraduate Near Male 71
data <- data[1:200, c(-3,-4,-5,-6,-8,-9,-10,-12,-13,-14,-15,-16,-17,-18)]head(data)
Hours_Studied Attendance Previous_Scores Family_Income Gender Exam_Score
1 23 84 73 Low Male 67
2 19 64 59 Medium Female 61
3 24 98 91 Medium Male 74
4 29 89 98 Medium Male 71
5 19 92 65 Medium Female 70
6 19 88 89 Medium Male 71
First, I altered the dataset, as to not have 6607 observations and 20 variables. After deleting observations and variables, the data has the variables Hours_Studied, Attendance, Previous_Scores, Family_Income, Gender, Exam_Score and 200 observations.
Hours_Studied is the number of hours spent per week studying.
Attendance is the percentage of classes attended.
Previous_Scores is the score for the previous exam.
Family_Income is the Family income level in three categories (Low, Medium, High).
Gender is the gender of the student (Male, Female).
Hours_Studied Attendance Previous_Scores Family_Income Gender
Min. : 4.00 Min. : 60.0 Min. : 50.00 High :36 Female: 73
1st Qu.:16.00 1st Qu.: 70.0 1st Qu.: 65.00 Low :76 Male :127
Median :20.00 Median : 79.0 Median : 77.50 Medium:88
Mean :19.79 Mean : 80.2 Mean : 76.44
3rd Qu.:23.00 3rd Qu.: 91.0 3rd Qu.: 89.00
Max. :36.00 Max. :100.0 Max. :100.00
Exam_Score
Min. : 60.00
1st Qu.: 65.00
Median : 67.00
Mean : 67.33
3rd Qu.: 69.00
Max. :100.00
I then converted Gender and Family_Income into factors. The data shows that there are 73 females and 127 males. The 200 individuals are divided by their family income. 36 people fall into the category high, 76 in low and 88 in medium.
data[2,3] <-60
After that, I changed the Previous Test Score of one student, by only a few points, as I didn’t want
Descriptive statistics by group
group: Female
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 73 67.47 4.98 67 66.95 2.97 61 100 39 3.88 22.62 0.58
------------------------------------------------------------
group: Male
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 127 67.26 3.15 67 67.25 2.97 60 76 16 0.08 -0.47 0.28
With the function describeBy we can now compare male and female students. The groups have very similar means - around 67. And their median is also both 67. This means that 50 percent of scores is below 67 and 50 percent is above 67. Female havethe bigger range of exam scores between 61 and 100, while males only have scores between 60 to 76. So a range of 16 compared to the females with a range of 39.
In these graphs we can see the frequency of exam scores among the genders. The distribution seems to be around the same, while there are more males represented in the data.
ggplot(data, aes(x = Household_Income, y = Exam_Score, fill = Gender)) +geom_boxplot() +scale_fill_brewer(palette ="Spectral") +xlab("Household Income") +ylab("Exam Score") +labs(fill ="Gender")
These boxplots show the influence of the family income on the exam results and if that changes with gender. The boxplot show that the family income seems to have a bigger influence on the males, but all in all it is not sure if there is any significant difference in performance due to household income. There are two outlier obsvervations which outperform the other students. Both are female.
In this scatterplot we can see the relation between the previous exam result and the current exam result for the two genders. The exam results seem to be fairly lower this time. Most students have around 60 to 75 points, while in the previous exam, where in the range from 50 to 100 points. Only one female student was able to get 100 points. There doesn´t seem to be a correlation between previous exam score and final exam score.