Fontys

Fontys University of Applied Sciences

Professor: Langenhuizen,Marco

Subject: Statistics


ABSTRACT

The purpose of this report is to find factors that can influence the student’s academic performance. according to that, we can manage the factors and increase the quality as well as performance coming from students.To obtain the data, many research is implemented and the raw data we are going to analyze is based on the collection of research which was carried out by the author.Regarding to the data set, the factors found relating to gender, age, sleeping hours,playing game hours, music and distance from home to scool that may affect the students study. This report will analyze the data set to see if these factors really matter to students.


1. INTRODUCTION

Students are the assets of a nation. In other words, the number of educated people will sketch the strength of a nation regarding to employment rate, high-skill workers, professional careers and etc. Therefore, investigating time and effort on how student performing themselves in school and what factors are influencing to their study would be a very valueable subject to implement.

Additionally, is it clear that the performance of students in school has highly effect to the economy and society afterwards. Therefore, if students are supported with full of the best conditions to develop themselves would bring a bright future for them and for mankind either. There were many research carried out to figure out what elements affecting student’s academic performance and the result are various.

This research is focuses on the personal problems that every student facing while they are in their education period. the data set gathered is including 100 random students from around the world which was collected by the author. The data set analyzed includes: Gender, Age, Sleeping Hours, Music, Gaming Hours, Average Score, Distance from home to school and Student’s Perforamnce. The graph below indicates the relations between student’s performance and the factors:

Relationships

2. METHODS AND MATERIALS

2.1 RESEARCH QUESTION

What Are the Factors May Influence Student’s Academic Performance?

2.2 RESEARCH OBJECTIVE

The research objective is to figure out as many factors that bothering studen’s academic performance as possible so that we can have a clearer view about what is preventing them from achieving high score in school on the students’ perspective.

2.3 METHOD

From the dataset, the author would like to know the relationship between every attribute with student’s academic performance. Additionally, the correlation between the factors need to be revealed as well in order to prove the hypotheses stated later on in the document are true.

In order to analyze a data set, understanding the general picture of the data set is neccessary such as how many factors, or numerical attribute there are. From that, the author can decide which method is the best to use for each attribute.

First of all, we are going to find out the general information from data set such as average, min, max, frequentcy of each elements. By doing this, we could have an eagleview about what we are analyzing and how we will do that. The central attendancy as well as dispersity of the data set wil need to be found.

Formula: summary(), hist()…

Secondly, the relations between factors and student’s academic performance will be shown to see if the factors are affecting student study or not.

Formula: ggplot(), abline(), par(), multiplot()…

Thirdly, the correlation of the data set is calculated before visualizing them in graph, from that, we can conclude that the relations are strong or week.

Formula: ggplot(), ggcorr(), lm(),geom_boxplot()…

Finally, sum up all the information we have, we can draw a conclusion about which factors should be seriously consider as the reason influencing student’s acedamic performance.

Note: The explanation given right below every visualizaion.

2.4 DATA ANALYSIS AND INTERPRETATION OF THE RESULTS

Loading Dataset

dfStudent <- read.csv("D:\\ICT English Stream\\ERP&BI Minor-Steering companies using ERP, BI and Big Data\\Statistics (STAT)\\Datasets\\Factors Affect Studying Performance.csv") 

General information about the dataset is given below:

summary(dfStudent)
##  Gender      Age        SleepingHours   Music     GamingHours  
##  F:49   Min.   :17.00   Min.   : 4.00   No :53   Min.   :0.00  
##  M:51   1st Qu.:20.00   1st Qu.: 5.00   Yes:47   1st Qu.:1.00  
##         Median :27.00   Median : 6.50            Median :2.00  
##         Mean   :26.23   Mean   : 6.48            Mean   :2.09  
##         3rd Qu.:31.00   3rd Qu.: 8.00            3rd Qu.:3.00  
##         Max.   :35.00   Max.   :10.00            Max.   :4.00  
##     Distance      AverageScore   AcademicPerformance
##  Min.   : 2193   Min.   :2.200   Bad      :24       
##  1st Qu.:23718   1st Qu.:6.600   Excellent:35       
##  Median :58258   Median :7.600   Good     :41       
##  Mean   :52050   Mean   :7.136                      
##  3rd Qu.:76755   3rd Qu.:8.900                      
##  Max.   :98436   Max.   :9.900

As clearly shown in the summary, the number of male/female and student who is often listening to music or playing instruments in the data set is fairly balance, Student’s Academic Performance seems quite good, but there are still many bad students, therfore, more analysis is carried out to find out the causes.

par(mfrow = c(3, 2))
hist(dfStudent$Age, col = "skyblue", xlab = "Student Age",main = "Dispersity in Student Age")

hist(dfStudent$SleepingHours, col = "skyblue", xlab = "Sleeping Hours", main = "Dispersity in Sleeping Hours")

hist(dfStudent$GamingHours, col = "skyblue", xlab = "Gaming Hours", main = "Dispersity in Gaming Hours")


hist(dfStudent$Distance, col = "skyblue", xlab = "Distance Between Home and School", main = "Dispersity in Distance")

hist(dfStudent$AverageScore, col = "skyblue", xlab = "Average Score", main = "Dispersity in Average Score")

The five graphs above show the dispersity of each attribute in the data set. In “Dispersity in Student Age”, it is said that students age are diversed, however, the age of 17 who is very young is the most individuals to come to school. Besides, student spent less time for sleeping which is only 4 hours per day, this can be a noticeble factor which may affect the study performance the most, perhaps, this can be explained by their gaming time spent.Additionally, most of students are far away from school and this could be the reason taken their sleeping time. Moreover, it is shown that the average score of students are illustrating more positive, more good and excellent students compared to the bad.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
p1 <- ggplot(dfStudent,aes(Gender, AverageScore,fill=Gender))+ geom_boxplot()  +ggtitle("Gender vs Average Score")

p2 <- ggplot(dfStudent, aes(Music,AverageScore,fill=Music)) +geom_boxplot()+ ggtitle("Music vs Average Score")

multiplot(p1,p2,cols = 2)

In the graphs above, it indicates that male and female students study fairly well, however, there are more low-score female student than males which we could say that male’s performance are better than female. For the music graph, students who often listen to or playing music clearly have a better score than students who do not.

library(ggplot2)

p3<-ggplot(dfStudent, aes(AverageScore,Age ,  fill=AcademicPerformance)) +geom_boxplot()+
    ggtitle("Age vs Average Score")

p4<-ggplot(dfStudent, aes(AverageScore,SleepingHours, fill=AcademicPerformance))+geom_boxplot()+
    ggtitle("Sleeping Hours vs Average Score")

p5<-ggplot(dfStudent, aes(AverageScore,GamingHours ,fill=AcademicPerformance)) +geom_boxplot()+
    ggtitle("Gaming Hours vs Average Score")

p6<-ggplot(dfStudent, aes(AverageScore,Distance,  fill=AcademicPerformance)) + geom_boxplot()+
    ggtitle("Distance vs Average Score")

multiplot(p3,p4,p5,p6, cols = 2)

the relationships between four factors and student’s academic performance are crystal clear in the graph above. The short explanation for every graph is mentioned below:

In the first graph, the older students seem studying worse than the younger one, students at age 25-35 performing negatively compared to the age of less than 30.

In the “Sleeping hours” graph, students who less spending time for sleeping have bad result compared to the one who sleep 6 or more hours per day.

For the gaming hours, the gaming time is from 0 to 4 hours per day, we would say that students playing game in a reasonable amount of time (mostly <2) study better than the one addicted with more than 2 hours spent for game per day regarding to the graph.

For the “Distance” factors, we could say that distance are matter to the student’s result. Students live far away around 30km from school have lower performance than others.

2.5 TESTING OF HYPOTHESES

To carry out further analysis, the author decided to come up with 5 hypotheses questions.

H1: is there any relationship between gender and student’s performance?

H2: is there any relationship between age and student’s performance?

H3: is there any relationship between sleeping hours and student’s performance?

H4: is there any relationship between gaming hours and student’s performance?

H5: is there any relationship between the distance from home to school and student’s performance?

In order to answer if the hypotheses stated are true, the linear regression between factors and average score/ student’s performance is implemented, the correlation between them will be calculated also so that we can define how strong the relationship is.

library(ggplot2)
mod1 <- lm(dfStudent$AverageScore ~ dfStudent$Age)
r1<-ggplot(dfStudent,
    aes(Age, AverageScore)) + 
  geom_point(pch = 19,col="steelblue") + 
  geom_abline(intercept = mod1$coefficients[1], slope = mod1$coefficients[2], col="firebrick")+
    ggtitle("Age vs Average Score")

mod2 <- lm(dfStudent$AverageScore ~ dfStudent$SleepingHours)
r2<-ggplot(dfStudent,
    aes(SleepingHours,AverageScore)) + 
  geom_point(pch = 19,col="steelblue") + 
  geom_abline(intercept = mod2$coefficients[1], slope = mod2$coefficients[2], col="firebrick")+
    ggtitle("Sleeping(h) vs Average Score")

mod3 <- lm(dfStudent$AverageScore ~ dfStudent$GamingHours)
r3<-ggplot(dfStudent,
    aes(GamingHours,AverageScore)) + 
  geom_point(pch = 19,col="steelblue") + 
  geom_abline(intercept = mod3$coefficients[1], slope = mod3$coefficients[2], col="firebrick")+
    ggtitle("Gaming(h) vs Average Score")

mod4 <- lm(dfStudent$AverageScore ~ dfStudent$Distance)
r4<-ggplot(dfStudent,
    aes(Distance,AverageScore)) + 
  geom_point(pch = 19,col="steelblue") + 
  geom_abline(intercept = mod4$coefficients[1], slope = mod4$coefficients[2], col="firebrick")+
    ggtitle("Distance(m) vs Average Score")

multiplot(r1,r2,r3,r4,cols = 2)

Looking at the graps above, it is true that the factors mentioned are all connecting to the student’s result some how. As you can see, there is a line in every graph, whenever we select random points on the line, the result shows differently, however, from the result we could say if this is increase or decrease. Taking hypothesis 1 as an example, the slope of the line illustrates that the older students are having less average score than the younger one. all lines in the graph are sloping, and it means that all four factors are relating to the student’s academic performance.

In order to define the relationship is strong or weak, we can use the above graph to say about the relationship level. However, to make you have a clearer view about how relationship is, we will calculate the correlation of all and make a ggcorr plot so that you can easily understand the method.

correl <-cor(dfStudent[,c("AverageScore","Age","SleepingHours","GamingHours","Distance")])

library(ggplot2)
source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")

ggcorr(correl,label = TRUE)+ ggtitle("Factors Bothering Student's Academic Performance ")
## Warning: package 'reshape2' was built under R version 3.2.4

Strength of Correlation

the above corrplot shows the strength of relationship among attributes in the data set. However, we only look at the last row which says about the relationship strength between Average Score/Student’s Performance and other four factors: Age, Sleeping Hours, Gaming Hours and Distance.

From the visualization, we can conclude that all fours elements have moderate correlation with student’s result. In other words, these can be seen as reasons that influencing students’s rank in school.

3. CONCLUSIONS

Regarding to the implemented analysis, we could say that the four factors mentioned in the data set all have affect to student’s academic performance. To be more details, the sleeping hours have the strongest relationship with student, in other words, student should get enough sleeping time which is recommended >6 hours/day, to gain the best study result. The other factors such as: age, time spending on game and travelling from home to school are also bothering student at a moderate level which is shown inthe ggcorr plot, these elements should be taken into consideration seriously also so that students can realize the reason why they could not achive high performance in order to improve or utilise their time more effectively.However, age and distance are difficult to change, but gaming hours are easy, therefore, students should find a way to balance the factors to keep their study at best.

In conclusion, all the hypotheses which are stated before in the report are all true: the four factors are truely matters and students should consider them carefully to improve the school performance.The table below again confirm the answer for the author’s hypothesis questions:
hypotheses

References

DANIYAL, M. (n.d.). THE FACTORS AFFECTING THE STUDENTS’ PERFORMANCE. Retrieved from academia: https://www.academia.edu/4729029/THE_FACTORS_AFFECTING_THE_STUDENTS_PERFORMANCE

Jekyll, k. a. (n.d.). Bar and line graphs (ggplot2). Retrieved from Cookbook for R: http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_%28ggplot2%29/

Plotly. (n.d.). Retrieved from Plotly ggplot2 Library: https://plot.ly/ggplot2/

R, C. f. (n.d.). Graphs with ggplot2. Retrieved from Cookbook for R: http://www.cookbook-r.com/Graphs/

Rmarkdown. (n.d.). HTML Documents. Retrieved from Rmarkdown Rstudio: http://rmarkdown.rstudio.com/html_document_format.html#custom_css

Robert I. Kabacoff, P. (n.d.). Histograms and Density Plots . Retrieved from Quick - R: http://www.statmethods.net/graphs/density.html

Scott, E. (n.d.). R Graphics Cookbook - Chapter 3: Bar Graphs. Retrieved from Rpubs: https://rpubs.com/escott8908/RGC_Ch3_Gar_Graphs

W3schools. (n.d.). HTML Tutorial. Retrieved from W3schools: http://www.w3schools.com/html/html_tables.asp

Explorable.com (May 2, 2009). Statistical Correlation. Retrieved Apr 20, 2016 from Explorable.com: https://explorable.com/statistical-correlation