The City University of New York School of Professional Studies
Statistics and Probability for Data Analytics (DATA 606)

Final Project: Data Insights to Improve school Education System

Alexis Mekueko
email:

12/08/2020

Part 2 - Introduction

Many students failed in school not because of thier intelligence. There are numerous factors that contribute to students success. In other words, students success in school relies upon on the ability of the school education system to take appropriate measures on these factors. These factors are : weekly studying time, extra-curricular activities, travel time to school, family educational support, student desire to pursue higher education, companionship, parents’job type, etc. Therefore, in this project, we interested in studying these factors to determine any corroletion that could lead to students failure. If none, then we would like to determine the factors which contribute for the most to success. This is done in order for the school education system to keep track of success and improve the factors that negatively impact students success.

Github Link: https://github.com/asmozo24/DATA606_Final_Project

Web link: https://rpubs.com/amekueko/697306

Part 2a - Benefits

The interest in experimental study related to school will have the advantage to help schools’ officials in decision making in term of improving school education system. This project is seeking to make the collected data about (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira) schools speak or reveal useful information. This experiemental study aims to help school’s officials in planning strategy for better school education system. Ultimately, I plan to become a consultant using my skills as data scientist in various domain of the society to present meaningful report to government entities, companies, and organizations to help them in decision making. So, this project will contribute to building skills necessary for one to be successful in data science.

Part 2b - Research question

Do you students from Gabriel Pereira (GP) school do better in Math course than those from Mousinho da Silveira (MS) school? We could also explore the corelation between factors time and students performance. We could also verify some popular assumption out there. For instance, there are some studies out there suggesting that study time likely affects students performance. Let’s verify that in this study. Do students studying at least 10hrs weekly do well in Math course than those spending lesser time?

Part 3 - Data

Part 3a - Data Acquisition

Data is collected or made available by archive.ics.uci.edu: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.

Part 3b - Data source

We found some interesting dataset from data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00320/. This data is about a study on students(395) taking math or/and portuguese language course. Each case represents a student at one of the two schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira). There are 395 observations in the given dataset. The data is pretty rich with a txt file that described all variables in the data. therefore there is no need to rename the column. The orignal data format is comma delimited and rendering from R was not easy. So, we used excel with one attemp to fix it. We are interested in the student taking Math course. with 33 variables. - Data available –> https://github.com/asmozo24/DATA606_Project_Proposal

Using R to acquire data

Part 4 - Data Preparation / Data Wrangling

Part 4a - Cleaning data

What is the structure of data?

## Rows: 395
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
## [1] "Data frame is composed of character, boolean and numerical."
## [1] 0
## [1] 0

Part 5 - Explore Data

Let’s take a look at the data frame…

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3 Var Var2
## 1        6  5  6  6 All    1
## 2        4  5  5  6 All    2
## 3       10  7  8 10 All    3
## 4        2 15 14 15 All    4
## 5        4  6 10 10 All    5
## 6       10 15 15 15 All    6
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        0       yes     no   no         no
## 4   mother          1         3        0        no    yes   no        yes
## 5   father          1         2        0        no    yes   no         no
## 6   mother          1         2        0        no    yes   no        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3 Var
## 1        4  0 11 11 All
## 2        2  9 11 11 All
## 3        6 12 13 12 All
## 4        0 14 14 14 All
## 5        0 11 13 13 All
## 6        6 12 12 13 All

The data frame presents about 30 factors and 03 variables (G1, G2 and G3). These 03 variables are interesting as there are students’s grades.

G1: first period grade (numeric: from 0 to 20)
G2: second period grade (numeric: from 0 to 20)
G3: final grade (numeric: from 0 to 20)

Let’s keep in mind the research questions. Do students at “GP” - Gabriel Pereira school or “MS” - Mousinho da Silveira school perform well? If yes, what are the factors contributing to students’s success? If no, what are the factors leading to students’ poor performance? One way to go about these questions is to look at the 03 variables. These 03 variable can summary to one key element-That element is student’s performance.

Let’s take a closer look at these 03 variables. We might throw in a biais by neglecting the fact that there are two schools in the data frame. How significant is each school into the data frame.

## student_math$school 
##        n  missing distinct 
##      395        0        2 
##                       
## Value         GP    MS
## Frequency    349    46
## Proportion 0.884 0.116
## [1] "Students dristribution from each school are: 88.4% students for Gabriel Pereira School and 11.6% students for Mousinho da Silveira School"

## student_math$sex 
##        n  missing distinct 
##      395        0        2 
##                       
## Value          F     M
## Frequency    208   187
## Proportion 0.527 0.473

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob reason
## 1     MS   M  18       R     GT3       T    3    2    other    other course
## 2     MS   M  19       R     GT3       T    1    1    other services   home
## 3     MS   M  17       U     GT3       T    3    3   health    other course
## 4     MS   M  18       U     LE3       T    1    3  at_home services course
## 5     MS   M  19       R     GT3       T    1    1    other    other   home
## 6     MS   M  17       R     GT3       T    4    3 services    other   home
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         1        1        no    yes   no         no
## 2    other          3         2        3        no     no   no         no
## 3   mother          2         2        0        no    yes  yes         no
## 4   mother          1         1        1        no     no   no         no
## 5    other          3         1        1        no    yes   no         no
## 6   mother          2         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1      no    yes      yes       no      2        5     5    5    5      5
## 2     yes    yes      yes       no      5        4     4    3    3      2
## 3     yes    yes      yes       no      4        5     4    2    3      3
## 4     yes     no      yes      yes      4        3     3    2    3      3
## 5     yes    yes      yes       no      4        4     4    3    3      5
## 6      no    yes      yes      yes      4        5     5    1    3      2
##   absences G1 G2 G3 Var Var2 grade1 grade2 grade3
## 1       10 11 13 13 All  350      C      C      C
## 2        8  8  7  8 All  351      D      D      D
## 3        2 13 13 13 All  352      C      C      C
## 4        7  8  7  8 All  353      D      D      D
## 5        4  8  8  8 All  354      D      D      D
## 6        4 13 11 11 All  355      C      C      C
## Let's do summary on Math result 1 for students from Gabriel Pereira School
## student_math_GP$G1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      349        0       17    0.992    10.94    3.791        6        7 
##      .25      .50      .75      .90      .95 
##        8       11       13       16       16 
## 
## lowest :  3  4  5  6  7, highest: 15 16 17 18 19
##                                                                             
## Value          3     4     5     6     7     8     9    10    11    12    13
## Frequency      1     1     7    19    32    35    30    45    34    32    27
## Proportion 0.003 0.003 0.020 0.054 0.092 0.100 0.086 0.129 0.097 0.092 0.077
##                                               
## Value         14    15    16    17    18    19
## Frequency     27    21    21     8     7     2
## Proportion 0.077 0.060 0.060 0.023 0.020 0.006

Let’s see the mean, max for students from Gabriel Pereira School

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   11.00   10.94   13.00   19.00

Part 6 - Data Analysis

## 
##  Students performance in Math Exam 1 from Gabriel Pereira School

## A better representation is graded letters

-Let’s see the math exam2 graded from the two schools.

## student_math_GP$grade3 
##        n  missing distinct 
##      349        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency     17    76   143    59    54
## Proportion 0.049 0.218 0.410 0.169 0.155
## student_math_MS$grade3 
##        n  missing distinct 
##       46        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency      1     6    22    10     7
## Proportion 0.022 0.130 0.478 0.217 0.152
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   11.00   10.49   14.00   20.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.000  10.000   9.848  12.750  19.000
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter
## Warning in axis(1, at = 1:length(means), labels = legends, ...): "frame" is not
## a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter

## student_portuguese_GP$grade3 
##        n  missing distinct 
##      423        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency     10   136   245    27     5
## Proportion 0.024 0.322 0.579 0.064 0.012
## student_portuguese_MS$grade3 
##        n  missing distinct 
##      226        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency      7    41   110    53    15
## Proportion 0.031 0.181 0.487 0.235 0.066

At this point, we can either explore students who did well in Math with a final grade of A or B to see if they do something than those who final grade below C. Alternative way is to explore students with the final grade of D or F.

Part 7 - Inference

Part 7a - Problem

The average absence from the top students registered in Math course.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    2.00    3.78    6.00   24.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   4.000   6.762  10.000  75.000

Now, let’s create the two type of students.

Distributions of absences among top and bottom students registered in Math course.

t-test

## 
##  Welch Two Sample t-test
## 
## data:  absences by TB
## t = 2.9118, df = 184.84, p-value = 0.004036
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9613824 5.0016945
## sample estimates:
## mean in group B mean in group T 
##        6.761538        3.780000

Part 7b - Correlation between amount of study time and result

Conducting a hypothesis test to evaluate whether the average grade is different for those who study at least ten times a week than those who don’t. - H_null: there is no difference in the average grade for those who study at at least ten times a week than those who don’t. - H_alt: there is difference in the average grade for those who study at at least ten times a week than those who don’t. - case = students enrolled in Math course - sample is all students from both school (GP and MS)

Let’s see the difference between weekly study time and students final grade in Math

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   studyTime10 meanFinal_grade
##   <chr>                 <dbl>
## 1 no                     10.4
## 2 yes                    11.3
## study10plus$grade3 
##        n  missing distinct 
##       27        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency      3     7    10     3     4
## Proportion 0.111 0.259 0.370 0.111 0.148
## 
## Let's see the math final grade distribution from the two schools based on 10+hrs weekly study time

## study10Less$grade3 
##        n  missing distinct 
##      368        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency     15    75   155    66    57
## Proportion 0.041 0.204 0.421 0.179 0.155
## 
## Let's see the math final grade distribution from the two schools based on 10+hrs weekly study time

## [1] -1.238795
## [1] 3.050792
## [1] 0.05

The p-value = 0.05 < alpha (0.1), thus we reject the null hypothesis. Thus, there is difference in the average grade for those who study at at least ten times a week than those who don’t.

Part 8 - Conclusion

Part 8a - Findings

Part 8b - Challenges

References

  1. https://fall2020.data606.net/assignments/labs/

  2. file:///C:/Users/Petit%20Mandela/Documents/R/DATA606_Lab7/DATA606_Lab7/DATA606_Lab7.html

  3. https://www.statisticshowto.com/least-squares-regression-line/

  4. https://rcompanion.org/handbook/C_04.html

  5. https://data-flair.training/blogs/t-tests-in-r/

  6. https://rstatisticsblog.com/data-science-in-action/data-preprocessing/hypothesis-testing-in-r-with-examples-interpretations/

  7. https://www.r-graph-gallery.com/all-graphs.html

  8. http://www.sthda.com/english/wiki/ggplot2-barplot-easy-bar-graphs-in-r-software-using-ggplot2