The City University of New York School of Professional Studies

Data Acquisition and Management (DATA 607)


Final Project
Data Insights to Improve school Education System

Alexis Mekueko and DH Kim
email:


Alexis Mekueko

12/9/2020

Introduction

Github Link: https://github.com/asmozo24/DATA607_Final_Project

Web link: https://rpubs.com/amekueko/697306

Benefits

Research question

Data Acquisition

Data collection

Data is collected or made available by archive.ics.uci.edu: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.

Data source

We found some interesting dataset from -> data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00320/. This data is about a study on students(395) taking math or/and portuguese language course. Each case represents a student at one of the two schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira). There are 395 observations in the given dataset. The data is pretty rich with a txt file that described all variables in the data. therefore there is no need to rename the column. The orignal data format is comma delimited and rendering from R was not easy. So, we used excel with one attemp to fix it. We are interested in the student taking Math course. with 33 variables.

Using R to acquire data

Using SQL to acquire data

Data Preparation / Data Wrangling

Cleaning data

What is the structure of data?

## Rows: 395
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## [1] 649  33
## [1] 0
## [1] 0

Explore Data

Let’s take a look at the data frame…

Visualize students distribution per school in Math Course

## student_math$G3 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      395        0       18    0.992    10.42    4.992      0.0      5.0 
##      .25      .50      .75      .90      .95 
##      8.0     11.0     14.0     15.6     17.0 
## 
## lowest :  0  4  5  6  7, highest: 16 17 18 19 20
##                                                                             
## Value          0     4     5     6     7     8     9    10    11    12    13
## Frequency     38     1     7    15     9    32    28    56    47    31    31
## Proportion 0.096 0.003 0.018 0.038 0.023 0.081 0.071 0.142 0.119 0.078 0.078
##                                                     
## Value         14    15    16    17    18    19    20
## Frequency     27    33    16     6    12     5     1
## Proportion 0.068 0.084 0.041 0.015 0.030 0.013 0.003

- First, we need to organize the data frame into two data frame that represents the two schools.

## student_math_GP$G1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      349        0       17    0.992    10.94    3.791        6        7 
##      .25      .50      .75      .90      .95 
##        8       11       13       16       16 
## 
## lowest :  3  4  5  6  7, highest: 15 16 17 18 19
##                                                                             
## Value          3     4     5     6     7     8     9    10    11    12    13
## Frequency      1     1     7    19    32    35    30    45    34    32    27
## Proportion 0.003 0.003 0.020 0.054 0.092 0.100 0.086 0.129 0.097 0.092 0.077
##                                               
## Value         14    15    16    17    18    19
## Frequency     27    21    21     8     7     2
## Proportion 0.077 0.060 0.060 0.023 0.020 0.006
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   11.00   10.94   13.00   19.00

Data Analysis

- Let’s see the math exam1 graded from the two schools

Let’s see the Portuguese Final grade from the two schools.

- Stats summary from Gabriel Pereira School

## student_math_GP$grade3 
##        n  missing distinct 
##      349        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency     17    76   143    59    54
## Proportion 0.049 0.218 0.410 0.169 0.155
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   11.00   10.49   14.00   20.00
## student_math_MS$grade3 
##        n  missing distinct 
##       46        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency      1     6    22    10     7
## Proportion 0.022 0.130 0.478 0.217 0.152
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   8.000  10.000   9.848  12.750  19.000

- Let’s visualize the average score in Math course from the two schools.

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter
## Warning in axis(1, at = 1:length(means), labels = legends, ...): "frame" is not
## a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter

- Correlation between amount of study time and result

Conducting a hypothesis test to evaluate whether the average grade is different for those who study at least ten times a week than those who don’t.

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   studyTime10 meanFinal_grade
##   <chr>                 <dbl>
## 1 no                     10.4
## 2 yes                    11.3

- Let’s see the statical information about students final grade in Math based on 10+hrs weekly study time

## study10plus$grade3 
##        n  missing distinct 
##       27        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency      3     7    10     3     4
## Proportion 0.111 0.259 0.370 0.111 0.148

Let’s visualize the Math final grade distribution from the two schools based on 10+hrs weekly study time

- Statical information about students final grade in Math based on less than 10hrs Weekly study time

## study10Less$grade3 
##        n  missing distinct 
##      368        0        5 
## 
## lowest : A B C D F, highest: A B C D F
##                                         
## Value          A     B     C     D     F
## Frequency     15    75   155    66    57
## Proportion 0.041 0.204 0.421 0.179 0.155

- Overall students performance in Math course from the two school

# This can be rewrite as a function

n_yes <- nrow(study10plus) 
n_no <- nrow(study10Less)
df <- n_yes - 1
mean_no <- mean(study10Less$G3)
mean_yes <- mean(study10plus$G3)
sd_no <- sd(study10Less$G3)
sd_yes <- sd(study10plus$G3)
SE <- sqrt( (sd_yes^2)/n_yes + (sd_no^2)/n_no)
t_value <- qt(0.05/2, df, lower.tail = FALSE)
point_estimate <-  mean_yes - mean_no
lower_CI <- point_estimate - t_value * SE
upper_CI <- point_estimate + t_value * SE
lower_CI 
## [1] -1.238795
upper_CI
## [1] 3.050792
p_value <- 2*pt(t_value, df, lower.tail = FALSE)
p_value
## [1] 0.05

The p-value = 0.05 < alpha (0.1), thus we reject the null hypothesis. Thus, there is difference in the average grade for those who study at at least ten times a week than those who don’t.

Interpret Results

Challenges

References

  1. https://fall2020.data606.net/assignments/labs/

  2. file:///C:/Users/Petit%20Mandela/Documents/R/DATA606_Lab7/DATA606_Lab7/DATA606_Lab7.html

  3. https://www.statisticshowto.com/least-squares-regression-line/

  4. https://rcompanion.org/handbook/C_04.html

  5. https://data-flair.training/blogs/t-tests-in-r/

  6. https://rstatisticsblog.com/data-science-in-action/data-preprocessing/hypothesis-testing-in-r-with-examples-interpretations/

  7. https://www.r-graph-gallery.com/all-graphs.html

  8. http://www.sthda.com/english/wiki/ggplot2-barplot-easy-bar-graphs-in-r-software-using-ggplot2