Alexis Mekueko
12/9/2020
Github Link: https://github.com/asmozo24/DATA607_Final_Project
Web link: https://rpubs.com/amekueko/697306
Data is collected or made available by archive.ics.uci.edu: The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged.
We found some interesting dataset from -> data source: https://archive.ics.uci.edu/ml/machine-learning-databases/00320/. This data is about a study on students(395) taking math or/and portuguese language course. Each case represents a student at one of the two schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira). There are 395 observations in the given dataset. The data is pretty rich with a txt file that described all variables in the data. therefore there is no need to rename the column. The orignal data format is comma delimited and rendering from R was not easy. So, we used excel with one attemp to fix it. We are interested in the student taking Math course. with 33 variables.
Using R to acquire data
Using SQL to acquire data
What is the structure of data?
## Rows: 395
## Columns: 33
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason <chr> "course", "course", "other", "home", "home", "reputation...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## [1] 649 33
## [1] 0
## [1] 0
Let’s take a look at the data frame…
Amount the 33 variables in the data frame, there are 03 variables (G1, G2 and G3) which represent the students’s grades.
These 03 variables are interesting as there are measures of students performances in the registered courses.
G1: first period grade (numeric: from 0 to 20) G2: second period grade (numeric: from 0 to 20) G3: final grade (numeric: from 0 to 20)
## student_math$G3
## n missing distinct Info Mean Gmd .05 .10
## 395 0 18 0.992 10.42 4.992 0.0 5.0
## .25 .50 .75 .90 .95
## 8.0 11.0 14.0 15.6 17.0
##
## lowest : 0 4 5 6 7, highest: 16 17 18 19 20
##
## Value 0 4 5 6 7 8 9 10 11 12 13
## Frequency 38 1 7 15 9 32 28 56 47 31 31
## Proportion 0.096 0.003 0.018 0.038 0.023 0.081 0.071 0.142 0.119 0.078 0.078
##
## Value 14 15 16 17 18 19 20
## Frequency 27 33 16 6 12 5 1
## Proportion 0.068 0.084 0.041 0.015 0.030 0.013 0.003
## student_math_GP$G1
## n missing distinct Info Mean Gmd .05 .10
## 349 0 17 0.992 10.94 3.791 6 7
## .25 .50 .75 .90 .95
## 8 11 13 16 16
##
## lowest : 3 4 5 6 7, highest: 15 16 17 18 19
##
## Value 3 4 5 6 7 8 9 10 11 12 13
## Frequency 1 1 7 19 32 35 30 45 34 32 27
## Proportion 0.003 0.003 0.020 0.054 0.092 0.100 0.086 0.129 0.097 0.092 0.077
##
## Value 14 15 16 17 18 19
## Frequency 27 21 21 8 7 2
## Proportion 0.077 0.060 0.060 0.023 0.020 0.006
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 8.00 11.00 10.94 13.00 19.00
## student_math_GP$grade3
## n missing distinct
## 349 0 5
##
## lowest : A B C D F, highest: A B C D F
##
## Value A B C D F
## Frequency 17 76 143 59 54
## Proportion 0.049 0.218 0.410 0.169 0.155
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 11.00 10.49 14.00 20.00
## student_math_MS$grade3
## n missing distinct
## 46 0 5
##
## lowest : A B C D F, highest: A B C D F
##
## Value A B C D F
## Frequency 1 6 22 10 7
## Proportion 0.022 0.130 0.478 0.217 0.152
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.000 10.000 9.848 12.750 19.000
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter
## Warning in axis(1, at = 1:length(means), labels = legends, ...): "frame" is not
## a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "frame" is not a
## graphical parameter
Conducting a hypothesis test to evaluate whether the average grade is different for those who study at least ten times a week than those who don’t.
H_null: there is no difference in the average grade for those who study at at least ten times a week than those who don’t.
H_alt: there is difference in the average grade for those who study at at least ten times a week than those who don’t.
case = students enrolled in Math course.
sample is all students from both school (GP and MS).
Let’s see the difference between weekly study time and students final grade in Math
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## studyTime10 meanFinal_grade
## <chr> <dbl>
## 1 no 10.4
## 2 yes 11.3
## study10plus$grade3
## n missing distinct
## 27 0 5
##
## lowest : A B C D F, highest: A B C D F
##
## Value A B C D F
## Frequency 3 7 10 3 4
## Proportion 0.111 0.259 0.370 0.111 0.148
Let’s visualize the Math final grade distribution from the two schools based on 10+hrs weekly study time
## study10Less$grade3
## n missing distinct
## 368 0 5
##
## lowest : A B C D F, highest: A B C D F
##
## Value A B C D F
## Frequency 15 75 155 66 57
## Proportion 0.041 0.204 0.421 0.179 0.155
# This can be rewrite as a function
n_yes <- nrow(study10plus)
n_no <- nrow(study10Less)
df <- n_yes - 1
mean_no <- mean(study10Less$G3)
mean_yes <- mean(study10plus$G3)
sd_no <- sd(study10Less$G3)
sd_yes <- sd(study10plus$G3)
SE <- sqrt( (sd_yes^2)/n_yes + (sd_no^2)/n_no)
t_value <- qt(0.05/2, df, lower.tail = FALSE)
point_estimate <- mean_yes - mean_no
lower_CI <- point_estimate - t_value * SE
upper_CI <- point_estimate + t_value * SE
lower_CI
## [1] -1.238795
## [1] 3.050792
## [1] 0.05
The p-value = 0.05 < alpha (0.1), thus we reject the null hypothesis. Thus, there is difference in the average grade for those who study at at least ten times a week than those who don’t.