Context:
The data were obtained in a survey of students math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students.
Content:
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) age - student’s age (numeric: from 15 to 22) address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) failures - number of past class failures (numeric: n if 1<=n<3, else 4) schoolsup - extra educational support (binary: yes or no) famsup - family educational support (binary: yes or no) paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) activities - extra-curricular activities (binary: yes or no) nursery - attended nursery school (binary: yes or no) higher - wants to take higher education (binary: yes or no) internet - Internet access at home (binary: yes or no) romantic - with a romantic relationship (binary: yes or no) famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) freetime - free time after school (numeric: from 1 - very low to 5 - very high) goout - going out with friends (numeric: from 1 - very low to 5 - very high) Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) health - current health status (numeric: from 1 - very bad to 5 - very good) absences - number of school absences (numeric: from 0 to 93)
Source Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.3
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: package 'tibble' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'purrr' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(plyr)
## Warning: package 'plyr' was built under R version 3.3.3
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(readr)
library(ggplot2)
student_info <- read_csv("~/2 MSSA/463/datasets/student-por.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## age = col_integer(),
## Medu = col_integer(),
## Fedu = col_integer(),
## traveltime = col_integer(),
## studytime = col_integer(),
## failures = col_integer(),
## famrel = col_integer(),
## freetime = col_integer(),
## goout = col_integer(),
## Dalc = col_integer(),
## Walc = col_integer(),
## health = col_integer(),
## absences = col_integer(),
## G1 = col_integer(),
## G2 = col_integer(),
## G3 = col_integer()
## )
## See spec(...) for full column specifications.
View(student_info)
glimpse(student_info)
## Observations: 649
## Variables: 33
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "G...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other",...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "...
## $ reason <chr> "course", "course", "other", "home", "home", "reput...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", ...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes"...
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "ye...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no"...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10,...
## $ G1 <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 1...
## $ G2 <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13,...
## $ G3 <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12,...
I changed the values of the of the variable Weekend alcohol consumption from int to factor and changes the values to 1 - very low to 5 - very high.
student_info$Walc <- as.factor(student_info$Walc)
student_info$Walc <- mapvalues(student_info$Walc,
from = 1:5,
to = c("Very Low", "Low", "Medium", "High", "Very High"))
ggplot(student_info, aes(x=age, fill=Walc))+
geom_histogram(binwidth=1, colour="black")+
facet_grid(~Walc)+
theme_minimal() +
ggtitle("Weekend alcohol consumption per age")+
xlab("Student's age")
ggplot(student_info, aes(x=age, y= freetime, color = age )) +
geom_jitter()+
facet_grid(~reason, scales = "free") +
scale_colour_gradientn(colours=rainbow(7))