Introductory Statistics (CRN: 6896)



Objective

Read the dataset, answer some quetions, and create a knitted pdf file. Then send it to me.

  • You’re allowed to look up the lab material from previous classes.
  • For each question, you need to include the code follwoed by your answer to the question (i.e., the output of the code).
  • if you don’t know the answer to a question skip it.


The dataset

This data is from a survey of students in two secondary schools (Gabriel Pereira and Mousinho da Silveira). It contains a lot of interesting social, gender and study information about the students. You can see the table of variables and their descirptions below.


Variable Description
school student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex student’s sex (binary: ‘F’ - female or ‘M’ - male)
age student’s age (numeric: from 15 to 22)
address student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime Home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup extra educational support (binary: yes or no)
famsup family educational support (binary: yes or no)
paid extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities extra-curricular activities (binary: yes or no)
nursery attended nursery school (binary: yes or no)
higher wants to take higher education (binary: yes or no)
internet Internet access at home (binary: yes or no)
romantic with a romantic relationship (binary: yes or no)
famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime free time after school (numeric: from 1 - very low to 5 - very high)
goout going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health current health status (numeric: from 1 - very bad to 5 - very good)
absences number of school absences (numeric: from 0 to 93)
G1 first period grade in math (numeric: from 0 to 20)
G2 second period grade in math (numeric: from 0 to 20)
G3 final grade in math (numeric: from 0 to 20, output target)
Open a new file
midterm <- read.csv("./students.csv")


Answer these questions:
  1. How many students are there in this dataset?
dim(midterm)
## [1] 395  33
nrow(midterm)
## [1] 395
str(midterm)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
  1. How many male and female students are there?
table(midterm$sex)
## 
##   F   M 
## 208 187
  1. What is the average age of male and female students?
by(midterm$age, midterm$sex, mean)
## midterm$sex: F
## [1] 16.73077
## ------------------------------------------------------------ 
## midterm$sex: M
## [1] 16.65775
  1. Do the two schools differ in the average age of students? How?
by(midterm$age, midterm$school, mean)
## midterm$school: GP
## [1] 16.52149
## ------------------------------------------------------------ 
## midterm$school: MS
## [1] 18.02174
  1. What percentage of each school has a family size of 3 or bigger?
prop.table(table(midterm$famsize, midterm$school))
##      
##               GP         MS
##   GT3 0.63797468 0.07341772
##   LE3 0.24556962 0.04303797
  1. What percentage of parents in the Mousinho da Silveira school live separately?
prop.table(table(midterm$school, midterm$Pstatus))
##     
##                A           T
##   GP 0.096202532 0.787341772
##   MS 0.007594937 0.108860759
  1. Which school has a higher average rate of absence?
by(midterm$absences, midterm$school, mean)
## midterm$school: GP
## [1] 5.965616
## ------------------------------------------------------------ 
## midterm$school: MS
## [1] 3.76087
  1. Do students who have internet access at home drink more or less alcohol during the week? What about on weekends?
by(data = midterm$Walc, midterm$internet, mean)
## midterm$internet: no
## [1] 2.257576
## ------------------------------------------------------------ 
## midterm$internet: yes
## [1] 2.297872
by(data = midterm$Dalc, midterm$internet, mean)
## midterm$internet: no
## [1] 1.409091
## ------------------------------------------------------------ 
## midterm$internet: yes
## [1] 1.495441
  1. How does having internet access influence study time?
by(midterm$studytime, midterm$internet, mean)
## midterm$internet: no
## [1] 1.924242
## ------------------------------------------------------------ 
## midterm$internet: yes
## [1] 2.057751
  1. Does having a romantic relationship influence final math grades?
by(midterm$G3, midterm$romantic, mean)
## midterm$romantic: no
## [1] 10.8365
## ------------------------------------------------------------ 
## midterm$romantic: yes
## [1] 9.575758
  1. From the three math grade (G1, G2, and G3), which one has higher variability?
sd(midterm$G1)
## [1] 3.319195
sd(midterm$G2)
## [1] 3.761505
summary(midterm$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   11.00   10.42   14.00   20.00
  1. How do students with varying records of past class failures score in the final math grades?
by(midterm$G3, midterm$failures, mean)
## midterm$failures: 0
## [1] 11.25321
## ------------------------------------------------------------ 
## midterm$failures: 1
## [1] 8.12
## ------------------------------------------------------------ 
## midterm$failures: 2
## [1] 6.235294
## ------------------------------------------------------------ 
## midterm$failures: 3
## [1] 5.6875
  1. Do Urban and Rural students differ in the variability of their mothers’ education? What about their fathers?
by(midterm$Medu, midterm$address, mean)
## midterm$address: R
## [1] 2.465909
## ------------------------------------------------------------ 
## midterm$address: U
## [1] 2.830619
by(midterm$Fedu, midterm$address, mean)
## midterm$address: R
## [1] 2.375
## ------------------------------------------------------------ 
## midterm$address: U
## [1] 2.563518
  1. How does the quality of relationships differ between parents who live together compared to those who don’t?
by(midterm$famrel, midterm$Pstatus, mean)
## midterm$Pstatus: A
## [1] 3.878049
## ------------------------------------------------------------ 
## midterm$Pstatus: T
## [1] 3.951977
  1. What is the mean age in this dataset? What is the SD for age? Based on this info and what you learned to form the handouts, can you calculate the 95% confidence interval for age?
mean(midterm$age)
## [1] 16.6962
sd(midterm$age)
## [1] 1.276043
sd(midterm$age)/(sqrt(nrow(midterm)))
## [1] 0.06420468

CI: 16.6962 + 1.96 * 0.06420468 16.6962 - 1.96 * 0.06420468