Load the data

Get the data before loading!

Fill in a questionnaire down the link: bit.do/survey2K20

data <- read.csv("test survey.csv")
head(data)

##                     Timestamp  Your.group
## 1 2020/01/26 8:44:48 pm GMT+3 test option
## 2 2020/01/26 8:45:14 pm GMT+3 test option
## 3 2020/01/26 8:46:43 pm GMT+3 test option
## 4 2020/01/26 8:47:12 pm GMT+3 test option
## 5 2020/01/26 8:47:35 pm GMT+3 test option
## 6 2020/01/26 8:48:00 pm GMT+3 test option
##   Did.your.mother.graduate.from.a.university.
## 1                                          No
## 2                                         Yes
## 3                                         Yes
## 4                                         Yes
## 5                                          No
## 6                                          No
##   Did.your.father.graduate.from.a.university.
## 1                                         Yes
## 2                                         Yes
## 3                                         Yes
## 4                                          No
## 5                                         Yes
## 6                                         Yes
##   Which.one.of.those.four.pets.do.you.favour.more.
## 1                                             Fish
## 2                                              Cat
## 3                                          Hamster
## 4                                          Hamster
## 5                                              Cat
## 6                                              Dog
##   What.is.the.run.time.of.your.favourite.film..in.minutes.
## 1                                                       90
## 2                                                      186
## 3                                                      120
## 4                                                      150
## 5                                                       95
## 6                                                       99
##   How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.
## 1                                                                          68
## 2                                                                          60
## 3                                                                          45
## 4                                                                          55
## 5                                                                          95
## 6                                                                          61
##   What.is.the.colour.of.your.eyes.
## 1                           green?
## 2                             blue
## 3                             gray
## 4                             grey
## 5                             Grey
## 6                            hazel

colnames(data)

## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."

str(data)

## 'data.frame':    7 obs. of  8 variables:
##  $ Timestamp                                                                  : Factor w/ 7 levels "2020/01/26 8:44:48 pm GMT+3",..: 1 2 3 4 5 6 7
##  $ Your.group                                                                 : Factor w/ 1 level "test option": 1 1 1 1 1 1 1
##  $ Did.your.mother.graduate.from.a.university.                                : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1
##  $ Did.your.father.graduate.from.a.university.                                : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 1
##  $ Which.one.of.those.four.pets.do.you.favour.more.                           : Factor w/ 4 levels "Cat","Dog","Fish",..: 3 1 4 4 1 2 4
##  $ What.is.the.run.time.of.your.favourite.film..in.minutes.                   : int  90 186 120 150 95 99 195
##  $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.: int  68 60 45 55 95 61 45
##  $ What.is.the.colour.of.your.eyes.                                           : Factor w/ 6 levels "blue","gray",..: 3 1 2 4 5 6 1

library(tidyverse)
glimpse(data)

## Observations: 7
## Variables: 8
## $ Timestamp                                                                   <fct> ...
## $ Your.group                                                                  <fct> ...
## $ Did.your.mother.graduate.from.a.university.                                 <fct> ...
## $ Did.your.father.graduate.from.a.university.                                 <fct> ...
## $ Which.one.of.those.four.pets.do.you.favour.more.                            <fct> ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes.                    <int> ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes. <int> ...
## $ What.is.the.colour.of.your.eyes.                                            <fct> ...

Select relevant variables

colnames(data) # get variable names

## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."

Boomers do:

data1 <- data[c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes.")]
dim(data1) #7 variables

## [1] 7 7

Millenials do:

data2 <- subset(data, 
                select = c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data2) #7 variables, the same result

## [1] 7 7

Zoomers do:

#library(dplyr)
data3 <- select(data, c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data3) #7 variables, the same result

## [1] 7 7

Sometimes still other ways can be employed:

data4 <- data[,-1]
dim(data4)

## [1] 7 7

Name the variables the way you like

data1 <- rename(data1, 
                 #new name = old name,
              studygroup = Your.group,
              mothereduc = Did.your.mother.graduate.from.a.university.,
              fathereduc = Did.your.father.graduate.from.a.university.,
              favpet = Which.one.of.those.four.pets.do.you.favour.more.,
               runtime = What.is.the.run.time.of.your.favourite.film..in.minutes., 
               traveltime = How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.,
               eyecolor = What.is.the.colour.of.your.eyes.)

Check yourself: If you run the code above twice, it won’t work for the second time. Why?

After changing the variable names, always check the result:

colnames(data1)

## [1] "studygroup" "mothereduc" "fathereduc" "favpet"     "runtime"   
## [6] "traveltime" "eyecolor"

Now you have shorter variable names and know how to learn what they are.

Some important rules of naming:

keep the names short (rfrgrtp is better than refrigeratortype)
keep them informative (rfrgrtp is better than V1)
do not duplicate function names
do not begin variable name with a number
use underscore (_) to separate words in names (e.g. my_biscuit)
avoid CamelCase names and dots, they are reserved for technical objects (e.g. StringsAsFunctions)

Useful link to the Coding Style Guide in R http://adv-r.had.co.nz/Style.html

Tidyverse Coding Guide: https://style.tidyverse.org/syntax.html#object-names

Create a subset based on criteria

Select the answers from one group (studygroup) only

table(data1$studygroup)

## 
## test option 
##           7

Boomers do:

data181 <- data1[data1$studygroup == "test option", ]
dim(data181)#studygroup == "BSC181"

## [1] 7 7

Millenials do:

data182 <- subset(data1, studygroup == "test option")
dim(data182)#studygroup == "BSC182"

## [1] 7 7

Zoomers do:

data183 <- filter(data1, data1$studygroup == "test option")
dim(data183) #studygroup == "BSC183"

## [1] 7 7

Now let’s filter by two conditions: members of one group who spend less than an hour on travel time:

data183_1 <- filter(data1, data1$studygroup == "test option" & data1$traveltime < 60)
head(data183_1)

##    studygroup mothereduc fathereduc  favpet runtime traveltime eyecolor
## 1 test option        Yes        Yes Hamster     120         45     gray
## 2 test option        Yes         No Hamster     150         55     grey
## 3 test option         No         No Hamster     195         45     blue

Or else you can filter data by one condition from a set:

data183_2 <- filter(data1, data1$studygroup == "test option" | data1$traveltime < 60)
head(data183_2)

##    studygroup mothereduc fathereduc  favpet runtime traveltime eyecolor
## 1 test option         No        Yes    Fish      90         68   green?
## 2 test option        Yes        Yes     Cat     186         60     blue
## 3 test option        Yes        Yes Hamster     120         45     gray
## 4 test option        Yes         No Hamster     150         55     grey
## 5 test option         No        Yes     Cat      95         95     Grey
## 6 test option         No        Yes     Dog      99         61    hazel

Recap: we are here

Load the data - DONE
Select relevant variables - DONE
Name the variables the way you like - DONE
Create a subset based on criteria - DONE
Recode variables when you need this
Summarise the dataset as a whole and by groups
Create suitable graphs to summarise one or two variables
Pass the short test on scales of measurement

Recode variables when you need this

(Get back to data1 for this.)

Let’s create a new variable which says “Yes” if at least one of the two parents had higher education and “No” if none did.

library(dplyr)
data1$eduhi <- if_else(data1$mothereduc == "Yes" | data1$fathereduc == "Yes", 
                      "Yes", 
                      "No")

Always check the results of recoding:

table(data1$eduhi)

## 
##  No Yes 
##   1   6

Now let’s create a factor of average travel time to university: “less than an hour” for trips below 60 minutes, and “an hour or more” for the rest.

data1$time2[data1$traveltime < 60] <- "less than an hour"
data1$time2[data1$traveltime >= 60] <- "an hour or more"
table(data1$time2)

## 
##   an hour or more less than an hour 
##                 4                 3

table(data1$traveltime, data1$time2)

##     
##      an hour or more less than an hour
##   45               0                 2
##   55               0                 1
##   60               1                 0
##   61               1                 0
##   68               1                 0
##   95               1                 0

Let’s recode several categories into few ones.

(we create a new variable not to overwrite the old data)

table(data1$favpet)

## 
##     Cat     Dog    Fish Hamster 
##       2       1       1       3

data1$pet2[data1$favpet == "Cat" | 
              data1$favpet == "Dog" ] <- "big_pet"
data1$pet2[data1$favpet == "Fish" | 
              data1$favpet == "Hamster"] <- "small_pet"
table(data1$favpet, data1$pet2)

##          
##           big_pet small_pet
##   Cat           2         0
##   Dog           1         0
##   Fish          0         1
##   Hamster       0         3

Summarise the dataset as a whole and by groups

Boomers do:

summary(data1)

##        studygroup mothereduc fathereduc     favpet     runtime     
##  test option:7    No :4      No :2      Cat    :2   Min.   : 90.0  
##                   Yes:3      Yes:5      Dog    :1   1st Qu.: 97.0  
##                                         Fish   :1   Median :120.0  
##                                         Hamster:3   Mean   :133.6  
##                                                     3rd Qu.:168.0  
##                                                     Max.   :195.0  
##    traveltime      eyecolor    eduhi              time2          
##  Min.   :45.00   blue  :2   Length:7           Length:7          
##  1st Qu.:50.00   gray  :1   Class :character   Class :character  
##  Median :60.00   green?:1   Mode  :character   Mode  :character  
##  Mean   :61.29   grey  :1                                        
##  3rd Qu.:64.50   Grey  :1                                        
##  Max.   :95.00   hazel :1                                        
##      pet2          
##  Length:7          
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Millenials do:

library(psych)
describe(data1)

##             vars n   mean    sd median trimmed   mad min  max range  skew
## studygroup*    1 7   1.00  0.00      1    1.00  0.00   1    1     0   NaN
## mothereduc*    2 7   1.43  0.53      1    1.43  0.00   1    2     1  0.23
## fathereduc*    3 7   1.71  0.49      2    1.71  0.00   1    2     1 -0.75
## favpet*        4 7   2.71  1.38      3    2.71  1.48   1    4     3 -0.22
## runtime        5 7 133.57 43.89    120  133.57 44.48  90  195   105  0.33
## traveltime     6 7  61.29 17.09     60   61.29 11.86  45   95    50  0.85
## eyecolor*      7 7   3.14  1.95      3    3.14  2.97   1    6     5  0.18
## eduhi*         8 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## time2*         9 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## pet2*         10 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.20  0.20
## fathereduc*    -1.60  0.18
## favpet*        -1.99  0.52
## runtime        -1.88 16.59
## traveltime     -0.60  6.46
## eyecolor*      -1.79  0.74
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA

describeBy(data1, data1$pet2) # describe the data by groups

## 
##  Descriptive statistics by group 
## group: big_pet
##             vars n   mean    sd median trimmed  mad min  max range  skew
## studygroup*    1 3   1.00  0.00      1    1.00 0.00   1    1     0   NaN
## mothereduc*    2 3   1.33  0.58      1    1.33 0.00   1    2     1  0.38
## fathereduc*    3 3   2.00  0.00      2    2.00 0.00   2    2     0   NaN
## favpet*        4 3   1.33  0.58      1    1.33 0.00   1    2     1  0.38
## runtime        5 3 126.67 51.42     99  126.67 5.93  95  186    91  0.38
## traveltime     6 3  72.00 19.92     61   72.00 1.48  60   95    35  0.38
## eyecolor*      7 3   4.00  2.65      5    4.00 1.48   1    6     5 -0.32
## eduhi*         8 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
## time2*         9 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
## pet2*         10 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.33  0.33
## fathereduc*      NaN  0.00
## favpet*        -2.33  0.33
## runtime        -2.33 29.69
## traveltime     -2.33 11.50
## eyecolor*      -2.33  1.53
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA
## -------------------------------------------------------- 
## group: small_pet
##             vars n   mean    sd median trimmed   mad min  max range  skew
## studygroup*    1 4   1.00  0.00    1.0    1.00  0.00   1    1     0   NaN
## mothereduc*    2 4   1.50  0.58    1.5    1.50  0.74   1    2     1  0.00
## fathereduc*    3 4   1.50  0.58    1.5    1.50  0.74   1    2     1  0.00
## favpet*        4 4   3.75  0.50    4.0    3.75  0.00   3    4     1 -0.75
## runtime        5 4 138.75 44.79  135.0  138.75 44.48  90  195   105  0.16
## traveltime     6 4  53.25 10.90   50.0   53.25  7.41  45   68    23  0.40
## eyecolor*      7 4   2.50  1.29    2.5    2.50  1.48   1    4     3  0.00
## eduhi*         8 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## time2*         9 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## pet2*         10 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.44  0.29
## fathereduc*    -2.44  0.29
## favpet*        -1.69  0.25
## runtime        -2.02 22.40
## traveltime     -2.00  5.45
## eyecolor*      -2.08  0.65
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA

Zoomers do:

library(magrittr)
data1 %>%
  group_by(pet2) %>%
  summarise(avg_runtime = mean(runtime),
            mdn_runtime = median(runtime),
            n = n())

## # A tibble: 2 x 4
##   pet2      avg_runtime mdn_runtime     n
##   <chr>           <dbl>       <dbl> <int>
## 1 big_pet          127.          99     3
## 2 small_pet        139.         135     4

Millenials have been quicker here.

Help page here: https://www.earthdatascience.org/workshops/clean-coding-tidyverse-intro/summarise-data-in-R-tidyverse/

Create suitable graphs to summarize one or two variables

For univariate distributions:

A CATEGORICAL VARIABLE needs a bar plot (there is space between bars)

Boomers do:

barplot(table(data1$favpet))

Zoomers do:

library(ggplot2)
ggplot(data1, aes(x = favpet)) +
  geom_bar()

A CONTINUOUS VARIABLE needs a histogram

Boomers do:

hist(data1$traveltime)

Zoomers do:

ggplot(data1, aes(x = traveltime)) +
  geom_histogram()

For bivariate distributions:

when a categorical and a continuous variable meet together, use a boxplot

How to read a boxplot:

Boomers:

boxplot(data1$runtime ~ data1$eduhi)

Zoomers:

ggplot(data1, aes(x = eduhi, y = runtime)) +
  geom_boxplot()

when two continuous variables meet together, use a scatterplot

Boomers:

plot(data1$runtime, data1$traveltime)

Zoomers:

ggplot(data1, aes(x = runtime, y = traveltime)) +
  geom_point() +
  geom_smooth(method = lm, se = F)

See help here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/

when two categorical variables meet together, use a stacked barchart:

Millenials:

library(sjPlot)
plot_xtab(data1$pet2, data1$eduhi, 
         margin = "row",
         bar.pos = "stack")

Zoomers:

ggplot(data1, aes(eduhi)) +
  geom_bar(aes(fill = pet2))

Help: https://ggplot2.tidyverse.org/reference/geom_bar.html

Useful functions (overview)*

Eyeballing the dataset: str(), glimpse(), describeBy(), group_by() %>% summarise()
Selecting variables: select(data, var)
Subsetting: subset(data, var), filter(data, condition)
Renaming variables: rename(data, new name = old name), if_else(data$var = 0, "value if true", "value if false")
Graphs for variable distributions:
Barplot barplot(table(data$var)), ggplot(data, aes(var)) + geom_bar(),
Histogram hist(data$var), ggplot(data, aes(var)) + geom_histogram()
Boxplot boxplot(data$cont ~ data$group), ggplot(data, aes(var)) + geom_boxplot()
Scatterplot plot(data$var, data$var2), ggplot(data, aes(x = var, y = var2)) + geom_point()
Stacked barchart sjp.xtab(data$var, data$var2, ...), ggplot(data, aes(var)) + geom_bar(aes(fill = var2))
Checking the results of manipulation: dim(data), table(data$var, data$var new)

*These are just some of the working solutions

Behold, ye creatures, basic data manipulation. Act I

Goals of this class

Load the data

Select relevant variables

Name the variables the way you like

Create a subset based on criteria

Recap: we are here

Recode variables when you need this

Summarise the dataset as a whole and by groups

Create suitable graphs to summarize one or two variables

Useful functions (overview)*

Pass the short test on the scales of measurement

Remember: If cleaning the data gets hard at times, it’s okay. Go on to learn and practice!