Behold, ye creatures, basic data manipulation. Act I

Volchenko, Shirokanova

January 26, 2020

Goals of this class

  1. Load the data
  2. Select relevant variables
  3. Name the variables the way you like
  4. Create a subset based on criteria
  5. Recode variables when you need this
  6. Summarise the dataset as a whole and by groups
  7. Create suitable graphs to summarize one or two variables
  8. Pass the short test on scales of measurement

Load the data

Get the data before loading!

Fill in a questionnaire down the link: bit.do/survey2K20

data <- read.csv("test survey.csv")
head(data)
##                     Timestamp  Your.group
## 1 2020/01/26 8:44:48 pm GMT+3 test option
## 2 2020/01/26 8:45:14 pm GMT+3 test option
## 3 2020/01/26 8:46:43 pm GMT+3 test option
## 4 2020/01/26 8:47:12 pm GMT+3 test option
## 5 2020/01/26 8:47:35 pm GMT+3 test option
## 6 2020/01/26 8:48:00 pm GMT+3 test option
##   Did.your.mother.graduate.from.a.university.
## 1                                          No
## 2                                         Yes
## 3                                         Yes
## 4                                         Yes
## 5                                          No
## 6                                          No
##   Did.your.father.graduate.from.a.university.
## 1                                         Yes
## 2                                         Yes
## 3                                         Yes
## 4                                          No
## 5                                         Yes
## 6                                         Yes
##   Which.one.of.those.four.pets.do.you.favour.more.
## 1                                             Fish
## 2                                              Cat
## 3                                          Hamster
## 4                                          Hamster
## 5                                              Cat
## 6                                              Dog
##   What.is.the.run.time.of.your.favourite.film..in.minutes.
## 1                                                       90
## 2                                                      186
## 3                                                      120
## 4                                                      150
## 5                                                       95
## 6                                                       99
##   How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.
## 1                                                                          68
## 2                                                                          60
## 3                                                                          45
## 4                                                                          55
## 5                                                                          95
## 6                                                                          61
##   What.is.the.colour.of.your.eyes.
## 1                           green?
## 2                             blue
## 3                             gray
## 4                             grey
## 5                             Grey
## 6                            hazel
colnames(data)
## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
str(data)
## 'data.frame':    7 obs. of  8 variables:
##  $ Timestamp                                                                  : Factor w/ 7 levels "2020/01/26 8:44:48 pm GMT+3",..: 1 2 3 4 5 6 7
##  $ Your.group                                                                 : Factor w/ 1 level "test option": 1 1 1 1 1 1 1
##  $ Did.your.mother.graduate.from.a.university.                                : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1
##  $ Did.your.father.graduate.from.a.university.                                : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 1
##  $ Which.one.of.those.four.pets.do.you.favour.more.                           : Factor w/ 4 levels "Cat","Dog","Fish",..: 3 1 4 4 1 2 4
##  $ What.is.the.run.time.of.your.favourite.film..in.minutes.                   : int  90 186 120 150 95 99 195
##  $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.: int  68 60 45 55 95 61 45
##  $ What.is.the.colour.of.your.eyes.                                           : Factor w/ 6 levels "blue","gray",..: 3 1 2 4 5 6 1
library(tidyverse)
glimpse(data)
## Observations: 7
## Variables: 8
## $ Timestamp                                                                   <fct> ...
## $ Your.group                                                                  <fct> ...
## $ Did.your.mother.graduate.from.a.university.                                 <fct> ...
## $ Did.your.father.graduate.from.a.university.                                 <fct> ...
## $ Which.one.of.those.four.pets.do.you.favour.more.                            <fct> ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes.                    <int> ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes. <int> ...
## $ What.is.the.colour.of.your.eyes.                                            <fct> ...

Select relevant variables

colnames(data) # get variable names
## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."

Boomers do:

data1 <- data[c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes.")]
dim(data1) #7 variables
## [1] 7 7

Millenials do:

data2 <- subset(data, 
                select = c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data2) #7 variables, the same result
## [1] 7 7

Zoomers do:

#library(dplyr)
data3 <- select(data, c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data3) #7 variables, the same result
## [1] 7 7

Sometimes still other ways can be employed:

data4 <- data[,-1]
dim(data4)
## [1] 7 7

Name the variables the way you like

data1 <- rename(data1, 
                 #new name = old name,
              studygroup = Your.group,
              mothereduc = Did.your.mother.graduate.from.a.university.,
              fathereduc = Did.your.father.graduate.from.a.university.,
              favpet = Which.one.of.those.four.pets.do.you.favour.more.,
               runtime = What.is.the.run.time.of.your.favourite.film..in.minutes., 
               traveltime = How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.,
               eyecolor = What.is.the.colour.of.your.eyes.)

Check yourself: If you run the code above twice, it won’t work for the second time. Why?

After changing the variable names, always check the result:

colnames(data1)
## [1] "studygroup" "mothereduc" "fathereduc" "favpet"     "runtime"   
## [6] "traveltime" "eyecolor"

Now you have shorter variable names and know how to learn what they are.

Some important rules of naming:

Useful link to the Coding Style Guide in R http://adv-r.had.co.nz/Style.html

Tidyverse Coding Guide: https://style.tidyverse.org/syntax.html#object-names

Create a subset based on criteria

Select the answers from one group (studygroup) only

table(data1$studygroup)
## 
## test option 
##           7

Boomers do:

data181 <- data1[data1$studygroup == "test option", ]
dim(data181)#studygroup == "BSC181"
## [1] 7 7

Millenials do:

data182 <- subset(data1, studygroup == "test option")
dim(data182)#studygroup == "BSC182"
## [1] 7 7

Zoomers do:

data183 <- filter(data1, data1$studygroup == "test option")
dim(data183) #studygroup == "BSC183"
## [1] 7 7

Now let’s filter by two conditions: members of one group who spend less than an hour on travel time:

data183_1 <- filter(data1, data1$studygroup == "test option" & data1$traveltime < 60)
head(data183_1)
##    studygroup mothereduc fathereduc  favpet runtime traveltime eyecolor
## 1 test option        Yes        Yes Hamster     120         45     gray
## 2 test option        Yes         No Hamster     150         55     grey
## 3 test option         No         No Hamster     195         45     blue

Or else you can filter data by one condition from a set:

data183_2 <- filter(data1, data1$studygroup == "test option" | data1$traveltime < 60)
head(data183_2)
##    studygroup mothereduc fathereduc  favpet runtime traveltime eyecolor
## 1 test option         No        Yes    Fish      90         68   green?
## 2 test option        Yes        Yes     Cat     186         60     blue
## 3 test option        Yes        Yes Hamster     120         45     gray
## 4 test option        Yes         No Hamster     150         55     grey
## 5 test option         No        Yes     Cat      95         95     Grey
## 6 test option         No        Yes     Dog      99         61    hazel

Recap: we are here

  1. Load the data - DONE
  2. Select relevant variables - DONE
  3. Name the variables the way you like - DONE
  4. Create a subset based on criteria - DONE
  5. Recode variables when you need this
  6. Summarise the dataset as a whole and by groups
  7. Create suitable graphs to summarise one or two variables
  8. Pass the short test on scales of measurement

Recode variables when you need this

(Get back to data1 for this.)

Let’s create a new variable which says “Yes” if at least one of the two parents had higher education and “No” if none did.

library(dplyr)
data1$eduhi <- if_else(data1$mothereduc == "Yes" | data1$fathereduc == "Yes", 
                      "Yes", 
                      "No")

Always check the results of recoding:

table(data1$eduhi)
## 
##  No Yes 
##   1   6

Now let’s create a factor of average travel time to university: “less than an hour” for trips below 60 minutes, and “an hour or more” for the rest.

data1$time2[data1$traveltime < 60] <- "less than an hour"
data1$time2[data1$traveltime >= 60] <- "an hour or more"
table(data1$time2)
## 
##   an hour or more less than an hour 
##                 4                 3
table(data1$traveltime, data1$time2)
##     
##      an hour or more less than an hour
##   45               0                 2
##   55               0                 1
##   60               1                 0
##   61               1                 0
##   68               1                 0
##   95               1                 0

Let’s recode several categories into few ones.

(we create a new variable not to overwrite the old data)

table(data1$favpet)
## 
##     Cat     Dog    Fish Hamster 
##       2       1       1       3
data1$pet2[data1$favpet == "Cat" | 
              data1$favpet == "Dog" ] <- "big_pet"
data1$pet2[data1$favpet == "Fish" | 
              data1$favpet == "Hamster"] <- "small_pet"
table(data1$favpet, data1$pet2)
##          
##           big_pet small_pet
##   Cat           2         0
##   Dog           1         0
##   Fish          0         1
##   Hamster       0         3

Summarise the dataset as a whole and by groups

Boomers do:

summary(data1)
##        studygroup mothereduc fathereduc     favpet     runtime     
##  test option:7    No :4      No :2      Cat    :2   Min.   : 90.0  
##                   Yes:3      Yes:5      Dog    :1   1st Qu.: 97.0  
##                                         Fish   :1   Median :120.0  
##                                         Hamster:3   Mean   :133.6  
##                                                     3rd Qu.:168.0  
##                                                     Max.   :195.0  
##    traveltime      eyecolor    eduhi              time2          
##  Min.   :45.00   blue  :2   Length:7           Length:7          
##  1st Qu.:50.00   gray  :1   Class :character   Class :character  
##  Median :60.00   green?:1   Mode  :character   Mode  :character  
##  Mean   :61.29   grey  :1                                        
##  3rd Qu.:64.50   Grey  :1                                        
##  Max.   :95.00   hazel :1                                        
##      pet2          
##  Length:7          
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Millenials do:

library(psych)
describe(data1)
##             vars n   mean    sd median trimmed   mad min  max range  skew
## studygroup*    1 7   1.00  0.00      1    1.00  0.00   1    1     0   NaN
## mothereduc*    2 7   1.43  0.53      1    1.43  0.00   1    2     1  0.23
## fathereduc*    3 7   1.71  0.49      2    1.71  0.00   1    2     1 -0.75
## favpet*        4 7   2.71  1.38      3    2.71  1.48   1    4     3 -0.22
## runtime        5 7 133.57 43.89    120  133.57 44.48  90  195   105  0.33
## traveltime     6 7  61.29 17.09     60   61.29 11.86  45   95    50  0.85
## eyecolor*      7 7   3.14  1.95      3    3.14  2.97   1    6     5  0.18
## eduhi*         8 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## time2*         9 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## pet2*         10 7    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.20  0.20
## fathereduc*    -1.60  0.18
## favpet*        -1.99  0.52
## runtime        -1.88 16.59
## traveltime     -0.60  6.46
## eyecolor*      -1.79  0.74
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA
describeBy(data1, data1$pet2) # describe the data by groups
## 
##  Descriptive statistics by group 
## group: big_pet
##             vars n   mean    sd median trimmed  mad min  max range  skew
## studygroup*    1 3   1.00  0.00      1    1.00 0.00   1    1     0   NaN
## mothereduc*    2 3   1.33  0.58      1    1.33 0.00   1    2     1  0.38
## fathereduc*    3 3   2.00  0.00      2    2.00 0.00   2    2     0   NaN
## favpet*        4 3   1.33  0.58      1    1.33 0.00   1    2     1  0.38
## runtime        5 3 126.67 51.42     99  126.67 5.93  95  186    91  0.38
## traveltime     6 3  72.00 19.92     61   72.00 1.48  60   95    35  0.38
## eyecolor*      7 3   4.00  2.65      5    4.00 1.48   1    6     5 -0.32
## eduhi*         8 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
## time2*         9 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
## pet2*         10 3    NaN    NA     NA     NaN   NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.33  0.33
## fathereduc*      NaN  0.00
## favpet*        -2.33  0.33
## runtime        -2.33 29.69
## traveltime     -2.33 11.50
## eyecolor*      -2.33  1.53
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA
## -------------------------------------------------------- 
## group: small_pet
##             vars n   mean    sd median trimmed   mad min  max range  skew
## studygroup*    1 4   1.00  0.00    1.0    1.00  0.00   1    1     0   NaN
## mothereduc*    2 4   1.50  0.58    1.5    1.50  0.74   1    2     1  0.00
## fathereduc*    3 4   1.50  0.58    1.5    1.50  0.74   1    2     1  0.00
## favpet*        4 4   3.75  0.50    4.0    3.75  0.00   3    4     1 -0.75
## runtime        5 4 138.75 44.79  135.0  138.75 44.48  90  195   105  0.16
## traveltime     6 4  53.25 10.90   50.0   53.25  7.41  45   68    23  0.40
## eyecolor*      7 4   2.50  1.29    2.5    2.50  1.48   1    4     3  0.00
## eduhi*         8 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## time2*         9 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
## pet2*         10 4    NaN    NA     NA     NaN    NA Inf -Inf  -Inf    NA
##             kurtosis    se
## studygroup*      NaN  0.00
## mothereduc*    -2.44  0.29
## fathereduc*    -2.44  0.29
## favpet*        -1.69  0.25
## runtime        -2.02 22.40
## traveltime     -2.00  5.45
## eyecolor*      -2.08  0.65
## eduhi*            NA    NA
## time2*            NA    NA
## pet2*             NA    NA

Zoomers do:

library(magrittr)
data1 %>%
  group_by(pet2) %>%
  summarise(avg_runtime = mean(runtime),
            mdn_runtime = median(runtime),
            n = n())
## # A tibble: 2 x 4
##   pet2      avg_runtime mdn_runtime     n
##   <chr>           <dbl>       <dbl> <int>
## 1 big_pet          127.          99     3
## 2 small_pet        139.         135     4

Millenials have been quicker here.

Help page here: https://www.earthdatascience.org/workshops/clean-coding-tidyverse-intro/summarise-data-in-R-tidyverse/

Create suitable graphs to summarize one or two variables

For univariate distributions:

A CATEGORICAL VARIABLE needs a bar plot (there is space between bars)

Boomers do:

barplot(table(data1$favpet))

Zoomers do:

library(ggplot2)
ggplot(data1, aes(x = favpet)) +
  geom_bar()

A CONTINUOUS VARIABLE needs a histogram

Boomers do:

hist(data1$traveltime)

Zoomers do:

ggplot(data1, aes(x = traveltime)) +
  geom_histogram()

For bivariate distributions:

How to read a boxplot:

Boomers:

boxplot(data1$runtime ~ data1$eduhi)

Zoomers:

ggplot(data1, aes(x = eduhi, y = runtime)) +
  geom_boxplot()

Boomers:

plot(data1$runtime, data1$traveltime)

Zoomers:

ggplot(data1, aes(x = runtime, y = traveltime)) +
  geom_point() +
  geom_smooth(method = lm, se = F) 

See help here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/

Millenials:

library(sjPlot)
plot_xtab(data1$pet2, data1$eduhi, 
         margin = "row",
         bar.pos = "stack")

Zoomers:

ggplot(data1, aes(eduhi)) +
  geom_bar(aes(fill = pet2))

Help: https://ggplot2.tidyverse.org/reference/geom_bar.html

Useful functions (overview)*

*These are just some of the working solutions

Pass the short test on the scales of measurement

Remember: If cleaning the data gets hard at times, it’s okay. Go on to learn and practice!