Volchenko, Shirokanova
January 26, 2020
Get the data before loading!
Fill in a questionnaire down the link: bit.do/survey2K20
## Timestamp Your.group
## 1 2020/01/26 8:44:48 pm GMT+3 test option
## 2 2020/01/26 8:45:14 pm GMT+3 test option
## 3 2020/01/26 8:46:43 pm GMT+3 test option
## 4 2020/01/26 8:47:12 pm GMT+3 test option
## 5 2020/01/26 8:47:35 pm GMT+3 test option
## 6 2020/01/26 8:48:00 pm GMT+3 test option
## Did.your.mother.graduate.from.a.university.
## 1 No
## 2 Yes
## 3 Yes
## 4 Yes
## 5 No
## 6 No
## Did.your.father.graduate.from.a.university.
## 1 Yes
## 2 Yes
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
## Which.one.of.those.four.pets.do.you.favour.more.
## 1 Fish
## 2 Cat
## 3 Hamster
## 4 Hamster
## 5 Cat
## 6 Dog
## What.is.the.run.time.of.your.favourite.film..in.minutes.
## 1 90
## 2 186
## 3 120
## 4 150
## 5 95
## 6 99
## How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.
## 1 68
## 2 60
## 3 45
## 4 55
## 5 95
## 6 61
## What.is.the.colour.of.your.eyes.
## 1 green?
## 2 blue
## 3 gray
## 4 grey
## 5 Grey
## 6 hazel
## [1] "Timestamp"
## [2] "Your.group"
## [3] "Did.your.mother.graduate.from.a.university."
## [4] "Did.your.father.graduate.from.a.university."
## [5] "Which.one.of.those.four.pets.do.you.favour.more."
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
## 'data.frame': 7 obs. of 8 variables:
## $ Timestamp : Factor w/ 7 levels "2020/01/26 8:44:48 pm GMT+3",..: 1 2 3 4 5 6 7
## $ Your.group : Factor w/ 1 level "test option": 1 1 1 1 1 1 1
## $ Did.your.mother.graduate.from.a.university. : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1
## $ Did.your.father.graduate.from.a.university. : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 1
## $ Which.one.of.those.four.pets.do.you.favour.more. : Factor w/ 4 levels "Cat","Dog","Fish",..: 3 1 4 4 1 2 4
## $ What.is.the.run.time.of.your.favourite.film..in.minutes. : int 90 186 120 150 95 99 195
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.: int 68 60 45 55 95 61 45
## $ What.is.the.colour.of.your.eyes. : Factor w/ 6 levels "blue","gray",..: 3 1 2 4 5 6 1
## Observations: 7
## Variables: 8
## $ Timestamp <fct> ...
## $ Your.group <fct> ...
## $ Did.your.mother.graduate.from.a.university. <fct> ...
## $ Did.your.father.graduate.from.a.university. <fct> ...
## $ Which.one.of.those.four.pets.do.you.favour.more. <fct> ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes. <int> ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes. <int> ...
## $ What.is.the.colour.of.your.eyes. <fct> ...
## [1] "Timestamp"
## [2] "Your.group"
## [3] "Did.your.mother.graduate.from.a.university."
## [4] "Did.your.father.graduate.from.a.university."
## [5] "Which.one.of.those.four.pets.do.you.favour.more."
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
Boomers do:
data1 <- data[c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes.")]
dim(data1) #7 variables## [1] 7 7
Millenials do:
data2 <- subset(data,
select = c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes."))
dim(data2) #7 variables, the same result## [1] 7 7
Zoomers do:
#library(dplyr)
data3 <- select(data, c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes."))
dim(data3) #7 variables, the same result## [1] 7 7
Sometimes still other ways can be employed:
## [1] 7 7
data1 <- rename(data1,
#new name = old name,
studygroup = Your.group,
mothereduc = Did.your.mother.graduate.from.a.university.,
fathereduc = Did.your.father.graduate.from.a.university.,
favpet = Which.one.of.those.four.pets.do.you.favour.more.,
runtime = What.is.the.run.time.of.your.favourite.film..in.minutes.,
traveltime = How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.,
eyecolor = What.is.the.colour.of.your.eyes.)Check yourself: If you run the code above twice, it won’t work for the second time. Why?
After changing the variable names, always check the result:
## [1] "studygroup" "mothereduc" "fathereduc" "favpet" "runtime"
## [6] "traveltime" "eyecolor"
Now you have shorter variable names and know how to learn what they are.
Some important rules of naming:
rfrgrtp is better than refrigeratortype)rfrgrtp is better than V1)my_biscuit)Useful link to the Coding Style Guide in R http://adv-r.had.co.nz/Style.html
Tidyverse Coding Guide: https://style.tidyverse.org/syntax.html#object-names
Select the answers from one group (studygroup) only
##
## test option
## 7
Boomers do:
## [1] 7 7
Millenials do:
## [1] 7 7
Zoomers do:
## [1] 7 7
Now let’s filter by two conditions: members of one group who spend less than an hour on travel time:
data183_1 <- filter(data1, data1$studygroup == "test option" & data1$traveltime < 60)
head(data183_1)## studygroup mothereduc fathereduc favpet runtime traveltime eyecolor
## 1 test option Yes Yes Hamster 120 45 gray
## 2 test option Yes No Hamster 150 55 grey
## 3 test option No No Hamster 195 45 blue
Or else you can filter data by one condition from a set:
data183_2 <- filter(data1, data1$studygroup == "test option" | data1$traveltime < 60)
head(data183_2)## studygroup mothereduc fathereduc favpet runtime traveltime eyecolor
## 1 test option No Yes Fish 90 68 green?
## 2 test option Yes Yes Cat 186 60 blue
## 3 test option Yes Yes Hamster 120 45 gray
## 4 test option Yes No Hamster 150 55 grey
## 5 test option No Yes Cat 95 95 Grey
## 6 test option No Yes Dog 99 61 hazel
(Get back to data1 for this.)
Let’s create a new variable which says “Yes” if at least one of the two parents had higher education and “No” if none did.
library(dplyr)
data1$eduhi <- if_else(data1$mothereduc == "Yes" | data1$fathereduc == "Yes",
"Yes",
"No")Always check the results of recoding:
##
## No Yes
## 1 6
Now let’s create a factor of average travel time to university: “less than an hour” for trips below 60 minutes, and “an hour or more” for the rest.
data1$time2[data1$traveltime < 60] <- "less than an hour"
data1$time2[data1$traveltime >= 60] <- "an hour or more"
table(data1$time2)##
## an hour or more less than an hour
## 4 3
##
## an hour or more less than an hour
## 45 0 2
## 55 0 1
## 60 1 0
## 61 1 0
## 68 1 0
## 95 1 0
Let’s recode several categories into few ones.
(we create a new variable not to overwrite the old data)
##
## Cat Dog Fish Hamster
## 2 1 1 3
data1$pet2[data1$favpet == "Cat" |
data1$favpet == "Dog" ] <- "big_pet"
data1$pet2[data1$favpet == "Fish" |
data1$favpet == "Hamster"] <- "small_pet"
table(data1$favpet, data1$pet2)##
## big_pet small_pet
## Cat 2 0
## Dog 1 0
## Fish 0 1
## Hamster 0 3
Boomers do:
## studygroup mothereduc fathereduc favpet runtime
## test option:7 No :4 No :2 Cat :2 Min. : 90.0
## Yes:3 Yes:5 Dog :1 1st Qu.: 97.0
## Fish :1 Median :120.0
## Hamster:3 Mean :133.6
## 3rd Qu.:168.0
## Max. :195.0
## traveltime eyecolor eduhi time2
## Min. :45.00 blue :2 Length:7 Length:7
## 1st Qu.:50.00 gray :1 Class :character Class :character
## Median :60.00 green?:1 Mode :character Mode :character
## Mean :61.29 grey :1
## 3rd Qu.:64.50 Grey :1
## Max. :95.00 hazel :1
## pet2
## Length:7
## Class :character
## Mode :character
##
##
##
Millenials do:
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 7 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## mothereduc* 2 7 1.43 0.53 1 1.43 0.00 1 2 1 0.23
## fathereduc* 3 7 1.71 0.49 2 1.71 0.00 1 2 1 -0.75
## favpet* 4 7 2.71 1.38 3 2.71 1.48 1 4 3 -0.22
## runtime 5 7 133.57 43.89 120 133.57 44.48 90 195 105 0.33
## traveltime 6 7 61.29 17.09 60 61.29 11.86 45 95 50 0.85
## eyecolor* 7 7 3.14 1.95 3 3.14 2.97 1 6 5 0.18
## eduhi* 8 7 NaN NA NA NaN NA Inf -Inf -Inf NA
## time2* 9 7 NaN NA NA NaN NA Inf -Inf -Inf NA
## pet2* 10 7 NaN NA NA NaN NA Inf -Inf -Inf NA
## kurtosis se
## studygroup* NaN 0.00
## mothereduc* -2.20 0.20
## fathereduc* -1.60 0.18
## favpet* -1.99 0.52
## runtime -1.88 16.59
## traveltime -0.60 6.46
## eyecolor* -1.79 0.74
## eduhi* NA NA
## time2* NA NA
## pet2* NA NA
##
## Descriptive statistics by group
## group: big_pet
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 3 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## mothereduc* 2 3 1.33 0.58 1 1.33 0.00 1 2 1 0.38
## fathereduc* 3 3 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## favpet* 4 3 1.33 0.58 1 1.33 0.00 1 2 1 0.38
## runtime 5 3 126.67 51.42 99 126.67 5.93 95 186 91 0.38
## traveltime 6 3 72.00 19.92 61 72.00 1.48 60 95 35 0.38
## eyecolor* 7 3 4.00 2.65 5 4.00 1.48 1 6 5 -0.32
## eduhi* 8 3 NaN NA NA NaN NA Inf -Inf -Inf NA
## time2* 9 3 NaN NA NA NaN NA Inf -Inf -Inf NA
## pet2* 10 3 NaN NA NA NaN NA Inf -Inf -Inf NA
## kurtosis se
## studygroup* NaN 0.00
## mothereduc* -2.33 0.33
## fathereduc* NaN 0.00
## favpet* -2.33 0.33
## runtime -2.33 29.69
## traveltime -2.33 11.50
## eyecolor* -2.33 1.53
## eduhi* NA NA
## time2* NA NA
## pet2* NA NA
## --------------------------------------------------------
## group: small_pet
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 4 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN
## mothereduc* 2 4 1.50 0.58 1.5 1.50 0.74 1 2 1 0.00
## fathereduc* 3 4 1.50 0.58 1.5 1.50 0.74 1 2 1 0.00
## favpet* 4 4 3.75 0.50 4.0 3.75 0.00 3 4 1 -0.75
## runtime 5 4 138.75 44.79 135.0 138.75 44.48 90 195 105 0.16
## traveltime 6 4 53.25 10.90 50.0 53.25 7.41 45 68 23 0.40
## eyecolor* 7 4 2.50 1.29 2.5 2.50 1.48 1 4 3 0.00
## eduhi* 8 4 NaN NA NA NaN NA Inf -Inf -Inf NA
## time2* 9 4 NaN NA NA NaN NA Inf -Inf -Inf NA
## pet2* 10 4 NaN NA NA NaN NA Inf -Inf -Inf NA
## kurtosis se
## studygroup* NaN 0.00
## mothereduc* -2.44 0.29
## fathereduc* -2.44 0.29
## favpet* -1.69 0.25
## runtime -2.02 22.40
## traveltime -2.00 5.45
## eyecolor* -2.08 0.65
## eduhi* NA NA
## time2* NA NA
## pet2* NA NA
Zoomers do:
library(magrittr)
data1 %>%
group_by(pet2) %>%
summarise(avg_runtime = mean(runtime),
mdn_runtime = median(runtime),
n = n())## # A tibble: 2 x 4
## pet2 avg_runtime mdn_runtime n
## <chr> <dbl> <dbl> <int>
## 1 big_pet 127. 99 3
## 2 small_pet 139. 135 4
Millenials have been quicker here.
Help page here: https://www.earthdatascience.org/workshops/clean-coding-tidyverse-intro/summarise-data-in-R-tidyverse/
For univariate distributions:
A CATEGORICAL VARIABLE needs a bar plot (there is space between bars)
Boomers do:
Zoomers do:
A CONTINUOUS VARIABLE needs a histogram
Boomers do:
Zoomers do:
For bivariate distributions:
How to read a boxplot:
Boomers:
Zoomers:
Boomers:
Zoomers:
See help here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/
Millenials:
Zoomers:
str(), glimpse(), describeBy(), group_by() %>% summarise()select(data, var)subset(data, var), filter(data, condition)rename(data, new name = old name), if_else(data$var = 0, "value if true", "value if false")barplot(table(data$var)), ggplot(data, aes(var)) + geom_bar(),hist(data$var), ggplot(data, aes(var)) + geom_histogram()boxplot(data$cont ~ data$group), ggplot(data, aes(var)) + geom_boxplot()plot(data$var, data$var2), ggplot(data, aes(x = var, y = var2)) + geom_point()sjp.xtab(data$var, data$var2, ...), ggplot(data, aes(var)) + geom_bar(aes(fill = var2))dim(data), table(data$var, data$var new)*These are just some of the working solutions