Olesya Volchenko, Anna Shirokanova
January 13, 2021
In R Function takes a form of: function.name(x,y,z)
where x, y and z - arguments of our function.
For example, function sum()
## [1] 25
## [1] 2.718282
## [1] 2.302585
## [1] 1.414214
## [1] 6
## [1] 10
## [1] 1 2 3
## [1] 1 1 1 1 1 1 1 1 1 1
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
## [1] "numeric"
## [1] "character"
## [1] "logical"
## [1] yes no yes maybe maybe no maybe no no
## Levels: maybe no yes
name <- c("Masha", "Vasya", "Anya", "Petya", "Vanya")
age <- c(18, 17, 19, 21, 20)
weight <- c(45, 80, 69, 92, 60)
height <- c(1.62, 1.75, 1.82, 1.92, 1.70)
gender <- c("F", "M", "F", "M", "M")
course <- c(1, 1, 2, 3, 4)
students <- data.frame(name, age, weight, height, gender, course)
students## name age weight height gender course
## 1 Masha 18 45 1.62 F 1
## 2 Vasya 17 80 1.75 M 1
## 3 Anya 19 69 1.82 F 2
## 4 Petya 21 92 1.92 M 3
## 5 Vanya 20 60 1.70 M 4
Let’s call for “gender” variable from “students” dataset
## [1] "F" "M" "F" "M" "M"
## [1] "Masha"
## name age weight height gender course
## 1 Masha 18 45 1.62 F 1
## [1] "Masha" "Vasya" "Anya" "Petya" "Vanya"
head() - the first 6 rows of your dataset
tail() - the last 6 rows of your dataset
wd - a folder on your computer where R locates files
## [1] "C:/Users/lssi7/Downloads/Data_Analysis_in_Sociology"
Avoid cyrillic characters in your working directory
Use these functions to load external data files:
Get the data before loading!
Fill in a questionnaire down the link: bit.do/survey2K20
## Timestamp Your.group
## 1 2020/01/26 8:44:48 pm GMT+3 test option
## 2 2020/01/26 8:45:14 pm GMT+3 test option
## 3 2020/01/26 8:46:43 pm GMT+3 test option
## 4 2020/01/26 8:47:12 pm GMT+3 test option
## 5 2020/01/26 8:47:35 pm GMT+3 test option
## 6 2020/01/26 8:48:00 pm GMT+3 test option
## Did.your.mother.graduate.from.a.university.
## 1 No
## 2 Yes
## 3 Yes
## 4 Yes
## 5 No
## 6 No
## Did.your.father.graduate.from.a.university.
## 1 Yes
## 2 Yes
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
## Which.one.of.those.four.pets.do.you.favour.more.
## 1 Fish
## 2 Cat
## 3 Hamster
## 4 Hamster
## 5 Cat
## 6 Dog
## What.is.the.run.time.of.your.favourite.film..in.minutes.
## 1 90
## 2 186
## 3 120
## 4 150
## 5 95
## 6 99
## How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.
## 1 68
## 2 60
## 3 45
## 4 55
## 5 95
## 6 61
## What.is.the.colour.of.your.eyes.
## 1 green?
## 2 blue
## 3 gray
## 4 grey
## 5 Grey
## 6 hazel
## [1] "Timestamp"
## [2] "Your.group"
## [3] "Did.your.mother.graduate.from.a.university."
## [4] "Did.your.father.graduate.from.a.university."
## [5] "Which.one.of.those.four.pets.do.you.favour.more."
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
## 'data.frame': 63 obs. of 8 variables:
## $ Timestamp : chr "2020/01/26 8:44:48 pm GMT+3" "2020/01/26 8:45:14 pm GMT+3" "2020/01/26 8:46:43 pm GMT+3" "2020/01/26 8:47:12 pm GMT+3" ...
## $ Your.group : chr "test option" "test option" "test option" "test option" ...
## $ Did.your.mother.graduate.from.a.university. : chr "No" "Yes" "Yes" "Yes" ...
## $ Did.your.father.graduate.from.a.university. : chr "Yes" "Yes" "Yes" "No" ...
## $ Which.one.of.those.four.pets.do.you.favour.more. : chr "Fish" "Cat" "Hamster" "Hamster" ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes. : int 90 186 120 150 95 99 195 218 128 96 ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.: int 68 60 45 55 95 61 45 10 85 75 ...
## $ What.is.the.colour.of.your.eyes. : chr "green?" "blue" "gray" "grey" ...
## Rows: 63
## Columns: 8
## $ Timestamp <chr> ...
## $ Your.group <chr> ...
## $ Did.your.mother.graduate.from.a.university. <chr> ...
## $ Did.your.father.graduate.from.a.university. <chr> ...
## $ Which.one.of.those.four.pets.do.you.favour.more. <chr> ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes. <int> ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes. <int> ...
## $ What.is.the.colour.of.your.eyes. <chr> ...
## [1] "Timestamp"
## [2] "Your.group"
## [3] "Did.your.mother.graduate.from.a.university."
## [4] "Did.your.father.graduate.from.a.university."
## [5] "Which.one.of.those.four.pets.do.you.favour.more."
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
Boomers do:
data1 <- data[c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes.")]
dim(data1) #7 variables## [1] 63 7
Millenials do:
data2 <- subset(data,
select = c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes."))
dim(data2) #7 variables, the same result## [1] 63 7
Zoomers do:
#library(dplyr)
data3 <- select(data, c("Your.group",
"Did.your.mother.graduate.from.a.university.",
"Did.your.father.graduate.from.a.university.",
"Which.one.of.those.four.pets.do.you.favour.more.",
"What.is.the.run.time.of.your.favourite.film..in.minutes.",
"How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
"What.is.the.colour.of.your.eyes."))
dim(data3) #7 variables, the same result## [1] 63 7
Sometimes still other ways can be employed:
## [1] 63 7
data1 <- rename(data1,
#new name = old name,
studygroup = Your.group,
mothereduc = Did.your.mother.graduate.from.a.university.,
fathereduc = Did.your.father.graduate.from.a.university.,
favpet = Which.one.of.those.four.pets.do.you.favour.more.,
runtime = What.is.the.run.time.of.your.favourite.film..in.minutes.,
traveltime = How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.,
eyecolor = What.is.the.colour.of.your.eyes.)Check yourself: If you run the code above twice, it won’t work for the second time. Why?
After changing the variable names, always check the result:
## [1] "studygroup" "mothereduc" "fathereduc" "favpet" "runtime"
## [6] "traveltime" "eyecolor"
Now you have shorter variable names and know how to learn what they are.
Some important rules of naming:
rfrgrtp is better than refrigeratortype)rfrgrtp is better than V1)my_biscuit)Useful link to the Coding Style Guide in R http://adv-r.had.co.nz/Style.html
Tidyverse Coding Guide: https://style.tidyverse.org/syntax.html#object-names
Select the answers from one group (studygroup) only
##
## BSC181 BSC182 BSC183 test option
## 16 17 23 7
Boomers do:
## [1] 16 7
Millenials do:
## [1] 16 7
Zoomers do:
## [1] 16 7
Now let’s filter by two conditions: members of one group who spend less than an hour on travel time:
## studygroup mothereduc fathereduc favpet runtime traveltime eyecolor
## 1 BSC181 Yes No Dog 80 50 Green
## 2 BSC181 No No Cat 90 50 Green
## 3 BSC181 Yes Yes Fish 205 39 brown
## 4 BSC181 Yes Yes Cat 188 40 Brown
## 5 BSC181 Yes Yes Fish 96 18 Green
## 6 BSC181 Yes Yes Cat 113 20 Brown
## [1] 8 7
Or else you can filter data by one condition from a set:
## studygroup mothereduc fathereduc favpet runtime traveltime eyecolor
## 1 test option Yes Yes Hamster 120 45 gray
## 2 test option Yes No Hamster 150 55 grey
## 3 test option No No Hamster 195 45 blue
## 4 BSC183 No Yes Dog 218 10 Blue
## 5 BSC183 Yes No Dog 155 30 green
## 6 BSC183 Yes Yes Cat 106 28 blue
## [1] 34 7
(Get back to data1 for this.)
Let’s create a new variable which says “Yes” if at least one of the two parents had higher education and “No” if none did.
library(dplyr)
data1$eduhi <- if_else(data1$mothereduc == "Yes" | data1$fathereduc == "Yes",
"Yes",
"No")Always check the results of recoding:
##
## No Yes
## 7 56
##
## No Yes
## No 7 0
## Yes 10 46
##
## No Yes
## No 7 0
## Yes 7 49
Now let’s create a factor of average travel time to university: “less than an hour” for trips below 60 minutes, and “an hour or more” for the rest.
data1$time2[data1$traveltime < 60] <- "less than an hour"
data1$time2[data1$traveltime >= 60] <- "an hour or more"
table(data1$time2)##
## an hour or more less than an hour
## 37 26
##
## an hour or more less than an hour
## 10 0 1
## 15 0 1
## 18 0 1
## 20 0 1
## 25 0 2
## 28 0 1
## 30 0 2
## 35 0 1
## 39 0 1
## 40 0 4
## 45 0 3
## 50 0 5
## 55 0 3
## 60 5 0
## 61 1 0
## 65 4 0
## 68 1 0
## 70 5 0
## 75 6 0
## 80 8 0
## 81 1 0
## 85 1 0
## 90 3 0
## 95 1 0
## 180 1 0
Let’s recode several categories into few ones.
(we create a new variable not to overwrite the old data)
##
## Cat Dog Fish Hamster
## 32 23 4 4
data1$pet2[data1$favpet == "Cat" |
data1$favpet == "Dog" ] <- "big_pet"
data1$pet2[data1$favpet == "Fish" |
data1$favpet == "Hamster"] <- "small_pet"
table(data1$favpet, data1$pet2)##
## big_pet small_pet
## Cat 32 0
## Dog 23 0
## Fish 0 4
## Hamster 0 4
Boomers do:
## studygroup mothereduc fathereduc favpet
## Length:63 Length:63 Length:63 Length:63
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## runtime traveltime eyecolor eduhi
## Min. : 70.0 Min. : 10.00 Length:63 Length:63
## 1st Qu.: 99.5 1st Qu.: 45.00 Class :character Class :character
## Median :120.0 Median : 61.00 Mode :character Mode :character
## Mean :126.1 Mean : 60.71
## 3rd Qu.:146.0 3rd Qu.: 75.00
## Max. :218.0 Max. :180.00
## time2 pet2
## Length:63 Length:63
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Millenials do:
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 63 2.33 0.98 2 2.29 1.48 1 4 3 0.01
## mothereduc* 2 63 1.73 0.45 2 1.78 0.00 1 2 1 -1.01
## fathereduc* 3 63 1.78 0.42 2 1.84 0.00 1 2 1 -1.30
## favpet* 4 63 1.68 0.86 1 1.53 0.00 1 4 3 1.24
## runtime 5 63 126.06 34.87 120 122.69 34.10 70 218 148 0.84
## traveltime 6 63 60.71 25.92 61 60.43 23.72 10 180 170 1.17
## eyecolor* 7 63 9.38 6.15 8 8.94 5.93 1 24 23 0.52
## eduhi* 8 63 1.89 0.32 2 1.98 0.00 1 2 1 -2.42
## time2* 9 63 1.41 0.50 1 1.39 0.00 1 2 1 0.35
## pet2* 10 63 1.13 0.34 1 1.04 0.00 1 2 1 2.19
## kurtosis se
## studygroup* -1.15 0.12
## mothereduc* -0.99 0.06
## fathereduc* -0.30 0.05
## favpet* 0.94 0.11
## runtime -0.10 4.39
## traveltime 5.19 3.27
## eyecolor* -0.59 0.77
## eduhi* 3.90 0.04
## time2* -1.91 0.06
## pet2* 2.83 0.04
##
## Descriptive statistics by group
## group: big_pet
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 55 2.27 0.89 2 2.27 1.48 1 4 3 -0.08
## mothereduc* 2 55 1.75 0.44 2 1.80 0.00 1 2 1 -1.10
## fathereduc* 3 55 1.80 0.40 2 1.87 0.00 1 2 1 -1.46
## favpet* 4 55 1.42 0.50 1 1.40 0.00 1 2 1 0.32
## runtime 5 55 124.45 33.71 120 121.47 31.13 70 218 148 0.85
## traveltime 6 55 61.71 26.51 65 61.20 22.24 10 180 170 1.22
## eyecolor* 7 55 8.78 5.57 8 8.40 4.45 1 22 21 0.50
## eduhi* 8 55 1.91 0.29 2 2.00 0.00 1 2 1 -2.77
## time2* 9 55 1.38 0.49 1 1.36 0.00 1 2 1 0.47
## pet2* 10 55 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## kurtosis se
## studygroup* -1.07 0.12
## mothereduc* -0.81 0.06
## fathereduc* 0.13 0.05
## favpet* -1.93 0.07
## runtime 0.04 4.55
## traveltime 5.24 3.57
## eyecolor* -0.51 0.75
## eduhi* 5.77 0.04
## time2* -1.81 0.07
## pet2* NaN 0.00
## ------------------------------------------------------------
## group: small_pet
## vars n mean sd median trimmed mad min max range skew
## studygroup* 1 8 2.12 0.99 2.5 2.12 0.74 1 3 2 -0.20
## mothereduc* 2 8 1.62 0.52 2.0 1.62 0.00 1 2 1 -0.42
## fathereduc* 3 8 1.62 0.52 2.0 1.62 0.00 1 2 1 -0.42
## favpet* 4 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00
## runtime 5 8 137.12 42.90 120.5 137.12 40.03 90 205 115 0.51
## traveltime 6 8 53.88 21.66 50.0 53.88 21.50 18 81 63 -0.13
## eyecolor* 7 8 3.75 2.12 3.5 3.75 2.22 1 7 6 0.21
## eduhi* 8 8 1.75 0.46 2.0 1.75 0.00 1 2 1 -0.95
## time2* 9 8 1.62 0.52 2.0 1.62 0.00 1 2 1 -0.42
## pet2* 10 8 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN
## kurtosis se
## studygroup* -2.07 0.35
## mothereduc* -2.03 0.18
## fathereduc* -2.03 0.18
## favpet* -2.23 0.19
## runtime -1.50 15.17
## traveltime -1.43 7.66
## eyecolor* -1.67 0.75
## eduhi* -1.21 0.16
## time2* -2.03 0.18
## pet2* NaN 0.00
Zoomers do:
library(magrittr)
data1 %>%
group_by(pet2) %>%
summarise(avg_runtime = mean(runtime),
mdn_runtime = median(runtime),
n = n())## # A tibble: 2 x 4
## pet2 avg_runtime mdn_runtime n
## <chr> <dbl> <dbl> <int>
## 1 big_pet 124. 120 55
## 2 small_pet 137. 120. 8
Millenials have been quicker here.
Help page here: https://www.earthdatascience.org/workshops/clean-coding-tidyverse-intro/summarise-data-in-R-tidyverse/
For univariate distributions:
A CATEGORICAL VARIABLE needs a bar plot (there is space between bars)
Boomers do:
Zoomers do:
A CONTINUOUS VARIABLE needs a histogram
Boomers do:
Zoomers do:
For bivariate distributions:
How to read a boxplot:
Boomers:
Zoomers:
Boomers:
Zoomers:
See help here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/
Millenials:
Zoomers:
str(), glimpse(), describeBy(), group_by() %>% summarise()select(data, var)subset(data, var), filter(data, condition)rename(data, new name = old name), if_else(data$var = 0, "value if true", "value if false")barplot(table(data$var)), ggplot(data, aes(var)) + geom_bar(),hist(data$var), ggplot(data, aes(var)) + geom_histogram()boxplot(data$cont ~ data$group), ggplot(data, aes(var)) + geom_boxplot()plot(data$var, data$var2), ggplot(data, aes(x = var, y = var2)) + geom_point()sjp.xtab(data$var, data$var2, ...), ggplot(data, aes(var)) + geom_bar(aes(fill = var2))dim(data), table(data$var, data$var new)*These are just some of the working solutions