Introduction and Basic Data Manipulation

Olesya Volchenko, Anna Shirokanova

January 13, 2021

What’s going on?

Stages of R

Structure of R

Installing R

Installing RStudio

Where To Start

Creating an Object

Assignment

x <- 5
x
## [1] 5
x <- x + 5

Functions

In R Function takes a form of: function.name(x,y,z)
where x, y and z - arguments of our function.
For example, function sum()

sum(x = 10, y = 15)
## [1] 25
x1 <- sum(x = 10, y = 15)

R as a calculator

2 + 2
## [1] 4
2 * 3
## [1] 6
2^3
## [1] 8
(100-25)/3
## [1] 25

R as a calculator

exp(1)
## [1] 2.718282
log(10)
## [1] 2.302585
sqrt(2)
## [1] 1.414214
prod(2,3)
## [1] 6
abs(-10)
## [1] 10

Data types in R

Vector

a <- c(1, 2, 3)
a
## [1] 1 2 3
b <- rep(1, 10)
b
##  [1] 1 1 1 1 1 1 1 1 1 1
c <- seq(1, 5, 1)
c
## [1] 1 2 3 4 5
d <- 1:5
d
## [1] 1 2 3 4 5

Data classes

x <- c(1, 5, 9.5)
y <- c("good", "satisf", "bad")
l <- c(T, F, F)
class(x)
## [1] "numeric"
class(y)
## [1] "character"
class(l)
## [1] "logical"

Factors

f <- factor(c("yes","no","yes","maybe","maybe","no","maybe","no","no"))
f
## [1] yes   no    yes   maybe maybe no    maybe no    no   
## Levels: maybe no yes

Data frame

name <- c("Masha", "Vasya", "Anya", "Petya", "Vanya")
age <- c(18, 17, 19, 21, 20)
weight <- c(45, 80, 69, 92, 60)
height <- c(1.62, 1.75, 1.82, 1.92, 1.70)
gender <- c("F", "M", "F", "M", "M")
course <- c(1, 1, 2, 3, 4)
students <- data.frame(name, age, weight, height, gender, course)
students
##    name age weight height gender course
## 1 Masha  18     45   1.62      F      1
## 2 Vasya  17     80   1.75      M      1
## 3  Anya  19     69   1.82      F      2
## 4 Petya  21     92   1.92      M      3
## 5 Vanya  20     60   1.70      M      4

Taking a look at a part of an object

Let’s call for “gender” variable from “students” dataset

students$gender
## [1] "F" "M" "F" "M" "M"
students[1,1] #the 1st row of the 1st column
## [1] "Masha"
students[1, ] #the 1st row 
##    name age weight height gender course
## 1 Masha  18     45   1.62      F      1
students[ , 1] #the 1st column
## [1] "Masha" "Vasya" "Anya"  "Petya" "Vanya"

head() - the first 6 rows of your dataset

tail() - the last 6 rows of your dataset

Working directory, environment

wd - a folder on your computer where R locates files

getwd() # take a look at existing/default wd
## [1] "C:/Users/lssi7/Downloads/Data_Analysis_in_Sociology"
#setwd() # set a new wd
#dir()

Avoid cyrillic characters in your working directory

But I want to work with real data!

Use these functions to load external data files:

Next Slides Will Show How To:

  1. Load the data
  2. Select relevant variables
  3. Name the variables the way you like
  4. Create a subset based on criteria
  5. Recode variables when you need this
  6. Summarise the dataset as a whole and by groups
  7. Create suitable graphs to summarize one or two variables

Load the data

Get the data before loading!

Fill in a questionnaire down the link: bit.do/survey2K20

data <- read.csv("test survey1.csv")
head(data)
##                     Timestamp  Your.group
## 1 2020/01/26 8:44:48 pm GMT+3 test option
## 2 2020/01/26 8:45:14 pm GMT+3 test option
## 3 2020/01/26 8:46:43 pm GMT+3 test option
## 4 2020/01/26 8:47:12 pm GMT+3 test option
## 5 2020/01/26 8:47:35 pm GMT+3 test option
## 6 2020/01/26 8:48:00 pm GMT+3 test option
##   Did.your.mother.graduate.from.a.university.
## 1                                          No
## 2                                         Yes
## 3                                         Yes
## 4                                         Yes
## 5                                          No
## 6                                          No
##   Did.your.father.graduate.from.a.university.
## 1                                         Yes
## 2                                         Yes
## 3                                         Yes
## 4                                          No
## 5                                         Yes
## 6                                         Yes
##   Which.one.of.those.four.pets.do.you.favour.more.
## 1                                             Fish
## 2                                              Cat
## 3                                          Hamster
## 4                                          Hamster
## 5                                              Cat
## 6                                              Dog
##   What.is.the.run.time.of.your.favourite.film..in.minutes.
## 1                                                       90
## 2                                                      186
## 3                                                      120
## 4                                                      150
## 5                                                       95
## 6                                                       99
##   How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.
## 1                                                                          68
## 2                                                                          60
## 3                                                                          45
## 4                                                                          55
## 5                                                                          95
## 6                                                                          61
##   What.is.the.colour.of.your.eyes.
## 1                           green?
## 2                             blue
## 3                             gray
## 4                             grey
## 5                             Grey
## 6                            hazel
colnames(data)
## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."
str(data)
## 'data.frame':    63 obs. of  8 variables:
##  $ Timestamp                                                                  : chr  "2020/01/26 8:44:48 pm GMT+3" "2020/01/26 8:45:14 pm GMT+3" "2020/01/26 8:46:43 pm GMT+3" "2020/01/26 8:47:12 pm GMT+3" ...
##  $ Your.group                                                                 : chr  "test option" "test option" "test option" "test option" ...
##  $ Did.your.mother.graduate.from.a.university.                                : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Did.your.father.graduate.from.a.university.                                : chr  "Yes" "Yes" "Yes" "No" ...
##  $ Which.one.of.those.four.pets.do.you.favour.more.                           : chr  "Fish" "Cat" "Hamster" "Hamster" ...
##  $ What.is.the.run.time.of.your.favourite.film..in.minutes.                   : int  90 186 120 150 95 99 195 218 128 96 ...
##  $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.: int  68 60 45 55 95 61 45 10 85 75 ...
##  $ What.is.the.colour.of.your.eyes.                                           : chr  "green?" "blue" "gray" "grey" ...
library(tidyverse)
glimpse(data)
## Rows: 63
## Columns: 8
## $ Timestamp                                                                   <chr> ...
## $ Your.group                                                                  <chr> ...
## $ Did.your.mother.graduate.from.a.university.                                 <chr> ...
## $ Did.your.father.graduate.from.a.university.                                 <chr> ...
## $ Which.one.of.those.four.pets.do.you.favour.more.                            <chr> ...
## $ What.is.the.run.time.of.your.favourite.film..in.minutes.                    <int> ...
## $ How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes. <int> ...
## $ What.is.the.colour.of.your.eyes.                                            <chr> ...

Select relevant variables

colnames(data) # get variable names
## [1] "Timestamp"                                                                  
## [2] "Your.group"                                                                 
## [3] "Did.your.mother.graduate.from.a.university."                                
## [4] "Did.your.father.graduate.from.a.university."                                
## [5] "Which.one.of.those.four.pets.do.you.favour.more."                           
## [6] "What.is.the.run.time.of.your.favourite.film..in.minutes."                   
## [7] "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes."
## [8] "What.is.the.colour.of.your.eyes."

Boomers do:

data1 <- data[c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes.")]
dim(data1) #7 variables
## [1] 63  7

Millenials do:

data2 <- subset(data, 
                select = c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data2) #7 variables, the same result
## [1] 63  7

Zoomers do:

#library(dplyr)
data3 <- select(data, c("Your.group", 
              "Did.your.mother.graduate.from.a.university.",
              "Did.your.father.graduate.from.a.university.",
              "Which.one.of.those.four.pets.do.you.favour.more.",
              "What.is.the.run.time.of.your.favourite.film..in.minutes.",
              "How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.",
              "What.is.the.colour.of.your.eyes."))
dim(data3) #7 variables, the same result
## [1] 63  7

Sometimes still other ways can be employed:

data4 <- data[,-1]
dim(data4)
## [1] 63  7

Name the variables the way you like

data1 <- rename(data1, 
                 #new name = old name,
              studygroup = Your.group,
              mothereduc = Did.your.mother.graduate.from.a.university.,
              fathereduc = Did.your.father.graduate.from.a.university.,
              favpet = Which.one.of.those.four.pets.do.you.favour.more.,
               runtime = What.is.the.run.time.of.your.favourite.film..in.minutes., 
               traveltime = How.much.time.does.it.take.you.on.average.to.get.to.university..in.minutes.,
               eyecolor = What.is.the.colour.of.your.eyes.)

Check yourself: If you run the code above twice, it won’t work for the second time. Why?

After changing the variable names, always check the result:

colnames(data1)
## [1] "studygroup" "mothereduc" "fathereduc" "favpet"     "runtime"   
## [6] "traveltime" "eyecolor"

Now you have shorter variable names and know how to learn what they are.

Some important rules of naming:

Useful link to the Coding Style Guide in R http://adv-r.had.co.nz/Style.html

Tidyverse Coding Guide: https://style.tidyverse.org/syntax.html#object-names

Create a subset based on criteria

Select the answers from one group (studygroup) only

table(data1$studygroup)
## 
##      BSC181      BSC182      BSC183 test option 
##          16          17          23           7

Boomers do:

data1_1 <- data1[data1$studygroup == "BSC181", ]
dim(data1_1)
## [1] 16  7

Millenials do:

data1_2 <- subset(data1, studygroup == "BSC181")
dim(data1_2)
## [1] 16  7

Zoomers do:

data1_3 <- filter(data1, data1$studygroup == "BSC181")
dim(data1_3)
## [1] 16  7

Now let’s filter by two conditions: members of one group who spend less than an hour on travel time:

data1_4 <- filter(data1, data1$studygroup == "BSC181" & data1$traveltime < 60)
head(data1_4)
##   studygroup mothereduc fathereduc favpet runtime traveltime eyecolor
## 1     BSC181        Yes         No    Dog      80         50    Green
## 2     BSC181         No         No    Cat      90         50    Green
## 3     BSC181        Yes        Yes   Fish     205         39    brown
## 4     BSC181        Yes        Yes    Cat     188         40    Brown
## 5     BSC181        Yes        Yes   Fish      96         18    Green
## 6     BSC181        Yes        Yes    Cat     113         20   Brown
dim(data1_4)
## [1] 8 7

Or else you can filter data by one condition from a set:

data1_5 <- filter(data1, data1$studygroup == "BSC181" | data1$traveltime < 60)
head(data1_5)
##    studygroup mothereduc fathereduc  favpet runtime traveltime eyecolor
## 1 test option        Yes        Yes Hamster     120         45     gray
## 2 test option        Yes         No Hamster     150         55     grey
## 3 test option         No         No Hamster     195         45     blue
## 4      BSC183         No        Yes     Dog     218         10     Blue
## 5      BSC183        Yes         No     Dog     155         30    green
## 6      BSC183        Yes        Yes     Cat     106         28     blue
dim(data1_5)
## [1] 34  7

Recap: we are here

  1. Load the data - DONE
  2. Select relevant variables - DONE
  3. Name the variables the way you like - DONE
  4. Create a subset based on criteria - DONE
  5. Recode variables when you need this
  6. Summarise the dataset as a whole and by groups
  7. Create suitable graphs to summarise one or two variables

Recode variables when you need this

(Get back to data1 for this.)

Let’s create a new variable which says “Yes” if at least one of the two parents had higher education and “No” if none did.

library(dplyr)
data1$eduhi <- if_else(data1$mothereduc == "Yes" | data1$fathereduc == "Yes", 
                      "Yes", 
                      "No")

Always check the results of recoding:

table(data1$eduhi)
## 
##  No Yes 
##   7  56
table(data1$eduhi, data1$mothereduc)
##      
##       No Yes
##   No   7   0
##   Yes 10  46
table(data1$eduhi, data1$fathereduc)
##      
##       No Yes
##   No   7   0
##   Yes  7  49

Now let’s create a factor of average travel time to university: “less than an hour” for trips below 60 minutes, and “an hour or more” for the rest.

data1$time2[data1$traveltime < 60] <- "less than an hour"
data1$time2[data1$traveltime >= 60] <- "an hour or more"
table(data1$time2)
## 
##   an hour or more less than an hour 
##                37                26
table(data1$traveltime, data1$time2)
##      
##       an hour or more less than an hour
##   10                0                 1
##   15                0                 1
##   18                0                 1
##   20                0                 1
##   25                0                 2
##   28                0                 1
##   30                0                 2
##   35                0                 1
##   39                0                 1
##   40                0                 4
##   45                0                 3
##   50                0                 5
##   55                0                 3
##   60                5                 0
##   61                1                 0
##   65                4                 0
##   68                1                 0
##   70                5                 0
##   75                6                 0
##   80                8                 0
##   81                1                 0
##   85                1                 0
##   90                3                 0
##   95                1                 0
##   180               1                 0

Let’s recode several categories into few ones.

(we create a new variable not to overwrite the old data)

table(data1$favpet)
## 
##     Cat     Dog    Fish Hamster 
##      32      23       4       4
data1$pet2[data1$favpet == "Cat" | 
              data1$favpet == "Dog" ] <- "big_pet"
data1$pet2[data1$favpet == "Fish" | 
              data1$favpet == "Hamster"] <- "small_pet"
table(data1$favpet, data1$pet2)
##          
##           big_pet small_pet
##   Cat          32         0
##   Dog          23         0
##   Fish          0         4
##   Hamster       0         4

Summarise the dataset as a whole and by groups

Boomers do:

summary(data1)
##   studygroup         mothereduc         fathereduc           favpet         
##  Length:63          Length:63          Length:63          Length:63         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     runtime        traveltime       eyecolor            eduhi          
##  Min.   : 70.0   Min.   : 10.00   Length:63          Length:63         
##  1st Qu.: 99.5   1st Qu.: 45.00   Class :character   Class :character  
##  Median :120.0   Median : 61.00   Mode  :character   Mode  :character  
##  Mean   :126.1   Mean   : 60.71                                        
##  3rd Qu.:146.0   3rd Qu.: 75.00                                        
##  Max.   :218.0   Max.   :180.00                                        
##     time2               pet2          
##  Length:63          Length:63         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Millenials do:

library(psych)
describe(data1)
##             vars  n   mean    sd median trimmed   mad min max range  skew
## studygroup*    1 63   2.33  0.98      2    2.29  1.48   1   4     3  0.01
## mothereduc*    2 63   1.73  0.45      2    1.78  0.00   1   2     1 -1.01
## fathereduc*    3 63   1.78  0.42      2    1.84  0.00   1   2     1 -1.30
## favpet*        4 63   1.68  0.86      1    1.53  0.00   1   4     3  1.24
## runtime        5 63 126.06 34.87    120  122.69 34.10  70 218   148  0.84
## traveltime     6 63  60.71 25.92     61   60.43 23.72  10 180   170  1.17
## eyecolor*      7 63   9.38  6.15      8    8.94  5.93   1  24    23  0.52
## eduhi*         8 63   1.89  0.32      2    1.98  0.00   1   2     1 -2.42
## time2*         9 63   1.41  0.50      1    1.39  0.00   1   2     1  0.35
## pet2*         10 63   1.13  0.34      1    1.04  0.00   1   2     1  2.19
##             kurtosis   se
## studygroup*    -1.15 0.12
## mothereduc*    -0.99 0.06
## fathereduc*    -0.30 0.05
## favpet*         0.94 0.11
## runtime        -0.10 4.39
## traveltime      5.19 3.27
## eyecolor*      -0.59 0.77
## eduhi*          3.90 0.04
## time2*         -1.91 0.06
## pet2*           2.83 0.04
describeBy(data1, data1$pet2) # describe the data by groups
## 
##  Descriptive statistics by group 
## group: big_pet
##             vars  n   mean    sd median trimmed   mad min max range  skew
## studygroup*    1 55   2.27  0.89      2    2.27  1.48   1   4     3 -0.08
## mothereduc*    2 55   1.75  0.44      2    1.80  0.00   1   2     1 -1.10
## fathereduc*    3 55   1.80  0.40      2    1.87  0.00   1   2     1 -1.46
## favpet*        4 55   1.42  0.50      1    1.40  0.00   1   2     1  0.32
## runtime        5 55 124.45 33.71    120  121.47 31.13  70 218   148  0.85
## traveltime     6 55  61.71 26.51     65   61.20 22.24  10 180   170  1.22
## eyecolor*      7 55   8.78  5.57      8    8.40  4.45   1  22    21  0.50
## eduhi*         8 55   1.91  0.29      2    2.00  0.00   1   2     1 -2.77
## time2*         9 55   1.38  0.49      1    1.36  0.00   1   2     1  0.47
## pet2*         10 55   1.00  0.00      1    1.00  0.00   1   1     0   NaN
##             kurtosis   se
## studygroup*    -1.07 0.12
## mothereduc*    -0.81 0.06
## fathereduc*     0.13 0.05
## favpet*        -1.93 0.07
## runtime         0.04 4.55
## traveltime      5.24 3.57
## eyecolor*      -0.51 0.75
## eduhi*          5.77 0.04
## time2*         -1.81 0.07
## pet2*            NaN 0.00
## ------------------------------------------------------------ 
## group: small_pet
##             vars n   mean    sd median trimmed   mad min max range  skew
## studygroup*    1 8   2.12  0.99    2.5    2.12  0.74   1   3     2 -0.20
## mothereduc*    2 8   1.62  0.52    2.0    1.62  0.00   1   2     1 -0.42
## fathereduc*    3 8   1.62  0.52    2.0    1.62  0.00   1   2     1 -0.42
## favpet*        4 8   1.50  0.53    1.5    1.50  0.74   1   2     1  0.00
## runtime        5 8 137.12 42.90  120.5  137.12 40.03  90 205   115  0.51
## traveltime     6 8  53.88 21.66   50.0   53.88 21.50  18  81    63 -0.13
## eyecolor*      7 8   3.75  2.12    3.5    3.75  2.22   1   7     6  0.21
## eduhi*         8 8   1.75  0.46    2.0    1.75  0.00   1   2     1 -0.95
## time2*         9 8   1.62  0.52    2.0    1.62  0.00   1   2     1 -0.42
## pet2*         10 8   1.00  0.00    1.0    1.00  0.00   1   1     0   NaN
##             kurtosis    se
## studygroup*    -2.07  0.35
## mothereduc*    -2.03  0.18
## fathereduc*    -2.03  0.18
## favpet*        -2.23  0.19
## runtime        -1.50 15.17
## traveltime     -1.43  7.66
## eyecolor*      -1.67  0.75
## eduhi*         -1.21  0.16
## time2*         -2.03  0.18
## pet2*            NaN  0.00

Zoomers do:

library(magrittr)
data1 %>%
  group_by(pet2) %>%
  summarise(avg_runtime = mean(runtime),
            mdn_runtime = median(runtime),
            n = n())
## # A tibble: 2 x 4
##   pet2      avg_runtime mdn_runtime     n
##   <chr>           <dbl>       <dbl> <int>
## 1 big_pet          124.        120     55
## 2 small_pet        137.        120.     8

Millenials have been quicker here.

Help page here: https://www.earthdatascience.org/workshops/clean-coding-tidyverse-intro/summarise-data-in-R-tidyverse/

Create suitable graphs to summarize one or two variables

For univariate distributions:

A CATEGORICAL VARIABLE needs a bar plot (there is space between bars)

Boomers do:

barplot(table(data1$favpet))

Zoomers do:

library(ggplot2)
ggplot(data1, aes(x = favpet)) +
  geom_bar()

A CONTINUOUS VARIABLE needs a histogram

Boomers do:

hist(data1$traveltime)

Zoomers do:

ggplot(data1, aes(x = traveltime)) +
  geom_histogram()

For bivariate distributions:

How to read a boxplot:

Boomers:

boxplot(data1$runtime ~ data1$eduhi)

Zoomers:

ggplot(data1, aes(x = eduhi, y = runtime)) +
  geom_boxplot()

Boomers:

plot(data1$runtime, data1$traveltime)

Zoomers:

ggplot(data1, aes(x = runtime, y = traveltime)) +
  geom_point() +
  geom_smooth(method = lm, se = F) 

See help here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/

Millenials:

library(sjPlot)
plot_xtab(data1$eduhi, data1$pet2,
         margin = "row",
         bar.pos = "stack")

Zoomers:

ggplot(data1, aes(eduhi, fill = pet2)) +
  geom_bar(position="fill")

ggplot(data1, aes(x = eduhi, fill = pet2)) +
  geom_bar(position="stack")

Help: https://ggplot2.tidyverse.org/reference/geom_bar.html

Useful functions (overview)*

*These are just some of the working solutions

Remember: If cleaning the data gets hard at times, it’s okay. Go on to learn and practice!