Sally Chen
8/25/2018
Yijun(Sally) Chen
Undergraduate at Fudan University
4th year Marketing Phd at Olin
Data-driven research in social science
Social network, Peer effect
Teaching assistant for Olin courses
Runner, Barca supporter, mom of a cat
Welcome to Wash U!
If you have not installed R & Rstudio,use Rstudio cloud verion for today’s session
Let’s type the following code directly in the console and hit enter to see return
Assign value to object
## [1] "numeric"
## [1] "character"
## [1] "logical"
## [1] 100
## [1] 0.02
## Error: <text>:1:2: unexpected symbol
## 1: 1x
## ^
## [1] 0.02
## [1] 5.02
## [1] 0.02
## [1] 5.02
## [1] "character"
## Error in "TRUE" + 1: non-numeric argument to binary operator
## [1] "logical"
## [1] 2
## Error in 2 + "2": non-numeric argument to binary operator
## [1] 0 0 0
## [1] 0 0 0
## [1] "" "" "" "" ""
## [1] "" "" "" "" ""
## [1] "1" "hello" "TRUE"
## [1] 1 2 3 4
## [1] 4
## [1] "numeric"
## [1] "1" "2" "hello" "R"
## [1] 4
## [1] "character"
## [1] 1 2 3 4
## [1] 3 4 5 6 7
## [1] 1 3 5 7 9
## [1] 3 3 3 3
## [1] 1 2 3 4 5
## [1] "numeric"
## [1] 5
## [1] 1.581139
## [1] 1
## [1] 5
## [1] 2 3 4 5 6
## [1] 2.718282 7.389056 20.085537 54.598150 148.413159
## [1] 11
## [1] 1
## [1] 11
## [1] 11 2 3 4 5
## [1] 11
## [1] 11 2 3
## [1] 11 3
## [1] 11 2 4 5
## [1] "character"
## Warning in mean.default(z): argument is not numeric or logical: returning
## NA
## [1] NA
## [1] FALSE
## [1] 1 2 3
## [1] 2
## [1] "1" "2" "3"
## [1] NA
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [1] TRUE
## [1] 12.375
## [1] 10
## [1] 12
## [1] 4 6 8 10 12 14 16 18 20 22
## [1] 13
## [1] 4
## [1] 22
## [1] 6.055301
## [1] "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
R use function() to create a function
Structure of a R function
myfunction <- function(arg1, arg2){
statements
return(object)
}
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
## [1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
## [1] 1.1 1.4 1.7 2.0 2.3 2.6 2.9 3.2 3.5
## [1] 1.1 1.4 1.7 2.0 2.3 2.6 2.9 3.2 3.5
## [1] 1.1 1.4 1.7 2.0 2.3 2.6 2.9 3.2 3.5
## Error in rnorm(): argument "n" is missing, with no default
## [1] 0.03828554
## [1] 0.9784108
## [1] 2.841117
## [1] 5.028204
## [1] 0.002355184
## [1] 0.9977875
## [1] 0.4973576
## [1] -4.996475
## [1] 4.96678
## [1] -0.01792917
Let’s take a short break of 10 minutes
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
## a1 a2
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## a1 a2
## [1,] 1 4
## [2,] 2 5
## [3,] 3 4
## [4,] 4 5
## [5,] 5 4
## [6,] 6 5
## [,1] [,2] [,3]
## a1 1 2 3
## a2 4 5 6
## [,1] [,2]
## [1,] "1" "4"
## [2,] "2" "5"
## [3,] "3" "hello"
## [1] 3
## [1] 2
## [1] 3 2
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 1
## [1] 1 4
## [1] 4 5 6
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 2 5
## [2,] 3 6
## [3,] 4 7
## [,1] [,2]
## [1,] 2 8
## [2,] 4 10
## [3,] 6 12
## [,1] [,2]
## [1,] 2 8
## [2,] 4 10
## [3,] 6 12
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## Error in a + b: non-conformable arrays
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2]
## [1,] 1 16
## [2,] 4 25
## [3,] 9 36
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 3 2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [1] 2 3
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [,1] [,2] [,3]
## [1,] 17 22 27
## [2,] 22 29 36
## [3,] 27 36 45
## [,1] [,2]
## [1,] 14 32
## [2,] 32 77
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## Error in a + t(a): non-conformable arrays
## Error in a %*% a: non-conformable arguments
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 21
## [1] 3.5
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 5 7 9
## [1] 2 5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 6 10 14 18
## [2,] 22 26 30 34 38
## [3,] 42 46 50 54 58
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 14 24 34 44
## [2,] 26 36 46 56 66
## [3,] 48 58 68 78 88
## [,1] [,2] [,3]
## [1,] 940 2340 3740
## [2,] 1040 2640 4240
## [3,] 1140 2940 4740
See you back at 1:30pm
It’s R time again… Any questions?
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
## [1] 11
## [1] 32
name = c("Messi", "Ronaldo", "Neymar") # a character vector
age = c(31, 33, 26) # a numeric vector
golden_ball = c(TRUE, TRUE, FALSE)
players = data.frame(name, age, golden_ball)
head(players)## name age golden_ball
## 1 Messi 31 TRUE
## 2 Ronaldo 33 TRUE
## 3 Neymar 26 FALSE
## [1] 3
## [1] 2
## Error in data.frame(name, golden_ball): arguments imply differing number of rows: 3, 2
golden_ball = c(TRUE, TRUE, NA) # use NA to indicate missing value
data.frame(name, age, golden_ball)## name age golden_ball
## 1 Messi 31 TRUE
## 2 Ronaldo 33 TRUE
## 3 Neymar 26 NA
## [1] "matrix"
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] "data.frame"
## V1 V2 V3
## 1 1 3 5
## 2 2 4 6
This can be done using the same access operators for matrix objects [row,column]
The 1st row
## name age golden_ball
## 1 Messi 31 TRUE
## [1] 31 33 26
## name age golden_ball
## 1 Messi 31 TRUE
## 2 Ronaldo 33 TRUE
## name age
## 1 Messi 31
## 2 Ronaldo 33
## 3 Neymar 26
Particular columns of the data.frame object can be accessed with the $ access operator followed by the column name
The name column of the data.frame players
## [1] Messi Ronaldo Neymar
## Levels: Messi Neymar Ronaldo
## [1] Ronaldo
## Levels: Messi Neymar Ronaldo
## [1] 33 35 28
## [1] 34
## name age golden_ball assists
## 1 Messi 31 TRUE 9
## 2 Ronaldo 33 TRUE 8
## 3 Neymar 26 FALSE 3
## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
new_player = data.frame(name = "Suarez", age = 31, golden_ball = FALSE, goals = 40,
assists = 4) # add a new row to existing data.frame
rbind(players, new_player)## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
## 4 Suarez 31 FALSE 4 40
## [1] 20.09062
## [1] 8
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
| Name | Gender | TenK | PR | Qualified |
|---|---|---|---|---|
| Sally | F | 55 | 52 | FALSE |
| Mike | M | 46 | 44 | TRUE |
| Carol | F | 62 | 58 | FALSE |
| HalfMarathon |
|---|
| 120 |
| 100 |
| 140 |
| Name | Gender | TenK | PR | Qualified | HalfMarathon |
|---|---|---|---|---|---|
| Sage | M | 40 | 42 | TRUE | 81 |
Name = c("Sally", "Mike", "Carol")
Gender = c("F", "M", "F")
TenK = c(55, 46, 62)
PR = c(52, 44, 58)
Qualified = c(FALSE, TRUE, FALSE)
running = data.frame(Name, Gender, TenK, PR, Qualified)
running## Name Gender TenK PR Qualified
## 1 Sally F 55 52 FALSE
## 2 Mike M 46 44 TRUE
## 3 Carol F 62 58 FALSE
## Name Gender TenK PR Qualified HalfMarathon
## 1 Sally F 55 52 FALSE 120
## 2 Mike M 46 44 TRUE 100
## 3 Carol F 62 58 FALSE 140
newrunner = data.frame(Name = "Sage", Gender = "M", TenK = 40, PR = 42, Qualified = TRUE,
HalfMarathon = 81)
running = rbind(running, newrunner)
running## Name Gender TenK PR Qualified HalfMarathon
## 1 Sally F 55 52 FALSE 120
## 2 Mike M 46 44 TRUE 100
## 3 Carol F 62 58 FALSE 140
## 4 Sage M 40 42 TRUE 81
R uses list for complex, hierarchical objects
Use list() to construct
## [[1]]
## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
## 4 Suarez 31 FALSE 4 40
##
## [[2]]
## [1] 1 2 3
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
## 4 Suarez 31 FALSE 4 40
## [1] "data.frame"
## name age golden_ball assists goals
## Messi :1 Min. :26.00 Mode :logical Min. :3.00 Min. :10.0
## Neymar :1 1st Qu.:29.75 FALSE:2 1st Qu.:3.75 1st Qu.:17.5
## Ronaldo:1 Median :31.00 TRUE :2 Median :6.00 Median :25.0
## Suarez :1 Mean :30.25 NA's :0 Mean :6.00 Mean :25.0
## 3rd Qu.:31.50 3rd Qu.:8.25 3rd Qu.:32.5
## Max. :33.00 Max. :9.00 Max. :40.0
## [[1]]
## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
## 4 Suarez 31 FALSE 4 40
## [1] "list"
## Warning in mean.default(my_list[1]): argument is not numeric or logical:
## returning NA
## [1] NA
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [[1]]
## name age golden_ball assists goals
## 1 Messi 31 TRUE 9 20
## 2 Ronaldo 33 TRUE 8 30
## 3 Neymar 26 FALSE 3 10
## 4 Suarez 31 FALSE 4 40
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [1] 1 2 3
## [[1]]
## [1] 1 2 3
## [1] 2
## Warning in mean.default(my_list[2]): argument is not numeric or logical:
## returning NA
## [1] NA
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [1] 0.5553994 0.5335197 0.6975176 0.2739090 0.2006292
## Error in my_list[[4]]: subscript out of bounds
## [1] 3
## [[1]]
## [1] -0.77500241 1.50246652 1.39566400 0.18635356 -0.24914866
## [6] -1.71086730 -0.10150976 0.63850541 -0.07045173 -1.14813332
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
##
## [[3]]
## Name Gender TenK PR Qualified HalfMarathon
## 1 Sally F 55 52 FALSE 120
## 2 Mike M 46 44 TRUE 100
## 3 Carol F 62 58 FALSE 140
## 4 Sage M 40 42 TRUE 81
## [1] 55
direction = c("North", "West", "North", "East", "South", "West", "North", "South") #create a character vector
direction## [1] "North" "West" "North" "East" "South" "West" "North" "South"
## [1] "character"
## [1] North West North East South West North South
## Levels: East North South West
## [1] "factor"
## [1] "East" "North" "South" "West"
## factor_direction
## East North South West
## 1 3 2 2
seasons = c("Spring", "Fall", "Summer", "Spring", "Fall", "Winter", "Winter")
factor_seasons = factor(seasons, levels = c("Spring", "Summer", "Fall", "Winter"),
ordered = TRUE)
factor_seasons[1] < factor_seasons[2]## [1] TRUE
## [1] FALSE
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## factor_height
## (58,62.7] (62.7,67.3] (67.3,72]
## 5 5 5
## factor_height
## Low Medium High
## 5 5 5
mons = factor(c("March", "April", "January", "November", "January", "September",
"October", "September", "November", "August", "January", "November", "November",
"February", "May", "August", "July", "December", "August", "August", "September",
"November", "February", "April"), levels = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October", "November",
"December"), ordered = TRUE)
table(mons)## mons
## January February March April May June July
## 3 2 1 2 1 0 1
## August September October November December
## 4 3 1 5 1
## [1] Low Low Low Low Low Low Low Low Low High High High High High
## [15] High
## Levels: Low High
## factor_weight
## Low High
## 9 6
Any Questions?
See you tomorrow 8:30am