Now we will work with a csv-file. CSV stands for comma separated values, so it is a text file where columns are separated with a comma, like this:
a,b,c
1,5,2
0,4,3
At first, we will load a csv-file via a link. To do this we need the function read.csv()
. And then we put the link into brackets (don’t forget quotes):
dat <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/Chi.kuk.2007.csv")
Of course, in real life we do not have such links and load data from our laptops. So as not to spend more time on discussing working directories and writing paths to files, let’s consider an interative function file.choose()
that will ask to choose a file from a folder:
dat2 <- read.csv(file.choose())
So, it works like many other programs that suggest us to choose a file for working.
Look at our data in a convenient way:
View(dat)
This file contains data on the following research (description is taken from here).
The majority of examples in that presentation are based on Hau 2007. Experiment consisted of a perception and judgment test aimed at measuring the correlation between acoustic cues and perceived sexual orientation. Naïve Cantonese speakers were asked to listen to the Cantonese speech samples collected in Experiment and judge whether the speakers were gay or heterosexual. There are 14 speakers and following parameters:
s.duration.ms
)vowel.duration.ms
)average.f0.Hz
)f0.range.Hz
)perceived.as.homo
)perceived.as.hetero
)orientation
)age
)When data are loaded, we can look at general info using str()
function:
str(dat)
## 'data.frame': 14 obs. of 10 variables:
## $ speaker : Factor w/ 14 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ s.duration.ms : num 61.4 63.9 55.1 78.1 64.7 ...
## $ vowel.duration.ms : num 112.6 126.5 126.8 119.2 93.7 ...
## $ average.f0.Hz : num 120 100 115 127 131 ...
## $ f0.range.Hz : num 52.5 114 103.2 58.8 37.4 ...
## $ perceived.as.homo : int 7 20 9 15 10 17 20 21 20 8 ...
## $ perceived.as.hetero : int 18 5 16 10 15 8 5 4 5 17 ...
## $ perceived.as.homo.percent: num 0.28 0.8 0.36 0.6 0.4 0.68 0.8 0.84 0.8 0.32 ...
## $ orientation : Factor w/ 2 levels "hetero","homo": 1 1 2 2 2 2 1 1 2 2 ...
## $ age : int 30 19 29 36 27 33 28 22 22 40 ...
This function is very helpful since it returns a lot of information at the same time: number of observations (rows), number of variables (columns), names of all columns, their types and first values in each column.
To get descriptive statisitics for all columns, we will need summary()
function:
summary(dat)
## speaker s.duration.ms vowel.duration.ms average.f0.Hz
## A :1 Min. :45.13 Min. : 93.68 Min. :100.3
## B :1 1st Qu.:58.15 1st Qu.:118.31 1st Qu.:116.0
## C :1 Median :61.93 Median :123.75 Median :122.7
## D :1 Mean :61.22 Mean :124.06 Mean :125.2
## E :1 3rd Qu.:64.51 3rd Qu.:132.27 3rd Qu.:130.3
## F :1 Max. :78.11 Max. :147.52 Max. :155.3
## (Other):8
## f0.range.Hz perceived.as.homo perceived.as.hetero
## Min. : 37.40 Min. : 4.00 Min. : 4.00
## 1st Qu.: 53.30 1st Qu.: 8.25 1st Qu.: 5.00
## Median : 73.20 Median :12.50 Median :12.50
## Mean : 76.66 Mean :13.50 Mean :11.50
## 3rd Qu.:102.53 3rd Qu.:20.00 3rd Qu.:16.75
## Max. :118.20 Max. :21.00 Max. :21.00
##
## perceived.as.homo.percent orientation age
## Min. :0.16 hetero:7 Min. :19.00
## 1st Qu.:0.33 homo :7 1st Qu.:22.75
## Median :0.50 Median :28.50
## Mean :0.54 Mean :27.86
## 3rd Qu.:0.80 3rd Qu.:30.00
## Max. :0.84 Max. :40.00
##
For numeric columns it returns descriptive statistics, for character or factor ones it returns absolute frequencies.
To get the number of observations in a data frame, we can use nrow()
function:
nrow(dat)
## [1] 14
Let’s choose a column by its name. Take age
, for example:
# $ and then the name of a column
dat$age
## [1] 30 19 29 36 27 33 28 22 22 40 30 25 20 29
Then we can work with a separate column and R will treat it as a vector:
# summary of a column
summary(dat$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 22.75 28.50 27.86 30.00 40.00
# average age
mean(dat$age)
## [1] 27.85714
# histogram for a column
hist(dat$age, col="red")
Now let’s see how to choose rows and columns by their index (position in a data frame). As a data frame is a table with rows and columns, to choose a certain cell from this table we have to specify the row number and the column number. In R row numbers usually go first:
# 1st row, 2nd column
dat[1, 2] # 1st speaker, 2nd column - s.duration.ms
## [1] 61.4
So as to select the 1st row and all the columns, so, all the characteristics of the 1st speaker, we should leave the second position blank:
dat[1,] # type nothing after ,
## speaker s.duration.ms vowel.duration.ms average.f0.Hz f0.range.Hz
## 1 A 61.4 112.6 119.51 52.5
## perceived.as.homo perceived.as.hetero perceived.as.homo.percent
## 1 7 18 0.28
## orientation age
## 1 hetero 30
The same can be done for columns:
# all rows, only the 3rd column
dat[,3]
## [1] 112.60 126.49 126.81 119.17 93.68 127.87 147.52 120.13 140.44 121.01
## [11] 137.37 112.05 133.74 118.02
Of course, in practice we often have to filter our data based on some conditions rather than choose specific columns or rows by their indices. The logics of filtering rows of a data frame is the same as in vectors: we have to write the condition in square brackets.
# take all speakers older than 32 year old
dat[dat$age > 32,]
## speaker s.duration.ms vowel.duration.ms average.f0.Hz f0.range.Hz
## 4 D 78.11 119.17 126.61 58.8
## 6 F 67.00 127.87 150.79 42.0
## 10 J 59.59 121.01 123.90 111.7
## perceived.as.homo perceived.as.hetero perceived.as.homo.percent
## 4 15 10 0.60
## 6 17 8 0.68
## 10 8 17 0.32
## orientation age
## 4 homo 36
## 6 homo 33
## 10 homo 40
However, now we should not forget that our condition is tested for rows, so it should be placed before a comma. If we miss this comma, we will get incorrect results:
# all the speakers, of any age!
dat[dat$age > 32]
## average.f0.Hz perceived.as.homo age
## 1 119.51 7 30
## 2 100.29 20 19
## 3 114.90 9 29
## 4 126.61 15 36
## 5 130.76 10 27
## 6 150.79 17 33
## 7 128.96 20 28
## 8 105.26 21 22
## 9 109.86 20 22
## 10 123.90 8 40
## 11 119.48 21 30
## 12 146.20 8 25
## 13 155.34 9 20
## 14 121.48 4 29
We can test multiple conditions at the same time. For example, let’s choose speakers older than 20 year old that are homosexual:
# & - at the same time
dat[dat$age > 20 & dat$orientation == "homo",]
## speaker s.duration.ms vowel.duration.ms average.f0.Hz f0.range.Hz
## 3 C 55.08 126.81 114.90 103.2
## 4 D 78.11 119.17 126.61 58.8
## 5 E 64.71 93.68 130.76 37.4
## 6 F 67.00 127.87 150.79 42.0
## 9 I 60.45 140.44 109.86 96.4
## 10 J 59.59 121.01 123.90 111.7
## 11 K 62.94 137.37 119.48 87.6
## perceived.as.homo perceived.as.hetero perceived.as.homo.percent
## 3 9 16 0.36
## 4 15 10 0.60
## 5 10 15 0.40
## 6 17 8 0.68
## 9 20 5 0.80
## 10 8 17 0.32
## 11 21 4 0.84
## orientation age
## 3 homo 29
## 4 homo 36
## 5 homo 27
## 6 homo 33
## 9 homo 22
## 10 homo 40
## 11 homo 30
Again to join these conditions we use &
that stands for simultaneously true (recall working with vectors). Now let’s calculate how many homosexuals older than 20 are in our data frame:
nrow(dat[dat$age > 20 & dat$orientation == "homo",])
## [1] 7
To do this, we use nrow()
function. Why we do not need length()
?
length(dat[dat$age > 20 & dat$orientation == "homo",])
## [1] 10
As R treats any data frame as a vector of vectors, length()
function returns number of elements, so number of vectors in the data frame. Any column is a vector, so we obtain 10 here that is the number of columns.
As we have seen, looking at data that appear in the console is not convenient. So, we can save a subset of a data frame in a variable and then use View()
to see it in a separate tab:
dat_small <- dat[dat$age > 20 & dat$orientation == "homo",]
View(dat_small)
Now let’s suggest criteria you are interested in and we will choose speakers that satisfy these criteria.
Choose homosexuals that are mostly perceived as homosexuals.
Note: we decided that mostly perceived as homosexulas are speakers with the proportion of listeners assigned them to homosexuals is more than 0.5.
homo <- dat[dat$perceived.as.homo.percent > 0.5 &
dat$orientation == "homo",]
View(homo)
Now let’s calculate the percentage of homosexuals perceived as homosexuals.
nrow(homo)/nrow(dat) * 100
## [1] 28.57143
Choose speakers who are mostly perceived as homosexuals with either the intonation greater than the average or with vowel duration greater than the average.
First, to avoid overloading, we can calculate the average of vowel duration and the average of intonation (f0
):
# save
mean_duration <- mean(dat$vowel.duration.ms)
mean_intonation <- mean(dat$average.f0.Hz)
Then we can choose rows needed:
homo2 <- dat[(dat$vowel.duration.ms > mean_duration |
dat$average.f0.Hz > mean_intonation) &
dat$perceived.as.homo.percent > 0.5, ]
nrow(homo2)
## [1] 6
Note: the structure of our conditions is the following: (cond1 | cond2) & cond3
. So, cond3
should be true all the time (choose only homosexuals) while between cond1
and cond2
at least one should be true. In other words, our filter captures the following situations:
One more crucial point: brackets ()
here are compulsory; without them we will get a different condition, like this: cond1 | (cond2 & cond3)
. This is because operator &
is stronger.
Plot a histogram of vowel duration for speakers chosen at the previous step:
hist(homo2$vowel.duration.ms, col="yellow")