This set is based on (Coretta 2017, link. This dissertation dealt with the relation between vowel duration and aspiration in consonants. Author carried out a data collection with 5 natives speakers of Icelandic. Then he extracted the duration of vowels followed by aspirated versus non-aspirated consonants. Check out whether vowels before aspirated consonants (like in Icelandic takka ‘key’ [tʰaʰka]) are significantly shorter than vowels followed by non-aspirated consonants (like in kagga ‘barrel’ [kʰakka]). Link to the dataset.
df <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv")
A general boxplot:
boxplot(df$vowel.dur)
Get the number of outliers:
length(boxplot(df$vowel.dur)$out)
## [1] 27
Look at number of observations by groups (aspirated and non-aspirated cases):
table(df$aspriration)
## < table of extent 0 >
Choose two subsamples, one for words where vowels are followed by aspirated consonants and another for non-aspirated consonants.
asp <- df[df$aspiration == 'yes',]
nasp <- df[df$aspiration == 'no',]
Summary for aspirated and non-aspirated cases:
summary(asp$vowel.dur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.78 64.96 77.60 78.76 91.46 166.56
summary(nasp$vowel.dur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.98 77.56 91.91 94.69 103.88 214.48
Boxplot by groups:
boxplot(df$vowel.dur ~ df$aspiration)
More interesting - let us create a boxplot by all groups (see the field cons1
):
boxplot(df$vowel.dur ~ df$cons1)
You can compare distribution of vowel.dur
in asp(irated), fri(cative), nasp(non-aspirated), voi(ced), etc.
We can limit our data to just one type of vowels, say, middle vowels. Therefore, we will work with the same type of a consonant:
asp <- df[df$aspiration == 'yes' & df$height == 'mid', ]
nasp <- df[df$aspiration == 'no' & df$height == 'mid', ]
Again, here is a summary for a corrected case:
summary(asp$vowel.dur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 38.67 71.41 81.92 82.65 95.19 150.46
summary(nasp$vowel.dur)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.98 80.90 97.97 98.73 110.51 190.93
nrow(asp)
## [1] 156
nrow(nasp)
## [1] 174
Let us formulate the null hypothesis, the alternative hypotesis, and apply t-test to our dataset.
t.test(asp$vowel.dur, nasp$vowel.dur)
##
## Welch Two Sample t-test
##
## data: asp$vowel.dur and nasp$vowel.dur
## t = -6.4869, df = 317.72, p-value = 3.356e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -20.94772 -11.19801
## sample estimates:
## mean of x mean of y
## 82.65371 98.72657
By default, R calculates t.test with regard to the bi-directional alternative hypothesis, such as \(\mu_1 \neq \mu_2\).
H1: \(\mu_{asp} \lt \mu_{nasp}\)
t.test(asp$vowel.dur, nasp$vowel.dur, alternative = "less")
##
## Welch Two Sample t-test
##
## data: asp$vowel.dur and nasp$vowel.dur
## t = -6.4869, df = 317.72, p-value = 1.678e-10
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -11.98542
## sample estimates:
## mean of x mean of y
## 82.65371 98.72657
require(tidyverse)
require(dplyr)
Let’s get a descriptive summary of our data in a dplyr style.
df %>%
group_by(aspiration) %>%
summarise(mean = mean(vowel.dur),
st.dev = sd(vowel.dur))
Density plots can be thought of as plots of smoothed histograms.
library(ggplot2)
df %>%
ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
geom_density(alpha = 0.4)+
geom_rug()+
labs(title = "Vowel duration density plot",
caption = "Data from (Coretta 2017)",
x = "vowel duration")
Density plot by speaker:
df %>%
ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
geom_density(alpha = 0.4)+
geom_rug()+
facet_wrap(~speaker)+
labs(title = "Vowel duration density plot, by speaker",
caption = "Data from (Coretta 2017)",
x = "vowel duration")
and descriptive statistics:
df %>%
group_by(aspiration, speaker) %>%
summarise(mean = mean(vowel.dur),
st.dev = sd(vowel.dur))