Aspiration and vowel duration in Icelandic

This set is based on (Coretta 2017, link. This dissertation dealt with the relation between vowel duration and aspiration in consonants. Author carried out a data collection with 5 natives speakers of Icelandic. Then he extracted the duration of vowels followed by aspirated versus non-aspirated consonants. Check out whether vowels before aspirated consonants (like in Icelandic takka ‘key’ [tʰaʰka]) are significantly shorter than vowels followed by non-aspirated consonants (like in kagga ‘barrel’ [kʰakka]). Link to the dataset.

df <- read.csv("http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv")

Descriptive statistics

A general boxplot:

boxplot(df$vowel.dur)

Get the number of outliers:

length(boxplot(df$vowel.dur)$out)

## [1] 27

Look at number of observations by groups (aspirated and non-aspirated cases):

table(df$aspriration)
## < table of extent 0 >

Choose two subsamples, one for words where vowels are followed by aspirated consonants and another for non-aspirated consonants.

asp <- df[df$aspiration == 'yes',]
nasp <- df[df$aspiration == 'no',]

Summary for aspirated and non-aspirated cases:

summary(asp$vowel.dur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.78   64.96   77.60   78.76   91.46  166.56
summary(nasp$vowel.dur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.98   77.56   91.91   94.69  103.88  214.48

Boxplot by groups:

boxplot(df$vowel.dur ~ df$aspiration)

More interesting - let us create a boxplot by all groups (see the field cons1):

boxplot(df$vowel.dur ~ df$cons1)

You can compare distribution of vowel.dur in asp(irated), fri(cative), nasp(non-aspirated), voi(ced), etc.

We can limit our data to just one type of vowels, say, middle vowels. Therefore, we will work with the same type of a consonant:

asp <- df[df$aspiration == 'yes' & df$height == 'mid', ]
nasp <- df[df$aspiration == 'no' & df$height == 'mid', ]

Again, here is a summary for a corrected case:

summary(asp$vowel.dur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38.67   71.41   81.92   82.65   95.19  150.46
summary(nasp$vowel.dur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.98   80.90   97.97   98.73  110.51  190.93
nrow(asp)
## [1] 156
nrow(nasp)
## [1] 174

T-test

Let us formulate the null hypothesis, the alternative hypotesis, and apply t-test to our dataset.

t.test(asp$vowel.dur, nasp$vowel.dur)
## 
##  Welch Two Sample t-test
## 
## data:  asp$vowel.dur and nasp$vowel.dur
## t = -6.4869, df = 317.72, p-value = 3.356e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -20.94772 -11.19801
## sample estimates:
## mean of x mean of y 
##  82.65371  98.72657

By default, R calculates t.test with regard to the bi-directional alternative hypothesis, such as \(\mu_1 \neq \mu_2\).

Unidirectional t-test

H1: \(\mu_{asp} \lt \mu_{nasp}\)

t.test(asp$vowel.dur, nasp$vowel.dur, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  asp$vowel.dur and nasp$vowel.dur
## t = -6.4869, df = 317.72, p-value = 1.678e-10
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -11.98542
## sample estimates:
## mean of x mean of y 
##  82.65371  98.72657

Density plots

require(tidyverse)
require(dplyr)

Let’s get a descriptive summary of our data in a dplyr style.

df %>% 
  group_by(aspiration) %>%
  summarise(mean = mean(vowel.dur),
            st.dev = sd(vowel.dur))

Density plots can be thought of as plots of smoothed histograms.

library(ggplot2)
df %>% 
  ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
  geom_density(alpha = 0.4)+
  geom_rug()+
  labs(title = "Vowel duration density plot",
       caption = "Data from (Coretta 2017)",
       x = "vowel duration")

Density plot by speaker:

df %>% 
  ggplot(aes(vowel.dur, fill = aspiration, color = aspiration))+
  geom_density(alpha = 0.4)+
  geom_rug()+
  facet_wrap(~speaker)+
  labs(title = "Vowel duration density plot, by speaker",
       caption = "Data from (Coretta 2017)",
       x = "vowel duration")

and descriptive statistics:

df %>% 
  group_by(aspiration, speaker) %>%
  summarise(mean = mean(vowel.dur),
            st.dev = sd(vowel.dur))