Goodkind - Lab 1

Question 1

# use read.delim() to read txt files to df's
gb.df <- read.delim('goldrickBlumstein.txt')
bach.df <- read.delim('bach.txt')

# use $ and mean() to access fields in df
(mean.voiced <- mean(gb.df$VOT[gb.df$OnsetVoicing=='voiced']))

## [1] 22.66587

(mean.voiceless <- mean(gb.df$VOT[gb.df$OnsetVoicing=='voiceless']))

## [1] 65.2869

Voiced consonants have much shorter VOT than voiceless consonants.

# use $ and <- to specify and add a new field
gb.df$NVOT <- gb.df$VOT/gb.df$VowelLength

# use $ and mean() to access fields in df
(mean.aboveC <- mean(bach.df$Duration[bach.df$Note >= 60]))

## [1] 1.007399

(mean.belowC <- mean(bach.df$Duration[bach.df$Note < 60]))

## [1] 0.9338945

The means are much closer, although the duration is slightly longer at or above middle C.

Question 2

# use hist() and subset data
hist(gb.df$VOT[gb.df$OnsetVoicing=='voiceless'])

The bulk of the data is symmetrically centered around the mean.
The data has one mode, around which the data is centered.
These values are limited by human physiology. It is impossible to swich from a consonant to a vowel in 0 milliseconds. However, the switch can be very rapid. It is similarly difficult to sustain a consonant and then switch to a vowel, though consonants can be slightly sustained.

# use hist() and subset data
hist(gb.df$VowelLength[gb.df$OnsetVoicing=='voiced'])

Some of the data is symmetric around the center. However, there is a large mass at the lower end.
Although it is difficult to call the lower mass an additional mode, it is clear from the histogram that the data has multiple sources.
A vowel cannot be 0 length, otherwise it does not exist. I assume there is also a physical limit below which a human produce a sound. A vowel can be sustained for as long as a speaker can expel air.

# use hist() and subset data
hist(gb.df$NVOT[gb.df$OnsetVoicing=='voiceless'])

The data is right skewed, with a longer right tail.
The data has a single mode.
The data has no non-zero theoretical limits, as VOT and vowel duration could share many relationships.

# import ggplot
library(ggplot2)

# put data in ggplot subsetted to middle C and visualize with geom_histogram()
ggplot(data=subset(bach.df, Note==60), aes(x=Duration)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data does not appear to have a clear center. Possibly, the histogram is right skewed.
The data has multiple modes. One mode produces counts below 250. A second mode produces counts greater than 1,000.
There are no non-zer theoretical limits. A note can be sustained for an infinitesimally small duration or for an infinitely long period of time.

Question 3 The speech data has a much smoother distribution. This is most likely due to the natural array of human output. On the other hand, the more regimented durations of MIDI notes are apparent in the Bach files, with large gaps in duration.

Question 4 4A

# use nrow, min and max of VOT for CVOT, but shorten min/max by 15
CVOT <- runif(nrow(gb.df), min=min(gb.df$VOT)-15, max=max(gb.df$VOT)-15)
# use nrow, min and max of VowelLength for CVowelLength, but lengthen min/max by 15
CVowelLength <- runif(nrow(gb.df), min=min(gb.df$VowelLength)+15, 
                      max=max(gb.df$VowelLength)+15)
# CNVOT = CVOT/CVowelLength
CNVOT <- CVOT/CVowelLength

# add additional values above to data frame
gb.df$CVOT <- CVOT
gb.df$CVowelLength <- CVowelLength
gb.df$CNVOT <- CNVOT

# plot VOT histograms on single plot
hist(gb.df$VOT, col=rgb(1,0,0,0.5), xlab='VOT', main='Comparison of VOT')
hist(gb.df$CVOT, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("VOT", "CVOT"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)

# plot VowelLength histograms on single plot
hist(gb.df$VowelLength, col=rgb(1,0,0,0.5), xlab='VowelLength', main='Comparison of VowelLength')
hist(gb.df$CVowelLength, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("VowelLength", "CVowelLength"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)

# plot NVOT histograms on single plot
hist(gb.df$NVOT, col=rgb(1,0,0,0.5), xlab='NVOT', main='Comparison of NVOT')
hist(gb.df$CNVOT, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("NVOT", "CNVOT"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)

The errors seem to primarily affect VOT and vowel length. The errors make Clyde’s measurements uniformly distributed, rather than normally distributed, as in the Goldrick, Blumstein data.

However, there is still a normal distribution of VOT when Clyde’s errors are normalized. The only difference is that Clyde’s errors have less kurtosis in their distribution than the original data. The normal distribution makes sense, though, since the relationship between two random sets (VOT and vowel length) should still be normal.