Question 1
1A
# use read.delim() to read txt files to df's
gb.df <- read.delim('goldrickBlumstein.txt')
bach.df <- read.delim('bach.txt')
1B
# use $ and mean() to access fields in df
(mean.voiced <- mean(gb.df$VOT[gb.df$OnsetVoicing=='voiced']))
## [1] 22.66587
(mean.voiceless <- mean(gb.df$VOT[gb.df$OnsetVoicing=='voiceless']))
## [1] 65.2869
Voiced consonants have much shorter VOT than voiceless consonants.
1C
# use $ and <- to specify and add a new field
gb.df$NVOT <- gb.df$VOT/gb.df$VowelLength
1D
# use $ and mean() to access fields in df
(mean.aboveC <- mean(bach.df$Duration[bach.df$Note >= 60]))
## [1] 1.007399
(mean.belowC <- mean(bach.df$Duration[bach.df$Note < 60]))
## [1] 0.9338945
The means are much closer, although the duration is slightly longer at or above middle C.
Question 2
2A
# use hist() and subset data
hist(gb.df$VOT[gb.df$OnsetVoicing=='voiceless'])
2B
# use hist() and subset data
hist(gb.df$VowelLength[gb.df$OnsetVoicing=='voiced'])
2C
# use hist() and subset data
hist(gb.df$NVOT[gb.df$OnsetVoicing=='voiceless'])
2D
# import ggplot
library(ggplot2)
# put data in ggplot subsetted to middle C and visualize with geom_histogram()
ggplot(data=subset(bach.df, Note==60), aes(x=Duration)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question 3 The speech data has a much smoother distribution. This is most likely due to the natural array of human output. On the other hand, the more regimented durations of MIDI notes are apparent in the Bach files, with large gaps in duration.
Question 4 4A
# use nrow, min and max of VOT for CVOT, but shorten min/max by 15
CVOT <- runif(nrow(gb.df), min=min(gb.df$VOT)-15, max=max(gb.df$VOT)-15)
# use nrow, min and max of VowelLength for CVowelLength, but lengthen min/max by 15
CVowelLength <- runif(nrow(gb.df), min=min(gb.df$VowelLength)+15,
max=max(gb.df$VowelLength)+15)
# CNVOT = CVOT/CVowelLength
CNVOT <- CVOT/CVowelLength
# add additional values above to data frame
gb.df$CVOT <- CVOT
gb.df$CVowelLength <- CVowelLength
gb.df$CNVOT <- CNVOT
# plot VOT histograms on single plot
hist(gb.df$VOT, col=rgb(1,0,0,0.5), xlab='VOT', main='Comparison of VOT')
hist(gb.df$CVOT, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("VOT", "CVOT"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)
# plot VowelLength histograms on single plot
hist(gb.df$VowelLength, col=rgb(1,0,0,0.5), xlab='VowelLength', main='Comparison of VowelLength')
hist(gb.df$CVowelLength, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("VowelLength", "CVowelLength"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)
# plot NVOT histograms on single plot
hist(gb.df$NVOT, col=rgb(1,0,0,0.5), xlab='NVOT', main='Comparison of NVOT')
hist(gb.df$CNVOT, col=rgb(0,0,1,0.5), add=T)
legend("topright", c("NVOT", "CNVOT"), col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)), lwd=10)
The errors seem to primarily affect VOT and vowel length. The errors make Clyde’s measurements uniformly distributed, rather than normally distributed, as in the Goldrick, Blumstein data.
However, there is still a normal distribution of VOT when Clyde’s errors are normalized. The only difference is that Clyde’s errors have less kurtosis in their distribution than the original data. The normal distribution makes sense, though, since the relationship between two random sets (VOT and vowel length) should still be normal.