Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.
Dataset:
Calculate the following statistics on the word frequencies from the BNC data:
Summary Stats:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 800 1282 2335 13567 6050 6187267
Mode:
## [1] 822
Standard deviation:
## [1] 123948.4
Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.
par(mfrow =c(1,3))
hist(data$frequency,
main = "Histogram of word frequencies",
xlab = "Word frequencies")
plot(density(data$frequency),
main = "Density plot of word frequencies",
xlab = "Word frequencies")
{qqnorm(data$frequency)
qqline(data$frequency)}Create a plot displaying Zipf’s Law on the BNC data.
par(mfrow =c(1,2))
plot(sort(data$frequency, decreasing = TRUE),
type = "b", main = "Zipf's law using all words", ylab = "Word frequency")
plot(head(sort(data$frequency, decreasing = TRUE), 100),
type = "b", main = "Zipf's law using top 100 words", ylab = "Word frequency")Using your results from above, answer the following:
Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.
A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb
Word Count for parts of speech:
## a adv conj det
## 1124 427 34 47
## infinitive-marker interjection modal n
## 1 13 12 3262
## prep pron v
## 71 46 1281
Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.
pos_table = table(data$part_of_speech)
part_of_speech.t = sort(pos_table, decreasing = TRUE)[1:5]
names <- c( "Noun", "Verb", "Adjective", "Adverb", "Preposition")
cols <- c("darkblue","cyan","deeppink","darkmagenta","chartreuse")
pie(part_of_speech.t,
main = "Pie chart of parts of speech",
col = cols, labels = NA)
legend("bottomright", paste(names, ":", round(prop.table(part_of_speech.t)*100,2), "%"), cex=0.8, fill=cols)The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.
freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus
dev_prop = function (observed_count, expected_count){
DP_value = sum(abs(prop.table(observed_count)-prop.table(expected_count)))/2
DP_normal = DP_value / (1 - min(prop.table(expected_count)))
return(DP_normal)
}Overall Proportion table for COCA categories:
## Spoken Fiction Academic Press
## 0.2055636 0.1946987 0.1962086 0.4035291
Proportion table for the word, “shaving”:
## Spoken Fiction Academic Press
## 0.04873294 0.34113060 0.07797271 0.53216374
Proportion table for the word, “descriptor”:
## Spoken Fiction Academic Press
## 0.01165049 0.01359223 0.89708738 0.07766990
Proportion deviation for the word, “shaving”:
## [1] 0.3415698
Proportion deviation for the word, “descriptor”:
## [1] 0.8703311
Inter-proportion deviation between the words, “shaving” and “descriptor”:
## [1] 0.8287702
The inter proportion deviation is close to 1. Hence these words are not equally spread across writing types.
Using your results from above, answer the following: