Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.
library(readr)
BNC_wordfreq <- read_csv("~/Downloads/BNC_wordfreq.csv")
## Parsed with column specification:
## cols(
## frequency = col_double(),
## word = col_character(),
## part_of_speech = col_character()
## )
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Calculate the following statistics on the word frequencies from the BNC data: - Dispersion: min/max, 1st/3rd quantile, standard deviation - Location: mean, median, mode
summary(BNC_wordfreq$frequency)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 800 1282 2335 13567 6050 6187267
mean(BNC_wordfreq$frequency)
## [1] 13566.67
median(BNC_wordfreq$frequency)
## [1] 2335
mode(BNC_wordfreq$frequency)
## [1] "numeric"
Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.
data1<-aggregate(BNC_wordfreq$frequency, list(BNC_wordfreq$part_of_speech), FUN=mean)
barplot(data1$x,main = "frequencyies from British National Corpus", xlab="part of speech", ylab="Frequnecy", names.arg=data1$Group.1)
boxplot(frequency~word,data=BNC_wordfreq)
boxplot(frequency~part_of_speech,data=BNC_wordfreq)
Create a plot displaying Zipf’s Law on the BNC data.
plot(BNC_wordfreq$frequency)
Using your results from above, answer the following: - What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience.
# the data do look dispersion, which was varied from 800 as min to 6187267 as max. especially, the max number is around ~1022 times higher than 3rd quartile. - What do the mean, median, and mode tell us about the use of words in the English Language? How frequent are words in general (mean/median), and what is the most common frequency in the data? # from the charts above, it clearly tell us the most common frequency word is “it”,“the” and etc. In addition, the most frequently used word types are “det” and “verb”. - In looking at the pictures of the frequency data, does word frequency appear to be normally distributed (explain, not just yes/no)? - Does the frequency in the BNC follow Zipf’s Law (explain, not just yes/no)?
Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.
A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb
summary(BNC_wordfreq$part_of_speech)
## Length Class Mode
## 6318 character character
Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.
pos_table = table(BNC_wordfreq$part_of_speech)
pos_table_top5 = sort(pos_table, decreasing = TRUE)[1:5] #this gives you top 5
names <- c( "Noun", "Verb", "Adjective", "Adverb", "Preposition")
cols <- c("grey","orange","green","blue","red")
pie(pos_table_top5,main = "% of top 5 'part of speech'",col = cols, labels = NA)
legend("bottomleft", paste(names, ":", round(prop.table(pos_table_top5)*100,2), "%"), cex=0.6, fill=cols)
The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.
freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus
Using your results from above, answer the following: - What are the most frequent parts of speech by type (i.e., we are only considering each word individually, not the frequency of the words as well)? - What are the least frequent parts of speech by type? - Was shaving or descriptor more “dispersed” throughout the texts? Remember, that values close to zero mean that the concept is represented evenly across corpora, while values closer to one are more likely to appear in one corpus over another.