Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.
getwd ()
## [1] "/Users/pallavisaitu"
mydata = read.csv("/Users/pallavisaitu/Downloads/BNC_wordfreq.csv")
head(mydata)
## frequency word part_of_speech
## 1 6187267 the det
## 2 4239632 be v
## 3 3093444 of prep
## 4 2687863 and conj
## 5 2186369 a det
## 6 1924315 in prep
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(psych)
## Warning: package 'psych' was built under R version 3.5.2
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.5.2
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
Calculate the following statistics on the word frequencies from the BNC data: - Dispersion: min/max, 1st/3rd quantile, standard deviation - Location: mean, median, mode
# Mean, Median, Min, Max, Quantiles
summary(mydata$frequency)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 800 1282 2335 13567 6050 6187267
Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.
mydata_new <- aggregate(mydata$frequency, list(mydata$part_of_speech), FUN=mean)
barplot(mydata_new$x, main="Frequency Vs Part of Speech", xlab="Group",ylab = "Frequency", names.arg = mydata_new$Group.1)
boxplot(frequency~word, data=mydata)
boxplot(frequency~part_of_speech, data=mydata)
Create a plot displaying Zipf’s Law on the BNC data.
plot(mydata$frequency)
Using your results from above, answer the following: - What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience. Ans: Here, the minimun frequency is 800, with a median frequency of 2335 and a maximum of 6187267. The 1st quartile is 1282 and 3rd quartile is 13567. We got to understant that the frquency data here is skewed on the right.
- What do the mean, median, and mode tell us about the use of words in the English Language? How frequent are words in general (mean/median), and what is the most common frequency in the data?
Ans:The refrequency of mean, and median are very large, this means that in English language the use of words is unique.
- In looking at the pictures of the frequency data, does word frequency appear to be normally distributed (explain, not just yes/no)?
Ans:No, the data is right skewed. - Does the frequency in the BNC follow Zipf’s Law (explain, not just yes/no)? Ans:Yes, the graphs have a similar pattern.
Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.
A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb
data(mydata)
## Warning in data(mydata): data set 'mydata' not found
head(mydata)
## frequency word part_of_speech
## 1 6187267 the det
## 2 4239632 be v
## 3 3093444 of prep
## 4 2687863 and conj
## 5 2186369 a det
## 6 1924315 in prep
summary(mydata$part_of_speech)
## a adv conj det
## 1124 427 34 47
## infinitive-marker interjection modal n
## 1 13 12 3262
## prep pron v
## 71 46 1281
Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.
pos_table = table(mydata$part_of_speech) #create a table here
pos_table_top5 = sort(pos_table, decreasing = TRUE)[1:5] #this gives you top 5
pos_table_top5
##
## n v a adv prep
## 3262 1281 1124 427 71
pie(pos_table, main = "Pie Chart", col = c("yellow", "red", "green","grey","blue","black"),labels = paste(names(pos_table)), prop.table(pos_table)*100)
The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.
freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus
merged_data <- data.frame(rbind(shaving, descriptor))
merged_data
## Spoken Fiction Academic Press
## shaving 25 175 40 273
## descriptor 6 7 462 40
#categories of the corpus
dev_prop <- function(observed_count, expected_count) {
DP_value <- sum(abs(prop.table(observed_count) - prop.table(expected_count)))/2
DP_normal <- DP_value / (1 - min(prop.table(expected_count)))
return(DP_normal)
}
dev_prop(merged_data['shaving', ], freqreg)
## [1] 0.3415698
dev_prop(merged_data['descriptor', ], freqreg)
## [1] 0.8703311
Using your results from above, answer the following:
Ans: Noun and Verb are the twot most frequent parts of speech respectively.
Ans: Infinitive-maker
Ans: Yes, because the value is closer to 1 and this is likely to appear in one corpus than the other.