Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.
library(readr)
BNC_wordfreq <- read_csv("C:/Users/Emily/Desktop/540/BNC_wordfreq.csv")
## Parsed with column specification:
## cols(
## frequency = col_double(),
## word = col_character(),
## part_of_speech = col_character()
## )
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
library(psych)
library(plotrix)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v ggplot2 3.2.0 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%() masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Calculate the following statistics on the word frequencies from the BNC data: - Dispersion: min/max, 1st/3rd quantile, standard deviation - Location: mean, median, mode
calc_mode <- function(x) {
unique_x <- unique(x)
unique_x[which.max(tabulate(match(x, unique_x)))]
}
BNC_wordfreq %>%
summarize(
min_freq = min(frequency),
max_freq = max(frequency),
quantile_1st = quantile(frequency, probs = 0.25),
quantile_3rd = quantile(frequency, probs = 0.75),
sd_freq = sd(frequency),
mean_freq = mean(frequency),
median_freq = median(frequency),
mode = calc_mode(frequency)
) ->
q_stats
q_stats
## # A tibble: 1 x 8
## min_freq max_freq quantile_1st quantile_3rd sd_freq mean_freq median_freq
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 800 6187267 1282. 6050. 123948. 13567. 2335
## # ... with 1 more variable: mode <dbl>
Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.
## Histogram
ggplot(data = BNC_wordfreq) +
geom_histogram(aes(x = frequency), color = "white", fill = "tomato2", bins = 200) +
theme_bw() +
labs(title = "Histogram of the Frequencies")
## Density plot
ggplot(data = BNC_wordfreq) +
geom_density(aes(x = frequency), color = "tomato2") +
theme_bw() +
labs(title = "Density of the Frequencies")
## qqnorm plot
ggplot(data = BNC_wordfreq, aes(sample = frequency)) +
stat_qq() + stat_qq_line()
Create a plot displaying Zipf’s Law on the BNC data.
alpha <- 1
BNC_wordfreq %>%
mutate(
word = factor(word),
word_rank = row_number(),
zipfs_freq = ifelse(word_rank == 1, frequency, dplyr::first(frequency) / word_rank ^ alpha)
) ->
BNC_wordfreq_2
## Zipf's plot
ggplot(BNC_wordfreq_2, aes(x = word_rank, y = frequency)) +
geom_point(aes(color = "observed")) + theme_bw() +
geom_point(aes(y = zipfs_freq, color = "theoretical")) +
labs(x = "Word rank", y = "Frequency", title = "Zipf's law visualization") +
scale_colour_manual(name = "Word count", values = c("theoretical" = "red", "observed" = "black"))
Using your results from above, answer the following: - What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience. Minimum is 800, and maximum is 6187267. 1st quartile is 1282 and 3rd quartile is 6050, which means 75% of values are in range of 6050. - What do the mean, median, and mode tell us about the use of words in the English Language? How frequent are words in general (mean/median), and what is the most common frequency in the data? Median is 2335 and mean is 13567, which shows the position of the frequency of words. And the most common frequency in the data is 822. - In looking at the pictures of the frequency data, does word frequency appear to be normally distributed (explain, not just yes/no)? It is not normal destributed. It is right skewed. - Does the frequency in the BNC follow Zipf’s Law (explain, not just yes/no)? The frequency in the BNC follows Zipf’s Law because the observed points are very close to the theoretical points in the Zipf’s plot.
Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.
A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb
library(dplyr)
BNC_wordfreq %>%
tabyl(part_of_speech) %>%
adorn_pct_formatting(digits = 2)
## part_of_speech n percent
## a 1124 17.79%
## adv 427 6.76%
## conj 34 0.54%
## det 47 0.74%
## infinitive-marker 1 0.02%
## interjection 13 0.21%
## modal 12 0.19%
## n 3262 51.63%
## prep 71 1.12%
## pron 46 0.73%
## v 1281 20.28%
Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.
pos_table = table(BNC_wordfreq$part_of_speech) #create a table here
pos_table_top5 = sort(pos_table, decreasing = TRUE)[1:5] #this gives you top 5
# pie chart
pie(pos_table_top5,
labels = c('n' = 'Noun', 'v' = 'Verb', 'a' = "Adjective", 'adv' = 'Adverb', 'prep' = 'Preposition'))
title("The top 5 parts of speech for the BNC data")
The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.
freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus
df1 <- rbind(shaving, descriptor) %>%
as.data.frame()
df1
## Spoken Fiction Academic Press
## shaving 25 175 40 273
## descriptor 6 7 462 40
dev_prop <- function(observed_count, expected_count) {
DP_value <- sum(abs(prop.table(observed_count) - prop.table(expected_count)))/2
DP_normal <- DP_value / (1 - min(prop.table(expected_count)))
return(DP_normal)
}
dev_prop(df1['shaving', ], freqreg)
## [1] 0.3415698
dev_prop(df1['descriptor', ], freqreg)
## [1] 0.8703311
Using your results from above, answer the following: - What are the most frequent parts of speech by type (i.e., we are only considering each word individually, not the frequency of the words as well)? The top 5 parts of speech by type are Noun, Verb, Adjective,Adv and Preposition. - What are the least frequent parts of speech by type? The least frequent parts of speech by type are infinitive-maker, modal, interjection, conj, and pron. - Was shaving or descriptor more “dispersed” throughout the texts? Remember, that values close to zero mean that the concept is represented evenly across corpora, while values closer to one are more likely to appear in one corpus over another. Descriptor is more dispersed, beacause the deviation is very close to 1.