ANLY540 - Analysis of Human Language

Introduction

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(dplyr)
library(Rling)
library(modeest)
library(scales)

Import the Data

Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.

Dataset:

data <- read.csv("BNC_wordfreq.csv")
data <- data %>% rename(frequency = ï..frequency)
data

Continuous Data Analysis

Continuous Basic Statistics

Calculate the following statistics on the word frequencies from the BNC data:

Dispersion: min/max, 1st/3rd quantile, standard deviation
Location: mean, median, mode

Summary Stats:

summary(data$frequency)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     800    1282    2335   13567    6050 6187267

Mode:

mlv(data$frequency)

## [1] 822

Standard deviation:

sd(data$frequency)

## [1] 123948.4

Continuous Graphical Displays

Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.

par(mfrow =c(1,3))

hist(data$frequency, 
     main = "Histogram of word frequencies",
     xlab = "Word frequencies")

plot(density(data$frequency),
     main = "Density plot of word frequencies",
     xlab = "Word frequencies")

{qqnorm(data$frequency)
  qqline(data$frequency)}

Zipf’s Law

Create a plot displaying Zipf’s Law on the BNC data.

par(mfrow =c(1,2))

plot(sort(data$frequency, decreasing = TRUE),
     type = "b", main = "Zipf's law using all words", ylab = "Word frequency")

plot(head(sort(data$frequency, decreasing = TRUE), 100),
     type = "b", main = "Zipf's law using top 100 words", ylab = "Word frequency")

Interpretation of the Frequency Data

Using your results from above, answer the following:

What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience.
- The data seems to be extremely dispersed. The most frequently occuring word, “the” appears more than six million times, while the least frequently occuring words, “wildly”, “reformer”, “quantum”, and “considering” appear just 800 times. The first quartile for frequency is 1,282, which means that 75% of the words in the dataset have a frequency greater than 1,282. The third quartile for frequency is 6,050, which means that 25% of the words in the dataset have a frequency greater than 6,050. The standard deviation of above 120,000 indicates how spread out the frequencies are.
What do the mean, median, and mode tell us about the use of words in the English Language? How frequent are words in general (mean/median), and what is the most common frequency in the data?
- The mean tells us the average of the frequency of all the words in the dataset. The mean or average frequency of this dataset is 13,567, which is much greater than the third quartile, which clearly indicates that the distribution of frequencies is extremely right skewed.
- If the words are sorted in order of their increasing or decreasing frequencies, the frequency corresponding to the word in the middle is the median. In other words, it is the 2nd quartile, where 50% of the words have lower frequency than this frequency and 50% of the words have higher frequency than this frequency. The median frequency for this dataset is 2,335. This is much lesser than the mean, again indicating that the distribution of frequencies is right skewed.
- The most common frequency in the dataset, that is the specific frequency corresponding to the most number of words is 822.
- All this tells us that while an overwhelming majority of the words are used not so freqeuntly, there are a fair number of words that are used extensively.
In looking at the pictures of the frequency data, does word frequency appear to be normally distributed (explain, not just yes/no)?
- Firstly, from the histogram, density plot, and Q-Q plot it is clear that the distribution is extremely right skewed.
- Secondly, the mean being greater than the 3rd quartile indicates how extremely right skewed and non-normal the distribution is.
Does the frequency in the BNC follow Zipf’s Law (explain, not just yes/no)?
- Zipf’s law states that the word frequency is inversely related to its frequency rank. This implies that the first word is twice as likely as the second word, three times as likely as the third word, and so on.
- From the plot for Zipf’s law using all the words, the curve seems like a power curve, but it is difficult to decide due to the large number of indices (words).
- Inspecting just the top 100 most frequenctly used words, we see a clear power curve. Therefore, we can conclude that the distrbution of frequencies in BNC follows Zipf’s law.

Categorical Data Analysis

Categorical Basic Statistics

Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.

A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb

Word Count for parts of speech:

summary(data$part_of_speech)

##                 a               adv              conj               det 
##              1124               427                34                47 
## infinitive-marker      interjection             modal                 n 
##                 1                13                12              3262 
##              prep              pron                 v 
##                71                46              1281

Categorical Graphical Displays

Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.

pos_table = table(data$part_of_speech)
part_of_speech.t = sort(pos_table, decreasing = TRUE)[1:5] 

names <- c( "Noun", "Verb", "Adjective", "Adverb", "Preposition")
cols <- c("darkblue","cyan","deeppink","darkmagenta","chartreuse")

pie(part_of_speech.t,
    main = "Pie chart of parts of speech",
    col = cols, labels = NA)

legend("bottomright", paste(names, ":", round(prop.table(part_of_speech.t)*100,2), "%"), cex=0.8, fill=cols)

Categorical Dispersion

The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.

freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus

dev_prop = function (observed_count, expected_count){
  DP_value = sum(abs(prop.table(observed_count)-prop.table(expected_count)))/2
  DP_normal = DP_value / (1 - min(prop.table(expected_count)))
  return(DP_normal)
}

Overall Proportion table for COCA categories:

prop.table(freqreg)

##    Spoken   Fiction  Academic     Press 
## 0.2055636 0.1946987 0.1962086 0.4035291

Proportion table for the word, “shaving”:

prop.table(shaving)

##     Spoken    Fiction   Academic      Press 
## 0.04873294 0.34113060 0.07797271 0.53216374

Proportion table for the word, “descriptor”:

prop.table(descriptor)

##     Spoken    Fiction   Academic      Press 
## 0.01165049 0.01359223 0.89708738 0.07766990

Proportion deviation for the word, “shaving”:

dev_prop(shaving, freqreg)

## [1] 0.3415698

Proportion deviation for the word, “descriptor”:

dev_prop(descriptor, freqreg)

## [1] 0.8703311

Inter-proportion deviation between the words, “shaving” and “descriptor”:

dev_prop(shaving, descriptor)

## [1] 0.8287702

The inter proportion deviation is close to 1. Hence these words are not equally spread across writing types.

Interpretation of the Categorical Data