Assignment 1: Descriptive Statistics

Import the Data

Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.

bnc<-read.csv("BNC_wordfreq.csv", header = TRUE)

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.2     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.1.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Continuous Basic Statistics

Calculate the following statistics on the word frequencies from the BNC data: - Dispersion: min/max, 1st/3rd quantile, standard deviation - Location: mean, median, mode

summary(bnc$frequency)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     800    1282    2335   13570    6050 6187000

sd(bnc$frequency)

## [1] 123948.4

mean(bnc$frequency)

## [1] 13566.67

median(bnc$frequency)

## [1] 2335

mode(bnc$frequency)

## [1] "numeric"

Continuous Graphical Displays

Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.

par(mfrow = c(1, 3))
hist(bnc$frequency, main = "Histogram of BNC Frequency", 
     xlab = "frequencies")

plot(density(bnc$frequency), main = "Density Plot of BNC Frequency", 
     xlab = "frequencies")

{qqnorm(bnc$frequency)
qqline(bnc$frequency)}

Zipf’s Law

Create a plot displaying Zipf’s Law on the BNC data.

plot(sort(bnc$frequency, decreasing = TRUE), 
     type = "b", main = "Zipf's Law", ylab = "Frequency")

Interpretation of the Frequency Data

Using your results from above, answer the following: - What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience. The result above shows the data is not normal distributed. the dispersion is large for this dataset.

- What do the mean, median, and mode tell us about the use of words in the English Language? How frequent are words in general (mean/median), and what is the most common frequency in the data?
The refrequency of mean, and median are varied large, which measn that the use of words are very different in English. For example, some words that we use more than once in one sentence. In comparison, some words are rarely seen in a sentence. 

- In looking at the pictures of the frequency data, does word frequency appear to be normally distributed (explain, not just yes/no)?
No.The data is not normally distributed. It is right -skewed. 

- Does the frequency in the BNC follow Zipf's Law (explain, not just yes/no)?
Yes. The pattern for these two graphs are very close.

Categorical Basic Statistics

Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.

A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb

data(bnc)

## Warning in data(bnc): data set 'bnc' not found

head(bnc)

##   frequency word part_of_speech
## 1   6187267  the            det
## 2   4239632   be              v
## 3   3093444   of           prep
## 4   2687863  and           conj
## 5   2186369    a            det
## 6   1924315   in           prep

summary(bnc$part_of_speech)

##                 a               adv              conj               det 
##              1124               427                34                47 
## infinitive-marker      interjection             modal                 n 
##                 1                13                12              3262 
##              prep              pron                 v 
##                71                46              1281

Categorical Graphical Displays

Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.

pos_table = table(bnc$part_of_speech)#create a table here
pos_table_top5 = sort(pos_table, decreasing = TRUE)[1:5] #this gives you top 5
pie(pos_table, main = "Pie Chart", col = c("black", "grey", "yellow","blue","green","red"),labels = paste(names(pos_table)), prop.table(pos_table)*100)

Categorical Dispersion

The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.

freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press")

df1 <- rbind(shaving, descriptor) %>% 
    as.data.frame() 
df1

##            Spoken Fiction Academic Press
## shaving        25     175       40   273
## descriptor      6       7      462    40

#categories of the corpus
dev_prop <- function(observed_count, expected_count) {
  DP_value <- sum(abs(prop.table(observed_count) - prop.table(expected_count)))/2
  DP_normal <- DP_value / (1 - min(prop.table(expected_count)))
  return(DP_normal)
}

dev_prop(df1['shaving', ], freqreg)

## [1] 0.3415698

dev_prop(df1['descriptor', ], freqreg)

## [1] 0.8703311

Interpretation of the Categorical Data

Using your results from above, answer the following: - What are the most frequent parts of speech by type (i.e., we are only considering each word individually, not the frequency of the words as well)? The most frquent part is noun, followed by verb and adjective.

- What are the least frequent parts of speech by type?

Infinitive-maker is the lease frequent parts of speech. 

- Was shaving or descriptor more "dispersed" throughout the texts? Remember, that values close to zero mean that the concept is represented evenly across corpora, while values closer to one are more likely to appear in one corpus over another. 
 Yes. because the value is close to 1, but not 0.