Assignment 1: Descriptive Statistics

Import the Data

Import the included dataset BNC_wordfreq.csv. This file contains the compiled word frequencies for the British National Corpus.

library(readr)

## Warning: package 'readr' was built under R version 3.5.3

aks_BNC_wordfreq= read_csv("BNC_wordfreq.csv")

## Parsed with column specification:
## cols(
##   frequency = col_double(),
##   word = col_character(),
##   part_of_speech = col_character()
## )

aks_BNC_wordfreq

## # A tibble: 6,318 x 3
##    frequency word  part_of_speech   
##        <dbl> <chr> <chr>            
##  1   6187267 the   det              
##  2   4239632 be    v                
##  3   3093444 of    prep             
##  4   2687863 and   conj             
##  5   2186369 a     det              
##  6   1924315 in    prep             
##  7   1620850 to    infinitive-marker
##  8   1375636 have  v                
##  9   1090186 it    pron             
## 10   1039323 to    prep             
## # ... with 6,308 more rows

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ---------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v purrr   0.3.2     v forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'stringr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts ------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Continuous Basic Statistics

Calculate the following statistics on the word frequencies from the BNC data: - Dispersion: min/max, 1st/3rd quantile, standard deviation - Location: mean, median, mode

#dispersion
summary(aks_BNC_wordfreq$frequency)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     800    1282    2335   13567    6050 6187267

mean(aks_BNC_wordfreq$frequency)

## [1] 13566.67

median(aks_BNC_wordfreq$frequency)

## [1] 2335

sd(aks_BNC_wordfreq$frequency)

## [1] 123948.4

#Location
mode(aks_BNC_wordfreq$frequency)

## [1] "numeric"

Continuous Graphical Displays

Create a histogram, density plot, and qqnorm plot of the frequencies from the British National Corpus.

###The distribution of frequencies is very skewed to the right and is not normally distributed
#histogram- majority of the words appear <500k times
ggplot(aks_BNC_wordfreq, aes(x=aks_BNC_wordfreq$frequency)) + geom_histogram(stat="bin", position="stack", alpha=0.8, bins=14) + theme_grey() + theme(text=element_text(family="sans", face="plain", color="#000000", size=15, hjust=0.5, vjust=0.5)) + ggtitle("Histogram") + xlab("Frequency") + ylab("Count")

#density- shows a right-side skew
ggplot(aks_BNC_wordfreq, aes(x=aks_BNC_wordfreq$frequency)) + geom_density(aes(y=..density..), stat="density", position="identity", alpha=0.5) + theme_grey() + theme(text=element_text(family="sans", face="plain", color="#000000", size=15, hjust=0.5, vjust=0.5)) + xlab("Life in Hours") + ggtitle("Density Plot") + xlab("Frequency") + ylab("density")

#qqnorm- 
ggplot(aks_BNC_wordfreq, aes(sample = aks_BNC_wordfreq$frequency)) + stat_qq() + stat_qq_line()

Zipf’s Law

Create a plot displaying Zipf’s Law on the BNC data.

plot(sort(aks_BNC_wordfreq$frequency, decreasing = TRUE),type = "b", main = "Zipf's Law Plot", ylab = "Frequency")

Interpretation of the Frequency Data

Using your results from above, answer the following: 1. What does the dispersion of the data look like? How much does the data span from minimum to maximum, 1st-3rd quartile, etc. In this question, look at the dispersion statistics and explain what they mean to naive audience Answer 1: The dipersion of the data is large and indicates data is right skewed and is not normally distributed. Minimum is 800, mean is 13565 and maximum is 6187267

2. What do the mean, median, and mode tell us about the use of words in the English Language? How frequent 
are words in general (mean/median), and what is the most common frequency in the data?
Answer 2: The mean/median/mode indicate certain words are used more frequently than other words

3. In looking at the pictures of the frequency data, does word frequency appear to be normally distributed 
(explain, not just yes/no)?
Answer 3: No it does not as the qq plot shows data is skewed towards thr right and not normally distributed

4. Does the frequency in the BNC follow Zipf's Law (explain, not just yes/no)?
Answer 4: Yes, because the pattern of the graphs is similar to the theoritical distribution seen in Zipf's law

Categorical Basic Statistics

Included in the BNC frequency data is the part of speech of each word. Create a summary of the types of parts of speech.

A = Adjective, Adv = Adverb, Conj = Conjunction, Det = Determinant, N = Noun, Prep = Preposition, Pron = Pronoun, V = Verb

head(aks_BNC_wordfreq)

## # A tibble: 6 x 3
##   frequency word  part_of_speech
##       <dbl> <chr> <chr>         
## 1   6187267 the   det           
## 2   4239632 be    v             
## 3   3093444 of    prep          
## 4   2687863 and   conj          
## 5   2186369 a     det           
## 6   1924315 in    prep

summary(aks_BNC_wordfreq$part_of_speech)

##    Length     Class      Mode 
##      6318 character character

Categorical Graphical Displays

Create a pie chart of the top 5 parts of speech for the BNC data. You can use base plot or ggplot2.

pos_table = table(aks_BNC_wordfreq$part_of_speech)
pos_table_top5 = sort(pos_table, decreasing = TRUE)[1:5] #this gives you top 5
pos_table_bottom = sort(pos_table, decreasing = FALSE)
#pie chart
pie(pos_table_top5, labels = c('n' = 'Noun', 'v' = 'Verb', 'a' = "Adjective", 'adv' = 'Adverb', 'prep' = 'Preposition')) 
title("The top 5 parts of speech for the BNC data")

Categorical Dispersion

The nouns shaving and descriptor have very similar frequencies in the Corpus of Contemporary American English (COCA), namely, 513 and 515. Do you think these words are equally spread across writing types? Use the information below (and the function from class) to calculate their deviations.

freqreg = c(95385672, 90344134, 91044778, 187245672) #word count of COCA categories
shaving = c(25, 175, 40, 273) #frequencies for shaving
descriptor = c(6, 7, 462, 40) #frequencies for descriptor
names(freqreg) = names(shaving) = names(descriptor) = c("Spoken", "Fiction", "Academic", "Press") #categories of the corpus

aks_df1 = rbind(shaving, descriptor) %>% 
    as.data.frame() 
aks_df1

##            Spoken Fiction Academic Press
## shaving        25     175       40   273
## descriptor      6       7      462    40

aks_dev = function(observed_count, expected_count) {
  aks_DPvalue= sum(abs(prop.table(observed_count) - prop.table(expected_count)))/2
  aks_DPnormal = aks_DPvalue / (1 - min(prop.table(expected_count)))
  return(aks_DPnormal)
}
aks_dev(aks_df1['shaving', ], freqreg)

## [1] 0.3415698

aks_dev(aks_df1['descriptor', ], freqreg)

## [1] 0.8703311

Interpretation of the Categorical Data

Using your results from above, answer the following: 1. What are the most frequent parts of speech by type (i.e., we are only considering each word individually, not the frequency of the words as well)? Answer 1: Noun is the most frequent part of speech

2. What are the least frequent parts of speech by type?
Answer 2: infinitive-maker (1), modal (12), interjection (13), conj (34), pron (46)

3. Was shaving or descriptor more "dispersed" throughout the texts? Remember, that values close to zero mean that the concept is represented evenly across corpora, while values closer to one are more likely to appear in one corpus over another
Answer 3:Descriptor is more dispersed throughout texts since the std deviation is closer to 1