This is a coding exercise I was asked to do as part of the interview. The aim is to graph the distribution of word lengths from a paragraph given on the handout. Here I’ve recreated the process to identify any issues that may arise in practice.
library(readtext)
library(quanteda)
library(stringi)
library(dplyr)
library(ggplot2)
data <- readtext("data.txt")
mycorpus <- corpus(data)
my_dfm <- dfm(mycorpus, remove_punct=TRUE, remove_numbers=TRUE, remove_hyphens=TRUE, tolower = TRUE)
freq.table <- textstat_frequency(my_dfm)
head(freq.table)
## feature frequency rank docfreq group
## 1 the 17 1 1 all
## 2 and 10 2 1 all
## 3 in 9 3 1 all
## 4 is 4 4 1 all
## 5 company 4 5 1 all
## 6 for 4 6 1 all
freq.table$wordlength <- stri_length(freq.table$feature)
head(freq.table)
## feature frequency rank docfreq group wordlength
## 1 the 17 1 1 all 3
## 2 and 10 2 1 all 3
## 3 in 9 3 1 all 2
## 4 is 4 4 1 all 2
## 5 company 4 5 1 all 7
## 6 for 4 6 1 all 3
dat.for.graph <- freq.table %>% group_by(wordlength) %>% summarise(frequency=sum(frequency))
head(dat.for.graph)
## # A tibble: 6 x 2
## wordlength frequency
## <int> <dbl>
## 1 1 5
## 2 2 32
## 3 3 44
## 4 4 18
## 5 5 23
## 6 6 20
plot <- ggplot(dat.for.graph,aes(wordlength,frequency,fill="red")) +
geom_bar(stat = "identity") +
ggtitle("Frequency distribution of word lengths") +
labs(x="Word Length", y="Frequency") +
theme(legend.position="none", plot.title = element_text(hjust = 0.5))+
scale_x_continuous(breaks=seq(0,max(dat.for.graph$wordlength),by=1))
plot
Abbreviations such as L in the middle name, and Ltd, are counted as words.
The s behind apostrophe (e.g. abc’s) has been counted as word.
Our current analysis seemed to remove words ending with digits (e.g. abc3). This requires further investigation.
Using regex to remove ’s, and L.