This is a coding exercise I was asked to do as part of the interview. The aim is to graph the distribution of word lengths from a paragraph given on the handout. Here I’ve recreated the process to identify any issues that may arise in practice.

Packages required:

library(readtext)
library(quanteda)
library(stringi)
library(dplyr)
library(ggplot2)

Load the data

data <- readtext("data.txt")

Build a corpus

mycorpus <- corpus(data)

Build document feature matrix, removing punctuation marks, numbers, hyphens, converting to lowercase

my_dfm <- dfm(mycorpus, remove_punct=TRUE, remove_numbers=TRUE, remove_hyphens=TRUE, tolower = TRUE)

Create feature frequency table

freq.table <- textstat_frequency(my_dfm)
head(freq.table)
##   feature frequency rank docfreq group
## 1     the        17    1       1   all
## 2     and        10    2       1   all
## 3      in         9    3       1   all
## 4      is         4    4       1   all
## 5 company         4    5       1   all
## 6     for         4    6       1   all

Create a new column called wordlength to the feature frequency table

freq.table$wordlength <- stri_length(freq.table$feature)
head(freq.table)
##   feature frequency rank docfreq group wordlength
## 1     the        17    1       1   all          3
## 2     and        10    2       1   all          3
## 3      in         9    3       1   all          2
## 4      is         4    4       1   all          2
## 5 company         4    5       1   all          7
## 6     for         4    6       1   all          3

Aggregate ‘frequency’ by wordlength

dat.for.graph <- freq.table %>% group_by(wordlength) %>% summarise(frequency=sum(frequency))
head(dat.for.graph)
## # A tibble: 6 x 2
##   wordlength frequency
##        <int>     <dbl>
## 1          1         5
## 2          2        32
## 3          3        44
## 4          4        18
## 5          5        23
## 6          6        20

Plot graph

plot <- ggplot(dat.for.graph,aes(wordlength,frequency,fill="red")) + 
        geom_bar(stat = "identity") + 
        ggtitle("Frequency distribution of word lengths") + 
        labs(x="Word Length", y="Frequency") +
        theme(legend.position="none", plot.title = element_text(hjust = 0.5))+ 
        scale_x_continuous(breaks=seq(0,max(dat.for.graph$wordlength),by=1)) 

plot

Issues I could have raised and attempted to address:

  1. Abbreviations such as L in the middle name, and Ltd, are counted as words.

  2. The s behind apostrophe (e.g. abc’s) has been counted as word.

  3. Our current analysis seemed to remove words ending with digits (e.g. abc3). This requires further investigation.

Solution

Using regex to remove ’s, and L.