Exploratory Analysis of Text Data for Predictive Text

Author

Liu Yi

Introduction

This document presents an exploratory analysis of a large text dataset consisting of three files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. The goal is to understand the characteristics of the data and lay the foundation for developing a predictive text algorithm and a Shiny app.

Loading the Data

The following code is used to load the en_US.twitter.txt file in chunks due to its large size.

con <- file("D:/Datascience/Projects/Finak JHU Data Science/en_US/en_US.twitter.txt", "r")
chunk_size <- 10000  
lines <- character()
while (length(new_lines <- readLines(con, n = chunk_size)) > 0) {
  lines <- c(lines, new_lines)
}

Warning in readLines(con, n = chunk_size): line 7155 appears to contain an
embedded nul

Warning in readLines(con, n = chunk_size): line 8547 appears to contain an
embedded nul

Warning in readLines(con, n = chunk_size): line 4086 appears to contain an
embedded nul

Warning in readLines(con, n = chunk_size): line 9032 appears to contain an
embedded nul

close(con)

Basic Summaries

Word Counts

To get an idea of the word counts in the twitter data, we can use the following code:

library(stringr)
words <- unlist(strsplit(paste(lines, collapse = " "), "\\s+"))
word_counts <- table(words)
total_words <- sum(word_counts)
print(paste("Total number of words in en_US.twitter.txt:", total_words))

[1] "Total number of words in en_US.twitter.txt: 30373541"

Line Counts

line_count <- length(lines)
print(paste("Number of lines in en_US.twitter.txt:", line_count))

[1] "Number of lines in en_US.twitter.txt: 2360148"

Basic Data Table

top_words <- data.frame(Word = names(head(word_counts, 10)), Frequency = head(word_counts, 10))
print(top_words)

           Word Frequency.words Frequency.Freq
1              ‏                ‏             16
2            ⚽              ⚽             11
3             �               �             61
4              ‎                ‎              9
5                                          3
6            ⚾              ⚾             26
7          ⚾⚾            ⚾⚾              3
8  ⚾⚾⚾⚾⚾⚾    ⚾⚾⚾⚾⚾⚾              1
9        ⚾⚾⚾          ⚾⚾⚾              3
10   ⚾⚾⚾⚾⚾      ⚾⚾⚾⚾⚾              2

Plots

A histogram can be used to visualize the frequency distribution of word lengths.

word_lengths <- nchar(words)
library(ggplot2)
ggplot(data.frame(Length = word_lengths), aes(x = Length)) +
  geom_histogram(binwidth = 1) +
  labs(title = "Histogram of Word Lengths in en_US.twitter.txt",
       x = "Word Length", y = "Frequency")

Conclusion

This exploratory analysis provides basic summaries and visualizations of the en_US.twitter.txt file. Similar analyses can be performed on the other two files to gain a comprehensive understanding of the entire dataset. The findings from this analysis will inform the development of the predictive text algorithm and Shiny app.