Exploratory Analysis of Text Data for Predictive Text
Author
Liu Yi
Introduction
This document presents an exploratory analysis of a large text dataset consisting of three files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. The goal is to understand the characteristics of the data and lay the foundation for developing a predictive text algorithm and a Shiny app.
Loading the Data
The following code is used to load the en_US.twitter.txt file in chunks due to its large size.
con <-file("D:/Datascience/Projects/Finak JHU Data Science/en_US/en_US.twitter.txt", "r")chunk_size <-10000lines <-character()while (length(new_lines <-readLines(con, n = chunk_size)) >0) { lines <-c(lines, new_lines)}
Warning in readLines(con, n = chunk_size): line 7155 appears to contain an
embedded nul
Warning in readLines(con, n = chunk_size): line 8547 appears to contain an
embedded nul
Warning in readLines(con, n = chunk_size): line 4086 appears to contain an
embedded nul
Warning in readLines(con, n = chunk_size): line 9032 appears to contain an
embedded nul
close(con)
Basic Summaries
Word Counts
To get an idea of the word counts in the twitter data, we can use the following code:
library(stringr)words <-unlist(strsplit(paste(lines, collapse =" "), "\\s+"))word_counts <-table(words)total_words <-sum(word_counts)print(paste("Total number of words in en_US.twitter.txt:", total_words))
[1] "Total number of words in en_US.twitter.txt: 30373541"
Line Counts
line_count <-length(lines)print(paste("Number of lines in en_US.twitter.txt:", line_count))
[1] "Number of lines in en_US.twitter.txt: 2360148"
Basic Data Table
top_words <-data.frame(Word =names(head(word_counts, 10)), Frequency =head(word_counts, 10))print(top_words)
A histogram can be used to visualize the frequency distribution of word lengths.
word_lengths <-nchar(words)library(ggplot2)ggplot(data.frame(Length = word_lengths), aes(x = Length)) +geom_histogram(binwidth =1) +labs(title ="Histogram of Word Lengths in en_US.twitter.txt",x ="Word Length", y ="Frequency")
Conclusion
This exploratory analysis provides basic summaries and visualizations of the en_US.twitter.txt file. Similar analyses can be performed on the other two files to gain a comprehensive understanding of the entire dataset. The findings from this analysis will inform the development of the predictive text algorithm and Shiny app.