knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
Introduction
The goal of this project is to build a predictive text algorithm using the English SwiftKey dataset, which contains samples from blogs, news, and Twitter.
This report summarizes the exploratory analysis of the data and outlines plans for building a next-word prediction model and a Shiny app for end users.
# Set working directory
setwd("C:/Users/salya/Desktop/Coursera-SwiftKey/final/en_US")
# Load datasets (use skipNul=TRUE to avoid errors with Twitter file)
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Function to count words per line
word_count <- function(x) sapply(gregexpr("\\W+", x), length) + 1
# Apply function
blogs_words <- word_count(blogs)
news_words <- word_count(news)
twitter_words <- word_count(twitter)
# Summary table
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
AvgWordsPerLine = c(mean(blogs_words), mean(news_words), mean(twitter_words)),
MaxWordsPerLine = c(max(blogs_words), max(news_words), max(twitter_words)),
MinWordsPerLine = c(min(blogs_words), min(news_words), min(twitter_words))
)
summary_table
## File Lines AvgWordsPerLine MaxWordsPerLine MinWordsPerLine
## 1 Blogs 899288 43.50290 6852 2
## 2 News 1010206 36.34854 1929 2
## 3 Twitter 2360148 13.89476 47 2
library(ggplot2)
df_plot <- data.frame(
Words = c(blogs_words, news_words, twitter_words),
Source = rep(c("Blogs","News","Twitter"),
times = c(length(blogs_words), length(news_words), length(twitter_words)))
)
ggplot(df_plot, aes(x = Words, fill = Source)) +
geom_histogram(binwidth = 5, alpha = 0.6, position="identity") +
xlim(0,100) +
labs(title="Distribution of Words per Line", x="Words per line", y="Count") +
theme_minimal()
library(dplyr)
library(tidytext)
blogs_df <- data.frame(text = blogs, stringsAsFactors = FALSE)
blogs_tokens <- blogs_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
top10 <- blogs_tokens %>% top_n(10, n) %>% arrange(n)
ggplot(top10, aes(x = reorder(word, n), y = n)) +
geom_col(fill="steelblue") +
coord_flip() +
labs(title="Top 10 Words in Blogs", x="Word", y="Frequency") +
theme_minimal()
Observations
Twitter posts tend to be very short, often under 20 words per line.
Blogs have the longest lines, with some exceeding 40,000 characters.
Common words like “the”, “and”, “you” dominate, suggesting stopwords will need to be handled.
Some profanity and hashtags appear in Twitter, which may need filtering.
Plans for Prediction Algorithm
Build a next-word prediction model using n-grams (uni-, bi-, tri-grams).
For unseen word sequences, use backoff strategies and smoothing to assign small probabilities.
Optimize the model for speed and memory, suitable for deployment in a Shiny app.
The final app will allow users to type text and see predicted next words in real time.