Milestone Report- Data Science Capstone

The goal of the project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm.

This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Download and read the data

Data Source: https://d396qusza40orc.cloudfront.nelibrary(t/dsscapstone/dataset/Coursera-SwiftKey.zip

library(stringr)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding= "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding= "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding= "UTF-8", skipNul = TRUE)

Statistics summary

First, I analyze the size and word count statistics of the three text files (blogs, news and Twitter) by calculating the number of lines, average line length, total word count, and average word count per line. I then generate a summary table with these parameters and the file sizes in megabytes.

library(stringr)

all_paths <- c(blogs = "final/en_US/en_US.blogs.txt",  twitter = "final/en_US/en_US.twitter.txt",  news = "final/en_US/en_US.news.txt")
dimensions <- file.info(all_paths)$size/1024^2

files <- list(blogs=blogs, twitter=twitter, news=news)
 analysis <- function(files, name) {
     nlines <- length(files)
     line_lengths <- nchar(files)
     avg_line_length <- mean(line_lengths)
     word_counts <- sum(str_count(files, "\\S+"))  
     avg_word_count <- mean(str_count(files, "\\S+"))
     
     return(data.frame(
         dataset = name,
         nlines = nlines,
         avg_line_length = avg_line_length,
         total_word_count = word_counts,
         avg_word_count_per_line = avg_word_count
     ))
 }
 
table_stats <- do.call(rbind, lapply(names(files), function(name) {
     analysis(files[[name]], name)
}))
table_stats$megabytes <- dimensions
print(table_stats)

##   dataset  nlines avg_line_length total_word_count avg_word_count_per_line
## 1   blogs  899288       229.98695         37334131                41.51521
## 2 twitter 2360148        68.68054         30373583                12.86936
## 3    news 1010242       201.16285         34372530                34.02406
##   megabytes
## 1  200.4242
## 2  159.3641
## 3  196.2775

Text cleaning

Due to the large size of the data, I sample 1% of each dataset to continue the analysis. After sampling, I pre-process the text data to clean it by converting it to lowercase, removing punctuation, numbers, stop words, and extra white spaces.

Here’s a summary of the cleaning process:

Lowercased the text to ensure consistency.
Removed punctuation and numbers to focus on meaningful words.
Removed stop words (common words like “the”, “and”, “is”) to emphasize more important terms.
Removed extra white spaces for a cleaner dataset.

library(NLP)
library(tm)

set.seed(12345)

s_blogs <- sample(blogs, length(blogs)*0.01)
s_news <- sample(news, length(news)*0.01)
s_twitter <- sample(twitter, length(twitter)*0.01) 
s_data <- list(s_blogs=s_blogs, s_news=s_news, s_twitter=s_twitter)


clean_text <- function(text) {
  text <- tolower(text)
  text <- removePunctuation(text)
  text <- removeNumbers(text)
  text <- removeWords(text, stopwords("en"))
  text <- stripWhitespace(text)
  
  return(text)
}

cleaned_datasets<- lapply(s_data, clean_text)

cleaned_blogs <- cleaned_datasets$s_blogs
cleaned_news <- cleaned_datasets$s_news
cleaned_twitter <- cleaned_datasets$s_twitter

cleaned_blogs <- cleaned_blogs[cleaned_blogs != ""]
cleaned_news <- cleaned_news[cleaned_news != ""]
cleaned_twitter <- cleaned_twitter[cleaned_twitter != ""]

Word Clouds

After cleaning the data, I visualize the most frequent words using word clouds. The size of the word indicates how frequently it appears in the dataset. Below are the word clouds for blogs, news, and Twitter.

library(RColorBrewer)
library(wordcloud)

par(mfrow = c(1, 3), mar = c(2,2,2,2))
suppressWarnings(wordcloud(cleaned_blogs, max.words = 50, random.order = FALSE, use.r.layout = FALSE, colors = brewer.pal(4, "RdYlBu")))

suppressWarnings(wordcloud(cleaned_news, max.words = 50, random.order = FALSE, use.r.layout = FALSE, colors = brewer.pal(4, "PiYG")))

suppressWarnings(wordcloud(cleaned_twitter, max.words = 50, random.order = FALSE, use.r.layout = FALSE, colors = brewer.pal(4, "PRGn")))

N-grams Analysis

I continue analyzing the most frequent n-grams in each of the three datasets: blogs, news, and Twitter.

N-grams are sequences of N words that appear together in the text. In this case I focus on 1, 2 and 3 word sequences, respectively known as unigrams, bigrams, and trigrams.

I create and visualize the top 5 n-grams from each dataset using ggplot2, with different color palettes assigned for each dataset.

library(tidytext)
library(dplyr)
library(ggplot2)
library(gridExtra)

create_ngrams <- function(text_data, n) {
  tibble(text = text_data) %>%
    unnest_tokens(output = "ngram", input = text, token = "ngrams", n = n) %>%
    filter(!is.na(ngram)) %>%
    count(ngram, sort = TRUE)%>%
    filter(n>1)
    
}

ngram_blogs_1gram <- create_ngrams(cleaned_blogs, 1)
ngram_blogs_2gram <- create_ngrams(cleaned_blogs, 2)
ngram_blogs_3gram <- create_ngrams(cleaned_blogs, 3)

ngram_news_1gram <- create_ngrams(cleaned_news, 1)
ngram_news_2gram <- create_ngrams(cleaned_news, 2)
ngram_news_3gram <- create_ngrams(cleaned_news, 3)

ngram_twitter_1gram <- create_ngrams(cleaned_twitter, 1)
ngram_twitter_2gram <- create_ngrams(cleaned_twitter, 2)
ngram_twitter_3gram <- create_ngrams(cleaned_twitter, 3)

palettes <- list(brewer.pal(5, "RdYlBu"), brewer.pal(5, "PiYG"), brewer.pal(5, "PRGn"))
 
plot_ngram <- function(ngram_data, title, palettes) {
  ggplot(ngram_data, aes(x = reorder(ngram, n), y = n, fill=n)) +
    geom_bar(stat = "identity") +
    scale_fill_gradientn(colors = palettes) +
    labs(title = title, x="N-grams", y = "Frequency") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    theme_minimal()
}

b1 <- plot_ngram(head(ngram_blogs_1gram, 5), "Top 5 Blogs", palettes[[1]])
n1 <- plot_ngram(head(ngram_news_1gram, 5), "Top 5 News", palettes[[2]])
t1 <- plot_ngram(head(ngram_twitter_1gram, 5), "Top 5 Twitter", palettes[[3]])

grid.arrange(b1,n1,t1, heights= c(4,4,4), top="Unigrams")

b2 <- plot_ngram(head(ngram_blogs_2gram, 5), "Top 5 Blogs", palettes[[1]])
n2 <- plot_ngram(head(ngram_news_2gram, 5), "Top 5 News", palettes[[2]])
t2 <- plot_ngram(head(ngram_twitter_2gram, 5), "Top 5 Twitter", palettes[[3]])

grid.arrange(b2,n2,t2, heights= c(4,4,4), top="Bigrams")

b3 <- plot_ngram(head(ngram_blogs_3gram, 5), "Top 5 Blogs", palettes[[1]])
n3 <- plot_ngram(head(ngram_news_3gram, 5), "Top 5 News", palettes[[2]])
t3 <- plot_ngram(head(ngram_twitter_3gram, 5), "Top 5 Twitter", palettes[[3]])

grid.arrange(b3,n3,t3, heights= c(4,4,4), top="Trigrams")

To conclude, I’ve explored the data from blogs, news, and Twitter, cleaning it up and analyzing word frequencies and patterns. By working with n-grams and word clouds, I’ve gathered some useful insights that will help in predicting words based on the ones already typed.

This analysis sets the stage for building a prediction model that can suggest words when part of the sentence is typed.

The next step is to develop an algorithm that can make these predictions, and then bring it all together in a Shiny app that users can easily interact with.