Introduction

The purpose of this report is to demonstrate familiarity with the text data provided for the Data Science Capstone Project and to outline plans for building a next-word prediction algorithm and a Shiny web application.

This document provides: - An overview of the datasets - Basic summary statistics - Key findings from exploratory analysis - A brief description of the planned prediction algorithm and Shiny app

The report is written in a concise manner suitable for a non-data scientist manager.

Data Overview

The data used in this project comes from the SwiftKey corpus and includes English-language text from three sources:


Loading Required Libraries

library(tm)
library(ggplot2)
library(stringi)
library(dplyr)

Data Loading

Set working directory to the folder containing the data

setwd(“path_to_your_data”)

blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)

Confirm lengths

print(c(length(blogs), length(news), length(twitter)))

All three datasets were successfully loaded into R.


Basic Summary Statistics

data_summary <- data.frame( Source = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c( sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter)) ) ) print(data_summary)

Key Observations

  • Twitter contains the largest number of lines.
  • Blog and news data contain longer text per line.
  • The dataset is sufficiently large to support language modeling.

Sampling the Data

set.seed(1234)

sample_blogs <- sample(blogs, as.integer(length(blogs) * 0.01)) sample_news <- sample(news, as.integer(length(news) * 0.01)) sample_twitter <- sample(twitter, as.integer(length(twitter) * 0.01))

sample_data <- c(sample_blogs, sample_news, sample_twitter)


Data Cleaning

corpus <- VCorpus(VectorSource(sample_data))

corpus <- tm_map(corpus, content_transformer(tolower)) corpus <- tm_map(corpus, content_transformer(removePunctuation)) corpus <- tm_map(corpus, content_transformer(removeNumbers)) corpus <- tm_map(corpus, stripWhitespace)


Exploratory Data Analysis

Words per Line

words_per_line <- stri_count_words(sample_data) df_words <- data.frame(words = words_per_line)

ggplot(df_words, aes(x = words)) + geom_histogram(binwidth = 5, fill = “steelblue”, color = “black”) + labs( title = “Distribution of Words per Line”, x = “Number of Words”, y = “Frequency” )

This histogram shows that most lines are relatively short, especially due to Twitter content.


Word Frequency Analysis

tdm <- TermDocumentMatrix(corpus) term_freq <- rowSums(as.matrix(tdm)) term_freq <- sort(term_freq, decreasing = TRUE)

top_words <- data.frame( word = names(term_freq)[1:10], frequency = term_freq[1:10] )

print(top_words)

Most Frequent Words

ggplot(top_words, aes(x = reorder(word, frequency), y = frequency)) + geom_bar(stat = “identity”, fill = “darkgreen”) + coord_flip() + labs( title = “Top 10 Most Frequent Words”, x = “Word”, y = “Frequency” )

A small number of words account for a large portion of the text, which is typical of natural language data.

```


Reduce sample to 0.1% for EDA

sample_blogs <- sample(blogs, as.integer(length(blogs) * 0.001)) sample_news <- sample(news, as.integer(length(news) * 0.001)) sample_twitter <- sample(twitter, as.integer(length(twitter) * 0.001))