Milestone Report

Introduction

This report summarizes an exploratory analysis of three large English text files (blogs, news, and Twitter). The goal is to understand the data and outline a plan for building a next-word prediction algorithm and Shiny web application.

Data Summary

The three data sources differ in size and style. Blogs tend to have longer lines, news text is more formal, and Twitter messages are shorter and more conversational. Table 1 shows the number of lines and average characters per line for each source.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

twitter <- readLines("final/en_US/en_US.twitter.txt",
encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("final/en_US/en_US.blogs.txt",
encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt",
encoding = "UTF-8", skipNul = TRUE)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

summary_stats <- tibble(
  Source = c("Blogs", "News", "Twitter"),
  Lines  = c(length(blogs), length(news), length(twitter)),
  Chars  = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
) %>%
  mutate(AvgCharsPerLine = round(Chars / Lines, 1))

knitr::kable(summary_stats, caption = "Basic summary of text files")

Basic summary of text files
Source	Lines	Chars	AvgCharsPerLine
Blogs	899288	206824505	230.0
News	1010242	203223159	201.2
Twitter	2360148	162096241	68.7

library(stringr)
library(ggplot2)
library(dplyr)

line_lengths <- tibble(
  Source = rep(c("Blogs","News","Twitter"),
               times = c(length(blogs), length(news), length(twitter))),
  Words  = c(str_count(blogs, "\\S+"),
             str_count(news, "\\S+"),
             str_count(twitter, "\\S+"))
)

ggplot(line_lengths, aes(x = Words, fill = Source)) +
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
  xlab("Words per line") + ylab("Count") +
  ggtitle("Distribution of line lengths")

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Milestone Report

Uttam

12/3/2025

Introduction

Data Summary

R Markdown

Including Plots