Overview

This is the milestone report for week 2 of the Johns Hopkins University on Coursera Data Science Capstone project. The overall goal of the Capstone project is to build a predictive text model using Natural Language Processing (NLM) along with a predictive text application that will determine the most likely next word when a user inputs a word or a phrase. The purpose of this milestone report is to demonstrate how the data was downloaded, imported into R, and cleaned. This report also contains an exploratory analysis of the data including summary statistics.graphics that illustrate features of the data, interesting findings discovered along the way, and an outline of the next steps that will be taken toward building the predictive application.

library(sentimentr)
## Warning: package 'sentimentr' was built under R version 4.3.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringi)
## Warning: package 'stringi' was built under R version 4.3.3
library(stringr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.3.3
library(dplyr)
library(textdata)
## Warning: package 'textdata' was built under R version 4.3.3
 library(RColorBrewer)
library(ggplot2)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3

Set the working directory

setwd("C:/Users/USER/OneDrive/Desktop/Coursera")

Downloading and import data Data is downloaded from gutenberg Project

puffin <- tibble(line = 1: length(lines),text =lines)
puffin
puffin_tokens <- puffin %>% unnest_tokens(word,text)
puffin_tokens %>% count(word,sort = TRUE)
data("stop_words")
stop_words

We next need to clean data . Common text mining cleaning tasks include:

Remove punctuation marks, numbers, extra whitespace, and stopwords (common words like “and”, “or”, “is”, “in”, etc.) Filtering out unwanted words

library(tidytext)
library(dplyr)
library(ggplot2)
important_puffin_tokens <- puffin_tokens%>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
important_puffin_tokens
important_puffin_tokens%>%count(word,sort = TRUE)
library(tidytext)
library(dplyr)
library(ggplot2)
counts <- puffin %>% 
  unnest_tokens(word,text)%>%
anti_join(stop_words) %>%
  count(word,sort = TRUE)%>%
  filter(n > 50)
## Joining with `by = join_by(word)`
counts<- counts %>% mutate(word = reorder(word,n))
counts

Visualize the Data The final step will be to create visualizations of the data.

counts %>% ggplot(aes(n,word))+ geom_col() + labs(y =NULL)

A word cloud is another interesting way to visualize the data. Word clouds are easy to understand as the words with the highest frequency stand out better. Word clouds are also visually engaging and work well for presentations.

library(ggwordcloud)# Create a word cloud
## Warning: package 'ggwordcloud' was built under R version 4.3.3
wordcloud(words = counts$word, 
          freq = counts$n, 
          min.freq = 1, 
          max.words = 100, 
          random.order = FALSE, 
          main = "Word Cloud")