Data Science Capstone Project: Milestone Report

Introduction

Finaly here we are at the first steps to complete the Data Science Specialization doing this Capstone Project. The goal is get a group of texts, so random and comprehensive as possible, in a bundle, and statistically analyse it searching for clues that could help to predict the next letters in a word and the next words in a phrase. There are a lot of theories about how to do this, considering the roots of the idiom being analysed, creating personal rules, mixing different flavors of analyse but, in this work, we will try to create some machine decisions based on statistics and tools available to data scientists.

First step: create the bundle

Swiftkey, partner of Johns Hopkins Health Public School for the Coursera Data Science Specialization, provided three groups of texts: texts from blogs texts from news texts from twitters. We need to know how big are these files and the profile of the contents.

To perform an exploratory data analysis (EDA) on the SwiftKey dataset, which includes text data from blogs, news, and Twitter, we need to:

Determine the size of each file: This will help us understand the storage requirements and memory management needed. Profile the contents of each file: This involves analyzing the number of lines, words, characters, and other relevant statistics to understand the nature of the text data.

Review

Vectorization with sapply:

Instead of repeating the readLines and length operations for each file separately, this code uses sapply to apply the get_num_lines function to each file in the file_names vector. This approach reduces code duplication and makes the script more concise and easier to maintain.

Encapsulation in a Function:

The get_num_lines function encapsulates the logic for reading a file and counting its lines. This improves code readability and allows you to reuse the function if needed for other tasks.

Naming and Printing Results:

The names function assigns the file names as names to the line_counts vector for better clarity when printing the results.

# Set the working directory
setwd("C:/Users/Carlos/Desktop/Diplomados y Cursos/DataScienceCoursera")

# List of file names
file_names <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

# Function to get the number of lines in a file
get_num_lines <- function(file_name) {
  length(readLines(file_name))
}

# Get number of lines for each file
line_counts <- sapply(file_names, get_num_lines)

# Print the results
names(line_counts) <- file_names
line_counts

##   en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
##            899288             77259           2360148

We need to know if there are enough texts and if there are diversity in the contents. So, we will count the lines and characters, and proceed a cleaning for eliminate too much blank spaces, too much repeted letters and eliminate some badwords that, for the pourpose of this project, are not considered.

Review

Using lapply for Reading Files:

lapply is used to apply the readLines function to each file in file_paths. This approach is more concise and scalable compared to calling readLines multiple times separately.

Creating Data Frame:

The data.frame function is used to create the bundle data frame directly from the list sample_data. This is more efficient and readable than converting each list element to a data frame and then combining them.

Naming Columns:

The data.frame function allows for direct column naming, which simplifies the process and avoids the need to manually assign column names afterward.

Avoiding Unnecessary Conversions:

By creating the data frame directly from the list of character vectors, the need for intermediate conversions is eliminated, reducing potential errors and improving code efficiency.

##     Blogs               News             Twitter         
##  Length:400         Length:400         Length:400        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Review

The R code is designed to preprocess text data and analyze word frequencies using the tm and wordcloud packages. Here’s a breakdown of its process and an overall review:

Library Imports The code starts by loading the necessary libraries: tm for text mining, wordcloud for visualization, and NLP for natural language processing capabilities. This setup ensures all required tools are available for text preprocessing and analysis.
Text Preprocessing with tokenmaker Function The tokenmaker function is central to the text preprocessing workflow. It takes raw text input and processes it through several steps to normalize and clean the data:

Lowercasing: Converts all text to lowercase to ensure consistency (e.g., “Hello” and “hello” are treated the same). Removing Punctuation and Numbers: Strips out punctuation and numbers to focus on meaningful words. Whitespace Trimming: Removes extra spaces, which can result from punctuation or other formatting issues. Stopwords Removal: Eliminates common English words (like “the”, “is”, etc.) that don’t contribute to the uniqueness of the text content. Stemming: Reduces words to their base or root form (e.g., “running” becomes “run”), which helps in consolidating word variants and reduces dimensionality. The function then returns the cleaned corpus, which is ready for further analysis.

Word Frequency Analysis with wordcounter Function The wordcounter function takes the preprocessed corpus and performs word frequency analysis:

Document-Term Matrix (DTM) Creation: Converts the corpus into a matrix where rows represent documents and columns represent terms (words), with each cell indicating the frequency of the word in the document. Matrix Conversion and Summation: Converts the DTM to a standard matrix format and sums the occurrences of each term across all documents. Sorting Frequencies: Orders the terms by frequency in descending order, highlighting the most common words in the dataset. The output is a sorted vector of word frequencies, which can be used for further analysis or visualization, such as word clouds.

library(tm)

## Loading required package: NLP

library(wordcloud)

## Loading required package: RColorBrewer

library(NLP)  # Ensures the NLP package is loaded

# Function to preprocess text data
tokenmaker <- function(x) {
    corpus <- Corpus(VectorSource(x))
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, PlainTextDocument)
    corpus <- tm_map(corpus, stemDocument)
    
    # Return the corpus
    return(corpus)
}  

# Function to count word frequencies
wordcounter <- function(corpus) {
    dtm <- DocumentTermMatrix(corpus)
    dtm_matrix <- as.matrix(dtm)
    word_freq <- colSums(dtm_matrix)
    word_freq <- sort(word_freq, decreasing = TRUE)
    return(word_freq)
}  

# Example usage
# Sample text data
texts <- c("This is a sample text.", "Another sample text for analysis.")

# Preprocess text data
corpus <- tokenmaker(texts)

# Count word frequencies
word_freq <- wordcounter(corpus)

# View results
print(head(word_freq, 10))

##   sampl    text analysi   anoth 
##       2       2       1       1

Then, we will manage the three column, aplying the two functions, making a summary of each kind of data, inspect it´s profile printing the header and making an curious graphical expression with the 100 more common words in each subject. This graphical wordcloud has a “scientific”” expression in the histogram displayed after it.

library(tm)
library(wordcloud)

# Function to preprocess text data
tokenmaker <- function(x) {
  # Create a Corpus object from the vector of text
  corpus <- Corpus(VectorSource(x))
  
  # Preprocess each document in the Corpus
  corpus <- tm_map(corpus, content_transformer(tolower))          # Convert to lowercase
  corpus <- tm_map(corpus, removePunctuation)                     # Remove punctuation
  corpus <- tm_map(corpus, stripWhitespace)                       # Remove extra whitespace
  corpus <- tm_map(corpus, removeWords, stopwords("english"))     # Remove stopwords
  corpus <- tm_map(corpus, removeNumbers)                         # Remove numbers
  corpus <- tm_map(corpus, stemDocument)                          # Apply stemming
  
  return(corpus)
}

# Function to compute word frequencies
wordcounter <- function(corpus) {
  if (length(corpus) == 0) {
    stop("Error: The corpus is empty.")
  }
  
  # Create DocumentTermMatrix
  dtm <- DocumentTermMatrix(corpus)
  
  if (ncol(dtm) == 0 || nrow(dtm) == 0) {
    stop("Error: DocumentTermMatrix is empty.")
  }
  
  # Convert DocumentTermMatrix to matrix and calculate frequencies
  dtm_matrix <- as.matrix(dtm)
  word_freq <- colSums(dtm_matrix)
  word_freq <- sort(word_freq, decreasing = TRUE)
  return(list(names(word_freq), word_freq))
}

# Ensure 'bundle' is a valid data frame
if (!is.data.frame(bundle)) {
  stop("Error: 'bundle' must be a data frame.")
}

# Preprocess and tokenize the blog texts
blogs_token <- tokenmaker(bundle[, 1])

# Compute word frequencies
blogs_words <- wordcounter(blogs_token)

# Display summary statistics
summary_stats <- summary(nchar(bundle[, 1]))
print(summary_stats)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    47.0   180.5   248.2   360.2  1461.0

# Preview the first few blog entries
print(head(bundle[, 1]))

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

# Generate a word cloud if there are valid words
if (length(blogs_words[[1]]) > 0 && length(blogs_words[[2]]) > 0) {
  wordcloud(blogs_words[[1]], blogs_words[[2]], max.words = 100)
} else {
  warning("No words available to generate a word cloud.")
}

## Explanation of the above process

Library Imports:

tm: For text mining and preprocessing. wordcloud: For visualizing word frequencies.

Text Preprocessing (tokenmaker function):

Input: A vector of text documents (bundle[, 1]).

Process: Corpus Creation: Creates a Corpus object from the text vector using VectorSource. Text Processing: Applies a series of transformations to clean and normalize the text: Converts text to lowercase. Removes punctuation. Strips extra whitespace. Removes common stopwords. Removes numbers. Applies stemming to reduce words to their root forms. Output: A preprocessed Corpus object containing cleaned documents.

Word Frequency Calculation (wordcounter function):

Input: The preprocessed Corpus. Process: DocumentTermMatrix Creation: Creates a DocumentTermMatrix from the Corpus. Frequency Calculation: Converts the DTM to a matrix, computes the frequency of each term, and sorts these frequencies in descending order. Output: A list of word names and their corresponding frequencies.

Data Validation and Visualization:

Validates the bundle data frame to ensure it’s properly formatted. Processes and tokenizes blog texts. Computes word frequencies and provides a summary of text lengths. Generates a word cloud if there are valid words.

Explanation of the next process

1.Creating TermDocumentMatrix:

TermDocumentMatrix: Constructs a matrix where each row represents a term and each column represents a document. The entries are the term frequencies. Input: A Corpus of preprocessed text documents. Output: A TermDocumentMatrix object.

2.Converting to Matrix:

as.matrix(tdm_Blogs): Converts the TermDocumentMatrix to a standard matrix where rows represent terms and columns represent documents. Output: A matrix where each entry indicates the frequency of a term in a document.

Calculating Word Frequencies:

rowSums(m_Blogs): Sums the term frequencies across all documents for each term to get the total frequency of each term. sort(rowSums(m_Blogs), decreasing = TRUE): Sorts the terms by their total frequency in descending order. Output: A named vector of term frequencies.

Creating a Data Frame:

data.frame(word = names(v_Blogs), freq = v_Blogs): Creates a data frame with terms and their frequencies. Output: A data frame suitable for plotting.

Plotting with ggplot2:

ggplot(subset(d_Blogs, freq > 30), aes(x = reorder(word, freq), y = freq)): Initializes the plot with terms having a frequency greater than 30, reorders words by frequency for better visualization. geom_bar(stat = “identity”): Creates a bar plot where the height of bars represents the term frequencies. theme(axis.text.x = element_text(angle = 45, hjust = 1)): Rotates x-axis labels to avoid overlap. labs(x = “Word”, y = “Frequency”, title = “Word Frequencies in Blogs”): Adds labels and a title to the plot.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

# Create TermDocumentMatrix
tdm_Blogs <- TermDocumentMatrix(blogs_token)

# Convert TermDocumentMatrix to matrix
m_Blogs <- as.matrix(tdm_Blogs)

# Calculate word frequencies
v_Blogs <- sort(rowSums(m_Blogs), decreasing = TRUE)

# Create a data frame of words and their frequencies
d_Blogs <- data.frame(word = names(v_Blogs), freq = v_Blogs)

# Display the top 25 words by frequency
head(d_Blogs, 25)

##        word freq
## like   like   74
## time   time   69
## just   just   58
## one     one   57
## will   will   54
## get     get   52
## make   make   45
## work   work   40
## know   know   38
## can     can   38
## day     day   38
## ’s       ’s   37
## good   good   34
## use     use   32
## now     now   32
## much   much   31
## think think   31
## way     way   30
## year   year   29
## want   want   29
## come   come   29
## thing thing   28
## first first   27
## peopl peopl   27
## even   even   27

# Plot word frequencies using ggplot2
p <- ggplot(subset(d_Blogs, freq > 30), aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Word", y = "Frequency", title = "Word Frequencies in Blogs")

print(p)

Now: Data Preprocessing: Converts and cleans news text data, ensuring it’s in a suitable format for analysis. Frequency Analysis: Calculates and sorts word frequencies from the processed news text corpus. Text Length Summary: Provides insights into the length of news articles, which can be useful for understanding data characteristics and ensuring proper preprocessing.

# Process and tokenize the news texts
news_token <- tokenmaker(bundle[, 2])

# Compute word frequencies
news_words <- wordcounter(news_token)

# Display summary statistics for the news texts
summary_stats_news <- summary(nchar(bundle[, 2]))
print(summary_stats_news)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   108.8   183.0   199.9   270.0   982.0

head(bundle [,2])

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

Purpose: Creates a visual representation of word frequencies where the size of each word indicates its frequency.

wordcloud() Function:

Purpose: Creates a visual representation of word frequencies where the size of each word indicates its frequency. Inputs: words: A character vector of words. In this case, it comes from news_words[[1]], which contains the words sorted by frequency. freq: A numeric vector of frequencies corresponding to the words. It comes from news_words[[2]], which contains the frequencies of the words. max.words: Limits the maximum number of words to display in the word cloud. Here, it’s set to 100. scale: A vector specifying the range of word sizes. The first value is the largest word size, and the second value is the smallest. colors: Specifies the color palette for the word cloud. Here, brewer.pal(8, “Dark2”) is used to provide a color palette from the RColorBrewer package.

Generating the Word Cloud:

The function plots a word cloud where more frequent words are displayed larger and less frequent words are smaller. Visual Enhancement: The scale parameter ensures that the sizes of the words are within a specified range, making the cloud visually appealing. The colors parameter enhances readability by providing a palette of distinct colors.

wordcloud(words = news_words[[1]], 
          freq = news_words[[2]], 
          max.words = 100, 
          scale = c(3, 0.5), 
          colors = brewer.pal(8, "Dark2"))

# Create TermDocumentMatrix from the news tokenized corpus
tdm_News <- TermDocumentMatrix(news_token)

# Convert TermDocumentMatrix to matrix
m_News <- as.matrix(tdm_News)

# Calculate the frequency of each word
v_News <- sort(rowSums(m_News), decreasing = TRUE)

# Create a data frame with words and their frequencies
d_News <- data.frame(word = names(v_News), freq = v_News)

# Display the top 25 words by frequency
top_words <- head(d_News, 25)
print(top_words)

##          word freq
## said     said  122
## year     year   40
## will     will   34
## time     time   33
## new       new   31
## two       two   30
## first   first   27
## like     like   27
## make     make   24
## school school   24
## one       one   23
## last     last   22
## work     work   22
## day       day   22
## —           —   22
## polic   polic   22
## say       say   21
## citi     citi   20
## also     also   20
## get       get   20
## now       now   19
## peopl   peopl   18
## just     just   18
## state   state   18
## come     come   18

library(ggplot2)

# Create a bar plot of word frequencies for terms with frequency greater than 20
p <- ggplot(subset(d_News, freq > 20), aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Word", y = "Frequency", title = "Top Words in News Texts")

# Display the plot
print(p)

# Generate tokens from the Twitter data
twitter_token <- tokenmaker(bundle[, 3])

# Extract and view the first few documents from the Corpus
inspect(twitter_token[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] btw thank rt gonna dc anytim soon love see way way long                
## [2] meet someon special youll know heart will beat rapid youll smile reason
## [3] theyv decid fun dont

# Generate tokens from the Twitter data
twitter_token <- tokenmaker(bundle[, 3])

# Compute word frequencies from the tokenized Twitter data
twitter_words <- wordcounter(twitter_token)

# Display summary statistics for the length of Twitter texts
summary_stats_twitter <- summary(nchar(bundle[, 3]))
print(summary_stats_twitter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   34.75   63.00   66.45   92.00  140.00

head(bundle [,3])

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

library(wordcloud)
library(RColorBrewer)

# Generate a word cloud for the Twitter data
wordcloud(words = twitter_words[[1]], 
          freq = twitter_words[[2]], 
          max.words = 100, 
          scale = c(3, 0.5), 
          colors = brewer.pal(8, "Dark2"))

# Create TermDocumentMatrix from the tokenized Twitter data
tdm_Twitter <- TermDocumentMatrix(twitter_token)

# Convert TermDocumentMatrix to matrix
m_Twitter <- as.matrix(tdm_Twitter)

# Calculate the frequency of each word
v_Twitter <- sort(rowSums(m_Twitter), decreasing = TRUE)

# Create a data frame with words and their frequencies
d_Twitter <- data.frame(word = names(v_Twitter), freq = v_Twitter)

# Display the top 25 words by frequency
top_words_twitter <- head(d_Twitter, 25)
print(top_words_twitter)

##            word freq
## like       like   27
## day         day   24
## get         get   24
## just       just   24
## love       love   22
## good       good   22
## can         can   22
## thank     thank   20
## will       will   20
## one         one   19
## know       know   18
## need       need   17
## new         new   15
## dont       dont   14
## great     great   14
## time       time   14
## now         now   13
## follow   follow   13
## show       show   12
## tonight tonight   12
## got         got   12
## right     right   11
## much       much   11
## look       look   10
## let         let   10

library(ggplot2)

# Create a bar plot of word frequencies for terms with frequency greater than 15
p <- ggplot(subset(d_Twitter, freq > 15), aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Word", y = "Frequency", title = "Top Words in Twitter Texts")

# Display the plot
print(p)

Second step: Profanity filtering - removing profanity and other words you do not want to predict

After the execution of his task, a new wordcount of the dataset is made, to know the remaining words in the bundle, without the profan words provided inthe badwords.txt file.

# Print the number of words before filtering the selectd badwords
# Create a data frame with the dataset names and their respective word counts
word_counts <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  WordCount = c(length(blogs_words[[1]]), length(news_words[[1]]), length(twitter_words[[1]]))
)

# Print the data frame
print(word_counts)

##   Dataset WordCount
## 1   Blogs      3552
## 2    News      3342
## 3 Twitter      1413

# Function to remove bad words from a corpus
badwordremover <- function(corpus) {
  # Check if the bad words file exists; if not, download it
  if (!file.exists("badwords.txt")) {
    download.file(url = "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",
                  destfile = "badwords.txt")
  }
  
  # Read the list of bad words
  profanity <- readLines("badwords.txt")
  
  # Remove bad words from the corpus
  corpus <- tm_map(corpus, removeWords, profanity)
  
  return(corpus)
}

# Apply bad word removal to each dataset and compute word counts
blogs_token <- Corpus(VectorSource(blogs_token))
blogs_token <- badwordremover(blogs_token)
blogs_words <- wordcounter(blogs_token)

news_token <- Corpus(VectorSource(news_token))
news_token <- badwordremover(news_token)
news_words <- wordcounter(news_token)

twitter_token <- Corpus(VectorSource(twitter_token))
twitter_token <- badwordremover(twitter_token)
twitter_words <- wordcounter(twitter_token)

# Print the number of words after removing bad words
print(data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  WordCount = c(length(blogs_words[[1]]), length(news_words[[1]]), length(twitter_words[[1]]))
))

##   Dataset WordCount
## 1   Blogs      3987
## 2    News      3786
## 3 Twitter      1722

Explanation of the Process

Function Definition (badwordremover):

badwordremover(corpus): Function that removes profane words from a text corpus. Check File Existence: if (!file.exists(“badwords.txt”)): Checks if the bad words file is already downloaded; if not, it downloads it from a URL. Read Bad Words: profanity <- readLines(“badwords.txt”): Reads the list of bad words from the file. Remove Bad Words: tm_map(corpus, removeWords, profanity): Applies the removeWords function to remove profane words from the corpus.

Apply Profanity Filtering:

Corpus(VectorSource(…)): Converts text data into a text corpus for processing. badwordremover(…): Filters out bad words from the corpus. wordcounter(…): Counts the words in the filtered corpus.

Display Updated Word Counts:

data.frame(Dataset = …, WordCount = …): Creates a data frame to display the number of words after profanity removal. print(data.frame(…)): Prints the data frame.

Third step: Prepare to next step

We have maneged the three text files, undestand their profile and prepared tools to start working with the whole bunch of texts concerarning about time to process and memory needed and star searching not only words alone but pairs , trigrams and other sets of words and their frequency of occurence.

Feedback for Next Steps

Refine the Prediction Algorithm:

Feature Selection: Use the word frequency analysis and term-document matrices to select features that will be used in the prediction model. Model Training: Implement and train machine learning algorithms (e.g., logistic regression, Naive Bayes) using the processed text data.

Develop and Test the Shiny App:

UI/UX Design: Design the user interface of the Shiny app to include interactive elements like text input boxes, visualization panels, and analysis results. Integration: Integrate the text processing functions and prediction model into the app, ensuring that users can interact with and view results dynamically. Testing: Test the app with different inputs to ensure it performs well and provides accurate predictions and visualizations.

Evaluate and Iterate:

Model Evaluation: Assess the performance of the prediction algorithm using metrics such as accuracy, precision, and recall. User Feedback: Gather feedback from users of the Shiny app to identify any improvements or additional features that may enhance its functionality.

##Summary

The code I’ve developed prepares me well for creating a prediction algorithm and Shiny app by focusing on essential data preparation steps, feature extraction, and initial visualizations. By refining my prediction model and developing an interactive Shiny app, I’ll be able to leverage this work to build a robust and user-friendly solution for analyzing and predicting text data.

Brief Project Report: Text Data Analysis and Prediction

Objective

Our goal is to prepare and analyze text data from different sources (blogs, news, Twitter) to develop a prediction algorithm and an interactive Shiny app.

1.Data Overview

Data Sources:

Blogs: Text data from blog posts. News: Text data from news articles. Twitter: Text data from tweets.

Data Size:

Each dataset varies in size, with the number of words ranging from thousands to over a thousand in each source.

2.Key Steps in the Analysis

Data Cleaning:

Loaded and Inspected: We loaded the data and checked its size to understand the scope of our analysis. Removed Profanity: We filtered out inappropriate words to ensure the analysis focuses on meaningful content. Feature Extraction:

Term Frequencies: Calculated how often words appear in each dataset to identify common terms and trends. Visualization: Created visual representations (word clouds and bar plots) to illustrate the most frequent words in each text source.

3.Key Findings

Word Distribution: Identified the most common words in blogs, news, and tweets. This helps in understanding the main themes and topics across different text sources. Data Quality: Removing profane words and irrelevant terms improves the quality of data for further analysis.

4.Next Steps

Develop Prediction Algorithm:

Use the processed text data to build a model that can predict outcomes based on word patterns. Create Shiny App:

Build an interactive app to allow users to input text, view processed results, and analyze word frequencies in real-time.

Evaluate and Improve:

Test the prediction model and Shiny app, gather feedback, and make necessary improvements to enhance functionality and accuracy.

Summary

We have effectively prepared and cleaned the text data, extracted valuable features, and visualized key findings. The next phase involves developing a predictive model and an interactive application to provide insightful analysis and predictions based on text data.

Data Science Capstone Project: Milestone Report

Carlos Vargas

2024-08-30

Introduction

First step: create the bundle

Review

Review

Review

Explanation of the next process

Purpose: Creates a visual representation of word frequencies where the size of each word indicates its frequency.

Second step: Profanity filtering - removing profanity and other words you do not want to predict

Explanation of the Process

Third step: Prepare to next step

Feedback for Next Steps

Brief Project Report: Text Data Analysis and Prediction