Introduction

In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us.

Load the libraries

load the dataset

file_path<-file.choose()
text_data<-readLines(file_path)

verify file loaded properly and also check file is empty or not

file.exists(file_path)
## [1] TRUE
length(text_data)
## [1] 109

load the data and display

head(text_data,10)
##  [1] "---"                                                                                                                                    
##  [2] "title: \"Capstone project\""                                                                                                            
##  [3] "author: \"Mayank Gaur\""                                                                                                                
##  [4] "date: \"2024-08-04\""                                                                                                                   
##  [5] "output: html_document"                                                                                                                  
##  [6] "---"                                                                                                                                    
##  [7] ""                                                                                                                                       
##  [8] "## Introduction"                                                                                                                        
##  [9] "In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us."
## [10] ""

creating a corpus

In NLP and text mining, corpus is used to collect the data for analyze purpose

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
## 
##  [1] ---                                                                                                                                    
##  [2] title: "Capstone project"                                                                                                              
##  [3] author: "Mayank Gaur"                                                                                                                  
##  [4] date: "2024-08-04"                                                                                                                     
##  [5] output: html_document                                                                                                                  
##  [6] ---                                                                                                                                    
##  [7]                                                                                                                                        
##  [8] ## Introduction                                                                                                                        
##  [9] In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us.
## [10]

clean the corpus

we need to clean corpus to reduce noise from the data, standarize the text so that we can focus on meaningful words, which lead to improve accuracy in analysis, tm is used for text mining package.

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
## 
##  [1]                                                                                             
##  [2] title capstone project                                                                      
##  [3] author mayank gaur                                                                          
##  [4] date                                                                                        
##  [5] output htmldocument                                                                         
##  [6]                                                                                             
##  [7]                                                                                             
##  [8]  introduction                                                                               
##  [9]  report tried capture exploratory data analysis performed text dataset covered language enus
## [10]
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
## 
##  [1]                                                                                   
##  [2] titl capston project                                                              
##  [3] author mayank gaur                                                                
##  [4] date                                                                              
##  [5] output htmldocument                                                               
##  [6]                                                                                   
##  [7]                                                                                   
##  [8] introduct                                                                         
##  [9] report tri captur exploratori data analysi perform text dataset cover languag enus
## [10]

Exploratory Data Analysis

Word frequency analysis

## <<DocumentTermMatrix (documents: 109, terms: 169)>>
## Non-/sparse entries: 250/18171
## Sparsity           : 99%
## Maximal term length: 40
## Weighting          : term frequency (tf)
##        data warningfals      corpus        word        term    frequenc 
##          12           6           6          15           6           6
Term Frequencies
word freq
data data 12
warningfals warningfals 6
corpus corpus 6
word word 15
term term 6
frequenc frequenc 6

# Plot top 20 words using ggplot2
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words", x = "Word", y = "Frequency") +
  theme_minimal()

Summary of data

Data Includes text data which has been pro/cessed to remove punctuation, numbers, and common stopwords. Below are key findings from the data.