In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us.
file_path<-file.choose()
text_data<-readLines(file_path)
file.exists(file_path)
## [1] TRUE
length(text_data)
## [1] 109
head(text_data,10)
## [1] "---"
## [2] "title: \"Capstone project\""
## [3] "author: \"Mayank Gaur\""
## [4] "date: \"2024-08-04\""
## [5] "output: html_document"
## [6] "---"
## [7] ""
## [8] "## Introduction"
## [9] "In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us."
## [10] ""
print(head(text_data,10))
## [1] "---"
## [2] "title: \"Capstone project\""
## [3] "author: \"Mayank Gaur\""
## [4] "date: \"2024-08-04\""
## [5] "output: html_document"
## [6] "---"
## [7] ""
## [8] "## Introduction"
## [9] "In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us."
## [10] ""
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 10
##
## [1] ---
## [2] title: "Capstone project"
## [3] author: "Mayank Gaur"
## [4] date: "2024-08-04"
## [5] output: html_document
## [6] ---
## [7]
## [8] ## Introduction
## [9] In this report, I have tried to capture exploratory data analysis performed on the text dataset. I have covered for the language en_us.
## [10]
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 10
##
## [1]
## [2] title capstone project
## [3] author mayank gaur
## [4] date
## [5] output htmldocument
## [6]
## [7]
## [8] introduction
## [9] report tried capture exploratory data analysis performed text dataset covered language enus
## [10]
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 10
##
## [1]
## [2] titl capston project
## [3] author mayank gaur
## [4] date
## [5] output htmldocument
## [6]
## [7]
## [8] introduct
## [9] report tri captur exploratori data analysi perform text dataset cover languag enus
## [10]
## <<DocumentTermMatrix (documents: 109, terms: 169)>>
## Non-/sparse entries: 250/18171
## Sparsity : 99%
## Maximal term length: 40
## Weighting : term frequency (tf)
## data warningfals corpus word term frequenc
## 12 6 6 15 6 6
| word | freq | |
|---|---|---|
| data | data | 12 |
| warningfals | warningfals | 6 |
| corpus | corpus | 6 |
| word | word | 15 |
| term | term | 6 |
| frequenc | frequenc | 6 |
# Plot top 20 words using ggplot2
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words", x = "Word", y = "Frequency") +
theme_minimal()
Data Includes text data which has been pro/cessed to remove punctuation, numbers, and common stopwords. Below are key findings from the data.