ReportWeek2

##My dataset Data is just The Project Gutenberg eBook of The heel of Achilles in txt format. I fromatted it into n-gramm dataframe and show below “head(ngram_dataset,10)”

Summary Statistics

Let’s start by examining some summary statistics of the ngram_dataset:

summary(ngram_dataset)

##      word            next_word        
##  Length:55717       Length:55717      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

##Analysis of Data ###Frequency of Words We can calculate the frequency of words in the word column:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

word_freq <- ngram_dataset %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
head(word_freq, 10)

## # A tibble: 10 × 2
##    word                         count
##    <chr>                        <int>
##  1 the project gutenberg           30
##  2 said aunt beryl                 27
##  3 that she had                    21
##  4 aunt beryl and                  20
##  5 and uncle george                18
##  6 mr monteagle almond             18
##  7 of her own                      18
##  8 project gutenberg electronic    18
##  9 i dont know                     16
## 10 she did not                     16

library(ggplot2)


# Вывод графика
ggplot(head(word_freq, 15), aes(x = reorder(word, -count), y = count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Top 10 Most Frequent Words",
       x = "Word",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

##Most Common Next Words Let’s also identify the most common next words in the next_word column:

next_word_freq <- ngram_dataset %>%
  group_by(next_word) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
head(next_word_freq, 10)

## # A tibble: 10 × 2
##    next_word count
##    <chr>     <int>
##  1 the        2469
##  2 and        1596
##  3 to         1573
##  4 of         1557
##  5 a          1159
##  6 her        1122
##  7 she         891
##  8 in          825
##  9 was         761
## 10 that        756

##Plans for Prediction Algorithm and Shiny App As we can see data еhe data needs further preparation, removing stop words and often repeatable articles and pronouns. The next step will be to use the ngram_dataset to train a prediction algorithm. We will develop an n-gram model to predict the next word in a sequence of words. Additionally, we plan to create a Shiny app interface for users to interact with the prediction algorithm.

##Conclusion This document provides a summary of the ngram_dataset and outlines plans for creating a next word prediction algorithm and Shiny app.

ReportWeek2

Sivonin Nikita

2024-04-10

Introduction

Summary Statistics