R report

Summary Statistics

Let’s begin by examining some summary statistics of the ngram_dataset:

summary(my_dataset)

##     word1              word2              word3              word4          
##  Length:107545      Length:107545      Length:107545      Length:107545     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

##Analysis of Data ###Frequency of Word Combinations We can calculate the frequency of word combinations in the dataset to identify common patterns:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

my_dataset$phrase <- paste(my_dataset$word1, my_dataset$word2, my_dataset$word3, my_dataset$word4, sep = " ")

phrase_freq <- my_dataset %>%
  group_by(phrase) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

head(phrase_freq, 10)

## # A tibble: 10 × 2
##    phrase                                count
##    <chr>                                 <int>
##  1 i have no doubt                          17
##  2 i think that i                           17
##  3 have no doubt that                       14
##  4 the adventure of the                     14
##  5 gutenberg literary archive foundation    13
##  6 i do not know                            13
##  7 project gutenberg literary archive       13
##  8 the project gutenberg literary           13
##  9 project gutenbergtm electronic works     12
## 10 it seemed to me                          11

##Visualization of Word Combinations Let’s visualize the top 10 most frequent word combinations:

library(ggplot2)

# Plot top 10 most frequent word combinations
ggplot(head(phrase_freq, 20), aes(x = reorder(phrase, -count), y = count)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Top 20 Most Frequent Word Combinations",
       x = "Word Combination",
       y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

##Plans for Prediction Algorithm and Shiny App The next step will be to use the data to train a prediction algorithm. To predict the next word based on previous word combinations.

R report

Timurlan Saparov

2024-04-10

Introduction

Summary Statistics