As new phones or electronics emerges, the need for completing task quickly is also rising. Swiftkey is a leading keyboard text prediction that predicts the next word or next phrase when one starts typing. The goal of this project is to predict a possible next word as users start typing in order to suggest or improve typing speed using SwiftKey data. This report mainly focuses of the Exploratory Analysis of the data to develop insight about the data. link for full code can be accessed here: My github
words in each subject file will be foundlines in each subject file will be found# NLP tools
library("R.utils")
library(NLP)
library(tm)
## Visualization
library(cowplot) #multi panel plot
library(ggplot2)
library("wordcloud") # word-cloud generator
library(plotly) #interactive plot
## Text cleaning
library(stringr); library("tidytext")
## For messy data cleaning
library(dplyr); library(tidyverse)
set.seed(2222)This dataset was downloaded from , a leading keyboard text prediction, through Coursera Data Science Capstone by John Hopkins University here Data
data_folder <- "Data"
zip_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zip_name <- "Coursera-SwiftKey.zip"
download_path <- paste(getwd(), zip_name, sep = "/")
if (!file.exists(download_path)){
download.file(url, destfile=download_path, method="curl")
}
unzip(zipfile=download_path, exdir=data_folder) ##extracted folder is "final"Read all text files in a particular language directory, find the number of lines in each, the longest line of words and number of words in each subject line.
flist <- list.files("Data/final", pattern = ".*en_.*.txt", recursive = T)
get_info <-lapply(paste("Data", flist, sep = "/"), function(filename){
## Check file size
fsize = file.size(filename)[1]/1024/1024
## Read subset of data
con <- file(filename, "r") #open connection
all_lines = readLines(con)
#readLines(con, 5) ## read first 5 lines
## check number of lines in the file
#NROW(readLines(con_en_us_twitter))
nlines <- sapply(filename, countLines)
##longest line
nchars = lapply(all_lines, nchar)
longest_line <- which.max(nchars) ## determine index of max
## split based on one or more white spaces and find length
nwords <- sum(sapply(strsplit(all_lines, "\\s+"), length))
close(con)
return(c(filename, round(as.numeric(fsize),2), nlines, longest_line, nwords))
})us_df <- data.frame(matrix(unlist(get_info), nrow=length(get_info), byrow=T))
colnames(us_df) <- c("file", "fsize", "num_of_lines", "longest_line", "nwords")
us_df## file fsize num_of_lines longest_line nwords
## 1 Data/en_US/en_US.blogs.txt 200.42 899288 483415 37334149
## 2 Data/en_US/en_US.news.txt 196.28 1010242 123628 34372814
## 3 Data/en_US/en_US.twitter.txt 159.36 2360148 26 30373565
10% of the data to explorecount_words_l <- lapply(paste("Data/final", flist, sep = "/"), function(path){
## REad file
#path = "Data/en_US/en_US.blogs.txt"
con <- file(path, "r") # open connection
all_lines = readLines(con)
close(con) #connection
## number of lines
nlines = sapply(path, countLines)
## Sample 10% of words
sub_lines <- all_lines[sample(1:nlines, nlines*0.1, replace = FALSE)]
#remove paragraphs
sub_lines <- paste(all_lines, collapse = " ") %>% stringr::str_replace_all(pattern = "\n", replacement = "")
text_df <- tibble(Text = sub_lines) ## put nealty into a df
text_words <- text_df %>% unnest_tokens(output = word, input = Text) # make each row per word
## Remove stop words
data("stop_words")
text_words <- text_words %>% anti_join(stop_words)
##check for most frequent words
count_words <- text_words %>% count(word, sort = TRUE)
return(count_words)
})time, love and people; News words are more about schools, county and city, while most words used on Twitter are follow, people, night, weekendtopn <- 20
### GGplot summary
top_n_words_blogs <- count_words_l[[1]] %>% filter(n>count_words_l[[1]]$n[topn]) %>%
mutate(word = reorder(word, n))
top_n_words_news <- count_words_l[[2]] %>%
filter(n>count_words_l[[2]]$n[topn]) %>% mutate(word = reorder(word, n))
top_n_words_tweet <- count_words_l[[3]] %>%
filter(n>count_words_l[[1]]$n[topn]) %>% mutate(word = reorder(word, n))Us blogs highest number of words and longest line of sentence relative to US news and Twitter. Twitter does not use long line of sentences.
US News write more lines than blogs and twitter
US Blogs focuses more on words such astime, love and people; News words are more about schools, county and city, while most words used on Twitter are follow, people, night, weekend