Objective

This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.

This assign is preparation for the final project focused on text cleaning. You will find a dataset that matches what you are interested in for your final project (likely sentiment analysis, but entity recognition or another classification problem would be acceptable as well). You will import your dataset and clean the data using the steps listed below. You can change datasets between now and the final, but this project should get the code ready for the data cleaning section.

Method - Data - Variables

Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data?

Clean the Data

You should include code to perform the following steps:

  • Lower case
  • Remove symbols
  • Remove contractions
  • Fix spelling errors
  • Lemmatize the words
  • Remove stopwords

You can perform this analysis in Python or R. You will turn in a knitted file that shows the steps of the code, along with the final print out of the first few words for the finalized data. Be sure to save the data at each step and do not print it out until the end (you can make it print temporarily for yourself, but the final report should not be pages and pages of text printed out).

library("xml2")
library("rvest")
library("stringr")
library("stringi")
library("stringr")
library("textclean")
library("NLP")
library("tm")
library(readr)
## 
## Attaching package: 'readr'
## The following object is masked from 'package:rvest':
## 
##     guess_encoding
reviews <- read_csv("20190928-reviews.CSV")
## Parsed with column specification:
## cols(
##   asin = col_character(),
##   name = col_character(),
##   rating = col_double(),
##   date = col_character(),
##   verified = col_logical(),
##   title = col_character(),
##   body = col_character(),
##   helpfulVotes = col_double()
## )
phone_review<-reviews$body[1:500]
lowered_review<-tolower(phone_review)
##remove symbols
no_symbol_review<-stri_trans_general(str = lowered_review,
                   id = "Latin-ASCII")
##remove contractions
library(textclean)
clean_review<-replace_contraction(no_symbol_review, 
                    contraction.key = lexicon::key_contractions, #default
                    ignore.case = T)
library(hunspell)

# Spell check the words
spelling.errors <- hunspell(clean_review)
spelling.sugg <- hunspell_suggest(unlist(spelling.errors), dict = dictionary("en_US"))

# Pick the first suggestion
spelling.sugg <- unlist(lapply(spelling.sugg, function(x) x[1]))
spelling.dict <- as.data.frame(cbind(spelling.errors,spelling.sugg))
## Warning in cbind(spelling.errors, spelling.sugg): number of rows of result
## is not a multiple of vector length (arg 1)
spelling.dict$spelling.pattern <- paste0("\\b", spelling.dict$spelling.errors, "\\b")

no_error_review<-stri_replace_all_regex(str = clean_review,
                       pattern = spelling.dict$spelling.pattern,
                       replacement = spelling.dict$spelling.sugg,
                       vectorize_all = FALSE)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
## 
##   available.koRpus.lang()
## 
## and see ?install.koRpus.lang()
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
## 
##     tokenize
lemmatize_review<-lemmatize_strings(no_error_review)
no_stop_review<-removeWords(lemmatize_review, stopwords(kind = "SMART"))
no_stop_review[1:10]
##  [1] "   rename a600  awhile   absolute doo doo.   read  review    detect  rage   stupid thing.  finally die        phone  buy   garage sale  $1.     sell   cheap?... bad: ===>  hate  menu.   forever           scroll endlessly.  phone  number category    simply press  #       . ===>    pain  put   silent  vibrate.     class   ring,    turn   immediately.    fast   silence  damn thing.  remember  put   silent!  learn   hard . ===>    true   case.    mission      break ur nail   process. ,   damage  case  time  .   reason  phone start give  problem    succeed  open . ===> button    bite big. vibration   strong. good: ===> reception    shabby.       elevator    remarkable feat    phone  lose service  simply putting    pocket. ===> compare    rename,  phone work  good.  ring tone  loud   hear   phone  charge quickly   great battery life.    heat    potatoe   oven   long phone convos. ===> nice bright, large screen. ===> cute   customize . scroll bar   set  purple, pink, aqua, orange, . :  phone.  serve  purpose   pale  comparison    phone    sprint.         great?"
##  [2] "due   software issue    sprint  phone' text message capability   work  sprint' system      software patch   \"  time     month \".     spend   1 hour  sprint' award win customer service team  find    admit   .  problem    design  phone   incoming message  retrieve quickly   view \" offline \"    provider work. sprint, ,    people hook    server    stay connect, burn minute   check  inbox, compose  reply  wait   sprint server  respond    send  . innovation  money - make   fine."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [3] "   great, reliable phone.   purchase  phone   rename a460 die.  menu  easily comprehendable  speed dial     300 number. voice dial    nice feature,    long  speed dial.   thing  bother    game...      snake ( 1  2 )   phone.    skydiving game, bowl,  tennis (  pong ).  ringer   nice,   feature    choose   ringer   person call. , ringtones    online  download   phone.   pretty  stick    .   vibrate ringtones  regular ( midi ) polyphonic tone.     cover   reasonable price range..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [4] " love  phone  ,      ,     expect  price   bill   receive . , I    phone      month       receive  free accessory   suppose     phone.  time  call  company,      wait  couple  week,      receive  shortly.   ,   love  phone          ;  I    talk  make  phone call!: )"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [5] " phone   great   purpose  offer,   day  buy  -  couldnt   case .     case   put   picture   jaket   super cool,      back   store -  employee       hard   . good,   barely     -   close  snap  case  half.            isnt  big   deal,    dirty   clear case     dirty . make      case   !      charge     time."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [6] ",    phone      decide  buy  flip phone.     problem   battery   case - -     fish case      stay  good   original .   add ring tone,  music     good   - -   matter  fact     sell  phone,  home charger   car charger   great deal - - email  - - (... )"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [7] "cool. cheap. color: 3 word  describe   3588 perfectly.  ,     ?  beauty'  classic, sleek   jerk       3310' .  3588    amaze ringtones.     cellphone shop,     experience.  rate   5th good phone  I       store.     prompt  buy , pal? good,    !:   3599 include  alarm clock, calendar, voice recorder, calculator  stopwatch.  trust ,  good   money      tight budget.  ,          teenager,       phone;    crazy      color lcd (  happen     oversize ).   phone' goin'   adult,   hit da soda pop ,  lucky person!    convince   mobile' real! (   )."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
##  [8] " 3599i    nice phone,    make  universal headset jack incompatible   2. 5mm universal headset (    ).  phone call  tech support  require  find      buy  headset  plug   pop - port accessory connector   bottom   phone.  end  suggest  hdb - 5 headset."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [9] "I      phone ,       .     phone Samsung,  reception  great...    signal strength show   low.  quality   phone  great ,  feel sturdy  weight.  highly recommend  phone."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [10] " good im  school     text message     phone    month  text message  suppose  start work 2weeks ago.  conflict  software type stuff  btween   sprint. (    4 star )   fact   download ringtones. ( 3stars )  game ( 2 star )    beat  nearfree  free color phone ( back  3 star )      text message    4 star."