This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.
This assign is preparation for the final project focused on text cleaning. You will find a dataset that matches what you are interested in for your final project (likely sentiment analysis, but entity recognition or another classification problem would be acceptable as well). You will import your dataset and clean the data using the steps listed below. You can change datasets between now and the final, but this project should get the code ready for the data cleaning section.
Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data?
You should include code to perform the following steps:
You can perform this analysis in Python or R. You will turn in a knitted file that shows the steps of the code, along with the final print out of the first few words for the finalized data. Be sure to save the data at each step and do not print it out until the end (you can make it print temporarily for yourself, but the final report should not be pages and pages of text printed out).
library("xml2")
library("rvest")
library("stringr")
library("stringi")
library("stringr")
library("textclean")
library("NLP")
library("tm")
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:rvest':
##
## guess_encoding
reviews <- read_csv("20190928-reviews.CSV")
## Parsed with column specification:
## cols(
## asin = col_character(),
## name = col_character(),
## rating = col_double(),
## date = col_character(),
## verified = col_logical(),
## title = col_character(),
## body = col_character(),
## helpfulVotes = col_double()
## )
phone_review<-reviews$body[1:500]
lowered_review<-tolower(phone_review)
##remove symbols
no_symbol_review<-stri_trans_general(str = lowered_review,
id = "Latin-ASCII")
##remove contractions
library(textclean)
clean_review<-replace_contraction(no_symbol_review,
contraction.key = lexicon::key_contractions, #default
ignore.case = T)
library(hunspell)
# Spell check the words
spelling.errors <- hunspell(clean_review)
spelling.sugg <- hunspell_suggest(unlist(spelling.errors), dict = dictionary("en_US"))
# Pick the first suggestion
spelling.sugg <- unlist(lapply(spelling.sugg, function(x) x[1]))
spelling.dict <- as.data.frame(cbind(spelling.errors,spelling.sugg))
## Warning in cbind(spelling.errors, spelling.sugg): number of rows of result
## is not a multiple of vector length (arg 1)
spelling.dict$spelling.pattern <- paste0("\\b", spelling.dict$spelling.errors, "\\b")
no_error_review<-stri_replace_all_regex(str = clean_review,
pattern = spelling.dict$spelling.pattern,
replacement = spelling.dict$spelling.sugg,
vectorize_all = FALSE)
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:readr':
##
## tokenize
lemmatize_review<-lemmatize_strings(no_error_review)
no_stop_review<-removeWords(lemmatize_review, stopwords(kind = "SMART"))
no_stop_review[1:10]
## [1] " rename a600 awhile absolute doo doo. read review detect rage stupid thing. finally die phone buy garage sale $1. sell cheap?... bad: ===> hate menu. forever scroll endlessly. phone number category simply press # . ===> pain put silent vibrate. class ring, turn immediately. fast silence damn thing. remember put silent! learn hard . ===> true case. mission break ur nail process. , damage case time . reason phone start give problem succeed open . ===> button bite big. vibration strong. good: ===> reception shabby. elevator remarkable feat phone lose service simply putting pocket. ===> compare rename, phone work good. ring tone loud hear phone charge quickly great battery life. heat potatoe oven long phone convos. ===> nice bright, large screen. ===> cute customize . scroll bar set purple, pink, aqua, orange, . : phone. serve purpose pale comparison phone sprint. great?"
## [2] "due software issue sprint phone' text message capability work sprint' system software patch \" time month \". spend 1 hour sprint' award win customer service team find admit . problem design phone incoming message retrieve quickly view \" offline \" provider work. sprint, , people hook server stay connect, burn minute check inbox, compose reply wait sprint server respond send . innovation money - make fine."
## [3] " great, reliable phone. purchase phone rename a460 die. menu easily comprehendable speed dial 300 number. voice dial nice feature, long speed dial. thing bother game... snake ( 1 2 ) phone. skydiving game, bowl, tennis ( pong ). ringer nice, feature choose ringer person call. , ringtones online download phone. pretty stick . vibrate ringtones regular ( midi ) polyphonic tone. cover reasonable price range..."
## [4] " love phone , , expect price bill receive . , I phone month receive free accessory suppose phone. time call company, wait couple week, receive shortly. , love phone ; I talk make phone call!: )"
## [5] " phone great purpose offer, day buy - couldnt case . case put picture jaket super cool, back store - employee hard . good, barely - close snap case half. isnt big deal, dirty clear case dirty . make case ! charge time."
## [6] ", phone decide buy flip phone. problem battery case - - fish case stay good original . add ring tone, music good - - matter fact sell phone, home charger car charger great deal - - email - - (... )"
## [7] "cool. cheap. color: 3 word describe 3588 perfectly. , ? beauty' classic, sleek jerk 3310' . 3588 amaze ringtones. cellphone shop, experience. rate 5th good phone I store. prompt buy , pal? good, !: 3599 include alarm clock, calendar, voice recorder, calculator stopwatch. trust , good money tight budget. , teenager, phone; crazy color lcd ( happen oversize ). phone' goin' adult, hit da soda pop , lucky person! convince mobile' real! ( )."
## [8] " 3599i nice phone, make universal headset jack incompatible 2. 5mm universal headset ( ). phone call tech support require find buy headset plug pop - port accessory connector bottom phone. end suggest hdb - 5 headset."
## [9] "I phone , . phone Samsung, reception great... signal strength show low. quality phone great , feel sturdy weight. highly recommend phone."
## [10] " good im school text message phone month text message suppose start work 2weeks ago. conflict software type stuff btween sprint. ( 4 star ) fact download ringtones. ( 3stars ) game ( 2 star ) beat nearfree free color phone ( back 3 star ) text message 4 star."