In each step, you will process your data for common text data issues. Be sure to complete each one in R and Python separately - creating a clean text version in each language for comparison at the end. Update the saved clean text at each step, do not simply just print it out.
##r chunk
##python chunk
rvest
to import a webpage and process that text for html
codes (i.e. take them out)!##r chunk
requests
package to import the same webpage and use BeautifulSoup
to clean up the html
codes.##python chunk
##r chunk
##python chunk
stringi
package to remove any symbols from your text.##r chunk
unicodedata
in python to remove any symbols from your text.##python chunk
##r chunk
##python chunk
hunspell
package in R - it’s ok to use the first, most probable option, like we did in class.##r chunk
textblob
from python.##python chunk
textstem
.##r chunk
spacy
.##python chunk
##r chunk
##python chunk
tokenize_words
function to create a set of words for your R clean text.##r chunk
nltk
or spacy
to tokenize the words from your python clean text.##python chunk
##r chunk
##python chunk
Note: here you can print out, summarize, or otherwise view your text in anyway you want.