In each step, you will process your data for common text data issues. Be sure to complete each one in R and Python separately - creating a clean text version in each language for comparison at the end. Update the saved clean text at each step, do not simply just print it out.
##r chunk
##python chunk
rvest to import a webpage and process that text for html codes (i.e. take them out)!##r chunk
requests package to import the same webpage and use BeautifulSoup to clean up the html codes.##python chunk
##r chunk
##python chunk
stringi package to remove any symbols from your text.##r chunk
unicodedata in python to remove any symbols from your text.##python chunk
##r chunk
##python chunk
hunspell package in R - it’s ok to use the first, most probable option, like we did in class.##r chunk
textblob from python.##python chunk
textstem.##r chunk
spacy.##python chunk
##r chunk
##python chunk
tokenize_words function to create a set of words for your R clean text.##r chunk
nltk or spacy to tokenize the words from your python clean text.##python chunk
##r chunk
##python chunk
Note: here you can print out, summarize, or otherwise view your text in anyway you want.