Processing Raw Text Assignment

In each step, you will process your data for common text data issues. Be sure to complete each one in R and Python separately - creating a clean text version in each language for comparison at the end. Update the saved clean text at each step, do not simply just print it out.

Libraries / R Setup

In this section, include the libraries you need for the R questions.

##r chunk

In this section, include import functions to load the packages you will use for Python.

##python chunk

Import data to work with

Use rvest to import a webpage and process that text for html codes (i.e. take them out)!

##r chunk

Use the requests package to import the same webpage and use BeautifulSoup to clean up the html codes.

##python chunk

Lower case

Lower case the text you created using R.

##r chunk

Lower case the text you created using python.

##python chunk

Removing symbols

Use the stringi package to remove any symbols from your text.

##r chunk

Use the unicodedata in python to remove any symbols from your text.

##python chunk

Contractions

Replace all the contractions in your webpage using R.

##r chunk

Replace all the contractions in your webpage using python.

##python chunk

Spelling

Fix any spelling errors with the hunspell package in R - it’s ok to use the first, most probable option, like we did in class.

##r chunk

Fix your spelling errors using textblob from python.

##python chunk

Lemmatization

Lemmatize your data in R using textstem.

##r chunk

Lemmatize your data in python using spacy.

##python chunk

Stopwords

Remove all the stopwords from your R clean text.

##r chunk

Remove all the stop words from your python clean text.

##python chunk

Tokenization

Use the tokenize_words function to create a set of words for your R clean text.

##r chunk

Use nltk or spacy to tokenize the words from your python clean text.

##python chunk

Check out the results

Print out the first 100 tokens of your clean text from R.

##r chunk

Print out the first 100 tokens of your clean text from python.

##python chunk

Note: here you can print out, summarize, or otherwise view your text in anyway you want.

ANSWER THIS: Compare the results from your processing. Write a short paragraph answering the following questions. You will need to write more than a few sentences for credit.
Which text appears to be “cleaner”?
Or are they the same?
What differences can you spot?
Which processing approach appears to be easier?