Objective

This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.

This assignment is preparation for the final project focused on text cleaning. You will find a dataset that matches what you are interested in for your final project (likely sentiment analysis, but entity recognition or another classification problem would be acceptable as well). You will import your dataset and clean the data using the steps listed below. You can change datasets between now and the final, but this project should get the code ready for the data cleaning section.

Method - Data - Variables

Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? The data is about news classification. It includes 4 different news, which are world news, sports news, business news, and science-technology news.The training set is from Kaggle. It is labeled by 1,2,3,4. The test dataset will be pulled by API from alive news.

Clean the Data

import spacy
import nltk
import pandas as pd
news = pd.read_csv('test.csv')
news.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7600 entries, 0 to 7599
## Data columns (total 3 columns):
##  #   Column       Non-Null Count  Dtype 
## ---  ------       --------------  ----- 
##  0   Class Index  7600 non-null   int64 
##  1   Title        7600 non-null   object
##  2   Description  7600 non-null   object
## dtypes: int64(1), object(2)
## memory usage: 178.2+ KB
news.describe()
# show the basic information of the data
##        Class Index
## count  7600.000000
## mean      2.500000
## std       1.118108
## min       1.000000
## 25%       1.750000
## 50%       2.500000
## 75%       3.250000
## max       4.000000

You should include code to perform the following steps:

Lower case

news['Description_1'] = news['Description'].str.lower()

Remove symbols/non-Latin characters (unless you are interested in emoticons)

punctuation_signs = list("!#$%&'()*+, -./:;<=>?@[\]^_`{|}~")

news['Description_2'] = news['Description_1']

for i in punctuation_signs:
    print(i)
    news['Description_2'] = news['Description_2'].str.replace(i, ' ')
    
## !
## #
## $
## %
## &
## '
## (
## )
## *
## +
## ,
##  
## -
## .
## /
## :
## ;
## <
## =
## >
## ?
## @
## [
## \
## ]
## ^
## _
## `
## {
## |
## }
## ~
news['Description_2'] = news['Description_2'].str.replace('"', ' ')

Remove contractions

import re
def remove_contraction(description):
    # specific
    description = re.sub(r"won\'t", "will not", description)
    description = re.sub(r"can\'t", "can not", description)

    # general
    description = re.sub(r"n\'t", " not", description)
    description = re.sub(r"\'re", " are", description)
    description = re.sub(r"\'s", " is", description)
    description = re.sub(r"\'d", " would", description)
    description = re.sub(r"\'ll", " will", description)
    description = re.sub(r"\'t", " not", description)
    description = re.sub(r"\'ve", " have", description)
    description = re.sub(r"\'m", " am", description)
    return description
news['Description_3']=news['Description_2'].apply(remove_contraction)

Fix spelling errors

#from autocorrect import Speller
#Check = Speller(lang='en')
#news['Description_4']=news['Description_3'].apply(Check)

Lemmatize the words

from nltk.stem import WordNetLemmatizer 
wordnet_lemmatizer = WordNetLemmatizer()

nrows = len(news)
lemmatized_text_list = []

for i in range(0, nrows):
    lemmatized_list = []
    text = news.loc[i]['Description_3']
    text_words = text.split(" ")
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
    lemmatized_text = " ".join(lemmatized_list)
    lemmatized_text_list.append(lemmatized_text)

news['Description_5'] = lemmatized_text_list

Remove stopwords

# import package for stop words
import nltk
from nltk.corpus import stopwords

# load English stop words  
stop_words = list(stopwords.words('english'))

# show examples of stop words
stop_words[10:20]
## ["you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']
news['Description_6']= news['Description_5']

#remove stop words
for stop_word in stop_words:
    regex_stopword = r"\b" + stop_word + r"\b"
    news['Description_6'] = news['Description_6'].str.replace(regex_stopword, ' ')

Show Example of processed description

notes: the space between words will not have any impact on later model fit. And also won’t have any impact on future prediction.

news.loc[1]['Description']
## 'SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.'
news.loc[1]['Description_1']
## 'space.com - toronto, canada -- a second\\team of rocketeers competing for the  #36;10 million ansari x prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.'
news.loc[1]['Description_2']
## 'space com   toronto  canada    a second team of rocketeers competing for the   36 10 million ansari x prize  a contest for privately funded suborbital space flight  has officially announced the first launch date for its manned rocket '
news.loc[1]['Description_3']
## 'space com   toronto  canada    a second team of rocketeers competing for the   36 10 million ansari x prize  a contest for privately funded suborbital space flight  has officially announced the first launch date for its manned rocket '
news.loc[1]['Description_5']
## 'space com   toronto  canada    a second team of rocketeers compete for the   36 10 million ansari x prize  a contest for privately fund suborbital space flight  have officially announce the first launch date for its man rocket '
news.loc[1]['Description_6']
## 'space com   toronto  canada      second team   rocketeers compete       36 10 million ansari x prize    contest   privately fund suborbital space flight    officially announce   first launch date     man rocket '

You can perform this analysis in Python or R. You will turn in a knitted file that shows the steps of the code, along with the final print out of the first few words for the finalized data. Be sure to save the data at each step and do not print it out until the end (you can make it print temporarily for yourself, but the final report should not be pages and pages of text printed out).