This project should allow you to apply the information you’ve learned in the course to a new dataset. While the structure of the final project will be more of a research project, you can use this knowledge to appropriately answer questions in all fields, along with the practical skills of writing a report that others can read. The dataset must be related to language or language processing in some way. You must use an analysis we learned in class.
This assignment is preparation for the final project focused on text cleaning. You will find a dataset that matches what you are interested in for your final project (likely sentiment analysis, but entity recognition or another classification problem would be acceptable as well). You will import your dataset and clean the data using the steps listed below. You can change datasets between now and the final, but this project should get the code ready for the data cleaning section.
Explain the data you have selected to study. You can find data through many available corpora or other datasets online (ask for help here for sure!). How was the data collected? Who/what is in the data? The data is about news classification. It includes 4 different news, which are world news, sports news, business news, and science-technology news.The training set is from Kaggle. It is labeled by 1,2,3,4. The test dataset will be pulled by API from alive news.
import spacy
import nltk
import pandas as pd
news = pd.read_csv('test.csv')
news.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7600 entries, 0 to 7599
## Data columns (total 3 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Class Index 7600 non-null int64
## 1 Title 7600 non-null object
## 2 Description 7600 non-null object
## dtypes: int64(1), object(2)
## memory usage: 178.2+ KB
news.describe()
# show the basic information of the data
## Class Index
## count 7600.000000
## mean 2.500000
## std 1.118108
## min 1.000000
## 25% 1.750000
## 50% 2.500000
## 75% 3.250000
## max 4.000000
You should include code to perform the following steps:
news['Description_1'] = news['Description'].str.lower()
punctuation_signs = list("!#$%&'()*+, -./:;<=>?@[\]^_`{|}~")
news['Description_2'] = news['Description_1']
for i in punctuation_signs:
print(i)
news['Description_2'] = news['Description_2'].str.replace(i, ' ')
## !
## #
## $
## %
## &
## '
## (
## )
## *
## +
## ,
##
## -
## .
## /
## :
## ;
## <
## =
## >
## ?
## @
## [
## \
## ]
## ^
## _
## `
## {
## |
## }
## ~
news['Description_2'] = news['Description_2'].str.replace('"', ' ')
import re
def remove_contraction(description):
# specific
description = re.sub(r"won\'t", "will not", description)
description = re.sub(r"can\'t", "can not", description)
# general
description = re.sub(r"n\'t", " not", description)
description = re.sub(r"\'re", " are", description)
description = re.sub(r"\'s", " is", description)
description = re.sub(r"\'d", " would", description)
description = re.sub(r"\'ll", " will", description)
description = re.sub(r"\'t", " not", description)
description = re.sub(r"\'ve", " have", description)
description = re.sub(r"\'m", " am", description)
return description
news['Description_3']=news['Description_2'].apply(remove_contraction)
#from autocorrect import Speller
#Check = Speller(lang='en')
#news['Description_4']=news['Description_3'].apply(Check)
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
nrows = len(news)
lemmatized_text_list = []
for i in range(0, nrows):
lemmatized_list = []
text = news.loc[i]['Description_3']
text_words = text.split(" ")
for word in text_words:
lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
lemmatized_text = " ".join(lemmatized_list)
lemmatized_text_list.append(lemmatized_text)
news['Description_5'] = lemmatized_text_list
# import package for stop words
import nltk
from nltk.corpus import stopwords
# load English stop words
stop_words = list(stopwords.words('english'))
# show examples of stop words
stop_words[10:20]
## ["you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']
news['Description_6']= news['Description_5']
#remove stop words
for stop_word in stop_words:
regex_stopword = r"\b" + stop_word + r"\b"
news['Description_6'] = news['Description_6'].str.replace(regex_stopword, ' ')
notes: the space between words will not have any impact on later model fit. And also won’t have any impact on future prediction.
news.loc[1]['Description']
## 'SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.'
news.loc[1]['Description_1']
## 'space.com - toronto, canada -- a second\\team of rocketeers competing for the #36;10 million ansari x prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.'
news.loc[1]['Description_2']
## 'space com toronto canada a second team of rocketeers competing for the 36 10 million ansari x prize a contest for privately funded suborbital space flight has officially announced the first launch date for its manned rocket '
news.loc[1]['Description_3']
## 'space com toronto canada a second team of rocketeers competing for the 36 10 million ansari x prize a contest for privately funded suborbital space flight has officially announced the first launch date for its manned rocket '
news.loc[1]['Description_5']
## 'space com toronto canada a second team of rocketeers compete for the 36 10 million ansari x prize a contest for privately fund suborbital space flight have officially announce the first launch date for its man rocket '
news.loc[1]['Description_6']
## 'space com toronto canada second team rocketeers compete 36 10 million ansari x prize contest privately fund suborbital space flight officially announce first launch date man rocket '
You can perform this analysis in Python or R. You will turn in a knitted file that shows the steps of the code, along with the final print out of the first few words for the finalized data. Be sure to save the data at each step and do not print it out until the end (you can make it print temporarily for yourself, but the final report should not be pages and pages of text printed out).