Please do not reorder the assignment - fill in each chunk as requested.
Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
library(reticulate)
py_config()
## python: C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python.exe
## libpython: C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python36.dll
## pythonhome: C:/Users/punthakur/AppData/Local/Programs/Python/Python36
## version: 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Users/punthakur/AppData/Local/Programs/Python/Python36/Lib/site-packages/numpy
## numpy_version: 1.19.2
##
## python versions found:
## C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python.exe
## C:/Program Files (x86)/Microsoft Visual Studio/Shared/Python37_64/python.exe
library(plyr)
## Warning: package 'plyr' was built under R version 3.6.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.6.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(gbm)
## Warning: package 'gbm' was built under R version 3.6.3
## Loaded gbm 2.1.8
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.6.3
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:lattice':
##
## panel.fill
## The following object is masked from 'package:plyr':
##
## baseball
Load the Python libraries or functions that you will use for that section.
##python chunk
from textsearch import TextSearch
import spacy
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bs4 import BeautifulSoup
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
stopwords = nltk.corpus.stopwords.words('english')
import unicodedata
from contractions import contractions_dict
The dataset is a set of Youtube comments that have been coded as: - 1: spam youtube messages - 0: good youtube messages - This data is stored in the CLASS column
Import the data using either R or Python. I put a Python chunk here because you will need one to import the data, but if you want to first import into R, that’s fine.
youtube <- read.csv("youtube_spam.csv")
head(youtube)
## COMMENT_ID AUTHOR
## 1 LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU Julius NM
## 2 LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A adam riyati
## 3 LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8 Evgeny Murashkin
## 4 z13jhp0bxqncu512g22wvzkasxmvvzjaz04 ElNino Melendez
## 5 z13fwbwp1oujthgqj04chlngpvzmtt3r3dw GsMega
## 6 LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc Jason Haddad
## DATE
## 1 2013-11-07T06:20:48
## 2 2013-11-07T12:37:15
## 3 2013-11-08T17:34:21
## 4 2013-11-09T08:28:43
## 5 2013-11-10T16:05:38
## 6 2013-11-26T02:55:11
## CONTENT
## 1 Huh, anyway check out this you[tube] channel: kobyoshi02
## 2 Hey guys check out my new channel and our first vid THIS IS US THE MONKEYS!!! I'm the monkey in the white shirt,please leave a like comment and please subscribe!!!!
## 3 just for test I have to say murdev.com
## 4 me shaking my sexy ass on my channel enjoy ^_^ 
## 5 watch?v=vtaRGgvGtWQ Check this out .
## 6 Hey, check out my new website!! This site is about kids stuff. kidsmediausa . com
## CLASS
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
str(youtube)
## 'data.frame': 1956 obs. of 5 variables:
## $ COMMENT_ID: Factor w/ 1953 levels "_2viQ_Qnc6-_qc98D_T8ICCw3meS1f1YJqU9SA-X1t4",..: 392 390 395 1508 1416 393 1553 461 1766 548 ...
## $ AUTHOR : Factor w/ 1792 levels " Berty Winata",..: 878 46 559 530 665 794 581 258 368 219 ...
## $ DATE : Factor w/ 1710 levels "","2013-07-12T22:33:27.916000",..: 201 202 203 204 205 206 207 208 209 210 ...
## $ CONTENT : Factor w/ 1760 levels "''Little Psy, only 5 months left.. Tumor in the head :( WE WILL MISS U <3",..: 719 536 974 1114 1623 584 1432 884 1750 145 ...
## $ CLASS : int 1 1 1 1 1 1 1 0 1 1 ...
summary(youtube)
## COMMENT_ID AUTHOR
## _2viQ_Qnc68fX3dYsfYuM-m4ELMJvxOQBmBOFHqGOk0: 2 M.E.S : 8
## LneaDw26bFuH6iFsSrjlJLJIX3qD4R8-emuZ-aGUj0o: 2 5000palo : 7
## LneaDw26bFvPh9xBHNw1btQoyP60ay_WWthtvXCx37s: 2 Louis Bryant : 7
## _2viQ_Qnc6-_qc98D_T8ICCw3meS1f1YJqU9SA-X1t4: 1 Shadrach Grentz: 7
## _2viQ_Qnc6-_SpDzX8DwHaw3bUkE-owmcb7eOEPPurs: 1 DanteBTV : 6
## _2viQ_Qnc6-1fj_YPI5S4X9e9VnvAzoykRbwZGAlYgo: 1 Derek Moya : 5
## (Other) :1947 (Other) :1916
## DATE
## : 245
## 2013-10-05T00:57:25.078000: 2
## 2014-11-07T19:33:46 : 2
## 2013-07-12T22:33:27.916000: 1
## 2013-07-13T11:17:52.308000: 1
## 2013-07-13T12:09:31.188000: 1
## (Other) :1704
## CONTENT
## Check out this video on YouTube: : 97
## Check out this playlist on YouTube: : 21
## Check Out The New Hot Video By Dante B Called Riled Up : 6
## hey its M.E.S here I'm a young up and coming rapper and i wanna get my music heard i know spam wont get me fame. but at the moment i got no way of getting a little attention so please do me a favour and check out my channel and drop a sub if you enjoy yourself. im just getting started so i really appreciate those who take time to leave constructive criticism i already got 200 subscribers and 4000 views on my first vid ive been told i have potential : 4
## Hey Music Fans I really appreciate any of you who will take the time to read this, and check my music out! I'm just a 15 year old boy DREAMING of being a successful MUSICIAN in the music world. I do lots of covers, and piano covers. But I don't have money to advertise. A simple thumbs up to my comment, a comment on my videos or a SUBSCRIPTION would be a step forward! It will only be a few seconds of your life that you won't regret!!! Thank u to all the people who just give me a chance! :) : 4
## Hi. Check out and share our songs. : 4
## (Other) :1820
## CLASS
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.5138
## 3rd Qu.:1.0000
## Max. :1.0000
##
##python chunk
import pandas as pd
from pandas import DataFrame
def func_nb():
data = pd.read_csv("youtube_spam.csv")
df = pd.DataFrame(data)
return df
data_df = func_nb()
data_df.head()
## COMMENT_ID ... CLASS
## 0 LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU ... 1
## 1 LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A ... 1
## 2 LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8 ... 1
## 3 z13jhp0bxqncu512g22wvzkasxmvvzjaz04 ... 1
## 4 z13fwbwp1oujthgqj04chlngpvzmtt3r3dw ... 1
##
## [5 rows x 5 columns]
Use one of our clean text functions to clean up the CONTENT column in the dataset.
clean.text = function(x)
{
# tolower
x = tolower(x)
# remove rt
x = gsub("rt", "", x)
# remove at
x = gsub("@\\w+", "", x)
# remove punctuation
x = gsub("[[:punct:]]", "", x)
# remove numbers
x = gsub("[[:digit:]]", "", x)
# remove links http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
return(x)
}
youtubecontent <- youtube$CONTENT
youtubecontent = clean.text (youtubecontent)
##python chunk
Content = r.youtubecontent
##Normalize our corpus
#remove html
data_df['CONTENT'] = [BeautifulSoup(str(text)).get_text() for text in data_df['CONTENT'].tolist()]
#lower case
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://twitter.com/GBphotographyGB" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=4604617" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://binbox.io/1FIRo#123" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://thepiratebay.se/torrent/6381501/Timothy_Sykes_Collection" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.gcmforex.com/partners/aw.aspx?Task=JoinT2&AffiliateID=9107" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/teeLaLaLa" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.twitch.tv/zxlightsoutxz" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://flipagram.com/f/LUkA1QMrhF" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.gofundme.com/gvr7xg" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://soundcloud.com/jackal-and-james/wrap-up-the-night" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.surveymonkey.com/s/CVHMKLT" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/pages/Mathster-WP/1495323920744243?ref=hl" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=5242575" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/FUDAIRYQUEEN?pnref=story" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.twitch.tv/daconnormc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/photo.php?fbid=543627485763966&l=0d878a889c" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.bubblews.com/news/6401116-vps-solutions" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=HuPwEH5ab" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=G8iX5cTKd" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=Jt2ufxHxc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.reverbnation.com/slicknick313/songs" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://viralangels.com/user/d4aaacwk" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://vimeo.com/107297364" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/pages/Komedi-burda-gel/775510675841486" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/profile.php?id=100007085325116" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://vimeo.com/106865403" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://minhateca.com.br/mauro-sp2013/Filmes+Series+Desenhos+Animes+Mp3+etc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/antrobofficial" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=4344749" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://soundcloud.com/j-supt-fils-du-son/fucking-hostile" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/myfunnyriddles" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.wattpad.com/story/26032883-she-can-love-you-good" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://binbox.io/DNCkM#qT4Q1JB1" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
## MarkupResemblesLocatorWarning
data_df['CONTENT'] = data_df['CONTENT'].str.lower()
#unicode
data_df['CONTENT'] = [unicodedata.normalize('NFKD', str(text)).encode('ascii', 'ignore').decode('utf-8', 'ignore') for text in data_df['CONTENT'].tolist()]
#take out special characters
data_df['CONTENT'] = data_df['CONTENT'].str.replace('[^a-zA-Z0-9\s]|\[|\]', '')
#stemming
data_df['CONTENT'] = [' '.join([ps.stem(word) for word in text.split()]) for text in data_df['CONTENT'].tolist()]
#stop words
data_df['CONTENT'] = [' '.join([word for word in text.split() if word not in stopwords]) for text in data_df['CONTENT'].tolist()]
#drop the null values after this process
data_df = data_df.dropna().reset_index(drop=True)
#save it
data_df.to_csv('clean_youtube_spam.csv', index=False)
data_df = pd.read_csv('clean_youtube_spam.csv')
data_df.head()
## COMMENT_ID ... CLASS
## 0 LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU ... 1
## 1 LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A ... 1
## 2 LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8 ... 1
## 3 z13jhp0bxqncu512g22wvzkasxmvvzjaz04 ... 1
## 4 z13fwbwp1oujthgqj04chlngpvzmtt3r3dw ... 1
##
## [5 rows x 5 columns]
print(data_df)
## COMMENT_ID ... CLASS
## 0 LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU ... 1
## 1 LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A ... 1
## 2 LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8 ... 1
## 3 z13jhp0bxqncu512g22wvzkasxmvvzjaz04 ... 1
## 4 z13fwbwp1oujthgqj04chlngpvzmtt3r3dw ... 1
## ... ... ... ...
## 1706 z12sjp3zgtqnvlysj23zuxxaolrvd1oj504 ... 0
## 1707 z132enrpoy35yxpoe04cjr4zur3jvbyq3xo0k ... 0
## 1708 z132jbmxfqm4fjysg23nwjfb2mv2vxnua ... 1
## 1709 z12cdlswetvnejcri04cex0jfwy2u3tzj54 ... 0
## 1710 z120e5uautvcuper304ccf4bjrjugdpbwrc0k ... 0
##
## [1711 rows x 5 columns]
Split the data into testing and training data.
##python chunk
from sklearn.model_selection import train_test_split
train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names = train_test_split(np.array(data_df['CONTENT'].apply(lambda x:np.str_(x))), np.array(data_df['CLASS']), np.array(data_df['CLASS']), test_size=0.20, random_state=42)
train_corpus.shape, test_corpus.shape
## ((1368,), (343,))
from collections import Counter
trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))
(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd],
columns=['Target Label', 'Train Count', 'Test Count']).sort_values(by=['Train Count', 'Test Count'], ascending=False))
## Target Label Train Count Test Count
## 0 0 753 198
## 1 1 615 145
For word2vec, create the tokenized vectors of the text.
##python chunk
tokenized_train = [nltk.tokenize.word_tokenize(text)
for text in train_corpus]
tokenized_test = [nltk.tokenize.word_tokenize(text)
for text in test_corpus]
import gensim
# build word2vec model
w2v_num_features = 300
w2v_model = gensim.models.Word2Vec(tokenized_train, #corpus
size=w2v_num_features, #number of features
window=10, #size of moving window
min_count=2, #minimum number of times to run
sg = 0, #cbow model
iter=5, workers=5) #iterations and cores
#create flattening function
def document_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index2word)
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,), dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model.wv[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
num_features=w2v_num_features)
Create a TF-IDF matrix.
##python chunk
from sklearn.feature_extraction.text import TfidfVectorizer
# build BOW with TFIDF features on train class
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
# apply to train and test
tv_train_features = tv.fit_transform(train_corpus)
tv_test_features = tv.transform(test_corpus)
# look at feature shape
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)
## TFIDF model:> Train features shape: (1368, 2534) Test features shape: (343, 2534)
Build the word2vec model.
##python chunk
Convert the word2vec model into a set of features to use in our classifier.
##python chunk
#create flattening function
def document_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index2word)
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,), dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model.wv[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
num_features=w2v_num_features)
In class, we used a few algorithms to test which model might be the best. Pick one of the algorithms to use here (logistic regression, naive bayes, support vector machine).
Run your algorithm on both the TF-IDF matrix and the output from word2vec.
##python chunk
Print out the accuracy, recall, and precision of both of your models.
##python chunk