Classification

Please do not reorder the assignment - fill in each chunk as requested.

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk

library(reticulate)
py_config()

## python:         C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python.exe
## libpython:      C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python36.dll
## pythonhome:     C:/Users/punthakur/AppData/Local/Programs/Python/Python36
## version:        3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/punthakur/AppData/Local/Programs/Python/Python36/Lib/site-packages/numpy
## numpy_version:  1.19.2
## 
## python versions found: 
##  C:/Users/punthakur/AppData/Local/Programs/Python/Python36/python.exe
##  C:/Program Files (x86)/Microsoft Visual Studio/Shared/Python37_64/python.exe

library(plyr)

## Warning: package 'plyr' was built under R version 3.6.3

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.6.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.6.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(gbm)

## Warning: package 'gbm' was built under R version 3.6.3

## Loaded gbm 2.1.8

library(corrgram)

## Warning: package 'corrgram' was built under R version 3.6.3

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:lattice':
## 
##     panel.fill

## The following object is masked from 'package:plyr':
## 
##     baseball

Load the Python libraries or functions that you will use for that section.

##python chunk

from textsearch import TextSearch
import spacy
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from bs4 import BeautifulSoup 
from nltk.stem import PorterStemmer
ps = PorterStemmer()
import nltk
stopwords = nltk.corpus.stopwords.words('english')
import unicodedata
from contractions import contractions_dict

The Data

The dataset is a set of Youtube comments that have been coded as: - 1: spam youtube messages - 0: good youtube messages - This data is stored in the CLASS column

Import the data using either R or Python. I put a Python chunk here because you will need one to import the data, but if you want to first import into R, that’s fine.

youtube <- read.csv("youtube_spam.csv")
head(youtube)

##                                    COMMENT_ID           AUTHOR
## 1 LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU        Julius NM
## 2 LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A      adam riyati
## 3 LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8 Evgeny Murashkin
## 4         z13jhp0bxqncu512g22wvzkasxmvvzjaz04  ElNino Melendez
## 5         z13fwbwp1oujthgqj04chlngpvzmtt3r3dw           GsMega
## 6 LZQPQhLyRh9-wNRtlZDM90f1k0BrdVdJyN_YsaSwfxc     Jason Haddad
##                  DATE
## 1 2013-11-07T06:20:48
## 2 2013-11-07T12:37:15
## 3 2013-11-08T17:34:21
## 4 2013-11-09T08:28:43
## 5 2013-11-10T16:05:38
## 6 2013-11-26T02:55:11
##                                                                                                                                                                  CONTENT
## 1                                                                                                               Huh, anyway check out this you[tube] channel: kobyoshi02
## 2 Hey guys check out my new channel and our first vid THIS IS US THE  MONKEYS!!! I'm the monkey in the white shirt,please leave a like comment  and please subscribe!!!!
## 3                                                                                                                                 just for test I have to say murdev.com
## 4                                                                                                                     me shaking my sexy ass on my channel enjoy ^_^ ï»¿
## 5                                                                                                                              watch?v=vtaRGgvGtWQ   Check this out .ï»¿
## 6                                                                                     Hey, check out my new website!! This site is about kids stuff. kidsmediausa  . com
##   CLASS
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1
## 6     1

str(youtube)

## 'data.frame':    1956 obs. of  5 variables:
##  $ COMMENT_ID: Factor w/ 1953 levels "_2viQ_Qnc6-_qc98D_T8ICCw3meS1f1YJqU9SA-X1t4",..: 392 390 395 1508 1416 393 1553 461 1766 548 ...
##  $ AUTHOR    : Factor w/ 1792 levels "   Berty  Winata",..: 878 46 559 530 665 794 581 258 368 219 ...
##  $ DATE      : Factor w/ 1710 levels "","2013-07-12T22:33:27.916000",..: 201 202 203 204 205 206 207 208 209 210 ...
##  $ CONTENT   : Factor w/ 1760 levels "''Little Psy, only 5 months left.. Tumor in the head :( WE WILL MISS U &lt;3ï»¿",..: 719 536 974 1114 1623 584 1432 884 1750 145 ...
##  $ CLASS     : int  1 1 1 1 1 1 1 0 1 1 ...

summary(youtube)

##                                        COMMENT_ID               AUTHOR    
##  _2viQ_Qnc68fX3dYsfYuM-m4ELMJvxOQBmBOFHqGOk0:   2   M.E.S          :   8  
##  LneaDw26bFuH6iFsSrjlJLJIX3qD4R8-emuZ-aGUj0o:   2   5000palo       :   7  
##  LneaDw26bFvPh9xBHNw1btQoyP60ay_WWthtvXCx37s:   2   Louis Bryant   :   7  
##  _2viQ_Qnc6-_qc98D_T8ICCw3meS1f1YJqU9SA-X1t4:   1   Shadrach Grentz:   7  
##  _2viQ_Qnc6-_SpDzX8DwHaw3bUkE-owmcb7eOEPPurs:   1   DanteBTV       :   6  
##  _2viQ_Qnc6-1fj_YPI5S4X9e9VnvAzoykRbwZGAlYgo:   1   Derek Moya     :   5  
##  (Other)                                    :1947   (Other)        :1916  
##                          DATE     
##                            : 245  
##  2013-10-05T00:57:25.078000:   2  
##  2014-11-07T19:33:46       :   2  
##  2013-07-12T22:33:27.916000:   1  
##  2013-07-13T11:17:52.308000:   1  
##  2013-07-13T12:09:31.188000:   1  
##  (Other)                   :1704  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        CONTENT    
##  Check out this video on YouTube:ï»¿                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       :  97  
##  Check out this playlist on YouTube:ï»¿                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    :  21  
##  Check Out The New Hot Video By Dante B Called Riled Up                                                                                                                                                                                                                                                                                                                                                                                                                                                                    :   6  
##  hey its M.E.S here I&#39;m a young up and coming rapper and i wanna get my music heard i know spam wont get me fame. but at the moment i got no way of getting a little attention so please do me a favour and check out my channel and drop a sub if you enjoy yourself. im just getting started so i really appreciate those who take time to leave constructive criticism i already got 200 subscribers and 4000 views on my first vid ive been told i have potential                                                  :   4  
##  Hey Music Fans I really appreciate any of you who will take the time to read this, and check my music out! I&#39;m just a 15 year old boy DREAMING of being a successful MUSICIAN in the music world. I do lots of covers, and piano covers. But I don&#39;t have money to advertise. A simple thumbs up to my comment, a comment on my videos or a SUBSCRIPTION would be a step forward! It will only be a few seconds of your life that you won&#39;t regret!!! Thank u to all the people who just give me a chance! :) :   4  
##  Hi. Check out and share our songs.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        :   4  
##  (Other)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   :1820  
##      CLASS       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5138  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

##python chunk

import pandas as pd
from pandas import DataFrame



def func_nb():
   data = pd.read_csv("youtube_spam.csv")
   df = pd.DataFrame(data)
   return df

data_df = func_nb()
data_df.head()

##                                     COMMENT_ID  ... CLASS
## 0  LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU  ...     1
## 1  LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A  ...     1
## 2  LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8  ...     1
## 3          z13jhp0bxqncu512g22wvzkasxmvvzjaz04  ...     1
## 4          z13fwbwp1oujthgqj04chlngpvzmtt3r3dw  ...     1
## 
## [5 rows x 5 columns]

Clean up the data (text normalization)

Use one of our clean text functions to clean up the CONTENT column in the dataset.

clean.text = function(x)
{
  # tolower
  x = tolower(x)
  # remove rt
  x = gsub("rt", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  # remove punctuation
  x = gsub("[[:punct:]]", "", x)
  # remove numbers
  x = gsub("[[:digit:]]", "", x)
  # remove links http
  x = gsub("http\\w+", "", x)
  # remove tabs
  x = gsub("[ |\t]{2,}", "", x)
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x)
  # remove blank spaces at the end
  x = gsub(" $", "", x)
  return(x)
}

youtubecontent <- youtube$CONTENT

youtubecontent = clean.text (youtubecontent)

##python chunk


Content = r.youtubecontent

##Normalize our corpus

#remove html
data_df['CONTENT'] = [BeautifulSoup(str(text)).get_text() for text in data_df['CONTENT'].tolist()]


#lower case

## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://twitter.com/GBphotographyGB" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://ubuntuone.com/40beUutVu2ZKxK4uTgPZ8K" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=4604617" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://binbox.io/1FIRo#123" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://thepiratebay.se/torrent/6381501/Timothy_Sykes_Collection" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.gcmforex.com/partners/aw.aspx?Task=JoinT2&amp;AffiliateID=9107" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/teeLaLaLa" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.twitch.tv/zxlightsoutxz" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://flipagram.com/f/LUkA1QMrhF" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.gofundme.com/gvr7xg" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://soundcloud.com/jackal-and-james/wrap-up-the-night" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.surveymonkey.com/s/CVHMKLT" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/pages/Mathster-WP/1495323920744243?ref=hl" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=5242575" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/FUDAIRYQUEEN?pnref=story" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.twitch.tv/daconnormc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/photo.php?fbid=543627485763966&amp;l=0d878a889c" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.bubblews.com/news/6401116-vps-solutions" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=HuPwEH5ab" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=G8iX5cTKd" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://shhort.com/a?r=Jt2ufxHxc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.reverbnation.com/slicknick313/songs" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://viralangels.com/user/d4aaacwk" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://vimeo.com/107297364" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/pages/Komedi-burda-gel/775510675841486" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/profile.php?id=100007085325116" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://vimeo.com/106865403" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://minhateca.com.br/mauro-sp2013/Filmes+Series+Desenhos+Animes+Mp3+etc" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/antrobofficial" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://hackfbaccountlive.com/?ref=4344749" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://soundcloud.com/j-supt-fils-du-son/fucking-hostile" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://www.facebook.com/myfunnyriddles" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "http://www.wattpad.com/story/26032883-she-can-love-you-good" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning
## C:\Users\PUNTHA~1\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py:421: MarkupResemblesLocatorWarning: "https://binbox.io/DNCkM#qT4Q1JB1" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
##   MarkupResemblesLocatorWarning

data_df['CONTENT'] = data_df['CONTENT'].str.lower()


#unicode
data_df['CONTENT'] = [unicodedata.normalize('NFKD', str(text)).encode('ascii', 'ignore').decode('utf-8', 'ignore') for text in data_df['CONTENT'].tolist()]


#take out special characters
data_df['CONTENT'] = data_df['CONTENT'].str.replace('[^a-zA-Z0-9\s]|\[|\]', '')

#stemming
data_df['CONTENT'] = [' '.join([ps.stem(word) for word in text.split()]) for text in data_df['CONTENT'].tolist()]


#stop words
data_df['CONTENT'] = [' '.join([word for word in text.split() if word not in stopwords]) for text in data_df['CONTENT'].tolist()]

#drop the null values after this process
data_df = data_df.dropna().reset_index(drop=True)


#save it
data_df.to_csv('clean_youtube_spam.csv', index=False)

data_df = pd.read_csv('clean_youtube_spam.csv')
data_df.head()

##                                     COMMENT_ID  ... CLASS
## 0  LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU  ...     1
## 1  LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A  ...     1
## 2  LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8  ...     1
## 3          z13jhp0bxqncu512g22wvzkasxmvvzjaz04  ...     1
## 4          z13fwbwp1oujthgqj04chlngpvzmtt3r3dw  ...     1
## 
## [5 rows x 5 columns]

print(data_df)

##                                        COMMENT_ID  ... CLASS
## 0     LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU  ...     1
## 1     LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A  ...     1
## 2     LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8  ...     1
## 3             z13jhp0bxqncu512g22wvzkasxmvvzjaz04  ...     1
## 4             z13fwbwp1oujthgqj04chlngpvzmtt3r3dw  ...     1
## ...                                           ...  ...   ...
## 1706          z12sjp3zgtqnvlysj23zuxxaolrvd1oj504  ...     0
## 1707        z132enrpoy35yxpoe04cjr4zur3jvbyq3xo0k  ...     0
## 1708            z132jbmxfqm4fjysg23nwjfb2mv2vxnua  ...     1
## 1709          z12cdlswetvnejcri04cex0jfwy2u3tzj54  ...     0
## 1710        z120e5uautvcuper304ccf4bjrjugdpbwrc0k  ...     0
## 
## [1711 rows x 5 columns]

Split the data

Split the data into testing and training data.

##python chunk


from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names = train_test_split(np.array(data_df['CONTENT'].apply(lambda x:np.str_(x))), np.array(data_df['CLASS']), np.array(data_df['CLASS']), test_size=0.20, random_state=42)

train_corpus.shape, test_corpus.shape

## ((1368,), (343,))

from collections import Counter

trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd], 
             columns=['Target Label', 'Train Count', 'Test Count']).sort_values(by=['Train Count', 'Test Count'], ascending=False))

##    Target Label  Train Count  Test Count
## 0             0          753         198
## 1             1          615         145

Process the data

For word2vec, create the tokenized vectors of the text.

##python chunk

tokenized_train = [nltk.tokenize.word_tokenize(text)
                   for text in train_corpus]
tokenized_test = [nltk.tokenize.word_tokenize(text)
                   for text in test_corpus]


import gensim
# build word2vec model
w2v_num_features = 300
w2v_model = gensim.models.Word2Vec(tokenized_train, #corpus
            size=w2v_num_features, #number of features
            window=10, #size of moving window
            min_count=2, #minimum number of times to run
            sg = 0, #cbow model
            iter=5, workers=5) #iterations and cores


#create flattening function
def document_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
    
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

TF-IDF

Create a TF-IDF matrix.

##python chunk

from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW with TFIDF features on train class
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)

# apply to train and test
tv_train_features = tv.fit_transform(train_corpus)
tv_test_features = tv.transform(test_corpus)

# look at feature shape
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

## TFIDF model:> Train features shape: (1368, 2534)  Test features shape: (343, 2534)

Word2Vec

Build the word2vec model.

##python chunk

Convert the model

Convert the word2vec model into a set of features to use in our classifier.

##python chunk

#create flattening function
def document_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
    
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

Build a classifier model

In class, we used a few algorithms to test which model might be the best. Pick one of the algorithms to use here (logistic regression, naive bayes, support vector machine).

Run your algorithm on both the TF-IDF matrix and the output from word2vec.

##python chunk

Examine the results

Print out the accuracy, recall, and precision of both of your models.

##python chunk

Interpretation

Where you able to predict the spam messages from the real comments?
Which model provided you with a better prediction?