Unit 5

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Unit 5 - Twitter

# IMPORTANT NOTE
# In the following video, we ask you to install the "tm" package to perform the pre-processing steps. Due to function changes that occurred after this video was recorded, you will need to run the following command immediately after converting all of the words to lowercase letters (it converts all documents in the corpus to the PlainTextDocument type):

# corpus = tm_map(corpus, PlainTextDocument)

# Then you can continue with the R commands as they are in the video.

# VIDEO 5

# Read in the data

tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)

str(tweets)

## 'data.frame':    1181 obs. of  2 variables:
##  $ Tweet: chr  "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
##  $ Avg  : num  2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...

# Create dependent variable

tweets$Negative = as.factor(tweets$Avg <= -1)
table(tweets$Negative)

## 
## FALSE  TRUE 
##   999   182

# Install new packages
options(repos = c(CRAN = "http://cran.rstudio.com"))
# install.packages("tm")
library(tm)

## Loading required package: NLP

# install.packages("SnowballC")
library(SnowballC)

# Create corpus
 
corpus = Corpus(VectorSource(tweets$Tweet))

# Look at corpus
corpus

## <<VCorpus (documents: 1181, metadata (corpus/indexed): 0/0)>>

corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
## I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore

# Convert to lower-case

corpus = tm_map(corpus, tolower)
corpus[[1]]

## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.

corpus = tm_map(corpus, PlainTextDocument)

# Remove punctuation

corpus = tm_map(corpus, removePunctuation)
corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
## i have to say apple has by far the best customer care service i have ever received apple appstore

# Look at stop words 
stopwords("english")[1:10]

##  [1] "i"         "me"        "my"        "myself"    "we"       
##  [6] "our"       "ours"      "ourselves" "you"       "your"

# Remove stopwords and apple

corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
##    say    far  best customer care service   ever received  appstore

# Stem document 

corpus = tm_map(corpus, stemDocument)
corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
##    say    far  best custom care servic   ever receiv  appstor

# Video 6

# Create matrix

frequencies = DocumentTermMatrix(corpus)
frequencies

## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)

# Look at matrix 
inspect(frequencies[1000:1005,505:515])

## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           cheapen cheaper check cheep cheer cheerio cherylcol chief
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     0       0         0     0
##   character(0)       0       0     0     0     1       0         0     0
##               Terms
## Docs           chiiiiqu child children
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0
##   character(0)        0     0        0

# Check for sparsity
findFreqTerms(frequencies, lowfreq=20)

##  [1] "android"              "anyon"                "app"                 
##  [4] "appl"                 "back"                 "batteri"             
##  [7] "better"               "buy"                  "can"                 
## [10] "cant"                 "come"                 "dont"                
## [13] "fingerprint"          "freak"                "get"                 
## [16] "googl"                "ios7"                 "ipad"                
## [19] "iphon"                "iphone5"              "iphone5c"            
## [22] "ipod"                 "ipodplayerpromo"      "itun"                
## [25] "just"                 "like"                 "lol"                 
## [28] "look"                 "love"                 "make"                
## [31] "market"               "microsoft"            "need"                
## [34] "new"                  "now"                  "one"                 
## [37] "phone"                "pleas"                "promo"               
## [40] "promoipodplayerpromo" "realli"               "releas"              
## [43] "samsung"              "say"                  "store"               
## [46] "thank"                "think"                "time"                
## [49] "twitter"              "updat"                "use"                 
## [52] "via"                  "want"                 "well"                
## [55] "will"                 "work"

# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse

## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity           : 99%
## Maximal term length: 20
## Weighting          : term frequency (tf)

# Convert to a data frame

tweetsSparse = as.data.frame(as.matrix(sparse))

# Make all variable names R-friendly

colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

# Add dependent variable

tweetsSparse$Negative = tweets$Negative

# Split the data
install.packages("caTools")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages

library(caTools)

set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)


# QUICK QUESTION  (1 point possible)
# In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)
findFreqTerms(frequencies, lowfreq=100)

## [1] "iphon" "itun"  "new"

# Video 7

# Build a CART model
install.packages("caTools")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages

library(rpart)

install.packages("rpart.plot")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages

library(rpart.plot)

tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

prp(tweetCART)

# Evaluate the performance of the model
predictCART = predict(tweetCART, newdata=testSparse, type="class")

table(testSparse$Negative, predictCART)

##        predictCART
##         FALSE TRUE
##   FALSE   294    6
##   TRUE     37   18

# Compute accuracy

(294+18)/(294+6+37+18)

## [1] 0.8788732

# Baseline accuracy 

table(testSparse$Negative)

## 
## FALSE  TRUE 
##   300    55

300/(300+55)

## [1] 0.8450704

# Random forest model
install.packages("randomForest")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

set.seed(123)

tweetRF = randomForest(Negative ~ ., data=trainSparse)

# Make predictions:
predictRF = predict(tweetRF, newdata=testSparse)

table(testSparse$Negative, predictRF)

##        predictRF
##         FALSE TRUE
##   FALSE   293    7
##   TRUE     34   21

# Accuracy:
(293+21)/(293+7+34+21)

## [1] 0.884507

# QUICK QUESTION  (1 point possible)
# In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well logistic regression does. Build a logistic regression model (using the training set) to predict "Negative" using all of the independent variables. You may get a warning message after building your model - don't worry (we explain what it means in the explanation).
tweet.logistic = glm(Negative ~ ., data=trainSparse, family="binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Now, make predictions using the logistic regression model:

predictions = predict(tweet.logistic, newdata=testSparse, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

# where "tweetLog" should be the name of your logistic regression model. You might also get a warning message after this command, but don't worry - it is due to the same problem as the previous warning message.

# Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model. What is the accuracy?
table(testSparse$Negative, predictions > 0.5)

##        
##         FALSE TRUE
##   FALSE   253   47
##   TRUE     23   32

(253+32)/nrow(testSparse) # Accuracy  is 0.8028169

## [1] 0.8028169

# ===========================================
# Unit 5 - Recitation


# Video 2
# Load the dataset

emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)
str(emails)

## 'data.frame':    855 obs. of  2 variables:
##  $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
##  $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...

# Look at emails

emails$email[1]

## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"

emails$responsive[1]

## [1] 0

emails$email[2]

## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the  California State Auditor. I look forward to seeing you at The Aspen  Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"

emails$responsive[2]

## [1] 1

# Responsive emails

table(emails$responsive)

## 
##   0   1 
## 716 139

# Note: strwrap() can break down a long email into many lines

# Video 3


# Load tm package

library(tm)


# Create corpus

corpus = Corpus(VectorSource(emails$email))

corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
## North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. "Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated," said Janine Ferretti, executive director of the CEC. "We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment." The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. "Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships," said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. "We need to achieve this new level of cooperation in our environmental approaches as well." The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. "We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector," said Ferretti. "How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called "pollution havens" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. "The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region," said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********

# Pre-process data
corpus = tm_map(corpus, tolower)

# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)


corpus = tm_map(corpus, removePunctuation)

corpus = tm_map(corpus, removeWords, stopwords("english"))

corpus = tm_map(corpus, stemDocument)

# Look at first email
corpus[[1]]

## <<PlainTextDocument (metadata: 7)>>
## north america integr electr market requir cooper  environment polici commiss  environment cooper releas work paper  north america electr market montreal 27 novemb 2001   north american commiss  environment cooper cec  releas  work paper highlight  trend toward increas trade competit  crossbord invest  electr  canada mexico   unit state   hope   work paper environment challeng  opportun   evolv north american electr market will stimul public discuss around  cec symposium    titl   need  coordin environment polici trinat   north americawid electr market develop  cec symposium will take place  san diego  2930 novemb  will bring togeth lead expert  industri academia ngos   govern  canada mexico   unit state  consid  impact   evolv continent electr market  human health   environ  goal   work paper   symposium   highlight key environment issu  must  address   electr market  north america becom    integr said janin ferretti execut director   cec  want  stimul discuss around  import polici question  rais   countri can cooper   approach  energi   environ  cec  intern organ creat   environment side agreement  nafta known   north american agreement  environment cooper  establish  address region environment concern help prevent potenti trade  environment conflict  promot  effect enforc  environment law  cec secretariat believ  greater north american cooper  environment polici regard  continent electr market  necessari    protect air qualiti  mitig climat chang   minim  possibl  environmentbas trade disput   ensur  depend suppli  reason price electr across north america   avoid creation  pollut haven    ensur local  nation environment measur remain effect  chang market  work paper profil  rapid chang north american electr market  exampl  2001  us  project  export 131 thousand gigawatthour gwh  electr  canada  mexico  2007  number  project  grow  169 thousand gwh  electr   past  decad  north american electr market  develop   complex array  crossbord transact  relationship said phil sharp former us congressman  chairman   cec electr advisori board  need  achiev  new level  cooper   environment approach  well  environment profil   electr sector  electr sector   singl largest sourc  nation report toxin   unit state  canada   larg sourc  mexico   us  electr sector emit approxim 25 percent   nox emiss rough 35 percent   co2 emiss 25 percent   mercuri emiss  almost 70 percent  so2 emiss  emiss   larg impact  airsh watersh  migratori speci corridor   often share   three north american countri  want  discuss  possibl outcom  greater effort  coordin feder state  provinci environment law  polici  relat   electr sector said ferretti  can  develop  compat environment approach  help make domest environment polici  effect  effect   integr electr market one key issu rais   paper   effect  market integr   competit  particular fuel   coal natur gas  renew fuel choic larg determin environment impact   specif facil along  pollut control technolog perform standard  regul  paper highlight  impact   high competit market  well  exampl concern   call pollut haven aris  signific differ  environment law  enforc practic induc power compani  locat  oper  jurisdict  lower standard  cec secretariat  explor  addit environment polici will work   restructur market    polici can  adapt  ensur   enhanc competit  benefit  entir region said sharp  trade rule  polici measur direct influenc  variabl  drive  success integr north american electr market  work paper also address fuel choic technolog pollut control strategi  subsidi  cec will use  inform gather   discuss period  develop  final report  will  submit   council  earli 2002   inform   view  live video webcast   symposium pleas go  httpwwwcecorgelectr  may download  work paper   support document  httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss  environment cooper 393 rue stjacqu ouest bureau 200 montréal québec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg

# Video 4

# Create matrix

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 855, terms: 21735)>>
## Non-/sparse entries: 102511/18480914
## Sparsity           : 99%
## Maximal term length: 113
## Weighting          : term frequency (tf)

# Remove sparse terms
dtm = removeSparseTerms(dtm, 0.97)
dtm

## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51644/622096
## Sparsity           : 92%
## Maximal term length: 19
## Weighting          : term frequency (tf)

# Create data frame
labeledTerms = as.data.frame(as.matrix(dtm))

# Add in the outcome variable
labeledTerms$responsive = emails$responsive

str(labeledTerms)

## 'data.frame':    855 obs. of  789 variables:
##  $ 100                : num  0 0 0 0 0 0 5 0 0 0 ...
##  $ 1400               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 1999               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 2000               : num  0 0 1 0 1 0 6 0 1 0 ...
##  $ 2001               : num  2 1 0 0 0 0 7 0 0 0 ...
##  $ 713                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 77002              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl                : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ accept             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ access             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accord             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ account            : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ act                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ action             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ activ              : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ actual             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ add                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ addit              : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ address            : num  3 0 0 0 2 0 0 0 0 1 ...
##  $ administr          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ advanc             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ advis              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ affect             : num  0 0 0 0 2 0 0 0 0 0 ...
##  $ afternoon          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agenc              : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ ago                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ agre               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agreement          : num  2 0 0 0 2 0 1 0 0 1 ...
##  $ alan               : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ allow              : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ along              : num  1 0 0 0 1 0 1 0 0 0 ...
##  $ alreadi            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ also               : num  1 0 0 0 0 0 8 0 0 0 ...
##  $ altern             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ although           : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ amend              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ america            : num  4 0 0 0 0 0 0 0 1 0 ...
##  $ among              : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ amount             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ analysi            : num  0 0 0 2 0 0 0 0 0 0 ...
##  $ analyst            : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ andor              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ andrew             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ announc            : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ anoth              : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ answer             : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ anyon              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anyth              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appear             : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ appli              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ applic             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appreci            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ approach           : num  3 0 0 0 0 0 1 0 0 0 ...
##  $ appropri           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ approv             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ approxim           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ april              : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ area               : num  0 0 0 0 1 0 3 0 0 0 ...
##  $ around             : num  2 0 0 0 0 0 1 0 0 0 ...
##  $ arrang             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ articl             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ ask                : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ asset              : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ assist             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ associ             : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ assum              : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ attach             : num  0 1 0 1 1 0 1 0 3 1 ...
##  $ attend             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ attent             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ attorney           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ august             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ author             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ avail              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ averag             : num  0 0 0 0 0 0 5 0 0 0 ...
##  $ avoid              : num  1 0 0 0 1 0 2 0 0 0 ...
##  $ awar               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ back               : num  0 0 0 0 1 1 1 0 0 0 ...
##  $ balanc             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bank               : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ base               : num  0 0 0 0 1 0 9 0 0 0 ...
##  $ basi               : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ becom              : num  1 0 0 0 0 0 4 0 0 0 ...
##  $ begin              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ believ             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ benefit            : num  1 0 0 0 0 0 5 0 0 0 ...
##  $ best               : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ better             : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ bid                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ big                : num  0 0 0 0 0 1 6 0 0 0 ...
##  $ bill               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ billion            : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ bit                : num  0 0 0 0 0 1 2 0 0 0 ...
##  $ board              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ bob                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ book               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ brian              : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ brief              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bring              : num  1 0 0 0 0 0 2 0 0 0 ...
##  $ build              : num  0 0 0 0 0 0 7 0 1 0 ...
##   [list output truncated]

# Video 5


# Split the data

library(caTools)

set.seed(144)

spl = sample.split(labeledTerms$responsive, 0.7)

train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)

# Build a CART model

library(rpart)
library(rpart.plot)

emailCART = rpart(responsive~., data=train, method="class")

prp(emailCART)

# Video 6

# Make predictions on the test set

pred = predict(emailCART, newdata=test)
pred[1:10,]

##                        0          1
## character(0)   0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000

pred.prob = pred[,2]

# Compute accuracy

table(test$responsive, pred.prob >= 0.5)

##    
##     FALSE TRUE
##   0   195   20
##   1    17   25

(195+25)/(195+25+17+20)

## [1] 0.8560311

# Baseline model accuracy

table(test$responsive)

## 
##   0   1 
## 215  42

215/(215+42)

## [1] 0.8365759

# Video 7

# ROC curve
install.packages("ROCR")

## 
## The downloaded binary packages are in
##  /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.1.3

## Loading required package: gplots
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

predROCR = prediction(pred.prob, test$responsive)

perfROCR = performance(predROCR, "tpr", "fpr")

plot(perfROCR, colorize=TRUE)

# Compute AUC

performance(predROCR, "auc")@y.values

## [[1]]
## [1] 0.7936323

# ===================
# Assingment 5.1: DETECTING VANDALISM ON WIKIPEDIA

# PROBLEM 1.1 - BAGS OF WORDS  (1 point possible)
# Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame "wiki". 

wiki = read.csv("wiki.csv", stringsAsFactors=FALSE)
str(wiki)

## 'data.frame':    3876 obs. of  7 variables:
##  $ X.1     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Vandal  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Minor   : int  1 1 0 1 1 0 0 0 1 0 ...
##  $ Loggedin: int  1 1 1 0 1 1 1 1 1 0 ...
##  $ Added   : chr  "  represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
##  $ Removed : chr  " " " talklanguagetalk" " regarded as technologytechnologies human first" "  represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...

# Convert the "Vandal" column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).
wiki$Vandal = as.factor(wiki$Vandal)

# How many cases of vandalism were detected in the history of this page? ANS 1815
table(wiki$Vandal)

## 
##    0    1 
## 2061 1815

# PROBLEM 1.2 - BAGS OF WORDS  (2 points possible)
# We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We'll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

# 1) Create the corpus for the Added column, and call it "corpusAdded".

corpusAdded = Corpus(VectorSource(wiki$Added))
corpusAdded[[1]]

## <<PlainTextDocument (metadata: 7)>>
##   represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationships linguistics regarded writing languages to other listing xmlspacepreservelanguages metaverse formal term philology common each including phonologyphonology often ten list humans affiliation see computer are speechpathologyspeech our what for ways dialects please artificial written body be of quite hypothesis found alone refers by about language profanity study programming priorities rosenfelders technologytechnologies makes or first among useful languagephilosophy one sounds use area create phrases mark their genetic basic families complete but sapirwhorfhypothesissapirwhorf with talklanguagetalk population animals this science up vocal can concepts called at and topics locations as numbers have in pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy most important meaningful a historicallinguisticsorphilologyhistorical semanticssemantics patterns the oral

# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpusAdded = tm_map(corpusAdded, PlainTextDocument)


# 2) Remove the English-language stopwords.
corpusAdded = tm_map(corpusAdded, removeWords, stopwords("english"))
corpusAdded[[1]]

## <<PlainTextDocument (metadata: 7)>>
##   represent psycholinguisticspsycholinguistics orthographyorthography help text  actions  human ethnologue relationships linguistics regarded writing languages   listing xmlspacepreservelanguages metaverse formal term philology common  including phonologyphonology often ten list humans affiliation see computer  speechpathologyspeech    ways dialects please artificial written body   quite hypothesis found alone refers   language profanity study programming priorities rosenfelders technologytechnologies makes  first among useful languagephilosophy one sounds use area create phrases mark  genetic basic families complete  sapirwhorfhypothesissapirwhorf  talklanguagetalk population animals  science  vocal can concepts called   topics locations  numbers   pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy  important meaningful  historicallinguisticsorphilologyhistorical semanticssemantics patterns  oral

# 3) Stem the words.
corpusAdded = tm_map(corpusAdded, stemDocument)
corpusAdded[[1]]

## <<PlainTextDocument (metadata: 7)>>
##   repres psycholinguisticspsycholinguist orthographyorthographi help text  action  human ethnologu relationship linguist regard write languag   list xmlspacepreservelanguag metavers formal term philolog common  includ phonologyphonolog often ten list human affili see comput  speechpathologyspeech    way dialect pleas artifici written bodi   quit hypothesi found alon refer   languag profan studi program prioriti rosenfeld technologytechnolog make  first among use languagephilosophi one sound use area creat phrase mark  genet basic famili complet  sapirwhorfhypothesissapirwhorf  talklanguagetalk popul anim  scienc  vocal can concept call   topic locat  number   patholog differ develop 4000 thing idea group complex anim mathemat fair literatur httpwwwzompistcom philosophi  import meaning  historicallinguisticsorphilologyhistor semanticssemant pattern  oral

# 4) Build the DocumentTermMatrix, and call it dtmAdded.

dtmAdded = DocumentTermMatrix(corpusAdded)
dtmAdded

## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

# If the code length(stopwords("english")) does not return 174 for you [it does], then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusAdded, removeWords, sw) instead of tm_map(corpusAdded, removeWords, stopwords("english")).

# How many terms appear in dtmAdded? # ANS 6675

# PROBLEM 1.3 - BAGS OF WORDS  (1 point possible)
# Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?
sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded

## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

# PROBLEM 1.4 - BAGS OF WORDS  (2 points possible)
# Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

wordsAdded = as.data.frame(as.matrix(sparseAdded))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))

# Now repeat all of the steps we've done so far:
# create a corpus, 
corpusRemoved = Corpus(VectorSource(wiki$Removed))

# remove stop words, 
corpusRemoved = tm_map(corpusRemoved, removeWords, stopwords("english"))
corpusRemoved[[1]]

## <<PlainTextDocument (metadata: 7)>>
##

# stem the document, 
corpusRemoved = tm_map(corpusRemoved, stemDocument)
corpusRemoved[[1]]

## <<PlainTextDocument (metadata: 7)>>

# create a sparse document term matrix, and
dtmRemoved = DocumentTermMatrix(corpusRemoved)
dtmRemoved

## <<DocumentTermMatrix (documents: 3876, terms: 5403)>>
## Non-/sparse entries: 13293/20928735
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

sparseRemoved = removeSparseTerms(dtmRemoved, 0.997)
sparseRemoved

## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

# convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

wordsRemoved = as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

# How many words are in the wordsRemoved data frame?
ncol(wordsRemoved) # ANS 162

## [1] 162

# PROBLEM 1.5 - BAGS OF WORDS  (2 points possible)
# Combine the two data frames into a data frame called wikiWords with the following line of code:

wikiWords = cbind(wordsAdded, wordsRemoved)

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18

# The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture). 
wikiWords$Vandal = wiki$Vandal

#Set the random seed to 123 and then split the data set using sample.split from the "caTools" package to put 70% in the training set.


set.seed(123)

split = sample.split(wikiWords$Vandal, SplitRatio = 0.7)
trainSparse = subset(wikiWords, split==TRUE)
testSparse = subset(wikiWords, split==FALSE)

# What is the accuracy on the test set of a baseline method that always predicts "not vandalism" (the most frequent outcome)?
table(testSparse$Vandal)

## 
##   0   1 
## 618 545

618/nrow(testSparse) # ANS  0.5313844

## [1] 0.5313844

# PROBLEM 1.6 - BAGS OF WORDS  (2 points possible)
# Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don't set values for minbucket or cp).

# What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.)

VandalCART = rpart(Vandal ~ ., data=trainSparse, method="class")

prp(VandalCART)

# Evaluate the performance of the model
predict.Vandal.CART = predict(VandalCART, newdata=testSparse, type="class")

table(testSparse$Vandal, predict.Vandal.CART)

##    predict.Vandal.CART
##       0   1
##   0 618   0
##   1 533  12

# Compute accuracy

(618+12)/nrow(testSparse) # ANS 0.5417025

## [1] 0.5417025

# PROBLEM 1.7 - BAGS OF WORDS  (1 point possible)
# Plot the CART tree. How many word stems does the CART model use?

prp(VandalCART) # ANS 2

# PROBLEM 1.8 - BAGS OF WORDS  (1 point possible)
# Given the performance of the CART model relative to the baseline, what is the best explanation of these results? ANS Although it beats the baseline, bag of words is not very predictive for this problem. Although it beats the baseline, bag of words is not very predictive for this problem. - correct

# PROBLEM 2.1 - PROBLEM-SPECIFIC KNOWLEDGE  (1 point possible)
# We weren't able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

# The key class of words we will use are website addresses. "Website addresses" (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be "http://www.google.com". The first part is the protocol, which is usually "http" (HyperText Transfer Protocol). The second part is the address of the site, e.g. "www.google.com". We have stripped all punctuation so links to websites appear in the data as one word, e.g. "httpwwwgooglecom". We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

# We can search for the presence of a web address in the words added by searching for "http" in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

# grepl("cat","dogs and cats",fixed=TRUE) # TRUE

# grepl("cat","dogs and rats",fixed=TRUE) # FALSE

# Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

# Make a new column in wikiWords2 that is 1 if "http" was in Added:

wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)

# Based on this new column, how many revisions added a link?
sum(wikiWords2$HTTP) # 217

## [1] 217

# PROBLEM 2.2 - PROBLEM-SPECIFIC KNOWLEDGE  (2 points possible)
# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets:

wikiTrain2 = subset(wikiWords2, split==TRUE)

wikiTest2 = subset(wikiWords2, split==FALSE)

# Then create a new CART model using this new variable as one of the independent variables (ie, it was added earlier, and is now an additional independent variable)

# What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

VandalCART2 = rpart(Vandal ~ ., data=wikiTrain2, method="class")

#prp(VandalCART)

# Evaluate the performance of the model
predict.Vandal.CART2 = predict(VandalCART2, newdata=wikiTest2, type="class")

table(wikiTest2$Vandal, predict.Vandal.CART2)

##    predict.Vandal.CART2
##       0   1
##   0 609   9
##   1 488  57

# Compute accuracy

(609+57)/nrow(wikiTest2) # ANS 0.5726569

## [1] 0.5726569

# PROBLEM 2.3 - PROBLEM-SPECIFIC KNOWLEDGE  (1 point possible)
# Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

# Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:

wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))

wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))

# What is the average number of words added?

mean(wikiWords2$NumWordsAdded) # ANS 4.050052

## [1] 4.050052

# PROBLEM 2.4 - PROBLEM-SPECIFIC KNOWLEDGE  (2 points possible)
# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).

# What is the new accuracy of the CART model on the test set? (Note: variables added in 2.3)
wikiTrain3 = subset(wikiWords2, split==TRUE)

wikiTest3 = subset(wikiWords2, split==FALSE)

VandalCART3 = rpart(Vandal ~ ., data=wikiTrain3, method="class")

#prp(VandalCART)

# Evaluate the performance of the model
predict.Vandal.CART3 = predict(VandalCART3, newdata=wikiTest3, type="class")

table(wikiTest3$Vandal, predict.Vandal.CART3)

##    predict.Vandal.CART3
##       0   1
##   0 514 104
##   1 297 248

# Compute accuracy

(514+248)/nrow(wikiTest3) # ANS 0.6552021

## [1] 0.6552021

# PROBLEM 3.1 - USING NON-TEXTUAL DATA  (2 points possible)
# We have two pieces of "metadata" (data about data) that we haven't yet used. Make a copy of wikiWords2, and call it wikiWords3:

wikiWords3 = wikiWords2

# Then add the two original variables Minor and Loggedin to this new data frame:

wikiWords3$Minor = wiki$Minor

wikiWords3$Loggedin = wiki$Loggedin

# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.

# Build a CART model using all the training data. What is the accuracy of the model on the test set?
wikiTrain4 = subset(wikiWords3, split==TRUE)

wikiTest4 = subset(wikiWords3, split==FALSE)

VandalCART4 = rpart(Vandal ~ ., data=wikiTrain4, method="class")



# Evaluate the performance of the model
predict.Vandal.CART4 = predict(VandalCART4, newdata=wikiTest4, type="class")

table(wikiTest4$Vandal, predict.Vandal.CART4)

##    predict.Vandal.CART4
##       0   1
##   0 595  23
##   1 304 241

# Compute accuracy

(595+241)/nrow(wikiTest4) # ANS 0.6552021

## [1] 0.7188306

# PROBLEM 3.2 - USING NON-TEXTUAL DATA  (1 point possible)
# There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?

# Plot the CART tree. How many splits are there in the tree?

prp(VandalCART4) # ANS 3

# UNIT 5 Section 2: AUTOMATING REVIEWS IN MEDICINE

# PROBLEM 1.1 - LOADING THE DATA  (1 point possible)
# Load clinical_trial.csv into a data frame called trials (remembering to add the argument stringsAsFactors=FALSE), and investigate the data frame with summary() and str().

trials = read.csv("clinical_trial.csv", stringsAsFactors=FALSE)
summary(trials)

##     title             abstract             trial       
##  Length:1860        Length:1860        Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :0.0000  
##                                        Mean   :0.4392  
##                                        3rd Qu.:1.0000  
##                                        Max.   :1.0000

str(trials)

## 'data.frame':    1860 obs. of  3 variables:
##  $ title   : chr  "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breast cancer: phase "| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for early breast can"| __truncated__ ...
##  $ abstract: chr  "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population before and after in"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with unfavorable outcom"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a consequence, they "| __truncated__ ...
##  $ trial   : int  1 0 1 1 1 0 1 0 0 0 ...

# IMPORTANT NOTE: Some students have been getting errors like "invalid multibyte string" when performing certain parts of this homework question. If this is happening to you, use the argument fileEncoding="latin1" when reading in the file with read.csv. This should cause those errors to go away.

# We can use R's string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text. Using the nchar() function on the variables in the data frame, answer the following questions:



# How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)

max(nchar(trials$abstract)) # ANS 3708

## [1] 3708

summary(nchar(trials$abstract)) # Alternative method

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1195    1583    1480    1820    3708

# PROBLEM 1.2 - LOADING THE DATA  (1 point possible)
# How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)

table(nchar(trials$abstract)) # ANS 112 Easier to see if gettng the head() of this result, or

## 
##    0  243  273  282  288  290  332  337  345  363  378  420  434  444  447 
##  112    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  454  463  464  465  468  469  477  482  483  489  491  492  501  507  511 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
##  514  519  528  541  543  548  559  563  566  567  576  584  585  588  591 
##    1    1    1    1    1    2    1    1    2    1    1    1    1    1    1 
##  593  600  601  604  610  615  617  620  627  628  631  634  639  644  645 
##    1    1    1    1    1    1    2    1    1    2    1    1    1    2    1 
##  647  655  656  660  666  671  673  675  681  685  688  695  700  701  713 
##    1    1    2    2    1    1    1    1    1    1    1    3    1    1    1 
##  717  720  721  722  723  725  730  733  735  739  740  741  765  773  775 
##    2    1    1    1    1    1    1    1    1    1    1    1    2    1    2 
##  777  781  782  783  788  795  798  802  805  806  808  811  820  823  825 
##    1    3    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  829  832  834  836  837  838  840  842  846  851  852  857  860  861  865 
##    1    1    1    2    2    1    1    1    1    1    1    1    1    1    1 
##  868  871  874  878  882  885  888  891  892  900  902  904  906  909  910 
##    1    1    1    1    1    3    2    1    2    1    1    1    1    3    1 
##  913  919  920  921  922  924  925  926  927  930  932  937  939  940  942 
##    1    2    1    2    1    1    2    1    3    2    1    1    1    1    1 
##  948  953  957  958  959  960  962  964  965  968  969  973  974  980  981 
##    1    1    1    1    1    2    1    1    2    3    1    1    1    1    1 
##  984  987  989  990  991  994  996 1000 1005 1006 1007 1009 1016 1018 1020 
##    1    2    3    2    1    1    1    1    1    1    2    2    1    1    2 
## 1021 1022 1024 1025 1026 1028 1029 1030 1031 1033 1034 1035 1037 1038 1041 
##    1    2    1    1    1    1    2    1    1    1    3    2    1    1    1 
## 1045 1047 1049 1050 1052 1054 1063 1064 1065 1066 1067 1069 1070 1071 1073 
##    2    4    1    1    2    1    1    1    2    1    1    2    2    1    1 
## 1077 1078 1079 1081 1082 1083 1085 1087 1088 1093 1094 1097 1098 1101 1103 
##    1    3    1    1    2    2    1    1    1    1    2    1    2    2    1 
## 1105 1107 1108 1111 1112 1114 1115 1116 1121 1122 1123 1124 1125 1126 1127 
##    1    1    1    1    3    1    2    1    2    1    1    1    1    2    2 
## 1128 1133 1134 1136 1137 1138 1140 1145 1148 1149 1159 1161 1165 1167 1169 
##    1    1    2    1    2    1    1    1    1    1    1    1    1    2    3 
## 1170 1171 1173 1175 1176 1177 1179 1181 1182 1184 1185 1188 1189 1191 1192 
##    1    2    1    1    1    1    1    2    2    4    1    1    1    1    3 
## 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1207 1208 1209 1210 1211 
##    3    2    1    1    1    1    1    1    2    1    2    1    2    2    1 
## 1212 1214 1215 1216 1217 1221 1222 1223 1225 1229 1231 1234 1235 1236 1239 
##    2    2    3    3    1    2    2    1    2    1    1    1    1    1    2 
## 1240 1241 1243 1245 1250 1253 1254 1255 1256 1258 1259 1261 1262 1265 1267 
##    1    2    2    1    1    1    1    1    2    1    1    1    1    1    1 
## 1268 1269 1270 1271 1272 1273 1274 1276 1277 1278 1279 1281 1284 1288 1289 
##    1    2    2    1    1    1    1    1    2    2    1    3    1    1    2 
## 1290 1291 1293 1294 1295 1296 1298 1300 1302 1303 1305 1306 1307 1308 1309 
##    1    2    1    3    1    1    1    1    1    1    3    4    2    2    1 
## 1310 1312 1313 1316 1317 1321 1323 1324 1325 1326 1329 1330 1331 1333 1335 
##    1    1    2    2    1    2    2    2    1    2    1    1    4    1    1 
## 1336 1337 1339 1340 1341 1342 1343 1344 1346 1347 1348 1349 1350 1351 1352 
##    1    3    1    2    1    1    1    3    2    1    2    2    1    3    2 
## 1354 1355 1356 1358 1359 1360 1361 1362 1363 1364 1367 1368 1369 1370 1371 
##    1    1    1    2    2    1    2    1    1    4    2    2    3    3    1 
## 1372 1373 1375 1376 1377 1380 1381 1383 1386 1387 1389 1390 1391 1392 1393 
##    2    2    2    1    4    2    1    1    1    1    2    2    1    1    2 
## 1394 1396 1397 1399 1400 1401 1403 1404 1405 1407 1408 1409 1410 1412 1413 
##    2    3    1    2    1    3    1    1    2    3    2    2    1    3    4 
## 1414 1415 1416 1418 1419 1421 1422 1423 1424 1425 1428 1432 1433 1434 1436 
##    1    3    3    1    1    3    1    1    4    2    1    2    1    2    1 
## 1437 1438 1441 1442 1445 1446 1447 1448 1449 1450 1451 1452 1453 1455 1456 
##    1    1    1    1    2    1    2    2    1    3    2    1    2    2    1 
## 1459 1460 1461 1463 1466 1467 1468 1470 1472 1474 1476 1477 1478 1479 1480 
##    2    1    3    2    2    2    3    1    1    2    2    4    3    3    1 
## 1481 1484 1485 1487 1489 1490 1491 1492 1493 1494 1496 1497 1499 1500 1501 
##    3    1    1    1    2    1    1    1    1    1    1    2    1    5    3 
## 1503 1504 1505 1506 1507 1508 1509 1510 1512 1514 1515 1516 1517 1518 1519 
##    2    1    2    3    1    3    3    1    1    1    1    3    3    1    1 
## 1520 1521 1522 1524 1525 1526 1528 1529 1530 1532 1533 1534 1535 1536 1537 
##    1    2    1    2    3    3    2    1    1    1    2    1    3    3    1 
## 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 
##    1    1    3    2    2    5    1    1    1    2    1    3    1    4    1 
## 1555 1556 1561 1563 1564 1566 1569 1570 1571 1572 1574 1575 1577 1578 1580 
##    4    1    1    1    1    3    6    1    1    2    2    2    3    1    1 
## 1581 1582 1583 1584 1585 1586 1588 1589 1591 1593 1594 1596 1597 1598 1600 
##    2    2    3    2    1    2    1    2    1    1    2    1    4    1    1 
## 1601 1602 1603 1604 1605 1606 1609 1610 1613 1614 1615 1616 1617 1618 1619 
##    4    3    2    2    1    3    3    2    4    1    3    2    1    1    1 
## 1620 1622 1623 1624 1625 1626 1627 1628 1630 1631 1632 1633 1634 1635 1636 
##    1    1    4    2    2    3    1    1    2    1    3    1    3    1    2 
## 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 
##    2    2    2    4    1    1    2    3    1    1    1    1    5    3    2 
## 1652 1653 1654 1655 1656 1657 1658 1659 1660 1662 1663 1665 1666 1667 1668 
##    1    1    1    3    2    1    4    2    2    4    1    1    1    3    3 
## 1669 1670 1671 1672 1673 1674 1675 1676 1678 1679 1680 1681 1682 1683 1684 
##    2    3    2    3    1    5    3    2    1    1    4    1    3    4    2 
## 1685 1686 1687 1688 1689 1690 1692 1696 1697 1698 1699 1700 1702 1703 1704 
##    2    1    1    1    1    2    4    2    3    3    1    3    2    1    1 
## 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 
##    2    3    2    2    2    1    3    1    2    4    1    1    3    2    2 
## 1720 1721 1722 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 
##    1    1    1    3    2    1    2    4    5    3    1    1    3    2    1 
## 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1747 1748 1750 1751 1753 
##    5    2    6    2    5    2    3    4    3    2    1    1    1    4    4 
## 1754 1755 1756 1759 1760 1761 1762 1763 1764 1765 1766 1768 1769 1770 1771 
##    4    4    3    3    4    2    4    3    2    1    2    2    2    2    1 
## 1772 1773 1774 1775 1776 1777 1779 1780 1781 1782 1783 1784 1786 1787 1788 
##    5    3    2    4    2    3    3    1    1    2    2    2    3    3    1 
## 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 
##    3    1    4    1    2    1    1    4    1    3    1    3    2    2    4 
## 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 
##    2    2    2    3    1    2    1    2    4    5    2    2    2    3    4 
## 1819 1820 1821 1822 1824 1825 1827 1828 1829 1830 1831 1832 1833 1834 1837 
##    2    2    1    3    3    2    2    3    1    2    1    3    4    2    3 
## 1838 1839 1840 1842 1843 1844 1846 1848 1850 1852 1853 1854 1855 1856 1857 
##    3    2    2    1    1    2    2    1    3    3    1    1    1    2    2 
## 1858 1859 1860 1861 1862 1863 1865 1867 1869 1870 1871 1872 1873 1875 1882 
##    2    1    1    3    1    2    1    2    3    1    1    1    2    2    2 
## 1883 1884 1885 1887 1888 1890 1891 1892 1893 1895 1896 1897 1898 1899 1901 
##    1    1    1    3    3    1    3    2    1    1    1    3    2    2    3 
## 1902 1904 1906 1908 1909 1910 1911 1912 1914 1916 1917 1918 1919 1920 1921 
##    2    1    1    1    3    2    1    1    2    2    1    1    1    3    1 
## 1922 1923 1924 1925 1927 1928 1929 1930 1933 1934 1935 1936 1937 1938 1939 
##    3    1    1    3    2    2    3    1    1    1    1    1    2    2    1 
## 1940 1941 1942 1943 1944 1945 1946 1947 1948 1951 1953 1954 1955 1956 1957 
##    3    1    2    1    4    1    5    1    1    2    1    1    1    1    2 
## 1958 1959 1960 1962 1965 1970 1971 1973 1974 1975 1976 1979 1980 1981 1982 
##    1    1    1    2    1    1    1    1    2    1    3    1    1    2    1 
## 1985 1986 1987 1990 1992 1993 1995 1996 1998 1999 2001 2002 2005 2008 2011 
##    1    1    2    2    3    1    1    1    2    3    1    2    1    1    2 
## 2012 2013 2015 2016 2018 2019 2020 2024 2025 2028 2029 2030 2031 2037 2039 
##    2    2    2    1    1    1    1    1    2    1    1    1    1    1    1 
## 2040 2041 2043 2044 2046 2049 2050 2051 2052 2053 2056 2059 2061 2062 2069 
##    1    2    1    1    1    1    1    1    1    1    1    4    1    1    1 
## 2071 2072 2074 2075 2076 2078 2080 2081 2082 2089 2093 2095 2097 2101 2105 
##    1    1    1    2    1    1    1    1    2    1    1    1    1    2    1 
## 2106 2108 2110 2117 2120 2125 2126 2127 2131 2133 2134 2137 2138 2141 2142 
##    1    1    1    1    1    1    1    2    1    2    1    1    1    2    1 
## 2146 2151 2153 2154 2170 2173 2175 2177 2182 2183 2190 2191 2193 2194 2195 
##    1    1    1    1    1    1    2    1    1    1    1    1    1    1    1 
## 2196 2206 2207 2208 2212 2215 2219 2221 2223 2228 2229 2232 2235 2236 2238 
##    1    2    1    1    1    1    1    1    2    1    2    1    1    1    1 
## 2240 2241 2242 2247 2248 2250 2252 2254 2257 2259 2262 2263 2265 2267 2271 
##    2    1    2    1    1    1    1    1    1    2    1    1    1    1    1 
## 2276 2280 2282 2287 2290 2292 2293 2297 2299 2300 2302 2303 2311 2325 2329 
##    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
## 2334 2336 2345 2348 2349 2358 2360 2367 2376 2380 2382 2387 2389 2394 2404 
##    1    1    2    1    1    1    1    1    1    1    1    1    2    2    1 
## 2410 2412 2422 2428 2430 2431 2436 2441 2450 2474 2479 2481 2482 2483 2489 
##    1    1    1    1    1    2    1    1    2    1    1    1    1    1    1 
## 2494 2496 2497 2504 2511 2516 2522 2528 2529 2530 2532 2533 2534 2539 2547 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 2550 2555 2584 2602 2606 2607 2633 2636 2682 2685 2700 2709 2712 2722 2723 
##    1    1    2    1    1    1    1    1    2    1    1    1    1    1    1 
## 2746 2751 2752 2779 2790 2791 2796 2797 2798 2816 2838 2891 2901 2905 2910 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 2942 2961 2965 3050 3104 3138 3178 3214 3298 3465 3642 3708 
##    1    1    1    1    1    1    1    1    1    1    1    1

table(nchar(trials$abstract)==0) # Alt, gives summary

## 
## FALSE  TRUE 
##  1748   112

sum(nchar(trials$abstract)==0) # Alt

## [1] 112

# PROBLEM 1.3 - LOADING THE DATA  (1 point possible)
# Find the observation with the minimum number of characters in the title (the variable "title") out of all of the observations in this dataset. 

which.min(nchar(trials$title)) # ANS observation 1258

## [1] 1258

summary(nchar(trials$title)) # ANS 28 characters

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    28.0   102.0   133.5   138.8   168.0   336.0

nchar(trials$title[which.min(nchar(trials$title))])# Alt for 28 characters

## [1] 28

# What is the text of the title of this article? Include capitalization and punctuation in your response, but don't include the quotes.
trials$title[which.min(nchar(trials$title))] # ANS A decade of letrozole: FACE.

## [1] "A decade of letrozole: FACE."

# PROBLEM 2.1 - PREPARING THE CORPUS  (4 points possible)
# Because we have both title and abstract information for trials, we need to build two corpera instead of one. Name them corpusTitle and corpusAbstract.

# Following the commands from lecture, perform the following tasks (you might need to load the "tm" package first if it isn't already loaded). Make sure to perform them in this order.

# 1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.

corpusTitle = Corpus(VectorSource(trials$title))
corpusAbstract = Corpus(VectorSource(trials$abstract))

# 2) Convert corpusTitle and corpusAbstract to lowercase. 

corpusTitle = tm_map(corpusTitle, tolower)
corpusAbstract = tm_map(corpusAbstract, tolower)

# After performing this step, remember to run the lines:

corpusTitle = tm_map(corpusTitle, PlainTextDocument)
corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)

# 3) Remove the punctuation in corpusTitle and corpusAbstract.

corpusTitle = tm_map(corpusTitle, removePunctuation)
corpusAbstract = tm_map(corpusAbstract, removePunctuation)

# 4) Remove the English language stop words from corpusTitle and corpusAbstract.

corpusTitle = tm_map(corpusTitle, removeWords, stopwords("english"))
corpusAbstract = tm_map(corpusAbstract, removeWords, stopwords("english"))

# 5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).

corpusTitle = tm_map(corpusTitle, stemDocument)
corpusAbstract = tm_map(corpusAbstract, stemDocument)

# 6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.

dtmTitle = DocumentTermMatrix(corpusTitle)
dtmAbstract = DocumentTermMatrix(corpusAbstract)

# 7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).

dtmTitle = removeSparseTerms(dtmTitle, 0.95)
dtmAbstract = removeSparseTerms(dtmAbstract, 0.95)

# 8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).

dtmTitle = as.data.frame(as.matrix(dtmTitle))
dtmAbstract = as.data.frame(as.matrix(dtmAbstract))

# If the code length(stopwords("english")) does not return 174 for you [Note: it does], then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords("english")) and tm_map(corpusAbstract, removeWords, stopwords("english")). 

# How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?
ncol(dtmTitle) # ANS 31

## [1] 31

# How many terms remain in dtmAbstract?
ncol(dtmAbstract)  # ANS 335

## [1] 335

# PROBLEM 2.2 - PREPARING THE CORPUS  (1 point possible)
# What is the most likely reason why dtmAbstract has so many more terms than dtmTitle?

# ANS Abstracts tend to have many more words than titles: Because titles are so short, a word needs to be very common to appear in 5% of titles. Because abstracts have many more words, a word can be much less common and still appear in 5% of abstracts. While abstracts may have wider vocabulary, this is a secondary effect. As we saw in the previous subsection, all papers have titles, but not all have abstracts.

# PROBLEM 2.3 - PREPARING THE CORPUS  (1 point possible)
# What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.

which.max(colSums(dtmAbstract)) # ANS patient

## patient 
##     212

tail(sort(colSums(dtmAbstract))) # Alternative

## chemotherapi        group    treatment       cancer       breast 
##         2344         2668         2894         3726         3859 
##      patient 
##         8381

# PROBLEM 3.1 - BUILDING A MODEL  (1 point possible)
# We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:

colnames(dtmTitle) = paste0("T", colnames(dtmTitle))

colnames(dtmAbstract) = paste0("A", colnames(dtmAbstract))

# What was the effect of these functions? ANS Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.

# PROBLEM 3.2 - BUILDING A MODEL  (1 point possible)
# Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:

dtm = cbind(dtmTitle, dtmAbstract)

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18

# As we did in class, add the dependent variable "trial" to dtm, copying it from the original data frame called trials. 

dtm$trial = trials$trial

# How many columns are in this combined data frame? ANS 367

# PROBLEM 3.3 - BUILDING A MODEL  (1 point possible)
# Now that we have prepared our data frame, it's time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named "train" and "test", putting 70% of the data in the training set.

set.seed(144)
split = sample.split(dtm$trial, SplitRatio = 0.7)
train = subset(dtm, split==TRUE)
test = subset(dtm, split==FALSE)

# What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)

 max(table(train$trial))/nrow(train) # ANS 0.5606759

## [1] 0.5606759

730/(ncol(train)) # Alternative

## [1] 1.989101

730/(730+572) # Alternative

## [1] 0.5606759

# PROBLEM 3.4 - BUILDING A MODEL  (2 points possible)
# Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don't add a minbucket or cp value). Remember to add the method="class" argument, since this is a classification problem.

trialCART = rpart(trial ~ ., data=train, method="class")
prp(trialCART)

# What is the name of the first variable the model split on? ANS Tphase

# PROBLEM 3.5 - BUILDING A MODEL  (1 point possible)
# Obtain the training set predictions for the model (do not yet predict on the test set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output). What is the maximum predicted probability for any result?

predict.trialCART.train = predict(trialCART, data=train)
max(predict.trialCART.train[,2]) # ANS 0.8718861

## [1] 0.8718861

# Alt method
# predict.trialCART.train = predict(trialCART)[,2]
# summary(predict.trialCART.train)

# PROBLEM 3.6 - BUILDING A MODEL  (1 point possible)
# Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set? 
# ANS The maximum predicted probability will likely be exactly the same in the testing set. Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.

# PROBLEM 3.7 - BUILDING A MODEL  (3 points possible)
# For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.

# What is the training set accuracy of the CART model?

table(train$trial, predict.trialCART.train[,2] >= 0.5)

##    
##     FALSE TRUE
##   0   631   99
##   1   131  441

(631+441)/nrow(train) # Accuracy = 0.8233487

## [1] 0.8233487

# What is the training set sensitivity of the CART model?
441/(441+131) # ANS 0.770979

## [1] 0.770979

# What is the training set specificity of the CART model?
631/(631+99) # ANS 0.8643836

## [1] 0.8643836

# PROBLEM 4.1 - EVALUATING THE MODEL ON THE TESTING SET (2 points possible)
# Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.

predTest = predict(trialCART, newdata=test)
max(predTest[,2]) # ANS 0.8718861

## [1] 0.8718861

# What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?
table(test$trial, predTest[,2] >= 0.5)

##    
##     FALSE TRUE
##   0   261   52
##   1    83  162

(261+162)/nrow(test) # Accuracy = 0.7580645

## [1] 0.7580645

# Alt 
# predTest = predict(trialCART, newdata=test)[,2]
# table(test$trial, predTest >= 0.5)

# PROBLEM 4.2 - EVALUATING THE MODEL ON THE TESTING SET (2 points possible)
# Using the ROCR package, what is the testing set AUC of the prediction model?

predTestROCR = prediction(predTest[,2], test$trial)

# Plot not needed in this problem
perfROCR = performance(predTestROCR, "tpr", "fpr") 
plot(perfROCR, colorize=TRUE)

# Compute AUC
performance(predTestROCR, "auc")@y.values

## [[1]]
## [1] 0.8371063

# Alt as.numeric(performance(predTestROCR, "auc")@y.values)

# PART 5: DECISION-MAKER TRADEOFFS
# The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:

# 1) For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)

# 2) Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study's detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.

# 3) Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.

# PROBLEM 5.1 - DECISION-MAKER TRADEOFFS  (1 point possible)
# What is the cost associated with the model in Step 1 making a false negative prediction?
# ANS A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3. 

# PROBLEM 5.2 - DECISION-MAKER TRADEOFFS  (1 point possible)
# What is the cost associated with the model in Step 1 making a false positive prediction?
# ANS A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3. 

# PROBLEM 5.3 - DECISION-MAKER TRADEOFFS  (1 point possible)
# Given the costs associated with false positives and false negatives, which of the following is most accurate?
# ANS A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model. 

# Unit 5 Part 3: SEPARATING SPAM FROM HAM (PART 1)

# PROBLEM 1.1 - LOADING THE DATASET  (1 point possible)
# Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.


emails = read.csv("emails.csv", stringsAsFactors=FALSE)

# How many emails are in the dataset?
summary(emails)

##      text                spam       
##  Length:5728        Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.2388  
##                     3rd Qu.:0.0000  
##                     Max.   :1.0000

str(emails) # ANS 5728

## 'data.frame':    5728 obs. of  2 variables:
##  $ text: chr  "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special  request additional information now ! click here  click here for a printable version of our o"| __truncated__ ...
##  $ spam: int  1 1 1 1 1 1 1 1 1 1 ...

# Alt nrow(emails)

# PROBLEM 1.2 - LOADING THE DATASET  (1 point possible)
# How many of the emails are spam?
table(emails$spam)

## 
##    0    1 
## 4360 1368

sum(emails$spam==1) # ANS 1368

## [1] 1368

# PROBLEM 1.3 - LOADING THE DATASET  (1 point possible)
# Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.

head(emails$text)

## [1] "Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
## [2] "Subject: the stock trading gunslinger  fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ramble is segovia not group try slung kansas tanzania yes chameleon or continuant clothesman no  libretto is chesapeake but tight not waterway herald and hawthorn like chisel morristown superior is deoxyribonucleic not clockwork try hall incredible mcdougall yes hepburn or einsteinian earmark no  sapling is boar but duane not plain palfrey and inflexible like huzzah pepperoni bedtime is nameable not attire try edt chronography optima yes pirogue or diffusion albeit no "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [3] "Subject: unbelievable new homes made easy  im wanting to show you this  homeowner  you have been pre - approved for a $ 454 , 169 home loan at a 3 . 72 fixed rate .  this offer is being extended to you unconditionally and your credit is in no way a factor .  to take advantage of this limited time opportunity  all we ask is that you visit our website and complete  the 1 minute post approval form  look foward to hearing from you ,  dorcas pittman"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [4] "Subject: 4 color printing special  request additional information now ! click here  click here for a printable version of our order form ( pdf format )  phone : ( 626 ) 338 - 8090 fax : ( 626 ) 338 - 8102 e - mail : ramsey @ goldengraphix . com  request additional information now ! click here  click here for a printable version of our order form ( pdf format )  golden graphix & printing 5110 azusa canyon rd . irwindale , ca 91706 this e - mail message is an advertisement and / or solicitation . "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "Subject: do not have money , get software cds from here !  software compatibility . . . . ain ' t it great ?  grow old along with me the best is yet to be .  all tradgedies are finish ' d by death . all comedies are ended by marriage ."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [6] "Subject: great nnews  hello , welcome to medzonline sh groundsel op  we are pleased to introduce ourselves as one of the ieading online phar felicitation maceuticai shops .  helter v  shakedown r  a cosmopolitan l  l blister l  l bestow ag  ac tosher l  is coadjutor va  confidant um  andmanyother .  - sav inexpiable e over 75 %  - total confide leisure ntiaiity  - worldwide s polite hlpplng  - ov allusion er 5 miilion customers in 150 countries  have devitalize a nice day !"

# PROBLEM 1.4 - LOADING THE DATASET  (1 point possible)
# Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?

# Yes -- the number of times the word appears might help us differentiate spam from ham.

# PROBLEM 1.5 - LOADING THE DATASET  (1 point possible)
# The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?
nchar(emails$text[which.max(nchar(emails$text))]) # ANS 43952

## [1] 43952

# Alt max(nchar(emails$text))

# PROBLEM 1.6 - LOADING THE DATASET  (1 point possible)
# Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)

which.min(nchar(emails$text)) # ANS 1992

## [1] 1992

# ALT which(nchar(emails$text) == min(nchar(emails$text)) )

# PROBLEM 2.1 - PREPARING THE CORPUS  (2 points possible)
# Follow the standard steps to build and pre-process the corpus:

# 1) Build a new corpus variable called corpus.

corpus = Corpus(VectorSource(emails$text))

# 2) Using tm_map, convert the text to lowercase.

corpus = tm_map(corpus, tolower)

# After performing this step, remember to run the lines:
corpus = tm_map(corpus, PlainTextDocument)

# 3) Using tm_map, remove all punctuation from the corpus.

corpus = tm_map(corpus, removePunctuation)

# 4) Using tm_map, remove all English stopwords from the corpus.

corpus = tm_map(corpus, removeWords, stopwords("english"))

# 5) Using tm_map, stem the words in the corpus.

corpus = tm_map(corpus, stemDocument)

# 6) Build a document term matrix from the corpus, called dtm.

dtm = DocumentTermMatrix(corpus)

# If the code length(stopwords("english")) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords("english")).
length(stopwords("english")) # checks out

## [1] 174

# How many terms are in dtm?

dtm # 28687 terms

## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity           : 100%
## Maximal term length: 24
## Weighting          : term frequency (tf)

ncol(dtm) # Alternative

## [1] 28687

# PROBLEM 2.2 - PREPARING THE CORPUS  (1 point possible)
# To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don't overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?

spdtm = removeSparseTerms(dtm, 0.95)
spdtm # ANS 330

## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity           : 89%
## Maximal term length: 10
## Weighting          : term frequency (tf)

# PROBLEM 2.3 - PREPARING THE CORPUS  (2 points possible)
# Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid.

# Convert to a data frame
emailsSparse = as.data.frame(as.matrix(spdtm))


# Make all variable names R-friendly
colnames(emailsSparse) = make.names(colnames(emailsSparse))

str(emailsSparse)

## 'data.frame':    5728 obs. of  330 variables:
##  $ X000      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ X2000     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2001     : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ X713      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X853      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ access    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ account   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ addit     : num  0 0 0 2 0 0 1 0 0 0 ...
##  $ address   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ allow     : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ alreadi   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ also      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ analysi   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anoth     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ applic    : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ appreci   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ approv    : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ april     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ area      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ arrang    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ask       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ assist    : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ associ    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attach    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ attend    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ avail     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ back      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ base      : num  0 0 0 0 0 0 3 0 2 0 ...
##  $ begin     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ believ    : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ best      : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ better    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ book      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bring     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ busi      : num  2 0 0 0 0 0 3 0 2 0 ...
##  $ buy       : num  0 0 0 0 0 0 1 1 0 1 ...
##  $ call      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ can       : num  0 0 0 0 0 0 11 1 0 1 ...
##  $ case      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ chang     : num  2 0 0 0 0 0 1 0 0 0 ...
##  $ check     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ click     : num  0 0 0 4 0 0 0 0 0 0 ...
##  $ com       : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ come      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ comment   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ communic  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ compani   : num  3 0 0 0 0 0 16 0 0 0 ...
##  $ complet   : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ confer    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ confirm   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ contact   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ continu   : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ contract  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ copi      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ corp      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ corpor    : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ cost      : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ cours     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ creat     : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ credit    : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ crenshaw  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ current   : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ custom    : num  0 0 0 0 0 1 4 0 0 0 ...
##  $ data      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ date      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ day       : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ deal      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dear      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ depart    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ deriv     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ design    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ detail    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ develop   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ differ    : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ direct    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ director  : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ discuss   : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ doc       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ don       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ done      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ due       : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ ect       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ edu       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ effect    : num  2 0 0 0 0 0 0 1 0 1 ...
##  $ effort    : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ either    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ email     : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ end       : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ energi    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ engin     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ enron     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ etc       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ even      : num  1 0 0 0 0 0 1 1 0 1 ...
##  $ event     : num  0 0 0 0 0 0 4 0 0 0 ...
##  $ expect    : num  0 0 0 0 0 0 4 0 0 0 ...
##  $ experi    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fax       : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ feel      : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

# colSums() is an R function that returns the sum of values for each variable in our data frame. Our data frame contains the number of times each word stem (columns) appeared in each email (rows). Therefore, colSums(emailsSparse) returns the number of times a word stem appeared across all the emails in the dataset. What is the word stem that shows up most frequently across all the emails in the dataset? Hint: think about how you can use sort() or which.max() to pick out the maximum frequency.
which.max((sapply(emailsSparse,sum))) # enron

## enron 
##    92

sort(colSums(emailsSparse)) # Alt #1

##     vkamin      begin     either       done      sorri        lot 
##        301        317        318        337        343        348 
##    mention    thought      bring       idea     better     immedi 
##        355        367        374        378        383        385 
##    without       mean      write      happi      repli       life 
##        389        390        390        396        397        400 
##     experi     involv     specif     arrang      creat       read 
##        405        405        407        410        413        413 
##       wish       open     realli       link        say    respond 
##        414        416        417        421        423        430 
##      sever       keep        etc      anoth        run       info 
##        430        431        434        435        437        438 
##     togeth      short     sincer        buy        due    alreadi 
##        438        439        441        442        445        446 
##       line      allow     recent    special      given     believ 
##        448        450        451        451        453        456 
##     design        put      remov       X853  wednesday       type 
##        457        458        460        462        464        466 
##     public       full       hear       join     effect     effort 
##        468        469        469        469        471        473 
##    tuesday     robert      locat      check       area      final 
##        474        482        485        488        489        490 
##    increas       soon    analysi       sure       deal     return 
##        491        492        495        495        498        509 
##      place      onlin    success       sinc understand      still 
##        516        518        519        521        521        523 
##     import    comment    confirm      hello       long      thing 
##        530        531        532        534        534        535 
##      point    appreci       feel      howev     member       hour 
##        536        541        543        545        545        548 
##        net    continu      event     expect    suggest       unit 
##        548        552        552        554        554        554 
##    resourc       case    version     corpor     applic      engin 
##        556        561        564        565        567        571 
##       part     attend   thursday      might       morn        abl 
##        571        573        575        577        586        590 
##     assist     differ     intern      updat       move       mark 
##        598        598        606        606        612        613 
##     depart       even       made   internet       high      cours 
##        621        622        622        623        624        626 
##   contract     gibner        end      right        per      invit 
##        629        633        635        639        642        647 
##     approv       real     monday     result     school      kevin 
##        648        648        649        655        655        656 
##     direct       home     detail        tri       form    problem 
##        657        660        661        661        664        666 
##        web        doc      deriv        don      april       note 
##        668        675        676        676        682        688 
##      relat     websit       juli   director    complet       rate 
##        694        700        701        705        707        717 
##       valu      futur    student        set     within     requir 
##        721        722        726        727        732        736 
##    softwar       book       mani     person      click       file 
##        739        756        758        767        769        770 
##      addit      money     associ   particip       term     access 
##        774        776        777        782        786        789 
##     custom    possibl       copi       oper       cost    respons 
##        796        796        797        820        821        824 
##      today    account       base      great       dear     london 
##        828        829        837        837        838        843 
##     friday    support      secur       hope       much       back 
##        854        854        857        858        861        864 
##        way       find     invest        ask      start      shall 
##        864        867        867        871        880        884 
##     origin       come       plan    financi        two       site 
##        892        903        904        909        911        913 
##   opportun       team      first      resum       issu       data 
##        918        926        929        933        944        955 
##      month      peopl     credit   industri    process     review 
##        958        958        960        970        975        976 
##       talk       last      phone       X000      chang        fax 
##        981        998       1001       1007       1035       1038 
##       john    current    stinson       give    univers      offic 
##       1042       1044       1051       1055       1059       1068 
##        gas    schedul     financ      state       name       X713 
##       1070       1071       1073       1086       1089       1097 
##       good      posit   crenshaw     system       well       sent 
##       1097       1104       1115       1118       1125       1126 
##      visit       free      next.      avail   question    address 
##       1126       1141       1145       1152       1152       1154 
##      offer     attach     number       date    product      order 
##       1171       1176       1182       1187       1197       1210 
##      think     includ     report       best     confer        now 
##       1216       1238       1279       1291       1297       1300 
##        www    discuss  interview     servic   communic    request 
##       1323       1326       1333       1337       1343       1344 
##       just       take      trade       send     provid       list 
##       1354       1361       1366       1379       1405       1410 
##       help    program     option       want    project    contact 
##       1430       1438       1488       1488       1522       1543 
##    present     follow     receiv        see    houston       http 
##       1543       1552       1557       1567       1582       1609 
##        edu       call    shirley       corp       week   interest 
##       1627       1687       1689       1692       1758       1814 
##        day       also    develop       make       year        let 
##       1860       1864       1882       1884       1890       1963 
##     messag       look     regard      email        one      power 
##       1983       2003       2045       2066       2108       2117 
##     energi      model       risk       mail        new    compani 
##       2179       2199       2267       2269       2281       2290 
##       busi       need        use       like        get        may 
##       2313       2328       2330       2352       2462       2465 
##      manag      group       know       meet      price     inform 
##       2600       2604       2614       2623       2694       2701 
##       work     market   research      X2001       time    forward 
##       2708       2750       2820       3089       3145       3161 
##      thank        can   kaminski      X2000      pleas        com 
##       3730       4257       4801       4967       5113       5443 
##        hou       will       vinc    subject        ect      enron 
##       5577       8252       8532      10202      11427      13388

which.max(colSums(emailsSparse)) # Alt #2

## enron 
##    92

# PROBLEM 2.4 - PREPARING THE CORPUS  (1 point possible)
# Add a variable called "spam" to emailsSparse containing the email spam labels. You can do this by copying over the "spam" variable from the original data frame (remember how we did this in the Twitter lecture).
which(colnames(emailsSparse)=="spam")

## integer(0)

emailsSparse$spam = emails$spam

# How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.

ham.mails = subset(emailsSparse,emailsSparse$spam==0)
sum(colSums(ham.mails)>=5000) # ANS 6

## [1] 6

sort(colSums(subset(emailsSparse, spam == 0))) # Alternative

##       spam       life      remov      money      onlin    without 
##          0         80        103        114        173        191 
##     websit      click    special       wish      repli        buy 
##        194        217        226        229        239        243 
##        net       link     immedi       done       mean     design 
##        243        247        249        254        259        261 
##        lot     effect       info     either       read      write 
##        268        270        273        279        279        286 
##       line      begin      sorri    success     involv      creat 
##        289        291        293        293        294        299 
##    softwar     better     vkamin        say       keep      bring 
##        299        301        301        305        306        311 
##     believ       full    increas     realli    mention    thought 
##        313        317        320        324        325        325 
##       idea     invest      secur     specif      sever     experi 
##        327        327        337        338        340        346 
##      thing      allow      check        due       type      happi 
##        347        348        351        351        352        354 
##     return     expect      short     effort       open   internet 
##        355        356        357        358        360        361 
##     sincer     public     recent      anoth    alreadi       home 
##        361        364        368        369        372        375 
##       made    respond      given        etc        put     within 
##        380        382        383        385        385        386 
##      place      right    version      hello       sure       area 
##        388        390        390        395        396        397 
##        run     arrang    account       join       hour      locat 
##        398        399        401        403        404        406 
##     togeth      engin     import        per     corpor       high 
##        406        411        411        412        414        416 
##     result       hear      final       deal     applic       even 
##        418        420        422        423        428        429 
##        web     custom       soon       long       sinc      futur 
##        430        433        435        436        439        440 
##     member       X000      event        don       part       feel 
##        446        447        447        450        450        453 
##    tuesday  wednesday      still       unit       site       X853 
##        454        456        457        457        458        461 
##    continu understand    resourc     robert    analysi       form 
##        464        464        466        466        468        468 
##      point     assist    confirm     differ     intern      might 
##        474        475        485        489        489        490 
##       real       case      howev    comment        abl    complet 
##        490        492        496        505        515        515 
##       rate    appreci        tri       move      updat     approv 
##        516        518        521        526        527        533 
##    suggest       free   contract     detail       morn        end 
##        533        535        544        546        546        550 
##       mani     attend   thursday     direct     requir      cours 
##        550        558        558        561        562        567 
##     person      relat     depart      today      start        way 
##        569        573        575        577        580        586 
##       mark       valu    problem      peopl       note     school 
##        588        590        593        599        600        607 
##      invit     access       term       juli     monday     gibner 
##        614        617        625        630        630        633 
##       base   director      offer       cost      addit      kevin 
##        635        640        643        646        648        654 
##      great        set       file       find       much       oper 
##        655        658        659        665        669        669 
##      order      deriv        doc      april       book    address 
##        669        673        673        677        680        693 
##       copi    financi      month    student    respons    possibl 
##        700        702        709        710        711        712 
##     associ   particip        now      first   industri       dear 
##        715        717        725        726        731        734 
##    support       plan       back       name       come   opportun 
##        734        738        739        745        748        760 
##     report    product        two     origin        ask     credit 
##        772        776        787        796        797        798 
##      state     system    process       hope     london       just 
##        806        816        826        828        828        830 
##     receiv      chang     review    current      shall     friday 
##        830        831        834        841        844        847 
##       team      phone       issu       data      avail       last 
##        850        858        865        868        872        874 
##       good       give        www        gas       list      posit 
##        876        883        897        905        907        917 
##      visit     includ      resum       best      offic     servic 
##        920        924        928        933        935        942 
##       talk     number       well        fax     provid       sent 
##        943        951        961        963        970        971 
##      next.       send       http       john    univers     financ 
##        975        986       1009       1022       1025       1038 
##    stinson    schedul       take       date       want   question 
##       1051       1054       1057       1060       1068       1069 
##    program      think       X713   crenshaw     attach      trade 
##       1080       1084       1097       1115       1155       1167 
##       help      email    compani    request        see   communic 
##       1168       1201       1225       1227       1238       1251 
##     confer    discuss       make    contact     follow  interview 
##       1264       1270       1281       1301       1308       1320 
##    project       mail    present       busi   interest     option 
##       1328       1352       1397       1416       1429       1432 
##        day       call        one       year       week     messag 
##       1440       1497       1516       1523       1527       1538 
##    houston       also       look        edu       corp    shirley 
##       1577       1604       1607       1620       1643       1687 
##    develop        get        new        use        let     regard 
##       1691       1768       1777       1784       1856       1859 
##     inform       need      power        may       like       risk 
##       1883       1890       1972       1976       1980       2097 
##     energi     market      model      price       work      manag 
##       2124       2150       2170       2191       2293       2334 
##       know      group       meet       time   research    forward 
##       2345       2474       2544       2552       2752       2952 
##      X2001        can      thank        com      pleas   kaminski 
##       3060       3426       3558       4444       4494       4801 
##      X2000        hou       will       vinc    subject        ect 
##       4935       5569       6802       8531       8625      11417 
##      enron 
##      13388

# PROBLEM 2.5 - PREPARING THE CORPUS  (1 point possible)
# How many word stems appear at least 1000 times in the spam emails in the dataset?

spam.mails = subset(emailsSparse, spam==1)
tail(sort(colSums(spam.mails)>=1000) )# ANS 3

##     www    year compani subject    will    spam 
##   FALSE   FALSE    TRUE    TRUE    TRUE    TRUE

tail(sort(colSums(spam.mails)))

##    mail     com compani    spam    will subject 
##     917     999    1065    1368    1450    1577

sort(colSums(subset(emailsSparse, spam == 1))) # Alternative

##       X713   crenshaw      enron     gibner   kaminski    stinson 
##          0          0          0          0          0          0 
##     vkamin       X853       vinc        doc      kevin    shirley 
##          0          1          1          2          2          2 
##      deriv      april    houston      resum        edu     friday 
##          3          5          5          5          7          7 
##        hou  wednesday        ect     arrang  interview     attend 
##          8          8         10         11         13         15 
##     london     robert    student    schedul   thursday     monday 
##         15         16         16         17         17         19 
##       john    tuesday     attach    suggest    appreci       mark 
##         20         20         21         21         23         25 
##      begin    comment    analysi      X2001      model       hope 
##         26         26         27         29         29         30 
##    mention      X2000     togeth     confer      invit    univers 
##         30         32         32         33         33         34 
##     financ       talk     either        run       morn      shall 
##         35         38         39         39         40         40 
##      happi    thought     depart    confirm    respond     school 
##         42         42         46         47         48         48 
##       corp        etc       hear      howev      sorri       idea 
##         49         49         49         49         50         51 
##     energi    discuss       open     option       soon understand 
##         55         56         56         56         57         57 
##      cours     experi     associ      point      bring   director 
##         59         59         62         62         63         65 
##   particip      anoth       join      still      final   research 
##         65         66         66         66         68         68 
##       case        set     specif      given       juli    problem 
##         69         69         69         70         71         73 
##        put    alreadi        ask        abl       deal        fax 
##         73         74         74         75         75         75 
##       book       team       issu      locat       meet      updat 
##         76         76         79         79         79         79 
##        lot     sincer     better      short       sinc       done 
##         80         80         82         82         82         83 
##   question     recent    possibl   contract        end       move 
##         83         83         84         85         85         86 
##       data      might    continu       note       feel    resourc 
##         87         87         88         88         90         90 
##      sever       area   communic     realli        due     direct 
##         90         92         92         93         94         96 
##     origin       copi       unit       long     member       sure 
##         96         97         97         98         99         99 
##      allow       dear     public      write      event        let 
##        102        104        104        104        105        107 
##     differ       file     involv    respons      creat       type 
##        109        111        111        113        114        114 
##     approv     detail     effort     intern    request        say 
##        115        115        115        117        117        118 
##     import    support       part      relat     assist       last 
##        119        120        121        121        123        124 
##        two       back       keep      addit       date      place 
##        124        125        125        126        127        128 
##      group       mean       valu      think      offic       read 
##        130        131        131        132        133        134 
##     immedi      check     applic      hello        tri     review 
##        136        137        139        139        140        142 
##     believ      phone       hour      power    present    process 
##        143        143        144        145        146        149 
##     corpor       oper       full     return       come       sent 
##        151        151        152        154        155        155 
##   opportun       real      repli       line      engin       term 
##        158        158        158        159        160        161 
##     credit       well        gas       info       plan      next. 
##        162        164        165        165        166        170 
##       risk    increas     access       give      thank       link 
##        170        171        172        172        172        174 
##     requir    version       cost      great       wish     regard 
##        174        174        175        182        185        186 
##      posit      thing       call    develop    complet       much 
##        187        188        190        191        192        192 
##       even    project     design       form     expect     person 
##        193        194        196        196        198        198 
##    without        buy      trade     effect       rate       base 
##        198        199        199        201        201        202 
##       find    current      first      chang      visit    financi 
##        202        203        203        204        206        207 
##       high       mani    forward       good    special        don 
##        208        208        209        221        225        226 
##    success        per     number       week     result        web 
##        226        230        231        231        237        238 
##   industri    contact       made     follow      month      right 
##        239        242        242        244        249        249 
##      today       also       help   internet      manag       know 
##        251        260        262        262        266        269 
##        way      avail      state      futur       home      start 
##        278        280        280        282        285        300 
##     system       take        net     includ       life        see 
##        302        304        305        314        320        329 
##       name      onlin     within      remov       best    program 
##        344        345        346        357        358        358 
##      peopl     custom       year       like   interest       send 
##        359        363        367        372        385        393 
##     servic       look       work        day       want    product 
##        395        396        415        420        420        421 
##        www    account     provid       need    softwar     messag 
##        426        428        435        438        440        445 
##       site    address        may       list      price        new 
##        455        461        489        503        503        504 
##     websit     report      secur       just      offer     invest 
##        506        507        520        524        528        540 
##      order        use      click       X000        now        one 
##        541        546        552        560        575        592 
##       time       http     market       make       free      pleas 
##        593        600        600        603        606        619 
##      money        get     receiv     inform        can      email 
##        662        694        727        818        831        865 
##       busi       mail        com    compani       spam       will 
##        897        917        999       1065       1368       1450 
##    subject 
##       1577

# PROBLEM 2.6 - PREPARING THE CORPUS  (1 point possible)
# The lists of most common words are significantly different between the spam and ham emails. What does this likely imply?

# ANS The frequencies of these most common words are likely to help differentiate between spam and ham.

# PROBLEM 2.7 - PREPARING THE CORPUS  (1 point possible)
# Several of the most common word stems from the ham documents, such as "enron", "hou" (short for Houston), "vinc" (the word stem of "Vince") and "kaminski", are likely specific to Vincent Kaminski's inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?

# ANS The models we build are personalized, and would need to be further tested before being used as a spam filter for another person. 

# PROBLEM 3.1 - BUILDING MACHINE LEARNING MODELS (3 points possible)
# First, convert the dependent variable to a factor with "emailsSparse$spam = as.factor(emailsSparse$spam)".

emailsSparse$spam = as.factor(emailsSparse$spam)

# Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called "train" and a testing set called "test". Make sure to perform this step on emailsSparse instead of emails.

set.seed(123)

split = sample.split(emailsSparse$spam, SplitRatio = 0.7)

train = subset(emailsSparse, split==TRUE)
test = subset(emailsSparse, split==FALSE)

# Using the training set, train the following three machine learning models. The models should predict the dependent variable "spam", using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.

# 1) A logistic regression model called spamLog. You may see a warning message here - we'll discuss this more later.

spamLog = glm(spam ~ ., data=train, family="binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# 2) A CART model called spamCART, using the default parameters to train the model (don't worry about adding minbucket or cp). Remember to add the argument method="class" since this is a binary classification problem.

spamCART = rpart(spam ~ ., data=train, method="class")

# 3) A random forest model called spamRF, using the default parameters to train the model (don't worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we've already done this earlier in the problem, it's important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).

set.seed(123)
spamRF = randomForest(spam ~ ., data=train)

# For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type="prob". For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.

predict.spamLog = predict(spamLog,data=train, type="response")
predict.spamCART = predict(spamCART,data=train)[,2]
predict.spamRF = predict(spamRF,data=train, type="prob")[,2]
hist(predict.spamLog)

# You may have noticed that training the logistic regression model yielded the messages "algorithm did not converge" and "fitted probabilities numerically 0 or 1 occurred". Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let's investigate the predicted probabilities from the logistic regression model.

# How many of the training set predicted probabilities from spamLog are less than 0.00001?

sum(predict.spamLog < 0.00001) # ANS 3048

## [1] 3046

table(predict.spamLog < 0.00001) # Alternative

## 
## FALSE  TRUE 
##   964  3046

# How many of the training set predicted probabilities from spamLog are more than 0.99999?

sum(predict.spamLog > 0.99999) # ANS 953

## [1] 954

table(predict.spamLog > 0.99999) # Alternative

## 
## FALSE  TRUE 
##  3056   954

# How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
sum((predict.spamLog <= 0.99999) & (predict.spamLog >= 0.00001)) # ANS 9

## [1] 10

table(predict.spamLog <= 0.99999 & predict.spamLog >= 0.00001) # Alternative

## 
## FALSE  TRUE 
##  4000    10

#Note: & operates each element and && only first element in a vector
length(predict.spamLog) # Note 4010 total elements

## [1] 4010

# PROBLEM 3.2 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?

summary(spamLog) # None. Symptom of logistic regression not converging

## 
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.011   0.000   0.000   0.000   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01  1.055e+04  -0.003    0.998
## X000         1.474e+01  1.058e+04   0.001    0.999
## X2000       -3.631e+01  1.556e+04  -0.002    0.998
## X2001       -3.215e+01  1.318e+04  -0.002    0.998
## X713        -2.427e+01  2.914e+04  -0.001    0.999
## X853        -1.212e+00  5.942e+04   0.000    1.000
## abl         -2.049e+00  2.088e+04   0.000    1.000
## access      -1.480e+01  1.335e+04  -0.001    0.999
## account      2.488e+01  8.165e+03   0.003    0.998
## addit        1.463e+00  2.703e+04   0.000    1.000
## address     -4.613e+00  1.113e+04   0.000    1.000
## allow        1.899e+01  6.436e+03   0.003    0.998
## alreadi     -2.407e+01  3.319e+04  -0.001    0.999
## also         2.990e+01  1.378e+04   0.002    0.998
## analysi     -2.405e+01  3.860e+04  -0.001    1.000
## anoth       -8.744e+00  2.032e+04   0.000    1.000
## applic      -2.649e+00  1.674e+04   0.000    1.000
## appreci     -2.145e+01  2.762e+04  -0.001    0.999
## approv      -1.302e+00  1.589e+04   0.000    1.000
## april       -2.620e+01  2.208e+04  -0.001    0.999
## area         2.041e+01  2.266e+04   0.001    0.999
## arrang       1.069e+01  2.135e+04   0.001    1.000
## ask         -7.746e+00  1.976e+04   0.000    1.000
## assist      -1.128e+01  2.490e+04   0.000    1.000
## associ       9.049e+00  1.909e+04   0.000    1.000
## attach      -1.037e+01  1.534e+04  -0.001    0.999
## attend      -3.451e+01  3.257e+04  -0.001    0.999
## avail        8.651e+00  1.709e+04   0.001    1.000
## back        -1.323e+01  2.272e+04  -0.001    1.000
## base        -1.354e+01  2.122e+04  -0.001    0.999
## begin        2.228e+01  2.973e+04   0.001    0.999
## believ       3.233e+01  2.136e+04   0.002    0.999
## best        -8.201e+00  1.333e+03  -0.006    0.995
## better       4.263e+01  2.360e+04   0.002    0.999
## book         4.301e+00  2.024e+04   0.000    1.000
## bring        1.607e+01  6.767e+04   0.000    1.000
## busi        -4.803e+00  1.000e+04   0.000    1.000
## buy          4.170e+01  3.892e+04   0.001    0.999
## call        -1.145e+00  1.111e+04   0.000    1.000
## can          3.762e+00  7.674e+03   0.000    1.000
## case        -3.372e+01  2.880e+04  -0.001    0.999
## chang       -2.717e+01  2.215e+04  -0.001    0.999
## check        1.425e+00  1.963e+04   0.000    1.000
## click        1.376e+01  7.077e+03   0.002    0.998
## com          1.936e+00  4.039e+03   0.000    1.000
## come        -1.166e+00  1.511e+04   0.000    1.000
## comment     -3.251e+00  3.387e+04   0.000    1.000
## communic     1.580e+01  8.958e+03   0.002    0.999
## compani      4.781e+00  9.186e+03   0.001    1.000
## complet     -1.363e+01  2.024e+04  -0.001    0.999
## confer      -7.503e-01  8.557e+03   0.000    1.000
## confirm     -1.300e+01  1.514e+04  -0.001    0.999
## contact      1.530e+00  1.262e+04   0.000    1.000
## continu      1.487e+01  1.535e+04   0.001    0.999
## contract    -1.295e+01  1.498e+04  -0.001    0.999
## copi        -4.274e+01  3.070e+04  -0.001    0.999
## corp         1.606e+01  2.708e+04   0.001    1.000
## corpor      -8.286e-01  2.818e+04   0.000    1.000
## cost        -1.938e+00  1.833e+04   0.000    1.000
## cours        1.665e+01  1.834e+04   0.001    0.999
## creat        1.338e+01  3.946e+04   0.000    1.000
## credit       2.617e+01  1.314e+04   0.002    0.998
## crenshaw     9.994e+01  6.769e+04   0.001    0.999
## current      3.629e+00  1.707e+04   0.000    1.000
## custom       1.829e+01  1.008e+04   0.002    0.999
## data        -2.609e+01  2.271e+04  -0.001    0.999
## date        -2.786e+00  1.699e+04   0.000    1.000
## day         -6.100e+00  5.866e+03  -0.001    0.999
## deal        -1.129e+01  1.448e+04  -0.001    0.999
## dear        -2.313e+00  2.306e+04   0.000    1.000
## depart      -4.068e+01  2.509e+04  -0.002    0.999
## deriv       -4.971e+01  3.587e+04  -0.001    0.999
## design      -7.923e+00  2.939e+04   0.000    1.000
## detail       1.197e+01  2.301e+04   0.001    1.000
## develop      5.976e+00  9.455e+03   0.001    0.999
## differ      -2.293e+00  1.075e+04   0.000    1.000
## direct      -2.051e+01  3.194e+04  -0.001    0.999
## director    -1.770e+01  1.793e+04  -0.001    0.999
## discuss     -1.051e+01  1.915e+04  -0.001    1.000
## doc         -2.597e+01  2.603e+04  -0.001    0.999
## don          2.129e+01  1.456e+04   0.001    0.999
## done         6.828e+00  1.882e+04   0.000    1.000
## due         -4.163e+00  3.532e+04   0.000    1.000
## ect          8.685e-01  5.342e+03   0.000    1.000
## edu         -2.122e-01  6.917e+02   0.000    1.000
## effect       1.948e+01  2.100e+04   0.001    0.999
## effort       1.606e+01  5.670e+04   0.000    1.000
## either      -2.744e+01  4.000e+04  -0.001    0.999
## email        3.833e+00  1.186e+04   0.000    1.000
## end         -1.311e+01  2.938e+04   0.000    1.000
## energi      -1.620e+01  1.646e+04  -0.001    0.999
## engin        2.664e+01  2.394e+04   0.001    0.999
## enron       -8.789e+00  5.719e+03  -0.002    0.999
## etc          9.470e-01  1.569e+04   0.000    1.000
## even        -1.654e+01  2.289e+04  -0.001    0.999
## event        1.694e+01  1.851e+04   0.001    0.999
## expect      -1.179e+01  1.914e+04  -0.001    1.000
## experi       2.460e+00  2.240e+04   0.000    1.000
## fax          3.537e+00  3.386e+04   0.000    1.000
## feel         2.596e+00  2.348e+04   0.000    1.000
## file        -2.943e+01  2.165e+04  -0.001    0.999
## final        8.075e+00  5.008e+04   0.000    1.000
## financ      -9.122e+00  7.524e+03  -0.001    0.999
## financi     -9.747e+00  1.727e+04  -0.001    1.000
## find        -2.623e+00  9.727e+03   0.000    1.000
## first       -4.666e-01  2.043e+04   0.000    1.000
## follow       1.766e+01  3.080e+03   0.006    0.995
## form         8.483e+00  1.674e+04   0.001    1.000
## forward     -3.484e+00  1.864e+04   0.000    1.000
## free         6.113e+00  8.121e+03   0.001    0.999
## friday      -1.146e+01  1.996e+04  -0.001    1.000
## full         2.125e+01  2.190e+04   0.001    0.999
## futur        4.146e+01  1.439e+04   0.003    0.998
## gas         -3.901e+00  4.160e+03  -0.001    0.999
## get          5.154e+00  9.737e+03   0.001    1.000
## gibner       2.901e+01  2.460e+04   0.001    0.999
## give        -2.518e+01  2.130e+04  -0.001    0.999
## given       -2.186e+01  5.426e+04   0.000    1.000
## good         5.399e+00  1.619e+04   0.000    1.000
## great        1.222e+01  1.090e+04   0.001    0.999
## group        5.264e-01  1.037e+04   0.000    1.000
## happi        1.939e-02  1.202e+04   0.000    1.000
## hear         2.887e+01  2.281e+04   0.001    0.999
## hello        2.166e+01  1.361e+04   0.002    0.999
## help         1.731e+01  2.791e+03   0.006    0.995
## high        -1.982e+00  2.554e+04   0.000    1.000
## home         5.973e+00  8.965e+03   0.001    0.999
## hope        -1.435e+01  2.179e+04  -0.001    0.999
## hou          6.852e+00  6.437e+03   0.001    0.999
## hour         2.478e+00  1.333e+04   0.000    1.000
## houston     -1.855e+01  7.305e+03  -0.003    0.998
## howev       -3.449e+01  3.562e+04  -0.001    0.999
## http         2.528e+01  2.107e+04   0.001    0.999
## idea        -1.845e+01  3.892e+04   0.000    1.000
## immedi       6.285e+01  3.346e+04   0.002    0.999
## import      -1.859e+00  2.236e+04   0.000    1.000
## includ      -3.454e+00  1.799e+04   0.000    1.000
## increas      6.476e+00  2.329e+04   0.000    1.000
## industri    -3.160e+01  2.373e+04  -0.001    0.999
## info        -1.255e+00  4.857e+03   0.000    1.000
## inform       2.078e+01  8.549e+03   0.002    0.998
## interest     2.698e+01  1.159e+04   0.002    0.998
## intern      -7.991e+00  3.351e+04   0.000    1.000
## internet     8.749e+00  1.100e+04   0.001    0.999
## interview   -1.640e+01  1.873e+04  -0.001    0.999
## invest       3.201e+01  2.393e+04   0.001    0.999
## invit        4.304e+00  2.215e+04   0.000    1.000
## involv       3.815e+01  3.315e+04   0.001    0.999
## issu        -3.708e+01  3.396e+04  -0.001    0.999
## john        -5.326e-01  2.856e+04   0.000    1.000
## join        -3.824e+01  2.334e+04  -0.002    0.999
## juli        -1.358e+01  3.009e+04   0.000    1.000
## just        -1.021e+01  1.114e+04  -0.001    0.999
## kaminski    -1.812e+01  6.029e+03  -0.003    0.998
## keep         1.867e+01  2.782e+04   0.001    0.999
## kevin       -3.779e+01  4.738e+04  -0.001    0.999
## know         1.277e+01  1.526e+04   0.001    0.999
## last         1.046e+00  1.372e+04   0.000    1.000
## let         -2.763e+01  1.462e+04  -0.002    0.998
## life         5.812e+01  3.864e+04   0.002    0.999
## like         5.649e+00  7.660e+03   0.001    0.999
## line         8.743e+00  1.236e+04   0.001    0.999
## link        -6.929e+00  1.345e+04  -0.001    1.000
## list        -8.692e+00  2.149e+03  -0.004    0.997
## locat        2.073e+01  1.597e+04   0.001    0.999
## london       6.745e+00  1.642e+04   0.000    1.000
## long        -1.489e+01  1.934e+04  -0.001    0.999
## look        -7.031e+00  1.563e+04   0.000    1.000
## lot         -1.964e+01  1.321e+04  -0.001    0.999
## made         2.820e+00  2.743e+04   0.000    1.000
## mail         7.584e+00  1.021e+04   0.001    0.999
## make         2.901e+01  1.528e+04   0.002    0.998
## manag        6.014e+00  1.445e+04   0.000    1.000
## mani         1.885e+01  1.442e+04   0.001    0.999
## mark        -3.350e+01  3.208e+04  -0.001    0.999
## market       7.895e+00  8.012e+03   0.001    0.999
## may         -9.434e+00  1.397e+04  -0.001    0.999
## mean         6.078e-01  2.952e+04   0.000    1.000
## meet        -1.063e+00  1.263e+04   0.000    1.000
## member       1.381e+01  2.343e+04   0.001    1.000
## mention     -2.279e+01  2.714e+04  -0.001    0.999
## messag       1.716e+01  2.562e+03   0.007    0.995
## might        1.244e+01  1.753e+04   0.001    0.999
## model       -2.292e+01  1.049e+04  -0.002    0.998
## monday      -1.034e+00  3.233e+04   0.000    1.000
## money        3.264e+01  1.321e+04   0.002    0.998
## month       -3.727e+00  1.112e+04   0.000    1.000
## morn        -2.645e+01  3.403e+04  -0.001    0.999
## move        -3.834e+01  3.011e+04  -0.001    0.999
## much         3.775e-01  1.392e+04   0.000    1.000
## name         1.672e+01  1.322e+04   0.001    0.999
## need         8.437e-01  1.221e+04   0.000    1.000
## net          1.256e+01  2.197e+04   0.001    1.000
## new          1.003e+00  1.009e+04   0.000    1.000
## next.        1.492e+01  1.724e+04   0.001    0.999
## note         1.446e+01  2.294e+04   0.001    0.999
## now          3.790e+01  1.219e+04   0.003    0.998
## number      -9.622e+00  1.591e+04  -0.001    1.000
## offer        1.174e+01  1.084e+04   0.001    0.999
## offic       -1.344e+01  2.311e+04  -0.001    1.000
## one          1.241e+01  6.652e+03   0.002    0.999
## onlin        3.589e+01  1.665e+04   0.002    0.998
## open         2.114e+01  2.961e+04   0.001    0.999
## oper        -1.696e+01  2.757e+04  -0.001    1.000
## opportun    -4.131e+00  1.918e+04   0.000    1.000
## option      -1.085e+00  9.325e+03   0.000    1.000
## order        6.533e+00  1.242e+04   0.001    1.000
## origin       3.226e+01  3.818e+04   0.001    0.999
## part         4.594e+00  3.483e+04   0.000    1.000
## particip    -1.154e+01  1.738e+04  -0.001    0.999
## peopl       -1.864e+01  1.439e+04  -0.001    0.999
## per          1.367e+01  1.273e+04   0.001    0.999
## person       1.870e+01  9.575e+03   0.002    0.998
## phone       -6.957e+00  1.172e+04  -0.001    1.000
## place        9.005e+00  3.661e+04   0.000    1.000
## plan        -1.830e+01  6.320e+03  -0.003    0.998
## pleas       -7.961e+00  9.484e+03  -0.001    0.999
## point        5.498e+00  3.403e+04   0.000    1.000
## posit       -1.543e+01  2.316e+04  -0.001    0.999
## possibl     -1.366e+01  2.492e+04  -0.001    1.000
## power       -5.643e+00  1.173e+04   0.000    1.000
## present     -6.163e+00  1.278e+04   0.000    1.000
## price        3.428e+00  7.850e+03   0.000    1.000
## problem      1.262e+01  9.763e+03   0.001    0.999
## process     -2.957e-01  1.191e+04   0.000    1.000
## product      1.016e+01  1.345e+04   0.001    0.999
## program      1.444e+00  1.183e+04   0.000    1.000
## project      2.173e+00  1.497e+04   0.000    1.000
## provid       2.422e-01  1.859e+04   0.000    1.000
## public      -5.250e+01  2.341e+04  -0.002    0.998
## put         -1.052e+01  2.681e+04   0.000    1.000
## question    -3.467e+01  1.859e+04  -0.002    0.999
## rate        -3.112e+00  1.319e+04   0.000    1.000
## read        -1.527e+01  2.145e+04  -0.001    0.999
## real         2.046e+01  2.358e+04   0.001    0.999
## realli      -2.667e+01  4.640e+04  -0.001    1.000
## receiv       5.765e-01  1.585e+04   0.000    1.000
## recent      -2.067e+00  1.780e+04   0.000    1.000
## regard      -3.668e+00  1.511e+04   0.000    1.000
## relat       -5.114e+01  1.793e+04  -0.003    0.998
## remov        2.325e+01  2.484e+04   0.001    0.999
## repli        1.538e+01  2.916e+04   0.001    1.000
## report      -1.482e+01  1.477e+04  -0.001    0.999
## request     -1.232e+01  1.167e+04  -0.001    0.999
## requir       5.004e-01  2.937e+04   0.000    1.000
## research    -2.826e+01  1.553e+04  -0.002    0.999
## resourc     -2.735e+01  3.522e+04  -0.001    0.999
## respond      2.974e+01  3.888e+04   0.001    0.999
## respons     -1.960e+01  3.667e+04  -0.001    1.000
## result      -5.002e-01  3.140e+04   0.000    1.000
## resum       -9.219e+00  2.100e+04   0.000    1.000
## return       1.745e+01  1.844e+04   0.001    0.999
## review      -4.825e+00  1.013e+04   0.000    1.000
## right        2.312e+01  1.590e+04   0.001    0.999
## risk        -4.001e+00  1.718e+04   0.000    1.000
## robert      -2.096e+01  2.907e+04  -0.001    0.999
## run         -5.162e+01  4.434e+04  -0.001    0.999
## say          7.366e+00  2.217e+04   0.000    1.000
## schedul      1.919e+00  3.580e+04   0.000    1.000
## school      -3.870e+00  2.882e+04   0.000    1.000
## secur       -1.604e+01  2.201e+03  -0.007    0.994
## see         -1.120e+01  1.293e+04  -0.001    0.999
## send        -2.427e+01  1.222e+04  -0.002    0.998
## sent        -1.488e+01  2.195e+04  -0.001    0.999
## servic      -7.164e+00  1.235e+04  -0.001    1.000
## set         -9.353e+00  2.627e+04   0.000    1.000
## sever        2.041e+01  3.093e+04   0.001    0.999
## shall        1.930e+01  3.075e+04   0.001    0.999
## shirley     -7.133e+01  6.329e+04  -0.001    0.999
## short       -8.974e+00  1.721e+04  -0.001    1.000
## sinc        -3.438e+00  3.546e+04   0.000    1.000
## sincer      -2.073e+01  3.515e+04  -0.001    1.000
## site         8.689e+00  1.496e+04   0.001    1.000
## softwar      2.575e+01  1.059e+04   0.002    0.998
## soon         2.350e+01  3.731e+04   0.001    0.999
## sorri        6.036e+00  2.299e+04   0.000    1.000
## special      1.777e+01  2.755e+04   0.001    0.999
## specif      -2.337e+01  3.083e+04  -0.001    0.999
## start        1.437e+01  1.897e+04   0.001    0.999
## state        1.221e+01  1.677e+04   0.001    0.999
## still        3.878e+00  2.622e+04   0.000    1.000
## stinson     -4.345e+01  2.697e+04  -0.002    0.999
## student     -1.815e+01  2.186e+04  -0.001    0.999
## subject      3.041e+01  1.055e+04   0.003    0.998
## success      4.344e+00  2.783e+04   0.000    1.000
## suggest     -3.842e+01  4.475e+04  -0.001    0.999
## support     -1.539e+01  1.976e+04  -0.001    0.999
## sure        -5.503e+00  2.078e+04   0.000    1.000
## system       3.778e+00  9.149e+03   0.000    1.000
## take         5.731e+00  1.716e+04   0.000    1.000
## talk        -1.011e+01  2.021e+04  -0.001    1.000
## team         7.940e+00  2.570e+04   0.000    1.000
## term         2.013e+01  2.303e+04   0.001    0.999
## thank       -3.890e+01  1.059e+04  -0.004    0.997
## thing        2.579e+01  1.341e+04   0.002    0.998
## think       -1.218e+01  2.077e+04  -0.001    1.000
## thought      1.243e+01  3.023e+04   0.000    1.000
## thursday    -1.491e+01  3.262e+04   0.000    1.000
## time        -5.921e+00  8.335e+03  -0.001    0.999
## today       -1.762e+01  1.965e+04  -0.001    0.999
## togeth      -2.355e+01  1.869e+04  -0.001    0.999
## trade       -1.755e+01  1.483e+04  -0.001    0.999
## tri          9.278e-01  1.282e+04   0.000    1.000
## tuesday     -2.808e+01  3.959e+04  -0.001    0.999
## two         -2.573e+01  1.844e+04  -0.001    0.999
## type        -1.447e+01  2.755e+04  -0.001    1.000
## understand   9.307e+00  2.342e+04   0.000    1.000
## unit        -4.020e+00  3.008e+04   0.000    1.000
## univers      1.228e+01  2.197e+04   0.001    1.000
## updat       -1.510e+01  1.448e+04  -0.001    0.999
## use         -1.385e+01  9.382e+03  -0.001    0.999
## valu         9.024e-01  1.360e+04   0.000    1.000
## version     -3.606e+01  2.939e+04  -0.001    0.999
## vinc        -3.735e+01  8.647e+03  -0.004    0.997
## visit        2.585e+01  1.170e+04   0.002    0.998
## vkamin      -6.649e+01  5.703e+04  -0.001    0.999
## want        -2.555e+00  1.106e+04   0.000    1.000
## way          1.339e+01  1.138e+04   0.001    0.999
## web          2.791e+00  1.686e+04   0.000    1.000
## websit      -2.563e+01  1.848e+04  -0.001    0.999
## wednesday   -1.526e+01  2.642e+04  -0.001    1.000
## week        -6.795e+00  1.046e+04  -0.001    0.999
## well        -2.222e+01  9.713e+03  -0.002    0.998
## will        -1.119e+01  5.980e+03  -0.002    0.999
## wish         1.173e+01  3.175e+04   0.000    1.000
## within       2.900e+01  2.163e+04   0.001    0.999
## without      1.942e+01  1.763e+04   0.001    0.999
## work        -1.099e+01  1.160e+04  -0.001    0.999
## write        4.406e+01  2.825e+04   0.002    0.999
## www         -7.867e+00  2.224e+04   0.000    1.000
## year        -1.010e+01  1.039e+04  -0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4409.49  on 4009  degrees of freedom
## Residual deviance:   13.46  on 3679  degrees of freedom
## AIC: 675.46
## 
## Number of Fisher Scoring iterations: 25

# PROBLEM 3.3 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# How many of the word stems "enron", "hou", "vinc", and "kaminski" appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.

prp(spamCART) # The words "enron" and "vinc" occur

# PROBLEM 3.4 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?

table(train$spam, predict.spamLog >0.5)

##    
##     FALSE TRUE
##   0  3052    0
##   1     4  954

(3048+958)/nrow(train) # ANS 0.9990025

## [1] 0.9990025

# PROBLEM 3.5 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set AUC of spamLog?
library(ROCR)

predLog = prediction(predict.spamLog, train$spam)
performance(predLog, "auc")@y.values

## [[1]]
## [1] 0.9999959

as.numeric(performance(predLog, "auc")@y.values) # Alternative

## [1] 0.9999959

# PROBLEM 3.6 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions? (Remember that if you used the type="class" argument when making predictions, you automatically used a threshold of 0.5. If you did not add in the type argument to the predict function, the probabilities are in the second column of the predict output.)

table(train$spam, predict.spamCART>0.5)

##    
##     FALSE TRUE
##   0  2885  167
##   1    64  894

(2885+894)/nrow(train) # ANS 0.9990025

## [1] 0.942394

# PROBLEM 3.7 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set AUC of spamCART? (Remember that you have to pass the prediction function predicted probabilities, so don't include the type argument when making predictions for your CART model.)

predCART = prediction(predict.spamCART, train$spam)
performance(predCART, "auc")@y.values # ANS 0.9696044

## [[1]]
## [1] 0.9696044

as.numeric(performance(predCART, "auc")@y.values) # Alternative

## [1] 0.9696044

# PROBLEM 3.8 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions? (Remember that your answer might not match ours exactly, due to random behavior in the random forest algorithm on different operating systems.)

table(train$spam, predict.spamRF>0.5)

##    
##     FALSE TRUE
##   0  3013   39
##   1    44  914

(3013+914)/nrow(train) # ANS 0.9793017

## [1] 0.9793017

# PROBLEM 3.9 - BUILDING MACHINE LEARNING MODELS (2 points possible)
# What is the training set AUC of spamRF? (Remember to pass the argument type="prob" to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)


predRF = prediction(predict.spamRF, train$spam)
performance(predRF, "auc")@y.values # ANS 0.9979116

## [[1]]
## [1] 0.9979116

as.numeric(performance(predRF, "auc")@y.values) # Alternative

## [1] 0.9979116

# PROBLEM 3.10 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# Which model had the best training set performance, in terms of accuracy and AUC? ANS Logistic

# PROBLEM 4.1 - EVALUATING ON THE TEST SET  (1 point possible)
# Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained.

predict.spamLog.test = predict(spamLog,newdata=test, type="response")
predict.spamCART.test = predict(spamCART, newdata=test)[,2]
predict.spamRF.test = predict(spamRF,newdata=test, type="prob")[,2]

# What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?

table(test$spam, predict.spamLog.test >0.5)

##    
##     FALSE TRUE
##   0  1257   51
##   1    34  376

(1228+386)/nrow(test) # ANS  0.9394645

## [1] 0.9394645

# PROBLEM 4.2 - EVALUATING ON THE TEST SET  (1 point possible)
# What is the testing set AUC of spamLog? ANS 0.9627517


predLog.test = prediction(predict.spamLog.test, test$spam)
performance(predLog.test, "auc")@y.values

## [[1]]
## [1] 0.9627517

as.numeric(performance(predLog.test, "auc")@y.values) # Alternative

## [1] 0.9627517

# PROBLEM 4.3 - EVALUATING ON THE TEST SET  (1 point possible)
# What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?


table(test$spam, predict.spamCART.test >0.5)

##    
##     FALSE TRUE
##   0  1228   80
##   1    24  386

(1228+386)/nrow(test) # ANS 0.9394645

## [1] 0.9394645

# PROBLEM 4.4 - EVALUATING ON THE TEST SET  (1 point possible)
# What is the testing set AUC of spamCART?

predCART.test = prediction(predict.spamCART.test, test$spam)
as.numeric(performance(predCART.test, "auc")@y.values) # 0.963176

## [1] 0.963176

# PROBLEM 4.5 - EVALUATING ON THE TEST SET  (1 point possible)
# What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?

table(test$spam, predict.spamRF.test >0.5)

##    
##     FALSE TRUE
##   0  1290   18
##   1    25  385

(1290+384)/nrow(test) # ANS 0.9743888

## [1] 0.9743888

# PROBLEM 4.6 - EVALUATING ON THE TEST SET  (1 point possible)
# What is the testing set AUC of spamRF?

predRF.test = prediction(predict.spamRF.test, test$spam)
as.numeric(performance(predRF.test, "auc")@y.values)

## [1] 0.9975656

# ANS 0.9975656

# PROBLEM 4.7 - EVALUATING ON THE TEST SET  (1/1 point)
# Which model had the best testing set performance, in terms of accuracy and AUC? ANS Random Forest

# PROBLEM 4.8 - EVALUATING ON THE TEST SET  (1/1 point)
# Which model demonstrated the greatest degree of overfitting?
# ANS Logistic. Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.

# PROBLEM 6.1 - INTEGRATING WORD COUNT INFORMATION  
# While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.

# First, we will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row. Obtain the word counts for each email with the command:

wordCount = rowSums(as.matrix(dtm))

# PROBLEM 6.4 - INTEGRATING WORD COUNT INFORMATION  
# Create a variable called logWordCount in emailsSparse2 that is equal to log(wordCount). Use the boxplot() command to plot logWordCount against whether a message is spam. Which of the following best describes the box plot?

emailsSparse2 = emailsSparse
emailsSparse2$logWordCount = log(wordCount)

boxplot(emailsSparse2$logWordCount, emailsSparse2$spam)

Unit 5

Todd Curtis

April 7, 2015