This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Unit 5 - Twitter
# IMPORTANT NOTE
# In the following video, we ask you to install the "tm" package to perform the pre-processing steps. Due to function changes that occurred after this video was recorded, you will need to run the following command immediately after converting all of the words to lowercase letters (it converts all documents in the corpus to the PlainTextDocument type):
# corpus = tm_map(corpus, PlainTextDocument)
# Then you can continue with the R commands as they are in the video.
# VIDEO 5
# Read in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)
str(tweets)
## 'data.frame': 1181 obs. of 2 variables:
## $ Tweet: chr "I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore" "iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple" "LOVE U @APPLE" "Thank you @apple, loving my new iPhone 5S!!!!! #apple #iphone5S pic.twitter.com/XmHJCU4pcb" ...
## $ Avg : num 2 2 1.8 1.8 1.8 1.8 1.8 1.6 1.6 1.6 ...
# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)
table(tweets$Negative)
##
## FALSE TRUE
## 999 182
# Install new packages
options(repos = c(CRAN = "http://cran.rstudio.com"))
# install.packages("tm")
library(tm)
## Loading required package: NLP
# install.packages("SnowballC")
library(SnowballC)
# Create corpus
corpus = Corpus(VectorSource(tweets$Tweet))
# Look at corpus
corpus
## <<VCorpus (documents: 1181, metadata (corpus/indexed): 0/0)>>
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore
# Convert to lower-case
corpus = tm_map(corpus, tolower)
corpus[[1]]
## [1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## i have to say apple has by far the best customer care service i have ever received apple appstore
# Look at stop words
stopwords("english")[1:10]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
# Remove stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## say far best customer care service ever received appstore
# Stem document
corpus = tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## say far best custom care servic ever receiv appstor
# Video 6
# Create matrix
frequencies = DocumentTermMatrix(corpus)
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3289)>>
## Non-/sparse entries: 8980/3875329
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
# Look at matrix
inspect(frequencies[1000:1005,505:515])
## <<DocumentTermMatrix (documents: 6, terms: 11)>>
## Non-/sparse entries: 1/65
## Sparsity : 98%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs cheapen cheaper check cheep cheer cheerio cherylcol chief
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 1 0 0 0
## Terms
## Docs chiiiiqu child children
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
## character(0) 0 0 0
# Check for sparsity
findFreqTerms(frequencies, lowfreq=20)
## [1] "android" "anyon" "app"
## [4] "appl" "back" "batteri"
## [7] "better" "buy" "can"
## [10] "cant" "come" "dont"
## [13] "fingerprint" "freak" "get"
## [16] "googl" "ios7" "ipad"
## [19] "iphon" "iphone5" "iphone5c"
## [22] "ipod" "ipodplayerpromo" "itun"
## [25] "just" "like" "lol"
## [28] "look" "love" "make"
## [31] "market" "microsoft" "need"
## [34] "new" "now" "one"
## [37] "phone" "pleas" "promo"
## [40] "promoipodplayerpromo" "realli" "releas"
## [43] "samsung" "say" "store"
## [46] "thank" "think" "time"
## [49] "twitter" "updat" "use"
## [52] "via" "want" "well"
## [55] "will" "work"
# Remove sparse terms
sparse = removeSparseTerms(frequencies, 0.995)
sparse
## <<DocumentTermMatrix (documents: 1181, terms: 309)>>
## Non-/sparse entries: 4669/360260
## Sparsity : 99%
## Maximal term length: 20
## Weighting : term frequency (tf)
# Convert to a data frame
tweetsSparse = as.data.frame(as.matrix(sparse))
# Make all variable names R-friendly
colnames(tweetsSparse) = make.names(colnames(tweetsSparse))
# Add dependent variable
tweetsSparse$Negative = tweets$Negative
# Split the data
install.packages("caTools")
##
## The downloaded binary packages are in
## /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages
library(caTools)
set.seed(123)
split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)
trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)
# QUICK QUESTION (1 point possible)
# In the previous video, we showed a list of all words that appear at least 20 times in our tweets. Which of the following words appear at least 100 times? Select all that apply. (HINT: use the findFreqTerms function)
findFreqTerms(frequencies, lowfreq=100)
## [1] "iphon" "itun" "new"
# Video 7
# Build a CART model
install.packages("caTools")
##
## The downloaded binary packages are in
## /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages
library(rpart)
install.packages("rpart.plot")
##
## The downloaded binary packages are in
## /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages
library(rpart.plot)
tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")
prp(tweetCART)
# Evaluate the performance of the model
predictCART = predict(tweetCART, newdata=testSparse, type="class")
table(testSparse$Negative, predictCART)
## predictCART
## FALSE TRUE
## FALSE 294 6
## TRUE 37 18
# Compute accuracy
(294+18)/(294+6+37+18)
## [1] 0.8788732
# Baseline accuracy
table(testSparse$Negative)
##
## FALSE TRUE
## 300 55
300/(300+55)
## [1] 0.8450704
# Random forest model
install.packages("randomForest")
##
## The downloaded binary packages are in
## /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
set.seed(123)
tweetRF = randomForest(Negative ~ ., data=trainSparse)
# Make predictions:
predictRF = predict(tweetRF, newdata=testSparse)
table(testSparse$Negative, predictRF)
## predictRF
## FALSE TRUE
## FALSE 293 7
## TRUE 34 21
# Accuracy:
(293+21)/(293+7+34+21)
## [1] 0.884507
# QUICK QUESTION (1 point possible)
# In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well logistic regression does. Build a logistic regression model (using the training set) to predict "Negative" using all of the independent variables. You may get a warning message after building your model - don't worry (we explain what it means in the explanation).
tweet.logistic = glm(Negative ~ ., data=trainSparse, family="binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Now, make predictions using the logistic regression model:
predictions = predict(tweet.logistic, newdata=testSparse, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
# where "tweetLog" should be the name of your logistic regression model. You might also get a warning message after this command, but don't worry - it is due to the same problem as the previous warning message.
# Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model. What is the accuracy?
table(testSparse$Negative, predictions > 0.5)
##
## FALSE TRUE
## FALSE 253 47
## TRUE 23 32
(253+32)/nrow(testSparse) # Accuracy is 0.8028169
## [1] 0.8028169
# ===========================================
# Unit 5 - Recitation
# Video 2
# Load the dataset
emails = read.csv("energy_bids.csv", stringsAsFactors=FALSE)
str(emails)
## 'data.frame': 855 obs. of 2 variables:
## $ email : chr "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
## $ responsive: int 0 1 0 1 0 0 1 0 0 0 ...
# Look at emails
emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
emails$responsive[1]
## [1] 0
emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the California State Auditor. I look forward to seeing you at The Aspen Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
emails$responsive[2]
## [1] 1
# Responsive emails
table(emails$responsive)
##
## 0 1
## 716 139
# Note: strwrap() can break down a long email into many lines
# Video 3
# Load tm package
library(tm)
# Create corpus
corpus = Corpus(VectorSource(emails$email))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. "Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated," said Janine Ferretti, executive director of the CEC. "We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment." The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: * protect air quality and mitigate climate change, * minimize the possibility of environment-based trade disputes, * ensure a dependable supply of reasonably priced electricity across North America * avoid creation of pollution havens, and * ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. "Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships," said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. "We need to achieve this new level of cooperation in our environmental approaches as well." The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. "We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector," said Ferretti. "How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called "pollution havens" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. "The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region," said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********
# Pre-process data
corpus = tm_map(corpus, tolower)
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
# Look at first email
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## north america integr electr market requir cooper environment polici commiss environment cooper releas work paper north america electr market montreal 27 novemb 2001 north american commiss environment cooper cec releas work paper highlight trend toward increas trade competit crossbord invest electr canada mexico unit state hope work paper environment challeng opportun evolv north american electr market will stimul public discuss around cec symposium titl need coordin environment polici trinat north americawid electr market develop cec symposium will take place san diego 2930 novemb will bring togeth lead expert industri academia ngos govern canada mexico unit state consid impact evolv continent electr market human health environ goal work paper symposium highlight key environment issu must address electr market north america becom integr said janin ferretti execut director cec want stimul discuss around import polici question rais countri can cooper approach energi environ cec intern organ creat environment side agreement nafta known north american agreement environment cooper establish address region environment concern help prevent potenti trade environment conflict promot effect enforc environment law cec secretariat believ greater north american cooper environment polici regard continent electr market necessari protect air qualiti mitig climat chang minim possibl environmentbas trade disput ensur depend suppli reason price electr across north america avoid creation pollut haven ensur local nation environment measur remain effect chang market work paper profil rapid chang north american electr market exampl 2001 us project export 131 thousand gigawatthour gwh electr canada mexico 2007 number project grow 169 thousand gwh electr past decad north american electr market develop complex array crossbord transact relationship said phil sharp former us congressman chairman cec electr advisori board need achiev new level cooper environment approach well environment profil electr sector electr sector singl largest sourc nation report toxin unit state canada larg sourc mexico us electr sector emit approxim 25 percent nox emiss rough 35 percent co2 emiss 25 percent mercuri emiss almost 70 percent so2 emiss emiss larg impact airsh watersh migratori speci corridor often share three north american countri want discuss possibl outcom greater effort coordin feder state provinci environment law polici relat electr sector said ferretti can develop compat environment approach help make domest environment polici effect effect integr electr market one key issu rais paper effect market integr competit particular fuel coal natur gas renew fuel choic larg determin environment impact specif facil along pollut control technolog perform standard regul paper highlight impact high competit market well exampl concern call pollut haven aris signific differ environment law enforc practic induc power compani locat oper jurisdict lower standard cec secretariat explor addit environment polici will work restructur market polici can adapt ensur enhanc competit benefit entir region said sharp trade rule polici measur direct influenc variabl drive success integr north american electr market work paper also address fuel choic technolog pollut control strategi subsidi cec will use inform gather discuss period develop final report will submit council earli 2002 inform view live video webcast symposium pleas go httpwwwcecorgelectr may download work paper support document httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss environment cooper 393 rue stjacqu ouest bureau 200 montréal québec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg
# Video 4
# Create matrix
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 21735)>>
## Non-/sparse entries: 102511/18480914
## Sparsity : 99%
## Maximal term length: 113
## Weighting : term frequency (tf)
# Remove sparse terms
dtm = removeSparseTerms(dtm, 0.97)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51644/622096
## Sparsity : 92%
## Maximal term length: 19
## Weighting : term frequency (tf)
# Create data frame
labeledTerms = as.data.frame(as.matrix(dtm))
# Add in the outcome variable
labeledTerms$responsive = emails$responsive
str(labeledTerms)
## 'data.frame': 855 obs. of 789 variables:
## $ 100 : num 0 0 0 0 0 0 5 0 0 0 ...
## $ 1400 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 1999 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 2000 : num 0 0 1 0 1 0 6 0 1 0 ...
## $ 2001 : num 2 1 0 0 0 0 7 0 0 0 ...
## $ 713 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 77002 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ abl : num 0 0 0 0 0 0 2 0 0 0 ...
## $ accept : num 0 0 0 0 0 0 1 0 0 0 ...
## $ access : num 0 0 0 0 0 0 0 0 0 0 ...
## $ accord : num 0 0 0 0 0 0 1 0 0 0 ...
## $ account : num 0 0 0 0 0 0 3 0 0 0 ...
## $ act : num 0 0 0 0 0 0 1 0 0 0 ...
## $ action : num 0 0 0 0 1 0 0 0 0 0 ...
## $ activ : num 0 0 1 0 1 0 1 0 0 0 ...
## $ actual : num 0 0 0 0 0 0 0 0 0 0 ...
## $ add : num 0 0 0 0 0 0 1 0 0 0 ...
## $ addit : num 1 0 0 0 0 0 1 0 0 0 ...
## $ address : num 3 0 0 0 2 0 0 0 0 1 ...
## $ administr : num 0 0 0 0 0 0 1 0 0 0 ...
## $ advanc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ advis : num 0 0 0 0 0 0 0 0 0 0 ...
## $ affect : num 0 0 0 0 2 0 0 0 0 0 ...
## $ afternoon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ agenc : num 0 0 0 0 1 0 0 0 0 0 ...
## $ ago : num 0 0 0 0 0 0 1 0 0 0 ...
## $ agre : num 0 0 0 0 0 0 0 0 0 0 ...
## $ agreement : num 2 0 0 0 2 0 1 0 0 1 ...
## $ alan : num 0 0 0 0 0 1 0 0 0 0 ...
## $ allow : num 0 0 0 0 0 0 2 0 0 0 ...
## $ along : num 1 0 0 0 1 0 1 0 0 0 ...
## $ alreadi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ also : num 1 0 0 0 0 0 8 0 0 0 ...
## $ altern : num 0 0 0 0 0 0 0 0 1 0 ...
## $ although : num 0 0 0 0 0 0 6 0 0 0 ...
## $ amend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ america : num 4 0 0 0 0 0 0 0 1 0 ...
## $ among : num 0 0 0 0 0 0 3 0 0 0 ...
## $ amount : num 0 0 0 0 0 0 1 0 0 0 ...
## $ analysi : num 0 0 0 2 0 0 0 0 0 0 ...
## $ analyst : num 0 0 0 0 0 0 6 0 0 0 ...
## $ andor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ andrew : num 0 0 0 0 0 0 0 0 0 0 ...
## $ announc : num 0 0 0 0 0 0 2 0 0 0 ...
## $ anoth : num 0 0 0 0 0 0 6 0 0 0 ...
## $ answer : num 0 0 0 0 0 0 2 0 0 0 ...
## $ anyon : num 0 0 0 0 0 0 0 0 0 0 ...
## $ anyth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ appear : num 0 0 0 0 0 0 3 0 0 0 ...
## $ appli : num 0 0 0 0 0 0 0 0 0 0 ...
## $ applic : num 0 0 0 0 0 0 0 0 0 0 ...
## $ appreci : num 0 0 0 0 1 0 0 0 0 0 ...
## $ approach : num 3 0 0 0 0 0 1 0 0 0 ...
## $ appropri : num 0 0 0 0 0 0 0 1 0 0 ...
## $ approv : num 0 0 0 0 0 0 1 0 0 0 ...
## $ approxim : num 1 0 0 0 0 0 1 0 0 0 ...
## $ april : num 0 0 0 0 0 0 3 0 0 0 ...
## $ area : num 0 0 0 0 1 0 3 0 0 0 ...
## $ around : num 2 0 0 0 0 0 1 0 0 0 ...
## $ arrang : num 0 0 0 0 0 0 0 0 0 0 ...
## $ articl : num 0 0 0 0 0 0 1 0 0 0 ...
## $ ask : num 0 0 0 0 0 1 0 0 0 0 ...
## $ asset : num 0 0 0 0 0 0 2 0 0 0 ...
## $ assist : num 0 0 0 0 0 0 0 0 0 0 ...
## $ associ : num 0 0 1 0 1 0 0 0 0 0 ...
## $ assum : num 0 0 0 0 0 1 0 0 0 0 ...
## $ attach : num 0 1 0 1 1 0 1 0 3 1 ...
## $ attend : num 0 0 0 0 0 0 0 0 1 0 ...
## $ attent : num 0 0 0 0 0 0 1 0 0 0 ...
## $ attorney : num 0 0 0 0 0 0 0 0 0 0 ...
## $ august : num 0 0 0 0 0 0 0 0 0 0 ...
## $ author : num 0 0 1 0 0 0 0 0 0 0 ...
## $ avail : num 0 0 0 0 0 0 0 0 0 0 ...
## $ averag : num 0 0 0 0 0 0 5 0 0 0 ...
## $ avoid : num 1 0 0 0 1 0 2 0 0 0 ...
## $ awar : num 0 0 0 0 0 0 0 0 0 0 ...
## $ back : num 0 0 0 0 1 1 1 0 0 0 ...
## $ balanc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bank : num 0 0 0 0 0 0 2 0 0 0 ...
## $ base : num 0 0 0 0 1 0 9 0 0 0 ...
## $ basi : num 0 0 0 0 0 0 1 0 0 0 ...
## $ becom : num 1 0 0 0 0 0 4 0 0 0 ...
## $ begin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ believ : num 1 0 0 0 0 0 0 0 0 0 ...
## $ benefit : num 1 0 0 0 0 0 5 0 0 0 ...
## $ best : num 0 0 0 0 0 0 0 0 0 1 ...
## $ better : num 0 0 0 0 0 0 2 0 0 0 ...
## $ bid : num 0 0 0 0 0 0 1 0 0 0 ...
## $ big : num 0 0 0 0 0 1 6 0 0 0 ...
## $ bill : num 0 0 0 0 0 0 0 0 0 0 ...
## $ billion : num 0 0 0 0 0 0 2 0 0 0 ...
## $ bit : num 0 0 0 0 0 1 2 0 0 0 ...
## $ board : num 1 0 0 0 0 0 0 0 0 0 ...
## $ bob : num 0 0 0 0 0 0 0 0 0 0 ...
## $ book : num 0 0 0 0 0 0 0 0 0 0 ...
## $ brian : num 0 1 0 0 0 0 0 0 0 0 ...
## $ brief : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bring : num 1 0 0 0 0 0 2 0 0 0 ...
## $ build : num 0 0 0 0 0 0 7 0 1 0 ...
## [list output truncated]
# Video 5
# Split the data
library(caTools)
set.seed(144)
spl = sample.split(labeledTerms$responsive, 0.7)
train = subset(labeledTerms, spl == TRUE)
test = subset(labeledTerms, spl == FALSE)
# Build a CART model
library(rpart)
library(rpart.plot)
emailCART = rpart(responsive~., data=train, method="class")
prp(emailCART)
# Video 6
# Make predictions on the test set
pred = predict(emailCART, newdata=test)
pred[1:10,]
## 0 1
## character(0) 0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000
pred.prob = pred[,2]
# Compute accuracy
table(test$responsive, pred.prob >= 0.5)
##
## FALSE TRUE
## 0 195 20
## 1 17 25
(195+25)/(195+25+17+20)
## [1] 0.8560311
# Baseline model accuracy
table(test$responsive)
##
## 0 1
## 215 42
215/(215+42)
## [1] 0.8365759
# Video 7
# ROC curve
install.packages("ROCR")
##
## The downloaded binary packages are in
## /var/folders/r_/jg_fymdd069b2cw6jsqwvxd80000gn/T//RtmpjOo4mS/downloaded_packages
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.1.3
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
predROCR = prediction(pred.prob, test$responsive)
perfROCR = performance(predROCR, "tpr", "fpr")
plot(perfROCR, colorize=TRUE)
# Compute AUC
performance(predROCR, "auc")@y.values
## [[1]]
## [1] 0.7936323
# ===================
# Assingment 5.1: DETECTING VANDALISM ON WIKIPEDIA
# PROBLEM 1.1 - BAGS OF WORDS (1 point possible)
# Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame "wiki".
wiki = read.csv("wiki.csv", stringsAsFactors=FALSE)
str(wiki)
## 'data.frame': 3876 obs. of 7 variables:
## $ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Vandal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minor : int 1 1 0 1 1 0 0 0 1 0 ...
## $ Loggedin: int 1 1 1 0 1 1 1 1 1 0 ...
## $ Added : chr " represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
## $ Removed : chr " " " talklanguagetalk" " regarded as technologytechnologies human first" " represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
# Convert the "Vandal" column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).
wiki$Vandal = as.factor(wiki$Vandal)
# How many cases of vandalism were detected in the history of this page? ANS 1815
table(wiki$Vandal)
##
## 0 1
## 2061 1815
# PROBLEM 1.2 - BAGS OF WORDS (2 points possible)
# We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We'll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:
# 1) Create the corpus for the Added column, and call it "corpusAdded".
corpusAdded = Corpus(VectorSource(wiki$Added))
corpusAdded[[1]]
## <<PlainTextDocument (metadata: 7)>>
## represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationships linguistics regarded writing languages to other listing xmlspacepreservelanguages metaverse formal term philology common each including phonologyphonology often ten list humans affiliation see computer are speechpathologyspeech our what for ways dialects please artificial written body be of quite hypothesis found alone refers by about language profanity study programming priorities rosenfelders technologytechnologies makes or first among useful languagephilosophy one sounds use area create phrases mark their genetic basic families complete but sapirwhorfhypothesissapirwhorf with talklanguagetalk population animals this science up vocal can concepts called at and topics locations as numbers have in pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy most important meaningful a historicallinguisticsorphilologyhistorical semanticssemantics patterns the oral
# IMPORTANT NOTE: If you are using the latest version of the tm package, you will need to run the following line before continuing (it converts corpus to a Plain Text Document). This is a recent change having to do with the tolower function that occurred after this video was recorded.
corpusAdded = tm_map(corpusAdded, PlainTextDocument)
# 2) Remove the English-language stopwords.
corpusAdded = tm_map(corpusAdded, removeWords, stopwords("english"))
corpusAdded[[1]]
## <<PlainTextDocument (metadata: 7)>>
## represent psycholinguisticspsycholinguistics orthographyorthography help text actions human ethnologue relationships linguistics regarded writing languages listing xmlspacepreservelanguages metaverse formal term philology common including phonologyphonology often ten list humans affiliation see computer speechpathologyspeech ways dialects please artificial written body quite hypothesis found alone refers language profanity study programming priorities rosenfelders technologytechnologies makes first among useful languagephilosophy one sounds use area create phrases mark genetic basic families complete sapirwhorfhypothesissapirwhorf talklanguagetalk population animals science vocal can concepts called topics locations numbers pathology different develop 4000 things ideas grouped complex animal mathematics fairly literature httpwwwzompistcom philosophy important meaningful historicallinguisticsorphilologyhistorical semanticssemantics patterns oral
# 3) Stem the words.
corpusAdded = tm_map(corpusAdded, stemDocument)
corpusAdded[[1]]
## <<PlainTextDocument (metadata: 7)>>
## repres psycholinguisticspsycholinguist orthographyorthographi help text action human ethnologu relationship linguist regard write languag list xmlspacepreservelanguag metavers formal term philolog common includ phonologyphonolog often ten list human affili see comput speechpathologyspeech way dialect pleas artifici written bodi quit hypothesi found alon refer languag profan studi program prioriti rosenfeld technologytechnolog make first among use languagephilosophi one sound use area creat phrase mark genet basic famili complet sapirwhorfhypothesissapirwhorf talklanguagetalk popul anim scienc vocal can concept call topic locat number patholog differ develop 4000 thing idea group complex anim mathemat fair literatur httpwwwzompistcom philosophi import meaning historicallinguisticsorphilologyhistor semanticssemant pattern oral
# 4) Build the DocumentTermMatrix, and call it dtmAdded.
dtmAdded = DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity : 100%
## Maximal term length: 784
## Weighting : term frequency (tf)
# If the code length(stopwords("english")) does not return 174 for you [it does], then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusAdded, removeWords, sw) instead of tm_map(corpusAdded, removeWords, stopwords("english")).
# How many terms appear in dtmAdded? # ANS 6675
# PROBLEM 1.3 - BAGS OF WORDS (1 point possible)
# Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?
sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
# PROBLEM 1.4 - BAGS OF WORDS (2 points possible)
# Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:
wordsAdded = as.data.frame(as.matrix(sparseAdded))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))
# Now repeat all of the steps we've done so far:
# create a corpus,
corpusRemoved = Corpus(VectorSource(wiki$Removed))
# remove stop words,
corpusRemoved = tm_map(corpusRemoved, removeWords, stopwords("english"))
corpusRemoved[[1]]
## <<PlainTextDocument (metadata: 7)>>
##
# stem the document,
corpusRemoved = tm_map(corpusRemoved, stemDocument)
corpusRemoved[[1]]
## <<PlainTextDocument (metadata: 7)>>
# create a sparse document term matrix, and
dtmRemoved = DocumentTermMatrix(corpusRemoved)
dtmRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 5403)>>
## Non-/sparse entries: 13293/20928735
## Sparsity : 100%
## Maximal term length: 784
## Weighting : term frequency (tf)
sparseRemoved = removeSparseTerms(dtmRemoved, 0.997)
sparseRemoved
## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
# convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:
wordsRemoved = as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))
# How many words are in the wordsRemoved data frame?
ncol(wordsRemoved) # ANS 162
## [1] 162
# PROBLEM 1.5 - BAGS OF WORDS (2 points possible)
# Combine the two data frames into a data frame called wikiWords with the following line of code:
wikiWords = cbind(wordsAdded, wordsRemoved)
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18
# The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture).
wikiWords$Vandal = wiki$Vandal
#Set the random seed to 123 and then split the data set using sample.split from the "caTools" package to put 70% in the training set.
set.seed(123)
split = sample.split(wikiWords$Vandal, SplitRatio = 0.7)
trainSparse = subset(wikiWords, split==TRUE)
testSparse = subset(wikiWords, split==FALSE)
# What is the accuracy on the test set of a baseline method that always predicts "not vandalism" (the most frequent outcome)?
table(testSparse$Vandal)
##
## 0 1
## 618 545
618/nrow(testSparse) # ANS 0.5313844
## [1] 0.5313844
# PROBLEM 1.6 - BAGS OF WORDS (2 points possible)
# Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don't set values for minbucket or cp).
# What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type="class" when making predictions, the output of predict will automatically use a threshold of 0.5.)
VandalCART = rpart(Vandal ~ ., data=trainSparse, method="class")
prp(VandalCART)
# Evaluate the performance of the model
predict.Vandal.CART = predict(VandalCART, newdata=testSparse, type="class")
table(testSparse$Vandal, predict.Vandal.CART)
## predict.Vandal.CART
## 0 1
## 0 618 0
## 1 533 12
# Compute accuracy
(618+12)/nrow(testSparse) # ANS 0.5417025
## [1] 0.5417025
# PROBLEM 1.7 - BAGS OF WORDS (1 point possible)
# Plot the CART tree. How many word stems does the CART model use?
prp(VandalCART) # ANS 2
# PROBLEM 1.8 - BAGS OF WORDS (1 point possible)
# Given the performance of the CART model relative to the baseline, what is the best explanation of these results? ANS Although it beats the baseline, bag of words is not very predictive for this problem. Although it beats the baseline, bag of words is not very predictive for this problem. - correct
# PROBLEM 2.1 - PROBLEM-SPECIFIC KNOWLEDGE (1 point possible)
# We weren't able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.
# The key class of words we will use are website addresses. "Website addresses" (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be "http://www.google.com". The first part is the protocol, which is usually "http" (HyperText Transfer Protocol). The second part is the address of the site, e.g. "www.google.com". We have stripped all punctuation so links to websites appear in the data as one word, e.g. "httpwwwgooglecom". We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.
# We can search for the presence of a web address in the words added by searching for "http" in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.
# grepl("cat","dogs and cats",fixed=TRUE) # TRUE
# grepl("cat","dogs and rats",fixed=TRUE) # FALSE
# Create a copy of your dataframe from the previous question:
wikiWords2 = wikiWords
# Make a new column in wikiWords2 that is 1 if "http" was in Added:
wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
# Based on this new column, how many revisions added a link?
sum(wikiWords2$HTTP) # 217
## [1] 217
# PROBLEM 2.2 - PROBLEM-SPECIFIC KNOWLEDGE (2 points possible)
# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets:
wikiTrain2 = subset(wikiWords2, split==TRUE)
wikiTest2 = subset(wikiWords2, split==FALSE)
# Then create a new CART model using this new variable as one of the independent variables (ie, it was added earlier, and is now an additional independent variable)
# What is the new accuracy of the CART model on the test set, using a threshold of 0.5?
VandalCART2 = rpart(Vandal ~ ., data=wikiTrain2, method="class")
#prp(VandalCART)
# Evaluate the performance of the model
predict.Vandal.CART2 = predict(VandalCART2, newdata=wikiTest2, type="class")
table(wikiTest2$Vandal, predict.Vandal.CART2)
## predict.Vandal.CART2
## 0 1
## 0 609 9
## 1 488 57
# Compute accuracy
(609+57)/nrow(wikiTest2) # ANS 0.5726569
## [1] 0.5726569
# PROBLEM 2.3 - PROBLEM-SPECIFIC KNOWLEDGE (1 point possible)
# Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).
# Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:
wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))
# What is the average number of words added?
mean(wikiWords2$NumWordsAdded) # ANS 4.050052
## [1] 4.050052
# PROBLEM 2.4 - PROBLEM-SPECIFIC KNOWLEDGE (2 points possible)
# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).
# What is the new accuracy of the CART model on the test set? (Note: variables added in 2.3)
wikiTrain3 = subset(wikiWords2, split==TRUE)
wikiTest3 = subset(wikiWords2, split==FALSE)
VandalCART3 = rpart(Vandal ~ ., data=wikiTrain3, method="class")
#prp(VandalCART)
# Evaluate the performance of the model
predict.Vandal.CART3 = predict(VandalCART3, newdata=wikiTest3, type="class")
table(wikiTest3$Vandal, predict.Vandal.CART3)
## predict.Vandal.CART3
## 0 1
## 0 514 104
## 1 297 248
# Compute accuracy
(514+248)/nrow(wikiTest3) # ANS 0.6552021
## [1] 0.6552021
# PROBLEM 3.1 - USING NON-TEXTUAL DATA (2 points possible)
# We have two pieces of "metadata" (data about data) that we haven't yet used. Make a copy of wikiWords2, and call it wikiWords3:
wikiWords3 = wikiWords2
# Then add the two original variables Minor and Loggedin to this new data frame:
wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin
# In problem 1.5, you computed a vector called "spl" that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.
# Build a CART model using all the training data. What is the accuracy of the model on the test set?
wikiTrain4 = subset(wikiWords3, split==TRUE)
wikiTest4 = subset(wikiWords3, split==FALSE)
VandalCART4 = rpart(Vandal ~ ., data=wikiTrain4, method="class")
# Evaluate the performance of the model
predict.Vandal.CART4 = predict(VandalCART4, newdata=wikiTest4, type="class")
table(wikiTest4$Vandal, predict.Vandal.CART4)
## predict.Vandal.CART4
## 0 1
## 0 595 23
## 1 304 241
# Compute accuracy
(595+241)/nrow(wikiTest4) # ANS 0.6552021
## [1] 0.7188306
# PROBLEM 3.2 - USING NON-TEXTUAL DATA (1 point possible)
# There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?
# Plot the CART tree. How many splits are there in the tree?
prp(VandalCART4) # ANS 3
# UNIT 5 Section 2: AUTOMATING REVIEWS IN MEDICINE
# PROBLEM 1.1 - LOADING THE DATA (1 point possible)
# Load clinical_trial.csv into a data frame called trials (remembering to add the argument stringsAsFactors=FALSE), and investigate the data frame with summary() and str().
trials = read.csv("clinical_trial.csv", stringsAsFactors=FALSE)
summary(trials)
## title abstract trial
## Length:1860 Length:1860 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.4392
## 3rd Qu.:1.0000
## Max. :1.0000
str(trials)
## 'data.frame': 1860 obs. of 3 variables:
## $ title : chr "Treatment of Hodgkin's disease and other cancers with 1,3-bis(2-chloroethyl)-1-nitrosourea (BCNU; NSC-409962)." "Cell mediated immune status in malignancy--pretherapy and post-therapy assessment." "Neoadjuvant vinorelbine-capecitabine versus docetaxel-doxorubicin-cyclophosphamide in early nonresponsive breast cancer: phase "| __truncated__ "Randomized phase 3 trial of fluorouracil, epirubicin, and cyclophosphamide alone or followed by Paclitaxel for early breast can"| __truncated__ ...
## $ abstract: chr "" "Twenty-eight cases of malignancies of different kinds were studied to assess T-cell activity and population before and after in"| __truncated__ "BACKGROUND: Among breast cancer patients, nonresponse to initial neoadjuvant chemotherapy is associated with unfavorable outcom"| __truncated__ "BACKGROUND: Taxanes are among the most active drugs for the treatment of metastatic breast cancer, and, as a consequence, they "| __truncated__ ...
## $ trial : int 1 0 1 1 1 0 1 0 0 0 ...
# IMPORTANT NOTE: Some students have been getting errors like "invalid multibyte string" when performing certain parts of this homework question. If this is happening to you, use the argument fileEncoding="latin1" when reading in the file with read.csv. This should cause those errors to go away.
# We can use R's string functions to learn more about the titles and abstracts of the located papers. The nchar() function counts the number of characters in a piece of text. Using the nchar() function on the variables in the data frame, answer the following questions:
# How many characters are there in the longest abstract? (Longest here is defined as the abstract with the largest number of characters.)
max(nchar(trials$abstract)) # ANS 3708
## [1] 3708
summary(nchar(trials$abstract)) # Alternative method
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1195 1583 1480 1820 3708
# PROBLEM 1.2 - LOADING THE DATA (1 point possible)
# How many search results provided no abstract? (HINT: A search result provided no abstract if the number of characters in the abstract field is zero.)
table(nchar(trials$abstract)) # ANS 112 Easier to see if gettng the head() of this result, or
##
## 0 243 273 282 288 290 332 337 345 363 378 420 434 444 447
## 112 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 454 463 464 465 468 469 477 482 483 489 491 492 501 507 511
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 514 519 528 541 543 548 559 563 566 567 576 584 585 588 591
## 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1
## 593 600 601 604 610 615 617 620 627 628 631 634 639 644 645
## 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1
## 647 655 656 660 666 671 673 675 681 685 688 695 700 701 713
## 1 1 2 2 1 1 1 1 1 1 1 3 1 1 1
## 717 720 721 722 723 725 730 733 735 739 740 741 765 773 775
## 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2
## 777 781 782 783 788 795 798 802 805 806 808 811 820 823 825
## 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1
## 829 832 834 836 837 838 840 842 846 851 852 857 860 861 865
## 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1
## 868 871 874 878 882 885 888 891 892 900 902 904 906 909 910
## 1 1 1 1 1 3 2 1 2 1 1 1 1 3 1
## 913 919 920 921 922 924 925 926 927 930 932 937 939 940 942
## 1 2 1 2 1 1 2 1 3 2 1 1 1 1 1
## 948 953 957 958 959 960 962 964 965 968 969 973 974 980 981
## 1 1 1 1 1 2 1 1 2 3 1 1 1 1 1
## 984 987 989 990 991 994 996 1000 1005 1006 1007 1009 1016 1018 1020
## 1 2 3 2 1 1 1 1 1 1 2 2 1 1 2
## 1021 1022 1024 1025 1026 1028 1029 1030 1031 1033 1034 1035 1037 1038 1041
## 1 2 1 1 1 1 2 1 1 1 3 2 1 1 1
## 1045 1047 1049 1050 1052 1054 1063 1064 1065 1066 1067 1069 1070 1071 1073
## 2 4 1 1 2 1 1 1 2 1 1 2 2 1 1
## 1077 1078 1079 1081 1082 1083 1085 1087 1088 1093 1094 1097 1098 1101 1103
## 1 3 1 1 2 2 1 1 1 1 2 1 2 2 1
## 1105 1107 1108 1111 1112 1114 1115 1116 1121 1122 1123 1124 1125 1126 1127
## 1 1 1 1 3 1 2 1 2 1 1 1 1 2 2
## 1128 1133 1134 1136 1137 1138 1140 1145 1148 1149 1159 1161 1165 1167 1169
## 1 1 2 1 2 1 1 1 1 1 1 1 1 2 3
## 1170 1171 1173 1175 1176 1177 1179 1181 1182 1184 1185 1188 1189 1191 1192
## 1 2 1 1 1 1 1 2 2 4 1 1 1 1 3
## 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1207 1208 1209 1210 1211
## 3 2 1 1 1 1 1 1 2 1 2 1 2 2 1
## 1212 1214 1215 1216 1217 1221 1222 1223 1225 1229 1231 1234 1235 1236 1239
## 2 2 3 3 1 2 2 1 2 1 1 1 1 1 2
## 1240 1241 1243 1245 1250 1253 1254 1255 1256 1258 1259 1261 1262 1265 1267
## 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1
## 1268 1269 1270 1271 1272 1273 1274 1276 1277 1278 1279 1281 1284 1288 1289
## 1 2 2 1 1 1 1 1 2 2 1 3 1 1 2
## 1290 1291 1293 1294 1295 1296 1298 1300 1302 1303 1305 1306 1307 1308 1309
## 1 2 1 3 1 1 1 1 1 1 3 4 2 2 1
## 1310 1312 1313 1316 1317 1321 1323 1324 1325 1326 1329 1330 1331 1333 1335
## 1 1 2 2 1 2 2 2 1 2 1 1 4 1 1
## 1336 1337 1339 1340 1341 1342 1343 1344 1346 1347 1348 1349 1350 1351 1352
## 1 3 1 2 1 1 1 3 2 1 2 2 1 3 2
## 1354 1355 1356 1358 1359 1360 1361 1362 1363 1364 1367 1368 1369 1370 1371
## 1 1 1 2 2 1 2 1 1 4 2 2 3 3 1
## 1372 1373 1375 1376 1377 1380 1381 1383 1386 1387 1389 1390 1391 1392 1393
## 2 2 2 1 4 2 1 1 1 1 2 2 1 1 2
## 1394 1396 1397 1399 1400 1401 1403 1404 1405 1407 1408 1409 1410 1412 1413
## 2 3 1 2 1 3 1 1 2 3 2 2 1 3 4
## 1414 1415 1416 1418 1419 1421 1422 1423 1424 1425 1428 1432 1433 1434 1436
## 1 3 3 1 1 3 1 1 4 2 1 2 1 2 1
## 1437 1438 1441 1442 1445 1446 1447 1448 1449 1450 1451 1452 1453 1455 1456
## 1 1 1 1 2 1 2 2 1 3 2 1 2 2 1
## 1459 1460 1461 1463 1466 1467 1468 1470 1472 1474 1476 1477 1478 1479 1480
## 2 1 3 2 2 2 3 1 1 2 2 4 3 3 1
## 1481 1484 1485 1487 1489 1490 1491 1492 1493 1494 1496 1497 1499 1500 1501
## 3 1 1 1 2 1 1 1 1 1 1 2 1 5 3
## 1503 1504 1505 1506 1507 1508 1509 1510 1512 1514 1515 1516 1517 1518 1519
## 2 1 2 3 1 3 3 1 1 1 1 3 3 1 1
## 1520 1521 1522 1524 1525 1526 1528 1529 1530 1532 1533 1534 1535 1536 1537
## 1 2 1 2 3 3 2 1 1 1 2 1 3 3 1
## 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553
## 1 1 3 2 2 5 1 1 1 2 1 3 1 4 1
## 1555 1556 1561 1563 1564 1566 1569 1570 1571 1572 1574 1575 1577 1578 1580
## 4 1 1 1 1 3 6 1 1 2 2 2 3 1 1
## 1581 1582 1583 1584 1585 1586 1588 1589 1591 1593 1594 1596 1597 1598 1600
## 2 2 3 2 1 2 1 2 1 1 2 1 4 1 1
## 1601 1602 1603 1604 1605 1606 1609 1610 1613 1614 1615 1616 1617 1618 1619
## 4 3 2 2 1 3 3 2 4 1 3 2 1 1 1
## 1620 1622 1623 1624 1625 1626 1627 1628 1630 1631 1632 1633 1634 1635 1636
## 1 1 4 2 2 3 1 1 2 1 3 1 3 1 2
## 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651
## 2 2 2 4 1 1 2 3 1 1 1 1 5 3 2
## 1652 1653 1654 1655 1656 1657 1658 1659 1660 1662 1663 1665 1666 1667 1668
## 1 1 1 3 2 1 4 2 2 4 1 1 1 3 3
## 1669 1670 1671 1672 1673 1674 1675 1676 1678 1679 1680 1681 1682 1683 1684
## 2 3 2 3 1 5 3 2 1 1 4 1 3 4 2
## 1685 1686 1687 1688 1689 1690 1692 1696 1697 1698 1699 1700 1702 1703 1704
## 2 1 1 1 1 2 4 2 3 3 1 3 2 1 1
## 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719
## 2 3 2 2 2 1 3 1 2 4 1 1 3 2 2
## 1720 1721 1722 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735
## 1 1 1 3 2 1 2 4 5 3 1 1 3 2 1
## 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1747 1748 1750 1751 1753
## 5 2 6 2 5 2 3 4 3 2 1 1 1 4 4
## 1754 1755 1756 1759 1760 1761 1762 1763 1764 1765 1766 1768 1769 1770 1771
## 4 4 3 3 4 2 4 3 2 1 2 2 2 2 1
## 1772 1773 1774 1775 1776 1777 1779 1780 1781 1782 1783 1784 1786 1787 1788
## 5 3 2 4 2 3 3 1 1 2 2 2 3 3 1
## 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803
## 3 1 4 1 2 1 1 4 1 3 1 3 2 2 4
## 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818
## 2 2 2 3 1 2 1 2 4 5 2 2 2 3 4
## 1819 1820 1821 1822 1824 1825 1827 1828 1829 1830 1831 1832 1833 1834 1837
## 2 2 1 3 3 2 2 3 1 2 1 3 4 2 3
## 1838 1839 1840 1842 1843 1844 1846 1848 1850 1852 1853 1854 1855 1856 1857
## 3 2 2 1 1 2 2 1 3 3 1 1 1 2 2
## 1858 1859 1860 1861 1862 1863 1865 1867 1869 1870 1871 1872 1873 1875 1882
## 2 1 1 3 1 2 1 2 3 1 1 1 2 2 2
## 1883 1884 1885 1887 1888 1890 1891 1892 1893 1895 1896 1897 1898 1899 1901
## 1 1 1 3 3 1 3 2 1 1 1 3 2 2 3
## 1902 1904 1906 1908 1909 1910 1911 1912 1914 1916 1917 1918 1919 1920 1921
## 2 1 1 1 3 2 1 1 2 2 1 1 1 3 1
## 1922 1923 1924 1925 1927 1928 1929 1930 1933 1934 1935 1936 1937 1938 1939
## 3 1 1 3 2 2 3 1 1 1 1 1 2 2 1
## 1940 1941 1942 1943 1944 1945 1946 1947 1948 1951 1953 1954 1955 1956 1957
## 3 1 2 1 4 1 5 1 1 2 1 1 1 1 2
## 1958 1959 1960 1962 1965 1970 1971 1973 1974 1975 1976 1979 1980 1981 1982
## 1 1 1 2 1 1 1 1 2 1 3 1 1 2 1
## 1985 1986 1987 1990 1992 1993 1995 1996 1998 1999 2001 2002 2005 2008 2011
## 1 1 2 2 3 1 1 1 2 3 1 2 1 1 2
## 2012 2013 2015 2016 2018 2019 2020 2024 2025 2028 2029 2030 2031 2037 2039
## 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1
## 2040 2041 2043 2044 2046 2049 2050 2051 2052 2053 2056 2059 2061 2062 2069
## 1 2 1 1 1 1 1 1 1 1 1 4 1 1 1
## 2071 2072 2074 2075 2076 2078 2080 2081 2082 2089 2093 2095 2097 2101 2105
## 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1
## 2106 2108 2110 2117 2120 2125 2126 2127 2131 2133 2134 2137 2138 2141 2142
## 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1
## 2146 2151 2153 2154 2170 2173 2175 2177 2182 2183 2190 2191 2193 2194 2195
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 2196 2206 2207 2208 2212 2215 2219 2221 2223 2228 2229 2232 2235 2236 2238
## 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1
## 2240 2241 2242 2247 2248 2250 2252 2254 2257 2259 2262 2263 2265 2267 2271
## 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1
## 2276 2280 2282 2287 2290 2292 2293 2297 2299 2300 2302 2303 2311 2325 2329
## 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 2334 2336 2345 2348 2349 2358 2360 2367 2376 2380 2382 2387 2389 2394 2404
## 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1
## 2410 2412 2422 2428 2430 2431 2436 2441 2450 2474 2479 2481 2482 2483 2489
## 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1
## 2494 2496 2497 2504 2511 2516 2522 2528 2529 2530 2532 2533 2534 2539 2547
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2550 2555 2584 2602 2606 2607 2633 2636 2682 2685 2700 2709 2712 2722 2723
## 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1
## 2746 2751 2752 2779 2790 2791 2796 2797 2798 2816 2838 2891 2901 2905 2910
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2942 2961 2965 3050 3104 3138 3178 3214 3298 3465 3642 3708
## 1 1 1 1 1 1 1 1 1 1 1 1
table(nchar(trials$abstract)==0) # Alt, gives summary
##
## FALSE TRUE
## 1748 112
sum(nchar(trials$abstract)==0) # Alt
## [1] 112
# PROBLEM 1.3 - LOADING THE DATA (1 point possible)
# Find the observation with the minimum number of characters in the title (the variable "title") out of all of the observations in this dataset.
which.min(nchar(trials$title)) # ANS observation 1258
## [1] 1258
summary(nchar(trials$title)) # ANS 28 characters
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.0 102.0 133.5 138.8 168.0 336.0
nchar(trials$title[which.min(nchar(trials$title))])# Alt for 28 characters
## [1] 28
# What is the text of the title of this article? Include capitalization and punctuation in your response, but don't include the quotes.
trials$title[which.min(nchar(trials$title))] # ANS A decade of letrozole: FACE.
## [1] "A decade of letrozole: FACE."
# PROBLEM 2.1 - PREPARING THE CORPUS (4 points possible)
# Because we have both title and abstract information for trials, we need to build two corpera instead of one. Name them corpusTitle and corpusAbstract.
# Following the commands from lecture, perform the following tasks (you might need to load the "tm" package first if it isn't already loaded). Make sure to perform them in this order.
# 1) Convert the title variable to corpusTitle and the abstract variable to corpusAbstract.
corpusTitle = Corpus(VectorSource(trials$title))
corpusAbstract = Corpus(VectorSource(trials$abstract))
# 2) Convert corpusTitle and corpusAbstract to lowercase.
corpusTitle = tm_map(corpusTitle, tolower)
corpusAbstract = tm_map(corpusAbstract, tolower)
# After performing this step, remember to run the lines:
corpusTitle = tm_map(corpusTitle, PlainTextDocument)
corpusAbstract = tm_map(corpusAbstract, PlainTextDocument)
# 3) Remove the punctuation in corpusTitle and corpusAbstract.
corpusTitle = tm_map(corpusTitle, removePunctuation)
corpusAbstract = tm_map(corpusAbstract, removePunctuation)
# 4) Remove the English language stop words from corpusTitle and corpusAbstract.
corpusTitle = tm_map(corpusTitle, removeWords, stopwords("english"))
corpusAbstract = tm_map(corpusAbstract, removeWords, stopwords("english"))
# 5) Stem the words in corpusTitle and corpusAbstract (each stemming might take a few minutes).
corpusTitle = tm_map(corpusTitle, stemDocument)
corpusAbstract = tm_map(corpusAbstract, stemDocument)
# 6) Build a document term matrix called dtmTitle from corpusTitle and dtmAbstract from corpusAbstract.
dtmTitle = DocumentTermMatrix(corpusTitle)
dtmAbstract = DocumentTermMatrix(corpusAbstract)
# 7) Limit dtmTitle and dtmAbstract to terms with sparseness of at most 95% (aka terms that appear in at least 5% of documents).
dtmTitle = removeSparseTerms(dtmTitle, 0.95)
dtmAbstract = removeSparseTerms(dtmAbstract, 0.95)
# 8) Convert dtmTitle and dtmAbstract to data frames (keep the names dtmTitle and dtmAbstract).
dtmTitle = as.data.frame(as.matrix(dtmTitle))
dtmAbstract = as.data.frame(as.matrix(dtmAbstract))
# If the code length(stopwords("english")) does not return 174 for you [Note: it does], then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusTitle, removeWords, sw) and tm_map(corpusAbstract, removeWords, sw) instead of tm_map(corpusTitle, removeWords, stopwords("english")) and tm_map(corpusAbstract, removeWords, stopwords("english")).
# How many terms remain in dtmTitle after removing sparse terms (aka how many columns does it have)?
ncol(dtmTitle) # ANS 31
## [1] 31
# How many terms remain in dtmAbstract?
ncol(dtmAbstract) # ANS 335
## [1] 335
# PROBLEM 2.2 - PREPARING THE CORPUS (1 point possible)
# What is the most likely reason why dtmAbstract has so many more terms than dtmTitle?
# ANS Abstracts tend to have many more words than titles: Because titles are so short, a word needs to be very common to appear in 5% of titles. Because abstracts have many more words, a word can be much less common and still appear in 5% of abstracts. While abstracts may have wider vocabulary, this is a secondary effect. As we saw in the previous subsection, all papers have titles, but not all have abstracts.
# PROBLEM 2.3 - PREPARING THE CORPUS (1 point possible)
# What is the most frequent word stem across all the abstracts? Hint: you can use colSums() to compute the frequency of a word across all the abstracts.
which.max(colSums(dtmAbstract)) # ANS patient
## patient
## 212
tail(sort(colSums(dtmAbstract))) # Alternative
## chemotherapi group treatment cancer breast
## 2344 2668 2894 3726 3859
## patient
## 8381
# PROBLEM 3.1 - BUILDING A MODEL (1 point possible)
# We want to combine dtmTitle and dtmAbstract into a single data frame to make predictions. However, some of the variables in these data frames have the same names. To fix this issue, run the following commands:
colnames(dtmTitle) = paste0("T", colnames(dtmTitle))
colnames(dtmAbstract) = paste0("A", colnames(dtmAbstract))
# What was the effect of these functions? ANS Adding the letter T in front of all the title variable names and adding the letter A in front of all the abstract variable names.
# PROBLEM 3.2 - BUILDING A MODEL (1 point possible)
# Using cbind(), combine dtmTitle and dtmAbstract into a single data frame called dtm:
dtm = cbind(dtmTitle, dtmAbstract)
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,1110,1111,1112,1113,1114,1115,1116,1117,1118,1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131,1132,1133,1134,1135,1136,1137,1138,1139,1140,1141,1142,1143,1144,1145,1146,1147,1148,1149,1150,1151,1152,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167,1168,1169,1170,1171,1172,1173,1174,1175,1176,1177,1178,1179,1180,1181,1182,1183,1184,1185,1186,1187,1188,1189,1190,1191,1192,1193,1194,1195,1196,1197,1198,1199,1200,1201,1202,1203,1204,1205,1206,1207,1208,1209,1210,1211,1212,1213,1214,1215,1216,1217,1218,1219,1220,1221,1222,1223,1224,1225,1226,1227,1228,1229,1230,1231,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241,1242,1243,1244,1245,1246,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256,1257,1258,1259,1260,1261,1262,1263,1264,1265,1266,1267,1268,1269,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279,1280,1281,1282,1283,1284,1285,1286,1287,1288,1289,1290,1291,1292,1293,1294,1295,1296,1297,1298,1299,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1318,1319,1320,1321,1322,1323,1324,1325,1326,1327,1328,1329,1330,1331,1332,1333,1334,1335,1336,1337,1338,1339,1340,1341,1342,1343,1344,1345,1346,1347,1348,1349,1350,1351,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361,1362,1363,1364,1365,1366,1367,1368,1369,1370,1371,1372,1373,1374,1375,1376,1377,1378,1379,1380,1381,1382,1383,1384,1385,1386,1387,1388,1389,1390,1391,1392,1393,1394,1395,1396,1397,1398,1399,1400,1401,1402,1403,1404,1405,1406,1407,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417,1418,1419,1420,1421,1422,1423,1424,1425,1426,1427,1428,1429,1430,1431,1432,1433,1434,1435,1436,1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,1447,1448,1449,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499,1500,1501,1502,1503,1504,1505,1506,1507,1508,1509,1510,1511,1512,1513,1514,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524,1525,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535,1536,1537,1538,1539,1540,1541,1542,1543,1544,1545,1546,1547,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558,1559,1560,1561,1562,1563,1564,1565,1566,1567,1568,1569,1570,1571,1572,1573,1574,1575,1576,1577,1578,1579,1580,1581,1582,1583,1584,1585,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,1615,1616,1617,1618,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1630,1631,1632,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649,1650,1651,1652,1653,1654,1655,1656,1657,1658,1659,1660,1661,1662,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682,1683,1684,1685,1686,1687,1688,1689,1690,1691,1692,1693,1694,1695,1696,1697,1698,1699,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739,1740,1741,1742,1743,1744,1745,1746,1747,1748,1749,1750,1751,1752,1753,1754,1755,1756,1757,1758,1759,1760,1761,1762,1763,1764,1765,1766,1767,1768,1769,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779,1780,1781,1782,1783,1784,1785,1786,1787,1788,1789,1790,1791,1792,1793,1794,1795,1796,1797,1798,1799,1800,1801,1802,1803,1804,1805,1806,1807,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817,1818,1819,1820,1821,1822,1823,1824,1825,1826,1827,1828,1829,1830,1831,1832,1833,1834,1835,1836,1837,1838,1839,1840,1841,1842,1843,1844,1845,1846,1847,1848,1849,1850,1851,1852,1853,1854,18
# As we did in class, add the dependent variable "trial" to dtm, copying it from the original data frame called trials.
dtm$trial = trials$trial
# How many columns are in this combined data frame? ANS 367
# PROBLEM 3.3 - BUILDING A MODEL (1 point possible)
# Now that we have prepared our data frame, it's time to split it into a training and testing set and to build regression models. Set the random seed to 144 and use the sample.split function from the caTools package to split dtm into data frames named "train" and "test", putting 70% of the data in the training set.
set.seed(144)
split = sample.split(dtm$trial, SplitRatio = 0.7)
train = subset(dtm, split==TRUE)
test = subset(dtm, split==FALSE)
# What is the accuracy of the baseline model on the training set? (Remember that the baseline model predicts the most frequent outcome in the training set for all observations.)
max(table(train$trial))/nrow(train) # ANS 0.5606759
## [1] 0.5606759
730/(ncol(train)) # Alternative
## [1] 1.989101
730/(730+572) # Alternative
## [1] 0.5606759
# PROBLEM 3.4 - BUILDING A MODEL (2 points possible)
# Build a CART model called trialCART, using all the independent variables in the training set to train the model, and then plot the CART model. Just use the default parameters to build the model (don't add a minbucket or cp value). Remember to add the method="class" argument, since this is a classification problem.
trialCART = rpart(trial ~ ., data=train, method="class")
prp(trialCART)
# What is the name of the first variable the model split on? ANS Tphase
# PROBLEM 3.5 - BUILDING A MODEL (1 point possible)
# Obtain the training set predictions for the model (do not yet predict on the test set). Extract the predicted probability of a result being a trial (recall that this involves not setting a type argument, and keeping only the second column of the predict output). What is the maximum predicted probability for any result?
predict.trialCART.train = predict(trialCART, data=train)
max(predict.trialCART.train[,2]) # ANS 0.8718861
## [1] 0.8718861
# Alt method
# predict.trialCART.train = predict(trialCART)[,2]
# summary(predict.trialCART.train)
# PROBLEM 3.6 - BUILDING A MODEL (1 point possible)
# Without running the analysis, how do you expect the maximum predicted probability to differ in the testing set?
# ANS The maximum predicted probability will likely be exactly the same in the testing set. Because the CART tree assigns the same predicted probability to each leaf node and there are a small number of leaf nodes compared to data points, we expect exactly the same maximum predicted probability.
# PROBLEM 3.7 - BUILDING A MODEL (3 points possible)
# For these questions, use a threshold probability of 0.5 to predict that an observation is a clinical trial.
# What is the training set accuracy of the CART model?
table(train$trial, predict.trialCART.train[,2] >= 0.5)
##
## FALSE TRUE
## 0 631 99
## 1 131 441
(631+441)/nrow(train) # Accuracy = 0.8233487
## [1] 0.8233487
# What is the training set sensitivity of the CART model?
441/(441+131) # ANS 0.770979
## [1] 0.770979
# What is the training set specificity of the CART model?
631/(631+99) # ANS 0.8643836
## [1] 0.8643836
# PROBLEM 4.1 - EVALUATING THE MODEL ON THE TESTING SET (2 points possible)
# Evaluate the CART model on the testing set using the predict function and creating a vector of predicted probabilities predTest.
predTest = predict(trialCART, newdata=test)
max(predTest[,2]) # ANS 0.8718861
## [1] 0.8718861
# What is the testing set accuracy, assuming a probability threshold of 0.5 for predicting that a result is a clinical trial?
table(test$trial, predTest[,2] >= 0.5)
##
## FALSE TRUE
## 0 261 52
## 1 83 162
(261+162)/nrow(test) # Accuracy = 0.7580645
## [1] 0.7580645
# Alt
# predTest = predict(trialCART, newdata=test)[,2]
# table(test$trial, predTest >= 0.5)
# PROBLEM 4.2 - EVALUATING THE MODEL ON THE TESTING SET (2 points possible)
# Using the ROCR package, what is the testing set AUC of the prediction model?
predTestROCR = prediction(predTest[,2], test$trial)
# Plot not needed in this problem
perfROCR = performance(predTestROCR, "tpr", "fpr")
plot(perfROCR, colorize=TRUE)
# Compute AUC
performance(predTestROCR, "auc")@y.values
## [[1]]
## [1] 0.8371063
# Alt as.numeric(performance(predTestROCR, "auc")@y.values)
# PART 5: DECISION-MAKER TRADEOFFS
# The decision maker for this problem, a researcher performing a review of the medical literature, would use a model (like the CART one we built here) in the following workflow:
# 1) For all of the papers retreived in the PubMed Search, predict which papers are clinical trials using the model. This yields some initial Set A of papers predicted to be trials, and some Set B of papers predicted not to be trials. (See the figure below.)
# 2) Then, the decision maker manually reviews all papers in Set A, verifying that each paper meets the study's detailed inclusion criteria (for the purposes of this analysis, we assume this manual review is 100% accurate at identifying whether a paper in Set A is relevant to the study). This yields a more limited set of papers to be included in the study, which would ideally be all papers in the medical literature meeting the detailed inclusion criteria for the study.
# 3) Perform the study-specific analysis, using data extracted from the limited set of papers identified in step 2.
# PROBLEM 5.1 - DECISION-MAKER TRADEOFFS (1 point possible)
# What is the cost associated with the model in Step 1 making a false negative prediction?
# ANS A paper that should have been included in Set A will be missed, affecting the quality of the results of Step 3.
# PROBLEM 5.2 - DECISION-MAKER TRADEOFFS (1 point possible)
# What is the cost associated with the model in Step 1 making a false positive prediction?
# ANS A paper will be mistakenly added to Set A, yielding additional work in Step 2 of the process but not affecting the quality of the results of Step 3.
# PROBLEM 5.3 - DECISION-MAKER TRADEOFFS (1 point possible)
# Given the costs associated with false positives and false negatives, which of the following is most accurate?
# ANS A false negative is more costly than a false positive; the decision maker should use a probability threshold less than 0.5 for the machine learning model.
# Unit 5 Part 3: SEPARATING SPAM FROM HAM (PART 1)
# PROBLEM 1.1 - LOADING THE DATASET (1 point possible)
# Begin by loading the dataset emails.csv into a data frame called emails. Remember to pass the stringsAsFactors=FALSE option when loading the data.
emails = read.csv("emails.csv", stringsAsFactors=FALSE)
# How many emails are in the dataset?
summary(emails)
## text spam
## Length:5728 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2388
## 3rd Qu.:0.0000
## Max. :1.0000
str(emails) # ANS 5728
## 'data.frame': 5728 obs. of 2 variables:
## $ text: chr "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqg"| __truncated__ "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ra"| __truncated__ "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved for a $ 454 , 1"| __truncated__ "Subject: 4 color printing special request additional information now ! click here click here for a printable version of our o"| __truncated__ ...
## $ spam: int 1 1 1 1 1 1 1 1 1 1 ...
# Alt nrow(emails)
# PROBLEM 1.2 - LOADING THE DATASET (1 point possible)
# How many of the emails are spam?
table(emails$spam)
##
## 0 1
## 4360 1368
sum(emails$spam==1) # ANS 1368
## [1] 1368
# PROBLEM 1.3 - LOADING THE DATASET (1 point possible)
# Which word appears at the beginning of every email in the dataset? Respond as a lower-case word with punctuation removed.
head(emails$text)
## [1] "Subject: naturally irresistible your corporate identity lt is really hard to recollect a company : the market is full of suqgestions and the information isoverwhelminq ; but a good catchy logo , stylish statlonery and outstanding website will make the task much easier . we do not promise that havinq ordered a iogo your company will automaticaily become a world ieader : it isguite ciear that without good products , effective business organization and practicable aim it will be hotat nowadays market ; but we do promise that your marketing efforts will become much more effective . here is the list of clear benefits : creativeness : hand - made , original logos , specially done to reflect your distinctive company image . convenience : logo and stationery are provided in all formats ; easy - to - use content management system letsyou change your website content and even its structure . promptness : you will see logo drafts within three business days . affordability : your marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction guaranteed : we provide unlimited amount of changes with no extra fees for you to be surethat you will love the result of this collaboration . have a look at our portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
## [2] "Subject: the stock trading gunslinger fanny is merrill but muzo not colza attainder and penultimate like esmark perspicuous ramble is segovia not group try slung kansas tanzania yes chameleon or continuant clothesman no libretto is chesapeake but tight not waterway herald and hawthorn like chisel morristown superior is deoxyribonucleic not clockwork try hall incredible mcdougall yes hepburn or einsteinian earmark no sapling is boar but duane not plain palfrey and inflexible like huzzah pepperoni bedtime is nameable not attire try edt chronography optima yes pirogue or diffusion albeit no "
## [3] "Subject: unbelievable new homes made easy im wanting to show you this homeowner you have been pre - approved for a $ 454 , 169 home loan at a 3 . 72 fixed rate . this offer is being extended to you unconditionally and your credit is in no way a factor . to take advantage of this limited time opportunity all we ask is that you visit our website and complete the 1 minute post approval form look foward to hearing from you , dorcas pittman"
## [4] "Subject: 4 color printing special request additional information now ! click here click here for a printable version of our order form ( pdf format ) phone : ( 626 ) 338 - 8090 fax : ( 626 ) 338 - 8102 e - mail : ramsey @ goldengraphix . com request additional information now ! click here click here for a printable version of our order form ( pdf format ) golden graphix & printing 5110 azusa canyon rd . irwindale , ca 91706 this e - mail message is an advertisement and / or solicitation . "
## [5] "Subject: do not have money , get software cds from here ! software compatibility . . . . ain ' t it great ? grow old along with me the best is yet to be . all tradgedies are finish ' d by death . all comedies are ended by marriage ."
## [6] "Subject: great nnews hello , welcome to medzonline sh groundsel op we are pleased to introduce ourselves as one of the ieading online phar felicitation maceuticai shops . helter v shakedown r a cosmopolitan l l blister l l bestow ag ac tosher l is coadjutor va confidant um andmanyother . - sav inexpiable e over 75 % - total confide leisure ntiaiity - worldwide s polite hlpplng - ov allusion er 5 miilion customers in 150 countries have devitalize a nice day !"
# PROBLEM 1.4 - LOADING THE DATASET (1 point possible)
# Could a spam classifier potentially benefit from including the frequency of the word that appears in every email?
# Yes -- the number of times the word appears might help us differentiate spam from ham.
# PROBLEM 1.5 - LOADING THE DATASET (1 point possible)
# The nchar() function counts the number of characters in a piece of text. How many characters are in the longest email in the dataset (where longest is measured in terms of the maximum number of characters)?
nchar(emails$text[which.max(nchar(emails$text))]) # ANS 43952
## [1] 43952
# Alt max(nchar(emails$text))
# PROBLEM 1.6 - LOADING THE DATASET (1 point possible)
# Which row contains the shortest email in the dataset? (Just like in the previous problem, shortest is measured in terms of the fewest number of characters.)
which.min(nchar(emails$text)) # ANS 1992
## [1] 1992
# ALT which(nchar(emails$text) == min(nchar(emails$text)) )
# PROBLEM 2.1 - PREPARING THE CORPUS (2 points possible)
# Follow the standard steps to build and pre-process the corpus:
# 1) Build a new corpus variable called corpus.
corpus = Corpus(VectorSource(emails$text))
# 2) Using tm_map, convert the text to lowercase.
corpus = tm_map(corpus, tolower)
# After performing this step, remember to run the lines:
corpus = tm_map(corpus, PlainTextDocument)
# 3) Using tm_map, remove all punctuation from the corpus.
corpus = tm_map(corpus, removePunctuation)
# 4) Using tm_map, remove all English stopwords from the corpus.
corpus = tm_map(corpus, removeWords, stopwords("english"))
# 5) Using tm_map, stem the words in the corpus.
corpus = tm_map(corpus, stemDocument)
# 6) Build a document term matrix from the corpus, called dtm.
dtm = DocumentTermMatrix(corpus)
# If the code length(stopwords("english")) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpus, removeWords, sw) instead of tm_map(corpus, removeWords, stopwords("english")).
length(stopwords("english")) # checks out
## [1] 174
# How many terms are in dtm?
dtm # 28687 terms
## <<DocumentTermMatrix (documents: 5728, terms: 28687)>>
## Non-/sparse entries: 481719/163837417
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
ncol(dtm) # Alternative
## [1] 28687
# PROBLEM 2.2 - PREPARING THE CORPUS (1 point possible)
# To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents, and store this result as spdtm (don't overwrite dtm, because we will use it in a later step of this homework). How many terms are in spdtm?
spdtm = removeSparseTerms(dtm, 0.95)
spdtm # ANS 330
## <<DocumentTermMatrix (documents: 5728, terms: 330)>>
## Non-/sparse entries: 213551/1676689
## Sparsity : 89%
## Maximal term length: 10
## Weighting : term frequency (tf)
# PROBLEM 2.3 - PREPARING THE CORPUS (2 points possible)
# Build a data frame called emailsSparse from spdtm, and use the make.names function to make the variable names of emailsSparse valid.
# Convert to a data frame
emailsSparse = as.data.frame(as.matrix(spdtm))
# Make all variable names R-friendly
colnames(emailsSparse) = make.names(colnames(emailsSparse))
str(emailsSparse)
## 'data.frame': 5728 obs. of 330 variables:
## $ X000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ X2000 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X2001 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ X713 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X853 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ abl : num 0 0 0 0 0 0 0 0 0 0 ...
## $ access : num 0 0 0 0 0 0 1 0 0 0 ...
## $ account : num 0 0 0 0 0 0 1 0 0 0 ...
## $ addit : num 0 0 0 2 0 0 1 0 0 0 ...
## $ address : num 0 0 0 0 0 0 0 0 0 0 ...
## $ allow : num 0 0 0 0 0 0 1 0 0 0 ...
## $ alreadi : num 0 0 0 0 0 0 1 0 0 0 ...
## $ also : num 0 0 0 0 0 0 1 0 0 0 ...
## $ analysi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ anoth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ applic : num 0 0 0 0 0 0 3 0 0 0 ...
## $ appreci : num 0 0 0 0 0 0 0 0 0 0 ...
## $ approv : num 0 0 2 0 0 0 0 0 0 0 ...
## $ april : num 0 0 0 0 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 1 0 0 0 ...
## $ arrang : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ask : num 0 0 1 0 0 0 0 0 0 0 ...
## $ assist : num 0 0 0 0 0 0 3 0 0 0 ...
## $ associ : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attach : num 0 0 0 0 0 0 0 0 0 0 ...
## $ attend : num 0 0 0 0 0 0 0 0 0 0 ...
## $ avail : num 0 0 0 0 0 0 0 0 0 0 ...
## $ back : num 0 0 0 0 0 0 0 0 0 0 ...
## $ base : num 0 0 0 0 0 0 3 0 2 0 ...
## $ begin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ believ : num 0 0 0 0 0 0 2 0 0 0 ...
## $ best : num 0 0 0 0 1 0 0 0 0 0 ...
## $ better : num 0 0 0 0 0 0 0 0 0 0 ...
## $ book : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bring : num 0 0 0 0 0 0 0 0 0 0 ...
## $ busi : num 2 0 0 0 0 0 3 0 2 0 ...
## $ buy : num 0 0 0 0 0 0 1 1 0 1 ...
## $ call : num 0 0 0 0 0 0 0 0 0 0 ...
## $ can : num 0 0 0 0 0 0 11 1 0 1 ...
## $ case : num 0 0 0 0 0 0 0 0 0 0 ...
## $ chang : num 2 0 0 0 0 0 1 0 0 0 ...
## $ check : num 0 0 0 0 0 0 0 0 0 0 ...
## $ click : num 0 0 0 4 0 0 0 0 0 0 ...
## $ com : num 0 0 0 1 0 0 0 0 1 0 ...
## $ come : num 0 0 0 0 0 0 0 0 0 0 ...
## $ comment : num 0 0 0 0 0 0 0 0 0 0 ...
## $ communic : num 0 0 0 0 0 0 0 0 0 0 ...
## $ compani : num 3 0 0 0 0 0 16 0 0 0 ...
## $ complet : num 0 0 1 0 0 0 1 0 0 0 ...
## $ confer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ confirm : num 0 0 0 0 0 0 0 0 0 0 ...
## $ contact : num 0 0 0 0 0 0 0 0 0 0 ...
## $ continu : num 0 1 0 0 0 0 0 0 0 0 ...
## $ contract : num 0 0 0 0 0 0 0 0 0 0 ...
## $ copi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ corp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ corpor : num 1 0 0 0 0 0 0 0 0 0 ...
## $ cost : num 0 0 0 0 0 0 1 0 0 0 ...
## $ cours : num 0 0 0 0 0 0 0 0 0 0 ...
## $ creat : num 0 0 0 0 0 0 1 0 0 0 ...
## $ credit : num 0 0 1 0 0 0 0 0 0 0 ...
## $ crenshaw : num 0 0 0 0 0 0 0 0 0 0 ...
## $ current : num 0 0 0 0 0 0 3 0 0 0 ...
## $ custom : num 0 0 0 0 0 1 4 0 0 0 ...
## $ data : num 0 0 0 0 0 0 0 0 0 0 ...
## $ date : num 0 0 0 0 0 0 0 0 0 0 ...
## $ day : num 1 0 0 0 0 1 0 0 0 0 ...
## $ deal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dear : num 0 0 0 0 0 0 0 0 0 0 ...
## $ depart : num 0 0 0 0 0 0 0 0 0 0 ...
## $ deriv : num 0 0 0 0 0 0 0 0 0 0 ...
## $ design : num 0 0 0 0 0 0 1 0 0 0 ...
## $ detail : num 0 0 0 0 0 0 0 0 0 0 ...
## $ develop : num 0 0 0 0 0 0 1 0 0 0 ...
## $ differ : num 0 0 0 0 0 0 2 0 0 0 ...
## $ direct : num 0 0 0 0 0 0 0 0 0 0 ...
## $ director : num 0 0 0 0 0 0 1 0 0 0 ...
## $ discuss : num 0 0 0 0 0 0 1 0 0 0 ...
## $ doc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ don : num 0 0 0 0 0 0 0 0 0 0 ...
## $ done : num 1 0 0 0 0 0 0 0 0 0 ...
## $ due : num 0 0 0 0 0 0 2 0 0 0 ...
## $ ect : num 0 0 0 0 0 0 0 0 0 0 ...
## $ edu : num 0 0 0 0 0 0 0 0 0 0 ...
## $ effect : num 2 0 0 0 0 0 0 1 0 1 ...
## $ effort : num 1 0 0 0 0 0 1 0 0 0 ...
## $ either : num 0 0 0 0 0 0 0 0 0 0 ...
## $ email : num 0 0 0 0 0 0 2 0 0 0 ...
## $ end : num 0 0 0 0 1 0 0 0 0 0 ...
## $ energi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ engin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ enron : num 0 0 0 0 0 0 0 0 0 0 ...
## $ etc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ even : num 1 0 0 0 0 0 1 1 0 1 ...
## $ event : num 0 0 0 0 0 0 4 0 0 0 ...
## $ expect : num 0 0 0 0 0 0 4 0 0 0 ...
## $ experi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fax : num 0 0 0 1 0 0 0 0 0 0 ...
## $ feel : num 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
# colSums() is an R function that returns the sum of values for each variable in our data frame. Our data frame contains the number of times each word stem (columns) appeared in each email (rows). Therefore, colSums(emailsSparse) returns the number of times a word stem appeared across all the emails in the dataset. What is the word stem that shows up most frequently across all the emails in the dataset? Hint: think about how you can use sort() or which.max() to pick out the maximum frequency.
which.max((sapply(emailsSparse,sum))) # enron
## enron
## 92
sort(colSums(emailsSparse)) # Alt #1
## vkamin begin either done sorri lot
## 301 317 318 337 343 348
## mention thought bring idea better immedi
## 355 367 374 378 383 385
## without mean write happi repli life
## 389 390 390 396 397 400
## experi involv specif arrang creat read
## 405 405 407 410 413 413
## wish open realli link say respond
## 414 416 417 421 423 430
## sever keep etc anoth run info
## 430 431 434 435 437 438
## togeth short sincer buy due alreadi
## 438 439 441 442 445 446
## line allow recent special given believ
## 448 450 451 451 453 456
## design put remov X853 wednesday type
## 457 458 460 462 464 466
## public full hear join effect effort
## 468 469 469 469 471 473
## tuesday robert locat check area final
## 474 482 485 488 489 490
## increas soon analysi sure deal return
## 491 492 495 495 498 509
## place onlin success sinc understand still
## 516 518 519 521 521 523
## import comment confirm hello long thing
## 530 531 532 534 534 535
## point appreci feel howev member hour
## 536 541 543 545 545 548
## net continu event expect suggest unit
## 548 552 552 554 554 554
## resourc case version corpor applic engin
## 556 561 564 565 567 571
## part attend thursday might morn abl
## 571 573 575 577 586 590
## assist differ intern updat move mark
## 598 598 606 606 612 613
## depart even made internet high cours
## 621 622 622 623 624 626
## contract gibner end right per invit
## 629 633 635 639 642 647
## approv real monday result school kevin
## 648 648 649 655 655 656
## direct home detail tri form problem
## 657 660 661 661 664 666
## web doc deriv don april note
## 668 675 676 676 682 688
## relat websit juli director complet rate
## 694 700 701 705 707 717
## valu futur student set within requir
## 721 722 726 727 732 736
## softwar book mani person click file
## 739 756 758 767 769 770
## addit money associ particip term access
## 774 776 777 782 786 789
## custom possibl copi oper cost respons
## 796 796 797 820 821 824
## today account base great dear london
## 828 829 837 837 838 843
## friday support secur hope much back
## 854 854 857 858 861 864
## way find invest ask start shall
## 864 867 867 871 880 884
## origin come plan financi two site
## 892 903 904 909 911 913
## opportun team first resum issu data
## 918 926 929 933 944 955
## month peopl credit industri process review
## 958 958 960 970 975 976
## talk last phone X000 chang fax
## 981 998 1001 1007 1035 1038
## john current stinson give univers offic
## 1042 1044 1051 1055 1059 1068
## gas schedul financ state name X713
## 1070 1071 1073 1086 1089 1097
## good posit crenshaw system well sent
## 1097 1104 1115 1118 1125 1126
## visit free next. avail question address
## 1126 1141 1145 1152 1152 1154
## offer attach number date product order
## 1171 1176 1182 1187 1197 1210
## think includ report best confer now
## 1216 1238 1279 1291 1297 1300
## www discuss interview servic communic request
## 1323 1326 1333 1337 1343 1344
## just take trade send provid list
## 1354 1361 1366 1379 1405 1410
## help program option want project contact
## 1430 1438 1488 1488 1522 1543
## present follow receiv see houston http
## 1543 1552 1557 1567 1582 1609
## edu call shirley corp week interest
## 1627 1687 1689 1692 1758 1814
## day also develop make year let
## 1860 1864 1882 1884 1890 1963
## messag look regard email one power
## 1983 2003 2045 2066 2108 2117
## energi model risk mail new compani
## 2179 2199 2267 2269 2281 2290
## busi need use like get may
## 2313 2328 2330 2352 2462 2465
## manag group know meet price inform
## 2600 2604 2614 2623 2694 2701
## work market research X2001 time forward
## 2708 2750 2820 3089 3145 3161
## thank can kaminski X2000 pleas com
## 3730 4257 4801 4967 5113 5443
## hou will vinc subject ect enron
## 5577 8252 8532 10202 11427 13388
which.max(colSums(emailsSparse)) # Alt #2
## enron
## 92
# PROBLEM 2.4 - PREPARING THE CORPUS (1 point possible)
# Add a variable called "spam" to emailsSparse containing the email spam labels. You can do this by copying over the "spam" variable from the original data frame (remember how we did this in the Twitter lecture).
which(colnames(emailsSparse)=="spam")
## integer(0)
emailsSparse$spam = emails$spam
# How many word stems appear at least 5000 times in the ham emails in the dataset? Hint: in this and the next question, remember not to count the dependent variable we just added.
ham.mails = subset(emailsSparse,emailsSparse$spam==0)
sum(colSums(ham.mails)>=5000) # ANS 6
## [1] 6
sort(colSums(subset(emailsSparse, spam == 0))) # Alternative
## spam life remov money onlin without
## 0 80 103 114 173 191
## websit click special wish repli buy
## 194 217 226 229 239 243
## net link immedi done mean design
## 243 247 249 254 259 261
## lot effect info either read write
## 268 270 273 279 279 286
## line begin sorri success involv creat
## 289 291 293 293 294 299
## softwar better vkamin say keep bring
## 299 301 301 305 306 311
## believ full increas realli mention thought
## 313 317 320 324 325 325
## idea invest secur specif sever experi
## 327 327 337 338 340 346
## thing allow check due type happi
## 347 348 351 351 352 354
## return expect short effort open internet
## 355 356 357 358 360 361
## sincer public recent anoth alreadi home
## 361 364 368 369 372 375
## made respond given etc put within
## 380 382 383 385 385 386
## place right version hello sure area
## 388 390 390 395 396 397
## run arrang account join hour locat
## 398 399 401 403 404 406
## togeth engin import per corpor high
## 406 411 411 412 414 416
## result hear final deal applic even
## 418 420 422 423 428 429
## web custom soon long sinc futur
## 430 433 435 436 439 440
## member X000 event don part feel
## 446 447 447 450 450 453
## tuesday wednesday still unit site X853
## 454 456 457 457 458 461
## continu understand resourc robert analysi form
## 464 464 466 466 468 468
## point assist confirm differ intern might
## 474 475 485 489 489 490
## real case howev comment abl complet
## 490 492 496 505 515 515
## rate appreci tri move updat approv
## 516 518 521 526 527 533
## suggest free contract detail morn end
## 533 535 544 546 546 550
## mani attend thursday direct requir cours
## 550 558 558 561 562 567
## person relat depart today start way
## 569 573 575 577 580 586
## mark valu problem peopl note school
## 588 590 593 599 600 607
## invit access term juli monday gibner
## 614 617 625 630 630 633
## base director offer cost addit kevin
## 635 640 643 646 648 654
## great set file find much oper
## 655 658 659 665 669 669
## order deriv doc april book address
## 669 673 673 677 680 693
## copi financi month student respons possibl
## 700 702 709 710 711 712
## associ particip now first industri dear
## 715 717 725 726 731 734
## support plan back name come opportun
## 734 738 739 745 748 760
## report product two origin ask credit
## 772 776 787 796 797 798
## state system process hope london just
## 806 816 826 828 828 830
## receiv chang review current shall friday
## 830 831 834 841 844 847
## team phone issu data avail last
## 850 858 865 868 872 874
## good give www gas list posit
## 876 883 897 905 907 917
## visit includ resum best offic servic
## 920 924 928 933 935 942
## talk number well fax provid sent
## 943 951 961 963 970 971
## next. send http john univers financ
## 975 986 1009 1022 1025 1038
## stinson schedul take date want question
## 1051 1054 1057 1060 1068 1069
## program think X713 crenshaw attach trade
## 1080 1084 1097 1115 1155 1167
## help email compani request see communic
## 1168 1201 1225 1227 1238 1251
## confer discuss make contact follow interview
## 1264 1270 1281 1301 1308 1320
## project mail present busi interest option
## 1328 1352 1397 1416 1429 1432
## day call one year week messag
## 1440 1497 1516 1523 1527 1538
## houston also look edu corp shirley
## 1577 1604 1607 1620 1643 1687
## develop get new use let regard
## 1691 1768 1777 1784 1856 1859
## inform need power may like risk
## 1883 1890 1972 1976 1980 2097
## energi market model price work manag
## 2124 2150 2170 2191 2293 2334
## know group meet time research forward
## 2345 2474 2544 2552 2752 2952
## X2001 can thank com pleas kaminski
## 3060 3426 3558 4444 4494 4801
## X2000 hou will vinc subject ect
## 4935 5569 6802 8531 8625 11417
## enron
## 13388
# PROBLEM 2.5 - PREPARING THE CORPUS (1 point possible)
# How many word stems appear at least 1000 times in the spam emails in the dataset?
spam.mails = subset(emailsSparse, spam==1)
tail(sort(colSums(spam.mails)>=1000) )# ANS 3
## www year compani subject will spam
## FALSE FALSE TRUE TRUE TRUE TRUE
tail(sort(colSums(spam.mails)))
## mail com compani spam will subject
## 917 999 1065 1368 1450 1577
sort(colSums(subset(emailsSparse, spam == 1))) # Alternative
## X713 crenshaw enron gibner kaminski stinson
## 0 0 0 0 0 0
## vkamin X853 vinc doc kevin shirley
## 0 1 1 2 2 2
## deriv april houston resum edu friday
## 3 5 5 5 7 7
## hou wednesday ect arrang interview attend
## 8 8 10 11 13 15
## london robert student schedul thursday monday
## 15 16 16 17 17 19
## john tuesday attach suggest appreci mark
## 20 20 21 21 23 25
## begin comment analysi X2001 model hope
## 26 26 27 29 29 30
## mention X2000 togeth confer invit univers
## 30 32 32 33 33 34
## financ talk either run morn shall
## 35 38 39 39 40 40
## happi thought depart confirm respond school
## 42 42 46 47 48 48
## corp etc hear howev sorri idea
## 49 49 49 49 50 51
## energi discuss open option soon understand
## 55 56 56 56 57 57
## cours experi associ point bring director
## 59 59 62 62 63 65
## particip anoth join still final research
## 65 66 66 66 68 68
## case set specif given juli problem
## 69 69 69 70 71 73
## put alreadi ask abl deal fax
## 73 74 74 75 75 75
## book team issu locat meet updat
## 76 76 79 79 79 79
## lot sincer better short sinc done
## 80 80 82 82 82 83
## question recent possibl contract end move
## 83 83 84 85 85 86
## data might continu note feel resourc
## 87 87 88 88 90 90
## sever area communic realli due direct
## 90 92 92 93 94 96
## origin copi unit long member sure
## 96 97 97 98 99 99
## allow dear public write event let
## 102 104 104 104 105 107
## differ file involv respons creat type
## 109 111 111 113 114 114
## approv detail effort intern request say
## 115 115 115 117 117 118
## import support part relat assist last
## 119 120 121 121 123 124
## two back keep addit date place
## 124 125 125 126 127 128
## group mean valu think offic read
## 130 131 131 132 133 134
## immedi check applic hello tri review
## 136 137 139 139 140 142
## believ phone hour power present process
## 143 143 144 145 146 149
## corpor oper full return come sent
## 151 151 152 154 155 155
## opportun real repli line engin term
## 158 158 158 159 160 161
## credit well gas info plan next.
## 162 164 165 165 166 170
## risk increas access give thank link
## 170 171 172 172 172 174
## requir version cost great wish regard
## 174 174 175 182 185 186
## posit thing call develop complet much
## 187 188 190 191 192 192
## even project design form expect person
## 193 194 196 196 198 198
## without buy trade effect rate base
## 198 199 199 201 201 202
## find current first chang visit financi
## 202 203 203 204 206 207
## high mani forward good special don
## 208 208 209 221 225 226
## success per number week result web
## 226 230 231 231 237 238
## industri contact made follow month right
## 239 242 242 244 249 249
## today also help internet manag know
## 251 260 262 262 266 269
## way avail state futur home start
## 278 280 280 282 285 300
## system take net includ life see
## 302 304 305 314 320 329
## name onlin within remov best program
## 344 345 346 357 358 358
## peopl custom year like interest send
## 359 363 367 372 385 393
## servic look work day want product
## 395 396 415 420 420 421
## www account provid need softwar messag
## 426 428 435 438 440 445
## site address may list price new
## 455 461 489 503 503 504
## websit report secur just offer invest
## 506 507 520 524 528 540
## order use click X000 now one
## 541 546 552 560 575 592
## time http market make free pleas
## 593 600 600 603 606 619
## money get receiv inform can email
## 662 694 727 818 831 865
## busi mail com compani spam will
## 897 917 999 1065 1368 1450
## subject
## 1577
# PROBLEM 2.6 - PREPARING THE CORPUS (1 point possible)
# The lists of most common words are significantly different between the spam and ham emails. What does this likely imply?
# ANS The frequencies of these most common words are likely to help differentiate between spam and ham.
# PROBLEM 2.7 - PREPARING THE CORPUS (1 point possible)
# Several of the most common word stems from the ham documents, such as "enron", "hou" (short for Houston), "vinc" (the word stem of "Vince") and "kaminski", are likely specific to Vincent Kaminski's inbox. What does this mean about the applicability of the text analytics models we will train for the spam filtering problem?
# ANS The models we build are personalized, and would need to be further tested before being used as a spam filter for another person.
# PROBLEM 3.1 - BUILDING MACHINE LEARNING MODELS (3 points possible)
# First, convert the dependent variable to a factor with "emailsSparse$spam = as.factor(emailsSparse$spam)".
emailsSparse$spam = as.factor(emailsSparse$spam)
# Next, set the random seed to 123 and use the sample.split function to split emailsSparse 70/30 into a training set called "train" and a testing set called "test". Make sure to perform this step on emailsSparse instead of emails.
set.seed(123)
split = sample.split(emailsSparse$spam, SplitRatio = 0.7)
train = subset(emailsSparse, split==TRUE)
test = subset(emailsSparse, split==FALSE)
# Using the training set, train the following three machine learning models. The models should predict the dependent variable "spam", using all other available variables as independent variables. Please be patient, as these models may take a few minutes to train.
# 1) A logistic regression model called spamLog. You may see a warning message here - we'll discuss this more later.
spamLog = glm(spam ~ ., data=train, family="binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# 2) A CART model called spamCART, using the default parameters to train the model (don't worry about adding minbucket or cp). Remember to add the argument method="class" since this is a binary classification problem.
spamCART = rpart(spam ~ ., data=train, method="class")
# 3) A random forest model called spamRF, using the default parameters to train the model (don't worry about specifying ntree or nodesize). Directly before training the random forest model, set the random seed to 123 (even though we've already done this earlier in the problem, it's important to set the seed right before training the model so we all obtain the same results. Keep in mind though that on certain operating systems, your results might still be slightly different).
set.seed(123)
spamRF = randomForest(spam ~ ., data=train)
# For each model, obtain the predicted spam probabilities for the training set. Be careful to obtain probabilities instead of predicted classes, because we will be using these values to compute training set AUC values. Recall that you can obtain probabilities for CART models by not passing any type parameter to the predict() function, and you can obtain probabilities from a random forest by adding the argument type="prob". For CART and random forest, you need to select the second column of the output of the predict() function, corresponding to the probability of a message being spam.
predict.spamLog = predict(spamLog,data=train, type="response")
predict.spamCART = predict(spamCART,data=train)[,2]
predict.spamRF = predict(spamRF,data=train, type="prob")[,2]
hist(predict.spamLog)
# You may have noticed that training the logistic regression model yielded the messages "algorithm did not converge" and "fitted probabilities numerically 0 or 1 occurred". Both of these messages often indicate overfitting and the first indicates particularly severe overfitting, often to the point that the training set observations are fit perfectly by the model. Let's investigate the predicted probabilities from the logistic regression model.
# How many of the training set predicted probabilities from spamLog are less than 0.00001?
sum(predict.spamLog < 0.00001) # ANS 3048
## [1] 3046
table(predict.spamLog < 0.00001) # Alternative
##
## FALSE TRUE
## 964 3046
# How many of the training set predicted probabilities from spamLog are more than 0.99999?
sum(predict.spamLog > 0.99999) # ANS 953
## [1] 954
table(predict.spamLog > 0.99999) # Alternative
##
## FALSE TRUE
## 3056 954
# How many of the training set predicted probabilities from spamLog are between 0.00001 and 0.99999?
sum((predict.spamLog <= 0.99999) & (predict.spamLog >= 0.00001)) # ANS 9
## [1] 10
table(predict.spamLog <= 0.99999 & predict.spamLog >= 0.00001) # Alternative
##
## FALSE TRUE
## 4000 10
#Note: & operates each element and && only first element in a vector
length(predict.spamLog) # Note 4010 total elements
## [1] 4010
# PROBLEM 3.2 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# How many variables are labeled as significant (at the p=0.05 level) in the logistic regression summary output?
summary(spamLog) # None. Symptom of logistic regression not converging
##
## Call:
## glm(formula = spam ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.011 0.000 0.000 0.000 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.082e+01 1.055e+04 -0.003 0.998
## X000 1.474e+01 1.058e+04 0.001 0.999
## X2000 -3.631e+01 1.556e+04 -0.002 0.998
## X2001 -3.215e+01 1.318e+04 -0.002 0.998
## X713 -2.427e+01 2.914e+04 -0.001 0.999
## X853 -1.212e+00 5.942e+04 0.000 1.000
## abl -2.049e+00 2.088e+04 0.000 1.000
## access -1.480e+01 1.335e+04 -0.001 0.999
## account 2.488e+01 8.165e+03 0.003 0.998
## addit 1.463e+00 2.703e+04 0.000 1.000
## address -4.613e+00 1.113e+04 0.000 1.000
## allow 1.899e+01 6.436e+03 0.003 0.998
## alreadi -2.407e+01 3.319e+04 -0.001 0.999
## also 2.990e+01 1.378e+04 0.002 0.998
## analysi -2.405e+01 3.860e+04 -0.001 1.000
## anoth -8.744e+00 2.032e+04 0.000 1.000
## applic -2.649e+00 1.674e+04 0.000 1.000
## appreci -2.145e+01 2.762e+04 -0.001 0.999
## approv -1.302e+00 1.589e+04 0.000 1.000
## april -2.620e+01 2.208e+04 -0.001 0.999
## area 2.041e+01 2.266e+04 0.001 0.999
## arrang 1.069e+01 2.135e+04 0.001 1.000
## ask -7.746e+00 1.976e+04 0.000 1.000
## assist -1.128e+01 2.490e+04 0.000 1.000
## associ 9.049e+00 1.909e+04 0.000 1.000
## attach -1.037e+01 1.534e+04 -0.001 0.999
## attend -3.451e+01 3.257e+04 -0.001 0.999
## avail 8.651e+00 1.709e+04 0.001 1.000
## back -1.323e+01 2.272e+04 -0.001 1.000
## base -1.354e+01 2.122e+04 -0.001 0.999
## begin 2.228e+01 2.973e+04 0.001 0.999
## believ 3.233e+01 2.136e+04 0.002 0.999
## best -8.201e+00 1.333e+03 -0.006 0.995
## better 4.263e+01 2.360e+04 0.002 0.999
## book 4.301e+00 2.024e+04 0.000 1.000
## bring 1.607e+01 6.767e+04 0.000 1.000
## busi -4.803e+00 1.000e+04 0.000 1.000
## buy 4.170e+01 3.892e+04 0.001 0.999
## call -1.145e+00 1.111e+04 0.000 1.000
## can 3.762e+00 7.674e+03 0.000 1.000
## case -3.372e+01 2.880e+04 -0.001 0.999
## chang -2.717e+01 2.215e+04 -0.001 0.999
## check 1.425e+00 1.963e+04 0.000 1.000
## click 1.376e+01 7.077e+03 0.002 0.998
## com 1.936e+00 4.039e+03 0.000 1.000
## come -1.166e+00 1.511e+04 0.000 1.000
## comment -3.251e+00 3.387e+04 0.000 1.000
## communic 1.580e+01 8.958e+03 0.002 0.999
## compani 4.781e+00 9.186e+03 0.001 1.000
## complet -1.363e+01 2.024e+04 -0.001 0.999
## confer -7.503e-01 8.557e+03 0.000 1.000
## confirm -1.300e+01 1.514e+04 -0.001 0.999
## contact 1.530e+00 1.262e+04 0.000 1.000
## continu 1.487e+01 1.535e+04 0.001 0.999
## contract -1.295e+01 1.498e+04 -0.001 0.999
## copi -4.274e+01 3.070e+04 -0.001 0.999
## corp 1.606e+01 2.708e+04 0.001 1.000
## corpor -8.286e-01 2.818e+04 0.000 1.000
## cost -1.938e+00 1.833e+04 0.000 1.000
## cours 1.665e+01 1.834e+04 0.001 0.999
## creat 1.338e+01 3.946e+04 0.000 1.000
## credit 2.617e+01 1.314e+04 0.002 0.998
## crenshaw 9.994e+01 6.769e+04 0.001 0.999
## current 3.629e+00 1.707e+04 0.000 1.000
## custom 1.829e+01 1.008e+04 0.002 0.999
## data -2.609e+01 2.271e+04 -0.001 0.999
## date -2.786e+00 1.699e+04 0.000 1.000
## day -6.100e+00 5.866e+03 -0.001 0.999
## deal -1.129e+01 1.448e+04 -0.001 0.999
## dear -2.313e+00 2.306e+04 0.000 1.000
## depart -4.068e+01 2.509e+04 -0.002 0.999
## deriv -4.971e+01 3.587e+04 -0.001 0.999
## design -7.923e+00 2.939e+04 0.000 1.000
## detail 1.197e+01 2.301e+04 0.001 1.000
## develop 5.976e+00 9.455e+03 0.001 0.999
## differ -2.293e+00 1.075e+04 0.000 1.000
## direct -2.051e+01 3.194e+04 -0.001 0.999
## director -1.770e+01 1.793e+04 -0.001 0.999
## discuss -1.051e+01 1.915e+04 -0.001 1.000
## doc -2.597e+01 2.603e+04 -0.001 0.999
## don 2.129e+01 1.456e+04 0.001 0.999
## done 6.828e+00 1.882e+04 0.000 1.000
## due -4.163e+00 3.532e+04 0.000 1.000
## ect 8.685e-01 5.342e+03 0.000 1.000
## edu -2.122e-01 6.917e+02 0.000 1.000
## effect 1.948e+01 2.100e+04 0.001 0.999
## effort 1.606e+01 5.670e+04 0.000 1.000
## either -2.744e+01 4.000e+04 -0.001 0.999
## email 3.833e+00 1.186e+04 0.000 1.000
## end -1.311e+01 2.938e+04 0.000 1.000
## energi -1.620e+01 1.646e+04 -0.001 0.999
## engin 2.664e+01 2.394e+04 0.001 0.999
## enron -8.789e+00 5.719e+03 -0.002 0.999
## etc 9.470e-01 1.569e+04 0.000 1.000
## even -1.654e+01 2.289e+04 -0.001 0.999
## event 1.694e+01 1.851e+04 0.001 0.999
## expect -1.179e+01 1.914e+04 -0.001 1.000
## experi 2.460e+00 2.240e+04 0.000 1.000
## fax 3.537e+00 3.386e+04 0.000 1.000
## feel 2.596e+00 2.348e+04 0.000 1.000
## file -2.943e+01 2.165e+04 -0.001 0.999
## final 8.075e+00 5.008e+04 0.000 1.000
## financ -9.122e+00 7.524e+03 -0.001 0.999
## financi -9.747e+00 1.727e+04 -0.001 1.000
## find -2.623e+00 9.727e+03 0.000 1.000
## first -4.666e-01 2.043e+04 0.000 1.000
## follow 1.766e+01 3.080e+03 0.006 0.995
## form 8.483e+00 1.674e+04 0.001 1.000
## forward -3.484e+00 1.864e+04 0.000 1.000
## free 6.113e+00 8.121e+03 0.001 0.999
## friday -1.146e+01 1.996e+04 -0.001 1.000
## full 2.125e+01 2.190e+04 0.001 0.999
## futur 4.146e+01 1.439e+04 0.003 0.998
## gas -3.901e+00 4.160e+03 -0.001 0.999
## get 5.154e+00 9.737e+03 0.001 1.000
## gibner 2.901e+01 2.460e+04 0.001 0.999
## give -2.518e+01 2.130e+04 -0.001 0.999
## given -2.186e+01 5.426e+04 0.000 1.000
## good 5.399e+00 1.619e+04 0.000 1.000
## great 1.222e+01 1.090e+04 0.001 0.999
## group 5.264e-01 1.037e+04 0.000 1.000
## happi 1.939e-02 1.202e+04 0.000 1.000
## hear 2.887e+01 2.281e+04 0.001 0.999
## hello 2.166e+01 1.361e+04 0.002 0.999
## help 1.731e+01 2.791e+03 0.006 0.995
## high -1.982e+00 2.554e+04 0.000 1.000
## home 5.973e+00 8.965e+03 0.001 0.999
## hope -1.435e+01 2.179e+04 -0.001 0.999
## hou 6.852e+00 6.437e+03 0.001 0.999
## hour 2.478e+00 1.333e+04 0.000 1.000
## houston -1.855e+01 7.305e+03 -0.003 0.998
## howev -3.449e+01 3.562e+04 -0.001 0.999
## http 2.528e+01 2.107e+04 0.001 0.999
## idea -1.845e+01 3.892e+04 0.000 1.000
## immedi 6.285e+01 3.346e+04 0.002 0.999
## import -1.859e+00 2.236e+04 0.000 1.000
## includ -3.454e+00 1.799e+04 0.000 1.000
## increas 6.476e+00 2.329e+04 0.000 1.000
## industri -3.160e+01 2.373e+04 -0.001 0.999
## info -1.255e+00 4.857e+03 0.000 1.000
## inform 2.078e+01 8.549e+03 0.002 0.998
## interest 2.698e+01 1.159e+04 0.002 0.998
## intern -7.991e+00 3.351e+04 0.000 1.000
## internet 8.749e+00 1.100e+04 0.001 0.999
## interview -1.640e+01 1.873e+04 -0.001 0.999
## invest 3.201e+01 2.393e+04 0.001 0.999
## invit 4.304e+00 2.215e+04 0.000 1.000
## involv 3.815e+01 3.315e+04 0.001 0.999
## issu -3.708e+01 3.396e+04 -0.001 0.999
## john -5.326e-01 2.856e+04 0.000 1.000
## join -3.824e+01 2.334e+04 -0.002 0.999
## juli -1.358e+01 3.009e+04 0.000 1.000
## just -1.021e+01 1.114e+04 -0.001 0.999
## kaminski -1.812e+01 6.029e+03 -0.003 0.998
## keep 1.867e+01 2.782e+04 0.001 0.999
## kevin -3.779e+01 4.738e+04 -0.001 0.999
## know 1.277e+01 1.526e+04 0.001 0.999
## last 1.046e+00 1.372e+04 0.000 1.000
## let -2.763e+01 1.462e+04 -0.002 0.998
## life 5.812e+01 3.864e+04 0.002 0.999
## like 5.649e+00 7.660e+03 0.001 0.999
## line 8.743e+00 1.236e+04 0.001 0.999
## link -6.929e+00 1.345e+04 -0.001 1.000
## list -8.692e+00 2.149e+03 -0.004 0.997
## locat 2.073e+01 1.597e+04 0.001 0.999
## london 6.745e+00 1.642e+04 0.000 1.000
## long -1.489e+01 1.934e+04 -0.001 0.999
## look -7.031e+00 1.563e+04 0.000 1.000
## lot -1.964e+01 1.321e+04 -0.001 0.999
## made 2.820e+00 2.743e+04 0.000 1.000
## mail 7.584e+00 1.021e+04 0.001 0.999
## make 2.901e+01 1.528e+04 0.002 0.998
## manag 6.014e+00 1.445e+04 0.000 1.000
## mani 1.885e+01 1.442e+04 0.001 0.999
## mark -3.350e+01 3.208e+04 -0.001 0.999
## market 7.895e+00 8.012e+03 0.001 0.999
## may -9.434e+00 1.397e+04 -0.001 0.999
## mean 6.078e-01 2.952e+04 0.000 1.000
## meet -1.063e+00 1.263e+04 0.000 1.000
## member 1.381e+01 2.343e+04 0.001 1.000
## mention -2.279e+01 2.714e+04 -0.001 0.999
## messag 1.716e+01 2.562e+03 0.007 0.995
## might 1.244e+01 1.753e+04 0.001 0.999
## model -2.292e+01 1.049e+04 -0.002 0.998
## monday -1.034e+00 3.233e+04 0.000 1.000
## money 3.264e+01 1.321e+04 0.002 0.998
## month -3.727e+00 1.112e+04 0.000 1.000
## morn -2.645e+01 3.403e+04 -0.001 0.999
## move -3.834e+01 3.011e+04 -0.001 0.999
## much 3.775e-01 1.392e+04 0.000 1.000
## name 1.672e+01 1.322e+04 0.001 0.999
## need 8.437e-01 1.221e+04 0.000 1.000
## net 1.256e+01 2.197e+04 0.001 1.000
## new 1.003e+00 1.009e+04 0.000 1.000
## next. 1.492e+01 1.724e+04 0.001 0.999
## note 1.446e+01 2.294e+04 0.001 0.999
## now 3.790e+01 1.219e+04 0.003 0.998
## number -9.622e+00 1.591e+04 -0.001 1.000
## offer 1.174e+01 1.084e+04 0.001 0.999
## offic -1.344e+01 2.311e+04 -0.001 1.000
## one 1.241e+01 6.652e+03 0.002 0.999
## onlin 3.589e+01 1.665e+04 0.002 0.998
## open 2.114e+01 2.961e+04 0.001 0.999
## oper -1.696e+01 2.757e+04 -0.001 1.000
## opportun -4.131e+00 1.918e+04 0.000 1.000
## option -1.085e+00 9.325e+03 0.000 1.000
## order 6.533e+00 1.242e+04 0.001 1.000
## origin 3.226e+01 3.818e+04 0.001 0.999
## part 4.594e+00 3.483e+04 0.000 1.000
## particip -1.154e+01 1.738e+04 -0.001 0.999
## peopl -1.864e+01 1.439e+04 -0.001 0.999
## per 1.367e+01 1.273e+04 0.001 0.999
## person 1.870e+01 9.575e+03 0.002 0.998
## phone -6.957e+00 1.172e+04 -0.001 1.000
## place 9.005e+00 3.661e+04 0.000 1.000
## plan -1.830e+01 6.320e+03 -0.003 0.998
## pleas -7.961e+00 9.484e+03 -0.001 0.999
## point 5.498e+00 3.403e+04 0.000 1.000
## posit -1.543e+01 2.316e+04 -0.001 0.999
## possibl -1.366e+01 2.492e+04 -0.001 1.000
## power -5.643e+00 1.173e+04 0.000 1.000
## present -6.163e+00 1.278e+04 0.000 1.000
## price 3.428e+00 7.850e+03 0.000 1.000
## problem 1.262e+01 9.763e+03 0.001 0.999
## process -2.957e-01 1.191e+04 0.000 1.000
## product 1.016e+01 1.345e+04 0.001 0.999
## program 1.444e+00 1.183e+04 0.000 1.000
## project 2.173e+00 1.497e+04 0.000 1.000
## provid 2.422e-01 1.859e+04 0.000 1.000
## public -5.250e+01 2.341e+04 -0.002 0.998
## put -1.052e+01 2.681e+04 0.000 1.000
## question -3.467e+01 1.859e+04 -0.002 0.999
## rate -3.112e+00 1.319e+04 0.000 1.000
## read -1.527e+01 2.145e+04 -0.001 0.999
## real 2.046e+01 2.358e+04 0.001 0.999
## realli -2.667e+01 4.640e+04 -0.001 1.000
## receiv 5.765e-01 1.585e+04 0.000 1.000
## recent -2.067e+00 1.780e+04 0.000 1.000
## regard -3.668e+00 1.511e+04 0.000 1.000
## relat -5.114e+01 1.793e+04 -0.003 0.998
## remov 2.325e+01 2.484e+04 0.001 0.999
## repli 1.538e+01 2.916e+04 0.001 1.000
## report -1.482e+01 1.477e+04 -0.001 0.999
## request -1.232e+01 1.167e+04 -0.001 0.999
## requir 5.004e-01 2.937e+04 0.000 1.000
## research -2.826e+01 1.553e+04 -0.002 0.999
## resourc -2.735e+01 3.522e+04 -0.001 0.999
## respond 2.974e+01 3.888e+04 0.001 0.999
## respons -1.960e+01 3.667e+04 -0.001 1.000
## result -5.002e-01 3.140e+04 0.000 1.000
## resum -9.219e+00 2.100e+04 0.000 1.000
## return 1.745e+01 1.844e+04 0.001 0.999
## review -4.825e+00 1.013e+04 0.000 1.000
## right 2.312e+01 1.590e+04 0.001 0.999
## risk -4.001e+00 1.718e+04 0.000 1.000
## robert -2.096e+01 2.907e+04 -0.001 0.999
## run -5.162e+01 4.434e+04 -0.001 0.999
## say 7.366e+00 2.217e+04 0.000 1.000
## schedul 1.919e+00 3.580e+04 0.000 1.000
## school -3.870e+00 2.882e+04 0.000 1.000
## secur -1.604e+01 2.201e+03 -0.007 0.994
## see -1.120e+01 1.293e+04 -0.001 0.999
## send -2.427e+01 1.222e+04 -0.002 0.998
## sent -1.488e+01 2.195e+04 -0.001 0.999
## servic -7.164e+00 1.235e+04 -0.001 1.000
## set -9.353e+00 2.627e+04 0.000 1.000
## sever 2.041e+01 3.093e+04 0.001 0.999
## shall 1.930e+01 3.075e+04 0.001 0.999
## shirley -7.133e+01 6.329e+04 -0.001 0.999
## short -8.974e+00 1.721e+04 -0.001 1.000
## sinc -3.438e+00 3.546e+04 0.000 1.000
## sincer -2.073e+01 3.515e+04 -0.001 1.000
## site 8.689e+00 1.496e+04 0.001 1.000
## softwar 2.575e+01 1.059e+04 0.002 0.998
## soon 2.350e+01 3.731e+04 0.001 0.999
## sorri 6.036e+00 2.299e+04 0.000 1.000
## special 1.777e+01 2.755e+04 0.001 0.999
## specif -2.337e+01 3.083e+04 -0.001 0.999
## start 1.437e+01 1.897e+04 0.001 0.999
## state 1.221e+01 1.677e+04 0.001 0.999
## still 3.878e+00 2.622e+04 0.000 1.000
## stinson -4.345e+01 2.697e+04 -0.002 0.999
## student -1.815e+01 2.186e+04 -0.001 0.999
## subject 3.041e+01 1.055e+04 0.003 0.998
## success 4.344e+00 2.783e+04 0.000 1.000
## suggest -3.842e+01 4.475e+04 -0.001 0.999
## support -1.539e+01 1.976e+04 -0.001 0.999
## sure -5.503e+00 2.078e+04 0.000 1.000
## system 3.778e+00 9.149e+03 0.000 1.000
## take 5.731e+00 1.716e+04 0.000 1.000
## talk -1.011e+01 2.021e+04 -0.001 1.000
## team 7.940e+00 2.570e+04 0.000 1.000
## term 2.013e+01 2.303e+04 0.001 0.999
## thank -3.890e+01 1.059e+04 -0.004 0.997
## thing 2.579e+01 1.341e+04 0.002 0.998
## think -1.218e+01 2.077e+04 -0.001 1.000
## thought 1.243e+01 3.023e+04 0.000 1.000
## thursday -1.491e+01 3.262e+04 0.000 1.000
## time -5.921e+00 8.335e+03 -0.001 0.999
## today -1.762e+01 1.965e+04 -0.001 0.999
## togeth -2.355e+01 1.869e+04 -0.001 0.999
## trade -1.755e+01 1.483e+04 -0.001 0.999
## tri 9.278e-01 1.282e+04 0.000 1.000
## tuesday -2.808e+01 3.959e+04 -0.001 0.999
## two -2.573e+01 1.844e+04 -0.001 0.999
## type -1.447e+01 2.755e+04 -0.001 1.000
## understand 9.307e+00 2.342e+04 0.000 1.000
## unit -4.020e+00 3.008e+04 0.000 1.000
## univers 1.228e+01 2.197e+04 0.001 1.000
## updat -1.510e+01 1.448e+04 -0.001 0.999
## use -1.385e+01 9.382e+03 -0.001 0.999
## valu 9.024e-01 1.360e+04 0.000 1.000
## version -3.606e+01 2.939e+04 -0.001 0.999
## vinc -3.735e+01 8.647e+03 -0.004 0.997
## visit 2.585e+01 1.170e+04 0.002 0.998
## vkamin -6.649e+01 5.703e+04 -0.001 0.999
## want -2.555e+00 1.106e+04 0.000 1.000
## way 1.339e+01 1.138e+04 0.001 0.999
## web 2.791e+00 1.686e+04 0.000 1.000
## websit -2.563e+01 1.848e+04 -0.001 0.999
## wednesday -1.526e+01 2.642e+04 -0.001 1.000
## week -6.795e+00 1.046e+04 -0.001 0.999
## well -2.222e+01 9.713e+03 -0.002 0.998
## will -1.119e+01 5.980e+03 -0.002 0.999
## wish 1.173e+01 3.175e+04 0.000 1.000
## within 2.900e+01 2.163e+04 0.001 0.999
## without 1.942e+01 1.763e+04 0.001 0.999
## work -1.099e+01 1.160e+04 -0.001 0.999
## write 4.406e+01 2.825e+04 0.002 0.999
## www -7.867e+00 2.224e+04 0.000 1.000
## year -1.010e+01 1.039e+04 -0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4409.49 on 4009 degrees of freedom
## Residual deviance: 13.46 on 3679 degrees of freedom
## AIC: 675.46
##
## Number of Fisher Scoring iterations: 25
# PROBLEM 3.3 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# How many of the word stems "enron", "hou", "vinc", and "kaminski" appear in the CART tree? Recall that we suspect these word stems are specific to Vincent Kaminski and might affect the generalizability of a spam filter built with his ham data.
prp(spamCART) # The words "enron" and "vinc" occur
# PROBLEM 3.4 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamLog, using a threshold of 0.5 for predictions?
table(train$spam, predict.spamLog >0.5)
##
## FALSE TRUE
## 0 3052 0
## 1 4 954
(3048+958)/nrow(train) # ANS 0.9990025
## [1] 0.9990025
# PROBLEM 3.5 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set AUC of spamLog?
library(ROCR)
predLog = prediction(predict.spamLog, train$spam)
performance(predLog, "auc")@y.values
## [[1]]
## [1] 0.9999959
as.numeric(performance(predLog, "auc")@y.values) # Alternative
## [1] 0.9999959
# PROBLEM 3.6 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamCART, using a threshold of 0.5 for predictions? (Remember that if you used the type="class" argument when making predictions, you automatically used a threshold of 0.5. If you did not add in the type argument to the predict function, the probabilities are in the second column of the predict output.)
table(train$spam, predict.spamCART>0.5)
##
## FALSE TRUE
## 0 2885 167
## 1 64 894
(2885+894)/nrow(train) # ANS 0.9990025
## [1] 0.942394
# PROBLEM 3.7 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set AUC of spamCART? (Remember that you have to pass the prediction function predicted probabilities, so don't include the type argument when making predictions for your CART model.)
predCART = prediction(predict.spamCART, train$spam)
performance(predCART, "auc")@y.values # ANS 0.9696044
## [[1]]
## [1] 0.9696044
as.numeric(performance(predCART, "auc")@y.values) # Alternative
## [1] 0.9696044
# PROBLEM 3.8 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# What is the training set accuracy of spamRF, using a threshold of 0.5 for predictions? (Remember that your answer might not match ours exactly, due to random behavior in the random forest algorithm on different operating systems.)
table(train$spam, predict.spamRF>0.5)
##
## FALSE TRUE
## 0 3013 39
## 1 44 914
(3013+914)/nrow(train) # ANS 0.9793017
## [1] 0.9793017
# PROBLEM 3.9 - BUILDING MACHINE LEARNING MODELS (2 points possible)
# What is the training set AUC of spamRF? (Remember to pass the argument type="prob" to the predict function to get predicted probabilities for a random forest model. The probabilities will be the second column of the output.)
predRF = prediction(predict.spamRF, train$spam)
performance(predRF, "auc")@y.values # ANS 0.9979116
## [[1]]
## [1] 0.9979116
as.numeric(performance(predRF, "auc")@y.values) # Alternative
## [1] 0.9979116
# PROBLEM 3.10 - BUILDING MACHINE LEARNING MODELS (1 point possible)
# Which model had the best training set performance, in terms of accuracy and AUC? ANS Logistic
# PROBLEM 4.1 - EVALUATING ON THE TEST SET (1 point possible)
# Obtain predicted probabilities for the testing set for each of the models, again ensuring that probabilities instead of classes are obtained.
predict.spamLog.test = predict(spamLog,newdata=test, type="response")
predict.spamCART.test = predict(spamCART, newdata=test)[,2]
predict.spamRF.test = predict(spamRF,newdata=test, type="prob")[,2]
# What is the testing set accuracy of spamLog, using a threshold of 0.5 for predictions?
table(test$spam, predict.spamLog.test >0.5)
##
## FALSE TRUE
## 0 1257 51
## 1 34 376
(1228+386)/nrow(test) # ANS 0.9394645
## [1] 0.9394645
# PROBLEM 4.2 - EVALUATING ON THE TEST SET (1 point possible)
# What is the testing set AUC of spamLog? ANS 0.9627517
predLog.test = prediction(predict.spamLog.test, test$spam)
performance(predLog.test, "auc")@y.values
## [[1]]
## [1] 0.9627517
as.numeric(performance(predLog.test, "auc")@y.values) # Alternative
## [1] 0.9627517
# PROBLEM 4.3 - EVALUATING ON THE TEST SET (1 point possible)
# What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?
table(test$spam, predict.spamCART.test >0.5)
##
## FALSE TRUE
## 0 1228 80
## 1 24 386
(1228+386)/nrow(test) # ANS 0.9394645
## [1] 0.9394645
# PROBLEM 4.4 - EVALUATING ON THE TEST SET (1 point possible)
# What is the testing set AUC of spamCART?
predCART.test = prediction(predict.spamCART.test, test$spam)
as.numeric(performance(predCART.test, "auc")@y.values) # 0.963176
## [1] 0.963176
# PROBLEM 4.5 - EVALUATING ON THE TEST SET (1 point possible)
# What is the testing set accuracy of spamRF, using a threshold of 0.5 for predictions?
table(test$spam, predict.spamRF.test >0.5)
##
## FALSE TRUE
## 0 1290 18
## 1 25 385
(1290+384)/nrow(test) # ANS 0.9743888
## [1] 0.9743888
# PROBLEM 4.6 - EVALUATING ON THE TEST SET (1 point possible)
# What is the testing set AUC of spamRF?
predRF.test = prediction(predict.spamRF.test, test$spam)
as.numeric(performance(predRF.test, "auc")@y.values)
## [1] 0.9975656
# ANS 0.9975656
# PROBLEM 4.7 - EVALUATING ON THE TEST SET (1/1 point)
# Which model had the best testing set performance, in terms of accuracy and AUC? ANS Random Forest
# PROBLEM 4.8 - EVALUATING ON THE TEST SET (1/1 point)
# Which model demonstrated the greatest degree of overfitting?
# ANS Logistic. Both CART and random forest had very similar accuracies on the training and testing sets. However, logistic regression obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting.
# PROBLEM 6.1 - INTEGRATING WORD COUNT INFORMATION
# While we have thus far mostly dealt with frequencies of specific words in our analysis, we can extract other information from text. The last two sections of this problem will deal with two other types of information we can extract.
# First, we will use the number of words in the each email as an independent variable. We can use the original document term matrix called dtm for this task. The document term matrix has documents (in this case, emails) as its rows, terms (in this case word stems) as its columns, and frequencies as its values. As a result, the sum of all the elements in a row of the document term matrix is equal to the number of terms present in the document corresponding to the row. Obtain the word counts for each email with the command:
wordCount = rowSums(as.matrix(dtm))
# PROBLEM 6.4 - INTEGRATING WORD COUNT INFORMATION
# Create a variable called logWordCount in emailsSparse2 that is equal to log(wordCount). Use the boxplot() command to plot logWordCount against whether a message is spam. Which of the following best describes the box plot?
emailsSparse2 = emailsSparse
emailsSparse2$logWordCount = log(wordCount)
boxplot(emailsSparse2$logWordCount, emailsSparse2$spam)