<- read.csv("~/Desktop/Houses.csv") houses
Assignment 5
Assignment 5: R Summer Programming
Part A: House Data
The variable SQFT gives the size of the home in square feet. We would like to predict the Total price of homes based on the size of the home (SQFT). Create a linear model that makes this prediction. Give the name “model_1” to this model. Give a summary() command of this model.
<- lm(Total ~ SQFT, data = houses) model_1 summary(model_1)
Call: lm(formula = Total ~ SQFT, data = houses) Residuals: Min 1Q Median 3Q Max -190007 -91711 -64236 -26105 4689712 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23807.07 126487.61 0.188 0.8511 SQFT 135.74 67.58 2.009 0.0473 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 483700 on 98 degrees of freedom Multiple R-squared: 0.03955, Adjusted R-squared: 0.02974 F-statistic: 4.035 on 1 and 98 DF, p-value: 0.04732
From model_1 report the slope of the least squares line. What does this value tell us?
The slope of the line is 135.74. This means that for every one additional square foot of the house, the total price goes up accordingly by $135.74.
From model_1 report the value of the R-squared value. What does this value tell us?
The R Squared value is 0.03955. This value is interesting, because while we have a p-value under 0.05 meaning the relationship between SQFT and Total is statistically significant, the R squared value being so low means that size does NOT account for a lot of the variation in price. This value proves that there are other variables and other relationships that have to be evaluated with the data to get a better understanding of a homes Total price.
This data contains an outlier that has a Total value of approximately 4.9 million dollars. Filter your data to create a new data frame that removes this observation. Use the filtered data to recreate your linear model. Give the name “model_2” to this model. Provide the summary() command of this new model.
# remove outlier <- subset(houses, Total < 4000000) houses2 #same process <- lm(Total ~ SQFT, data = houses2) model_2 summary(model_2)
Call: lm(formula = Total ~ SQFT, data = houses2) Residuals: Min 1Q Median 3Q Max -178469 -50742 -6400 25185 418136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -75819.6 21764.1 -3.484 0.000744 *** SQFT 165.9 11.6 14.299 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 82960 on 97 degrees of freedom Multiple R-squared: 0.6782, Adjusted R-squared: 0.6749 F-statistic: 204.5 on 1 and 97 DF, p-value: < 2.2e-16
Examine the R-squared value. How did this value change between the two models?
The new R squared value is 0.6782. This drastic of an increase shows that the outlier house above 4 million carried a significant amount of weight, and skewed the results heavily in terms of analysis. However, by filtering that house out, we can now say that almost 68% of the variation of total price of houses is explained by the SQFT variable.
The Acres variable gives the amount of land on which the house is located. The outlier that is approximately 4.9 million dollars is due to the large number of acres of the property. We would like to incorporate the Acres variable in our model. Create a new model called “model_3” that uses all of the data. Predict the Total value of the home with both the size of the home (SQFT) and the Acres variable. Produce a summary() command of this new model.
# mult reg <- lm(Total ~ SQFT + Acres, data = houses) model_3 summary(model_3)
Call: lm(formula = Total ~ SQFT + Acres, data = houses) Residuals: Min 1Q Median 3Q Max -189677 -48727 -5686 30044 448248 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -99554.31 24126.44 -4.126 7.79e-05 *** SQFT 151.79 12.83 11.832 < 2e-16 *** Acres 120884.58 2360.00 51.222 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 91800 on 97 degrees of freedom Multiple R-squared: 0.9658, Adjusted R-squared: 0.9651 F-statistic: 1368 on 2 and 97 DF, p-value: < 2.2e-16
Examine the R-squared value of this model. How did this value change from the model that removed the observation?
Including Acres in the model significantly improved the results, as shown by the R-squared value increasing to 0.9658. This means that nearly 97% of the variation in total home price is explained by both square footage and land size. By keeping the outlier in model_3 and adding land size as a predictor, we can now understand that the high price was largely due to the number of acres. Recognizing this factor made the model much stronger and more accurate overall.
We have a home that has a size of 2750 square feet and sits on 0.5 acres of land. Use your model to predict the value of this home.
#can use predict function built <- data.frame(SQFT = 2750, Acres = 0.5) sample_house #predict predict(model_3, newdata = sample_house)
1 378310.1
Our model predicts the sample house of 2750 SQFT on 0.5 acres of land to be: $378310.10
Part B: YouTube Data
YouTube is one of the largest social media platforms. Along with uploading videos, a large social aspect of the platform is the comments that users post about videos. A major problem with such posts is that some are “spam” messages that include undesirable advertising or even malware. In this assignment, we will examine a selection of comments from three music videos posted on YouTube. The videos include “Roar” by Katy Perry, “Gangnam Style” by Psy, and “Love The Way You Lie” by Eminem. The comments include both legitimate and spam messages. We will use the tools of text analysis to compare the spam and legitimate messages. The data set contains 4 variables. The author variable indicates the youtube user who wrote the comment. Content is the comment itself. Spam is a binary variable that is 1 if the comment is spam and 0 if non-spam. The video variable indicates which of the three videos the comment is from.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
#reading in
<- read.csv("~/Desktop/yt_comments_a.csv")
yt_comments_a glimpse(yt_comments_a)
Rows: 999
Columns: 4
$ author <chr> "lekanaVEVO1", "Pyunghee", "Erica Ross", "Aviel Haimov", "John…
$ content <chr> "i love this so much. AND also I Generate Free Leads on Auto P…
$ spam <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ video <chr> "Katie Perry", "Katie Perry", "Katie Perry", "Katie Perry", "K…
Consider again the YouTube file discussed previously. The variable spam is 1 or 0. We would like to change this variable to a character variable. Use an ifelse command change the spam variable to “yes” and “no”. Use dplyr verbs to count the number of comments that are spam and the number that are not.
<- yt_comments_a %>% yt_comments_a mutate(spam = ifelse(spam == 1, "yes", "no")) # count %>% yt_comments_a count(spam)
spam n 1 no 477 2 yes 522
Just as your did in assignment 4 use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains “http”. Convert your variable to a factor. Create a table that is grouped by your new variable with the count of spam and non-spam comments.
library(stringr) # true http <- yt_comments_a %>% yt_comments_a mutate(contains_http = str_detect(content, "http")) %>% mutate(contains_http = as.factor(contains_http)) #spam w http %>% yt_comments_a group_by(contains_http, spam) %>% summarise(count = n(), .groups = "drop")
# A tibble: 4 × 3 contains_http spam count <fct> <chr> <int> 1 FALSE no 469 2 FALSE yes 386 3 TRUE no 8 4 TRUE yes 136
Use the rpart package to create a decision tree model that predicts spam with your variable that detected “http”. Call this “model_4” and provide the text output of the model object.
library(rpart) # tree <- rpart(spam ~ contains_http, data = yt_comments_a, method = "class") model_4 model_4
n= 999 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 999 477 yes (0.47747748 0.52252252) 2) contains_http=FALSE 855 386 no (0.54853801 0.45146199) * 3) contains_http=TRUE 144 8 yes (0.05555556 0.94444444) *
Use the partykit package to create a visualization of your decision tree from model_4.
#wont render on install so made comment #install.packages("partykit") library(partykit)
Warning: package 'partykit' was built under R version 4.4.1
Loading required package: grid
Loading required package: libcoin
Warning: package 'libcoin' was built under R version 4.4.1
Loading required package: mvtnorm
Warning: package 'mvtnorm' was built under R version 4.4.1
#party object, plot plot(as.party(model_4))
Just as you did in assignment 4 create a variable that detects each of the following: the phrase subscribe, 3 or more punctuation marks in a row, and 3 or more numbers in a row. Convert each new variable to a factor. Give a glimpse() of your data frame with these additional variables.
#from last week # numbers <- yt_comments_a %>% yt_comments_a mutate(contains_numbers = str_detect(content, "[0-9]{3,}"), contains_numbers = as.factor(contains_numbers)) # punctuation <- yt_comments_a %>% yt_comments_a mutate(junk_words = str_detect(content, "[[:punct:]]{3,}"), junk_words = as.factor(junk_words)) #subscribe <- yt_comments_a %>% yt_comments_a mutate(subscribe = str_detect(str_to_lower(content), "subscribe"), subscribe = as.factor(subscribe)) # check glimpse(yt_comments_a)
Rows: 999 Columns: 8 $ author <chr> "lekanaVEVO1", "Pyunghee", "Erica Ross", "Aviel Haimo… $ content <chr> "i love this so much. AND also I Generate Free Leads … $ spam <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes… $ video <chr> "Katie Perry", "Katie Perry", "Katie Perry", "Katie P… $ contains_http <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,… $ contains_numbers <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, … $ junk_words <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,… $ subscribe <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
Use rpart to create a decision tree model that predicts the spam variable from all 4 of the variables you created. Give it the name “model_5”. Use the partykit package to create a visualization of your decision tree.
#tree <- rpart(spam ~ contains_http + subscribe + junk_words + contains_numbers, model_5 data = yt_comments_a, method = "class") plot(as.party(model_5))
We might like to use our decision tree to eliminate new comments that are spam. The file Youtube B contains an additional 149 comments that you did not use previously. Load this data file and create each of the four predictor variables for this new data. Be sure to use the same names before and convert them to factors. Give a glimpse() of this new data.
#reading in b <- read.csv("~/Desktop/yt_comments_b.csv") yt_comments_b
library(dplyr) library(stringr) # vars all <- yt_comments_b %>% yt_comments_b mutate( contains_http = str_detect(content, "http"), subscribe = str_detect(str_to_lower(content), "subscribe"), junk_words = str_detect(content, "[[:punct:]]{3,}"), contains_numbers = str_detect(content, "[0-9]{3,}"), contains_http = as.factor(contains_http), subscribe = as.factor(subscribe), junk_words = as.factor(junk_words), contains_numbers = as.factor(contains_numbers) ) glimpse(yt_comments_b)
Rows: 149 Columns: 9 $ comment <int> 1029, 718, 911, 72, 892, 952, 239, 324, 871, 619, 8, … $ author <chr> "Huda Salah", "TMCB production (Instrumental Beats)",… $ content <chr> "Looooved <U+FEFF>", "Check out our Channel for nice … $ spam <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,… $ video <chr> "Eminem", "Eminem", "Eminem", "Katie Perry", "Eminem"… $ contains_http <fct> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE… $ subscribe <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS… $ junk_words <fct> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,… $ contains_numbers <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,…
Use your model_5 to predict if these comments are spam. Use dplyr verbs to group the new comments by your predictions and count the amount of true spam. In a few sentences do you think your tree does a good job identifying spam in new comments?
# use model_5 $predicted_spam <- predict(model_5, newdata = yt_comments_b, type = "class") yt_comments_b # predicted value and actual %>% yt_comments_b group_by(predicted_spam, spam) %>% summarise(count = n(), .groups = "drop")
# A tibble: 4 × 3 predicted_spam spam count <fct> <int> <int> 1 no 0 75 2 no 1 31 3 yes 0 1 4 yes 1 42
Does the Tree do a good job: 42 spam comments and 75 non spam comments were accurately predicted by the decision tree. It also however incorrectly identified one non spam remark as spam (false positive) and missed 31 real spam comments (false negatives). This implies that while the tree is effective at detecting the majority of spam and non spam comments, it’s still misreading a decent amount of comments.