Assignment 5

Author

Kelly Ratigan

Published

July 20, 2025

Assignment 5: R Summer Programming

Part A: House Data

houses <- read.csv("~/Desktop/Houses.csv")

The variable SQFT gives the size of the home in square feet. We would like to predict the Total price of homes based on the size of the home (SQFT). Create a linear model that makes this prediction. Give the name “model_1” to this model. Give a summary() command of this model.

model_1 <- lm(Total ~ SQFT, data = houses)
summary(model_1)


Call:
lm(formula = Total ~ SQFT, data = houses)

Residuals:
    Min      1Q  Median      3Q     Max 
-190007  -91711  -64236  -26105 4689712 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept)  23807.07  126487.61   0.188   0.8511  
SQFT           135.74      67.58   2.009   0.0473 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 483700 on 98 degrees of freedom
Multiple R-squared:  0.03955,   Adjusted R-squared:  0.02974 
F-statistic: 4.035 on 1 and 98 DF,  p-value: 0.04732

From model_1 report the slope of the least squares line. What does this value tell us?

The slope of the line is 135.74. This means that for every one additional square foot of the house, the total price goes up accordingly by $135.74.
From model_1 report the value of the R-squared value. What does this value tell us?

The R Squared value is 0.03955. This value is interesting, because while we have a p-value under 0.05 meaning the relationship between SQFT and Total is statistically significant, the R squared value being so low means that size does NOT account for a lot of the variation in price. This value proves that there are other variables and other relationships that have to be evaluated with the data to get a better understanding of a homes Total price.

This data contains an outlier that has a Total value of approximately 4.9 million dollars. Filter your data to create a new data frame that removes this observation. Use the filtered data to recreate your linear model. Give the name “model_2” to this model. Provide the summary() command of this new model.

# remove outlier
houses2 <- subset(houses, Total < 4000000)
#same process
model_2 <- lm(Total ~ SQFT, data = houses2)
summary(model_2)


Call:
lm(formula = Total ~ SQFT, data = houses2)

Residuals:
    Min      1Q  Median      3Q     Max 
-178469  -50742   -6400   25185  418136 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -75819.6    21764.1  -3.484 0.000744 ***
SQFT           165.9       11.6  14.299  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 82960 on 97 degrees of freedom
Multiple R-squared:  0.6782,    Adjusted R-squared:  0.6749 
F-statistic: 204.5 on 1 and 97 DF,  p-value: < 2.2e-16

Examine the R-squared value. How did this value change between the two models?

The new R squared value is 0.6782. This drastic of an increase shows that the outlier house above 4 million carried a significant amount of weight, and skewed the results heavily in terms of analysis. However, by filtering that house out, we can now say that almost 68% of the variation of total price of houses is explained by the SQFT variable.

The Acres variable gives the amount of land on which the house is located. The outlier that is approximately 4.9 million dollars is due to the large number of acres of the property. We would like to incorporate the Acres variable in our model. Create a new model called “model_3” that uses all of the data. Predict the Total value of the home with both the size of the home (SQFT) and the Acres variable. Produce a summary() command of this new model.

# mult reg
model_3 <- lm(Total ~ SQFT + Acres, data = houses)
summary(model_3)


Call:
lm(formula = Total ~ SQFT + Acres, data = houses)

Residuals:
    Min      1Q  Median      3Q     Max 
-189677  -48727   -5686   30044  448248 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -99554.31   24126.44  -4.126 7.79e-05 ***
SQFT           151.79      12.83  11.832  < 2e-16 ***
Acres       120884.58    2360.00  51.222  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 91800 on 97 degrees of freedom
Multiple R-squared:  0.9658,    Adjusted R-squared:  0.9651 
F-statistic:  1368 on 2 and 97 DF,  p-value: < 2.2e-16

Examine the R-squared value of this model. How did this value change from the model that removed the observation?

Including Acres in the model significantly improved the results, as shown by the R-squared value increasing to 0.9658. This means that nearly 97% of the variation in total home price is explained by both square footage and land size. By keeping the outlier in model_3 and adding land size as a predictor, we can now understand that the high price was largely due to the number of acres. Recognizing this factor made the model much stronger and more accurate overall.

We have a home that has a size of 2750 square feet and sits on 0.5 acres of land. Use your model to predict the value of this home.

#can use predict function built
sample_house <- data.frame(SQFT = 2750, Acres = 0.5)

#predict
predict(model_3, newdata = sample_house)

       1 
378310.1

Our model predicts the sample house of 2750 SQFT on 0.5 acres of land to be: $378310.10

Part B: YouTube Data

YouTube is one of the largest social media platforms. Along with uploading videos, a large social aspect of the platform is the comments that users post about videos. A major problem with such posts is that some are “spam” messages that include undesirable advertising or even malware. In this assignment, we will examine a selection of comments from three music videos posted on YouTube. The videos include “Roar” by Katy Perry, “Gangnam Style” by Psy, and “Love The Way You Lie” by Eminem. The comments include both legitimate and spam messages. We will use the tools of text analysis to compare the spam and legitimate messages. The data set contains 4 variables. The author variable indicates the youtube user who wrote the comment. Content is the comment itself. Spam is a binary variable that is 1 if the comment is spam and 0 if non-spam. The video variable indicates which of the three videos the comment is from.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

#reading in
yt_comments_a <- read.csv("~/Desktop/yt_comments_a.csv")
glimpse(yt_comments_a)

Rows: 999
Columns: 4
$ author  <chr> "lekanaVEVO1", "Pyunghee", "Erica Ross", "Aviel Haimov", "John…
$ content <chr> "i love this so much. AND also I Generate Free Leads on Auto P…
$ spam    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ video   <chr> "Katie Perry", "Katie Perry", "Katie Perry", "Katie Perry", "K…

Consider again the YouTube file discussed previously. The variable spam is 1 or 0. We would like to change this variable to a character variable. Use an ifelse command change the spam variable to “yes” and “no”. Use dplyr verbs to count the number of comments that are spam and the number that are not.
```
yt_comments_a <- yt_comments_a %>%
  mutate(spam = ifelse(spam == 1, "yes", "no"))

# count
yt_comments_a %>%
  count(spam)
```
```
  spam   n
1   no 477
2  yes 522
```

Just as your did in assignment 4 use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains “http”. Convert your variable to a factor. Create a table that is grouped by your new variable with the count of spam and non-spam comments.

library(stringr)

# true http
yt_comments_a <- yt_comments_a %>%
  mutate(contains_http = str_detect(content, "http")) %>%
  mutate(contains_http = as.factor(contains_http))

#spam w http
yt_comments_a %>%
  group_by(contains_http, spam) %>%
  summarise(count = n(), .groups = "drop")

# A tibble: 4 × 3
  contains_http spam  count
  <fct>         <chr> <int>
1 FALSE         no      469
2 FALSE         yes     386
3 TRUE          no        8
4 TRUE          yes     136

Use the rpart package to create a decision tree model that predicts spam with your variable that detected “http”. Call this “model_4” and provide the text output of the model object.

library(rpart)

# tree
model_4 <- rpart(spam ~ contains_http, data = yt_comments_a, method = "class")
model_4

n= 999 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 999 477 yes (0.47747748 0.52252252)  
  2) contains_http=FALSE 855 386 no (0.54853801 0.45146199) *
  3) contains_http=TRUE 144   8 yes (0.05555556 0.94444444) *

Use the partykit package to create a visualization of your decision tree from model_4.

#wont render on install so made comment
#install.packages("partykit")
library(partykit)

Warning: package 'partykit' was built under R version 4.4.1

Loading required package: grid

Loading required package: libcoin

Warning: package 'libcoin' was built under R version 4.4.1

Loading required package: mvtnorm

Warning: package 'mvtnorm' was built under R version 4.4.1

#party object, plot
plot(as.party(model_4))

Just as you did in assignment 4 create a variable that detects each of the following: the phrase subscribe, 3 or more punctuation marks in a row, and 3 or more numbers in a row. Convert each new variable to a factor. Give a glimpse() of your data frame with these additional variables.

#from last week
# numbers
yt_comments_a <- yt_comments_a %>%
  mutate(contains_numbers = str_detect(content, "[0-9]{3,}"),
         contains_numbers = as.factor(contains_numbers))

# punctuation
yt_comments_a <- yt_comments_a %>%
  mutate(junk_words = str_detect(content, "[[:punct:]]{3,}"),
         junk_words = as.factor(junk_words))

#subscribe
yt_comments_a <- yt_comments_a %>%
  mutate(subscribe = str_detect(str_to_lower(content), "subscribe"),
         subscribe = as.factor(subscribe))

# check
glimpse(yt_comments_a)

Rows: 999
Columns: 8
$ author           <chr> "lekanaVEVO1", "Pyunghee", "Erica Ross", "Aviel Haimo…
$ content          <chr> "i love this so much. AND also I Generate Free Leads …
$ spam             <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
$ video            <chr> "Katie Perry", "Katie Perry", "Katie Perry", "Katie P…
$ contains_http    <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ contains_numbers <fct> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, …
$ junk_words       <fct> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ subscribe        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…

Use rpart to create a decision tree model that predicts the spam variable from all 4 of the variables you created. Give it the name “model_5”. Use the partykit package to create a visualization of your decision tree.

#tree
model_5 <- rpart(spam ~ contains_http + subscribe + junk_words + contains_numbers,
                 data = yt_comments_a,
                 method = "class")

plot(as.party(model_5))

We might like to use our decision tree to eliminate new comments that are spam. The file Youtube B contains an additional 149 comments that you did not use previously. Load this data file and create each of the four predictor variables for this new data. Be sure to use the same names before and convert them to factors. Give a glimpse() of this new data.

#reading in b
yt_comments_b <- read.csv("~/Desktop/yt_comments_b.csv")

library(dplyr)
library(stringr)

# vars all
yt_comments_b <- yt_comments_b %>%
  mutate(
    contains_http = str_detect(content, "http"),
    subscribe = str_detect(str_to_lower(content), "subscribe"),
    junk_words = str_detect(content, "[[:punct:]]{3,}"),
    contains_numbers = str_detect(content, "[0-9]{3,}"),
    contains_http = as.factor(contains_http),
    subscribe = as.factor(subscribe),
    junk_words = as.factor(junk_words),
    contains_numbers = as.factor(contains_numbers)
  )

glimpse(yt_comments_b)

Rows: 149
Columns: 9
$ comment          <int> 1029, 718, 911, 72, 892, 952, 239, 324, 871, 619, 8, …
$ author           <chr> "Huda Salah", "TMCB production (Instrumental Beats)",…
$ content          <chr> "Looooved <U+FEFF>", "Check out our Channel for nice …
$ spam             <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,…
$ video            <chr> "Eminem", "Eminem", "Eminem", "Katie Perry", "Eminem"…
$ contains_http    <fct> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
$ subscribe        <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ junk_words       <fct> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,…
$ contains_numbers <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,…

Use your model_5 to predict if these comments are spam. Use dplyr verbs to group the new comments by your predictions and count the amount of true spam. In a few sentences do you think your tree does a good job identifying spam in new comments?

# use model_5
yt_comments_b$predicted_spam <- predict(model_5, newdata = yt_comments_b, type = "class")

# predicted value and actual
yt_comments_b %>%
  group_by(predicted_spam, spam) %>%
  summarise(count = n(), .groups = "drop")

# A tibble: 4 × 3
  predicted_spam  spam count
  <fct>          <int> <int>
1 no                 0    75
2 no                 1    31
3 yes                0     1
4 yes                1    42

Does the Tree do a good job: 42 spam comments and 75 non spam comments were accurately predicted by the decision tree. It also however incorrectly identified one non spam remark as spam (false positive) and missed 31 real spam comments (false negatives). This implies that while the tree is effective at detecting the majority of spam and non spam comments, it’s still misreading a decent amount of comments.