Assignment 4

Author

Kelly Ratigan

Published

July 13, 2025

Assignment 4: R Programming Summer 2025

Part A - YouTube Data

YouTube is one of the largest social media platforms. Along with uploading videos, a large social aspect of the platform is the comments that users post about videos. A major problem with such posts is that some are “spam” messages that include undesirable advertising or even malware. In this assignment, we will examine a selection of comments from three music videos posted on YouTube. The videos include “Roar” by Katy Perry, “Gangnam Style” by Psy, and “Love The Way You Lie” by Eminem. The comments include both legitimate and spam messages. We will use the tools of text analysis to compare the spam and legitimate messages.  The data set contains 4 variables.  The author variable indicates the youtube user who wrote the comment. Content is the comment itself.  Spam is a binary variable that is 1 if the comment is spam and 0 if non-spam.  The video variable indicates which of the three videos the comment is from.

  1. Use R to import the file.  Examine the variables in the data frame by performing a glimpse(). 

    library(dplyr)
    
    Attaching package: 'dplyr'
    The following objects are masked from 'package:stats':
    
        filter, lag
    The following objects are masked from 'package:base':
    
        intersect, setdiff, setequal, union
    #used the top right manual entry to read in data:
    yt_comments_a <- read.csv("~/Desktop/yt_comments_a.csv")
    glimpse(yt_comments_a)
    Rows: 999
    Columns: 4
    $ author  <chr> "lekanaVEVO1", "Pyunghee", "Erica Ross", "Aviel Haimov", "John…
    $ content <chr> "i love this so much. AND also I Generate Free Leads on Auto P…
    $ spam    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
    $ video   <chr> "Katie Perry", "Katie Perry", "Katie Perry", "Katie Perry", "K…
  2. We would like to know how many spam and non-spam comments come from each of the artists (given in the variable called “video”).   Use dplyr verbs to create a table that is grouped by the video variable with the count of spam and non-spam comments.

    yt_comments_a %>%
      group_by(video, spam) %>%
      summarise(count = n(), .groups = "drop")
    # A tibble: 6 × 3
      video        spam count
      <chr>       <int> <int>
    1 Eminem          0   171
    2 Eminem          1   226
    3 Katie Perry     0   151
    4 Katie Perry     1   144
    5 Psy             0   155
    6 Psy             1   152
  3. One element that might be indicative of spam comments is a link to other websites.  Use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains “http”.  Create a table that is grouped by your new variable with the count of spam and non-spam comments. 

    library(stringr)
    
    #true if the comment contains http
    yt_comments_a <- yt_comments_a %>%
      mutate(contains_http = str_detect(content, "http?://"))
    
    #count
    yt_comments_a %>%
      group_by(contains_http, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'contains_http'. You can override using the
    `.groups` argument.
    # A tibble: 4 × 3
    # Groups:   contains_http [2]
      contains_http  spam count
      <lgl>         <int> <int>
    1 FALSE             0   469
    2 FALSE             1   450
    3 TRUE              0     8
    4 TRUE              1    72
  4. Another element that might be indicative of spam comments are the use of number.  Use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains 3 or more numbers in a row.  Create a table that is grouped by your new variable with the count of spam and non-spam comments.

    #3+ digits in a row
    yt_comments_a <- yt_comments_a %>%
      mutate(contains_numbers = str_detect(content, "[0-9]{3,}"))
    
    # count 
    yt_comments_a %>%
      group_by(contains_numbers, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'contains_numbers'. You can override using
    the `.groups` argument.
    # A tibble: 4 × 3
    # Groups:   contains_numbers [2]
      contains_numbers  spam count
      <lgl>            <int> <int>
    1 FALSE                0   415
    2 FALSE                1   386
    3 TRUE                 0    62
    4 TRUE                 1   136
  5. Comments that use lots of punctuation marks (e.g. !!!!?!??!) might also be spam  .  Use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains 3 or more punctuation marks in a row.  Create a table that is grouped by your new variable with the count of spam and non-spam comments.

    #new variable: junk words
    yt_comments_a <- yt_comments_a %>%
      mutate(junk_words = str_detect(content, "[[:punct:]]{3,}"))
    
    # count 
    yt_comments_a %>%
      group_by(junk_words, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'junk_words'. You can override using the
    `.groups` argument.
    # A tibble: 4 × 3
    # Groups:   junk_words [2]
      junk_words  spam count
      <lgl>      <int> <int>
    1 FALSE          0   390
    2 FALSE          1   328
    3 TRUE           0    87
    4 TRUE           1   194
  6. Some phrases might also indicate spam.  Use dplyr verbs and regex functions to create a new logical variable in your data set that is true if the comment contains the phrase subscribe.  Create a table that is grouped by your new variable with the count of spam and non-spam comments.

    #true if subscribe
    yt_comments_a <- yt_comments_a %>%
      mutate(subscribe = str_detect(str_to_lower(content), "subscribe"))
    
    #and spam
    yt_comments_a %>%
      group_by(subscribe, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'subscribe'. You can override using the
    `.groups` argument.
    # A tibble: 4 × 3
    # Groups:   subscribe [2]
      subscribe  spam count
      <lgl>     <int> <int>
    1 FALSE         0   475
    2 FALSE         1   398
    3 TRUE          0     2
    4 TRUE          1   124
  7. Examine the spam and non-spam comments in the data.  What other phrases or characteristic do you think might be useful for identifying spam?   Create another logical variable that identifies  your chosen characteristic of the comments.   Create a table that is grouped by your new variable with the count of spam and non-spam comments.

    #along with links could be direct contact
    #DM me or maybe text me?
    library(dplyr)
    library(stringr)
    
    #dm me
    yt_comments_a <- yt_comments_a %>%
      mutate(dm_me = str_detect(str_to_lower(content), "dm me"))
    
    # spam
    yt_comments_a %>%
      group_by(dm_me, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'dm_me'. You can override using the
    `.groups` argument.
    # A tibble: 2 × 3
    # Groups:   dm_me [1]
      dm_me  spam count
      <lgl> <int> <int>
    1 FALSE     0   477
    2 FALSE     1   522
  8. Review all of the logical variables you have created (http, numbers, etc).  Which of the variables did the best  at distinguishing spam from non-spam?  Explain your choice.

    The best variable created to indicate if a comment was spam is mostly likely containing http. Most spam messages are trying to redirect the viewer to an external website, and the easiest way to do that is by including a direct link. Since links provide the least resistance for someone to click away from the video, they are frequently used in spam comments.

  9. Again review the logical variables you created.  Can you create a combination of these variables that would better distinguish spam and non-spam?  Create the combination and provide a table grouped by your combination with a count of the spam and non-spam comments.

    # combine me
    yt_comments_a <- yt_comments_a %>%
      mutate(spam_combo = contains_http | junk_words | subscribe)
    
    # count
    yt_comments_a %>%
      group_by(spam_combo, spam) %>%
      summarise(count = n())
    `summarise()` has grouped output by 'spam_combo'. You can override using the
    `.groups` argument.
    # A tibble: 4 × 3
    # Groups:   spam_combo [2]
      spam_combo  spam count
      <lgl>      <int> <int>
    1 FALSE          0   389
    2 FALSE          1   224
    3 TRUE           0    88
    4 TRUE           1   298

Part B - House Data

  1. Open both the Houses and the Houses Geocode data in R. Perform a join to put these two datasets together. Provide a glimpse of the resulting data set.

    #read in 
    houses_geocode <- read.csv("~/Desktop/houses_geocode.csv")
    houses <- read.csv("~/Desktop/Houses.csv")
    
    # join data
    houses_both <- left_join(houses, houses_geocode, by = "Address")
    glimpse(houses_both)
    Rows: 100
    Columns: 15
    $ ID.        <int> 153159, 155943, 44627, 119645, 78570, 53841, 44951, 109325,…
    $ Year       <int> 1990, 1986, 1955, 1983, 2000, 1967, 1924, 1980, 1989, 1994,…
    $ SQFT       <dbl> 2102.1, 1740.0, 795.0, 1152.0, 1404.0, 1752.0, 1829.0, 1246…
    $ Story      <dbl> 1.00, 1.50, 1.00, 1.00, 1.00, 1.00, 1.50, 1.00, 2.00, 2.50,…
    $ Acres      <dbl> 0.77, 0.06, 0.30, 0.68, 39.38, 0.29, 0.25, 0.29, 0.25, 0.36…
    $ Baths      <dbl> NA, 3.0, 1.0, 2.0, 2.0, 1.5, 3.0, 2.0, 2.5, 3.0, 2.0, 1.0, …
    $ Fireplaces <int> 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
    $ Total      <dbl> 203200.0, 119096.0, 71666.3, 131103.0, 4904102.0, 144452.0,…
    $ land       <dbl> 44000.0, 22000.0, 28000.2, 50000.0, 4797750.0, 52000.0, 292…
    $ building   <dbl> 159200.0, 97096.0, 43666.1, 81103.0, 106352.0, 92452.0, 203…
    $ Zip.x      <int> 27603, 27604, 27610, 27616, 27519, 27604, 27607, 27511, 276…
    $ Address    <chr> "924 Shadywood Ln", "3655 Top of The Pines Ct", "428 Lansin…
    $ lon        <dbl> -78.62990, -78.56623, -78.60988, -78.55883, -78.89214, -78.…
    $ lat        <dbl> 35.64545, 35.83101, 35.77383, 35.87781, 35.84087, 35.82526,…
    $ Zip.y      <int> 27603, 27604, 27610, 27616, 27519, 27604, 27607, 27511, 276…
  2. To create a map of these houses we need a bounding box. Create an appropriate bounding box that will allow all of the data to be mapped.  Using your bounding box, create a map of the Wake County North Carolina area that shows the locations of the houses in the data set.  Use a “toner” map as your map layer.

    library(tidyverse)
    ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
    ✔ forcats   1.0.0     ✔ readr     2.1.5
    ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
    ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
    ✔ purrr     1.0.2     
    ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ✖ dplyr::filter() masks stats::filter()
    ✖ dplyr::lag()    masks stats::lag()
    ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
    library(ggmap)
    Warning: package 'ggmap' was built under R version 4.4.1
    ℹ Google's Terms of Service: <https://mapsplatform.google.com>
      Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service>
      OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles>
    ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
    #API key
    register_stadiamaps(key = "94becad7-9428-4916-bdc3-fdcb6390db0b", write = FALSE)
    
    #bounding box with lat and lon
    bbox <- c(
      left = min(houses_both$lon, na.rm = TRUE),
      bottom = min(houses_both$lat, na.rm = TRUE),
      right = max(houses_both$lon, na.rm = TRUE),
      top = max(houses_both$lat, na.rm = TRUE)
    )
    
    #w/ toner
    wake_map <- get_stadiamap(
      bbox = bbox,
      maptype = "stamen_toner",
      zoom = 11
    )
    ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
    #house plot
    ggmap(wake_map) +
      geom_point(data = houses_both, aes(x = lon, y = lat), color = "blue", alpha = 0.6, size = 1.5)

  3. Refine your map to include color coding for the variable “SQFT.” 

    # map by SQFT
    ggmap(wake_map) +
      geom_point(
        data = houses_both,
        aes(x = lon, y = lat, color = SQFT),
        alpha = 0.7,
        size = 2
      ) +
      scale_color_viridis_c(option = "plasma") +
      labs(title = "WC Houses by Square Footage", color = "SQFT") 

  4. Refine your map of the SQFT variable by creating a facet grid of the map based on the Story variable.  Consider this map. Do you think visualization provides insights to the story behind the data?  Explain in a few sentences.

    ggmap(wake_map) +
      geom_point(
        data = houses_both,
        aes(x = lon, y = lat, color = SQFT),
        alpha = 0.7,
        size = 2
      ) +
      scale_color_viridis_c(option = "plasma") +
      facet_wrap(~ Story) +
      labs(title = "WC Houses by SQFT and # of Stories", color = "SQFT") 

    Written Answer: This plot absolutely provides insights to the story. By grouping the map by story its provides a better understanding of what the most common houses in the area look like, why they may have less square feet, where they are, and how they compare to the larger homes in the area.

  5. Costco warehouses are a popular shopping location. The data set “costco” contains the street addresses for the three Costco warehouses in Wake County. Load this data into R and geocode their locations using the tidygeocoder. Provide a glimpse of the resulting data set.

    Costco <- read.csv("~/Desktop/Costco.csv")
    library(tidygeocoder)
    Warning: package 'tidygeocoder' was built under R version 4.4.1
    
    Attaching package: 'tidygeocoder'
    The following object is masked from 'package:ggmap':
    
        geocode
    library(dplyr)
    
    #geocode
    costco_geo <- Costco %>%
      geocode(address = Address, method = "osm", lat = lat, long = lon)
    Passing 3 addresses to the Nominatim single address geocoder
    Query completed in: 3 seconds
    #glimpse
    glimpse(costco_geo)
    Rows: 3
    Columns: 6
    $ Address <chr> "7800 Fayetteville Rd", "1021 Pine Plaza Dr", "2838 Wake Fores…
    $ City    <chr> " Raleigh", " Apex", " Raleigh"
    $ State   <chr> "NC", "NC", "NC"
    $ Zip     <int> 27603, 27523, 27609
    $ lat     <dbl> 35.66883, 35.74665, 35.81807
    $ lon     <dbl> -78.69932, -78.82686, -78.62147
  6. Recreate your map of the SQFT variable (without facet wrapping) and include the locations of the Costco warehouses. Use a distinctive marker to indicate the locations. Consider using the shape argument in ggplot.

    ggmap(wake_map) +
      geom_point(
        data = houses_both,
        aes(x = lon, y = lat, color = SQFT),
        alpha = 0.7,
        size = 2
      ) +
      geom_point(
        data = costco_geo,
        aes(x = lon, y = lat),
        shape = 17,   # triangle
        color = "red",
        size = 4
      ) +
      scale_color_viridis_c(option = "plasma") +
      labs(
        title = "WC Houses by SQFT with Costco Spots",
        color = "SQFT"
      ) 

  7. Recreate your map from the previous question, but make use of the terrain version of the Stamen map.

    wake_map_terrain <- get_stadiamap(
      bbox = bbox,
      maptype = "stamen_terrain",
      zoom = 11
    )
    ℹ © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
    #sqft
    ggmap(wake_map_terrain) +
      geom_point(
        data = houses_both,
        aes(x = lon, y = lat, color = SQFT),
        alpha = 0.7,
        size = 2
      ) +
      geom_point(
        data = costco_geo,
        aes(x = lon, y = lat),
        shape = 17,
        color = "red",
        size = 4
      ) +
      scale_color_viridis_c(option = "plasma") +
      labs(
        title = "WC Houses by SQFT (Terrain)",
        color = "SQFT"
      )