Assignment 4

Author

Garrett DeVoe

Go to the shared posit.cloud workspace for this class and open the assign04 project. Open the assign04.qmd file and complete the exercises.

We will be using pay-per-click (PPC) data from a 31 day campaign from a company that sells USB keys and USB hubs. Each row of the 555 observations represents a click on an internet ad based on a keyword search and there are 3 columns.

In this assignment you will be examining each column for data validity. Each exercise presents one or more questions for you to answer.

We’ll start by loading the tidyverse family of packages along with the janitor and skimr packages, and our data.

library(tidyverse)
library(janitor)
library(skimr)
ppc_data <- read_csv("https://jsuleiman.com/datasets/ppc_data.csv")
glimpse(ppc_data)
Rows: 555
Columns: 3
$ day     <dbl> 9, 30, 19, 4, 30, 17, 5, 8, 17, 21, 4, 13, 29, 25, 25, 7, 4, 5…
$ keyword <chr> "usb", "usb hub", "usb hub", "usb key", "key", "usb hub", "hub…
$ price   <dbl> 5.9, 8.0, 2.8, 7.7, 1.7, 5.5, 5.2, 3.8, 6.0, 2.0, 9.7, 7.0, 8.…

Exercises

There are six exercises in this assignment. The Grading Rubric is available at the end of this document.

glimpse(ppc_data)
Rows: 555
Columns: 3
$ day     <dbl> 9, 30, 19, 4, 30, 17, 5, 8, 17, 21, 4, 13, 29, 25, 25, 7, 4, 5…
$ keyword <chr> "usb", "usb hub", "usb hub", "usb key", "key", "usb hub", "hub…
$ price   <dbl> 5.9, 8.0, 2.8, 7.7, 1.7, 5.5, 5.2, 3.8, 6.0, 2.0, 9.7, 7.0, 8.…

Exercise 1

Create a graph of number of clicks (i.e., observations) for each day (1-31). Use geom_bar() for your geometry. In the narrative below your code note which days had zero clicks.

ppc_data |>
  ggplot(aes(x = day)) + 
  geom_bar() + 
  labs(title = "number of clicks for each day",
       y = "number of clicks", 
       x= "day",) + 
  theme_minimal()

The days that had zero clicks were day 1, and 27.

Exercise 2

Insert a code cell to show how many NA (i.e., missing) values there are in price. In the narrative below that code cell write out how many NA values there are for price and what percent of the observations that represents.

ppc_data |> 
  tabyl(price) |>
  filter(is.na(price))
 price n    percent valid_percent
    NA 6 0.01081081            NA

There are 6 NA values for price. This represents 1.08% of the observations.

Exercise 3

Valid values for price are 0.1 or greater. Insert a code cell that displays the number of values of price that are less than 0.1. In the narrative below that code cell write how many values are below 0.1.

ppc_data |> 
  tabyl(price) |>
  filter(price < 0.1)
 price  n    percent valid_percent
     0 10 0.01801802    0.01821494

The price of 0.0, which is the only price below 0.1, contains 10 values which represent 1.80% of the observations and 1.82% of the valid observations.

Exercise 4

Insert a code cell that drops all of the rows that contain invalid or NA values for price.

ppc_data <- ppc_data |> 
  filter(!is.na(price) & price >= 0.1)
ppc_data |>
  tabyl(price)
 price  n     percent
   0.1 10 0.018552876
   0.2  5 0.009276438
   0.3  2 0.003710575
   0.4  6 0.011131725
   0.5  6 0.011131725
   0.6  6 0.011131725
   0.7  9 0.016697588
   0.8  8 0.014842301
   0.9  6 0.011131725
   1.0  5 0.009276438
   1.1  4 0.007421150
   1.2  5 0.009276438
   1.3 10 0.018552876
   1.4  3 0.005565863
   1.5  8 0.014842301
   1.6  2 0.003710575
   1.7  9 0.016697588
   1.8  5 0.009276438
   1.9  5 0.009276438
   2.0  1 0.001855288
   2.1 10 0.018552876
   2.2  8 0.014842301
   2.3  4 0.007421150
   2.4  3 0.005565863
   2.5  4 0.007421150
   2.6 10 0.018552876
   2.7  3 0.005565863
   2.8  5 0.009276438
   2.9  4 0.007421150
   3.0  9 0.016697588
   3.1  8 0.014842301
   3.2  7 0.012987013
   3.3  5 0.009276438
   3.4  8 0.014842301
   3.5  4 0.007421150
   3.6  7 0.012987013
   3.7  8 0.014842301
   3.8  2 0.003710575
   3.9  3 0.005565863
   4.0  7 0.012987013
   4.1  6 0.011131725
   4.2  7 0.012987013
   4.3  3 0.005565863
   4.5  4 0.007421150
   4.6  8 0.014842301
   4.7 11 0.020408163
   4.8  5 0.009276438
   4.9  4 0.007421150
   5.0  2 0.003710575
   5.1  3 0.005565863
   5.2 11 0.020408163
   5.3  4 0.007421150
   5.4  4 0.007421150
   5.5  7 0.012987013
   5.6  6 0.011131725
   5.7  5 0.009276438
   5.8  4 0.007421150
   5.9  6 0.011131725
   6.0  6 0.011131725
   6.1  3 0.005565863
   6.2  7 0.012987013
   6.3  4 0.007421150
   6.4  4 0.007421150
   6.5  8 0.014842301
   6.6  4 0.007421150
   6.7  6 0.011131725
   6.8  2 0.003710575
   6.9  8 0.014842301
   7.0  4 0.007421150
   7.1  4 0.007421150
   7.2  7 0.012987013
   7.3  6 0.011131725
   7.4  5 0.009276438
   7.5  4 0.007421150
   7.6  5 0.009276438
   7.7  7 0.012987013
   7.8  7 0.012987013
   7.9  2 0.003710575
   8.0  7 0.012987013
   8.1  7 0.012987013
   8.2  4 0.007421150
   8.3  6 0.011131725
   8.4  4 0.007421150
   8.5  7 0.012987013
   8.6  7 0.012987013
   8.7  7 0.012987013
   8.8  1 0.001855288
   8.9  3 0.005565863
   9.0  2 0.003710575
   9.1  5 0.009276438
   9.2  4 0.007421150
   9.3  4 0.007421150
   9.4  5 0.009276438
   9.5  5 0.009276438
   9.6  4 0.007421150
   9.7  7 0.012987013
   9.8  5 0.009276438
   9.9  5 0.009276438
  10.0  3 0.005565863

Exercise 5

Insert a code cell that shows a tabyl of the counts of each keyword. In the narrative below the code cell, list the misspellings and counts if there are any.

ppc_data |>
  tabyl(keyword)
 keyword   n    percent
     hub  89 0.16512059
     key  81 0.15027829
     ubs  10 0.01855288
 ubs key  10 0.01855288
     usb  73 0.13543599
 usb hub 119 0.22077922
 usb key 157 0.29128015

The misspellings are “ubs” (10 values) and “ubs key” (10 values) which are values that should be allocated to “usb” and “usb key”

Exercise 6

Insert a code cell that corrects all the misspellings for keyword, then rerun tabyl to verify.

ppc_data <- ppc_data |>
  mutate(keyword = replace(keyword, keyword == "ubs", "usb")) |>
  mutate(keyword = replace(keyword, keyword == "ubs key", "usb key"))
ppc_data |> 
  tabyl(keyword)
 keyword   n   percent
     hub  89 0.1651206
     key  81 0.1502783
     usb  83 0.1539889
 usb hub 119 0.2207792
 usb key 167 0.3098330

Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Narrative: typos and grammatical errors
(7%)
Document formatting: correctly implemented instructions
(7%)

Exercises

(13% each)

Submitted properly to Brightspace

(8%)

NA NA You must submit according to instructions to receive any credit for this portion.