Download and open the assign04.qmd file and complete the exercises.
We will be using pay-per-click (PPC) data from a 31 day campaign from a company that sells USB keys and USB hubs. Each row of the 555 observations represents a click on an internet ad based on a keyword search and there are 3 columns.
day - represents the day of the campaign. Valid days are 1-31.
price - represents the price of the campaign. Price can’t be a number below 0.10
keyword - represents the keyword purchased. Everything must be spelled correctly, there aren’t many keywords but they are some combination of “usb” and/or “key” or “hub”
In this assignment you will be examining each column for data validity. Each exercise presents one or more questions for you to answer.
We’ll start by loading the tidyverse family of packages along with the janitor and skimr packages, and our data. Make sure you install these two packages in your RStudio prior to calling the library() functions below.
library(tidyverse)library(janitor)
Warning: package 'janitor' was built under R version 4.4.3
library(skimr)
Warning: package 'skimr' was built under R version 4.4.3
There are six exercises in this assignment. The Grading Rubric is available at the end of this document.
Exercise 1
Create a graph of number of clicks (i.e., observations) for each day (1-31). Use geom_bar() for your geometry. In the narrative below your code note which days had zero clicks.
Looking at the chart we can see that on day 27 there were 0 clicks.
Exercise 2
Insert a code cell to show how many NA (i.e., missing) values there are in price. In the narrative below that code cell write out how many NA values there are for price and what percent of the observations that represents.
ppc_data |>filter(is.na(price))
# A tibble: 6 × 3
day keyword price
<dbl> <chr> <dbl>
1 18 key NA
2 25 usb key NA
3 9 usb hub NA
4 16 usb NA
5 13 usb NA
6 13 usb key NA
Out of the 555 observations there are 6 that have NA values for price, which means that 1.08% of the observations have NA as a price.
Exercise 3
Valid values for price are 0.1 or greater. Insert a code cell that displays the number of values of price that are less than 0.1. In the narrative below that code cell write how many values are below 0.1.
ppc_data |>filter(price <0.1)
# A tibble: 10 × 3
day keyword price
<dbl> <chr> <dbl>
1 5 key 0
2 4 key 0
3 18 ubs 0
4 6 usb key 0
5 19 key 0
6 14 usb hub 0
7 16 usb key 0
8 23 usb key 0
9 10 hub 0
10 19 usb key 0
10 price values are less than 0.1.
Exercise 4
Insert a code cell that drops all of the rows that contain invalid or NA values for price.
ppc_data |>filter(!is.na(price))
# A tibble: 549 × 3
day keyword price
<dbl> <chr> <dbl>
1 9 usb 5.9
2 30 usb hub 8
3 19 usb hub 2.8
4 4 usb key 7.7
5 30 key 1.7
6 17 usb hub 5.5
7 5 hub 5.2
8 8 key 3.8
9 17 usb key 6
10 21 usb key 2
# ℹ 539 more rows
Exercise 5
Insert a code cell that shows a tabyl of the counts of each keyword. In the narrative below the code cell, list the misspellings and counts if there are any.
ppc_data |>tabyl(keyword)
keyword n percent
hub 90 0.16216216
key 85 0.15315315
ubs 11 0.01981982
ubs key 10 0.01801802
usb 75 0.13513514
usb hub 121 0.21801802
usb key 163 0.29369369
UBS instead of USB 11 times and UBS KEY instead of USB KEY 10 times. ### Exercise 6
Insert a code cell that corrects all the misspellings for keyword, then rerun tabyl to verify.