Assignment 4

Author

Constance Nahimana

Download and open the assign04.qmd file and complete the exercises.

We will be using pay-per-click (PPC) data from a 31 day campaign from a company that sells USB keys and USB hubs. Each row of the 555 observations represents a click on an internet ad based on a keyword search and there are 3 columns.

day - represents the day of the campaign. Valid days are 1-31.
price - represents the price of the campaign. Price can’t be a number below 0.10
keyword - represents the keyword purchased. Everything must be spelled correctly, there aren’t many keywords but they are some combination of “usb” and/or “key” or “hub”

In this assignment you will be examining each column for data validity. Each exercise presents one or more questions for you to answer.

We’ll start by loading the tidyverse family of packages along with the janitor and skimr packages, and our data. Make sure you install these two packages in your RStudio prior to calling the library() functions below.

library(tidyverse)
library(janitor)

Warning: package 'janitor' was built under R version 4.4.3

library(skimr)

Warning: package 'skimr' was built under R version 4.4.3

ppc_data <- read_csv("https://jsuleiman.com/datasets/ppc_data.csv")
glimpse(ppc_data)

Rows: 555
Columns: 3
$ day     <dbl> 9, 30, 19, 4, 30, 17, 5, 8, 17, 21, 4, 13, 29, 25, 25, 7, 4, 5…
$ keyword <chr> "usb", "usb hub", "usb hub", "usb key", "key", "usb hub", "hub…
$ price   <dbl> 5.9, 8.0, 2.8, 7.7, 1.7, 5.5, 5.2, 3.8, 6.0, 2.0, 9.7, 7.0, 8.…

Exercises

There are six exercises in this assignment. The Grading Rubric is available at the end of this document.

Exercise 1

Create a graph of number of clicks (i.e., observations) for each day (1-31). Use geom_bar() for your geometry. In the narrative below your code note which days had zero clicks.

library(tidyverse)
library(janitor)
library(skimr)
library(ggplot2)
  ggplot(ppc_data, aes(x = factor(day), y= price)) +geom_bar(stat = "identity", fill = "steelblue")

Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_bar()`).

Exercise 2

Insert a code cell to show how many NA (i.e., missing) values there are in price. In the narrative below that code cell write out how many NA values there are for price and what percent of the observations that represents.

library(skimr)
  ppc_data |>
  skim()

Data summary
Name	ppc_data
Number of rows	555
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
keyword	0	1	3	7	0	7	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
day	0	1.00	16.41	8.34	2	9.0	17.0	23.0	31	▆▇▇▇▆
price	6	0.99	4.74	2.92	0	2.2	4.7	7.3	10	▇▇▇▆▆

Exercise 3

Valid values for price are 0.1 or greater. Insert a code cell that displays the number of values of price that are less than 0.1. In the narrative below that code cell write how many values are below 0.1.

 ppc_data |>
 tabyl(price)

 price  n     percent valid_percent
   0.0 10 0.018018018   0.018214936
   0.1 10 0.018018018   0.018214936
   0.2  5 0.009009009   0.009107468
   0.3  2 0.003603604   0.003642987
   0.4  6 0.010810811   0.010928962
   0.5  6 0.010810811   0.010928962
   0.6  6 0.010810811   0.010928962
   0.7  9 0.016216216   0.016393443
   0.8  8 0.014414414   0.014571949
   0.9  6 0.010810811   0.010928962
   1.0  5 0.009009009   0.009107468
   1.1  4 0.007207207   0.007285974
   1.2  5 0.009009009   0.009107468
   1.3 10 0.018018018   0.018214936
   1.4  3 0.005405405   0.005464481
   1.5  8 0.014414414   0.014571949
   1.6  2 0.003603604   0.003642987
   1.7  9 0.016216216   0.016393443
   1.8  5 0.009009009   0.009107468
   1.9  5 0.009009009   0.009107468
   2.0  1 0.001801802   0.001821494
   2.1 10 0.018018018   0.018214936
   2.2  8 0.014414414   0.014571949
   2.3  4 0.007207207   0.007285974
   2.4  3 0.005405405   0.005464481
   2.5  4 0.007207207   0.007285974
   2.6 10 0.018018018   0.018214936
   2.7  3 0.005405405   0.005464481
   2.8  5 0.009009009   0.009107468
   2.9  4 0.007207207   0.007285974
   3.0  9 0.016216216   0.016393443
   3.1  8 0.014414414   0.014571949
   3.2  7 0.012612613   0.012750455
   3.3  5 0.009009009   0.009107468
   3.4  8 0.014414414   0.014571949
   3.5  4 0.007207207   0.007285974
   3.6  7 0.012612613   0.012750455
   3.7  8 0.014414414   0.014571949
   3.8  2 0.003603604   0.003642987
   3.9  3 0.005405405   0.005464481
   4.0  7 0.012612613   0.012750455
   4.1  6 0.010810811   0.010928962
   4.2  7 0.012612613   0.012750455
   4.3  3 0.005405405   0.005464481
   4.5  4 0.007207207   0.007285974
   4.6  8 0.014414414   0.014571949
   4.7 11 0.019819820   0.020036430
   4.8  5 0.009009009   0.009107468
   4.9  4 0.007207207   0.007285974
   5.0  2 0.003603604   0.003642987
   5.1  3 0.005405405   0.005464481
   5.2 11 0.019819820   0.020036430
   5.3  4 0.007207207   0.007285974
   5.4  4 0.007207207   0.007285974
   5.5  7 0.012612613   0.012750455
   5.6  6 0.010810811   0.010928962
   5.7  5 0.009009009   0.009107468
   5.8  4 0.007207207   0.007285974
   5.9  6 0.010810811   0.010928962
   6.0  6 0.010810811   0.010928962
   6.1  3 0.005405405   0.005464481
   6.2  7 0.012612613   0.012750455
   6.3  4 0.007207207   0.007285974
   6.4  4 0.007207207   0.007285974
   6.5  8 0.014414414   0.014571949
   6.6  4 0.007207207   0.007285974
   6.7  6 0.010810811   0.010928962
   6.8  2 0.003603604   0.003642987
   6.9  8 0.014414414   0.014571949
   7.0  4 0.007207207   0.007285974
   7.1  4 0.007207207   0.007285974
   7.2  7 0.012612613   0.012750455
   7.3  6 0.010810811   0.010928962
   7.4  5 0.009009009   0.009107468
   7.5  4 0.007207207   0.007285974
   7.6  5 0.009009009   0.009107468
   7.7  7 0.012612613   0.012750455
   7.8  7 0.012612613   0.012750455
   7.9  2 0.003603604   0.003642987
   8.0  7 0.012612613   0.012750455
   8.1  7 0.012612613   0.012750455
   8.2  4 0.007207207   0.007285974
   8.3  6 0.010810811   0.010928962
   8.4  4 0.007207207   0.007285974
   8.5  7 0.012612613   0.012750455
   8.6  7 0.012612613   0.012750455
   8.7  7 0.012612613   0.012750455
   8.8  1 0.001801802   0.001821494
   8.9  3 0.005405405   0.005464481
   9.0  2 0.003603604   0.003642987
   9.1  5 0.009009009   0.009107468
   9.2  4 0.007207207   0.007285974
   9.3  4 0.007207207   0.007285974
   9.4  5 0.009009009   0.009107468
   9.5  5 0.009009009   0.009107468
   9.6  4 0.007207207   0.007285974
   9.7  7 0.012612613   0.012750455
   9.8  5 0.009009009   0.009107468
   9.9  5 0.009009009   0.009107468
  10.0  3 0.005405405   0.005464481
    NA  6 0.010810811            NA

# Count how many price values are less than 0.1
ppc_data %>%
  filter(price < 0.1) %>%
  summarise(count = n())

# A tibble: 1 × 1
  count
  <int>
1    10

10 values are below 0.1.

Exercise 4

Insert a code cell that drops all of the rows that contain invalid or NA values for price.

  ppc_data |>
 filter(!is.na(price))

# A tibble: 549 × 3
     day keyword price
   <dbl> <chr>   <dbl>
 1     9 usb       5.9
 2    30 usb hub   8  
 3    19 usb hub   2.8
 4     4 usb key   7.7
 5    30 key       1.7
 6    17 usb hub   5.5
 7     5 hub       5.2
 8     8 key       3.8
 9    17 usb key   6  
10    21 usb key   2  
# ℹ 539 more rows

6 rows contain invalid or “NA” values for price!

Exercise 5

Insert a code cell that shows a tabyl of the counts of each keyword. In the narrative below the code cell, list the misspellings and counts if there are any.

ppc_data |>
  tabyl(keyword)

 keyword   n    percent
     hub  90 0.16216216
     key  85 0.15315315
     ubs  11 0.01981982
 ubs key  10 0.01801802
     usb  75 0.13513514
 usb hub 121 0.21801802
 usb key 163 0.29369369

Misspellings: There are 11 ubs, and 10 ubs key.

Exercise 6

Insert a code cell that corrects all the misspellings for keyword, then rerun tabyl to verify.

library(dplyr)
library(janitor)

# Correct specific misspellings in keyword
ppc_data <- ppc_data |>
  mutate(keyword = case_when(
    keyword == "ubs" ~ "usb",
    keyword == "ubs_key" ~ "usb_key",
    TRUE ~ keyword)) 

# View corrected keyword frequency
ppc_data |>
  tabyl(keyword)

 keyword   n    percent
     hub  90 0.16216216
     key  85 0.15315315
 ubs key  10 0.01801802
     usb  86 0.15495495
 usb hub 121 0.21801802
 usb key 163 0.29369369

Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item (percent overall)	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Narrative: typos and grammatical errors (7%)
Document formatting: correctly implemented instructions (7%)
Exercises (13% each)
Submitted properly to Brightspace (8%)	NA	NA	You must submit according to instructions to receive any credit for this portion.