Download and open the assign04.qmd file and complete the exercises.
We will be using pay-per-click (PPC) data from a 31 day campaign from a company that sells USB keys and USB hubs. Each row of the 555 observations represents a click on an internet ad based on a keyword search and there are 3 columns.
day - represents the day of the campaign. Valid days are 1-31.
price - represents the price of the campaign. Price can’t be a number below 0.10
keyword - represents the keyword purchased. Everything must be spelled correctly, there aren’t many keywords but they are some combination of “usb” and/or “key” or “hub”
In this assignment you will be examining each column for data validity. Each exercise presents one or more questions for you to answer.
We’ll start by loading the tidyverse family of packages along with the janitor and skimr packages, and our data. Make sure you install these two packages in your RStudio prior to calling the library() functions below.
There are six exercises in this assignment. The Grading Rubric is available at the end of this document.
Exercise 1
Create a graph of number of clicks (i.e., observations) for each day (1-31). Use geom_bar() for your geometry. In the narrative below your code note which days had zero clicks.
click_counts <- ppc_data %>%count(day =factor(day, levels =1:31))ggplot(click_counts, aes(x = day, y = n)) +geom_bar(stat ="identity", fill ="steelblue") +labs(title ="Number of Clicks per Day",x ="Day of the Month",y ="Number of Clicks") +theme_minimal()
There are no days in which there are zero clicks shown on the graph
Exercise 2
Insert a code cell to show how many NA (i.e., missing) values there are in price. In the narrative below that code cell write out how many NA values there are for price and what percent of the observations that represents.
library(dplyr)# Calculate missing value infona_summary <- ppc_data %>%summarise(total_rows =n(),na_count =sum(is.na(price)),na_percent =round(100* na_count / total_rows, 2) )# Rename columns for a prettier displayna_summary <- na_summary %>%rename(`Total Rows`= total_rows,`Missing Values (NA)`= na_count,`Percent Missing (%)`= na_percent )# View the summary tablena_summary
There are 6 missing values in the price column, which represents 1.08% of the total observations.
Exercise 3
Valid values for price are 0.1 or greater. Insert a code cell that displays the number of values of price that are less than 0.1. In the narrative below that code cell write how many values are below 0.1.
There are 10 values in the price column that are less than 0.1, which represents 1.8% of the total observations.
Exercise 4
Insert a code cell that drops all of the rows that contain invalid or NA values for price.
# Create a new tibble with the invalid rows removedppc_data_clean <- ppc_data %>%filter(!is.na(price) & price >=0.1)# View the summary tableglimpse(ppc_data_clean)
Insert a code cell that shows a tabyl of the counts of each keyword. In the narrative below the code cell, list the misspellings and counts if there are any.
# Create a tabyl of the counts of each keywordppc_data_clean %>%tabyl(keyword)
keyword n percent
hub 89 0.16512059
key 81 0.15027829
ubs 10 0.01855288
ubs key 10 0.01855288
usb 73 0.13543599
usb hub 119 0.22077922
usb key 157 0.29128015
# Count how many misspellings there are for keywordmisspellings <- ppc_data_clean %>%filter(keyword !="usb key"& keyword !="usb hub"& keyword !="usb key hub"& keyword !="usb key hub") %>%count(keyword)
Exercise 6
Insert a code cell that corrects all the misspellings for keyword, then rerun tabyl to verify.
# Correct the misspellings for keywordppc_data_clean <- ppc_data_clean %>%mutate(keyword =case_when( keyword =="ubs key"~"usb key", keyword =="usb hub"~"usb hub", keyword =="usb key hub"~"usb key hub", keyword =="usb key hub"~"usb key hub", keyword =="ubs"~"usb",TRUE~ keyword ))# Show the tabyl of the counts of each keywordppc_data_clean %>%tabyl(keyword)
keyword n percent
hub 89 0.1651206
key 81 0.1502783
usb 83 0.1539889
usb hub 119 0.2207792
usb key 167 0.3098330
Submission
To submit your assignment:
Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.