This PDF was generated using an RMarkdown (.Rmd) file.
You can edit the .Rmd file directly to add in your code and
fill out the answers. To edit .Rmd files and compile them
into PDFs, you will need to install Rstudio, as well as a
working build like TeX Live
or MacTeX. See the RMarkdown
cheatsheet for tips on how to work with files of this format.
You must submit your answers knitted to a readable
PDF to receive credit; we will not grade assignments submitted in
.Rmd or .R formats. Make sure the final PDF
output is easy to read. I recommend regularly knitting the
RMarkdown file as you work on the assignment, so that you catch errors
early and don’t get overwhelmed by a bunch of errors all at once at the
very end.
First, use the space below to load in packages or do any other setup you might need.
#load your packages here
library(readr)
library(microbenchmark)
library(dplyr)
library(tidyverse)
Let’s start with a basic review of the different data types in R. Consider the following four objects:
x1 = 1.0; x2 = 1L; x3 = "1"; x4 = TRUE
typeof() function
to check the data type of each of the 4 objects created above. Explain
in words the difference between each of these data types.#put your code for Q1 here
typeof(x1)
## [1] "double"
typeof(x2)
## [1] "integer"
typeof(x3)
## [1] "character"
typeof(x4)
## [1] "logical"
Answer: put your text answer for Q1 here
X1 is a numeric variable with a decimal. x2 is an numeric variable with a integer literal, which makes the variable cannot contain a decimal. x3 takes the “1” as text instead of a number. x4 takes in one of the only two outcomes, TRUE and FALSE, of logical variables.
x1 + 1, x2 + 1, and
so on.
#put your code for Q2 here
x1+1
## [1] 2
x2+1
## [1] 2
as.numeric(x3)+1
## [1] 2
x4+1
## [1] 2
Answer: put your text answer for Q2 here
x3+1 will return an error because the numeric value 1 cannot be added to a string. To fix this, x3 should be converted to numeric data type using the function as.numeric. x1,x2,and x4 will result in valid outputs, and the resulting data type for each will be “double”. This is because when a data of type “double” is added to a integer, logic, or double data, the result will be “double”.
Now consider the following vector of (fictional) test scores for a (fictional) set of students:
test_scores = c("Kei" = 90,
"Tahir" = 75,
"Adhyavu" = 89,
"Hao"= 83,
"Sangmin" = 79,
"Youssef" = 63,
"Hannah" = 93,
"Tanner" = 66,
"Xueying" = 70,
"Henrique" = 93)
test_scores
## Kei Tahir Adhyavu Hao Sangmin Youssef Hannah Tanner
## 90 75 89 83 79 63 93 66
## Xueying Henrique
## 70 93
test_scores. Write your
code to be general, i.e., it should work on any named vector
test_scores regardless of the number of students and the
specific names/scores of the students.
grep or grepl function)%% a.k.a. modulo function)#put your code for Q3 here
#Sort alphabetically:
test_scores[order(names(test_scores))]
## Adhyavu Hannah Hao Henrique Kei Sangmin Tahir Tanner
## 89 93 83 93 90 79 75 66
## Xueying Youssef
## 70 63
#Sort from highest to lowest:
rev(sort(test_scores))
## Henrique Hannah Kei Adhyavu Hao Sangmin Tahir Xueying
## 93 93 90 89 83 79 75 70
## Tanner Youssef
## 66 63
#Subset to only students with scores of at least 80:
test_scores[test_scores >= 80]
## Kei Adhyavu Hao Hannah Henrique
## 90 89 83 93 93
#Subset to only students with names starting with "T":
test_scores[grepl('T', names(test_scores))==TRUE]
## Tahir Tanner
## 75 66
#Subset to only students with even numbered scores:
test_scores[test_scores%%2 == 0]
## Kei Tanner Xueying
## 90 66 70
In R, we often need to deal with factors, a data type which looks like a character string (text) but works a bit differently. Factors are used to represent categorical variables, where a vector of character strings can only take on a certain number of values. Consider the following vector, where we convert the earlier vector of test scores into factors instead of numeric values. Now when we print the vector, it shows all the different scores that appear in the vector underneath, which are called “levels.” The levels are stored as character strings.
test_scores_factor = factor(test_scores)
test_scores_factor
## Kei Tahir Adhyavu Hao Sangmin Youssef Hannah Tanner
## 90 75 89 83 79 63 93 66
## Xueying Henrique
## 70 93
## Levels: 63 66 70 75 79 83 89 90 93
levels(test_scores_factor)
## [1] "63" "66" "70" "75" "79" "83" "89" "90" "93"
test_scores_factor back
into a numeric vector using the as.numeric function?
Explain why this happens (look at the documentation of the
factor function and/or search on Google or Stack Overflow
to help you figure it out).test_scores_factor back into a numeric vector so that we
don’t run into the problem above.#put your code for Q4 here
as.numeric(as.character(test_scores_factor))
## [1] 90 75 89 83 79 63 93 66 70 93
Answer: put your text answer for Q4 here
If the as.numeric function is applied to a factor, implicit coercion will happen and the result will be the underlying levels of the factor. To fix this, one can use the function “as.numeric(as.character(test_scores_factor))”.
In this section, let’s get some practice writing functions and benchmarking code efficiency through a simple problem: computing the sample average of a numeric vector.
y_1, y_2, and y_3. Make the size
of the vectors (i.e., the number of simulations) 100, 10,000, and
1,000,000 respectively.#put your code for Q5 here
y_1 = rnorm(100)
y_2 = rnorm(10000)
y_3 = rnorm(1000000)
mean_loop which takes a single input x and
returns the sample average of x. Compute the average using
a for loop, looping over the elements of x to
add them up, then dividing by the number of elements at the end.#put your code for Q6 here
mean_loop = function(x){
sum = 0
for(vals in x){
sum = sum + vals
}
avg = sum/length(x)
avg
}
microbenchmark (from the package
microbenchmark) lets you measure how fast code runs. If you
give it code to evaluate, it will evaluate it 100 times, and return
summary statistics on how long the code took to run. If you give it
multiple expressions, it will do this for each expression. Use
microbenchmark to compare how fast mean_loop
is compared to the default R function mean in computing the
mean of the vector x_1 that you generated earlier. Repeat
this exercise for x_2 and x_3. How do the
compute times compare? How does the speed of each function scale by how
big the input is?#put your code for Q7 here
library(microbenchmark)
microbenchmark(mean(y_1), mean_loop((y_1)), mean(y_2), mean_loop(y_2), mean(y_3), mean_loop(y_3))
Answer: put your text answer for Q7 here
The average run time for each function increases exponentially as the size of input increases linearly. The input y_1 is a vector contains 100 random numbers, and y_2 is 100 times larger than y_1, and y_3 is 100 times larger than y_2. The built-in function mean()’s run time to process y_2 is 3 times longer than its own run time to process y_1. the run time that mean() uses to process y_3 is 63 times longer than its run time used to process y_2. For mean_loop(), the run time needed for y_2 is 30 times of that for y_1. mean_loop()’s run time for y_3 is roughly 120 times of that for y_2.
When the size of a input is fixed, the run time of one function can be significantly faster than the other. For instance, when the input is relatively small, such as y_1, the mean_loop() on average takes roughly half the time that the built-in mean() function needs to reach an output. On the contrary, as the input becomes bigger, such as y_2, the built-in mean() function’s run time is almost five times faster than that of mean_loop(). If the size of the input is as huge as y_3, mean() is about 10 times faster than mean_loop().
Now we’ll get some practice loading in and visualizing actual data.
For this problem, we’ll use data from the grocery delivery service Instacart.
A random subset of the data, containing complete data on 10,000
Instacart orders (already merged across several tables) is in the same
folder as this file, with filename
instacart_sample.csv.
Each row consists of 1 product ordered by a customer, with the
order_id and user_id columns indicating the
order number and customer number associated with that product. A full
description of the columns is given in the data
dictionary.
I’ll write hints and suggested functions assuming you are using the
tidyverse family of packages (namely dplyr and
ggplot2) but you are welcome to use other packages such as
data.table to do the data processing.
instacart_sample.csv (e.g., using the read_csv
function) and store it as an object. Print the number of rows and
columns the table has, and print the names of the columns.#put your code for Q8 here
insta = read_csv('instacart_sample.csv', col_types = "ccdccccccdccccc")
colums_names = colnames(insta)
num_cols = length(colums_names)
num_cols
## [1] 15
colums_names
## [1] "order_id" "product_id" "add_to_cart_order"
## [4] "reordered" "user_id" "eval_set"
## [7] "order_number" "order_dow" "order_hour_of_day"
## [10] "days_since_prior_order" "product_name" "aisle_id"
## [13] "department_id" "aisle" "department"
summary to tabulate some summary
statistics about the distribution of order sizes. Hint: The function
count may be helpful to count the order sizes.#put your code for Q9 here
order_sizes = insta %>%
mutate(counter = 1)%>%
count(order_id, wt = counter, sort = TRUE)
summary(order_sizes)
## order_id n
## Length:10000 Min. : 1.00
## Class :character 1st Qu.: 5.00
## Mode :character Median : 8.00
## Mean :10.06
## 3rd Qu.:14.00
## Max. :72.00
data.frame with this information, then use the
kable function from the knitr package to
convert the matrix/data.frame into a print table. What is
the most purchased category? What is the least purchased category?#put your code for Q10 here
#dept_names = insta%>%
# select(department_id)%>%
#distinct(department_id)
total_order = insta%>%
mutate(counterr = 1)%>%
summarize(total_order = sum(counterr))
order_dept = insta%>%
mutate(counterr = 1)%>%
count(department, wt = counterr, sort = TRUE)
dept_name = c(order_dept[,1])
percent_ordered_dept = c(c(order_dept[,2] )/total_order)
dept_each_df = data.frame(dept_name, percent_ordered_dept)
colnames(dept_each_df) = c("Department Names |", "| Percent of Total Orders")
dept_each_table = knitr::kable(dept_each_df)
dept_each_table
| Department Names | | | Percent of Total Orders |
|---|---|
| produce | 0.2991399 |
| dairy eggs | 0.1661314 |
| snacks | 0.0870979 |
| beverages | 0.0822453 |
| frozen | 0.0685228 |
| pantry | 0.0557749 |
| bakery | 0.0365236 |
| canned goods | 0.0330234 |
| deli | 0.0317506 |
| dry goods pasta | 0.0269378 |
| household | 0.0227017 |
| meat seafood | 0.0225227 |
| breakfast | 0.0209218 |
| personal care | 0.0137722 |
| babies | 0.0125889 |
| international | 0.0081639 |
| alcohol | 0.0047233 |
| pets | 0.0030528 |
| missing | 0.0020285 |
| bulk | 0.0013126 |
| other | 0.0010640 |
#The most purchased category is from the "produce" department. The lease purchased category is "other".
order_hour_of_day) that orders are placed
(e.g., as a barplot or density plot).#put your code for Q11 here
helper_table = insta%>%
group_by(order_id)%>%
summarize(order_hour_of_day)
helper_table_distinct = distinct(helper_table) #all 10000 orders and corresponding time
order_hour_of_day_vector = helper_table_distinct[,2] #all 10000 order's corresponding times
frequency_counter = order_hour_of_day_vector%>% # count the frequency of each corresponding
mutate(counterrr = 1)%>% # times and map to table where col 1 = times, col2=feq
count(order_hour_of_day, wt = counterrr, sort = TRUE)
colnames(frequency_counter) = c("Order Hour of Day", "Frequency")
ohod_vector = frequency_counter$`Order Hour of Day`
#now I need to map the second col of freq_counter to the x axis of bar plot.
#the select or [] function only gives a list, while I need variables in type of double.
feq = frequency_counter$Frequency
ggplot(data = frequency_counter)+
geom_bar(mapping = aes(x = ohod_vector, y = feq ), stat = "identity")