This PDF was generated using an RMarkdown (.Rmd) file. You can edit the .Rmd file directly to add in your code and fill out the answers. To edit .Rmd files and compile them into PDFs, you will need to install Rstudio, as well as a working build like TeX Live or MacTeX. See the RMarkdown cheatsheet for tips on how to work with files of this format.

You must submit your answers knitted to a readable PDF to receive credit; we will not grade assignments submitted in .Rmd or .R formats. Make sure the final PDF output is easy to read. I recommend regularly knitting the RMarkdown file as you work on the assignment, so that you catch errors early and don’t get overwhelmed by a bunch of errors all at once at the very end.

First, use the space below to load in packages or do any other setup you might need.

#load your packages here
library(readr)
library(microbenchmark)
library(dplyr)
library(tidyverse)

Basics: data types and vectors

Let’s start with a basic review of the different data types in R. Consider the following four objects:

x1 = 1.0; x2 = 1L; x3 = "1"; x4 = TRUE
#put your code for Q1 here
typeof(x1)
## [1] "double"
typeof(x2)
## [1] "integer"
typeof(x3)
## [1] "character"
typeof(x4)
## [1] "logical"

Answer: put your text answer for Q1 here

X1 is a numeric variable with a decimal. x2 is an numeric variable with a integer literal, which makes the variable cannot contain a decimal. x3 takes the “1” as text instead of a number. x4 takes in one of the only two outcomes, TRUE and FALSE, of logical variables.

#put your code for Q2 here
x1+1
## [1] 2
x2+1
## [1] 2
as.numeric(x3)+1
## [1] 2
x4+1
## [1] 2

Answer: put your text answer for Q2 here

x3+1 will return an error because the numeric value 1 cannot be added to a string. To fix this, x3 should be converted to numeric data type using the function as.numeric. x1,x2,and x4 will result in valid outputs, and the resulting data type for each will be “double”. This is because when a data of type “double” is added to a integer, logic, or double data, the result will be “double”.

Now consider the following vector of (fictional) test scores for a (fictional) set of students:

test_scores = c("Kei" = 90,
                "Tahir" = 75,
                "Adhyavu" = 89,
                "Hao"= 83,
                "Sangmin" = 79,
                "Youssef" = 63,
                "Hannah" = 93,
                "Tanner" = 66,
                "Xueying" = 70,
                "Henrique" = 93)
test_scores
##      Kei    Tahir  Adhyavu      Hao  Sangmin  Youssef   Hannah   Tanner 
##       90       75       89       83       79       63       93       66 
##  Xueying Henrique 
##       70       93
#put your code for Q3 here
#Sort alphabetically:
test_scores[order(names(test_scores))]
##  Adhyavu   Hannah      Hao Henrique      Kei  Sangmin    Tahir   Tanner 
##       89       93       83       93       90       79       75       66 
##  Xueying  Youssef 
##       70       63
#Sort from highest to lowest:
rev(sort(test_scores))
## Henrique   Hannah      Kei  Adhyavu      Hao  Sangmin    Tahir  Xueying 
##       93       93       90       89       83       79       75       70 
##   Tanner  Youssef 
##       66       63
#Subset to only students with scores of at least 80:
test_scores[test_scores >= 80]
##      Kei  Adhyavu      Hao   Hannah Henrique 
##       90       89       83       93       93
#Subset to only students with names starting with "T":
test_scores[grepl('T', names(test_scores))==TRUE]
##  Tahir Tanner 
##     75     66
#Subset to only students with even numbered scores:
test_scores[test_scores%%2 == 0]
##     Kei  Tanner Xueying 
##      90      66      70

In R, we often need to deal with factors, a data type which looks like a character string (text) but works a bit differently. Factors are used to represent categorical variables, where a vector of character strings can only take on a certain number of values. Consider the following vector, where we convert the earlier vector of test scores into factors instead of numeric values. Now when we print the vector, it shows all the different scores that appear in the vector underneath, which are called “levels.” The levels are stored as character strings.

test_scores_factor = factor(test_scores)
test_scores_factor
##      Kei    Tahir  Adhyavu      Hao  Sangmin  Youssef   Hannah   Tanner 
##       90       75       89       83       79       63       93       66 
##  Xueying Henrique 
##       70       93 
## Levels: 63 66 70 75 79 83 89 90 93
levels(test_scores_factor)
## [1] "63" "66" "70" "75" "79" "83" "89" "90" "93"
#put your code for Q4 here
as.numeric(as.character(test_scores_factor))
##  [1] 90 75 89 83 79 63 93 66 70 93

Answer: put your text answer for Q4 here

If the as.numeric function is applied to a factor, implicit coercion will happen and the result will be the underlying levels of the factor. To fix this, one can use the function “as.numeric(as.character(test_scores_factor))”.

Writing functions: ways of computing sample averages

In this section, let’s get some practice writing functions and benchmarking code efficiency through a simple problem: computing the sample average of a numeric vector.

#put your code for Q5 here
y_1 = rnorm(100)
y_2 = rnorm(10000)
y_3 = rnorm(1000000)
#put your code for Q6 here
mean_loop = function(x){
  sum = 0
  for(vals in x){
    sum = sum + vals
  }
  avg = sum/length(x)
  avg
}
#put your code for Q7 here
library(microbenchmark)
microbenchmark(mean(y_1), mean_loop((y_1)), mean(y_2), mean_loop(y_2), mean(y_3), mean_loop(y_3))

Answer: put your text answer for Q7 here

The average run time for each function increases exponentially as the size of input increases linearly. The input y_1 is a vector contains 100 random numbers, and y_2 is 100 times larger than y_1, and y_3 is 100 times larger than y_2. The built-in function mean()’s run time to process y_2 is 3 times longer than its own run time to process y_1. the run time that mean() uses to process y_3 is 63 times longer than its run time used to process y_2. For mean_loop(), the run time needed for y_2 is 30 times of that for y_1. mean_loop()’s run time for y_3 is roughly 120 times of that for y_2.

When the size of a input is fixed, the run time of one function can be significantly faster than the other. For instance, when the input is relatively small, such as y_1, the mean_loop() on average takes roughly half the time that the built-in mean() function needs to reach an output. On the contrary, as the input becomes bigger, such as y_2, the built-in mean() function’s run time is almost five times faster than that of mean_loop(). If the size of the input is as huge as y_3, mean() is about 10 times faster than mean_loop().

Loading in and summarizing/visualizing data

Now we’ll get some practice loading in and visualizing actual data. For this problem, we’ll use data from the grocery delivery service Instacart. A random subset of the data, containing complete data on 10,000 Instacart orders (already merged across several tables) is in the same folder as this file, with filename instacart_sample.csv.

Each row consists of 1 product ordered by a customer, with the order_id and user_id columns indicating the order number and customer number associated with that product. A full description of the columns is given in the data dictionary.

I’ll write hints and suggested functions assuming you are using the tidyverse family of packages (namely dplyr and ggplot2) but you are welcome to use other packages such as data.table to do the data processing.

#put your code for Q8 here
insta = read_csv('instacart_sample.csv', col_types = "ccdccccccdccccc")
colums_names = colnames(insta)
num_cols = length(colums_names)

num_cols
## [1] 15
colums_names
##  [1] "order_id"               "product_id"             "add_to_cart_order"     
##  [4] "reordered"              "user_id"                "eval_set"              
##  [7] "order_number"           "order_dow"              "order_hour_of_day"     
## [10] "days_since_prior_order" "product_name"           "aisle_id"              
## [13] "department_id"          "aisle"                  "department"
#put your code for Q9 here
order_sizes = insta %>%
  mutate(counter = 1)%>%
  count(order_id, wt = counter, sort = TRUE)


summary(order_sizes)
##    order_id               n        
##  Length:10000       Min.   : 1.00  
##  Class :character   1st Qu.: 5.00  
##  Mode  :character   Median : 8.00  
##                     Mean   :10.06  
##                     3rd Qu.:14.00  
##                     Max.   :72.00
#put your code for Q10 here
#dept_names = insta%>%
 # select(department_id)%>%
  #distinct(department_id)

total_order = insta%>%
  mutate(counterr = 1)%>%
  summarize(total_order = sum(counterr)) 

order_dept = insta%>%
  mutate(counterr = 1)%>%
  count(department, wt = counterr, sort = TRUE)

dept_name = c(order_dept[,1]) 
percent_ordered_dept = c(c(order_dept[,2] )/total_order)


dept_each_df = data.frame(dept_name, percent_ordered_dept)

colnames(dept_each_df) = c("Department Names |", "| Percent of Total Orders")


dept_each_table = knitr::kable(dept_each_df)

dept_each_table
Department Names | | Percent of Total Orders
produce 0.2991399
dairy eggs 0.1661314
snacks 0.0870979
beverages 0.0822453
frozen 0.0685228
pantry 0.0557749
bakery 0.0365236
canned goods 0.0330234
deli 0.0317506
dry goods pasta 0.0269378
household 0.0227017
meat seafood 0.0225227
breakfast 0.0209218
personal care 0.0137722
babies 0.0125889
international 0.0081639
alcohol 0.0047233
pets 0.0030528
missing 0.0020285
bulk 0.0013126
other 0.0010640
#The most purchased category is from the "produce" department. The lease purchased category is "other". 
#put your code for Q11 here

helper_table = insta%>%       
  group_by(order_id)%>%
  summarize(order_hour_of_day)


helper_table_distinct = distinct(helper_table) #all 10000 orders and corresponding time


order_hour_of_day_vector = helper_table_distinct[,2] #all 10000 order's corresponding times

frequency_counter = order_hour_of_day_vector%>% # count the frequency of each corresponding
  mutate(counterrr = 1)%>%                       # times and map to table where col 1 = times, col2=feq
  count(order_hour_of_day, wt = counterrr, sort = TRUE)

colnames(frequency_counter) = c("Order Hour of Day", "Frequency")

ohod_vector = frequency_counter$`Order Hour of Day`

#now I need to map the second col of freq_counter to the x axis of bar plot. 
#the select or [] function only gives a list, while I need variables in type of double. 

feq = frequency_counter$Frequency
 

ggplot(data = frequency_counter)+
  geom_bar(mapping = aes(x = ohod_vector, y = feq ), stat = "identity")