MSMK Pre-Term Assignment: R Programming Component

This PDF was generated using an RMarkdown (.Rmd) file. You can edit the .Rmd file directly to add in your code and fill out the answers. To edit .Rmd files and compile them into PDFs, you will need to install Rstudio, as well as a working build like TeX Live or MacTeX. See the RMarkdown cheatsheet for tips on how to work with files of this format.

You must submit your answers knitted to a readable PDF to receive credit; we will not grade assignments submitted in .Rmd or .R formats. Make sure the final PDF output is easy to read. I recommend regularly knitting the RMarkdown file as you work on the assignment, so that you catch errors early and don’t get overwhelmed by a bunch of errors all at once at the very end.

First, use the space below to load in packages or do any other setup you might need.

#load your packages here
library(readr)
library(microbenchmark)
library(dplyr)
library(tidyverse)

Basics: data types and vectors

Let’s start with a basic review of the different data types in R. Consider the following four objects:

x1 = 1.0; x2 = 1L; x3 = "1"; x4 = TRUE

Question 1: Use the typeof() function to check the data type of each of the 4 objects created above. Explain in words the difference between each of these data types.

#put your code for Q1 here
typeof(x1)

## [1] "double"

typeof(x2)

## [1] "integer"

typeof(x3)

## [1] "character"

typeof(x4)

## [1] "logical"

Answer: put your text answer for Q1 here

X1 is a numeric variable with a decimal. x2 is an numeric variable with a integer literal, which makes the variable cannot contain a decimal. x3 takes the “1” as text instead of a number. x4 takes in one of the only two outcomes, TRUE and FALSE, of logical variables.

Question 2: Suppose we attempt to add 1 to each of the 4 objects by running x1 + 1, x2 + 1, and so on.
- For which of the 4 objects will running this code return an error, and for which will it return a valid output?
- For the object(s) that result in an error, explain why an error occurs and what we can add to the code to fix it.
- For the object(s) that result in valid output, what will be the data type of the output and why does this occur?

#put your code for Q2 here
x1+1

## [1] 2

x2+1

## [1] 2

as.numeric(x3)+1

## [1] 2

x4+1

## [1] 2

Answer: put your text answer for Q2 here

x3+1 will return an error because the numeric value 1 cannot be added to a string. To fix this, x3 should be converted to numeric data type using the function as.numeric. x1,x2,and x4 will result in valid outputs, and the resulting data type for each will be “double”. This is because when a data of type “double” is added to a integer, logic, or double data, the result will be “double”.

Now consider the following vector of (fictional) test scores for a (fictional) set of students:

test_scores = c("Kei" = 90,
                "Tahir" = 75,
                "Adhyavu" = 89,
                "Hao"= 83,
                "Sangmin" = 79,
                "Youssef" = 63,
                "Hannah" = 93,
                "Tanner" = 66,
                "Xueying" = 70,
                "Henrique" = 93)
test_scores

##      Kei    Tahir  Adhyavu      Hao  Sangmin  Youssef   Hannah   Tanner 
##       90       75       89       83       79       63       93       66 
##  Xueying Henrique 
##       70       93

Question 3: Print the following sorted and/or subsetted versions of the vector test_scores. Write your code to be general, i.e., it should work on any named vector test_scores regardless of the number of students and the specific names/scores of the students.
- Sorted alphabetically
- Sorted from highest to lowest score
- Subset to only students with scores of at least 80
- Subset to only students with names starting with “T” (hint: try using the grep or grepl function)
- Subset to only students with even numbered scores (hint: try using the %% a.k.a. modulo function)

#put your code for Q3 here
#Sort alphabetically:
test_scores[order(names(test_scores))]

##  Adhyavu   Hannah      Hao Henrique      Kei  Sangmin    Tahir   Tanner 
##       89       93       83       93       90       79       75       66 
##  Xueying  Youssef 
##       70       63

#Sort from highest to lowest:
rev(sort(test_scores))

## Henrique   Hannah      Kei  Adhyavu      Hao  Sangmin    Tahir  Xueying 
##       93       93       90       89       83       79       75       70 
##   Tanner  Youssef 
##       66       63

#Subset to only students with scores of at least 80:
test_scores[test_scores >= 80]

##      Kei  Adhyavu      Hao   Hannah Henrique 
##       90       89       83       93       93

#Subset to only students with names starting with "T":
test_scores[grepl('T', names(test_scores))==TRUE]

##  Tahir Tanner 
##     75     66

#Subset to only students with even numbered scores:
test_scores[test_scores%%2 == 0]

##     Kei  Tanner Xueying 
##      90      66      70

In R, we often need to deal with factors, a data type which looks like a character string (text) but works a bit differently. Factors are used to represent categorical variables, where a vector of character strings can only take on a certain number of values. Consider the following vector, where we convert the earlier vector of test scores into factors instead of numeric values. Now when we print the vector, it shows all the different scores that appear in the vector underneath, which are called “levels.” The levels are stored as character strings.

test_scores_factor = factor(test_scores)
test_scores_factor

##      Kei    Tahir  Adhyavu      Hao  Sangmin  Youssef   Hannah   Tanner 
##       90       75       89       83       79       63       93       66 
##  Xueying Henrique 
##       70       93 
## Levels: 63 66 70 75 79 83 89 90 93

levels(test_scores_factor)

## [1] "63" "66" "70" "75" "79" "83" "89" "90" "93"

Question 4: Working with factors can be convenient for a number of reasons, but they can sometimes behave in weird/unexpected ways when you try to convert them to other data types. Let’s take a look at an example of this.
- What happens when you coerce test_scores_factor back into a numeric vector using the as.numeric function? Explain why this happens (look at the documentation of the factor function and/or search on Google or Stack Overflow to help you figure it out).
- Show another way to (correctly) convert test_scores_factor back into a numeric vector so that we don’t run into the problem above.

#put your code for Q4 here
as.numeric(as.character(test_scores_factor))

##  [1] 90 75 89 83 79 63 93 66 70 93

Answer: put your text answer for Q4 here

If the as.numeric function is applied to a factor, implicit coercion will happen and the result will be the underlying levels of the factor. To fix this, one can use the function “as.numeric(as.character(test_scores_factor))”.

Writing functions: ways of computing sample averages

In this section, let’s get some practice writing functions and benchmarking code efficiency through a simple problem: computing the sample average of a numeric vector.

Question 5: Simulate 3 random vectors from a standard normal distribution, and store them with object names y_1, y_2, and y_3. Make the size of the vectors (i.e., the number of simulations) 100, 10,000, and 1,000,000 respectively.

#put your code for Q5 here
y_1 = rnorm(100)
y_2 = rnorm(10000)
y_3 = rnorm(1000000)

Question 6: Write a function called mean_loop which takes a single input x and returns the sample average of x. Compute the average using a for loop, looping over the elements of x to add them up, then dividing by the number of elements at the end.

#put your code for Q6 here
mean_loop = function(x){
  sum = 0
  for(vals in x){
    sum = sum + vals
  }
  avg = sum/length(x)
  avg
}

Question 7: The function microbenchmark (from the package microbenchmark) lets you measure how fast code runs. If you give it code to evaluate, it will evaluate it 100 times, and return summary statistics on how long the code took to run. If you give it multiple expressions, it will do this for each expression. Use microbenchmark to compare how fast mean_loop is compared to the default R function mean in computing the mean of the vector x_1 that you generated earlier. Repeat this exercise for x_2 and x_3. How do the compute times compare? How does the speed of each function scale by how big the input is?

#put your code for Q7 here
library(microbenchmark)
microbenchmark(mean(y_1), mean_loop((y_1)), mean(y_2), mean_loop(y_2), mean(y_3), mean_loop(y_3))

Answer: put your text answer for Q7 here

The average run time for each function increases exponentially as the size of input increases linearly. The input y_1 is a vector contains 100 random numbers, and y_2 is 100 times larger than y_1, and y_3 is 100 times larger than y_2. The built-in function mean()’s run time to process y_2 is 3 times longer than its own run time to process y_1. the run time that mean() uses to process y_3 is 63 times longer than its run time used to process y_2. For mean_loop(), the run time needed for y_2 is 30 times of that for y_1. mean_loop()’s run time for y_3 is roughly 120 times of that for y_2.

When the size of a input is fixed, the run time of one function can be significantly faster than the other. For instance, when the input is relatively small, such as y_1, the mean_loop() on average takes roughly half the time that the built-in mean() function needs to reach an output. On the contrary, as the input becomes bigger, such as y_2, the built-in mean() function’s run time is almost five times faster than that of mean_loop(). If the size of the input is as huge as y_3, mean() is about 10 times faster than mean_loop().

Loading in and summarizing/visualizing data

Now we’ll get some practice loading in and visualizing actual data. For this problem, we’ll use data from the grocery delivery service Instacart. A random subset of the data, containing complete data on 10,000 Instacart orders (already merged across several tables) is in the same folder as this file, with filename instacart_sample.csv.

Each row consists of 1 product ordered by a customer, with the order_id and user_id columns indicating the order number and customer number associated with that product. A full description of the columns is given in the data dictionary.

I’ll write hints and suggested functions assuming you are using the tidyverse family of packages (namely dplyr and ggplot2) but you are welcome to use other packages such as data.table to do the data processing.

Question 8: Read in the file instacart_sample.csv (e.g., using the read_csv function) and store it as an object. Print the number of rows and columns the table has, and print the names of the columns.

#put your code for Q8 here
insta = read_csv('instacart_sample.csv', col_types = "ccdccccccdccccc")
colums_names = colnames(insta)
num_cols = length(colums_names)

num_cols

## [1] 15

colums_names

##  [1] "order_id"               "product_id"             "add_to_cart_order"     
##  [4] "reordered"              "user_id"                "eval_set"              
##  [7] "order_number"           "order_dow"              "order_hour_of_day"     
## [10] "days_since_prior_order" "product_name"           "aisle_id"              
## [13] "department_id"          "aisle"                  "department"

Question 9: For each of the 10,000 orders in the data, count how big the order was (i.e., how many items the customer ordered). Use the function summary to tabulate some summary statistics about the distribution of order sizes. Hint: The function count may be helpful to count the order sizes.

#put your code for Q9 here
order_sizes = insta %>%
  mutate(counter = 1)%>%
  count(order_id, wt = counter, sort = TRUE)


summary(order_sizes)

##    order_id               n        
##  Length:10000       Min.   : 1.00  
##  Class :character   1st Qu.: 5.00  
##  Mode  :character   Median : 8.00  
##                     Mean   :10.06  
##                     3rd Qu.:14.00  
##                     Max.   :72.00

Question 10: Create a table which shows the breakdown of the “department” of items: one column should show the name of each department, and one column should show the percent of items ordered which fall under that department. First create a matrix or data.frame with this information, then use the kable function from the knitr package to convert the matrix/data.frame into a print table. What is the most purchased category? What is the least purchased category?

#put your code for Q10 here
#dept_names = insta%>%
 # select(department_id)%>%
  #distinct(department_id)

total_order = insta%>%
  mutate(counterr = 1)%>%
  summarize(total_order = sum(counterr)) 

order_dept = insta%>%
  mutate(counterr = 1)%>%
  count(department, wt = counterr, sort = TRUE)

dept_name = c(order_dept[,1]) 
percent_ordered_dept = c(c(order_dept[,2] )/total_order)


dept_each_df = data.frame(dept_name, percent_ordered_dept)

colnames(dept_each_df) = c("Department Names |", "| Percent of Total Orders")


dept_each_table = knitr::kable(dept_each_df)

dept_each_table

Department Names \|	\| Percent of Total Orders
produce	0.2991399
dairy eggs	0.1661314
snacks	0.0870979
beverages	0.0822453
frozen	0.0685228
pantry	0.0557749
bakery	0.0365236
canned goods	0.0330234
deli	0.0317506
dry goods pasta	0.0269378
household	0.0227017
meat seafood	0.0225227
breakfast	0.0209218
personal care	0.0137722
babies	0.0125889
international	0.0081639
alcohol	0.0047233
pets	0.0030528
missing	0.0020285
bulk	0.0013126
other	0.0010640

#The most purchased category is from the "produce" department. The lease purchased category is "other".

Question 11: Plot the distribution of the time-of-day (order_hour_of_day) that orders are placed (e.g., as a barplot or density plot).

#put your code for Q11 here

helper_table = insta%>%       
  group_by(order_id)%>%
  summarize(order_hour_of_day)


helper_table_distinct = distinct(helper_table) #all 10000 orders and corresponding time


order_hour_of_day_vector = helper_table_distinct[,2] #all 10000 order's corresponding times

frequency_counter = order_hour_of_day_vector%>% # count the frequency of each corresponding
  mutate(counterrr = 1)%>%                       # times and map to table where col 1 = times, col2=feq
  count(order_hour_of_day, wt = counterrr, sort = TRUE)

colnames(frequency_counter) = c("Order Hour of Day", "Frequency")

ohod_vector = frequency_counter$`Order Hour of Day`

#now I need to map the second col of freq_counter to the x axis of bar plot. 
#the select or [] function only gives a list, while I need variables in type of double. 

feq = frequency_counter$Frequency
 

ggplot(data = frequency_counter)+
  geom_bar(mapping = aes(x = ohod_vector, y = feq ), stat = "identity")

MSMK Pre-Term Assignment: R Programming Component

Shin Oblander <–Ziye Shao

Fall 2022

Basics: data types and vectors

Writing functions: ways of computing sample averages

Loading in and summarizing/visualizing data