** The Zoom session will be recorded **

** Please click all the tabs (in sequence) to get the entire set of information in these pages. **

Overview

Format

This is a 10 hour course, with 3 weekly meetings of 1 hr each for 3 weeks, and a final class where you get to present your project. Classes will be held on Zoom, 3-4 pm (CA time) on MWF (starting July 20). Sessions will be quite interactive and hands-on: you will follow along and write your own code. There are no fees as long as you bring a good attitude and a computer with a good wifi connection.

Learning Objectives

You will learn how to do basic data manipulation - including automatically fetching lots of data from web sites, rearranging and querying the data, doing statistical analysis, and doing useful visualizations - so that you can answer questions such as this: Do box office returns for movies predict secondary sales (dvd etc) and overall profitability? or, Are early election polls good predictors of final outcomes?

Course Requirements and Preparation

  • Go to Rstudio.cloud and Login with google.
  • Create a new file (using File/New File/RMarkdown) - give it any name ending in .Rmd. It will popup a request to install packages, say yes.

Note: the code chunk above shows you how to add images to your project and story.

Getting Setup to run RStudio

Some R Preliminaries

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)

The lines above just declare some document-wide configuration preferences.

install.packages("devtools")
devtools::install_github("jacobkap/boxoffice")

R is an extensible system. It is extended by installing new “packages” or “libraries” (via the install.packages command - this is a onetime step, but has to be repeated when the R implementation itself gets upgraded).

Then you must “add” the packages you need to your runtime environment through the library() command (which has to be done each time you execute the “knit” command or reload your workspace).

library(knitr)

library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)

Working with Data

The basic building block in R is a data frame, think of it as a table with lots of rows (or observations) and columns (or values). As you work with data sets and as you write and execute code, these elements get loaded into your workspace. Here’s an example, and which also exposes a few basic commands: assignment operator; picking items from a data set; ways to create sequences; sampling data; and arithmetic operations.

# build an example data frame
df.eg <- data.frame(col1 = letters[1:5], col2 = seq(2,4,length=5), col3 = sample(50,5))
df.eg$col4 = (100*df.eg$col2)/df.eg$col3

# print the first few rows
print(head(df.eg))
##   col1 col2 col3  col4
## 1    a  2.0   22  9.09
## 2    b  2.5   16 15.62
## 3    c  3.0   28 10.71
## 4    d  3.5   49  7.14
## 5    e  4.0   39 10.26
# Last date of the year for every year this decade
print(paste(2010:2019,"-12-31",sep=""))
##  [1] "2010-12-31" "2011-12-31" "2012-12-31" "2013-12-31" "2014-12-31"
##  [6] "2015-12-31" "2016-12-31" "2017-12-31" "2018-12-31" "2019-12-31"

Writing Code Chunks

Anytime you want to write code (which is executed by the program) you need to enclose it inside a special begin and end lines as shown below. Every code chunk has to have a unique name. In the picture below I called it new.code.chunk, but you should select a meaningful name for each chunk you write.

Class Sessions

Session 1: Setup and Examples

Session Overview

We’ll get everyone setup (quickly, as long as you did the Prequisites and Preparation Steps above), get familiar with RStudio, learn how to extend it with new packages (and load them). Then look at some good examples of getting useful insights or making tough decisions with the help of data analysis, and finally see how to get large amounts of data from external websites.

  • Setup your RStudio account, install important packages
  • Create a file, add useful packages (via the library() command)
  • Have fun playing around with data about movie sales
  • Write notes (to yourself), storytelling, and code inside the file.
  • Execute chunks of the file or the entire file.
  • Publish your results to the web!

After the basic steps of learning our way around RStudio, we’ll do something fun! We’ll write some code to download movie sales data, process it, and visualize it.

For the rest and details, see https://rpubs.com/hkb/DAX-Session1.

Session Plan

Later Sessions

Session 2: Learning the Basics of RStudio and R

Session 3

  • Basics of R: assignment and referencing data elements, creating and referencing data frames, rows, columns, cells

  • Dig into the code, learn how to write a function, and learn some new useful functions

  • Understand the meaning of the visualizations we did, and how to reproduce them

  • Identify interesting questions to ask about the topic you’re studying with data (here, movies) and how to answer those questions through data analysis

Session 4: Summarizing and Reshaping Data

Basic R operations on data frames, rearranging data: e.g., how many times each distributor features in the top-10 every year?

Session 5: Correlation, Association and Causation

Movie sales data - association between number of days in theater vs total_gross revenues.

Google mobility data.

Covid spread data - the association between cases and deaths.

Session 3

Assignment: Watch the movie Moneyball

Session 6: Guest lecture by Thomas

Appendix