** The Zoom session will be recorded **
** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
This is a 10 hour course, with 3 weekly meetings of 1 hr each for 3 weeks, and a final class where you get to present your project. Classes will be held on Zoom, 3-4 pm (CA time) on MWF (starting July 20). Sessions will be quite interactive and hands-on: you will follow along and write your own code. There are no fees as long as you bring a good attitude and a computer with a good wifi connection.
You will learn how to do basic data manipulation - including automatically fetching lots of data from web sites, rearranging and querying the data, doing statistical analysis, and doing useful visualizations - so that you can answer questions such as this: Do box office returns for movies predict secondary sales (dvd etc) and overall profitability? or, Are early election polls good predictors of final outcomes?
Skim through the graphics on this webpage, and identify 2 that are high priority for you to learn: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
Watch this video: https://www.youtube.com/watch?v=jbkSRLYSojo, or a slightly older version of it with a younger Hans Rosling, https://www.youtube.com/watch?v=WjVHvC9EeB4, and here’s another cool one by Hans, https://www.youtube.com/watch?v=ezVk1ahRF78.
Note: the code chunk above shows you how to add images to your project and story.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
The lines above just declare some document-wide configuration preferences.
install.packages("devtools")
devtools::install_github("jacobkap/boxoffice")
R is an extensible system. It is extended by installing new “packages” or “libraries” (via the install.packages command - this is a onetime step, but has to be repeated when the R implementation itself gets upgraded).
Then you must “add” the packages you need to your runtime environment through the library() command (which has to be done each time you execute the “knit” command or reload your workspace).
library(knitr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)
The basic building block in R is a data frame, think of it as a table with lots of rows (or observations) and columns (or values). As you work with data sets and as you write and execute code, these elements get loaded into your workspace. Here’s an example, and which also exposes a few basic commands: assignment operator; picking items from a data set; ways to create sequences; sampling data; and arithmetic operations.
# build an example data frame
df.eg <- data.frame(col1 = letters[1:5], col2 = seq(2,4,length=5), col3 = sample(50,5))
df.eg$col4 = (100*df.eg$col2)/df.eg$col3
# print the first few rows
print(head(df.eg))
## col1 col2 col3 col4
## 1 a 2.0 22 9.09
## 2 b 2.5 16 15.62
## 3 c 3.0 28 10.71
## 4 d 3.5 49 7.14
## 5 e 4.0 39 10.26
# Last date of the year for every year this decade
print(paste(2010:2019,"-12-31",sep=""))
## [1] "2010-12-31" "2011-12-31" "2012-12-31" "2013-12-31" "2014-12-31"
## [6] "2015-12-31" "2016-12-31" "2017-12-31" "2018-12-31" "2019-12-31"
Anytime you want to write code (which is executed by the program) you need to enclose it inside a special begin and end lines as shown below. Every code chunk has to have a unique name. In the picture below I called it new.code.chunk, but you should select a meaningful name for each chunk you write.
We’ll get everyone setup (quickly, as long as you did the Prequisites and Preparation Steps above), get familiar with RStudio, learn how to extend it with new packages (and load them). Then look at some good examples of getting useful insights or making tough decisions with the help of data analysis, and finally see how to get large amounts of data from external websites.
After the basic steps of learning our way around RStudio, we’ll do something fun! We’ll write some code to download movie sales data, process it, and visualize it.
For the rest and details, see https://rpubs.com/hkb/DAX-Session1.
Storytelling with RMarkdown: Headers, text, comments, R code chunks;
Knowing your way around RStudio
The basics of R:
assignment and referencing data frames, rows, columns, cells;
Basics of R: assignment and referencing data elements, creating and referencing data frames, rows, columns, cells
Dig into the code, learn how to write a function, and learn some new useful functions
Understand the meaning of the visualizations we did, and how to reproduce them
Identify interesting questions to ask about the topic you’re studying with data (here, movies) and how to answer those questions through data analysis
arithmetic and statistical functions; logical operations; vectors and applying functions on vectors; writing your own function.
Data manipulation with dplyr package.
dplyr package : https://dplyr.tidyverse.org/, https://blog.exploratory.io/filter-data-with-aggregate-and-window-functions-88e3b2353c00
Basic R operations on data frames, rearranging data: e.g., how many times each distributor features in the top-10 every year?
Movie sales data - association between number of days in theater vs total_gross revenues.
Google mobility data.
Covid spread data - the association between cases and deaths.
Assignment: Watch the movie Moneyball