Announcements
New Markdown/HTML format
My discussions with students have confirmed my experience.
It is difficult for students (and me) to swtich between Powerpoint and R and the Interactive Questions.
To simplify lectures, I will keep everything in R Markdown and provide script files.
To that end, I spent some time experimenting with new HTML styles for lecture presentations.
This one, called downcute, has a ‘dark’ option that may be easier on the eyes.
What do you think?
Projects:
As mentioned in Lecture 10, if you submitted your project survey, you have been assigned to a project group.
Please reach out to your group and spend some time before Thursday, 3/25, finding 2 or 3 candidate data sources.
Data criteria:
Exportable to Excel,
or copy friendly to Excel, e.g., BoxOfficeMojo,
or able to be directly imported into R, e.g. Yahoo FinanceHas at least one (or more) components that are currently being updated (weekly or more)
Interesting to you as students
Quiz 1 is on Thursday, 3/18 during class time.
80% of quiz will be timed during lecture (50. min. over Zoom)
Create a subset of the flights data as specified
Answer questions about that subset
Create 1 or 2 basic plots (not formatted)
Export data summaries
Submit R script file (with basic comments)
20% will be take-home (Due Friday 3/19/21 at midnight)
Creating a mini-table data set (R Markdown not required)
Format plot from quiz to be presentation ready (axes, etc.)
Answering a question or two in complete sentences.
Submit R script file of take-home work
R Markdown skills will not be on Quiz 1, but may be on Quiz 2
Review for Quiz 1
Concepts from basic statistics course (MAS 261 or eqivalent)
CV = Standard Deviation/Sample Mean
CV is Coefficient of Variation
CV is better for comparing variability between groups with different means.
Standard deviation = square root of variance
In R: sd(…) or sqrt(var(…))
Doing basic calculations in R
Recall Lectures 1 and 2 and HW 2
Data from a variable were summarized to find a probability:
IRRITATING SIDE NOTE R Markdown does not recognize data imported as an .rds file.
# Recall that All_Scores simulated 10000 'first hands' of Blackjack import .csv
# file because .rds doesn't work with R Markdown
<- read.csv("All_Scores.csv")
All_Scores
# remind yourself what this calculation is finding:
# create a variable with a shorter name
<- All_Scores$All_Scores
all
# calculate probability
round(length(all[all == 21])/length(all), 2)
[1] 0.05
- Useful Base R commands from Lectures 1 - 9 (in alphabetical order):
R Function | Description |
---|---|
as.numeric(…) | forces values to be treated as numeric |
c(…) | concatenates |
data.frame(…) | creates a data frame with input variables |
dim(…) | outputs number of rows and number of columns of a data frame |
head(…) | shows first 6 obs. by default |
ifelse(…) | creates a new two category variable based on test in first input |
length(…) | outputs how many values are in a vector |
* mean(…) | calculates the mean of a vector |
* median(…) | calculates the median of a vector |
rbind(…) | row binds or stacks values |
rep(…) | replicates or repeats specified value or object |
round(…) | rounds to decimal precision specified |
row.names(…) | outputs the row names of a data set |
sample(…) | samples a vector |
* sd(…) | calculates the sample standard deviation |
sqrt(…) | calculates the square root |
str(…) | examines structure of a data frame or tibble |
* sum(…) | sums values |
summary(…) | outputs numerical summary values |
tail(…) | shows last 6 obs. by default |
* var(…) | calculates the sample variance |
which.min(…) | identifies obs. or row number of min. value |
which.max(…) | identifies obs. or row number of max. value |
In the previous (updated) list, please notice the caption.
If a vector contains missing values, R commands like mean will not work without the option na.rm=T:
We will talk more about NAs after Quiz 1.
NAs have some unique characteristics in R.
# numeric vector with missing values
<- c(1, 5, 7, NA, 3, 19, NA, -2)
example
# attempt to calculate mean without removing missing values will not output a
# value
mean(example)
[1] NA
# calculate mean after removing missing values outputs a value
mean(example, na.rm = T)
[1] 5.5
Also, which.max and which.min are identical in function to code you have already seen:
# create mini-data set from High and Low temps 3/8 - 3/14
<- c("Mon.", "Tue.", "Wed.", "Thu.", "Fri.", "Sat.", "Sun.")
Day <- c(37, 49, 63, 72, 58, 40, 39)
High <- c(8, 28, 26, 49, 30, 25, 23)
Low
# create and display data frame
<- data.frame(Day, High, Low)
Last_Wk_df Last_Wk_df
Day High Low
1 Mon. 37 8
2 Tue. 49 28
3 Wed. 63 26
4 Thu. 72 49
5 Fri. 58 30
6 Sat. 40 25
7 Sun. 39 23
# create tibble (fancy modern data frame) from data frame and display it
<- as_tibble(Last_Wk_df)
Last_Wk_tb Last_Wk_tb
# A tibble: 7 x 3
Day High Low
<chr> <dbl> <dbl>
1 Mon. 37 8
2 Tue. 49 28
3 Wed. 63 26
4 Thu. 72 49
5 Fri. 58 30
6 Sat. 40 25
7 Sun. 39 23
# structure of data frame
str(Last_Wk_df)
'data.frame': 7 obs. of 3 variables:
$ Day : chr "Mon." "Tue." "Wed." "Thu." ...
$ High: num 37 49 63 72 58 40 39
$ Low : num 8 28 26 49 30 25 23
# structure of tibble
str(Last_Wk_tb)
tibble [7 x 3] (S3: tbl_df/tbl/data.frame)
$ Day : chr [1:7] "Mon." "Tue." "Wed." "Thu." ...
$ High: num [1:7] 37 49 63 72 58 40 39
$ Low : num [1:7] 8 28 26 49 30 25 23
# Example of which.max using data frame
which.max(Last_Wk_df$High), ] Last_Wk_df[
Day High Low
4 Thu. 72 49
# Notice which.max is identical in function to this code from HW 3
$High == max(Last_Wk_df$High), ] Last_Wk_df[Last_Wk_df
Day High Low
4 Thu. 72 49
Interactive Question 1:
What is the correct R code to find the day of the lowest low (min. of Low) using which.min in the Last_Wk_df data set?
- Common Operators Used in R:
R Operator | Description |
---|---|
<- | assign |
(..) | round parentheses are used for function inputs |
[…] | square brackets are used for subsetting |
{…} | curly brackets are used for loops and functions |
%in% | finds elements in or belonging to |
* | multiply by |
/ | divide by |
- Logical Operators in R
R Logic Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
!x | not x |
x|y | x OR y |
x&y | x AND y |
isTRUE(x) | test if x is TRUE |
Interactive Question 2:
What values would appear if I submitted the R code
Last_Wk_df[5, c(1,2)]
Starting in Lecture 6, we moved on to importing, cleaning, and managing data.
Data can be imported directly from sources like Yahoo Finance using the Quantmod package.
R command to import data from Yahoo Finance is getSymbols(…)
Due to International complications with data sourcing:
importing data directly from Yahoo Finance will not be tested
students should STILL know how to do this (It could be a short answer or matching question)
students are welcome to use these data for their projects.
Once data are downloaded to R, students from any country can access them.
Data can also be imported or copied into Excel and saved as a .csv file.
Best practice is to do “as little as possible” to clean the data in Excel
The more you can do with R code, the more easily you can reproduce the steps when data are updated.
Lecture 6 and HW 3 provided an introduction to common data cleaning tasks
These steps can be done with Base R, or tidyverse commands
Some tidyverse commands and Base R commands may conflict or be incompatible
Before talking about plots (Lectures 7 and 9), we’ll review Lecture 8:
The types of data
The modern “verbs” of data management using the ‘dplyr’ package
The str(…) command which we have talked about previously and used in HW 3, allows us to examine the structure of the data and the variables.
Here is a table (from Lecture 8) showing the variable types defined in R.
Abbreviation | Definition |
---|---|
int | integers |
dbl (num) | doubles or real numbers (can be decimal; called num in Base R) |
chr | character or text strings (can be used as a categorical variable) |
dttm | date-time (a date + a time) |
lgl | logical (TRUE or FALSE) |
fctr (Factor) | factor (categorical variable) |
date | date |
Note that the terms in parentheses in the Abbreviation column refer to the names of these variable terms used in Base R.
Function | Use |
---|---|
filter() | Pick (subset) observations by their value |
arrange() | Reorder the rows |
select() | Pick variables by their names |
rename() | Rename a variable in a data set |
mutate() | Create new variables with functions of existing variables |
summarise() | Collapse many values down to a single summary |
group_by() | Used in conjuction with these functions to change scope, e.g., by category |
Lecture 8 provides an example of each of these functions and demonstrates how to use group_by(…) and summarize(…) together.
HW 4 also provides some practice using these commands.
Interactive Question 3.
Using our “toy” data set, Last_Wk_df, create a new data set named Last_Wk_Lows that only includes ‘Day’ and ‘Low’ and omits ‘High’
There are two ways to do this using one one the functions above.
Lectures 7 and 9 focused on ggplot and examining data for data management:
There many, MANY types of plots.
We covered three plots that are useful for exploring data.
Scatterplot - geom_point
Histogram - geom_histogram
Frequency Distribution(s) - geom_freqpoly
Scatterplot - geom_point
During a timed quiz, you should be able to create the most basic scatterplot:
# import data from Lecture 7
<- read.csv("credit_scores.csv")
cred
# Basic scatterplot
<- ggplot(data = cred, mapping = aes(x = LIMIT, y = DEBT)) +
p geom_point()
p
In Lecture 7 we covered some minimal data management for plotting data and iteratively improved this scatterplot plot to end up with this:
# suppress scientific notation
options(scipen = 100)
# Divide our LIMIT (credit limit) and DEBT variables by 1000
$Limit_k <- cred$LIMIT/1000
cred$Debt_k <- cred$DEBT/1000
cred
# exclude Divorced/Widow due to small sample size (both commands acheive same
# goal)
<- filter(cred, MARITAL != "Divorced/Widow")
cred1 <- cred[cred$MARITAL %in% c("Married", "Single"), ]
cred1
# Final Plot with facets:
<- ggplot(data = cred1, mapping = aes(x = Limit_k, y = Debt_k)) +
p geom_point(mapping = aes(size = LATE, alpha = 0.1, color = GENDER)) +
scale_color_manual(values = c("steelblue1", "lightgreen")) +
labs(title = "Credit Limit vs Debt", x = "Credit Limit ($K)", y = "Credit Card Debit ($K)") +
facet_wrap(facets = c("DEFAULT", "MARITAL")) +
theme_classic()
p
histogram - geom_histogram
During a timed quiz, you should also be able to create the most basic histogram:
# basic histogram
<- ggplot(data = cred1, aes(x = Debt_k)) +
h geom_histogram()
h
In Lecture 7, We numerically summarized the debt load of those who defaulted to find the median as a reference:
# use filter to subset data to those who defaulted
<- filter(cred1, cred1$DEFAULT == "Yes")
cred_default
# numerically summarize debt distribution of those who defaulted
round(summary(cred_default$Debt_k), 1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 10.8 114.4 258.1 309.3 2691.5
# display numerical summary as a column (useful for tables)
cbind(round(summary(cred_default$Debt_k), 1))
[,1]
Min. 0.0
1st Qu. 10.8
Median 114.4
Mean 258.1
3rd Qu. 309.3
Max. 2691.5
In Lecture 7, We also log transformed the x-axis of the histogram, not the data, and adjusted the debt variable.
Debt was adjusted so that 0 values,those with no debt, were still 0, on the natural log scale (LN(1) = 0).
Recall that a log transformation was necessary because the data were highly skewed.
Knowing how and why to transform an axis is important aspect of data management.
# final histogram of credit card default data
# same plot as directly above but with x axis breaks specified
$debtk_adj <- round(cred1$Debt_k + 1)
cred1
<- ggplot(data = cred1, aes(x = debtk_adj)) +
h geom_histogram(color = "darkblue", fill = "lightblue") +
scale_x_continuous(trans = "log", breaks = c(10, 100, 1000, 3000)) +
geom_vline(aes(xintercept = 114.43), linetype = "dashed", size = 1, col = "darkred") +
labs(title = "Distribution of Credit Card Debt", x = "Debt ($K)", y = "Frequency",
caption = "50% of people who defaulted had a debt $114K or greater") +
theme_classic()
h
***
Lastly, in Lecture 9 (and HW 4) we covered:
parsing text using separate(…) to create new variables
joining text using mutate(…) and paste(…) to create new variables
plotting data by category using geom_freqpoly
Study suggestions
- Review R code carefully from HW 4
- Make sure you know how to parse and join text
To review these plots, I have saved a version of the data with the created variables that is imported below:
# Import data with Weekend and pandemic categorical variables
<- read.csv("bo20_Wknd_Pndmc.csv")
bo20_Wknd_Pndmc head(bo20_Wknd_Pndmc)
Date DayofWk DayofYr Gross_Top10 Pct_Chg_Day Pct_Chg_Wk Num_Releases
1 31-Dec-20 Thursday 366 1543267 -0.96 -0.96 15
2 30-Dec-20 Wednesday 365 1432164 0.01 0.55 14
3 29-Dec-20 Tuesday 364 1414703 0.12 0.51 15
4 28-Dec-20 Monday 363 1267981 -0.34 0.66 15
Num1 Gross_Num1 Date_new Weekend Year Month Day Pandemic
1 News of the World 515100 2020-12-31 No 2020 12 31 Yes
2 The Croods: A New Age 533180 2020-12-30 No 2020 12 30 Yes
3 The Croods: A New Age 520430 2020-12-29 No 2020 12 29 Yes
4 The Croods: A New Age 440330 2020-12-28 No 2020 12 28 Yes
Weekend_Pandemic
1 No - Yes
2 No - Yes
3 No - Yes
4 No - Yes
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Note that the basic freqpoly plot does include color option to differentiate between groups:
# basic plot
<- ggplot(data = bo20_Wknd_Pndmc, mapping = aes(x = Gross_Top10, color = Weekend_Pandemic)) +
h geom_freqpoly()
h
In the final plot, we have:
divided the X variable, Top 10 Gross by 1000
logged the X-axis and specified the breaks
formatted the axis labels
made the distribution lines thicker
removed the background grid
Interactive Question 4:
In the ‘labs’ option for the freqpoly plot, one line says color = “Weekend - Pandemic”
What does that line do to the plot format?
# Convert data to $K (divide by 1000)
<- mutate(bo20_Wknd_Pndmc, Gross_Top10_K = Gross_Top10/1000)
bo20_Wknd_Pndmc
# final plot
<- ggplot(data=bo20_Wknd_Pndmc,
h
mapping=aes(x=Gross_Top10_K, color=Weekend_Pandemic)) +
geom_freqpoly(size=1) +
scale_x_continuous(trans = 'log',
breaks=c(0, 1, 10, 100, 1000, 10000, 100000)) +
labs(x="Dist. of Gross of Top 10 ($K)",
y="Frequency",
color="Weekend - Pandemic",
title="2020 Daily Top 10 Movie Gross") +
theme_classic()
h