BUA 455 - Lecture 11

2021-04-27

Announcements


New Markdown/HTML format

My discussions with students have confirmed my experience.

It is difficult for students (and me) to swtich between Powerpoint and R and the Interactive Questions.

To simplify lectures, I will keep everything in R Markdown and provide script files.

To that end, I spent some time experimenting with new HTML styles for lecture presentations.

This one, called downcute, has a ‘dark’ option that may be easier on the eyes.

What do you think?


Projects:

As mentioned in Lecture 10, if you submitted your project survey, you have been assigned to a project group.

Please reach out to your group and spend some time before Thursday, 3/25, finding 2 or 3 candidate data sources.

Data criteria:

  • Exportable to Excel,
    or copy friendly to Excel, e.g., BoxOfficeMojo,
    or able to be directly imported into R, e.g. Yahoo Finance

  • Has at least one (or more) components that are currently being updated (weekly or more)

  • Interesting to you as students


Quiz 1 is on Thursday, 3/18 during class time.

  • 80% of quiz will be timed during lecture (50. min. over Zoom)

    • Create a subset of the flights data as specified

    • Answer questions about that subset

    • Create 1 or 2 basic plots (not formatted)

    • Export data summaries

    • Submit R script file (with basic comments)

  • 20% will be take-home (Due Friday 3/19/21 at midnight)

    • Creating a mini-table data set (R Markdown not required)

    • Format plot from quiz to be presentation ready (axes, etc.)

    • Answering a question or two in complete sentences.

    • Submit R script file of take-home work

  • R Markdown skills will not be on Quiz 1, but may be on Quiz 2


Review for Quiz 1

  • Concepts from basic statistics course (MAS 261 or eqivalent)

    • CV = Standard Deviation/Sample Mean

    • CV is Coefficient of Variation

    • CV is better for comparing variability between groups with different means.

    • Standard deviation = square root of variance

    • In R: sd(…) or sqrt(var(…))

  • Doing basic calculations in R

    • Recall Lectures 1 and 2 and HW 2

    • Data from a variable were summarized to find a probability:

    • IRRITATING SIDE NOTE R Markdown does not recognize data imported as an .rds file.

# Recall that All_Scores simulated 10000 'first hands' of Blackjack import .csv
# file because .rds doesn't work with R Markdown
All_Scores <- read.csv("All_Scores.csv")

# remind yourself what this calculation is finding:

# create a variable with a shorter name
all <- All_Scores$All_Scores

# calculate probability
round(length(all[all == 21])/length(all), 2)
[1] 0.05

  • Useful Base R commands from Lectures 1 - 9 (in alphabetical order):
* indicates na.rm=TRUE is needed if there are missing values in data
R Function Description
as.numeric(…) forces values to be treated as numeric
c(…) concatenates
data.frame(…) creates a data frame with input variables
dim(…) outputs number of rows and number of columns of a data frame
head(…) shows first 6 obs. by default
ifelse(…) creates a new two category variable based on test in first input
length(…) outputs how many values are in a vector
* mean(…) calculates the mean of a vector
* median(…) calculates the median of a vector
rbind(…) row binds or stacks values
rep(…) replicates or repeats specified value or object
round(…) rounds to decimal precision specified
row.names(…) outputs the row names of a data set
sample(…) samples a vector
* sd(…) calculates the sample standard deviation
sqrt(…) calculates the square root
str(…) examines structure of a data frame or tibble
* sum(…) sums values
summary(…) outputs numerical summary values
tail(…) shows last 6 obs. by default
* var(…) calculates the sample variance
which.min(…) identifies obs. or row number of min. value
which.max(…) identifies obs. or row number of max. value

In the previous (updated) list, please notice the caption.

If a vector contains missing values, R commands like mean will not work without the option na.rm=T:

We will talk more about NAs after Quiz 1.

NAs have some unique characteristics in R.

# numeric vector with missing values
example <- c(1, 5, 7, NA, 3, 19, NA, -2)

# attempt to calculate mean without removing missing values will not output a
# value
mean(example)
[1] NA
# calculate mean after removing missing values outputs a value
mean(example, na.rm = T)
[1] 5.5

Also, which.max and which.min are identical in function to code you have already seen:

# create mini-data set from High and Low temps 3/8 - 3/14
Day <- c("Mon.", "Tue.", "Wed.", "Thu.", "Fri.", "Sat.", "Sun.")
High <- c(37, 49, 63, 72, 58, 40, 39)
Low <- c(8, 28, 26, 49, 30, 25, 23)

# create and display data frame
Last_Wk_df <- data.frame(Day, High, Low)
Last_Wk_df
   Day High Low
1 Mon.   37   8
2 Tue.   49  28
3 Wed.   63  26
4 Thu.   72  49
5 Fri.   58  30
6 Sat.   40  25
7 Sun.   39  23
# create tibble (fancy modern data frame) from data frame and display it
Last_Wk_tb <- as_tibble(Last_Wk_df)
Last_Wk_tb
# A tibble: 7 x 3
  Day    High   Low
  <chr> <dbl> <dbl>
1 Mon.     37     8
2 Tue.     49    28
3 Wed.     63    26
4 Thu.     72    49
5 Fri.     58    30
6 Sat.     40    25
7 Sun.     39    23
# structure of data frame
str(Last_Wk_df)
'data.frame':   7 obs. of  3 variables:
 $ Day : chr  "Mon." "Tue." "Wed." "Thu." ...
 $ High: num  37 49 63 72 58 40 39
 $ Low : num  8 28 26 49 30 25 23
# structure of tibble
str(Last_Wk_tb)
tibble [7 x 3] (S3: tbl_df/tbl/data.frame)
 $ Day : chr [1:7] "Mon." "Tue." "Wed." "Thu." ...
 $ High: num [1:7] 37 49 63 72 58 40 39
 $ Low : num [1:7] 8 28 26 49 30 25 23
# Example of which.max using data frame
Last_Wk_df[which.max(Last_Wk_df$High), ]
   Day High Low
4 Thu.   72  49
# Notice which.max is identical in function to this code from HW 3
Last_Wk_df[Last_Wk_df$High == max(Last_Wk_df$High), ]
   Day High Low
4 Thu.   72  49

Interactive Question 1:

What is the correct R code to find the day of the lowest low (min. of Low) using which.min in the Last_Wk_df data set?


  • Common Operators Used in R:
R Operator Description
<- assign
(..) round parentheses are used for function inputs
[…] square brackets are used for subsetting
{…} curly brackets are used for loops and functions
%in% finds elements in or belonging to
* multiply by
/ divide by

  • Logical Operators in R
R Logic Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x not x
x|y x OR y
x&y x AND y
isTRUE(x) test if x is TRUE

Interactive Question 2:

What values would appear if I submitted the R code

Last_Wk_df[5, c(1,2)]


Starting in Lecture 6, we moved on to importing, cleaning, and managing data.

  • Data can be imported directly from sources like Yahoo Finance using the Quantmod package.

  • R command to import data from Yahoo Finance is getSymbols(…)

  • Due to International complications with data sourcing:

    • importing data directly from Yahoo Finance will not be tested

    • students should STILL know how to do this (It could be a short answer or matching question)

    • students are welcome to use these data for their projects.

    • Once data are downloaded to R, students from any country can access them.

  • Data can also be imported or copied into Excel and saved as a .csv file.

    • Best practice is to do “as little as possible” to clean the data in Excel

    • The more you can do with R code, the more easily you can reproduce the steps when data are updated.

    • Lecture 6 and HW 3 provided an introduction to common data cleaning tasks

    • These steps can be done with Base R, or tidyverse commands

    • Some tidyverse commands and Base R commands may conflict or be incompatible


Before talking about plots (Lectures 7 and 9), we’ll review Lecture 8:

  • The types of data

  • The modern “verbs” of data management using the ‘dplyr’ package


The str(…) command which we have talked about previously and used in HW 3, allows us to examine the structure of the data and the variables.

Here is a table (from Lecture 8) showing the variable types defined in R.

Variable Type Abbreviations in R
Abbreviation Definition
int integers
dbl (num) doubles or real numbers (can be decimal; called num in Base R)
chr character or text strings (can be used as a categorical variable)
dttm date-time (a date + a time)
lgl logical (TRUE or FALSE)
fctr (Factor) factor (categorical variable)
date date

Note that the terms in parentheses in the Abbreviation column refer to the names of these variable terms used in Base R.


Details about each function are in Chapter 5 of R for Data Science
Function Use
filter() Pick (subset) observations by their value
arrange() Reorder the rows
select() Pick variables by their names
rename() Rename a variable in a data set
mutate() Create new variables with functions of existing variables
summarise() Collapse many values down to a single summary
group_by() Used in conjuction with these functions to change scope, e.g., by category

Lecture 8 provides an example of each of these functions and demonstrates how to use group_by(…) and summarize(…) together.

HW 4 also provides some practice using these commands.

Interactive Question 3.

Using our “toy” data set, Last_Wk_df, create a new data set named Last_Wk_Lows that only includes ‘Day’ and ‘Low’ and omits ‘High’

There are two ways to do this using one one the functions above.


Lectures 7 and 9 focused on ggplot and examining data for data management:

  • There many, MANY types of plots.

  • We covered three plots that are useful for exploring data.

    • Scatterplot - geom_point

    • Histogram - geom_histogram

    • Frequency Distribution(s) - geom_freqpoly


Scatterplot - geom_point

During a timed quiz, you should be able to create the most basic scatterplot:

# import data from Lecture 7
cred <- read.csv("credit_scores.csv")

# Basic scatterplot
p <- ggplot(data = cred, mapping = aes(x = LIMIT, y = DEBT)) + 
geom_point()

p


In Lecture 7 we covered some minimal data management for plotting data and iteratively improved this scatterplot plot to end up with this:

# suppress scientific notation
options(scipen = 100)

# Divide our LIMIT (credit limit) and DEBT variables by 1000
cred$Limit_k <- cred$LIMIT/1000
cred$Debt_k <- cred$DEBT/1000

# exclude Divorced/Widow due to small sample size (both commands acheive same
# goal)
cred1 <- filter(cred, MARITAL != "Divorced/Widow")
cred1 <- cred[cred$MARITAL %in% c("Married", "Single"), ]

# Final Plot with facets:
p <- ggplot(data = cred1, mapping = aes(x = Limit_k, y = Debt_k)) + 
geom_point(mapping = aes(size = LATE, alpha = 0.1, color = GENDER)) + 
scale_color_manual(values = c("steelblue1", "lightgreen")) + 
labs(title = "Credit Limit vs Debt", x = "Credit Limit ($K)", y = "Credit Card Debit ($K)") + 
    facet_wrap(facets = c("DEFAULT", "MARITAL")) + 
theme_classic()

p


histogram - geom_histogram

During a timed quiz, you should also be able to create the most basic histogram:

# basic histogram

h <- ggplot(data = cred1, aes(x = Debt_k)) + 
geom_histogram()

h


In Lecture 7, We numerically summarized the debt load of those who defaulted to find the median as a reference:

# use filter to subset data to those who defaulted
cred_default <- filter(cred1, cred1$DEFAULT == "Yes")

# numerically summarize debt distribution of those who defaulted
round(summary(cred_default$Debt_k), 1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    10.8   114.4   258.1   309.3  2691.5 
# display numerical summary as a column (useful for tables)
cbind(round(summary(cred_default$Debt_k), 1))
          [,1]
Min.       0.0
1st Qu.   10.8
Median   114.4
Mean     258.1
3rd Qu.  309.3
Max.    2691.5

In Lecture 7, We also log transformed the x-axis of the histogram, not the data, and adjusted the debt variable.

Debt was adjusted so that 0 values,those with no debt, were still 0, on the natural log scale (LN(1) = 0).

Recall that a log transformation was necessary because the data were highly skewed.

Knowing how and why to transform an axis is important aspect of data management.

# final histogram of credit card default data

# same plot as directly above but with x axis breaks specified
cred1$debtk_adj <- round(cred1$Debt_k + 1)

h <- ggplot(data = cred1, aes(x = debtk_adj)) + 
geom_histogram(color = "darkblue", fill = "lightblue") + 
scale_x_continuous(trans = "log", breaks = c(10, 100, 1000, 3000)) + 
geom_vline(aes(xintercept = 114.43), linetype = "dashed", size = 1, col = "darkred") + 
    
labs(title = "Distribution of Credit Card Debt", x = "Debt ($K)", y = "Frequency", 
    caption = "50% of people who defaulted had a debt $114K or greater") + 
theme_classic()

h

***

Lastly, in Lecture 9 (and HW 4) we covered:

  • parsing text using separate(…) to create new variables

  • joining text using mutate(…) and paste(…) to create new variables

  • plotting data by category using geom_freqpoly

  • Study suggestions

    • Review R code carefully from HW 4
    • Make sure you know how to parse and join text

To review these plots, I have saved a version of the data with the created variables that is imported below:

# Import data with Weekend and pandemic categorical variables
bo20_Wknd_Pndmc <- read.csv("bo20_Wknd_Pndmc.csv")
head(bo20_Wknd_Pndmc)
       Date   DayofWk DayofYr Gross_Top10 Pct_Chg_Day Pct_Chg_Wk Num_Releases
1 31-Dec-20  Thursday     366     1543267       -0.96      -0.96           15
2 30-Dec-20 Wednesday     365     1432164        0.01       0.55           14
3 29-Dec-20   Tuesday     364     1414703        0.12       0.51           15
4 28-Dec-20    Monday     363     1267981       -0.34       0.66           15
                   Num1 Gross_Num1   Date_new Weekend Year Month Day Pandemic
1     News of the World     515100 2020-12-31      No 2020    12  31      Yes
2 The Croods: A New Age     533180 2020-12-30      No 2020    12  30      Yes
3 The Croods: A New Age     520430 2020-12-29      No 2020    12  29      Yes
4 The Croods: A New Age     440330 2020-12-28      No 2020    12  28      Yes
  Weekend_Pandemic
1         No - Yes
2         No - Yes
3         No - Yes
4         No - Yes
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

Note that the basic freqpoly plot does include color option to differentiate between groups:

# basic plot
h <- ggplot(data = bo20_Wknd_Pndmc, mapping = aes(x = Gross_Top10, color = Weekend_Pandemic)) + 
    geom_freqpoly()
h


In the final plot, we have:

  • divided the X variable, Top 10 Gross by 1000

  • logged the X-axis and specified the breaks

  • formatted the axis labels

  • made the distribution lines thicker

  • removed the background grid

Interactive Question 4:

In the ‘labs’ option for the freqpoly plot, one line says color = “Weekend - Pandemic”

What does that line do to the plot format?

# Convert data to $K (divide by 1000)
bo20_Wknd_Pndmc <- mutate(bo20_Wknd_Pndmc, Gross_Top10_K = Gross_Top10/1000)

# final plot
h <- ggplot(data=bo20_Wknd_Pndmc,
            
       mapping=aes(x=Gross_Top10_K, color=Weekend_Pandemic)) + 
       geom_freqpoly(size=1) +
  
       scale_x_continuous(trans = 'log', 
                               
       breaks=c(0, 1, 10, 100, 1000, 10000, 100000)) +
  
       labs(x="Dist. of Gross of Top 10 ($K)", 
            y="Frequency",
            color="Weekend - Pandemic",
            title="2020 Daily Top 10 Movie Gross") +
   
    theme_classic()
 
h