Lecture 4 - Data Visualizations, Boxplots, Outliers

Penelope Pooler Eisenbies
MAS 261

2023-09-06

Housekeeping

  • Today’s plan 📋

    • Review Question about Measures of Variability

    • A few minutes for R Questions 🪄

    • More about Boxplots

      • Review of summary values and where they are shown
    • Outliers

      • IQR - Inter-quartile Range

      • UL - Upper limit and LL Lower Limit to detect outliers

    • In-class Exercises

Review: R and RStudio 🪄

  • Review: You have two options to facilitate your introduction to R and RStudio:

  • If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.

    • We will use Posit Cloud for Quizzes.
  • If you are nervous about coding: Choose Option 2.

  • For both options: I can help with download/install issues during office hours.

  • What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

  • NOTE: We will use R and RStudio in class during MOST lectures

    • You can use either Posit Cloud or your laptop.

💥 Lecture 4 In-class Exercises - Q1 (Review) 💥

In lecture 3 we discussed the measures of variability, including TSS, variance, standard deviation, CV, and range.

Recall that all of these measures, except range, are closely related and can be calculated from each other.


An online electronics store has a selection of 31 \((n=31)\) different headphones. The sample mean \((\overline{X})\) price is $97.

Recall that \(SD = \sqrt{Var}\) and \(CV = \frac{SD}{\overline{X}}\)

Also recall that the R console (lower left pane) can be used like a calculator


If the variance in these prices is 1256, what is the coefficient of variation, CV? Round answer to two decimal places

Boxplot Review Annotated with Five Number Summary

Boxplot Review Annotated with Five Number Summary

Technical term is a Box and Whiskers Plot. This version shows where top whisker ends and circles above that.

Some new terms

Terminology

  • IQR - Interquartile Range \(IQR = Q3 - Q1\)
  • UL - Upper Limit \(UL = Q3 + 1.5\times IQR\)
  • LL - Lower Limit \(LL = Q1 - 1.5\times IQR\)

Notes

  • Values above Upper Limit are High outliers.
  • Values below Lower Limit are LOW outliers.
  • Not all datasets have high and/or low outliers.
  • Vermont and Maine data have no low outliers.
  • Lower Limit can be negative even if all data are positive.

Steps for Determining Outliers

  1. It is useful, but not required, to examine a boxplot (or histogram) to examine the data distribution.

    • We will introduce histograms at the end of this lecture.

    • Data visualizations can indicate if there are high or low outliers present.

  2. Find Q1 (25th Percentile) and Q3 (75th Percentile)

    • In R these values are found using the summary command.
  3. Calculate IQR, the Interquartile Range, \(IQR = Q3 - Q1\)

  4. Calculate the LL, Lower Limit and UL, Upper Limit for determining outliers.

    • \(LL = Q1 - 1.5\times IQR\)

    • \(UL = Q3 + 1.5\times IQR\)

  5. Examine values in sorted data to determine which values are

    • HIGH Outliers, values above the UL

    • Low Outliers, values below the LL

Boxplots of Lifetime Gross for All Three Markets

💥 Lecture 4 In-class Exercises - Q2 💥

Based on the boxplots below, which market, domestic or foreign, has a higher median for the lifetime gross data?

A. Domestic

B Foreign

C. The median values for these two markets are approximately equal.

💥 Lecture 4 In-class Exercises - Q3 💥


Which statement(s) are true about all three movie gross markets?


A. There are no outliers in these data.

B. There are only low outliers in these data.

C. There are only high outliers in these data.

D. There are low and high outliers in these data.

Calculations to Determine Outliers in Domestic Data

NOTE: All Saved calculations are enclosed in parentheses so they will ALSO be displayed.

  1. Use summary command to find Q1 (25th Percentile or Quantile) and Q3 (75th Percentile or Quantile).
summary(movie_gross$domestic_gross_mil)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.12  218.60  301.43  318.49  380.70  936.66       2 
  1. Calculate \(IQR = Q3 - Q1\) for domestic data and save calculation
(iqr_dom = 380.70 - 218.60)
[1] 162.1
  1. Calculate and save calculations for LL (Lower Limit) and UL (Upper Limit) for domestic data.
(ll_dom = 218.60 - 1.5*iqr_dom)
[1] -24.55
(ul_dom = 380.70 + 1.5*iqr_dom)
[1] 623.85

Final Step to determine outliers in Domestic Data

Examine domestic data to determine if there are

  • HIGH Outliers - values above the upper limit (UL = ul_dom)
  • LOW Outliers - values below the lower limit (LL = ll_dom)
ul_dom
[1] 623.85
movie_gross$domestic_gross_mil |> 
  sort(decreasing = T) |> head(15) |> cbind()
        [,1]
 [1,] 936.66
 [2,] 858.37
 [3,] 814.12
 [4,] 785.22
 [5,] 718.73
 [6,] 700.43
 [7,] 684.08
 [8,] 678.82
 [9,] 674.29
[10,] 653.41
[11,] 623.36
[12,] 620.18
[13,] 608.58
[14,] 574.15
[15,] 543.64
ll_dom
[1] -24.55
movie_gross$domestic_gross_mil |> 
  sort(decreasing = T) |> tail(15) |> cbind()
        [,1]
 [1,] 160.89
 [2,] 159.56
 [3,] 149.26
 [4,] 145.96
 [5,] 144.17
 [6,] 142.61
 [7,] 137.72
 [8,] 130.17
 [9,] 124.99
[10,]   5.97
[11,]   3.70
[12,]   2.72
[13,]   1.54
[14,]   0.34
[15,]   0.12

💥 Lecture 4 In-class Exercises - Q4, Q5, & Q6 💥

  • Q4: How many LOW outliers are in the domestic gross data?

    • Hint: This confirms what we saw in the boxplot data visualization.


  • Q5: How many HIGH outliers are in the domestic gross data?


  • Q6: What is the LOWEST value in the domestic gross data that is also a HIGH outlier?

Finding Outliers in the Foreign Gross Data

Instructions

    1. Use the previous example and the summary command to find Q1 and Q3 for the foreign gross data.
    1. Find the IQR, LL, and UL for the foreign gross data.
    1. Examine the sorted foreign gross data to determine how many high outliers are present.

💥 Lecture 4 In-class Exercises - Q7, Q8, Q9, Q10 💥

  • Q7: What is the Inter-quartile range (IQR) of the foreign gross data?


  • Q8: What is the upper limit (UL) for the foreign gross data?


  • Q9: How many HIGH outliers are in the foreign gross data?


  • Q10: What is the LOWEST value in the foreign gross data that is also a HIGH outlier?

Another way to look at data - Histograms

  • Boxplots and the side-by-side boxplots are great for comparing the central tendancy and variability or two or more groups of data.

  • Another tool for examining the entire distribution of values is a histogram.

Looking Ahead

  • Next lecture we’re going to examine both categorical and quantitative data.

  • For categorical data, we’ll look at

    • Frequency Tables and Terminology

    • Bar Charts

    • Pie Charts

  • For quantitative data we’ll talk more about histograms

    • How are they created by the software

    • How can we modify what the software is doing to better understand the data.

    • What does the shape of the histogram tell us. For example, are the data

      • left-skewed?
      • right-skewed?
      • normally distributed?

Key Points from Today

  • More about Boxplots

  • Identifying outliers visually from a boxplot

  • Identifying outliers numerically

    • Defining and calculating the Inter-quartile Range (IQR)

    • Defining upper and lower limits for determining outliers

    • Using lower and upper limits to identify outliers

  • Introduction to histograms


To submit an Engagement Question or Comment about material from Lecture 4: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 4