MAS 261 - Lecture 4

Data Visualizations, Boxplots, Outliers

Penelope Pooler Eisenbies

2024-09-05

Housekeeping

  • Today’s plan 📋

    • Review Question about Measures of Variability

    • A few minutes for R Questions 🪄

    • More about Boxplots

      • Review of summary values and where they are shown
    • Outliers

      • IQR - Inter-quartile Range

      • UL - Upper limit and LL Lower Limit to detect outliers

    • In-class Exercises

R and RStudio

  • In this course we will use R and RStudio to understand statistical concepts.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I will demo how to download completed work so that you can use this allotment efficiently.

    • For those who want to go further with R/RStudio:

      • After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

💥Lecture 4 In-class Exercises - Q1 💥

Session ID: MAS261f24

In Lecture 3 we discussed the measures of variability, including TSS, variance, standard deviation, CV, and range.

Recall that all of these measures, except range, are closely related and can be calculated from each other.


An online electronics store has a selection of 31 \((n=31)\) different headphones. The sample mean \((\overline{X})\) price is $97.

Recall that \(SD = \sqrt{Var}\) and \(CV = \frac{SD}{\overline{X}}\)

Also recall that the R console (lower left pane) can be used like a calculator


If the variance in these prices is 1256, what is the coefficient of variation, CV? Round answer to two decimal places.

Boxplot Review

Annotated with Five Number Summary

Boxplot Review and More Information

Technical term is a Box and Whiskers Plot. This version shows where top whisker ends and circles above that.

Some new terms

Terminology

  • IQR - Interquartile Range \(IQR = Q3 - Q1\)
  • UL - Upper Limit \(UL = Q3 + 1.5\times IQR\)
  • LL - Lower Limit \(LL = Q1 - 1.5\times IQR\)

Notes

  • Values above Upper Limit are High outliers.
  • Values below Lower Limit are LOW outliers.
  • Not all datasets have high and/or low outliers.
  • Vermont and Maine data have no low outliers.
  • Lower Limit can be negative even if all data are positive.

Steps for Determining Outliers

  1. It is useful, but not required, to examine a boxplot (or histogram) to examine the data distribution.

    • We will introduce histograms at the end of this lecture.

    • Data visualizations can indicate if there are high or low outliers present.

  2. Find Q1 (25th Percentile) and Q3 (75th Percentile)

    • In R these values are found using the summary command.
  3. Calculate IQR, the Interquartile Range, \(IQR = Q3 - Q1\)

  4. Calculate the LL, Lower Limit and UL, Upper Limit for determining outliers.

    • \(LL = Q1 - 1.5\times IQR\)

    • \(UL = Q3 + 1.5\times IQR\)

  5. Examine values in sorted data to determine which values are

    • HIGH Outliers, values above the UL

    • Low Outliers, values below the LL

Boxplots - Lifetime Gross for All 3 Markets

💥Lecture 4 In-class Exercises - Q2 💥

Session ID: MAS261f24

Based on the boxplots below, which market, domestic or foreign, has a higher median for the lifetime gross data?

A. Domestic

B Foreign

C. The median values for these two markets are approximately equal.

💥Lecture 4 In-class Exercises - Q3 💥

Session ID: MAS261f24


Which statement(s) are true about all three movie gross markets?


A. There are no outliers in these data.

B. There are only low outliers in these data.

C. There are only high outliers in these data.

D. There are low and high outliers in these data.

Calculations to Determine Outliers - Domestic Data

NOTE: All saved calculations are enclosed in parentheses so they will ALSO be displayed.

  1. Use summary command to find Q1 (25th Percentile or Quantile) and Q3 (75th Percentile or Quantile).
summary(movie_gross$domestic_gross_mil)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.12  226.70  305.79  326.43  387.83  936.66       2 
  1. Calculate \(IQR = Q3 - Q1\) for domestic data and save calculation
(iqr_dom = 387.83 - 226.7)  # values from summary
[1] 161.13
  1. Calculate and save calculations for LL (Lower Limit) and UL (Upper Limit) for domestic data.
(ll_dom = 226.70 - 1.5*iqr_dom)
[1] -14.995
(ul_dom = 387.83 + 1.5*iqr_dom)
[1] 629.525

Final Step to determine outliers in Domestic Data

Examine domestic data to determine if there are

  • HIGH Outliers - values above the upper limit (UL = ul_dom)
  • LOW Outliers - values below the lower limit (LL = ll_dom)
ul_dom
[1] 629.525
movie_gross$domestic_gross_mil |> 
  sort(decreasing = T) |> head(15) |> cbind()
        [,1]
 [1,] 936.66
 [2,] 858.37
 [3,] 814.87
 [4,] 785.22
 [5,] 718.73
 [6,] 700.43
 [7,] 684.08
 [8,] 678.82
 [9,] 674.29
[10,] 653.41
[11,] 651.25
[12,] 636.24
[13,] 623.36
[14,] 620.18
[15,] 608.58
ll_dom
[1] -14.995
movie_gross$domestic_gross_mil |> 
  sort(decreasing = T) |> tail(15) |> cbind()
        [,1]
 [1,] 165.25
 [2,] 161.32
 [3,] 160.89
 [4,] 159.56
 [5,] 146.13
 [6,] 144.33
 [7,] 137.72
 [8,] 130.17
 [9,] 124.99
[10,]   5.97
[11,]   3.70
[12,]   2.72
[13,]   1.54
[14,]   0.34
[15,]   0.12

💥 Lecture 4 In-class Exercises - Q4-Q6 💥

Session ID: MAS261f24

  • Q4: How many LOW outliers are in the domestic gross data?

    • Hint: This confirms what we saw in the boxplot data visualization.


  • Q5: How many HIGH outliers are in the domestic gross data?


  • Q6: What is the LOWEST value in the domestic gross data that is also a HIGH outlier?

Finding Outliers in the Foreign Gross Data

Instructions

  1. Use the previous example and the summary command to find Q1 and Q3 for the foreign gross data.

  2. Find the IQR, LL, and UL for the foreign gross data.

  3. Examine the sorted foreign gross data to determine how many high outliers are present.

💥 Lecture 4 In-class Exercises - Q7-Q10 💥

  • Q7: What is the Inter-quartile range (IQR) of the foreign gross data?


  • Q8: What is the upper limit (UL) for the foreign gross data?


  • Q9: How many HIGH outliers are in the foreign gross data?


  • Q10: What is the LOWEST value in the foreign gross data that is also a HIGH outlier?

Another way to look at data - Histograms

  • Boxplots and the side-by-side boxplots are great for comparing the central tendancy and variability or two or more groups of data.

  • Another tool for examining the entire distribution of values is a histogram.

Looking Ahead

  • Next lecture we’re going to examine both categorical and quantitative data.

  • For categorical data, we’ll look at

    • Frequency Tables and Terminology

    • Bar Charts

    • Pie Charts

  • For quantitative data we’ll talk more about histograms

    • How are they created by the software

    • How can we modify what the software is doing to better understand the data.

    • What does the shape of the histogram tell us. For example, are the data

      • left-skewed?
      • right-skewed?
      • normally distributed?

Key Points from Today

  • More about Boxplots

  • Identifying outliers visually from a boxplot

  • Identifying outliers numerically

    • Defining and calculating the Inter-quartile Range (IQR)

    • Defining upper and lower limits for determining outliers

    • Using lower and upper limits to identify outliers

  • Introduction to histograms


To submit an Engagement Question or Comment about material from Lecture 4: Submit it by midnight today (day of lecture).