Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.12 226.70 305.79 326.43 387.83 936.66 2
MAS 261 - Lecture 4
Data Visualizations, Boxplots, Outliers
Housekeeping
Today’s plan
Review Question about Measures of Variability
A few minutes for R Questions :magic_wand:
More about Boxplots
- Review of summary values and where they are shown
Outliers
IQR - Inter-quartile Range
UL - Upper limit and LL Lower Limit to detect outliers
In-class Exercises
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.
Lecture 4 In-class Exercises - Q1
Session ID: MAS261f24
In Lecture 3 we discussed the measures of variability, including TSS, variance, standard deviation, CV, and range.
Recall that all of these measures, except range, are closely related and can be calculated from each other.
An online electronics store has a selection of 31 \((n=31)\) different headphones. The sample mean \((\overline{X})\) price is $97.
Recall that \(SD = \sqrt{Var}\) and \(CV = \frac{SD}{\overline{X}}\)
Also recall that the R console (lower left pane) can be used like a calculator
If the variance in these prices is 1256, what is the coefficient of variation, CV? Round answer to two decimal places.
Boxplot Review
Annotated with Five Number Summary
Boxplot Review and More Information
Technical term is a Box and Whiskers Plot. This version shows where top whisker ends and circles above that.
Some new terms
Terminology
- IQR - Interquartile Range \(IQR = Q3 - Q1\)
- UL - Upper Limit \(UL = Q3 + 1.5\times IQR\)
- LL - Lower Limit \(LL = Q1 - 1.5\times IQR\)
Notes
- Values above Upper Limit are High outliers.
- Values below Lower Limit are LOW outliers.
- Not all datasets have high and/or low outliers.
- Vermont and Maine data have no low outliers.
- Lower Limit can be negative even if all data are positive.
Steps for Determining Outliers
It is useful, but not required, to examine a boxplot (or histogram) to examine the data distribution.
We will introduce histograms at the end of this lecture.
Data visualizations can indicate if there are high or low outliers present.
Find Q1 (25th Percentile) and Q3 (75th Percentile)
- In R these values are found using the
summary
command.
- In R these values are found using the
Calculate IQR, the Interquartile Range, \(IQR = Q3 - Q1\)
Calculate the LL, Lower Limit and UL, Upper Limit for determining outliers.
\(LL = Q1 - 1.5\times IQR\)
\(UL = Q3 + 1.5\times IQR\)
Examine values in sorted data to determine which values are
HIGH Outliers, values above the UL
Low Outliers, values below the LL
Boxplots - Lifetime Gross for All 3 Markets
Lecture 4 In-class Exercises - Q2
Session ID: MAS261f24
Based on the boxplots below, which market, domestic or foreign, has a higher median for the lifetime gross data?
A. Domestic
B Foreign
C. The median values for these two markets are approximately equal.
Lecture 4 In-class Exercises - Q3
Session ID: MAS261f24
Which statement(s) are true about all three movie gross markets?
A. There are no outliers in these data.
B. There are only low outliers in these data.
C. There are only high outliers in these data.
D. There are low and high outliers in these data.
Calculations to Determine Outliers - Domestic Data
NOTE: All saved calculations are enclosed in parentheses so they will ALSO be displayed.
- Use
summary
command to find Q1 (25th Percentile or Quantile) and Q3 (75th Percentile or Quantile).
- Calculate \(IQR = Q3 - Q1\) for domestic data and save calculation
- Calculate and save calculations for LL (Lower Limit) and UL (Upper Limit) for domestic data.
Final Step to determine outliers in Domestic Data
Examine domestic data to determine if there are
- HIGH Outliers - values above the upper limit (UL =
ul_dom
) - LOW Outliers - values below the lower limit (LL =
ll_dom
)
Code
[1] 629.525
[,1]
[1,] 936.66
[2,] 858.37
[3,] 814.87
[4,] 785.22
[5,] 718.73
[6,] 700.43
[7,] 684.08
[8,] 678.82
[9,] 674.29
[10,] 653.41
[11,] 651.25
[12,] 636.24
[13,] 623.36
[14,] 620.18
[15,] 608.58
Code
[1] -14.995
[,1]
[1,] 165.25
[2,] 161.32
[3,] 160.89
[4,] 159.56
[5,] 146.13
[6,] 144.33
[7,] 137.72
[8,] 130.17
[9,] 124.99
[10,] 5.97
[11,] 3.70
[12,] 2.72
[13,] 1.54
[14,] 0.34
[15,] 0.12
Lecture 4 In-class Exercises - Q4-Q6
Session ID: MAS261f24
Q4: How many LOW outliers are in the domestic gross data?
- Hint: This confirms what we saw in the boxplot data visualization.
- Q5: How many HIGH outliers are in the domestic gross data?
- Q6: What is the LOWEST value in the domestic gross data that is also a HIGH outlier?
Finding Outliers in the Foreign Gross Data
Instructions
Use the previous example and the
summary
command to find Q1 and Q3 for the foreign gross data.Find the IQR, LL, and UL for the foreign gross data.
Examine the sorted foreign gross data to determine how many high outliers are present.
Lecture 4 In-class Exercises - Q7-Q10
- Q7: What is the Inter-quartile range (IQR) of the foreign gross data?
- Q8: What is the upper limit (UL) for the foreign gross data?
- Q9: How many HIGH outliers are in the foreign gross data?
- Q10: What is the LOWEST value in the foreign gross data that is also a HIGH outlier?
Another way to look at data - Histograms
Boxplots and the side-by-side boxplots are great for comparing the central tendancy and variability or two or more groups of data.
Another tool for examining the entire distribution of values is a histogram.
Looking Ahead
Next lecture we’re going to examine both categorical and quantitative data.
For categorical data, we’ll look at
Frequency Tables and Terminology
Bar Charts
Pie Charts
For quantitative data we’ll talk more about histograms
How are they created by the software
How can we modify what the software is doing to better understand the data.
What does the shape of the histogram tell us. For example, are the data
- left-skewed?
- right-skewed?
- normally distributed?
Key Points from Today
More about Boxplots
Identifying outliers visually from a boxplot
Identifying outliers numerically
Defining and calculating the Inter-quartile Range (IQR)
Defining upper and lower limits for determining outliers
Using lower and upper limits to identify outliers
Introduction to histograms
To submit an Engagement Question or Comment about material from Lecture 4: Submit it by midnight today (day of lecture).