Loading required package: pacman
MAS 261 - Lecture 5
Visualizing Categorical and Quantitative Data
Housekeeping
Today’s plan
Review Question about Outliers
A few minutes for R Questions :magic_wand:
Introduction to Histograms
Frequency Tables and Terminology
Frequency Tables for Quantitative Data
Frequency Tables for Categorical Data
Visualizing Frequency Data
Histograms for Continuous Data
Bar Charts and Pie Charts for Categorical data
In-class Exercises
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.
Lecture 5 In-class Exercises - Q1
Session ID: MAS261f24
In lecture 4 we discussed how to visually and quantitatively identify outliers.
- Run the following R code to load the starwars dataset and remove missing values from the height variable.
- Use the summary function and additional calculations to find Q1, Q3, IQR, LL, UL
Min. 1st Qu. Median Mean 3rd Qu. Max.
66.0 167.0 180.0 174.6 191.0 264.0
Is Yoda a high or a low outlier in terms of height?
A. Yoda’s height is a high outlier.
B. Yoda’s height is a low outlier.
C Yoda’s height is not a high or low outlier.
D. We do not have enough information.
Histograms in R
Recall that a Histogram gives a detailed visualization of a quantitative data distribution.
- In simple terms: the plot shows how many observations are in each interval or vertical bar.
We’ll talk more about interpreting a histogram in lecture 6 when we introduce Normal data.
Today, let’s look at the histogram of the Star Wars height data.
R has a many options for displaying a histogram.
- In MAS 261, you will run code I provide to view plots and interpret them.
Default Histogram of Star Wars Height Data
Modifying the Histogram of Star Wars Height Data
R uses a default number of bins (30) or intervals for histograms.
Alternatively, we can specify the interval width:
Star Wars heights subdivided into 20 cm intervals (Bins)
Bin 1: 50.01 cm - 70 cm
Bin 2: 70.01 cm - 90 cm
Bin 3: 90.01 cm - 110 cm
etc.
Interval | Freq |
---|---|
(50,70] | 1 |
(70,90] | 2 |
(90,110] | 4 |
(110,130] | 2 |
(130,150] | 3 |
(150,170] | 15 |
(170,190] | 32 |
(190,210] | 15 |
(210,230] | 5 |
(230,250] | 1 |
(250,270] | 1 |
Height Histogram with 20 CM Intervals (Bins)
Frequency Table for Star Wars Heights Data
Interval | Freq | Cum_Freq | Rel_Freq | Cum_Rel_Freq | Pct_Freq | Cum_Pct_Freq |
---|---|---|---|---|---|---|
(50,70] | 1 | 1 | 0.0123 | 0.0123 | 1.23 | 1.23 |
(70,90] | 2 | 3 | 0.0247 | 0.0370 | 2.47 | 3.70 |
(90,110] | 4 | 7 | 0.0494 | 0.0864 | 4.94 | 8.64 |
(110,130] | 2 | 9 | 0.0247 | 0.1111 | 2.47 | 11.11 |
(130,150] | 3 | 12 | 0.0370 | 0.1481 | 3.70 | 14.81 |
(150,170] | 15 | 27 | 0.1852 | 0.3333 | 18.52 | 33.33 |
(170,190] | 32 | 59 | 0.3951 | 0.7284 | 39.51 | 72.84 |
(190,210] | 15 | 74 | 0.1852 | 0.9136 | 18.52 | 91.36 |
(210,230] | 5 | 79 | 0.0617 | 0.9753 | 6.17 | 97.53 |
(230,250] | 1 | 80 | 0.0123 | 0.9877 | 1.23 | 98.77 |
(250,270] | 1 | 81 | 0.0123 | 1.0000 | 1.23 | 100.00 |
Frequency Table Definitions
Frequency (Freq.): Number of observations in each INTERVAL
Cumulative Frequency (Cum_Freq): Sum of Observations in each INTERVAL plus observations in LOWER INTERVALS.
Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each INTERVAL.
Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each INTERVAL plus relative frequencies in LOWER INTERVALS.
Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each INTERVAL.
Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each INTERVAL plus percent frequencies in LOWER INTERVALS.
Lecture 5 In-class Exercises - Q2
Session ID: MAS261f24
What PERCENT of the Star Wars characters have a height of 130 cm or less
HINT: Percent values are between 0 and 100.
# A tibble: 11 × 7
Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (50,70] 1 1 0.0123 0.0123 1.23 1.23
2 (70,90] 2 3 0.0247 0.037 2.47 3.7
3 (90,110] 4 7 0.0494 0.0864 4.94 8.64
4 (110,130] 2 9 0.0247 0.111 2.47 11.1
5 (130,150] 3 12 0.037 0.148 3.7 14.8
6 (150,170] 15 27 0.185 0.333 18.5 33.3
7 (170,190] 32 59 0.395 0.728 39.5 72.8
8 (190,210] 15 74 0.185 0.914 18.5 91.4
9 (210,230] 5 79 0.0617 0.975 6.17 97.5
10 (230,250] 1 80 0.0123 0.988 1.23 98.8
11 (250,270] 1 81 0.0123 1 1.23 100
Lecture 5 In-class Exercises - Q3
Session ID: MAS261f24
What PROPORTION of the Star Wars characters have a height of 170.01 cm or more
HINTS:
Proportion values are between 0 and 1.
For this question you will have to sum values in the correct column manually from the bottom.
# A tibble: 11 × 7
Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (50,70] 1 1 0.0123 0.0123 1.23 1.23
2 (70,90] 2 3 0.0247 0.037 2.47 3.7
3 (90,110] 4 7 0.0494 0.0864 4.94 8.64
4 (110,130] 2 9 0.0247 0.111 2.47 11.1
5 (130,150] 3 12 0.037 0.148 3.7 14.8
6 (150,170] 15 27 0.185 0.333 18.5 33.3
7 (170,190] 32 59 0.395 0.728 39.5 72.8
8 (190,210] 15 74 0.185 0.914 18.5 91.4
9 (210,230] 5 79 0.0617 0.975 6.17 97.5
10 (230,250] 1 80 0.0123 0.988 1.23 98.8
11 (250,270] 1 81 0.0123 1 1.23 100
Lecture 5 In-class Exercises - Q4
Session ID: MAS261f24
Recall:
A numerical mode is the value that occurs most often in the data.
A distributional mode is an interval of the data where there is a large number of observations.
The Star Wars height data has a small mode and large mode.
The large mode appears to be in the interval 150.01 to 230.
The small mode in the height histogram is the interval ___
.
A. 50.01 - 70
B. 70.01 - 90
C. 90.01 - 110
D. 110.01 - 130
Summarizing and Visualizing Categorical Data
A histogram is ONLY used for quantitative data.
- HOWEVER the terminology for summarizing data into quantitative intervals is ALSO used for summarizing categorical data.
Categorical frequencies in tables are summarized from from the top category down.
- The order of categories is sometimes, but not always, intuitive because the data are ordinal.
To visualize categorical frequency data, we typically use bar charts or pie charts.
Categorical Frequency Table - Educational Attainment
The next example is based on a survey of 2537 people in the United States.
Question Asked: What level of education have you completed?
Barchart for Educational Attainment Survey
Numbers above each bar show number of observations in each category.
Frequency Table for Educational Attainment Data
Categorical frequencies are summarized from from the top category down.
The order of categories is sometimes, but not always, intuitive because the data are ordinal.
Educational attainment data are ordinal and go from lowest education level ot highest.
Highest_Degree | Freq | Cum_Freq | Rel_Freq | Cum_Rel_Freq | Pct_Freq | Cum_Pct_Freq |
---|---|---|---|---|---|---|
Left high school | 330 | 330 | 0.130 | 0.130 | 13.0 | 13.0 |
High school | 1269 | 1598 | 0.500 | 0.630 | 50.0 | 63.0 |
Junior college | 186 | 1786 | 0.073 | 0.704 | 7.3 | 70.4 |
Bachelor’s | 472 | 2258 | 0.186 | 0.890 | 18.6 | 89.0 |
Graduate | 280 | 2537 | 0.110 | 1.000 | 11.0 | 100.0 |
Frequency Table Definitions for Categorical data
Frequency (Freq.): Number of observations in each CATEGORY
Cumulative Frequency (Cum_Freq): Sum of Observations in each CATEGORY plus observations in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each CATEGORY.
Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each CATEGORY plus relative frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each CATEGORY.
Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each CATEGORY plus percent frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Lecture 5 In-class Exercises - Q5
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
What PROPORTION of survey respondents have a Bachelor’s degree or a lower level of education?
HINT: Proportion values are between 0 and 1.
Lecture 5 In-class Exercises - Q6
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
What PERCENTAGE of the survey respondents have a high school degree or a higher degree?
HINTS: Percentages are between 0 and 100.
For this question you will have to sum values in the correct column manually from the bottom.
OR You can calculate 100 - sum of remaining categories from top.
Lecture 5 In-class Exercises - Q7
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
Categorical data can have a mode as well; it’s the category that is most prevalent in the data.
Which response in the education level data is the mode and is answer for half of the respondents?
A. Left high school
B. High school
C. Junior college
D. Bachelor’s
E. Graduate
Bar Chart vs. Pie Chart
Two good ways to show categorical data
Pie Charts better for data with fewer categories.
Bar Charts effective in color or Black and White.
In pie charts, frequencies are often replaced by percents.
Saving 7 x 5 in image
Saving 7 x 5 in image
Bar Chart vs. Histogram
- Bar charts are used for CATEGORICAL data
- Histograms show the distribution of Quantitative Data
Histograms of Different Distributions
Histograms are an effective tool for examining the distribution of the data.
LEFT SKEWED
Tail pulled out to LEFT
Low outliers
e.g. Human Lifespan
NORMAL/SYMMETRIC
Data appear in a symmetric bell-shaped curve
No graphic evidence of outliers
e.g. Test scores
RIGHT SKEWED
Tail pulled out to RIGHT
High outliers
e.g. Movie Gross values
Histogram of Movie Data from Lecture 4
Warning in geom_vline(data = vertical_lines, aes(xintercept = xintercepts, :
Ignoring unknown aesthetics: label
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Key Points from Today
Frequency Tables
Definitions of Frequency Table Terms
Terms are essentially the same for categorical and quantitative data.
Quantitative data are subdivided into intervals (bins)
For categorical data, order of table categories is subjective if data are not ordinal.
A histogram is a visual representation of quantitative frequency data.
- A histogram CANNOT BE CREATED for categorical data.
Bar charts and Pie charts are two common ways to represent categorical frequency data.
To submit an Engagement Question or Comment about material from Lecture 5: Submit it by midnight today (day of lecture).