Visualizing Categorical and Quantitative Data
2024-09-14
Today’s plan 📋
Review Question about Outliers
A few minutes for R Questions 🪄
Introduction to Histograms
Frequency Tables and Terminology
Frequency Tables for Quantitative Data
Frequency Tables for Categorical Data
Visualizing Frequency Data
Histograms for Continuous Data
Bar Charts and Pie Charts for Categorical data
In-class Exercises
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
Session ID: MAS261f24
In lecture 4 we discussed how to visually and quantitatively identify outliers.
Min. 1st Qu. Median Mean 3rd Qu. Max.
66.0 167.0 180.0 174.6 191.0 264.0
Is Yoda a high or a low outlier in terms of height?
A. Yoda’s height is a high outlier.
B. Yoda’s height is a low outlier.
C Yoda’s height is not a high or low outlier.
D. We do not have enough information.
Recall that a Histogram gives a detailed visualization of a quantitative data distribution.
We’ll talk more about interpreting a histogram in lecture 6 when we introduce Normal data.
Today, let’s look at the histogram of the Star Wars height data.
R has a many options for displaying a histogram.
R uses a default number of bins (30) or intervals for histograms.
Alternatively, we can specify the interval width:
Star Wars heights subdivided into 20 cm intervals (Bins)
Bin 1: 50.01 cm - 70 cm
Bin 2: 70.01 cm - 90 cm
Bin 3: 90.01 cm - 110 cm
etc.
Interval | Freq |
---|---|
(50,70] | 1 |
(70,90] | 2 |
(90,110] | 4 |
(110,130] | 2 |
(130,150] | 3 |
(150,170] | 15 |
(170,190] | 32 |
(190,210] | 15 |
(210,230] | 5 |
(230,250] | 1 |
(250,270] | 1 |
Interval | Freq | Cum_Freq | Rel_Freq | Cum_Rel_Freq | Pct_Freq | Cum_Pct_Freq |
---|---|---|---|---|---|---|
(50,70] | 1 | 1 | 0.0123 | 0.0123 | 1.23 | 1.23 |
(70,90] | 2 | 3 | 0.0247 | 0.0370 | 2.47 | 3.70 |
(90,110] | 4 | 7 | 0.0494 | 0.0864 | 4.94 | 8.64 |
(110,130] | 2 | 9 | 0.0247 | 0.1111 | 2.47 | 11.11 |
(130,150] | 3 | 12 | 0.0370 | 0.1481 | 3.70 | 14.81 |
(150,170] | 15 | 27 | 0.1852 | 0.3333 | 18.52 | 33.33 |
(170,190] | 32 | 59 | 0.3951 | 0.7284 | 39.51 | 72.84 |
(190,210] | 15 | 74 | 0.1852 | 0.9136 | 18.52 | 91.36 |
(210,230] | 5 | 79 | 0.0617 | 0.9753 | 6.17 | 97.53 |
(230,250] | 1 | 80 | 0.0123 | 0.9877 | 1.23 | 98.77 |
(250,270] | 1 | 81 | 0.0123 | 1.0000 | 1.23 | 100.00 |
Frequency (Freq.): Number of observations in each INTERVAL
Cumulative Frequency (Cum_Freq): Sum of Observations in each INTERVAL plus observations in LOWER INTERVALS.
Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each INTERVAL.
Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each INTERVAL plus relative frequencies in LOWER INTERVALS.
Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each INTERVAL.
Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each INTERVAL plus percent frequencies in LOWER INTERVALS.
Session ID: MAS261f24
What PERCENT of the Star Wars characters have a height of 130 cm or less
HINT: Percent values are between 0 and 100.
# A tibble: 11 × 7
Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (50,70] 1 1 0.0123 0.0123 1.23 1.23
2 (70,90] 2 3 0.0247 0.037 2.47 3.7
3 (90,110] 4 7 0.0494 0.0864 4.94 8.64
4 (110,130] 2 9 0.0247 0.111 2.47 11.1
5 (130,150] 3 12 0.037 0.148 3.7 14.8
6 (150,170] 15 27 0.185 0.333 18.5 33.3
7 (170,190] 32 59 0.395 0.728 39.5 72.8
8 (190,210] 15 74 0.185 0.914 18.5 91.4
9 (210,230] 5 79 0.0617 0.975 6.17 97.5
10 (230,250] 1 80 0.0123 0.988 1.23 98.8
11 (250,270] 1 81 0.0123 1 1.23 100
Session ID: MAS261f24
What PROPORTION of the Star Wars characters have a height of 170.01 cm or more
HINTS:
Proportion values are between 0 and 1.
For this question you will have to sum values in the correct column manually from the bottom.
# A tibble: 11 × 7
Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (50,70] 1 1 0.0123 0.0123 1.23 1.23
2 (70,90] 2 3 0.0247 0.037 2.47 3.7
3 (90,110] 4 7 0.0494 0.0864 4.94 8.64
4 (110,130] 2 9 0.0247 0.111 2.47 11.1
5 (130,150] 3 12 0.037 0.148 3.7 14.8
6 (150,170] 15 27 0.185 0.333 18.5 33.3
7 (170,190] 32 59 0.395 0.728 39.5 72.8
8 (190,210] 15 74 0.185 0.914 18.5 91.4
9 (210,230] 5 79 0.0617 0.975 6.17 97.5
10 (230,250] 1 80 0.0123 0.988 1.23 98.8
11 (250,270] 1 81 0.0123 1 1.23 100
Session ID: MAS261f24
Recall:
A numerical mode is the value that occurs most often in the data.
A distributional mode is an interval of the data where there is a large number of observations.
The Star Wars height data has a small mode and large mode.
The large mode appears to be in the interval 150.01 to 230.
The small mode in the height histogram is the interval ___
.
A. 50.01 - 70
B. 70.01 - 90
C. 90.01 - 110
D. 110.01 - 130
A histogram is ONLY used for quantitative data.
Categorical frequencies in tables are summarized from from the top category down.
To visualize categorical frequency data, we typically use bar charts or pie charts.
Categorical Frequency Table - Educational Attainment
The next example is based on a survey of 2537 people in the United States.
Question Asked: What level of education have you completed?
Numbers above each bar show number of observations in each category.
Categorical frequencies are summarized from from the top category down.
The order of categories is sometimes, but not always, intuitive because the data are ordinal.
Educational attainment data are ordinal and go from lowest education level ot highest.
Highest_Degree | Freq | Cum_Freq | Rel_Freq | Cum_Rel_Freq | Pct_Freq | Cum_Pct_Freq |
---|---|---|---|---|---|---|
Left high school | 330 | 330 | 0.130 | 0.130 | 13.0 | 13.0 |
High school | 1269 | 1598 | 0.500 | 0.630 | 50.0 | 63.0 |
Junior college | 186 | 1786 | 0.073 | 0.704 | 7.3 | 70.4 |
Bachelor’s | 472 | 2258 | 0.186 | 0.890 | 18.6 | 89.0 |
Graduate | 280 | 2537 | 0.110 | 1.000 | 11.0 | 100.0 |
Frequency (Freq.): Number of observations in each CATEGORY
Cumulative Frequency (Cum_Freq): Sum of Observations in each CATEGORY plus observations in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each CATEGORY.
Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each CATEGORY plus relative frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each CATEGORY.
Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each CATEGORY plus percent frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
What PROPORTION of survey respondents have a Bachelor’s degree or a lower level of education?
HINT: Proportion values are between 0 and 1.
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
What PERCENTAGE of the survey respondents have a high school degree or a higher degree?
HINTS: Percentages are between 0 and 100.
For this question you will have to sum values in the correct column manually from the bottom.
OR You can calculate 100 - sum of remaining categories from top.
Session ID: MAS261f24
# A tibble: 5 × 7
Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Left high school 330 330 0.13 0.13 13 13
2 High school 1269 1598 0.5 0.63 50 63
3 Junior college 186 1786 0.073 0.704 7.3 70.4
4 Bachelor's 472 2258 0.186 0.89 18.6 89
5 Graduate 280 2537 0.11 1 11 100
Categorical data can have a mode as well; it’s the category that is most prevalent in the data.
Which response in the education level data is the mode and is answer for half of the respondents?
A. Left high school
B. High school
C. Junior college
D. Bachelor’s
E. Graduate
Two good ways to show categorical data
Pie Charts better for data with fewer categories.
Bar Charts effective in color or Black and White.
In pie charts, frequencies are often replaced by percents.
Histograms are an effective tool for examining the distribution of the data.
LEFT SKEWED
Tail pulled out to LEFT
Low outliers
e.g. Human Lifespan
NORMAL/SYMMETRIC
Data appear in a symmetric bell-shaped curve
No graphic evidence of outliers
e.g. Test scores
RIGHT SKEWED
Tail pulled out to RIGHT
High outliers
e.g. Movie Gross values
Frequency Tables
Definitions of Frequency Table Terms
Terms are essentially the same for categorical and quantitative data.
Quantitative data are subdivided into intervals (bins)
For categorical data, order of table categories is subjective if data are not ordinal.
A histogram is a visual representation of quantitative frequency data.
Bar charts and Pie charts are two common ways to represent categorical frequency data.
To submit an Engagement Question or Comment about material from Lecture 5: Submit it by midnight today (day of lecture).