Lecture 5 - Visualizing Categorical and Quantitative Data

Penelope Pooler Eisenbies
MAS 261

2023-09-11

Housekeeping

  • Today’s plan 📋

    • Review Question about Outliers

    • A few minutes for R Questions 🪄

    • Introduction to Histograms

    • Frequency Tables and Terminology

      • Frequency Tables for Quantitative Data

      • Frequency Tables for Categorical Data

    • Visualizing Frequency Data

      • Histograms for Continuous Data

      • Bar Charts and Pie Charts for Categorical data

    • In-class Exercises

Review: R and RStudio 🪄

  • Review: You have two options to facilitate your introduction to R and RStudio:

  • If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.

    • We will use Posit Cloud for Quizzes.
  • If you are nervous about coding: Choose Option 2.

  • For both options: I can help with download/install issues during office hours.

  • What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.

  • NOTE: We will use R and RStudio in class during MOST lectures

    • You can use either Posit Cloud or your laptop.

💥 Lecture 5 In-class Exercises - Q1 (Review) 💥

In lecture 4 we discussed how to visually and quantitatively identify outliers.

  1. Run the following R code to load the starwars dataset and remove missing values from the height variable.

  2. Use the summary function to find Q1, Q3, IQR, LL, UL

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   66.0   167.0   180.0   174.4   191.0   264.0 


Is Yoda a high or a low outlier in terms of height?

A. Yoda’s height is a high outlier.

B. Yoda’s height is a low outlier.

C Yoda’s height is not a high or low outlier.

D. We do not have enough information.

Histograms in R

  • Recall that a Histogram gives a detailed visualization of a quantitative data distribution.

    • In simple terms: the plot shows how many observations are in each interval or vertical bar.
  • We’ll talk more about interpreting a histogram in lecture 6 when we introduce Normal data.

  • Today, let’s look at the histogram of the Star Wars height data.

  • R has a many options for displaying a histogram.

    • In MAS 261, you will run code I provide to view plots and interpret them.

Default Histogram of Star Wars Height Data

Modifying Histogram of Star Wars Height Data

  • R uses a default number of bins (30) or intervals for histograms.

  • Alternatively, we can specify the interval width:

    • Star Wars heights subdivided into 20 cm intervals (Bins)

    • Bin 1: 50.01 cm - 70 cm

    • Bin 2: 70.01 cm - 90 cm

    • Bin 3: 90.01 cm - 110 cm

    • etc.

Interval Freq
(50,70] 1
(70,90] 2
(90,110] 4
(110,130] 2
(130,150] 3
(150,170] 15
(170,190] 32
(190,210] 15
(210,230] 5
(230,250] 1
(250,270] 1

Height Histogram with 20 CM Intervals (Bins)

Complete Frequency Table for Star Wars Heights Data

Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
(50,70] 1 1 0.0123 0.0123 1.23 1.23
(70,90] 2 3 0.0247 0.0370 2.47 3.70
(90,110] 4 7 0.0494 0.0864 4.94 8.64
(110,130] 2 9 0.0247 0.1111 2.47 11.11
(130,150] 3 12 0.0370 0.1481 3.70 14.81
(150,170] 15 27 0.1852 0.3333 18.52 33.33
(170,190] 32 59 0.3951 0.7284 39.51 72.84
(190,210] 15 74 0.1852 0.9136 18.52 91.36
(210,230] 5 79 0.0617 0.9753 6.17 97.53
(230,250] 1 80 0.0123 0.9877 1.23 98.77
(250,270] 1 81 0.0123 1.0000 1.23 100.00

Frequency Table Definitions

Frequency (Freq.): Number of observations in each INTERVAL

Cumulative Frequency (Cum_Freq): Sum of Observations in each INTERVAL plus observations in LOWER INTERVALS.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each INTERVAL.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each INTERVAL plus relative frequencies in LOWER INTERVALS.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each INTERVAL.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each INTERVAL plus percent frequencies in LOWER INTERVALS.

💥 Lecture 5 In-class Exercises - Q2 💥

What PERCENT of the Star Wars characters have a height of 130 cm or less


HINT: Percent values are between 0 and 100.


# A tibble: 11 × 7
   Interval  Freq        Cum_Freq Rel_Freq    Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <table[1d]>    <int> <table[1d]>        <dbl> <table[>        <dbl>
 1 (50,70]    1                 1 0.0123            0.0123  1.23            1.23
 2 (70,90]    2                 3 0.0247            0.037   2.47            3.7 
 3 (90,110]   4                 7 0.0494            0.0864  4.94            8.64
 4 (110,130]  2                 9 0.0247            0.111   2.47           11.1 
 5 (130,150]  3                12 0.0370            0.148   3.70           14.8 
 6 (150,170] 15                27 0.1852            0.333  18.52           33.3 
 7 (170,190] 32                59 0.3951            0.728  39.51           72.8 
 8 (190,210] 15                74 0.1852            0.914  18.52           91.4 
 9 (210,230]  5                79 0.0617            0.975   6.17           97.5 
10 (230,250]  1                80 0.0123            0.988   1.23           98.8 
11 (250,270]  1                81 0.0123            1       1.23          100   

💥 Lecture 5 In-class Exercises - Q3 💥

What PROPORTION of the Star Wars characters have a height of 170.01 cm or more


HINTS:

Proportion values are between 0 and 1.

For this question you will have to sum values in the correct column manually from the bottom.


# A tibble: 11 × 7
   Interval  Freq        Cum_Freq Rel_Freq    Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <table[1d]>    <int> <table[1d]>        <dbl> <table[>        <dbl>
 1 (50,70]    1                 1 0.0123            0.0123  1.23            1.23
 2 (70,90]    2                 3 0.0247            0.037   2.47            3.7 
 3 (90,110]   4                 7 0.0494            0.0864  4.94            8.64
 4 (110,130]  2                 9 0.0247            0.111   2.47           11.1 
 5 (130,150]  3                12 0.0370            0.148   3.70           14.8 
 6 (150,170] 15                27 0.1852            0.333  18.52           33.3 
 7 (170,190] 32                59 0.3951            0.728  39.51           72.8 
 8 (190,210] 15                74 0.1852            0.914  18.52           91.4 
 9 (210,230]  5                79 0.0617            0.975   6.17           97.5 
10 (230,250]  1                80 0.0123            0.988   1.23           98.8 
11 (250,270]  1                81 0.0123            1       1.23          100   

💥 Lecture 5 In-class Exercises - Q4 💥

Recall:

A numerical mode is the value that occurs most often in the data.

A distributional mode is an interval of the data where there is a large number of observations.

The Star Wars height data has a small mode and large mode.

The large mode appears to be in the interval 150.01 to 230.


The small mode in the height histogram is the interval ___

A. 50.01 - 70

B. 70.01 - 90

C. 90.01 - 110

D. 110.01 - 130

Summarizing and Visualizing Categorical Data

  • A histogram is ONLY used for quantitative data.

    • HOWEVER the terminology for summarizing data into quantitative intervals is ALSO used for summarizing categorical data.
  • Categorical frequencies in tables are summarized from from the top category down.

    • The order of categories is sometimes, but not always, intuitive because the data are ordinal.
  • To visualize categorical frequency data, we typically use bar charts or pie charts.


Categorical Frequency Table - Educational Attainment

  • The next example is based on a survey of 2537 people in the United States.

  • Question Asked: What level of education have you completed?

Default Barchart for Educational Attainment Survey

Numbers above each bar show number of observations in each category.

Frequency Table for Educational Attainment Data

Categorical frequencies in tables are summarized from from the top category down.

  • The order of categories is sometimes, but not always, intuitive because the data are ordinal.

  • Educational attainment data are ordinal and go from lowest education level ot highest.


Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
Left high school 330 330 0.130 0.130 13.0 13.0
High school 1269 1598 0.500 0.630 50.0 63.0
Junior college 186 1786 0.073 0.704 7.3 70.4
Bachelor's 472 2258 0.186 0.890 18.6 89.0
Graduate 280 2537 0.110 1.000 11.0 100.0

Frequency Table Definitions for Categorical data

Frequency (Freq.): Number of observations in each CATEGORY

Cumulative Frequency (Cum_Freq): Sum of Observations in each CATEGORY plus observations in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each CATEGORY.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each CATEGORY plus relative frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each CATEGORY.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each CATEGORY plus percent frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

💥 Lecture 5 In-class Exercises - Q5 💥

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  


What PROPORTION of survey respondents have a Bachelor’s degree or a lower level of education?


HINT: Proportion values are between 0 and 1.

💥 Lecture 5 In-class Exercises - Q6 💥

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  


What PERCENTAGE of the survey respondents High School Degree or a higher degree.

HINTS: Percentages are between 0 and 100.

  • For this question you will have to sum values in the correct column manually from the bottom.

  • OR You can calculate 100 - sum of remaining categories from top.

💥 Lecture 5 In-class Exercises - Q7 💥

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  


Categorical data can have a mode as well; it’s the category that is most prevalent in the data.


Which response was the mode and is answer for half of the respondents?

A. “Left high school”

B. “High school”

C. “Junior college”

D. “Bachelor’s”

E. “Graduate”

Bar Chart vs. Pie Chart

  • Two good ways to show categorical data

  • Pie Charts better for data with fewer categories.

  • Bar Charts effective in color or Black and White.

  • In pie charts, frequencies are often replaced by percents.

Bar Chart vs. Histogram

  • Bar charts are used for CATEGORICAL data

  • Histograms show the distribution of Quantitative Data

Histograms of Different Distributions

Histograms are an effective tool for examining the distribution of the data.

LEFT SKEWED

Tail pulled out to LEFT

Low outliers

e.g. Human Lifespan

NORMAL/SYMMETRIC

Data appear in a symmetric bell-shaped curve

No graphic evidence of outliers

e.g. Test scores

RIGHT SKEWED

Tail pulled out to RIGHT

High outliers

e.g. Movie Gross values

Histogram of Movie Data from Lecture 4

Key Points from Today

  • Frequency Tables

    • Definitions of Frequency Table Terms

    • Terms are essentially the same for categorical and quantitative data.

      • Quantitative data are subdivided into intervals (bins)

      • For categorical data, order of table categories is subjective if data are not ordinal.

    • A histogram is a visual representation of quantitative frequency data.

      • A histogram CANNOT BE CREATED for categorical data.
    • Bar charts and Pie charts are two common ways to represent categorical frequency data.


To submit an Engagement Question or Comment about material from Lecture 5: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 5