Visualizing Categorical and Quantitative Data

Author

Penelope Pooler Eisenbies

Published

September 14, 2024

Housekeeping

Loading required package: pacman
  • Today’s plan

    • Review Question about Outliers

    • A few minutes for R Questions :magic_wand:

    • Introduction to Histograms

    • Frequency Tables and Terminology

      • Frequency Tables for Quantitative Data

      • Frequency Tables for Categorical Data

    • Visualizing Frequency Data

      • Histograms for Continuous Data

      • Bar Charts and Pie Charts for Categorical data

    • In-class Exercises

R and RStudio

  • In this course we will use R and RStudio to understand statistical concepts.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I will demo how to download completed work so that you can use this allotment efficiently.

    • For those who want to go further with R/RStudio:

      • After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 5 In-class Exercises - Q1

Session ID: MAS261f24

In lecture 4 we discussed how to visually and quantitatively identify outliers.

  1. Run the following R code to load the starwars dataset and remove missing values from the height variable.
Code
```{r load starwars, echo=T}
my_starwars <- starwars |>
  dplyr::select(name, species, height) |>
  filter(!is.na(height))
```
  1. Use the summary function and additional calculations to find Q1, Q3, IQR, LL, UL
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   66.0   167.0   180.0   174.6   191.0   264.0 

Is Yoda a high or a low outlier in terms of height?

A. Yoda’s height is a high outlier.

B. Yoda’s height is a low outlier.

C Yoda’s height is not a high or low outlier.

D. We do not have enough information.

Histograms in R

  • Recall that a Histogram gives a detailed visualization of a quantitative data distribution.

    • In simple terms: the plot shows how many observations are in each interval or vertical bar.
  • We’ll talk more about interpreting a histogram in lecture 6 when we introduce Normal data.

  • Today, let’s look at the histogram of the Star Wars height data.

  • R has a many options for displaying a histogram.

    • In MAS 261, you will run code I provide to view plots and interpret them.

Default Histogram of Star Wars Height Data

Modifying the Histogram of Star Wars Height Data

  • R uses a default number of bins (30) or intervals for histograms.

  • Alternatively, we can specify the interval width:

    • Star Wars heights subdivided into 20 cm intervals (Bins)

    • Bin 1: 50.01 cm - 70 cm

    • Bin 2: 70.01 cm - 90 cm

    • Bin 3: 90.01 cm - 110 cm

    • etc.

Interval Freq
(50,70] 1
(70,90] 2
(90,110] 4
(110,130] 2
(130,150] 3
(150,170] 15
(170,190] 32
(190,210] 15
(210,230] 5
(230,250] 1
(250,270] 1

Height Histogram with 20 CM Intervals (Bins)

Frequency Table for Star Wars Heights Data

Interval Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
(50,70] 1 1 0.0123 0.0123 1.23 1.23
(70,90] 2 3 0.0247 0.0370 2.47 3.70
(90,110] 4 7 0.0494 0.0864 4.94 8.64
(110,130] 2 9 0.0247 0.1111 2.47 11.11
(130,150] 3 12 0.0370 0.1481 3.70 14.81
(150,170] 15 27 0.1852 0.3333 18.52 33.33
(170,190] 32 59 0.3951 0.7284 39.51 72.84
(190,210] 15 74 0.1852 0.9136 18.52 91.36
(210,230] 5 79 0.0617 0.9753 6.17 97.53
(230,250] 1 80 0.0123 0.9877 1.23 98.77
(250,270] 1 81 0.0123 1.0000 1.23 100.00

Frequency Table Definitions

Frequency (Freq.): Number of observations in each INTERVAL

Cumulative Frequency (Cum_Freq): Sum of Observations in each INTERVAL plus observations in LOWER INTERVALS.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each INTERVAL.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each INTERVAL plus relative frequencies in LOWER INTERVALS.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each INTERVAL.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each INTERVAL plus percent frequencies in LOWER INTERVALS.

Lecture 5 In-class Exercises - Q2

Session ID: MAS261f24

What PERCENT of the Star Wars characters have a height of 130 cm or less

HINT: Percent values are between 0 and 100.

# A tibble: 11 × 7
   Interval   Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
 1 (50,70]       1        1   0.0123       0.0123     1.23         1.23
 2 (70,90]       2        3   0.0247       0.037      2.47         3.7 
 3 (90,110]      4        7   0.0494       0.0864     4.94         8.64
 4 (110,130]     2        9   0.0247       0.111      2.47        11.1 
 5 (130,150]     3       12   0.037        0.148      3.7         14.8 
 6 (150,170]    15       27   0.185        0.333     18.5         33.3 
 7 (170,190]    32       59   0.395        0.728     39.5         72.8 
 8 (190,210]    15       74   0.185        0.914     18.5         91.4 
 9 (210,230]     5       79   0.0617       0.975      6.17        97.5 
10 (230,250]     1       80   0.0123       0.988      1.23        98.8 
11 (250,270]     1       81   0.0123       1          1.23       100   

Lecture 5 In-class Exercises - Q3

Session ID: MAS261f24

What PROPORTION of the Star Wars characters have a height of 170.01 cm or more

HINTS:

Proportion values are between 0 and 1.

For this question you will have to sum values in the correct column manually from the bottom.

# A tibble: 11 × 7
   Interval   Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
 1 (50,70]       1        1   0.0123       0.0123     1.23         1.23
 2 (70,90]       2        3   0.0247       0.037      2.47         3.7 
 3 (90,110]      4        7   0.0494       0.0864     4.94         8.64
 4 (110,130]     2        9   0.0247       0.111      2.47        11.1 
 5 (130,150]     3       12   0.037        0.148      3.7         14.8 
 6 (150,170]    15       27   0.185        0.333     18.5         33.3 
 7 (170,190]    32       59   0.395        0.728     39.5         72.8 
 8 (190,210]    15       74   0.185        0.914     18.5         91.4 
 9 (210,230]     5       79   0.0617       0.975      6.17        97.5 
10 (230,250]     1       80   0.0123       0.988      1.23        98.8 
11 (250,270]     1       81   0.0123       1          1.23       100   

Lecture 5 In-class Exercises - Q4

Session ID: MAS261f24

Recall:

A numerical mode is the value that occurs most often in the data.

A distributional mode is an interval of the data where there is a large number of observations.

The Star Wars height data has a small mode and large mode.

The large mode appears to be in the interval 150.01 to 230.


The small mode in the height histogram is the interval ___.

A. 50.01 - 70

B. 70.01 - 90

C. 90.01 - 110

D. 110.01 - 130

Summarizing and Visualizing Categorical Data

  • A histogram is ONLY used for quantitative data.

    • HOWEVER the terminology for summarizing data into quantitative intervals is ALSO used for summarizing categorical data.
  • Categorical frequencies in tables are summarized from from the top category down.

    • The order of categories is sometimes, but not always, intuitive because the data are ordinal.
  • To visualize categorical frequency data, we typically use bar charts or pie charts.


Categorical Frequency Table - Educational Attainment

  • The next example is based on a survey of 2537 people in the United States.

  • Question Asked: What level of education have you completed?

Barchart for Educational Attainment Survey

Numbers above each bar show number of observations in each category.

Frequency Table for Educational Attainment Data

Categorical frequencies are summarized from from the top category down.

  • The order of categories is sometimes, but not always, intuitive because the data are ordinal.

  • Educational attainment data are ordinal and go from lowest education level ot highest.

Highest_Degree Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
Left high school 330 330 0.130 0.130 13.0 13.0
High school 1269 1598 0.500 0.630 50.0 63.0
Junior college 186 1786 0.073 0.704 7.3 70.4
Bachelor’s 472 2258 0.186 0.890 18.6 89.0
Graduate 280 2537 0.110 1.000 11.0 100.0

Frequency Table Definitions for Categorical data

Frequency (Freq.): Number of observations in each CATEGORY

Cumulative Frequency (Cum_Freq): Sum of Observations in each CATEGORY plus observations in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each CATEGORY.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each CATEGORY plus relative frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each CATEGORY.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each CATEGORY plus percent frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Lecture 5 In-class Exercises - Q5

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  


What PROPORTION of survey respondents have a Bachelor’s degree or a lower level of education?


HINT: Proportion values are between 0 and 1.

Lecture 5 In-class Exercises - Q6

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  


What PERCENTAGE of the survey respondents have a high school degree or a higher degree?

HINTS: Percentages are between 0 and 100.

  • For this question you will have to sum values in the correct column manually from the bottom.

  • OR You can calculate 100 - sum of remaining categories from top.

Lecture 5 In-class Exercises - Q7

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100  

Categorical data can have a mode as well; it’s the category that is most prevalent in the data.

Which response in the education level data is the mode and is answer for half of the respondents?

A. Left high school

B. High school

C. Junior college

D. Bachelor’s

E. Graduate

Bar Chart vs. Pie Chart

  • Two good ways to show categorical data

  • Pie Charts better for data with fewer categories.

  • Bar Charts effective in color or Black and White.

  • In pie charts, frequencies are often replaced by percents.

Saving 7 x 5 in image
Saving 7 x 5 in image

Bar Chart vs. Histogram

  • Bar charts are used for CATEGORICAL data

  • Histograms show the distribution of Quantitative Data

Histograms of Different Distributions

Histograms are an effective tool for examining the distribution of the data.

LEFT SKEWED

Tail pulled out to LEFT

Low outliers

e.g. Human Lifespan

NORMAL/SYMMETRIC

Data appear in a symmetric bell-shaped curve

No graphic evidence of outliers

e.g. Test scores

RIGHT SKEWED

Tail pulled out to RIGHT

High outliers

e.g. Movie Gross values

Histogram of Movie Data from Lecture 4

Warning in geom_vline(data = vertical_lines, aes(xintercept = xintercepts, :
Ignoring unknown aesthetics: label
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Key Points from Today

  • Frequency Tables

    • Definitions of Frequency Table Terms

    • Terms are essentially the same for categorical and quantitative data.

      • Quantitative data are subdivided into intervals (bins)

      • For categorical data, order of table categories is subjective if data are not ordinal.

    • A histogram is a visual representation of quantitative frequency data.

      • A histogram CANNOT BE CREATED for categorical data.
    • Bar charts and Pie charts are two common ways to represent categorical frequency data.

To submit an Engagement Question or Comment about material from Lecture 5: Submit it by midnight today (day of lecture).