MAS 261 - Lecture 5

Visualizing Categorical and Quantitative Data

Author

Penelope Pooler Eisenbies

Published

September 14, 2024

Housekeeping

Loading required package: pacman

Today’s plan
- Review Question about Outliers
- A few minutes for R Questions :magic_wand:
- Introduction to Histograms
- Frequency Tables and Terminology
  - Frequency Tables for Quantitative Data
  - Frequency Tables for Categorical Data
- Visualizing Frequency Data
  - Histograms for Continuous Data
  - Bar Charts and Pie Charts for Categorical data
- In-class Exercises

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 5 In-class Exercises - Q1

Session ID: MAS261f24

In lecture 4 we discussed how to visually and quantitatively identify outliers.

Run the following R code to load the starwars dataset and remove missing values from the height variable.

Code

```{r load starwars, echo=T}
my_starwars <- starwars |>
  dplyr::select(name, species, height) |>
  filter(!is.na(height))
```

Use the summary function and additional calculations to find Q1, Q3, IQR, LL, UL

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   66.0   167.0   180.0   174.6   191.0   264.0

Is Yoda a high or a low outlier in terms of height?

A. Yoda’s height is a high outlier.

B. Yoda’s height is a low outlier.

C Yoda’s height is not a high or low outlier.

D. We do not have enough information.

Histograms in R

Recall that a Histogram gives a detailed visualization of a quantitative data distribution.
- In simple terms: the plot shows how many observations are in each interval or vertical bar.
We’ll talk more about interpreting a histogram in lecture 6 when we introduce Normal data.
Today, let’s look at the histogram of the Star Wars height data.
R has a many options for displaying a histogram.
- In MAS 261, you will run code I provide to view plots and interpret them.

Default Histogram of Star Wars Height Data

Modifying the Histogram of Star Wars Height Data

R uses a default number of bins (30) or intervals for histograms.
Alternatively, we can specify the interval width:
- Star Wars heights subdivided into 20 cm intervals (Bins)
- Bin 1: 50.01 cm - 70 cm
- Bin 2: 70.01 cm - 90 cm
- Bin 3: 90.01 cm - 110 cm
- etc.

Interval	Freq
(50,70]	1
(70,90]	2
(90,110]	4
(110,130]	2
(130,150]	3
(150,170]	15
(170,190]	32
(190,210]	15
(210,230]	5
(230,250]	1
(250,270]	1

Height Histogram with 20 CM Intervals (Bins)

Frequency Table for Star Wars Heights Data

Interval	Freq	Cum_Freq	Rel_Freq	Cum_Rel_Freq	Pct_Freq	Cum_Pct_Freq
(50,70]	1	1	0.0123	0.0123	1.23	1.23
(70,90]	2	3	0.0247	0.0370	2.47	3.70
(90,110]	4	7	0.0494	0.0864	4.94	8.64
(110,130]	2	9	0.0247	0.1111	2.47	11.11
(130,150]	3	12	0.0370	0.1481	3.70	14.81
(150,170]	15	27	0.1852	0.3333	18.52	33.33
(170,190]	32	59	0.3951	0.7284	39.51	72.84
(190,210]	15	74	0.1852	0.9136	18.52	91.36
(210,230]	5	79	0.0617	0.9753	6.17	97.53
(230,250]	1	80	0.0123	0.9877	1.23	98.77
(250,270]	1	81	0.0123	1.0000	1.23	100.00

Frequency Table Definitions

Frequency (Freq.): Number of observations in each INTERVAL

Cumulative Frequency (Cum_Freq): Sum of Observations in each INTERVAL plus observations in LOWER INTERVALS.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each INTERVAL.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each INTERVAL plus relative frequencies in LOWER INTERVALS.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each INTERVAL.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each INTERVAL plus percent frequencies in LOWER INTERVALS.

Lecture 5 In-class Exercises - Q2

Session ID: MAS261f24

What PERCENT of the Star Wars characters have a height of 130 cm or less

HINT: Percent values are between 0 and 100.

# A tibble: 11 × 7
   Interval   Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
 1 (50,70]       1        1   0.0123       0.0123     1.23         1.23
 2 (70,90]       2        3   0.0247       0.037      2.47         3.7 
 3 (90,110]      4        7   0.0494       0.0864     4.94         8.64
 4 (110,130]     2        9   0.0247       0.111      2.47        11.1 
 5 (130,150]     3       12   0.037        0.148      3.7         14.8 
 6 (150,170]    15       27   0.185        0.333     18.5         33.3 
 7 (170,190]    32       59   0.395        0.728     39.5         72.8 
 8 (190,210]    15       74   0.185        0.914     18.5         91.4 
 9 (210,230]     5       79   0.0617       0.975      6.17        97.5 
10 (230,250]     1       80   0.0123       0.988      1.23        98.8 
11 (250,270]     1       81   0.0123       1          1.23       100

Lecture 5 In-class Exercises - Q3

Session ID: MAS261f24

What PROPORTION of the Star Wars characters have a height of 170.01 cm or more

HINTS:

Proportion values are between 0 and 1.

For this question you will have to sum values in the correct column manually from the bottom.

# A tibble: 11 × 7
   Interval   Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
   <chr>     <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
 1 (50,70]       1        1   0.0123       0.0123     1.23         1.23
 2 (70,90]       2        3   0.0247       0.037      2.47         3.7 
 3 (90,110]      4        7   0.0494       0.0864     4.94         8.64
 4 (110,130]     2        9   0.0247       0.111      2.47        11.1 
 5 (130,150]     3       12   0.037        0.148      3.7         14.8 
 6 (150,170]    15       27   0.185        0.333     18.5         33.3 
 7 (170,190]    32       59   0.395        0.728     39.5         72.8 
 8 (190,210]    15       74   0.185        0.914     18.5         91.4 
 9 (210,230]     5       79   0.0617       0.975      6.17        97.5 
10 (230,250]     1       80   0.0123       0.988      1.23        98.8 
11 (250,270]     1       81   0.0123       1          1.23       100

Lecture 5 In-class Exercises - Q4

Session ID: MAS261f24

Recall:

A numerical mode is the value that occurs most often in the data.

A distributional mode is an interval of the data where there is a large number of observations.

The Star Wars height data has a small mode and large mode.

The large mode appears to be in the interval 150.01 to 230.

The small mode in the height histogram is the interval ___.

A. 50.01 - 70

B. 70.01 - 90

C. 90.01 - 110

D. 110.01 - 130

Summarizing and Visualizing Categorical Data

A histogram is ONLY used for quantitative data.
- HOWEVER the terminology for summarizing data into quantitative intervals is ALSO used for summarizing categorical data.
Categorical frequencies in tables are summarized from from the top category down.
- The order of categories is sometimes, but not always, intuitive because the data are ordinal.
To visualize categorical frequency data, we typically use bar charts or pie charts.

Categorical Frequency Table - Educational Attainment

The next example is based on a survey of 2537 people in the United States.
Question Asked: What level of education have you completed?

Barchart for Educational Attainment Survey

Numbers above each bar show number of observations in each category.

Frequency Table for Educational Attainment Data

Categorical frequencies are summarized from from the top category down.

The order of categories is sometimes, but not always, intuitive because the data are ordinal.
Educational attainment data are ordinal and go from lowest education level ot highest.

Highest_Degree	Freq	Cum_Freq	Rel_Freq	Cum_Rel_Freq	Pct_Freq	Cum_Pct_Freq
Left high school	330	330	0.130	0.130	13.0	13.0
High school	1269	1598	0.500	0.630	50.0	63.0
Junior college	186	1786	0.073	0.704	7.3	70.4
Bachelor’s	472	2258	0.186	0.890	18.6	89.0
Graduate	280	2537	0.110	1.000	11.0	100.0

Frequency Table Definitions for Categorical data

Frequency (Freq.): Number of observations in each CATEGORY

Cumulative Frequency (Cum_Freq): Sum of Observations in each CATEGORY plus observations in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Relative Frequency (Rel_Freq): Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each CATEGORY.

Cumulative Relative Frequency (Cum_Rel_Freq): Sum of relative frequencies in each CATEGORY plus relative frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Percent Frequency (Pct_Freq): Relative Frequency x 100%; values sum to 100 and indicate percent of data in each CATEGORY.

Cumulative Percent Frequency (Cum_Pct_Freq): Sum of percent frequencies in each CATEGORY plus percent frequencies in CATEGORIES THAT APPEAR ABOVE in the frequency table.

Lecture 5 In-class Exercises - Q5

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100

What PROPORTION of survey respondents have a Bachelor’s degree or a lower level of education?

HINT: Proportion values are between 0 and 1.

Lecture 5 In-class Exercises - Q6

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100

What PERCENTAGE of the survey respondents have a high school degree or a higher degree?

HINTS: Percentages are between 0 and 100.

For this question you will have to sum values in the correct column manually from the bottom.
OR You can calculate 100 - sum of remaining categories from top.

Lecture 5 In-class Exercises - Q7

Session ID: MAS261f24

# A tibble: 5 × 7
  Highest_Degree    Freq Cum_Freq Rel_Freq Cum_Rel_Freq Pct_Freq Cum_Pct_Freq
  <chr>            <dbl>    <dbl>    <dbl>        <dbl>    <dbl>        <dbl>
1 Left high school   330      330    0.13         0.13      13           13  
2 High school       1269     1598    0.5          0.63      50           63  
3 Junior college     186     1786    0.073        0.704      7.3         70.4
4 Bachelor's         472     2258    0.186        0.89      18.6         89  
5 Graduate           280     2537    0.11         1         11          100

Categorical data can have a mode as well; it’s the category that is most prevalent in the data.

Which response in the education level data is the mode and is answer for half of the respondents?

A. Left high school

B. High school

C. Junior college

D. Bachelor’s

E. Graduate

Bar Chart vs. Pie Chart

Two good ways to show categorical data
Pie Charts better for data with fewer categories.
Bar Charts effective in color or Black and White.
In pie charts, frequencies are often replaced by percents.

Saving 7 x 5 in image
Saving 7 x 5 in image

Bar Chart vs. Histogram

Bar charts are used for CATEGORICAL data

Histograms show the distribution of Quantitative Data

Histograms of Different Distributions

Histograms are an effective tool for examining the distribution of the data.

LEFT SKEWED

Tail pulled out to LEFT

Low outliers

e.g. Human Lifespan

NORMAL/SYMMETRIC

Data appear in a symmetric bell-shaped curve

No graphic evidence of outliers

e.g. Test scores

RIGHT SKEWED

Tail pulled out to RIGHT

High outliers

e.g. Movie Gross values

Histogram of Movie Data from Lecture 4

Warning in geom_vline(data = vertical_lines, aes(xintercept = xintercepts, :
Ignoring unknown aesthetics: label

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Key Points from Today

Frequency Tables
- Definitions of Frequency Table Terms
- Terms are essentially the same for categorical and quantitative data.
  - Quantitative data are subdivided into intervals (bins)
  - For categorical data, order of table categories is subjective if data are not ordinal.
- A histogram is a visual representation of quantitative frequency data.
  - A histogram CANNOT BE CREATED for categorical data.
- Bar charts and Pie charts are two common ways to represent categorical frequency data.

To submit an Engagement Question or Comment about material from Lecture 5: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 5" subtitle: "Visualizing Categorical and Quantitative Data" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay) # verify packages # p_loaded() ``` - Today's plan - Review Question about Outliers - A few minutes for R Questions :magic_wand: - Introduction to Histograms - Frequency Tables and Terminology - Frequency Tables for Quantitative Data - Frequency Tables for Categorical Data - Visualizing Frequency Data - Histograms for Continuous Data - Bar Charts and Pie Charts for Categorical data - In-class Exercises ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free) - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I will demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer. ## Lecture 5 In-class Exercises - Q1 ***Session ID: MAS261f24*** In lecture 4 we discussed how to visually and quantitatively identify outliers. 1. Run the following R code to load the starwars dataset and remove missing values from the height variable. ::: fragment ```{r load starwars, echo=T} my_starwars <- starwars |> dplyr::select(name, species, height) |> filter(!is.na(height)) ``` ::: 2. Use the summary function and additional calculations to find Q1, Q3, IQR, LL, UL ::: fragment ```{r summarize height and find outliers} summary(my_starwars$height) iqr <- 191-167 ll <- 167 - 1.5*iqr ul <- 191 + 1.5*iqr ``` ::: ::: fragment **Is Yoda a high or a low outlier in terms of height?** A. Yoda's height is a high outlier. B. Yoda's height is a low outlier. C Yoda's height is not a high or low outlier. D. We do not have enough information. ::: ## Histograms in R - Recall that a Histogram gives a detailed visualization of a quantitative data distribution. - In simple terms: the plot shows how many observations are in each interval or vertical bar. - We'll talk more about interpreting a histogram in lecture 6 when we introduce Normal data. - Today, let's look at the histogram of the Star Wars height data. - R has a many options for displaying a histogram. - In MAS 261, you will run code I provide to view plots and interpret them. ## ### Default Histogram of Star Wars Height Data ```{r, default sw histogram, message=F, warning=F} # Define vertical lines and their labels vertical_lines <- tibble( xintercepts = c(174.4, 180), labels = c("Mean", "Median"), colors = c("blue", "red")) (my_starwars |> filter(!is.na(height)) |> ggplot() + geom_histogram(aes(x=height), color="darkblue", fill="lightblue") + geom_vline(data=vertical_lines, aes(xintercept = xintercepts, color=colors, label=labels), linetype="dashed", linewidth=1) + scale_color_manual(values = vertical_lines$colors, labels=vertical_lines$labels) + theme_classic() + labs(x="Height (cm)", y="Frequency", title="Distribution of Heights of Star Wars Characters", color="", linetype="") + theme(legend.position = "bottom")) ``` ## ### Modifying the Histogram of Star Wars Height Data ::: columns ::: {.column width="50%"} - R uses a default number of bins (30) or intervals for histograms. - Alternatively, we can specify the interval width: - Star Wars heights subdivided into 20 cm intervals (Bins) - Bin 1: 50.01 cm - 70 cm - Bin 2: 70.01 cm - 90 cm - Bin 3: 90.01 cm - 110 cm - etc. ::: ::: {.column width="50%"} ::: fragment ```{r intervals to create histogram with interval width of 20 cm} Interval <- names(table(cut(starwars$height, breaks=seq(50,270,20)))) Freq <- unname(table(cut(starwars$height, breaks=seq(50,270,20)))) |> as.vector() Cum_Freq <- cumsum(Freq) Rel_Freq <- (Freq/sum(Freq)) |> round(4) |> as.vector() Cum_Rel_Freq <- cumsum(Freq/sum(Freq))|> round(4) Pct_Freq <- ((Freq/sum(Freq))*100) |> round(2) |> as.vector() Cum_Pct_Freq <- cumsum((Freq/sum(Freq))*100) |> round(2) sw_ht <- tibble(Interval, Freq, Cum_Freq, Rel_Freq, Cum_Rel_Freq, Pct_Freq, Cum_Pct_Freq) write_csv(sw_ht, "data/StarWars_Height_Frequency_Table.csv") sw_ht <- read_csv("data/StarWars_Height_Frequency_Table_Orig.csv", show_col_types = F) sw_ht[,1:2] |> kable() ``` ::: ::: ::: ## ### Height Histogram with 20 CM Intervals (Bins) ```{r sw height histogram 20 cm, warning=F, message=F} # Define vertical lines and their labels vertical_lines <- tibble( xintercepts = c(174.4, 180), labels = c("Mean", "Median"), colors = c("blue", "red")) (sw_hist <- my_starwars |> filter(!is.na(height)) |> ggplot() + geom_histogram(aes(x=height), color="darkblue", fill="lightblue", binwidth = 20) + geom_vline(data=vertical_lines, aes(xintercept = xintercepts, color=colors, label=labels), linetype="dashed", linewidth=1) + scale_color_manual(values = vertical_lines$colors, labels=vertical_lines$labels) + scale_x_continuous(breaks=seq(60, 260, 20)) + theme_classic() + labs(x="Height (cm)", y="Frequency", title="Distribution of Heights of Star Wars Characters", color="", linetype="") + theme(legend.position = "bottom")) ggsave("img/Starwars_Histogram.png") ``` ## ### Frequency Table for Star Wars Heights Data ```{r} sw_ht |> kable() ``` ## Frequency Table Definitions **Frequency (Freq.):** Number of observations in each **INTERVAL** **Cumulative Frequency (Cum_Freq):** Sum of Observations in each **INTERVAL** plus observations in **LOWER INTERVALS**. **Relative Frequency (Rel_Freq):** Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each **INTERVAL**. **Cumulative Relative Frequency (Cum_Rel_Freq):** Sum of relative frequencies in each **INTERVAL** plus relative frequencies in **LOWER INTERVALS**. **Percent Frequency (Pct_Freq):** Relative Frequency x 100%; values sum to 100 and indicate percent of data in each **INTERVAL**. **Cumulative Percent Frequency (Cum_Pct_Freq):** Sum of percent frequencies in each **INTERVAL** plus percent frequencies in **LOWER INTERVALS**. ## Lecture 5 In-class Exercises - Q2 ***Session ID: MAS261f24*** **What PERCENT of the Star Wars characters have a height of 130 cm or less** HINT: Percent values are between 0 and 100. ```{r} sw_ht ``` ## Lecture 5 In-class Exercises - Q3 ***Session ID: MAS261f24*** **What PROPORTION of the Star Wars characters have a height of 170.01 cm or more** HINTS: Proportion values are between 0 and 1. For this question you will have to sum values in the correct column manually from the bottom. ```{r} sw_ht ``` ## Lecture 5 In-class Exercises - Q4 ***Session ID: MAS261f24*** ::: columns ::: {.column width="50%"} Recall: A numerical mode is the value that occurs most often in the data. A distributional mode is an interval of the data where there is a large number of observations. The Star Wars height data has a small mode and large mode. The large mode appears to be in the interval 150.01 to 230. ::: ::: {.column width="50%"} ```{r} sw_hist ``` **The small mode in the height histogram is the interval `___`.** A. 50.01 - 70 B. 70.01 - 90 C. 90.01 - 110 D. 110.01 - 130 ::: ::: ## ### Summarizing and Visualizing Categorical Data - A histogram is ONLY used for quantitative data. - **HOWEVER** the terminology for summarizing data into quantitative intervals is **ALSO** used for summarizing categorical data. - Categorical frequencies in tables are summarized from from the top category down. - The order of categories is sometimes, but not always, intuitive because the data are ordinal. - To visualize categorical frequency data, we typically use bar charts or pie charts. ::: fragment **Categorical Frequency Table - Educational Attainment** ::: - The next example is based on a survey of 2537 people in the United States. - **Question Asked:** What level of education have you completed? ## ### Barchart for Educational Attainment Survey Numbers above each bar show number of observations in each category. ```{r degree data} Highest_Degree <- c("Left high school", "High school", "Junior college", "Bachelor's", "Graduate") # categories in correct order n <- 2537 # sample sizes degree <- read_csv("data/degree.csv", show_col_types=F) |> mutate(highest_degree = factor(highest_degree, levels=Highest_Degree)) degree_freq <- tab1(degree$highest_degree, cum.percent = TRUE, main="Barchart of Educational Attainment Data") Freq <- degree_freq$output.table[1:5,1] |> unname() Pct_Freq <- degree_freq$output.table[1:5,2] |> unname() Cum_Pct_Freq<- degree_freq$output.table[1:5,3] |> unname() Rel_Freq <- (Pct_Freq/100) |> round(4) |> unname() Cum_Rel_Freq <- (Cum_Pct_Freq/100) |> round(4) |> unname() Cum_Freq <- (Cum_Pct_Freq/100*n) |> round() degree_freq_final <- tibble(Highest_Degree, Freq, Cum_Freq, Rel_Freq, Cum_Rel_Freq, Pct_Freq, Cum_Pct_Freq) degree_freq_final |> write_csv("data/Education_Frequency_Table.csv") ``` ## ### Frequency Table for Educational Attainment Data Categorical frequencies are summarized from from the top category down. - The order of categories is sometimes, but not always, intuitive because the data are ordinal. - Educational attainment data are ordinal and go from lowest education level ot highest. ::: fragment ```{r ed data freq table} degree_freq_final |> kable() ``` ::: ## Frequency Table Definitions for Categorical data **Frequency (Freq.):** Number of observations in each **CATEGORY** **Cumulative Frequency (Cum_Freq):** Sum of Observations in each **CATEGORY** plus observations in **CATEGORIES THAT APPEAR ABOVE** in the frequency table. **Relative Frequency (Rel_Freq):** Frequency/Total sample Size; values sum to 1 and indicate proportion of data in each **CATEGORY**. **Cumulative Relative Frequency (Cum_Rel_Freq):** Sum of relative frequencies in each **CATEGORY** plus relative frequencies in **CATEGORIES THAT APPEAR ABOVE** in the frequency table. **Percent Frequency (Pct_Freq):** Relative Frequency x 100%; values sum to 100 and indicate percent of data in each **CATEGORY**. **Cumulative Percent Frequency (Cum_Pct_Freq):** Sum of percent frequencies in each **CATEGORY** plus percent frequencies in **CATEGORIES THAT APPEAR ABOVE** in the frequency table. ## Lecture 5 In-class Exercises - Q5 ***Session ID: MAS261f24*** ```{r} degree_freq_final ``` **What PROPORTION of survey respondents have a Bachelor's degree or a lower level of education?** HINT: Proportion values are between 0 and 1. ## Lecture 5 In-class Exercises - Q6 ***Session ID: MAS261f24*** ```{r} degree_freq_final ``` **What PERCENTAGE of the survey respondents have a high school degree or a higher degree?** HINTS: Percentages are between 0 and 100. - For this question you will have to sum values in the correct column manually from the bottom. - **OR** You can calculate 100 - sum of remaining categories from top. ## Lecture 5 In-class Exercises - Q7 ***Session ID: MAS261f24*** ```{r} degree_freq_final ``` Categorical data can have a mode as well; it's the category that is most prevalent in the data. **Which response in the education level data is the mode and is answer for half of the respondents?** A. Left high school B. High school C. Junior college D. Bachelor's E. Graduate ## Bar Chart vs. Pie Chart ::: columns ::: {.column width="45%"} - Two good ways to show **categorical data** - Pie Charts better for data with fewer categories. - Bar Charts effective in color or Black and White. - In pie charts, frequencies are often replaced by percents. ::: ::: {.column width="55%"} ::: fragment ```{r edu bar chart} (edu_bar <- degree_freq_final |> ggplot(aes(x=Highest_Degree, y=Freq)) + geom_bar(aes(fill=Highest_Degree), stat="identity", color="black") + geom_text(aes(label=Freq), vjust=1.6, color="white", position = position_dodge(0.9), size=5) + theme_classic() + labs(y = "Frequency", fill="Highest Degree") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.x=element_blank())) ggsave("img/Education_Bar_Chart.png") (edu_pie <- degree_freq_final |> ggplot(aes(x = "", y = Freq, fill = Highest_Degree)) + geom_col(color = "black") + geom_text(aes(label = Freq), position = position_stack(vjust = 0.5), color="white", size=5) + coord_polar(theta = "y") + labs(fill="Highest Degree") + theme_void()) ggsave("img/Education_Pie_Chart.png") ``` ::: ::: ::: ## Bar Chart vs. Histogram ::: columns ::: {.column width="50%"} - Bar charts are used for **CATEGORICAL** data ::: fragment ```{r} edu_bar ``` ::: ::: ::: {.column width="50%"} - Histograms show the distribution of **Quantitative** Data ::: fragment ```{r} sw_hist ``` ::: ::: ::: ## Histograms of Different Distributions Histograms are an effective tool for examining the distribution of the data. ![](img/histogram_examples.png){fig-align="center"} ::: columns ::: {.column width="35%"} ::: fragment **LEFT SKEWED** Tail pulled out to LEFT Low outliers e.g. Human Lifespan ::: ::: ::: {.column width="35%"} ::: fragment **NORMAL/SYMMETRIC** Data appear in a symmetric bell-shaped curve No graphic evidence of outliers e.g. Test scores ::: ::: ::: {.column width="30%"} ::: fragment **RIGHT SKEWED** Tail pulled out to RIGHT High outliers e.g. Movie Gross values ::: ::: ::: ## Histogram of Movie Data from Lecture 4 ```{r} mojo <- read_csv("data/All_Time_Movies_20240901.csv", show_col_types = F) vertical_lines <- tibble( xintercepts = c(mean(mojo$domestic_gross_mil, na.rm=T), median(mojo$domestic_gross_mil, na.rm=T)), labels = c("Mean", "Median"), colors = c("blue", "red")) (dom_hist <- mojo |> filter(!is.na(domestic_gross_mil)) |> ggplot() + geom_histogram(aes(x=domestic_gross_mil), color="darkblue", fill="lightblue") + geom_vline(data=vertical_lines, aes(xintercept = xintercepts, color=colors, label=labels), linetype="dashed", linewidth=1) + scale_color_manual(values = vertical_lines$colors, labels=vertical_lines$labels) + theme_classic() + labs(x="Domestic Movie Gross ($mil.)", y="Frequency", title="Domestic Gross of Top 200 Movies of All Time", color="", linetype="") + theme(legend.position = "bottom")) ``` ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Frequency Tables - Definitions of Frequency Table Terms - Terms are essentially the same for categorical and quantitative data. - Quantitative data are subdivided into intervals (bins) - For categorical data, order of table categories is subjective if data are not ordinal. - A histogram is a visual representation of quantitative frequency data. - A histogram **CANNOT BE CREATED** for categorical data. - Bar charts and Pie charts are two common ways to represent categorical frequency data. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 5:** Submit it by midnight today (day of lecture). :::