MAS 261 - Lecture 2 - Notes

Measures of Central Tendancy

Author

Penelope Pooler Eisenbies

Published

August 21, 2024

Housekeeping

Today’s plan
- Review Question
- A few minutes for R Questions
- Measures of Central Tendancy
- Random Variables and Parameters
- Calculations in R and Excel
  - Excel can be used for Lecture 2
  - Upcoming material is easier using R
- In-class Exercises

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I will demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

Lecture 2 In-class Exercises - Q1

Session ID: MAS261f24

Students often ask How do we determine if a categorical variable is ordinal or nominal?

Answer: Examine the variable entries and data dictionary (if provided).Is there an objective way to order the variable categories? If so, it is an ordinal variable.

All of the following variables are CATEGORICAL. Which of the following variables are ALSO ORDINAL? Select all correct answers.

A. Age Groups: 0-17, 18-45, 46-70, 70+

B. Upstate NY Cities: Syracuse, Rochester, Buffalo, etc.

C. Course Grades: A, A-, B+, B, B-, etc.

D. Hair Colors: Blonde, Brown, Black, Red

E. Credit Ratings: Poor, Fair, Good, Excellent

Types of Variables in a Dataset

Recall: There are Four main types of data.

Today we will focus on summarizing QUANTITATIVE DATA

Measures of Central Tendancy - MEAN

Measures of central tendancy tell us WHERE on the number line our data values are mostly located, i.e Where they TEND TO BE.

Mean is the arithmetic average

Sum up data values and divide by number of values.
Simple Example:
- Data values: 3, 5, 6, 8, 10
- Sum: 3 + 5 + 6 + 8 + 10 = 32
- Mean: 32/5 = 6.4

Calculation In R:

Code

```{r demo of mean calc, echo=T}
sum(3,5,6,8,10)/5
x <- c(3,5,6,8,10)
mean(x)
```

[1] 6.4
[1] 6.4

Calculating and Saving a Mean in R

Saving the Data to Global Environment

Code

```{r cars data, echo=T}
my_cars <- mtcars # save R dataset mtcars to Global Environment
```

Two Ways to Calculate Mean of a Variable

Code

```{r calculating mean, echo=T}
mean(my_cars$mpg)              # traditional way
my_cars |> pull(mpg) |> mean() # with piping symbol |>
```

[1] 20.09062
[1] 20.09062

The calculations above are not saved.
We can save any calculation result to the Global Environment by assigning a name.
To save a calculation AND print it to the screen, enclose the line(s) in parentheses.

Calculating a Mean, Saving It, and Displaying It

Code

```{r saving mean to global and printing it to screen, echo=T}
mean_mpg1 <- my_cars |> pull(mpg) |> mean() # results is saved but not displayed
(mean_mpg1 <- my_cars |> pull(mpg) |> mean()) # save result and display it by enclosing command parentheses
(mean_mpg2 <- mean(my_cars$mpg))
```

[1] 20.09062
[1] 20.09062

Using AI to find R syntax

AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.

Measures of Central Tendancy - MEDIAN

Median is the middle value of sorted data.

Example 1: If dataset has an odd number of values, median is middle value.
- Data values: 3, 5, 6, 8, 10
- Median: 6 (middle value)

Code

```{r median 1, echo=T}
x <- c(3,5,6,8,19)
median(x)
```

[1] 6

Example 2: If dataset has an even number of values, median is average of two middle values.
- Data Values: 3, 5, 6, 8, 10, 15
- Median: (6 + 8)/2 = 14/2 = 7

Calculation In R:

Code

```{r median 2, echo=T}
y <- c(3,5,6,8,10,15)
median(y)
```

[1] 7

Calculating and Saving a Median in R

Two Ways to Calculate Median of a Variable

Code

```{r calculating median, echo=T}
median(my_cars$mpg)              # traditional way
my_cars |> pull(mpg) |> median() # with piping symbol |>
```

[1] 19.2
[1] 19.2

The calculations above are not saved.
We can save any calculation result to the Global Environment by assigning a name.
To save a calculation AND print it to the screen, enclose the line(s) in parentheses.

Code

```{r saving median to global and printing it to screen, echo=T}
median_mpg1 <- my_cars |> pull(mpg) |> median()  # results is saved but not displayed
(median_mpg1 <- my_cars |> pull(mpg) |> median()) # save result and display it by enclosing command parentheses
(median_mpg2 <- median(my_cars$mpg))
```

[1] 19.2
[1] 19.2

Using AI to find R syntax

AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.

Means of each Category

Note that the R code required to subdivide data by category is not required in MAS 261.,

The small table below helps demonstrate how mean and median can differ.
The dataset has cars with Automatic and Manual transmissions.
Here we show the mean and median of each car category.
Looking at central tendancy by category is a good way to understand the data.

Transmission	Mean_MPG	Median_MPG
Automatic	17.15	17.3
Manual	24.39	22.8

Understanding Central Tendancy Visually

Dotplots show each observation.

Also note that some values are duplicated, the most commonly repeated value is the mode.

Measures of Central Tendancy - MODE

A mode or modal value is the value that occurs most often in the data.
Modal values don’t always exist (no duplicate values means no mode) and may not be interesting unless the value is very prevalent.
Distributional modes (where most of the data are concentrated) are more interesting.
R does not have a simple command for finding a modal value, but I will show you how how this value can be determined.
Below I show all horsepower (hp) values for the cars dataset that appear more than once and how often they appear (n).
The R code to do this is NOT REQUIRED, but understanding the output is.

hp	66	110	123	150	175	180	245
n	2	3	2	2	3	3	2

Mean, Median, and Mode in Excel

The R dataset, mtcars has been exported to Excel and can be accessed here.
Notice that the file also includes the calculations for mean, median, and mode:

Mean, Median, and how they are different

Recall that MEAN is the arithmetic average.
- If there are EXTREME VALUES (high or low extremes) they will PULL the mean towards the extreme.
- Preview/Review - What is the term for extreme values in the data?
The MEDIAN is NOT affected by extremes
- Regardless of extreme values, median represents center value(s).
IF mean and median are similar, that is a good indication that there are no extreme values in the data.

US Median and Mean Household Income By County

Median map is more commonly used. Why?

Notice the map of means is DARKER.
The mean income for many counties is HIGHER than the median because it is affected by unusual wealthy households.
The median income is unaffected.

Lecture 2 In-class Exercises - Q2

Session ID: MAS261f24

If you want to find the central tendency, i.e. where most of the data are located, of the following 7 numbers, which measure is best?

Data: 3, 4, 7, 5, 9, 11, 89

A. Mean

B. Median

C. Mode

D. All three of the above measures are equally informative.

E. There is no mode, but mean and median are equally appropriate for these data.

Population Mean and Sample Mean

Map data includes ALL U.S. counties for which data were available in 2019.

We have data for the full POPULATION of counties.
POPULATION: The whole group of objects about which you want information
Population Mean: symbolized as $\mu$ (Greek Letter mu)

Usually, we don’t have the time or resources to collect data from an entire population.

Instead, we SAMPLE the population to ESTIMATE information about the population.
Sample Mean: symbolized as $\overline{X}$ (Referred to as x bar)

Population Mean ($\mu$) and Sample Mean ($\overline{X}$) are calculated the same way BUT…

$\mu$ AND $\overline{X}$ are interpreted differently.
A population and a sample from that population are different.

Population Values are FIXED constants

The following HUGE dataset from kaggle contains the selling price for ALL U.S. residential properties from a couple years ago.

This is the POPULATION of for sale homes in the U.S. at that time.
We can find the population mean ($\mu$) and median.
These values are FIXED constants based on all of the data and are called PARAMETERS

Code

homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
  select(state, city, bed:acre_lot, house_size, price)

(pop_mean <- mean(homes$price))  
(pop_median <- median(homes$price))

[1] 768092.4
[1] 460000

Sample Values CHANGE with each new Sample

Below, three random samples were selected from our population of housing prices.

The mean and median housing price was calculated for from each sample.

Summary Table: What do you notice?

Data	Mean	Median
Population	768092.4	460000
Sample 1	790539.6	475000
Sample 2	723155.7	439900
Sample 3	815676.3	467000

Every NEW sample results in a DIFFERENT sample mean and median.
Sample summary values are RANDOM VARIABLES because they vary with each new sample.
Population values are PARAMETERS, FIXED constants that don’t change, but may be unknown.
- It is rare to have data for the total population.

Lecture 2 In-class Exercises - Q3

Session ID: mas261f24

R Practice: Click on green triangle to run the following code to create a sample of the realtor_data.

Code

```{r class exercise realtor data sample, echo=T}
homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
  select(state, city, bed:acre_lot, house_size, price)            # select variables

set.seed(1001) # set.seed used so everyone gets same sample 

homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |>   # Q3 Sample Data
  filter(!is.na(acre_lot))      # remove rows with missing values for acre_lot
```

This R code will import the data from the data file and then create a sample.
All students will get the SAME sample created by specifying set.seed.
NOTE: Students in MAS 261 do not have know this code but you are required to use the R environment and run code I provide.

Examine the homes_q3 dataset in the Global Environment.

How many observations are in the homes_Q3 sample dataset created by running this provided chunk of R code?

Lecture 2 In-class Exercises - Q4

Session ID: mas261f24

In the next EMPTY R chunk provided in the file for Lecture 2, use the following command to calculate the mean house size.

mean(homes_q3$acre_lot)

This command calculates the mean of acre_lot in the homes_Q3 sample dataset.
- Recall that $ is used to specify a variable within a datast
Additional OPTIONAL code is provided in R file to demonstrate
- how to both save this calculated mean and print it to the screen.
- how to round a calculated value to two decimal places.

What is the mean lot size in the homes_Q3 sample dataset? Round to two decimal places.

Lecture 2 In-class Exercises - Q5

Session ID: mas261f24

Copy and paste the code from previous question, but now modify it as follows:

Change mean(homes_q3$acre_lot) to median(homes_q3$acre_lot)

Additional OPTIONAL code is provided in R file to demonstrate how to save this calculated median and print it to the screen.

What is the median lot size in this sample? Round answer to two decimal places.

Lecture 2 In-class Exercises - Q6

Session ID: mas261f24

Thinking questions:

The mean lot size in this sample is MUCH larger than the median. (14 times as large). Why?

Is the mean or the median more representative of the central tendancy i.e. the typical values of lot sizes in this sample of data?

Hint: Examine data by clicking on dataset homes_Q3 in the Global Environment in R.

A. Median

B. Mean

C. Both measures are equally representative of the central tendancy of lot size in this sample dataset even though these values are very different.

Visualizing these Sample Data

Notes:

Mean is much higher than median.
Mean is ‘pulled up’ by a few properties with hundreds of acres.
Median is representative of where most of the lot size data.

Side Note:

Y-axis of plot was transformed so data would be more spread out.
- More to come on data transformations at the end of this course.

Key Points from Today

Summarizing QUANTITATIVE DATA
- Measures of Central Tendancy: Mean, Median, and Mode
  - Mean is arithmetic average and is effected by extreme values
  - Median is middle value or average of two middle values
    - Median is NOT affected by extreme values
  - Numeric mode(s) - value(s) that appears most often
    - Not always interesting or useful
Population and Sample Summary Values
- Population values are fixed constants, referred to as PARAMETERS. May be unknown
- Sample statistics vary with each new sample and are RANDOM VARIABLES

To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 2 - Notes" subtitle: "Measures of Central Tendancy" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, include=F, message=F, warning=F} # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData) # verify packages # p_loaded() ``` - Today's plan - Review Question - A few minutes for R Questions - Measures of Central Tendancy - Random Variables and Parameters - Calculations in R and Excel - Excel can be used for Lecture 2 - Upcoming material is easier using R - In-class Exercises ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free) - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I will demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer. ## Lecture 2 In-class Exercises - Q1 ***Session ID: MAS261f24*** Students often ask **How do we determine if a categorical variable is ordinal or nominal?** **Answer:** Examine the variable entries and data dictionary (if provided).Is there an **objective** way to order the variable categories? If so, it is an ordinal variable. All of the following variables are **CATEGORICAL**. Which of the following variables are ALSO **ORDINAL**? **Select all correct answers.** A. Age Groups: 0-17, 18-45, 46-70, 70+ B. Upstate NY Cities: Syracuse, Rochester, Buffalo, etc. C. Course Grades: A, A-, B+, B, B-, etc. D. Hair Colors: Blonde, Brown, Black, Red E. Credit Ratings: Poor, Fair, Good, Excellent ## Types of Variables in a Dataset **Recall:** There are Four main types of data. ```{r data type image, echo=F} knitr::include_graphics("img/types_of_data.png") ``` Today we will focus on summarizing **QUANTITATIVE DATA** ## Measures of Central Tendancy - MEAN Measures of central tendancy tell us WHERE on the number line our data values are mostly located, i.e Where they TEND TO BE. **Mean is the arithmetic average** - Sum up data values and divide by number of values. - Simple Example: - Data values: 3, 5, 6, 8, 10 - Sum: 3 + 5 + 6 + 8 + 10 = 32 - Mean: 32/5 = 6.4 ::: fragment Calculation In R: ```{r demo of mean calc, echo=T} sum(3,5,6,8,10)/5 x <- c(3,5,6,8,10) mean(x) ``` ::: ## {} ### Calculating and Saving a Mean in R ::: fragment Saving the Data to Global Environment ```{r cars data, echo=T} my_cars <- mtcars # save R dataset mtcars to Global Environment ``` ::: ::: fragment Two Ways to Calculate Mean of a Variable ```{r calculating mean, echo=T} mean(my_cars$mpg) # traditional way my_cars |> pull(mpg) |> mean() # with piping symbol |> ``` ::: - **The calculations above are not saved.** - We can save any calculation result to the Global Environment by assigning a name. - To save a calculation AND print it to the screen, enclose the line(s) in parentheses. ::: fragment Calculating a Mean, Saving It, and Displaying It ```{r saving mean to global and printing it to screen, echo=T} mean_mpg1 <- my_cars |> pull(mpg) |> mean() # results is saved but not displayed (mean_mpg1 <- my_cars |> pull(mpg) |> mean()) # save result and display it by enclosing command parentheses (mean_mpg2 <- mean(my_cars$mpg)) ``` ::: ## {} ### Using AI to find R syntax - AI tools like **ChatGPT**, **Copilot**, and **Gemini** are familiar with R and R datasets. - AI generated code may differ from my code but will **usually** work. ::: fragment ![](img/chat_gpt_mean_demo_mtcars_mpg.png){fig-align="center"} ::: ## Measures of Central Tendancy - MEDIAN **Median is the middle value of sorted data.** - Example 1: If dataset has an odd number of values, median is middle value. - Data values: 3, 5, 6, 8, 10 - Median: 6 (middle value) ::: fragment ```{r median 1, echo=T} x <- c(3,5,6,8,19) median(x) ``` ::: - Example 2: If dataset has an even number of values, median is average of two middle values. - Data Values: 3, 5, 6, 8, 10, 15 - Median: (6 + 8)/2 = 14/2 = 7 ::: fragment Calculation In R: ```{r median 2, echo=T} y <- c(3,5,6,8,10,15) median(y) ``` ::: ## {} ### Calculating and Saving a Median in R ::: fragment Two Ways to Calculate Median of a Variable ```{r calculating median, echo=T} median(my_cars$mpg) # traditional way my_cars |> pull(mpg) |> median() # with piping symbol |> ``` ::: - **The calculations above are not saved.** - We can save any calculation result to the Global Environment by assigning a name. - To save a calculation AND print it to the screen, enclose the line(s) in parentheses. ::: fragment ```{r saving median to global and printing it to screen, echo=T} median_mpg1 <- my_cars |> pull(mpg) |> median() # results is saved but not displayed (median_mpg1 <- my_cars |> pull(mpg) |> median()) # save result and display it by enclosing command parentheses (median_mpg2 <- median(my_cars$mpg)) ``` ::: ## {} ### Using AI to find R syntax - AI tools like **ChatGPT**, **Copilot**, and **Gemini** are familiar with R and R datasets. - AI generated code may differ from my code but will **usually** work. ::: fragment ![](img/chat_gpt_median_demo_mtcars_mpg.png){fig-align="center"} ::: ## Means of each Category **Note that the R code required to subdivide data by category is not required in MAS 261.**, - The small table below helps demonstrate how mean and median can differ. - The dataset has cars with Automatic and Manual transmissions. - Here we show the mean and median of each car category. - Looking at central tendancy by category is a good way to understand the data. ::: fragment ```{r find mean by category} my_cars <- my_cars |> mutate(Transmission = factor(am, labels=c("Automatic", "Manual"))) # format transmission my_cars |> group_by(Transmission) |> summarize(Mean_MPG = mean(mpg) |> round(2), # create summary dataset Median_MPG = median(mpg)) |> kable() ``` ::: ## Understanding Central Tendancy Visually Dotplots show each observation. ```{r dotplot, message=FALSE} # Calculate summary values for each group mean_data <- my_cars |> group_by(Transmission) |> summarize(mean_value = mean(mpg)) median_data <- my_cars |> group_by(Transmission) |> summarize(median_value = median(mpg)) # dotplot with group means and medians shown (mpg_plot <- my_cars |> ggplot(aes(x=Transmission, y=mpg)) + geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.8) + geom_point(data = mean_data, aes(x = Transmission, y = mean_value, color = "Mean", shape = "Mean"),size = 6, fill = "blue") + geom_point(data = median_data, aes(x = Transmission, y = median_value, color = "Median", shape = "Median"),size = 4, fill = "red") + scale_color_manual(name = "", values = c("blue", "red"), guide = guide_legend()) + scale_shape_manual(name = "", values = c(16,18), guide = guide_legend()) + theme_classic() + labs(title="MPG by Transmission type with Means and Medians", y="Miles per Gallon")) ``` - Also note that some values are duplicated, the most commonly repeated value is the mode. ## Measures of Central Tendancy - MODE - A mode or modal value is the value that occurs most often in the data. - Modal values don't always exist (no duplicate values means no mode) and may not be interesting unless the value is very prevalent. - Distributional modes (where most of the data are concentrated) are more interesting. - **R does not have a simple command for finding a modal value**, but I will show you how how this value can be determined. - Below I show all horsepower (hp) values for the cars dataset that appear more than once and how often they appear (n). - **The R code to do this is NOT REQUIRED, but understanding the output is.** ::: fragment ```{r calculating value frequencies} my_cars |> group_by(hp) |> summarize(n=n()) |> filter(n>1) |> t() |> kable() ``` ::: ## Mean, Median, and Mode in Excel - [The R dataset, `mtcars` has been exported to Excel and can be accessed here.](https://docs.google.com/spreadsheets/d/1mKHuKNp2O9re-nPJY_froKX9Uxzc6k6p/edit?usp=drive_link&ouid=105789079702095490344&rtpof=true&sd=true) - Notice that the file also includes the calculations for mean, median, and mode: ::: fragment ![](img/Excel_Central_Tendancy_Calculations.png){.r-stretch} ::: ## Mean, Median, and how they are different - Recall that MEAN is the arithmetic average. - If there are EXTREME VALUES (high or low extremes) they will **PULL** the mean towards the extreme. - Preview/Review - What is the term for extreme values in the data? - The MEDIAN is NOT affected by extremes - Regardless of extreme values, median represents center value(s). - **IF mean and median are similar, that is a good indication that there are no extreme values in the data.** ## {} ### US Median and Mean Household Income By County Median map is more commonly used. Why? ```{r county map data prep, message=F} us_counties <- map_data("county") |> # county polygons rename("state" = "region", "county" = "subregion") |> mutate(county = gsub("de soto", "desoto", county, fixed = T), county = gsub("de kalb", "dekalb", county, fixed = T), county = gsub("de witt", "dewitt", county, fixed = T), county = gsub("du page", "dupage", county, fixed = T), county = gsub("la salle", "lasalle", county, fixed = T), county = gsub("la porte", "laporte", county, fixed = T), county = gsub("obrien", "o'brien", county, fixed = T), county = gsub("prince georges", "prince george's", county, fixed = T), county = gsub("queen annes", "queen anne's", county, fixed = T), county = gsub("st marys", "st mary's", county, fixed = T), county = gsub("hampton", "hampton city", county, fixed = T), county = gsub("newport news", "newport news city", county, fixed = T), county = gsub("norfolk", "norfolk city", county, fixed = T), county = gsub("suffolk", "suffolk city", county, fixed = T), county = gsub("virginia beach", "virginia beach city", county, fixed = T), county = ifelse(state=="district of columbia", "district of columbia", county)) # cnty2019_1 <- county_2019 # unique(cnty2019_all$county[cnty2019_all$state=="louisiana"]) # note issue Louisiana counties cnty2019 <- county_2019 |> mutate(state = tolower(state), county = tolower(name), county = gsub(" county", "", county), county = gsub(" parish", "", county), county = gsub("\\.", "", county)) |> # \\ is required . used in R coding select(state, county, name:white_not_hispanic_moe) cnty2019_all <- full_join(us_counties,cnty2019) # geo data and demographic data ``` ```{r income plots, fig.width=17, fig.align='center'} # enlarge title and legend text cnty_data1 <- cnty2019_all |> select(long:county, mean_household_income, median_household_income) |> mutate(meanK = mean_household_income/1000, medianK = median_household_income/1000) cnty_median <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=medianK)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "$1000", title="Median Income")+ scale_fill_continuous(type = "viridis", limits=c(0,182), direction = -1) + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 25), legend.text = element_text(size = 15), legend.title = element_text(size = 18)) cnty_mean <- cnty_data1 |> ggplot(aes(x=long, y=lat, group=group, fill=meanK)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "$1000", title="Mean Income")+ scale_fill_continuous(type = "viridis",limits=c(0,182), direction = -1) + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 25), legend.text = element_text(size = 15), legend.title = element_text(size = 18)) grid.arrange(cnty_median, cnty_mean, ncol=2) ``` - Notice the map of means is **DARKER**. - The mean income for many counties is **HIGHER** than the median because it is affected by unusual wealthy households. - The median income is unaffected. ## Lecture 2 In-class Exercises - Q2 ***Session ID: MAS261f24*** If you want to find the **central tendency**, i.e. where most of the data are located, of the following 7 numbers, which measure is best? **Data: 3, 4, 7, 5, 9, 11, 89** A. Mean B. Median C. Mode D. All three of the above measures are equally informative. E. There is no mode, but mean and median are equally appropriate for these data. ## Population Mean and Sample Mean ::: fragment Map data includes **ALL U.S. counties** for which data were available in 2019. ::: - We have data for the full POPULATION of counties. - **POPULATION:** The whole group of objects about which you want information - **Population Mean:** symbolized as $\mu$ (Greek Letter mu) ::: fragment Usually, we don't have the time or resources to collect data from an entire population. ::: - Instead, we **SAMPLE** the population to **ESTIMATE** information about the population. - **Sample Mean:** symbolized as $\overline{X}$ (Referred to as x bar) ::: fragment Population Mean ($\mu$) and Sample Mean ($\overline{X}$) are calculated the same way BUT... ::: - $\mu$ AND $\overline{X}$ are interpreted differently. - A population and a sample from that population are different. ## Population Values are FIXED constants The following **HUGE** [dataset](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset) from [kaggle](https://www.kaggle.com/search?q=real+estate) contains the selling price for **ALL** U.S. residential properties from a couple years ago. - This is the **POPULATION** of for sale homes in the U.S. at that time. - We can find the population mean ($\mu$) and median. - These values are **FIXED** constants based on all of the data and are called **PARAMETERS** ::: fragment ```{r pop mean and median of real estate, echo=TRUE, results='hold'} homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data select(state, city, bed:acre_lot, house_size, price) (pop_mean <- mean(homes$price)) (pop_median <- median(homes$price)) ``` ::: ## {} ### Sample Values CHANGE with each new Sample Below, three random samples were selected from our population of housing prices. The mean and median housing price was calculated for from each sample. ```{r random samples} set.seed(101) # set.seed used so everyone gets same sample homes1 <- homes |> slice(sample(1:306000, 1000, replace=F)) # sample 1 mn1 <- mean(homes1$price); med1 <- median(homes1$price) # mean, median on same line set.seed(102) # random sample 2 homes2 <- homes |> slice(sample(1:306000, 1000, replace=F)) # sample 2 mn2 <- mean(homes2$price); med2 <- median(homes2$price) set.seed(103) # random sample 2 homes3 <- homes |> slice(sample(1:306000, 1000, replace=F)) # sample 3 mn3 <- mean(homes3$price); med3 <- median(homes3$price) ``` ::: columns ::: {.column width="50%"} ::: fragment Summary Table: What do you notice? ```{r create summary table} Data <- c("Population", "Sample 1", "Sample 2", "Sample 3") Mean <- c(pop_mean,mn1,mn2,mn3) Median <- c(pop_median,med1,med2,med3) tibble(Data,Mean,Median) |> kable() ``` ::: ::: ::: {.column width="50%"} - Every NEW sample results in a DIFFERENT sample mean and median. - Sample summary values are **RANDOM VARIABLES** because they **vary with each new sample**. - Population values are **PARAMETERS, FIXED constants that don't change, but may be unknown**. - It is rare to have data for the total population. ::: ::: ## Lecture 2 In-class Exercises - Q3 **Session ID: mas261f24** **R Practice:** Click on green triangle to run the following code to create a sample of the `realtor_data`. ```{r class exercise realtor data sample, echo=T} homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data select(state, city, bed:acre_lot, house_size, price) # select variables set.seed(1001) # set.seed used so everyone gets same sample homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |> # Q3 Sample Data filter(!is.na(acre_lot)) # remove rows with missing values for acre_lot ``` - This R code will import the data from the data file and then create a sample. - All students will get the SAME sample created by specifying `set.seed`. - **NOTE: Students in MAS 261 do not have know this code but you are required to use the R environment and run code I provide.** ::: fragment **Examine the `homes_q3` dataset in the Global Environment.** **How many observations are in the `homes_Q3` sample dataset created by running this provided chunk of R code?** ::: ## Lecture 2 In-class Exercises - Q4 **Session ID: mas261f24** In the next **EMPTY** R chunk provided in the file for Lecture 2, use the following command to calculate the mean house size. `mean(homes_q3$acre_lot)` - This command calculates the `mean` of `acre_lot` in the `homes_Q3` sample dataset. - Recall that `$` is used to specify a variable within a datast - Additional **OPTIONAL** code is provided in R file to demonstrate - how to both save this calculated mean and print it to the screen. - how to round a calculated value to two decimal places. ::: fragment **What is the mean lot size in the homes_Q3 sample dataset? Round to two decimal places.** ::: ## Lecture 2 In-class Exercises - Q5 **Session ID: mas261f24** Copy and paste the code from previous question, but now modify it as follows: Change `mean(homes_q3$acre_lot)` to `median(homes_q3$acre_lot)` :::fragment Additional **OPTIONAL** code is provided in R file to demonstrate how to save this calculated median and print it to the screen. ::: ::: fragment **What is the median lot size in this sample? Round answer to two decimal places.** ::: ## Lecture 2 In-class Exercises - Q6 **Session ID: mas261f24** Thinking questions: **The mean lot size in this sample is MUCH larger than the median. (14 times as large). Why?** **Is the mean or the median more representative of the central tendancy i.e. the typical values of lot sizes in this sample of data?** Hint: Examine data by clicking on dataset homes_Q3 in the Global Environment in R. A. Median B. Mean C. Both measures are equally representative of the central tendancy of lot size in this sample dataset even though these values are very different. ## Visualizing these Sample Data ::: columns ::: {.column width="50%"} ```{r dotplot of lot size data, message=FALSE, fig.dim=c(6, 7)} homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data select(state, city, bed:acre_lot, house_size, price) # select variables set.seed(1001) # set.seed used so everyone gets same sample homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |> # Q3 Sample Data filter(!is.na(acre_lot)) # remove rows with missing values for acre_lot mean_data <- homes_q3 |> summarize(mean_value = mean(acre_lot)) median_data <- homes_q3 |> summarize(median_value = median(acre_lot)) # dotplot with group means and medians shown (acre_plot <- homes_q3 |> ggplot(aes(x=1, y=acre_lot)) + geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.8) + geom_point(data = mean_data, aes(y = mean_value, color = "Mean", shape = "Mean"),size = 6, fill = "blue") + geom_point(data = median_data, aes(y = median_value, color = "Median", shape = "Median"),size = 4, fill = "red") + scale_color_manual(name = "", values = c("blue", "red"), guide = guide_legend()) + scale_shape_manual(name = "", values = c(16,18), guide = guide_legend()) + scale_y_continuous(trans="log", breaks = c(0, .5, 5, 50, 500)) + theme_classic() + labs(title="Dotplot of Lot Sizes", y="Acres", subtitle="Y-axis is log-transformed") + theme(plot.title = element_text(size = 20), axis.title.y = element_text(size=15), axis.text.y = element_text(size=12), axis.title.x = element_blank(), axis.text.x = element_blank())) ``` ::: ::: {.column width="50%"} ::: fragment Notes: ::: - Mean is much higher than median. - Mean is 'pulled up' by a few properties with hundreds of acres. - Median is representative of where most of the lot size data. ::: fragment Side Note: ::: - Y-axis of plot was transformed so data would be more spread out. - More to come on data transformations at the end of this course. ::: ::: ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Summarizing QUANTITATIVE DATA - Measures of Central Tendancy: Mean, Median, and Mode - Mean is arithmetic average and is effected by extreme values - Median is middle value or average of two middle values - Median is NOT affected by extreme values - Numeric mode(s) - value(s) that appears most often - Not always interesting or useful - Population and Sample Summary Values - Population values are fixed constants, referred to as PARAMETERS. May be unknown - Sample statistics vary with each new sample and are RANDOM VARIABLES ::: fragment **To submit an Engagement Question or Comment about material from Lecture 2:** Submit it by midnight today (day of lecture). :::