MAS 261 - Lecture 2

Measures of Central Tendancy

Penelope Pooler Eisenbies

2024-08-17

Housekeeping

  • Today’s plan 📋

    • Review Question

    • A few minutes for R Questions 🪄

    • Measures of Central Tendancy

    • Random Variables and Parameters

    • Calculations in R and Excel

      • Excel can be used for Lecture 2

      • Upcoming material is easier using R

    • In-class Exercises

R and RStudio

  • In this course we will use R and RStudio to understand statistical concepts.

  • You will access R and RStudio through Posit Cloud.

  • I will post R/RStudio files on Posit Cloud that you can access in provided links.

  • I will also provide demo videos that show how to access files and complete exercises.

  • NOTE: The free Posit Cloud account is limited to 25 hours per month.

    • I will demo how to download completed work so that you can use this allotment efficiently.

    • For those who want to go further with R/RStudio:

      • After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.

💥Lecture 2 In-class Exercises - Q1 💥

Session ID: MAS261f24

Students often ask How do we determine if a categorical variable is ordinal or nominal?

Answer: Examine the variable entries and data dictionary (if provided).Is there an objective way to order the variable categories? If so, it is an ordinal variable.


All of the following variables are CATEGORICAL. Which of the following variables are ALSO ORDINAL? Select all correct answers.

A. Age Groups: 0-17, 18-45, 46-70, 70+

B. Upstate NY Cities: Syracuse, Rochester, Buffalo, etc.

C. Course Grades: A, A-, B+, B, B-, etc.

D. Hair Colors: Blonde, Brown, Black, Red

E. Credit Ratings: Poor, Fair, Good, Excellent

🧮 Types of Variables in a Dataset 🧮

Recall: There are Four main types of data.

Today we will focus on summarizing QUANTITATIVE DATA

Measures of Central Tendancy - MEAN

Measures of central tendancy tell us WHERE on the number line our data values are mostly located, i.e Where they TEND TO BE.

Mean is the arithmetic average

  • Sum up data values and divide by number of values.

  • Simple Example:

    • Data values: 3, 5, 6, 8, 10
    • Sum: 3 + 5 + 6 + 8 + 10 = 32
    • Mean: 32/5 = 6.4

Calculation In R:

sum(3,5,6,8,10)/5
[1] 6.4
x <- c(3,5,6,8,10)
mean(x)
[1] 6.4

🤯 Calculating and Saving a Mean in R 🤯

Saving the Data to Global Environment

my_cars <- mtcars # save R dataset mtcars to Global Environment

Two Ways to Calculate Mean of a Variable

mean(my_cars$mpg)              # traditional way
[1] 20.09062
my_cars |> pull(mpg) |> mean() # with piping symbol |>
[1] 20.09062
  • The calculations above are not saved.
  • We can save any calculation result to the Global Environment by assigning a name.
  • To save a calculation AND print it to the screen, enclose the line(s) in parentheses.

Calculating a Mean, Saving It, and Displaying It

mean_mpg1 <- my_cars |> pull(mpg) |> mean() # results is saved but not displayed
(mean_mpg1 <- my_cars |> pull(mpg) |> mean()) # save result and display it by enclosing command parentheses
[1] 20.09062
(mean_mpg2 <- mean(my_cars$mpg))
[1] 20.09062

🤖 Using AI to find R syntax 🤖

  • AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.

  • AI generated code may differ from my code but will usually work.

Measures of Central Tendancy - MEDIAN

Median is the middle value of sorted data.

  • Example 1: If dataset has an odd number of values, median is middle value.

    • Data values: 3, 5, 6, 8, 10
    • Median: 6 (middle value)
x <- c(3,5,6,8,19)
median(x)
[1] 6
  • Example 2: If dataset has an even number of values, median is average of two middle values.

    • Data Values: 3, 5, 6, 8, 10, 15
    • Median: (6 + 8)/2 = 14/2 = 7

Calculation In R:

y <- c(3,5,6,8,10,15)
median(y)
[1] 7

🤯 Calculating and Saving a Median in R 🤯

Two Ways to Calculate Median of a Variable

median(my_cars$mpg)              # traditional way
[1] 19.2
my_cars |> pull(mpg) |> median() # with piping symbol |>
[1] 19.2
  • The calculations above are not saved.
  • We can save any calculation result to the Global Environment by assigning a name.
  • To save a calculation AND print it to the screen, enclose the line(s) in parentheses.
median_mpg1 <- my_cars |> pull(mpg) |> median()  # results is saved but not displayed
(median_mpg1 <- my_cars |> pull(mpg) |> median()) # save result and display it by enclosing command parentheses
[1] 19.2
(median_mpg2 <- median(my_cars$mpg))
[1] 19.2

🤖 Using AI to find R syntax 🤖

  • AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.

  • AI generated code may differ from my code but will usually work.

Means of each Category

The R code to subdivide data by categories is not required in MAS 261,

  • The small table below helps demonstrate how mean and median can differ.

  • The dataset has cars with Automatic and Manual transmissions.

  • Here we show the mean and median of each car category

  • Looking at central tendancy by category is a good way to understand the data


Transmission Mean_MPG Median_MPG
Automatic 17.15 17.3
Manual 24.39 22.8

Understanding Central Tendancy Visually

Dotplots show each observation.

  • Also note that some values are duplicated, the most commonly repeated value is the mode.

Measures of Central Tendancy - MODE

  • A mode or modal value is the value that occurs most often in the data.

  • Modal values don’t always exist (no duplicate values means no mode) and may not be interesting unless the value is very prevalent.

  • Distributional modes (where most of the data are concentrated) are more interesting.

  • R does not have a simple command for finding a modal value, but I will show you how how this value can be determined.

  • Below I show all horsepower (hp) values for the cars dataset that appear more than once and how often they appear (n).

  • The R code to do this is NOT REQUIRED, but understanding the output is.

hp 66 110 123 150 175 180 245
n 2 3 2 2 3 3 2

Mean, Median, and Mode in Excel

Mean, Median, and how they are different

  • Recall that MEAN is the arithmetic average.

    • If there are EXTREME VALUES (high or low extremes) they will PULL the mean towards the extreme.

    • Preview/Review - What is the term for extreme values in the data?

  • The MEDIAN is NOT affected by extremes

    • Regardless of extreme values, median represents center value(s).
  • IF mean and median are similar, that is a good indication that there are no extreme values in the data.

US Median and Mean Household Income By County

Median map is more commonly used. Why?

  • Notice the map of means is DARKER.
  • The mean income for many counties is HIGHER than the median because it is affected by unusual wealthy households.
  • The median income is unaffected.

💥Lecture 2 In-class Exercises - Q2 💥

Session ID: MAS261f24


If you want to find the central tendency, i.e. where most of the data are located, of the following 7 numbers, which measure is best?


Data: 3, 4, 7, 5, 9, 11, 89

A. Mean

B. Median

C. Mode

D. All three of the above measures are equally informative.

E. There is no mode, but mean and median are equally appropriate for these data.

Population Mean and Sample Mean

Map data includes ALL U.S. counties for which data were available in 2019.

  • We have data for the full POPULATION of counties.

  • POPULATION: The whole group of objects about which you want information

  • Population Mean: symbolized as \(\mu\) (Greek Letter mu)

Usually, we don’t have the time or resources to collect data from an entire population.

  • Instead, we SAMPLE the population to ESTIMATE information about the population.

  • Sample Mean: symbolized as \(\overline{X}\) (Referred to as x bar)

Population Mean (\(\mu\)) and Sample Mean (\(\overline{X}\)) are calculated the same way BUT…

  • \(\mu\) AND \(\overline{X}\) are interpreted differently.
  • A population and a sample from that population are different.

Population Values are FIXED constants

The following HUGE dataset from kaggle contains the selling price for ALL U.S. residential properties from a couple years ago.

  • This is the POPULATION of for sale homes in the U.S. at that time.

  • We can find the population mean (\(\mu\)) and median.

  • These values are FIXED constants based on all of the data and are called PARAMETERS


homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
  select(state, city, bed:acre_lot, house_size, price)

(pop_mean <- mean(homes$price))  
(pop_median <- median(homes$price))
[1] 768092.4
[1] 460000

Sample Values CHANGE with each new Sample

Below, three random samples were selected from our population of housing prices.

The mean and median housing price was calculated for from each sample.


Summary Table: What do you notice?

Data Mean Median
Population 768092.4 460000
Sample 1 790539.6 475000
Sample 2 723155.7 439900
Sample 3 815676.3 467000
  • Every NEW sample results in a DIFFERENT sample mean and median.

  • Sample summary values are RANDOM VARIABLES because they vary with each new sample.

  • Population values are PARAMETERS, FIXED constants that don’t change, but may be unknown.

    • It is rare to have data for the total population.

💥 Lecture 2 In-class Exercises - Q3 💥

Session ID: mas261f24

R Practice: Click on green triangle to run the following code to create a sample of the realtor_data.

homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
  select(state, city, bed:acre_lot, house_size, price)            # select variables

set.seed(1001) # set.seed used so everyone gets same sample 

homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |>   # Q3 Sample Data
  filter(!is.na(acre_lot))      # remove rows with missing values for acre_lot
  • This R code will import the data from the data file and then create a sample.

  • All students will get the SAME sample created by specifying set.seed.

  • NOTE: Students in MAS 261 do not have know this code but you are required to use the R environment and run code I provide.

Examine the homes_q3 dataset in the Global Environment.

How many observations are in the homes_Q3 sample dataset created by running this provided chunk of R code?

💥 Lecture 2 In-class Exercises - Q4 💥

Session ID: mas261f24

In the next EMPTY R chunk provided in the file for Lecture 2, use the following command to calculate the mean house size.


mean(homes_q3$acre_lot)


  • This command calculates the mean of acre_lot in the homes_Q3 sample dataset.

    • Recall that $ is used to specify a variable within a datast
  • Additional OPTIONAL code is provided in R file to demonstrate

    • how to both save this calculated mean and print it to the screen.

    • how to round a calculated value to two decimal places.

What is the mean lot size in the homes_Q3 sample dataset? Round to two decimal places.

💥 Lecture 2 In-class Exercises - Q5 💥

Session ID: mas261f24

Copy and paste the code from previous question, but now modify it as follows:


Change mean(homes_q3$acre_lot) to median(homes_q3$acre_lot)


Additional OPTIONAL code is provided in R file to demonstrate how to save this calculated median and print it to the screen.


What is the median lot size in this sample? Round answer to two decimal places.

💥 Lecture 2 In-class Exercises - Q6 💥

Session ID: mas261f24

Thinking questions:

The mean lot size in this sample is MUCH larger than the median. (14 times as large). Why?


Is the mean or the median more representative of the central tendancy i.e. the typical values of lot sizes in this sample of data?

Hint: Examine data by clicking on dataset homes_Q3 in the Global Environment in R.


A. Median

B. Mean

C. Both measures are equally representative of the central tendancy of lot size in this sample dataset even though these values are very different.

Visualizing these Sample Data

Notes:

  • Mean is much higher than median.

  • Mean is ‘pulled up’ by a few properties with hundreds of acres.

  • Median is representative of where most of the lot size data.


Side Note:

  • Y-axis of plot was transformed so data would be more spread out.
    • More to come on data transformations at the end of this course.

Key Points from Today

  • Summarizing QUANTITATIVE DATA
    • Measures of Central Tendancy: Mean, Median, and Mode
      • Mean is arithmetic average and is effected by extreme values
      • Median is middle value or average of two middle values
        • Median is NOT affected by extreme values
      • Numeric mode(s) - value(s) that appears most often
        • Not always interesting or useful
  • Population and Sample Summary Values
    • Population values are fixed constants, referred to as PARAMETERS. May be unknown
    • Sample statistics vary with each new sample and are RANDOM VARIABLES

To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).