[1] 6.4
[1] 6.4
Measures of Central Tendancy
2024-08-17
Today’s plan 📋
Review Question
A few minutes for R Questions 🪄
Measures of Central Tendancy
Random Variables and Parameters
Calculations in R and Excel
Excel can be used for Lecture 2
Upcoming material is easier using R
In-class Exercises
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
Session ID: MAS261f24
Students often ask How do we determine if a categorical variable is ordinal or nominal?
Answer: Examine the variable entries and data dictionary (if provided).Is there an objective way to order the variable categories? If so, it is an ordinal variable.
All of the following variables are CATEGORICAL. Which of the following variables are ALSO ORDINAL? Select all correct answers.
A. Age Groups: 0-17, 18-45, 46-70, 70+
B. Upstate NY Cities: Syracuse, Rochester, Buffalo, etc.
C. Course Grades: A, A-, B+, B, B-, etc.
D. Hair Colors: Blonde, Brown, Black, Red
E. Credit Ratings: Poor, Fair, Good, Excellent
Recall: There are Four main types of data.
Today we will focus on summarizing QUANTITATIVE DATA
Measures of central tendancy tell us WHERE on the number line our data values are mostly located, i.e Where they TEND TO BE.
Mean is the arithmetic average
Sum up data values and divide by number of values.
Simple Example:
Saving the Data to Global Environment
Two Ways to Calculate Mean of a Variable
AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.
Median is the middle value of sorted data.
Example 1: If dataset has an odd number of values, median is middle value.
Example 2: If dataset has an even number of values, median is average of two middle values.
Two Ways to Calculate Median of a Variable
AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.
The R code to subdivide data by categories is not required in MAS 261,
The small table below helps demonstrate how mean and median can differ.
The dataset has cars with Automatic and Manual transmissions.
Here we show the mean and median of each car category
Looking at central tendancy by category is a good way to understand the data
Transmission | Mean_MPG | Median_MPG |
---|---|---|
Automatic | 17.15 | 17.3 |
Manual | 24.39 | 22.8 |
Dotplots show each observation.
A mode or modal value is the value that occurs most often in the data.
Modal values don’t always exist (no duplicate values means no mode) and may not be interesting unless the value is very prevalent.
Distributional modes (where most of the data are concentrated) are more interesting.
R does not have a simple command for finding a modal value, but I will show you how how this value can be determined.
Below I show all horsepower (hp) values for the cars dataset that appear more than once and how often they appear (n).
The R code to do this is NOT REQUIRED, but understanding the output is.
hp | 66 | 110 | 123 | 150 | 175 | 180 | 245 |
n | 2 | 3 | 2 | 2 | 3 | 3 | 2 |
The R dataset, mtcars
has been exported to Excel and can be accessed here.
Notice that the file also includes the calculations for mean, median, and mode:
Recall that MEAN is the arithmetic average.
If there are EXTREME VALUES (high or low extremes) they will PULL the mean towards the extreme.
Preview/Review - What is the term for extreme values in the data?
The MEDIAN is NOT affected by extremes
IF mean and median are similar, that is a good indication that there are no extreme values in the data.
Median map is more commonly used. Why?
Session ID: MAS261f24
If you want to find the central tendency, i.e. where most of the data are located, of the following 7 numbers, which measure is best?
Data: 3, 4, 7, 5, 9, 11, 89
A. Mean
B. Median
C. Mode
D. All three of the above measures are equally informative.
E. There is no mode, but mean and median are equally appropriate for these data.
Map data includes ALL U.S. counties for which data were available in 2019.
We have data for the full POPULATION of counties.
POPULATION: The whole group of objects about which you want information
Population Mean: symbolized as \(\mu\) (Greek Letter mu)
Usually, we don’t have the time or resources to collect data from an entire population.
Instead, we SAMPLE the population to ESTIMATE information about the population.
Sample Mean: symbolized as \(\overline{X}\) (Referred to as x bar)
Population Mean (\(\mu\)) and Sample Mean (\(\overline{X}\)) are calculated the same way BUT…
The following HUGE dataset from kaggle contains the selling price for ALL U.S. residential properties from a couple years ago.
This is the POPULATION of for sale homes in the U.S. at that time.
We can find the population mean (\(\mu\)) and median.
These values are FIXED constants based on all of the data and are called PARAMETERS
Below, three random samples were selected from our population of housing prices.
The mean and median housing price was calculated for from each sample.
Summary Table: What do you notice?
Data | Mean | Median |
---|---|---|
Population | 768092.4 | 460000 |
Sample 1 | 790539.6 | 475000 |
Sample 2 | 723155.7 | 439900 |
Sample 3 | 815676.3 | 467000 |
Every NEW sample results in a DIFFERENT sample mean and median.
Sample summary values are RANDOM VARIABLES because they vary with each new sample.
Population values are PARAMETERS, FIXED constants that don’t change, but may be unknown.
Session ID: mas261f24
R Practice: Click on green triangle to run the following code to create a sample of the realtor_data
.
homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
select(state, city, bed:acre_lot, house_size, price) # select variables
set.seed(1001) # set.seed used so everyone gets same sample
homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |> # Q3 Sample Data
filter(!is.na(acre_lot)) # remove rows with missing values for acre_lot
This R code will import the data from the data file and then create a sample.
All students will get the SAME sample created by specifying set.seed
.
NOTE: Students in MAS 261 do not have know this code but you are required to use the R environment and run code I provide.
Examine the homes_q3
dataset in the Global Environment.
How many observations are in the homes_Q3
sample dataset created by running this provided chunk of R code?
Session ID: mas261f24
In the next EMPTY R chunk provided in the file for Lecture 2, use the following command to calculate the mean house size.
mean(homes_q3$acre_lot)
This command calculates the mean
of acre_lot
in the homes_Q3
sample dataset.
$
is used to specify a variable within a datastAdditional OPTIONAL code is provided in R file to demonstrate
how to both save this calculated mean and print it to the screen.
how to round a calculated value to two decimal places.
What is the mean lot size in the homes_Q3 sample dataset? Round to two decimal places.
Session ID: mas261f24
Copy and paste the code from previous question, but now modify it as follows:
Change mean(homes_q3$acre_lot)
to median(homes_q3$acre_lot)
Additional OPTIONAL code is provided in R file to demonstrate how to save this calculated median and print it to the screen.
What is the median lot size in this sample? Round answer to two decimal places.
Session ID: mas261f24
Thinking questions:
The mean lot size in this sample is MUCH larger than the median. (14 times as large). Why?
Is the mean or the median more representative of the central tendancy i.e. the typical values of lot sizes in this sample of data?
Hint: Examine data by clicking on dataset homes_Q3 in the Global Environment in R.
A. Median
B. Mean
C. Both measures are equally representative of the central tendancy of lot size in this sample dataset even though these values are very different.
Notes:
Mean is much higher than median.
Mean is ‘pulled up’ by a few properties with hundreds of acres.
Median is representative of where most of the lot size data.
Side Note:
To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).