MAS 261 - Lecture 2 - Notes
Measures of Central Tendancy
Housekeeping
Today’s plan
Review Question
A few minutes for R Questions
Measures of Central Tendancy
Random Variables and Parameters
Calculations in R and Excel
Excel can be used for Lecture 2
Upcoming material is easier using R
In-class Exercises
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.
Lecture 2 In-class Exercises - Q1
Session ID: MAS261f24
Students often ask How do we determine if a categorical variable is ordinal or nominal?
Answer: Examine the variable entries and data dictionary (if provided).Is there an objective way to order the variable categories? If so, it is an ordinal variable.
All of the following variables are CATEGORICAL. Which of the following variables are ALSO ORDINAL? Select all correct answers.
A. Age Groups: 0-17, 18-45, 46-70, 70+
B. Upstate NY Cities: Syracuse, Rochester, Buffalo, etc.
C. Course Grades: A, A-, B+, B, B-, etc.
D. Hair Colors: Blonde, Brown, Black, Red
E. Credit Ratings: Poor, Fair, Good, Excellent
Types of Variables in a Dataset
Recall: There are Four main types of data.
Today we will focus on summarizing QUANTITATIVE DATA
Measures of Central Tendancy - MEAN
Measures of central tendancy tell us WHERE on the number line our data values are mostly located, i.e Where they TEND TO BE.
Mean is the arithmetic average
Sum up data values and divide by number of values.
Simple Example:
- Data values: 3, 5, 6, 8, 10
- Sum: 3 + 5 + 6 + 8 + 10 = 32
- Mean: 32/5 = 6.4
Calculating and Saving a Mean in R
Saving the Data to Global Environment
Two Ways to Calculate Mean of a Variable
- The calculations above are not saved.
- We can save any calculation result to the Global Environment by assigning a name.
- To save a calculation AND print it to the screen, enclose the line(s) in parentheses.
Calculating a Mean, Saving It, and Displaying It
Code
[1] 20.09062
[1] 20.09062
Using AI to find R syntax
AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.
Measures of Central Tendancy - MEDIAN
Median is the middle value of sorted data.
Example 1: If dataset has an odd number of values, median is middle value.
- Data values: 3, 5, 6, 8, 10
- Median: 6 (middle value)
Example 2: If dataset has an even number of values, median is average of two middle values.
- Data Values: 3, 5, 6, 8, 10, 15
- Median: (6 + 8)/2 = 14/2 = 7
Calculating and Saving a Median in R
Two Ways to Calculate Median of a Variable
- The calculations above are not saved.
- We can save any calculation result to the Global Environment by assigning a name.
- To save a calculation AND print it to the screen, enclose the line(s) in parentheses.
Code
```{r saving median to global and printing it to screen, echo=T}
median_mpg1 <- my_cars |> pull(mpg) |> median() # results is saved but not displayed
(median_mpg1 <- my_cars |> pull(mpg) |> median()) # save result and display it by enclosing command parentheses
(median_mpg2 <- median(my_cars$mpg))
```
[1] 19.2
[1] 19.2
Using AI to find R syntax
AI tools like ChatGPT, Copilot, and Gemini are familiar with R and R datasets.
AI generated code may differ from my code but will usually work.
Means of each Category
Note that the R code required to subdivide data by category is not required in MAS 261.,
The small table below helps demonstrate how mean and median can differ.
The dataset has cars with Automatic and Manual transmissions.
Here we show the mean and median of each car category.
Looking at central tendancy by category is a good way to understand the data.
Transmission | Mean_MPG | Median_MPG |
---|---|---|
Automatic | 17.15 | 17.3 |
Manual | 24.39 | 22.8 |
Understanding Central Tendancy Visually
Dotplots show each observation.
- Also note that some values are duplicated, the most commonly repeated value is the mode.
Measures of Central Tendancy - MODE
A mode or modal value is the value that occurs most often in the data.
Modal values don’t always exist (no duplicate values means no mode) and may not be interesting unless the value is very prevalent.
Distributional modes (where most of the data are concentrated) are more interesting.
R does not have a simple command for finding a modal value, but I will show you how how this value can be determined.
Below I show all horsepower (hp) values for the cars dataset that appear more than once and how often they appear (n).
The R code to do this is NOT REQUIRED, but understanding the output is.
hp | 66 | 110 | 123 | 150 | 175 | 180 | 245 |
n | 2 | 3 | 2 | 2 | 3 | 3 | 2 |
Mean, Median, and Mode in Excel
The R dataset,
mtcars
has been exported to Excel and can be accessed here.Notice that the file also includes the calculations for mean, median, and mode:
Mean, Median, and how they are different
Recall that MEAN is the arithmetic average.
If there are EXTREME VALUES (high or low extremes) they will PULL the mean towards the extreme.
Preview/Review - What is the term for extreme values in the data?
The MEDIAN is NOT affected by extremes
- Regardless of extreme values, median represents center value(s).
IF mean and median are similar, that is a good indication that there are no extreme values in the data.
US Median and Mean Household Income By County
Median map is more commonly used. Why?
- Notice the map of means is DARKER.
- The mean income for many counties is HIGHER than the median because it is affected by unusual wealthy households.
- The median income is unaffected.
Lecture 2 In-class Exercises - Q2
Session ID: MAS261f24
If you want to find the central tendency, i.e. where most of the data are located, of the following 7 numbers, which measure is best?
Data: 3, 4, 7, 5, 9, 11, 89
A. Mean
B. Median
C. Mode
D. All three of the above measures are equally informative.
E. There is no mode, but mean and median are equally appropriate for these data.
Population Mean and Sample Mean
Map data includes ALL U.S. counties for which data were available in 2019.
We have data for the full POPULATION of counties.
POPULATION: The whole group of objects about which you want information
Population Mean: symbolized as \(\mu\) (Greek Letter mu)
Usually, we don’t have the time or resources to collect data from an entire population.
Instead, we SAMPLE the population to ESTIMATE information about the population.
Sample Mean: symbolized as \(\overline{X}\) (Referred to as x bar)
Population Mean (\(\mu\)) and Sample Mean (\(\overline{X}\)) are calculated the same way BUT…
- \(\mu\) AND \(\overline{X}\) are interpreted differently.
- A population and a sample from that population are different.
Population Values are FIXED constants
The following HUGE dataset from kaggle contains the selling price for ALL U.S. residential properties from a couple years ago.
This is the POPULATION of for sale homes in the U.S. at that time.
We can find the population mean (\(\mu\)) and median.
These values are FIXED constants based on all of the data and are called PARAMETERS
Sample Values CHANGE with each new Sample
Below, three random samples were selected from our population of housing prices.
The mean and median housing price was calculated for from each sample.
Summary Table: What do you notice?
Data | Mean | Median |
---|---|---|
Population | 768092.4 | 460000 |
Sample 1 | 790539.6 | 475000 |
Sample 2 | 723155.7 | 439900 |
Sample 3 | 815676.3 | 467000 |
Every NEW sample results in a DIFFERENT sample mean and median.
Sample summary values are RANDOM VARIABLES because they vary with each new sample.
Population values are PARAMETERS, FIXED constants that don’t change, but may be unknown.
- It is rare to have data for the total population.
Lecture 2 In-class Exercises - Q3
Session ID: mas261f24
R Practice: Click on green triangle to run the following code to create a sample of the realtor_data
.
Code
```{r class exercise realtor data sample, echo=T}
homes <- read_csv("data/realtor_data.csv", show_col_types = F) |> # import data
select(state, city, bed:acre_lot, house_size, price) # select variables
set.seed(1001) # set.seed used so everyone gets same sample
homes_q3 <- homes |> slice(sample(1:306000, 500, replace=F)) |> # Q3 Sample Data
filter(!is.na(acre_lot)) # remove rows with missing values for acre_lot
```
This R code will import the data from the data file and then create a sample.
All students will get the SAME sample created by specifying
set.seed
.NOTE: Students in MAS 261 do not have know this code but you are required to use the R environment and run code I provide.
Examine the homes_q3
dataset in the Global Environment.
How many observations are in the homes_Q3
sample dataset created by running this provided chunk of R code?
Lecture 2 In-class Exercises - Q4
Session ID: mas261f24
In the next EMPTY R chunk provided in the file for Lecture 2, use the following command to calculate the mean house size.
mean(homes_q3$acre_lot)
This command calculates the
mean
ofacre_lot
in thehomes_Q3
sample dataset.- Recall that
$
is used to specify a variable within a datast
- Recall that
Additional OPTIONAL code is provided in R file to demonstrate
how to both save this calculated mean and print it to the screen.
how to round a calculated value to two decimal places.
What is the mean lot size in the homes_Q3 sample dataset? Round to two decimal places.
Lecture 2 In-class Exercises - Q5
Session ID: mas261f24
Copy and paste the code from previous question, but now modify it as follows:
Change mean(homes_q3$acre_lot)
to median(homes_q3$acre_lot)
Additional OPTIONAL code is provided in R file to demonstrate how to save this calculated median and print it to the screen.
What is the median lot size in this sample? Round answer to two decimal places.
Lecture 2 In-class Exercises - Q6
Session ID: mas261f24
Thinking questions:
The mean lot size in this sample is MUCH larger than the median. (14 times as large). Why?
Is the mean or the median more representative of the central tendancy i.e. the typical values of lot sizes in this sample of data?
Hint: Examine data by clicking on dataset homes_Q3 in the Global Environment in R.
A. Median
B. Mean
C. Both measures are equally representative of the central tendancy of lot size in this sample dataset even though these values are very different.
Visualizing these Sample Data
Notes:
Mean is much higher than median.
Mean is ‘pulled up’ by a few properties with hundreds of acres.
Median is representative of where most of the lot size data.
Side Note:
- Y-axis of plot was transformed so data would be more spread out.
- More to come on data transformations at the end of this course.
Key Points from Today
- Summarizing QUANTITATIVE DATA
- Measures of Central Tendancy: Mean, Median, and Mode
- Mean is arithmetic average and is effected by extreme values
- Median is middle value or average of two middle values
- Median is NOT affected by extreme values
- Numeric mode(s) - value(s) that appears most often
- Not always interesting or useful
- Measures of Central Tendancy: Mean, Median, and Mode
- Population and Sample Summary Values
- Population values are fixed constants, referred to as PARAMETERS. May be unknown
- Sample statistics vary with each new sample and are RANDOM VARIABLES
To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).