I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I will demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
After Test 1, I will provide videos on how to download the software (R/RStudio/Quarto) and lecture files to your computer.
💥Lecture 3 In-class Exercises - Q1 💥
Session ID: MAS261f24
In Lecture 2 we discussed the measures of central tendency, the mean, median, and mode.
Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?
A. Mode
B. Mean
C. Median
Measures of Variability and Deviations
Measures Central Tendancy such as MEAN and MEDIAN indicate where the center of the data is located.
Measures of Variability indicate how spread out the data observations are.
The building block used to calculate many measures variability is called a DEVIATION.
DEVIATION: How far an observation is above or below the data mean.
DEVIATION (of an observation) - Observation - mean
Demonstration of calculating deviations (2 options shown):
Notice that deviations are both positive (above the mean) and negative (below the mean)
🤯 R Demo of Calculating Deviations
Notice that some deviations are positive and others are negative.
A positive deviation indicate an observation is above the mean.
A negative deviation indicates an observation is below the mean.
mpg
mean_mpg
mpg_dev
Mazda RX4
21.0
20.09062
0.909375
Mazda RX4 Wag
21.0
20.09062
0.909375
Datsun 710
22.8
20.09062
2.709375
Hornet 4 Drive
21.4
20.09062
1.309375
Hornet Sportabout
18.7
20.09062
-1.390625
Valiant
18.1
20.09062
-1.990625
Duster 360
14.3
20.09062
-5.790625
Merc 240D
24.4
20.09062
4.309375
Merc 230
22.8
20.09062
2.709375
Merc 280
19.2
20.09062
-0.890625
Total Sums of Squares (TSS)
We want a measure of the overall spread of the data.
If we sum all of these deviations we would get zero
How many observations (rows) are in this filtered dataset with three variables?
💥Lecture 3 In-class Exercises - Q3 💥
Session ID: MAS261f24
Recall that the deviation for an individual observation is observation minus mean.
What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.
Height for R2-D2 is 96 cm. Use this code to answer this question.
96 - mean(my_starwars$height)
💥Lecture 3 In-class Exercises - Q4 💥
Session ID: MAS261f24
Recall that \(SST = Variance\times(n-1)\) and \(Variance = \frac{SST}{n-1}\)
What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.
To find this answer, find the variance of height and multiply it by the sample size minus 1
var(my_starwars$height)*80
💥Lecture 3 In-class Exercises - Q4 💥
Session ID: MAS261f24
Recall that \(CV = \frac{SD}{\overline{X}}\), Standard Deviation divided by sample mean, \(\overline{X}\).
CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out.
What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places
sd(my_starwars$height)/mean(my_starwars$height)
Min, Max, Median, Quartiles, and Percentiles
There are values in the data that are used to understand it’s variability and distribution
We’ve already discussed these values:
Minimum: lowest value in the data
Maximum: highest value in the data
Range: Maximum minus Minimum
Median: The middle value or average of two middle values
Also referred to as the 50th percentile or the 2nd Quartile
50% (two quarters) of the observations are below this value
50% (two quarters) of the observations are above this value
Two Additional Informative Values:
25th percentile, also called the 1st Quartile: 25% of the data is below this value
75th percentile, also called the 3rd Quartile: 75% of the data is below this value
Five Number Summary
Five Number Summary:
Minimum, 25th Percentile, Median, 75th Percentile, Maximum
In R, we use the command summary to calculate these values and also get the mean as a bonus.
summary(my_cars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
NOTE: The summary command ONLY works if R recognizes data as numeric.
Visualizing Data
In Lecture 4, we will talk more about visualizing data to understand
Measures of Central Tendancy
Measures of Variability
Quartiles and Percentiles (Percentiles are also called Quantiles)
Extreme Values also referred to as Outliers
Comparing different categories
For today, let’s examine how the five number summary can be visualized.
Five Number Summary and Boxplots
Recall the large real estate dataset from Lecture 2.
Today, we filter this dataset to create realtor1 with:
two states, Maine and Vermont.
houses with prices of $1.2 Million or less.
We also make 2 separate datasets:
realtor_VT includes data for Vermont only
realtor_ME includes data for Maine only
MAS 261 students are not responsible for R code to import and filter data.
Five Number Summary for Each State
We use the summary command in R on the price data for each state to find summary values:
Minimum, Q1, Median (Q2), Q3, Maximum and Mean
summary(realtor_VT$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 89000 200000 250091 343750 1200000
summary(realtor_ME$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14999 174500 349900 391567 569000 1200000
In each state summary, the mean is substantially higher than the median.
Summary values for Vermont are shown on a boxplot (next slide).
In HW 2, you will annotate annotate a boxplot of the data used for that assignment about commute times.
Excel is NOT RECOMMENDED for finding these values!
Boxplot Annotated with Five Number Summary
Five Number summary for Vermont:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 89000 200000 250091 343750 1200000
Key Points from Today
Measures of Measures of Variability
Deviation is ‘building block’
Deviation = Observation - Mean
TSS, Variance, Standard Deviation and CV are all related
Range is Maximum minus Minimum
Quartiles, Q1, Q2, Q3 also help describe variability
summary command shows Min., Q1, Q2, Q3, Max. and Mean
Boxplot shows Min., Q1, Q2, Q3, Max.
Boxplot shows additional information (Lecture 4)
To submit an Engagement Question or Comment about material from Lecture 3: Submit it by midnight today (day of lecture).