Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
You can use either Posit Cloud or your laptop.
💥 Lecture 3 In-class Exercises - Q1 (Review) 💥
Session ID: mas261f23
In lecture 2 we discussed the measures of central tendency, the mean, median, and mode.
Which of these measures can ALWAYS be found and is good measure of central tendancy regardless of whether or not the data has extreme values (high or low extremes) referred to as outliers?
A. Mode
B. Mean
C. Median
Measures of variability - Deviations are Building Blocks
Measures Central Tendancy such as MEAN and MEDIAN indicate where the center of the data is located.
Measures of Variability indicate how spread out the data observations are.
The building block used to calculate these measures is called a DEVIATION.
DEVIATION: How far an observation is above or below the data mean.
DEVIATION (of an observation) - Observation - mean
Demonstration of calculating deviations (2 options shown):
Notice that deviations are both positive (above the mean) and negative (below the mean)
🤯 R Demo of Calculating Deviations
Notice that some deviations are positive and others are negative.
A positive deviation indicate an observation is above the mean.
A negative deviation indicates an observation is below the mean.
mpg
mean_mpg
mpg_dev
Mazda RX4
21.0
20.09062
0.909375
Mazda RX4 Wag
21.0
20.09062
0.909375
Datsun 710
22.8
20.09062
2.709375
Hornet 4 Drive
21.4
20.09062
1.309375
Hornet Sportabout
18.7
20.09062
-1.390625
Valiant
18.1
20.09062
-1.990625
Duster 360
14.3
20.09062
-5.790625
Merc 240D
24.4
20.09062
4.309375
Merc 230
22.8
20.09062
2.709375
Merc 280
19.2
20.09062
-0.890625
Total Sums of Squares (TSS)
We want a measure of the overall spread of the data.
If we sum all of these deviations we would get zero
How many observations (rows) are in this filtered dataset with three variables?
💥 Lecture 3 In-class Exercises - Q3 💥
Recall that the deviation for an individual observation is observation minus mean.
What is the deviation from the mean for the height of R2-D2? Round answer to closest whole number and include negative sign if needed.
Height for R2-D2 is 96 cm. Use this code to answer this question.
96 - mean(my_starwars$height)
💥 Lecture 3 In-class Exercises - Q4 💥
Recall that \(SST = Variance\times(n-1)\) and \(Variance = \frac{SST}{n-1}\)
What is the SST, the total sum of the squared deviations for the height variable in the StarWars dataset? Round answer to closest whole number.
To find this answer:
find the variance of height and multiply it by the sample size minus 1
var(my_starwars$height)*80
💥 Lecture 3 In-class Exercises - Q5 💥
Recall that \(CV = \frac{SD}{\overline{X}}\), Standard Deviation divided by sample mean, \(\overline{X}\).
CV is useful for comparing data with different units, e.g. centimeter data to inches, or US dollars to British pounds because the units are factored out.
What is the CV, the coefficient of variation for the height variable in the StarWars dataset? Round answer to two decimal places
sd(my_starwars$height)/mean(my_starwars$height)
Quartiles, Percentiles, and Five Number Summary
There are other key values in the data that are used to understand it’s variability and distribution
We’ve already discussed these values:
Minimum: lowest value in the data
Maximum: highest value in the data
Range: Maximum minus Minimum
Median: The middle value or average of two middle values
Also referred to as the 50th percentile or the 2nd Quartile
50% (two quarters) of the observations are below this value
50% (two quarters) of the observations are above this value
Two Additional Informative Values:
25th percentile, also called the 1st Quartile: 25% of the data is below this value
75th percentile, also called the 3rd Quartile: 75% of the data is below this value
Five Number Summary:
Minimum, 25th Percentile, Median, 75th Percentile, Maximum
Five Number Summary in R (with a Bonus Mean)
In R, you can use the command summary to find these values:
summary(my_cars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
NOTE: The summary command ONLY works if R recognizes data as numeric.
Visualizing Data
In Lecture 4, we will talk more about visualizing data to understand
Measures of Central Tendancy
Measures of Variability
Quartiles and Percentiles (Percentiles are also called Quantiles)
Extreme Values also referred to as Outliers
Comparing different categories
For today, let’s examine how the five number summary can be visualized.
Five Number Summary and Boxplots
Recall the large real estate dataset from Lecture 2.
Today, we filter this dataset to create realtor1 with:
two states, Maine and Vermont.
houses with prices of $1.2 Million or less.
We also make 2 separate datasets:
realtor_VT includes data for Vermont only
realtor_ME includes data for Maine only
MAS 261 students are not responsible for R code to import and filter data.
Five Number Summary for Each State
We use the summary command in R on the price data for each state to find summary values:
Minimum, Q1, Median (Q2), Q3, Maximum and Mean
summary(realtor_VT$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 89000 200000 250091 343750 1200000
summary(realtor_ME$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14999 174500 349900 391567 569000 1200000
Notice that in each summary, the mean is substantially higher than the median.
Summary values for Vermont are shown on the following box plot.
In HW 1, you will annotate annotate a boxplot of the data used for that assignment about commute times.
Excel is NOT RECOMMENDED for finding these values!
Boxplot Annotated with Five Number Summary
Five Number summary for Maine:
Min. 1st Qu. Median Mean 3rd Qu. Max.
10000 89000 200000 250091 343750 1200000
Key Points from Today
Continuation of summarizing QUANTITATIVE DATA
Measures of Measures of Variability
Deviation is ‘building block’
Deviation = Observation - Mean
TSS, Variance, Standard Deviation and CV are all related
Range is Maximum minus Minimum
Quartiles, Q1, Q2, Q3 also help describe variability
summary command shows Min., Q1, Q2, Q3, Max. and Mean
Boxplot shows Min., Q1, Q2, Q3, Max.
Boxplot shows additional information (Lecture 4)
To submit an Engagement Question or Comment about material from Lecture 3: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 3