POLS3316, Instructor: Tom Hanna, Fall 2023, University of Houston
2023-09-13
Measures of Dispersion (Variation or Spread)
Sample data from the USArrests data in R, specifically the data on arrests for assault and percent urban population.
State level data
50 observations
In this code, I specify that I only want columns 2 and 3 with [2:3]
#Create an object named arrests data#assign with the left assignment operator#built in R dataset USArrests columns 2 and 3 [2:3]arrests_data <- USArrests[2:3]
Look at the data
First, we’ll get acquainted with our data by looking at the head of the dataframe and two measures of central tendency that we talked about yesterday, the mean and median.
#get the first 6 rows of the datahead(arrests_data)
We need to know the center to find the spread around the center
mean is part of the formula for variance
#get the mean for both variables and store the means in objectsmean_assault_arrests <-mean(arrests_data$Assault)mean_urban_population <-mean(arrests_data$UrbanPop)#print the means to the screenmean_assault_arrests
[1] 170.76
mean_urban_population
[1] 65.54
Find the Center (Median)
Why? We want to know if the data is skewed
#get the medians and store as objectsmedian_assault_arrests <-median(arrests_data$Assault)median_urban_population <-median(arrests_data$UrbanPop)#return the values to the screencat('Median Assault Arrests')
Median Assault Arrests
median_assault_arrests
[1] 159
cat('Median Urban Population')
Median Urban Population
median_urban_population
[1] 66
Skewed distribution - when mean and median are different
The three numbers are often different for the same sample or population.
Example:
Negatively skewed, Normal, and Positively Skewed distributions
Scattered around the mean
Measures of dispersion typically look at how the data is scattered around the mean.
Creating a measure of dispersion: distance to mean
So, we could define a measure of dispersion or variation that is the total length of the colored lines.
Our formula in English would be “the sum of the differences between each observation and the mean”
Problem with sum of distances
The problem is that because of the definition of mean, the positive lines will cancel out the negative and the dispersion or variation would always be zero!
Simple Data Example
Suppose we had a very simple data set with only two observations - 5 and 15. The mean is 10. One is 5 above the mean and one is 5 below the mean.
Code for Example
simple_data <-c(5,15) #creating simple datapoint1 <- simple_data[1] #The 1st data point called from simple_datacat('Point 1') #outputs the text Point 1
Point 1
point1
[1] 5
point2 <- simple_data[2] #2nd data pointcat('Point 2') #Outputs the text "Point 2"
Point 2
point2 #outputs the content of object named point2
[1] 15
cat(sep ="\n") #Outputs the separator (sep = ) line break or enter (\n)
simple_mean <-mean(simple_data) #the meancat('The mean') #Outputs the text "The mean"
The mean
simple_mean #outputs the contents of object simple_mean
[1] 10
Results from Example
Point 1
[1] 5
Point 2
[1] 15
The mean
[1] 10
Distance from mean code
So, the distance from the mean is:
point1_dist <- simple_data[1] - simple_mean #math using objects to create new objectcat('Distance 1')
#This is the population variance#R actually computes the sample variance#To convert the two we can multiply times (n-1)/n#In this case, n = 2simple_variance_check <-var(simple_data) * (2-1)/2cat('Checking with R')
Checking with R
simple_variance_check
[1] 25
Solution: Average Squared Difference - Variance
We want the average of the distances or
Aaverage of the squared differences.
So our measure of variance is in the simplest form:
The variance:
[1] 25
Checking with R
[1] 25
Squares inflate the results
Squares inflate the numbers relative to the size of the mean.
25 is 2.5 times the mean.
But the distances aren’t really that big
Average distance is still 5
Solution: Square root of the variance
To partially account for this we can take the square root of the variance
That gives us our next measure
Standard deviation
standard deviation is the square root of the variance
simple_standard_deviation <-sqrt(simple_variance) #take the square rootcat('Standard deviation is:') #print to screen
Standard deviation is:
simple_standard_deviation #print to screen
[1] 5
simple_sd_check <-sqrt(simple_variance_check) #take the square rootcat('Checking with R:')
Checking with R:
simple_sd_check
[1] 5
Standard deviation
standard deviation is the square root of the variance
Standard deviation is:
[1] 5
Checking with R:
[1] 5
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.