Measures of Dispersion

POLS3316, Instructor: Tom Hanna, Fall 2023, University of Houston

2023-09-27

Plan for today

Lecture, Discussion, a little bit of code + A short overview of scalars, vectors, matrices + Some practice computing the measures so far + More on Standard Deviation and Variance + Percentiles, Deciles, Quartiles, Ranges

Data sets for those who needed help (and others)

          - I'll post several to Canvas with topics
          - Pick one and use it

Measures of Dispersion Continued

Yesterday we covered variance in detail and I told you that standard deviation is the square root of variance. Why do we care?

Example

Two companies with the same mean salary - $70,000.
Company A has a variance of 2,500.
Company B has a variance of 250,000.

Example: Assumptions:

In most companies entry level workers with no seniority are paid the least and workers with the most seniority are paid the most.
You have the same chance of getting hired at Company A or Company B
Salaries are normally distributed

As a new college graduate with interviews at both companies on the same, which is more appealing?

The distributions: code

#Simulating data for company A and B from the mean and standard deviation
set.seed(7385)
A <- rnorm(n=10000, mean=70000, sd=50)
B <- rnorm(n=10000, mean=70000, sd=500)

A_mean<-mean(A)
stdA<-sqrt(var(A))
B_mean<-mean(B)
stdB<-sqrt(var(B))

hist(A, density=20, prob=TRUE,
main="Histogram with normal curve", col = "lightblue") 
curve(dnorm(x, mean=A_mean, sd=stdA), col = "blue", add=TRUE)

hist(B, density=20, prob=TRUE, col = "red") 
curve(dnorm(x, mean=B_mean, sd=stdB), col = "red", add=TRUE)

The distributions: code 2

The issue?

The variance vastly overstates the difference
Why? Because in variance the units are squared

The answer

By taking the square root of the variance, we get back to the original unit of measure. So, in the example:

Company A has a standard deviation of 50.

Company B has a standard deviation of 500.

The answer

One standard deviation from the mean is $69,950 at Company A.

One standard deviation from the mean is $69,500 at Company B.

The answer

Because of some rules we’ll discuss more later, 99.7% of employees are within 3 standard deviations of the mean.

So, 99.7% of Company A employees make between $69,850 and $70,150 a year.

99.7% of Company B employees make between $68,500 and $71,500 a year.

So, you probably better look hard at things besides salary because there isn’t that much difference after all.

Other Measures

Here are a few other measures you should be aware of both for the midterm and final tests and because they are commonly used:

Quartiles: divides data into four chunks
Deciles: divides into 10 chunks
Percentiles: divides into 100 chunks
Interquartile Range: Between the 1st and 3rd quartiles
Minimum
Maximum
Range: Between the minimum and the maximum

Examples: Quartiles without code

Company A


======================================================
0%            25%        50%        75%        100%   
------------------------------------------------------
69,805.630 69,965.440 69,999.740 70,033.350 70,205.860
------------------------------------------------------

median A

[1] 69999.75

Company B


======================================================
0%            25%        50%        75%        100%   
------------------------------------------------------
67,958.840 69,664.430 70,005.900 70,346.460 71,966.900
------------------------------------------------------

median B

[1] 70005.9

Examples: Quartiles with code

library(stargazer) # load stargazer to make the output neater
cat('Company A') #This just prints "Company A"

Company A

stargazer(quantile(A), type = "text") #without stargazer, quantile(A) produce the same results but not as neat


======================================================
0%            25%        50%        75%        100%   
------------------------------------------------------
69,805.630 69,965.440 69,999.740 70,033.350 70,205.860
------------------------------------------------------

cat('median A') #this just prints "median A"

median A

median(A) #this actually prints the median of A

[1] 69999.75

Examples: Quartiles with code (continued)

cat('\n') #prints a blank line

cat('Company B') #prints "Company B"

Company B

stargazer(quantile(B), type = "text") #stargazer makes quantile function print more neatly


======================================================
0%            25%        50%        75%        100%   
------------------------------------------------------
67,958.840 69,664.430 70,005.900 70,346.460 71,966.900
------------------------------------------------------

cat('median B')

median B

median(B)

[1] 70005.9

Boxplot or Whisker Plot

The bottom line of the box is the first quartile. The top line is the third quartile. The heavy center line is the median. The “whiskers” show the largest or smallest observation that falls within a distance of 1.5 times the box size from the nearest box edge or “hinge”.** In this case, this plot doesn’t really tell us a whole lot, but it can be useful when the distributions are more varied.

Boxplot or Whisker Plot with code

boxplot(A) #create boxplot 

boxplot(B) #create boxplot

Deciles: without code

Company A: Deciles

     10%      20%      30%      40%      50%      60%      70%      80% 
69934.94 69957.03 69973.41 69987.09 69999.75 70012.54 70025.64 70040.94 
     90% 
70062.32

Company B: Deciles


======================================================
67,958.840 69,664.430 70,005.900 70,346.460 71,966.900
------------------------------------------------------

=====================================================
0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900
-----------------------------------------------------

Deciles: code

cat('Company A: Deciles')

Company A: Deciles

#The code probs = seq(.1, .9, by = .1) is just telling R that #instead of the default of quartiles we want to split the data by #a sequence running from .1 to .9 and separate by .1. Another way #of saying that is 0.1, 0.2, 0.3, 0.4 and so on up to 0.9, or ten #ranks - deciles. 

quantile((A), probs = seq(.1, .9, by = .1))

     10%      20%      30%      40%      50%      60%      70%      80% 
69934.94 69957.03 69973.41 69987.09 69999.75 70012.54 70025.64 70040.94 
     90% 
70062.32

cat('Company B: Deciles')

Company B: Deciles

stargazer(quantile(B), type = "text",  probs = seq(.1, .9, by = .1))


======================================================
67,958.840 69,664.430 70,005.900 70,346.460 71,966.900
------------------------------------------------------

=====================================================
0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900
-----------------------------------------------------

Specific percentiles: without code

What if we want to know a specific percentile. What if we want to know the top and bottom 1%? That’s the 1st and 99th percentiles:

Top and bottom 1% salaries at Company A


=====================
1%            99%    
---------------------
69,883.270 70,113.590
---------------------

Top and bottom 1% salaries at Company B


=====================
1%            99%    
---------------------
68,804.120 71,168.830
---------------------

Specific percentiles: with code

What if we want to know a specific percentile. What if we want to know the top and bottom 1%? That’s the 1st and 99th percentiles:

cat('Top and bottom 1% salaries at Company A')

Top and bottom 1% salaries at Company A

stargazer(quantile(A, probs = c(.01,.99)), type = "text")


=====================
1%            99%    
---------------------
69,883.270 70,113.590
---------------------

cat('Top and bottom 1% salaries at Company B')

Top and bottom 1% salaries at Company B

stargazer(quantile(B, probs = c(.01,.99)), type = "text")


=====================
1%            99%    
---------------------
68,804.120 71,168.830
---------------------

Summary statistics

Don’t forget we can also get several summary statistics that include the quartiles with summary():

Company A Summary

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  69806   69965   70000   69999   70033   70206

Company B Summary

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  67959   69664   70006   70002   70346   71967