This example will go through some conceptual issues we face when analyzing data, then we will cover some basic descriptive statistics and their ups and downs. We will describe measures of central tendency and variability and how these are affected by outliers in our data.

We then examine the 2015 American Community Survey microdata using some common tidyverse verbs.

Units of Analysis

We can measure phenomena at many levels

The Ecological Fallacy

This is not true, as everyone in the county has higher mortality chances, you assign an association based on aggregate data to an individual – you have just committed the ecological fallacy

Describing a Single variable

Measures of central tendency

We can use graphical methods to describe what data ‘look like’ in a visual sense, but graphical methods are rarely useful for comparative purposes. In order to make comparisons, you need to rely on a numerical summary of data vs. a graphical one (I’m not saying statistical graphics aren’t useful, they are!)

Numerical measures tell us a lot about the form of a distribution without resorting to graphical methods. The first kind of summary statistics we will see are those related to the measure of central tendency. Measures of central tendency tell us about the central part of the distribution

Mode

the most commonly occurring observation in the data. This is like the peak in a histogram – the point that occurs most often. This is an actual value in the data

There may be more than one mode for a variable. This would be called a multi-modal variable. Here is a unimodal dataset:

source("https://raw.githubusercontent.com/coreysparks/Rcode/master/mymode.R")
#make a variable
x<-rpois(1000, lambda = 20)

#make another variable
y<-c(rnorm(500, 20, 10), rnorm(500, 75, 10))


mx<-Mode(x)
Mode(x)
## [1] 21
#make some plots
hist(x,main="Unimodal varable")
abline(v=mx, col="red", lwd=3) #add the mode in as a red line

hist(y,main="Multi-modal varable")

Median

the middle value when observations are ranked from highest to lowest. The median divides the data into two groups of equal size, these are called percentiles.

Exactly 50% of the observations will be less than the median and 50% will be greater than the median.

If two (or an even number) of the same values occur at the midpoint, their average is taken as the median.

There is only one median for a variable, and it is an actual observed value, like the mode.

median(x)
## [1] 20
quantile(x)
##   0%  25%  50%  75% 100% 
##    8   17   20   23   35
hist(x, main="Distribution of x")
abline(v=median(x), col="red")

plot(ecdf(x), main="Cumulative distribution function of x showing the median")
abline(v=median(x), col="red")

plot(ecdf(x), main="Cumulative distribution function of x showing quintiles")
abline(v=quantile(x), col=rainbow(5))

plot(ecdf(y), main="Cumulative distribution function of x showing the median")
abline(v=median(y), col="red")

plot(ecdf(y), main="Cumulative distribution function of x showing quintiles")
abline(v=quantile(y), col=rainbow(5))

Mean

The Arithmetic mean is the the average value of a variable.

There is only one mean for a variable, and it doesn’t necessarily have to be an observed value! It is a parameter of a distribution that we attempt to estimate using data.

Because of the mean’s usage in many more complicated settings, we give it special notation, the population mean is denoted using the Greek letter \(\mu\). \(\mu\) is never observed unless you have the entire population, we try to learn about \(\mu\) by taking samples from the population. When we do this, we calculate the sample mean, noted as the variable name with a bar over it, or \(\bar{x}\). \(\bar{x}\) is calculated as:

\(\bar{x}=\frac{\sum_{i=1}^n x_i}{n}\)

The mean of a binary variable (0/1) is a proportion and is useful for finding the percent of observations that have the 1.

The mean is a useful statistic (and widely over used) to describe the central tendency of a distribution, but it can be severely affected by extreme values, called outliers

Outliers are values that occur in the extreme tails of a distribution and can artificially inflate (or deflate) the mean value.

Think of money in your pocket as an example, let’s say we have among us on average $22 in our pocket, but I won the lottery today and I have $100,000 in my pocket. If we calculate the mean using my extreme observation we may get a much more inflated sense of how rich our class is!!

Here is an example

money<-rnorm(10, 20, 5)
money
##  [1] 10.85954 17.86182 24.28613 23.68658 20.06346 16.36147 30.21948
##  [8] 13.49595 23.28423 15.47240
mo_money<-c(money, 10000)
mo_money
##  [1]    10.85954    17.86182    24.28613    23.68658    20.06346
##  [6]    16.36147    30.21948    13.49595    23.28423    15.47240
## [11] 10000.00000
mean(money)
## [1] 19.55911
mean(mo_money)
## [1] 926.8719

To deal with this we may often use a trimmed mean , which drops the lowest and highest extreme values and takes the average of the rest. This reduces the effect of the extreme values and gives us a more realistic estimate for the mean.

We may in practice trim between 1 and 10 percent of observations from the tails to estimate the trimmed mean.

Here, I compare the mean to 10% trimmed mean from the example above.

mean(mo_money)
## [1] 926.8719
mean(mo_money, trim=.1)
## [1] 20.52573

Much more realistic, and representative of the average amount of money in our pockets.

Measures of variation

While measures of central tendency tell us something about the centrality of a variable, the variation in that variable is equally as important (maybe even more-so).

In fact the range is the simplest numerical depiction of variability. range = max(x)-min(x)

Unfortunately the range is very sensitive to outliers, remember our pocket money example

max(money)-min(money)
## [1] 19.35994
max(mo_money)-min(mo_money)
## [1] 9989.14

Also the range does not give us any idea about the pattern, or shape of the variability, only the difference between highest and lowest values.

Another measure (and quickly becoming your instructor’s favorite) are percentiles, or quantiles of the distribution.

There are p percentiles of the distribution and they each represent a location of the cumulative distribution function. The cumulative distribution function shows the sum of the probability that a value of a variable is at or below a particular value of the variable.

It is denoted \(F(x)\), and \(F(x)= Pr \left( X \leqslant x \right )\)

If we arrange the data, say n cases, from lowest to highest value, the pth percentile of the data is the value at which we have observed p % of the data, and there is still 100- p% of observations above it.

Certain percentiles are often used to describe distributions, for example the quartiles (25%, 50% (median), 75%)

The Inter-quartile range is another measure of variability and is typically calculated as the value of x at the 75th percentile the value of x at the 25th percentile.

This isn’t a terribly useful measure of variation but it can be useful when comparing the same variable measured in multiple data sets.

One typical set of descriptive statistics that is very frequently used is the so-called five number summary and it consists of : the Minimum, lower quartile, median, upper quartile and maximum values. This is often useful if the data are not symmetric or skewed. This is what you get when you use the summary() function in R.

summary(x) #actually includes the mean too, so a 6 number summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   17.00   20.00   20.26   23.00   35.00
#inter quartile range
IQR(x)
## [1] 6

Variance

If the data are symmetrically distributed around a single mode, two measures can usually describe the distribution : the mean and the sample variance

The sample variance is calculated from the observation-specific deviations around the mean

\(\text{deviation = }x_i - \bar{x}\)

A variable with small average deviations from the mean will have a lower degree of variability compared to variables with high average deviations from the mean. This allows us to say variances are smaller or larger between two variables.

Again, like the mean, the variance is a parameter of a distribution that we attempt to estimate using data. In the population, the variance is denoted \(\sigma^2\), but again, this is unknowable using samples, so we resort to estimating the sample variance which is denoted as \(s^2\).

\(s^2 =\frac{\sum_{i=1}^n \left( x_i - \bar{x}\right )^2}{n-1}\)

We use n-1 in the denominator, because in the calculation we have to first calculate the mean first.

The variance can be thought of as the average, squared deviation from the mean, which isn’t terribly informative, so instead we typically take the square root of the variance, to have a measure that is on the same scale as the original variable. This is called the standard deviation and is

\(s = \sqrt{s^2}\)

var(x)
## [1] 20.03899
sd(x)
## [1] 4.476493
sqrt(var(x))#same as using sd()
## [1] 4.476493

Really Real data example

Now let’s open a ‘really real’ data file. This is a sample from the 2015 1-year American Community Survey microdata, meaning that each row in these data is a person who responded to the survey in 2015. I get these, and you should too from the Minnesota Population Center IPUMS data. The IPUMS stands for “Integrated Public Use Microdata Series”, and consists of individual person responses to decennial census returns going back to 1850, and the American Community Survey data from 2001 to the present.

I’m using data from the US, but there is an IPUMS International data series too, which has data from 85 countries and over 300 censuses.

I’ve done an extract (do example in class) and stored the data in a stata format on my github data site. The file we are using is called usa_00045.dta.

There is also a codebook that describes the data and all the response levels for each variable in the data. They are also on my github data page, and called Codebook_DEM7273_IPUMS2015.

I can read it from github directly by using the read_dta() function in the haven library:

library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
names(ipums) #print the column names
##  [1] "year"      "datanum"   "serial"    "hhwt"      "statefip" 
##  [6] "met2013"   "puma"      "gq"        "pernum"    "perwt"    
## [11] "famsize"   "nchild"    "nchlt5"    "eldch"     "nsibs"    
## [16] "relate"    "related"   "sex"       "age"       "marst"    
## [21] "birthyr"   "fertyr"    "race"      "raced"     "hispan"   
## [26] "hispand"   "bpl"       "bpld"      "citizen"   "yrsusa1"  
## [31] "language"  "languaged" "speakeng"  "educ"      "educd"    
## [36] "empstat"   "empstatd"  "labforce"  "occ"       "ind"      
## [41] "inctot"    "incwage"   "poverty"   "hwsei"     "migrate1" 
## [46] "migrate1d" "carpool"   "trantime"

Now, most variables in the IPUMS can’t be used out of the box. For example open the pdf codebook and find the variable “incwage”, which is person’s income from wages in the previous year.

We are specifically wanting to pay attention to the “Coder Instructions” you’re the coder. Notice two codes (values) that are of special note. Specific Variable Codes 999999 = N/A and 999998=Missing

So, if we did something silly like:

mean(ipums$incwage)
## [1] 205672.4

This is probably not a valid answer, since the 99999 things are polluting our data. So we must recode our data to get rid of such values.

Remember, the $ allows us to get a specific variable from a dataset and do something to it.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ipums%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
  summarise(meanold=mean(incwage), meannew=mean(mywage, na.rm=T), n=n())
## # A tibble: 1 x 3
##    meanold  meannew      n
##      <dbl>    <dbl>  <int>
## 1 205672.4 27489.69 300552

and we see a difference of an order of magnitude in the mean, what about the median?

ipums%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
  summarise(medianold=median(incwage), mediannew=median(mywage, na.rm=T), n=n())
## # A tibble: 1 x 3
##   medianold mediannew      n
##       <dbl>     <dbl>  <int>
## 1     20000      7000 300552

Ok, so that seems really low, maybe we should limit the data to only those people who are in the labor force and who are over 18.

ipums%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
  filter(labforce==2, age>=18) %>%
  summarise(mednold=median(incwage), mednew=median(mywage, na.rm=T), n=n())
## # A tibble: 1 x 3
##   mednold mednew      n
##     <dbl>  <dbl>  <int>
## 1   31200  31200 144292

Nice.

Now what if we wanted to compare the incomes of men to women?

ipums%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
  filter(labforce==2, age>=18) %>%
  mutate(sexrecode=ifelse(sex==1, "male", "female")) %>%
  group_by(sexrecode)%>%
  summarise(mednew=median(mywage, na.rm=T), sdwage=sd(mywage, na.rm=T), n=n())
## # A tibble: 2 x 4
##   sexrecode mednew   sdwage     n
##       <chr>  <dbl>    <dbl> <int>
## 1    female  27000 43054.00 68579
## 2      male  38000 69252.82 75713

and we see that men have higher median incomes than women, but the variance in male income is larger than for women.

We could also see how incomes are different in San Antonio (met2013==41700) compared to Dallas (met2013==19100).

ipums%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA, incwage))%>%
  filter(labforce==2, met2013%in%c(41700, 19100), age>18) %>%
  mutate(sexrecode=ifelse(sex==1, "male", "female"), city=ifelse(met2013==41700, "San Antonio", "Dallas")) %>%
  group_by(sexrecode, city)%>%
  summarise(medinc=median(mywage, na.rm=T),meaninc=mean(mywage, na.rm=T), sdwage=sd(mywage, na.rm=T), n=n())
## # A tibble: 4 x 6
## # Groups:   sexrecode [?]
##   sexrecode        city medinc  meaninc   sdwage     n
##       <chr>       <chr>  <dbl>    <dbl>    <dbl> <int>
## 1    female      Dallas  32000 42242.96 49528.57  1437
## 2    female San Antonio  25000 31570.21 34448.76   424
## 3      male      Dallas  40000 63841.47 82153.03  1744
## 4      male San Antonio  36000 49153.22 58692.36   466