1 Introduction

Let’s load some packages

library(tidyverse)
library(car)
library(dotwhisker)
library(ggcorrplot)

2 Mean and Median

One of the most basic principles of doing data analysis sounds straightforward, but it’s not. I know that most of our high school math teachers taught us about the average (which is also the mean). That’s how we calculate our grades for our courses, figure out how much we typically make it income each month, etc.

The mean works most of the time. However it doesn’t work all the time.

Here’s a key tenet of data analysis:

Don’t lie with data

A lot of politicans on both sides of the aisle use statistics that don’t accurately represent what the data really says. This recent example from Paul Ryan is a perfect illustration of that.

Any estimate about how much the “average” American will save under any tax plan is highly dependent on whether the analysis considering the mean income or the median income.

So, what’s the difference?

Mean - also known as the average. This takes all the values in a dataset and the divides them by the total number of individuals in the dataset.
Median - this takes the middle value of a range of values. If there is an even number of observations it finds the mid point between the two middle observations.

Let’s load some data and see what this looks like in the real world.

low <- read_csv("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/salary1.csv")
high <- read_csv("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/salary2.csv")
both <- read_csv("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/salary3.csv")

What you have here are three datasets. Let’s take a look at each individually.

low

## # A tibble: 100 x 3
##       X1 salary            names
##    <int>  <int>            <chr>
##  1     1   8354 Brendan Reynolds
##  2     2  49227     Jasmine Rios
##  3     3  16250    Yasa Choudhry
##  4     4  34698 Celeste Ratcliff
##  5     5  11044    Alex Hinojosa
##  6     6  31085  Kaylin Shattuck
##  7     7   1392     Erika Dragoo
##  8     8  22707     Joseph Mason
##  9     9  48113    John Santiago
## 10    10  36125        Tam Liles
## # ... with 90 more rows

So, we have a 100 individual’s names and yearly salaries. Let’s find both the mean and median of these 100 salaries.

low %>% summarise(mean = mean(salary), median = median(salary))

## # A tibble: 1 x 2
##       mean  median
##      <dbl>   <dbl>
## 1 26128.75 27551.5

So for this example the mean salary is $26,128 and the median salary is $27,551. Those numbers are relatively close to each other. Really either would be an appropriate one to display in an analysis because they accurately describe the distribution of values.

But, quickly take a look at the second salary dataset: high

high

## # A tibble: 5 x 3
##      X1   salary             names
##   <int>    <int>             <chr>
## 1     1 93411722 Jacqueline Risley
## 2     2 68547290     Abigail Burns
## 3     3 23384409     Marquis Small
## 4     4 58919855   Suheila Scanlon
## 5     5 44938191   Desiree Ramirez

Notice that these salaries are much higher than the previous salary dataset and that there are only 5 total entries here. So, what happens when we combine the 100 modest salaries from the first datasets with the 5 very high salaries from the second dataset? Well, that’s what both does, it combines them together.

Let’s find the mean and the median of this combined dataset.

both %>% summarise(mean = mean(salary), median = median(salary))

## # A tibble: 1 x 2
##      mean median
##     <dbl>  <int>
## 1 2779184  28523

Do you see the problem now? Is it technically accurate to say that the average salary in the dataset is $2.78 million dollars? Yes, it is. But does that accurately describe the financial position of most of the 105 individuals? No, it does not.

The median does a much better job of telling the whole story.

3 Outliers

The issue here is a statistical concept called outliers. An outlier is something that is drastically different than the rest of a statistical distribution. It’s value is abnormally high or abnormally low. If you want a quick way to figure out if your data has outliers, comparing the mean to the median is a pretty good place to start.

Notice how in the “low” dataset the mean and the median are close to each other? That means that there are not a lot of outliers. Actually the same is true for the “high” dataset. However when you compare the mean to the median in the combined dataset “both” you see a dramatic difference. In this case you have outliers. You can see if it we visualize it.

Here’s a visualization of that first dataset “low”

low %>% ggplot(., aes(x=salary)) + geom_histogram()

Compare that to the combined dataset “both”

both %>% ggplot(., aes(x=salary)) + geom_histogram()

You see those five little bars going across the X axis? Those are your outliers. In this situation you must use the median. It’s just more accurate.

By the way, this isn’t just some theoretical exercise. This is really how income is distributed in the United States

Those two bars are the far right are pretty tall, that’s becaused there are a decent number of Americans who make A LOT of money each year.

4 Standard Deviation

One of the most ubiquitous terms in statistics is standard deviation. It’s something that everyone who deals with numbers needs understand.

Here’s a simple definition:

standard deviation - a measure of how spread out numbers are.

Let’s load some data which is just 25000 individual weights and heights.

pop <- read.csv(url("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/population.csv"))

Let’s visualize the distribution of heights:

pop %>% ggplot(., aes(x=height)) + geom_histogram()

I don’t love how blocky that looks, so let’s add more bins.

pop %>% ggplot(., aes(x=height)) + geom_histogram(bins = 300)

That looks a lot better, right?

First, let’s figure out if this is a normal distribution. The simplest way to do that is compare the mean and the median.

pop %>% summarise(mean = mean(height), median = median(height))

##       mean  median
## 1 67.99311 67.9957

Because they are almost exactly alike, we can say that this distribution is about as normal as we are going to get.

Here’s that same graph with a red line for the mean.

pop %>% ggplot(., aes(x=height)) + geom_histogram(bins = 300) + geom_vline(xintercept = mean(pop$height), color = "red")

Now, let’s calculate the standard deviation.

pop %>% summarise(mean = mean(height), sd = sd(height))

##       mean       sd
## 1 67.99311 1.901679

So, the standard deviation is 1.9.

Here’s how that works. It’s called the 68-95-99 rule. What that means is that:

68% of your sample falls with 1 standard deviation of the mean.
95% of your sample falls with 2 standard deviations of the mean.
99% of your sample falls with 3 standard deviations of the mean.

So what does that mean in our example? To figure out one standard deviation just take the mean and add one standard deviation. Then, take the mean and subtract one standard deviations.

For us, here’s the math: 67.99-1.9= 66.09 and 67.99+1.9 = 69.89. So that means 68% of our respondents are between 66.09 inches and 69.89 inches tall. Let’s visualize.

pop %>% ggplot(., aes(x=height)) + geom_histogram(bins = 300) + geom_vline(xintercept = mean(pop$height), color = "red") + geom_vline(xintercept = (mean(pop$height)+ sd(pop$height)), color = "blue") + geom_vline(xintercept = (mean(pop$height)- sd(pop$height)), color = "blue")

So, what about two standard deviations and three standard deviations?

Two standard deviations are between 64.19 and 71.79. Three standard deviations are between 62.29 and 73.69

pop %>% ggplot(., aes(x=height)) + geom_histogram(bins = 300) + geom_vline(xintercept = mean(pop$height), color = "red") + geom_vline(xintercept = (mean(pop$height)+ sd(pop$height)), color = "blue") + geom_vline(xintercept = (mean(pop$height)- sd(pop$height)), color = "blue") + geom_vline(xintercept = (mean(pop$height)+ (sd(pop$height)*2)), color = "green") + geom_vline(xintercept = (mean(pop$height)- (sd(pop$height)*2)), color = "green") +  geom_vline(xintercept = (mean(pop$height)+ (sd(pop$height)*3)), color = "purple") + geom_vline(xintercept = (mean(pop$height)- (sd(pop$height)*3)), color = "purple")

The blue represents one standard deviation. The green is two standard deviations. The purple is three standard deviations.

To be precise we should see that 99.7% of our data falls within three standard deviations of the mean. So, is that the case here? We need to figure out how many people are shorter than 62.29 inches and how many are taller than 73.69 inches. Let’s just use filters to figure it out.

pop %>% filter(height > 73.69 | height < 62.29)

##        X   height    weight
## 1    140 73.90107 151.39130
## 2    176 73.83364 139.29830
## 3    414 62.01666 109.08480
## 4   1164 74.24899 150.21670
## 5   1385 74.19488 129.05970
## 6   1895 75.15280 146.97010
## 7   2397 73.99549 142.90160
## 8   2483 75.11519 153.95620
## 9   2653 60.61265  88.04646
## 10  3698 61.89340  95.74545
## 11  4193 74.03777 139.59530
## 12  4510 74.28376 147.78770
## 13  5643 60.86340 106.19390
## 14  6407 62.23548  94.80998
## 15  6483 61.59011  99.81074
## 16  6629 73.72628 142.81100
## 17  6943 61.40550 119.26520
## 18  7271 73.81695 140.09150
## 19  7841 73.85521 136.06670
## 20  8474 73.95409 145.26950
## 21  8830 74.27270 144.66000
## 22  9227 73.75335 153.10220
## 23  9494 74.05895 133.81720
## 24  9878 61.30021 120.88190
## 25 10242 61.93152  85.29040
## 26 10332 74.36328 164.66430
## 27 10637 73.88574 135.98160
## 28 11175 74.16797 142.77320
## 29 12033 60.86977 108.86330
## 30 13683 74.74047 155.54620
## 31 13973 60.27836 110.11380
## 32 14065 74.04804 149.63030
## 33 14108 61.90725  78.56785
## 34 15211 74.59993 147.03720
## 35 15968 74.25069 150.05670
## 36 16147 74.47517 130.90920
## 37 16387 73.88318 134.21790
## 38 16754 74.84890 122.16640
## 39 17081 74.29570 170.54790
## 40 19007 74.01942 124.23120
## 41 19200 61.82700 100.93910
## 42 19752 62.05222 120.43650
## 43 20610 60.80620 113.91450
## 44 21951 74.42744 141.74160
## 45 22473 74.51784 146.98670
## 46 22509 61.57720  96.81420
## 47 22771 74.19842 141.61480
## 48 22947 61.92639  78.01476
## 49 23041 73.95494 154.39870
## 50 24246 62.26498 104.13480
## 51 24803 74.53177 148.91040

So we have 51 people! That’s out of a sample of 25,000.

51/25000

## [1] 0.00204

That’s .2%, which is very close to the .3% that we should see. Not bad for grabbing some random data.

Here’s a handy graphic to help:

5 Confidence Intervals

When you see a poll result on a newscast or a website, you need to carefully assess what the numbers are. Here’s a good example

A quick glance at these numbers from 2012 indicate that Romney was ahead of Obama in North Carolina. However, take a look at the fine print in the bottom right. It says, “+/- 3%.” That’s usually called the “margin of error” or sometimes the “MoE.” In reality here’s what you have:

Romney’s actual support could be as high as 50% or as low as 44%.
Obama’s actual support could be as high as 46% or as low as 40%.

You know what that means? The race is a statistical deadheat in North Carolina.

We call that margin of error a 95% confidence interval. Remember our discussion of standard deviation? Yes that means we are including values that are about two standard deviations from the mean. Here’s a simple visualization of that concept:

The dots stand for the actual reported percentage. Here it’s Clinton 42% and Trump at 43%. But see the horizontal lines? Those represent the confidence intervals. In very simple statistically simple terms we are 95% confident that Clinton’s actual percentage is on her blue line somewhere and the same for Trump’s red line. That means this race is a dead heat as well.

Anytime we calculate using group_by and summarise in R, we need to include the confidence intervals so that we can determine if there is an actual statistical difference in the means of something.

Let’s return to that CCES data that we used previously.

cces <- read_csv("https://raw.githubusercontent.com/ryanburge/cces/master/CCES%20for%20Methods/small_cces.csv")

Codebook’s here

Let’s say I wanted to take a look at the mean party identification for each of the racial groups in the CCES. You should know how to do that!

cces %>% filter(pid7 <8) %>% group_by(race) %>% summarise(mean = mean(pid7))

## # A tibble: 8 x 2
##    race     mean
##   <int>    <dbl>
## 1     1 3.892838
## 2     2 1.980819
## 3     3 3.070871
## 4     4 3.289033
## 5     5 4.001969
## 6     6 3.190136
## 7     7 4.083791
## 8     8 3.139535

Recall that 1 is “Strong Democrat”, 4 is “Indepedent”, 7 is “Strong Republican”. Which group is the most liberal? It’s African Americans, right? Well is that statistically significant? We can figure that out.

plot <- cces %>% filter(pid7 <8) %>% group_by(race) %>% summarise(mean = mean(pid7),
            sd = sd(pid7), 
            n = n()) %>% 
  mutate(se = sd/sqrt(n),
         lower = mean - qt(1 - (0.05 /2),  n -1) * se,
         upper = mean + qt(1 - (0.05 /2),  n -1) * se)

You will notice that I added a bunch of syntax here. You don’t need to know really what all of it does. However what you do need to know is how to interpret a confidence interval. Let’s take a look at our results.

plot

## # A tibble: 8 x 7
##    race     mean       sd     n         se    lower    upper
##   <int>    <dbl>    <dbl> <int>      <dbl>    <dbl>    <dbl>
## 1     1 3.892838 2.146020 44904 0.01012724 3.872988 3.912688
## 2     2 1.980819 1.416066  7664 0.01617543 1.949111 2.012528
## 3     3 3.070871 1.967544  4995 0.02783919 3.016294 3.125448
## 4     4 3.289033 1.852062  2152 0.03992404 3.210740 3.367327
## 5     5 4.001969 1.946523   508 0.08636298 3.832295 4.171642
## 6     6 3.190136 1.769521  1399 0.04730934 3.097331 3.282941
## 7     7 4.083791 1.768550   728 0.06554683 3.955108 4.212475
## 8     8 3.139535 1.967486   129 0.17322752 2.796775 3.482295

If you look way over to the right you will see “lower” and “upper”. Can you guess what those are? Those are the top and bottom ends of your confidence intervals. So for African Americans, the mean pid is 1.98. It could be as low as 1.94 or as high as 2.02.

You want to visualize that? Of course you do!

plot %>% 
  ggplot(., aes(x=mean, y= race)) + geom_point() +  
  geom_errorbarh(aes(xmin = lower, xmax=upper))

There’s a bunch of stuff to take note of here.

Take a look at how wide some of confidence intervals are and how narrow they are for other groups. The smallest CIs are for white respondents. You want to guess why? That’s because there are nearly 45,000 of them. But look how wide they are for group 8 which is “other.” That’s because there are only 129 of them. It’s a mathematical fact- the larger the group; the smaller the CIs.

You always need to look vertically when comparing means like this. Here’s what I mean. Compare white respondents (Number 1) to Native American respondents (Number 5). If you just plotted the PID without CIs you would say that Native Americans are more conservative politically. But once you add the confidence intervals you see that they are not statistically different.

6 Correlation

Correlation is an important component of social science. It’s a simplified way of describe how two variables relate to each other. If variable 1 goes up does variable 2 go up as well? Or does it go down? We express correlation on a scale from -1 to 1. Obviously positive values mean a positive correlation and negative values equal a negative one.

Here’s how to interpret that:

So, let’s dig into some data.

mtcars is a dataset that comes preloaded with R.

mtcars %>% glimpse()

## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

It’s just a bunch of data about cars. Their weight, gas mileage, engine size, etc. A good starting place would be to see how the weight of a car and it’s gas mileage are related. Here’s how to calculate a correlation using the summarise function.

mtcars %>% summarise(cor(wt, mpg))

##   cor(wt, mpg)
## 1   -0.8676594

You can see that there’s pretty negative correlation with a coefficient of -.87. What that really means is that the slope of the line between the points -.87. Remember that slope is rise over run. Let’s visualize it.

mtcars %>% ggplot(.,aes(x=wt, y=mpg)) + geom_point()

You can almost see it, right? If you were to draw a line between the points what would that line look like? Yes, it would slope downhill. It would be high on the left and low on the right. That makes intuitive sense. The more a car weighs, the less efficient it is in gas mileage. Let’s go ahead and draw that line. Here’s the command.

mtcars %>% ggplot(.,aes(x=wt, y=mpg)) + geom_point() + geom_smooth(method = lm)

You can see that the blue line is downhill, just like we thought. And the darkened areas are 95% confidence intervals. Do you notice how they are pretty tightly packed around the blue line in the center and then “bell out” when you get to do the bottom right? That’s because we have fewer cases out there. Just like we talked about with 95% confidence intervals: the more cases you have, the lower level of uncertainty. You can see that happening right here.

Here’s a neat little addition you can make: how about comparing automatic transmissions to manual transmissions? There’s a variable in the dataset called “am” with 1 being manual and 0 being automatic. Let’s do it with summarise and then let’s visualize.

mtcars %>% group_by(am) %>% summarise(cor(wt, mpg))

## # A tibble: 2 x 2
##      am `cor(wt, mpg)`
##   <dbl>          <dbl>
## 1     0     -0.7676554
## 2     1     -0.9089148

You can see that both still have a negative correlation, but that downside slope is steeper for manual transmissions. Let’s visualize:

mtcars %>% ggplot(.,aes(x=wt, y=mpg, color= as.factor(am))) + geom_point() + geom_smooth(method = lm)

Can you tell which line is for manuals and which is for automatics? The steeper line (on the left) is manuals and the right is automatics. I’ve added a simple legend to make it easier, as well.

Sometimes we want to take a look at several possible correlations at the same time. We can do that pretty easily. First we need to create a small dataset of just the variables that we want to correlate.

We will use the select command and creat a new dataset called “small” that will contain just four variables.

small <- mtcars %>% select(wt, mpg, hp, qsec)

cor(small)

##              wt        mpg         hp       qsec
## wt    1.0000000 -0.8676594  0.6587479 -0.1747159
## mpg  -0.8676594  1.0000000 -0.7761684  0.4186840
## hp    0.6587479 -0.7761684  1.0000000 -0.7082234
## qsec -0.1747159  0.4186840 -0.7082234  1.0000000

Boom, we just took a look at a lot of correlations at once. But we can visualize it pretty easily. Just save that dataset you just made and then plot with ggcorrplot

plot <- cor(small)
ggcorrplot(plot)

You can see the strong positive correlations here in red and the strong negative correlations in purple. Obviously wt correlated 100% with wt which is way it’s dark red.

But, we must always consider this simple fact: correlation does not equal causation.

Some examples:

You can’t just run correlation models left and right and look for some type of relationship. There needs to be some type of theoretical basis between the two ideas.

Take this, for example:

The data is clearly trying to sell a narrative. If you send your kid to private school he or she is much more likely to graduate from a 4 year university in a timely manner. Whoever put this together wants you to be believe that there is something inherent in a private school education that makes your more college ready.

But this data doesn’t control for a really important factor:

Not many poor kids go to private school. And maybe it’s not the fact that private school is better than public, it’s the fact that rich kids are more likely to graduate college than poor kids are. We would need to control for income to determine any real relationship.

7 Regression

The big advantage of regression is that it controls for things. Correlation is just a simple linear relationship between two variables like vehicle weight and gas mileage. There are very few relationships in the social world that are as simple as one thing causing another thing. There are almost always other factors at play. That’s where regression comes in.

Let’s load up a little dataset and walk through a simple example.

df <- read.csv(url("https://raw.githubusercontent.com/ryanburge/pls2003_sp17/master/living.csv"))
df %>% glimpse()

## Observations: 3,617
## Variables: 9
## $ X        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ weight   <int> 208, 250, 165, 140, 100, 160, 213, 220, 130, 130, 135...
## $ height   <int> 68, 73, 69, 67, 61, 67, 63, 65, 66, 68, 67, 68, 67, 6...
## $ male     <int> 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0,...
## $ educ     <int> 3, 12, 9, 10, 14, 7, 12, 9, 12, 9, 12, 10, 14, 5, 6, ...
## $ physical <int> 1, 3, 5, 3, 2, 2, 3, 1, 4, 5, 1, 5, 1, 1, 4, 2, 2, 5,...
## $ stamps   <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ white    <int> 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,...
## $ black    <int> 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0,...

Most of what you are seeing here is self explanatory. “Physical” is amount of self reported physical activity on a 1-5 scale and “stamps” is a dichotomous variable with 1 meaning the person is on food stamps, while 0 means that they are not receiving food stamps.

Let’s run a really simple regression with weight as our dependent variable and height as our independent variable.

reg1 <- lm(weight ~ height, data = df)
summary(reg1)

## 
## Call:
## lm(formula = weight ~ height, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.956  -21.471   -4.964   16.039  191.540 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -135.4398     8.7148  -15.54   <2e-16 ***
## height         4.4986     0.1316   34.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.1 on 3536 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.2484, Adjusted R-squared:  0.2482 
## F-statistic:  1169 on 1 and 3536 DF,  p-value: < 2.2e-16

There’s a lot of output here, but don’t be scared. We are going to look at just a few things. Notice out to the far right there are three stars? That means we have statistical significance. That means that those results are statistically valid. In this case height does predict weight, but how so?

We need to do a little math. Let’s just grab a random example: someone is 70 inches tall. So how much should they weigh according to this data?

4.4986*70 = 314.902, but we can’t stop there because we have to consider the y-intercept. Because it’s negative let’s subtract it from our total.

314.902 - 135.4398 = 179.46. So this data predicts that someone who is 70 inches tall should weight about 179.5 pounds. But this is a really simple example because it doesn’t control for anything.

Gender is probably pretty predictive of weight. Let’s add that to the mix:

reg1 <- lm(weight ~ height + male, data = df)
summary(reg1)

## 
## Call:
## lm(formula = weight ~ height + male, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.398  -21.525   -4.828   15.507  191.030 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -113.5116    11.6647  -9.731  < 2e-16 ***
## height         4.1427     0.1821  22.749  < 2e-16 ***
## male           4.2102     1.4905   2.825  0.00476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.07 on 3535 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.2501, Adjusted R-squared:  0.2497 
## F-statistic: 589.5 on 2 and 3535 DF,  p-value: < 2.2e-16

We still have statistical significance for all variables, so we can use them to predict. Let’s take a MALE who is 72 inches tall. Same basic math as before.

72 * 4.1427 = 298.2744

298.2744-113.5116 = 184.7628

but now we have to add the male variable, which in this case adds 4.21 pounds to the predicted weight.

184.7628 + 4.21 = 188.97

So, if the person was a female we wouldn’t subtract 4.21, we just wouldn’t add it to our total. You can feel free to add as many variables as you want.

reg1 <- lm(weight ~ height + male + black + white + physical + stamps + educ, data = df)
summary(reg1)

## 
## Call:
## lm(formula = weight ~ height + male + black + white + physical + 
##     stamps + educ, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.588  -20.083   -4.453   15.593  190.005 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.1523    11.4468  -9.798  < 2e-16 ***
## height         4.2575     0.1818  23.423  < 2e-16 ***
## male           5.0693     1.4858   3.412 0.000652 ***
## black          8.4466     2.1781   3.878 0.000107 ***
## white         -0.3009     2.1060  -0.143 0.886395    
## physical      -1.4403     0.3727  -3.865 0.000113 ***
## stamps         4.7063     2.1610   2.178 0.029484 *  
## educ          -0.7017     0.1625  -4.319 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.45 on 3530 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.2809, Adjusted R-squared:  0.2795 
## F-statistic:   197 on 7 and 3530 DF,  p-value: < 2.2e-16

I just threw all the variables we had into the analysis. And there is something important to note: one of them does not reach statistical significance: being white. Let’s visualize that.

dwplot(reg1)  + geom_vline(xintercept = 0, colour = "grey50", linetype = 2)

This is called a coefficient plot or a dotwhisker plot. Look at all the variables we have. The dots you see are the coefficients listed in the regression outputs. However, the horizontal red lines represent the range of values that this dot could actually fall on. Every estimate has uncertainty built into it. The vertical dashed line is 0, which means that every variable to the right predicts MORE of something, in this case greater weight. Every variable to the left predicts a lower weight. BUT, look at white. Notice how that horizontal red line intersects with the vertical line? That means that being white can actually predict a greater weight or a lower weight. Because of that is does not have statistical significance and therefore it’s coefficient can be safely ignored.

But, here’s the problem with a dotwhisker plot: you can’t compare the magnitude of the coefficients. If you look at this chart we just plotted it looks like stamps and height have the same predictive power. But they don’t. You know why? Because this chart represents the magnitude of a ONE UNIT increase in that variable. So for height that is one inch taller, but for food stamps that means being on food stamps (remember that variable only has 0 and 1 values). So actually height has many values, so to accurately represent each variable in a dot whisker plot they need to be scaled. The best way to do that is make sure each ranges from 0 to 1. There are a few that needed changed.

summary(df)

##        X            weight          height           male       
##  Min.   :   1   Min.   : 80.0   Min.   :48.00   Min.   :0.0000  
##  1st Qu.: 905   1st Qu.:135.0   1st Qu.:63.00   1st Qu.:0.0000  
##  Median :1809   Median :160.0   Median :66.00   Median :0.0000  
##  Mean   :1809   Mean   :161.9   Mean   :66.09   Mean   :0.3754  
##  3rd Qu.:2713   3rd Qu.:183.0   3rd Qu.:69.00   3rd Qu.:1.0000  
##  Max.   :3617   Max.   :380.0   Max.   :84.00   Max.   :1.0000  
##                 NA's   :44      NA's   :43                      
##       educ          physical         stamps            white       
##  Min.   : 0.00   Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:10.00   1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :12.00   Median :3.000   Median :0.00000   Median :1.0000  
##  Mean   :11.47   Mean   :2.758   Mean   :0.06525   Mean   :0.6096  
##  3rd Qu.:14.00   3rd Qu.:4.000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :17.00   Max.   :5.000   Max.   :1.00000   Max.   :1.0000  
##                                                                    
##      black       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3196  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

Look at the bottom row, that’s the max value for each variable. So, male is fine and so is stamps, white, and black because all have a max value of 1. But we need to do some recoding and run this regression again. Here’s how we do it:

df <- df %>% mutate(ht = height/84, ed2 = educ/17, phy = physical/5)

You see what that does? I just created three new variables (ht, ed2, and phy) which are the original variables divided by the maximum value for each variable.

Let’s regress again.

reg1 <- lm(weight ~ ht + male + black + white + phy + stamps + ed2, data = df)
summary(reg1)

## 
## Call:
## lm(formula = weight ~ ht + male + black + white + phy + stamps + 
##     ed2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.588  -20.083   -4.453   15.593  190.005 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -112.1523    11.4468  -9.798  < 2e-16 ***
## ht           357.6283    15.2682  23.423  < 2e-16 ***
## male           5.0693     1.4858   3.412 0.000652 ***
## black          8.4466     2.1781   3.878 0.000107 ***
## white         -0.3009     2.1060  -0.143 0.886395    
## phy           -7.2013     1.8633  -3.865 0.000113 ***
## stamps         4.7063     2.1610   2.178 0.029484 *  
## ed2          -11.9294     2.7619  -4.319 1.61e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.45 on 3530 degrees of freedom
##   (79 observations deleted due to missingness)
## Multiple R-squared:  0.2809, Adjusted R-squared:  0.2795 
## F-statistic:   197 on 7 and 3530 DF,  p-value: < 2.2e-16

dwplot(reg1)  + geom_vline(xintercept = 0, colour = "grey50", linetype = 2)

Now, we have an accurate representation. What this shows us is moving across the ENTIRE RANGE of values for each variable. As you can see the real effect for height is MUCH, MUCH bigger than it is for gender, education or anything else.

8 Conclusion

There is so much more to social science statistics. There are z-scores and t-tests and standard errors. But this should give you enough tools in your toolbox to answer a lot questions in a pretty rigorous quantitative way. If you have more questions, just let me know. This stuff is not always easy, but you have to be able to interpret regressions and understand confidence intervals to really read and understand social science.

Intermediate R

Dr. Ryan Burge

2017-11-16