library(tidyverse)
<-c(1:5)
kids_id
<-1
kid1_spinach<-2
kid2_spinach<-3
kid3_spinach<-4
kid4_spinach<-5
kid5_spinach
<-2
kid1_height<-2.5
kid2_height<-2.9
kid3_height<-3.2
kid4_height<-3.6
kid5_height
<-c(kid1_spinach,
spinach
kid2_spinach,
kid3_spinach,
kid4_spinach,
kid5_spinach)
<-c(kid1_height,
height
kid2_height,
kid3_height,
kid4_height,
kid5_height)
<-tibble(kid=kids_id,spinach=spinach,height=height)
cor_dat
plot(cor_dat$spinach,cor_dat$height)
QM Workbook 1
1. Description of Single Variables and Correlations between Two Variables
The fundamental job of quantitative analysts is, first, to describe features of the world using numbers. In a simple case, a quantitative analyst in a financial context might be asked, “What are our best performing stocks?” Or in another business context, an analyst might be asked, “How much are we paying out in insurance claims?” Or in an administrative context, an official in a government agency may want to know the average tax revenues being collected from communities across some political unit of organization, like a city or state. In cases like these, the job of the analyst is to go gather these data and present them in a easily digestible way to decision-makers. This kind of description is very important, because we live in a complex world, and just keeping track of all of our information and keeping it at our fingertips is a very big job.
A second-order job of quantitative analysts is to move beyond describing features of the world with numbers and instead describe the possible relationships between two or more features of the world. For a simple example from the medical world, an analyst might respond to the question, “How much does aspirin help resolve headaches?” In this setup, there are two features of the world that the analyst is interested in: (1) the degree of headache pain, let’s say from 0 (no pain) to 10 (worst pain of your life) and (2) the taking or not-taking of a dose of aspirin. The natural question to ask in this setting is whether (1) and (2) are related. Does aspirin help to relieve headaches?
Correlation
The way quantitative analysts usually think about the relationship between two features of the world is whether those features are correlated or not. If two features of the world tend of occur together, they are called positively correlated and if the features of the world do not tend to occur together, they are called negatively correlated.
A note here on terminology: for reasons that we will get into in later chapters, quantitative analysis usually call features of the world variables. In short, we call features of the world variables because we believe that these features of the world could, conceptually, be different than they are. For instance, I’m six feet tall. That’s a feature of the world. However, if I had eaten more spinach when I was kid, maybe I would have grown a little more and be 6’1” now. In a metaphysical sense, I can imagine this feature of my height being different than it is, and in fact the my actual height is a function of a complex causal process, the exact parameters of which are unknown to me. You can think of anything this way, and it allows for some intellectual flexibility.
Now, let’s generalize. Let’s say, based on my claim that eating more spinach would have made me taller, that in general eating more spinach tends to make a person taller. Using R, I am going to make a scatterplot of this for five children. (I know this is terrible R code, but I’m being very explicit for clarity).
As you can see, these two variables, which are likely to vary from person to person, appear to be positively correlated in this sample. The basic job of statistics is to figure out whether this finding from a small sample should make us feel “good” about inferring that this same correlation would be found in other samples of people, and in fact the broader population.
Correlations are useful for at least two and possibly a third thing. First, correlations are definitely useful for describing the world. Obviously, it’s interesting to know that spinach consumption appears to be positively correlated with height. It might lead you to further investigate why this association appears to exist, and whether it is a causal relationship.
Second, our authors tell us that correlations might be useful for forecasting. Forecasting is the same thing as predicting, and I actually prefer the term prediction so we are going to use that more. Because we like to plan for the future, if we think one variable helps us predict the value of another, it help us make plans. If we know, for instance, clouds in the sky are associated with rain, then we might want to take our umbrella with us when there are dark clouds in the sky, if we don’t want to get wet.
Closely associated with description and prediction is causal inference, which is the formal process of asserting that one variable is causally related to the other - in other words, that the association between variables exists because a change in one variable make an actual, traceable impact on a change in the second. Causality is very complicated subject and we won’t get into exactly what it means philosophically or metaphysically, but it suffices to say we’d find our descriptions more interesting and our predictions more satisfying if we can prove an underlying causal relationship.
Identify two different features of the world that you suspect are positively correlated.
Create some fake data in R that “measures” these features of the world, and are positively correlated. Make a scatterplot that visualizes the relationship.
### Your code here
Identify two different variables that you suspect negatively correlated.
Create some fake data in R that “measures” these features of the world, and are negatively correlated. Make a scatterplot the visualizes the relationship.
Finally, think critically - if you had to generalize from a sample to a population, what characteristics about the sample do you think would be important in giving you confidence that the relationship would also hold in the general population?
### Your code here
Measuring Single Variables and Correlations
So far, we haven’t actually discussed how to quantify the characteristics of a single variable, or the characteristics of a correlation, other than to say some correlations are positive and some are negative. However, there are ways to quantify correlations, and they build on the values we use to measure the characteristics of single variables. So, we start with descriptive statistics for single variables and then move on to descriptive statistics for correlations.
Descriptive Statistics for Single Variables
When we start studying the world, it’s best to start with one thing at a time. We start simply by looking a single feature of the world - in other words, a single variable. Let’s start with something very simple: how tall are humans?
To answer this question, we first have to gather some data. We go out and measure the height of 100 people. Now, whether we know it or not, we have gathered what statisticians would call a vector of data. A vector is a series of numbers.
Now, 100 numbers is a lot of numbers, so we might want to find some ways to summarize 100 data points into a single statistic that captures some characteristics about these data.
The most common statistics you are all probably aware of are the mean and median. To that, we want to add several additional characteristics that help us understand the spread or distribution of data.
The mean and median are ways to approximate the overall value of the data. These are all measures of centrality - a single number that is in the “middle” of the data and thus does the best that a single number can do of summarizing all the values.
The mean is the average value, which is the sum of the values in the vector divided by the number of values. The Greek letter mu, , is often the symbol for the mean in statistics.
I know you all know what a mean is, but here it is in a formula, so you all get used to reading summation notation.
\[ \mu = \frac{1}{n} \sum_{i=i}^{n} x_{i} \]
The median is the “middle value” of the vector. You find the median, as you know, by putting a vector in sequential order from low to high value, then counting inwards and finding the middle value in the sequence. If the vector has an odd number of elements (numbers), the median is just the middle number. If the vector as an even number, then the median is the sum of the two most central numbers, divided by two.
You also probably already know this as well, but there is a question about whether the mean or median is the preferable measure of centrality, given the skewness of a vector If there are outliers in a vector, the median will not be sensitive to them, but the mean will be. Take this example, where the inclusion of an outlier significantly affects the mean and median of a vector. In the data with the big outlier, which value is a better representation of the data as whole? This decision requires some judgment from an analyst.
<- c(2,3,4,2,3,4,5,6,8,1)
no_outliers### Mean and median are pretty close
mean(no_outliers)
[1] 3.8
median(no_outliers)
[1] 3.5
<- c(2,3,4,45,3,4,5,6,8,1)
add_an_outlier### Mean and median are pretty different
mean(add_an_outlier)
[1] 8.1
median(add_an_outlier)
[1] 4
Practically, in most statistical applications, the mean is the more common measure of centrality, though some would argue that this is a pretty big problem, and more advanced econometric methods do take advantage of the median’s resistance to outliers. A deeper discussion of this is way outside the scope of this course, but it’s an intriguing issue.
Think of an actual situation where you might be faced with a very high outlier in a dataset. Given your hypothetical situation, would you prefer to use the mean or median as a measure of centrality? Why?
# your code here.
Now, let’s think a little more carefully about how to characterize spread. The range is a very rough way to characterize how much variation there is a data set, because it only takes into account the maximum and minimum values. In theory, a vector could have one very big high outlier and one very low outlier, and this would make the range not a very good indication of the range of common values in the dataset.
Create a vector of data where the range is a pretty reasonable measure of the “spread” of the data.
Then, create a vector of data where the range is a less representative measure of the spread of the data.
Why is range ok in one vector but not the other? Explain in your own words.
# your code here.
As you just showed, range can sometimes be a problematic way to characterize the spread of data. A better way to think about it, which is more resistant to crazy outliers, is variance.
The formula for variance follows this paragraph. Variance is usually abbreviated in Greek letters as sigma-squared, \(\sigma^{2}\). To calculate variance, take each ith value of the vector X, subtract from each of the ith values the mean of the vector X, and square the resulting difference (remembering that some of these differences will be negative). Do this for every element of vector X, so you have an equally long vector of squared differences. Add all those squared differences up, and divide by the number of numbers (length) of vector X. (Note - we will not be this explicit about all math formulas, but this is for practice).
\[ \sigma^2 = \frac{\sum\limits_{i=1}^N (X_i -\mu_X)^2}{N} \]
Essentially, variance measures the distance between each point in the data from the mean, which is the center of the data set. Then, we can imagine capturing the total variation in the vector by adding up these distances between the values and the “center.” The squaring of the differences just resolves the problem that because we operate on a two dimensional plane, we can have “negative” distances. You could theoretically solve this by using absolute values, but the standard practice in statistics is to square the values, so you get all positive values. Then, you simple add up all of those squared differences and divide by the length of the vector, and that’s the variance.
This is somewhat easier to understand graphically. Study this figure. It’s a one-dimensional number line. The values are spread horizontally away from the center.
<- data.frame(x = c(1,2,5,3,9,12,11,3,1,10),
d y = c(0,0,0,0,0,0,0,0,0,0)
)
ggplot(d, aes(x = x,y = y, label = x)) +
geom_point(aes(size=10)) +
geom_text(size = 3, color = "white") +
scale_color_manual(values = c("red", "darkblue")) +
theme(legend.position="none")
#theme(legend.position="none", axis.text.y=element_blank(), axis.ticks.y=element_blank() )
Now, for the sake of visualization, let’s stretch those points up the y-axis, putting each point at the same value of the y-axis as they are placed on the x-axis (for example, (5,5) in cartesian plane notation). We’ll draw a line from each value to the mean value, which is 5.7.
mean(c(1,2,5,3,9,12,11,3,1,10))
[1] 5.7
<- data.frame(x = c(1,2,5,4,9,12,11,3,1,10),
d y = c(1,2,5,4,9,12,11,3,1,10)
)
ggplot(d, aes(x = x,y = y, label = x)) +
geom_point(aes(size=10)) +
geom_text(size = 3, color = "white") +
theme(legend.position="none")+
#theme(legend.position="none", axis.text.y=element_blank(), axis.ticks.y=element_blank() )+
geom_vline(xintercept=5.7,linetype="dotted",color="red")+
geom_segment(aes(x=1.1,y=1,xend=5.7,yend=1))+
geom_segment(aes(x=2.1,y=2,xend=5.7,yend=2))+
geom_segment(aes(x=3.1,y=3,xend=5.7,yend=3))+
geom_segment(aes(x=4.1,y=4,xend=5.7,yend=4))+
geom_segment(aes(x=5.1,y=5,xend=5.7,yend=5))+
geom_segment(aes(x=8.9,y=9,xend=5.7,yend=9))+
geom_segment(aes(x=9.9,y=10,xend=5.7,yend=10))+
geom_segment(aes(x=10.9,y=11,xend=5.7,yend=11))+
geom_segment(aes(x=11.9,y=12,xend=5.7,yend=12))
If we put all of these line segments end to end, we would have a kind of measure of the amount of “spread” within this vector out from the mean.
Remember that in our calculation we square these distances, so is distance of -2 would be squared and yield a value of 4 (-2^2 or 2x2 = 4). This creates the problem that the variance is now in a different unit than the original data - we’ve squared it, so it doesn’t really reflect the “true” spread of the data, but a much higher value. So, the typical practice is to take the square root of the variance and call it the standard deviation - literally, you are standardizing the variance.
Create a vector of integers between the values of 0 and 10. Calculate the mean, variance, and standard deviation “by hand” in R - no using functions.
Then, use R’s built in functions to calculate the mean, variance, and standard deviation, to make sure your by-hand calculations were right.
Finally, create 10 vectors of 10 numbers each. The first vector should have a standard deviation of 0, and each following vector should have increasingly larger standard deviations. Calculate the standard deviation each time (use R’s sd() function). Ask yourself - do you see how variation in the vector increases with each iteration?
Measuring Correlations
Just as we can use descriptive statistics to summarize vectors of a variables, we can use a related set of descriptive statistics to summarize correlations between variables.
Let’s say have two different features of the world (put another way, variables) - let’s just say it’s study time and grades. Let’s think about students at Johns Hopkins University. We are going to assume that all students could spend their time studying, and all students receive grades.
Now, with univariate descriptive statistics, we can independently describe these two vectors of values, and we could graph them to see visually how they are related. However, it’s also helpful to be able to summarize the relationship between the two of them in a single number or statistic.
There are three statistics that are commonly used to summarize the association between two variables: covariance, correlation coefficient, and slope of the regression line.
Covariance:
Covariance is the average product of the deviations from the mean for two vectors X and Y. Similar to how the variance for a single vector is the sum of the squared distances from each point to the mean, covariance is the sum of the products the distances of each pair of points to the mean, divided by the number of pairs.
\[ cov_{x,y}=\frac{\sum_{i=1}^{N}(X_{i}-\mu_{X})(Y_{i}-\mu_{Y})}{N} \]
Covariance might be unintuitive, so it’s useful to calculate it by hand. Sorry, not fun, but good for you.
Calculate \(cov (X, Y)\) by hand using R, no functions.
<-c(1,2,3,4,5)
x<-c(1,2,3,2,4) y
Page 28 of the text book describes several “strong” versions of correlation that will lead to positive or negative covariance.
Create two pairs of X and Y vectors, one of which would show “strong” correlation leading to positive covariance and another which would show negative covariance.
Use your own words to describe what you are seeing.
Correlation Coefficient:
As you can probably see after grappling with the last exercise, the sign of a covariance can tell us about the positive or negative correlation between two vectors, but the magnitude is hard to interpret, because the product of deviations can be much bigger or smaller depending on how much variance is baked into the vectors to begin with. So, the way to solve this is to divide the \(cov(X,Y)\) by the product of the standard deviations of X and Y, \(\sigma_X * \sigma_Y\). This creates a normalized value that must range between -1 and +1. You can choose to really think about this deeply and geometrically if you wanted, or you can just accept it. If two values covary so imperfectly such that their covariance is as big as possible, then that covariance will be equal to the product the two vectors’ respective standard deviations. Dividing the covariance by the product of the variances in that case will create a value of 1 or -1. Inversely, if two values covary so perfectly that the covariance is as small as possible, then the value will be 0 or very close to it.
I assume that because you were able to calculate the covariance by hand a minute ago, you’d be able to calculate the correlation coefficient by hand, so we’ll let R do the hard work. This next exercise requires you to build some intution about the relationship between covariance and the correlation coefficient.
Create a dataset with two columns X and Y. In one dataset, make X and Y positively (but not perfectly) correlated, such that you can see the correlation with your eye. Graph the scatterplot of X and Y. Calculate the variance of X, and variance of Y, and cov(X,Y). Then, divide the covariance by the product of the variances. Finally, check that you did your math right by calculating the correlation coefficient for X and Y using corr().
In the next dataset, do the same thing but make X and Y negatively (but not perfectly) correlated.
In the last dataset, make X and Y have a very weak, if any, correlation.
Slope of the Regression Line:
A final way to quantify the correlation between two variables is the slope of the best-fit line, which is (for largely historical reasons) called the ordinary least squares regression line. We will talk much more about this, and you probably all have at least some working knowledge of regression analysis. But, for right now, let’s just wave some hands at this. There does exist the “best” straight line that drives through any cloud of data and is the best of all possible straight lines at summarizing the relationship between the two variables (in reality, the best line is often not perfectly straight, but straight lines are easy and common, so we focus on these for starters).
The regression coefficient is calculated (again for reasons we are not going to get into right now) as the covariance of X and Y divided by the variance of X:
\[ \frac{cov(X,Y)}{\sigma^{2}_{X}} \]
The regression coefficient tells you how much we should expect for Y to change, on average, as X increases. This is the linear relationship between X and Y. The regression coefficient is helpful because it tells you something about the magnitude of the relationship between X and Y, which is not easily available from just a covariance or correlation coefficient. The slope allows you predict values of Y based on the values of X (and vice versa).
I have created below some X and Y data where X and Y are positively correlated.
Using only the functions for covariance and variance, calculate the slope of the regression line. Then, use abline() to superimpose the best-fit regression line onto the scatterplot. Note: in this particular instance, the y-intercept of the best fit line is 0. Use ?abline() to figure out how to draw the abline.
<-runif(100,1,10)
x<- (x*0+x*2) + rnorm(100,0,5)
y
#### calculate the correlation coefficient
### fill in abline() to draw the regression line
plot(x,y,xlim=c(0,10))
#abline(....)
If X is equal to 5, what is the best guess based on the regression line about the value of Y? Remember that the y-intercept is 0.
End of Lesson Questions
- Consider the following three statements. Which ones of these describe a correlation, and which ones do not? Why?
- Most professional data analysts took a statistics course in college.
- Among MLB baseball players, pithcers have a lower than average batting averages.
- The candidate that wins Ohio tends of win the US presidency.
- The table below shows some data on which countries are major oil producers and which countries experienced a civil war between 1946 and 2004. How, if at all, are being a major oil producer and and experiencing civil war positively correlated? Explain. Additionally, why or why don’t you think this might be causal?