Overview of Lesson

This lesson will try to get you familiarized with R by creating ‘fake’ data. Using fake data provides an intuitive way to understand what exactly is going on when we tell R to run a regression. By the end of this lesson you should be able to:

  1. Create your own fake data.
  2. Use basic commands and plots to explore data.

What is regression anyway?

Regression analysis is a statistical technique that allows researches to prod the relationship between variables. In bivariate analysis we assess the relationship between a predictor and a response variable. To give an example, a bivariate analysis might consider how being male (predictor) affects the probability of voting for a Candidate A (response). Of course, bivariate anlayses are rare because the questions that puzzle social scientists are often quite complex. In the previous example, voting for Candidate A surely cannot be entirely explained by gender alone. Multiple regression, where we have one response but multiple predictors is much more common and forms the foundation of quantiative anlaysis.

Creating Data

Playing Nature

For each one of our lessons we will be playing with fake datasets. Typically, social scientists are confronted with real world data. The data may come from a variety of different sources and be of different type, experimental or observational. What is true of all data is that is produced by a particular social process which is unseen by the researcher. We call this a data generating process, or DGP. By creating our own data we possess an advanatage over the typical researcher because we can set the particular DGP for each one of our variables. Another way of thinking of DGP is to assume that there is some entity we call Nature that determines how each variable is created. Thus, we can play Nature by creating our own data for analysis, fully knowing their relationship ex-ante.

Support for the mayor

Today’s running example will consider an individual’s support for the local mayor. The response variable is a feeling thermometer that ranges 0-100, with warmer temperatures indicating more approving views of the mayor. Call this thermometer. The key predictors include:

  • age a continuous random variable that is normally distributed,
  • college a dichotomous or “dummy” random variable that takes value of 1 if the individual completed a 4-year college,
  • income a continuous random variable that indicates the individuals’ income in thousands of $US.

We create each of these variables in turn in R following the subsequent script:

## Create Predictors
## We draw from the following distributions 100 draws. 

set.seed(333)                 #We must set the seed so that our work is replicable, any small # is fine

age<-rnorm(100,45,15)         #This creates the variable *age* which is drawn from the normal distribution
                              #with mean 45 and st. dev. 20. 

college<-rbinom(100,1,0.3)    #This creates the variable *college* which is drawn from the binomial
                              #distribution, where the probability of obtaining a draw with value 1 is 0.3

income<-10*rbeta(100,2,8)     #This creates the variable *income* which is drawn from the beta distribution
                              #with shape parameters 2 and 8, such that it has a slight positive skew.

To create the response or dependent variable we will specify the following linear predictor:

\[thermometer_i = \alpha + age_i*\beta_1 + college_i*\beta_2 + income_i*\beta_3 + \epsilon_i. \]

Understanding what this equation is saying is fundamental to understanding regression analysis. Remember, since we play the role of Nature, we can specify the exact relationship between thermometer and all the other variables. The equation above is saying that in our universe the values of thermometer for person i are a linear function of a constant parameter \(\alpha\) in addition to each person i’s values on age, college, income, and a random error term \epsilon. Crucially, the effects of each predictor age, college, and income is weighted by a term beta_j where \(j\in{1,2,3}\). Regression analysis, as we will see later on, estimates these beta weights.

Since we are Nature, we set the values of these weights, in addition to the error term and the constant, and add them up to obtain our dependent variable. This is shown below.

#Create Dependent Variable

a<-80                                           #The intercept is the thermometer feeling towards the mayor
                                                #when all variables are set to zero. In this case it would
                                                #be the thermometer value when the person is age 0, has no 
                                                #college education and no income. This is nonsensical, which
                                                #highlights the fact that the intercept is not always
                                                #interpretable. 

b1<-(-.5)                                       #Beta1 says that a one year increase in age decreases support
                                                #for the mayor by .5 points. 

b2<-15                                          #Beta2 says that being college educated increases support for
                                                #the mayor by 10 points! (ugh, elites...)

b3<-(-2)                                        #Beta3 says that for every thousand dollar increase in income
                                                #person i will drop support for the mayor by 2 points.

e<-rnorm(100,0,2.5)                             #The error term is drawn from the normal, mean 0, st.dev 2.5

y<- a + age*b1 + college*b2 + income*b3 + e     #So our values for y are a function of a, plus each one of 
                                                #the predictors and their weights, plus an error term.

thermometer<-y                                  #I rename y 

mayor<-as.data.frame(cbind(thermometer,         #I create a data set to use subsequently. 
                          age, 
                          college,
                          income))              #Save your own data usin the following command:

#write.table(mayor, file = "directory and file name"", sep= ",", row.names=FALSE)
write.table(mayor,"/Users/thomasvargas/Dropbox/Learning R/Data/mayor",  sep= ",", row.names=FALSE)

Data Exploration and Visual Inspection

We now take off our ‘Nature’ hat and assume that our professor hands us a dataset she has collected using a recent survey. As with every statistical analysis, the first and most important step is to explore the data. The professor wishes to get a good handle of the data through graphs and summary statistics. In addition to giving one a sense of the data, these procedures alert the researcher to possibly miscoded data points which can bias estimates. We will return to this point later.

First, let us look at the dependent variable thermometer.

summary(thermometer)            #this produces basic descriptive statistics of the variabel
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.06   52.51   57.88   58.52   66.54   83.41
hist(thermometer)               #this plots a histogram of the variable

d_therm<-density(thermometer)   #this calculates the density of the variable
plot(d_therm)                   #this plots the density which is neater/cooler than the historgram

Nothing odd here, the dependent variable is normally distributed with a mean around 58. We could do this individually for all variables in our data set, but this is sometimes impractical. We can look at all of our variables at once using some of the previous commands and the graphing package ggplot2. To install a package in R, simply type install.packages("") and write the name of the package within the quotes.

summary(mayor)                                  #This summarizes all of our variables in dataset.
##   thermometer         age           college         income      
##  Min.   :33.06   Min.   :12.18   Min.   :0.00   Min.   :0.1308  
##  1st Qu.:52.51   1st Qu.:35.84   1st Qu.:0.00   1st Qu.:1.1096  
##  Median :57.88   Median :45.73   Median :0.00   Median :1.8562  
##  Mean   :58.52   Mean   :44.66   Mean   :0.28   Mean   :1.9668  
##  3rd Qu.:66.54   3rd Qu.:51.35   3rd Qu.:1.00   3rd Qu.:2.7760  
##  Max.   :83.41   Max.   :74.02   Max.   :1.00   Max.   :4.8085
                                                #Notice also that this provides the number of missing obs if
                                                #there are any. None here.

library(reshape2)                               #this package transforms data from wide to long format   
library(ggplot2)                                #this is a powerful graphing package
## Warning: package 'ggplot2' was built under R version 3.3.2
d <- melt(mayor)                                #melt turns wide-format data into long-format data.
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) +                      #here we plot the data `d`
    facet_wrap(~variable,scales = "free") +     #here we indiate what to plot, and the scales for each plot
    geom_histogram()                            #here we specify the plot type
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These figures provide visual inspection of the data. We can see that themometer is approx. bounded by 20 and 90, age between 15 and 80, income between 0 and 5 (remember this is in thousands of dollars) and almost 70 obs have no college. Nothing jumps out as odd.

Correlations

Finally, before proceeding with more sophisiticated analyses, it is often useful to inspect the correlation between variables. This serves as a (very) shorthand for regression. One way of doing this efficiently is by obtainin a correlation matrix, which shows how a given variable within a dataset moves with relation to every other variable. One easy way to do this is by using the corrplot package:

library(corrplot)
M <- cor(mayor)
corrplot(M, method="circle",             #Here we symbolize corr with circles, matrix discards upper half.
         type="lower",
         order="hclust", 
         tl.col="black", 
         tl.srt=45)

Positive correlations are displayed in blue, whereas negative correlations are disaplyed in red, as given by the legend at the bottom of the plot. The size of the circles is proportional to the size of the coefficient. The diagonal is obviously blue and large because the correlation between every single variable and itself is perfect, or 1. We can stick for now with the first column, since it tells us the relationship between thermometer and all the predictors. Notice that college has a weak positive correlation with thermometer, but age and income have negative correlations with thermometer. age in particular seems to have a large negative correlation coefficient, whereas incomes’ relationship to thermometer is much weaker. While this is informative about the true relationship between the variable, it does not return estimates as to the values of these relationships (beta coefficients) nor about the uncertainty surrrounding such relationship. For this we will have to turn to linear regression, in the upcoming session.

Excercises

  1. The code below produces some data. Plot this data, is there anything odd about it?

Appendix

Some additional correlations plots, with much more coding below:

cormat <- round(cor(mayor),2)                   #Create a correlation matrix. 
library(reshape2)                               #Use package to melt.
melted_cormat <- melt(cormat)                   #Melt the correlations matrix
head(melted_cormat)                             #This would show for each variable correlation with others.
##          Var1        Var2 value
## 1 thermometer thermometer  1.00
## 2         age thermometer -0.68
## 3     college thermometer  0.59
## 4      income thermometer -0.39
## 5 thermometer         age -0.68
## 6         age         age  1.00
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()

#For a plot with only the bottom half, write the following function:

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(cormat){
    cormat[upper.tri(cormat)] <- NA
    return(cormat)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }
  
  
upper_tri <- get_upper_tri(cormat)
upper_tri
##             thermometer   age college income
## thermometer           1 -0.68    0.59  -0.39
## age                  NA  1.00    0.11   0.20
## college              NA    NA    1.00  -0.09
## income               NA    NA      NA   1.00
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()