Data Analysis in the Social Sciences

Let us continue working with a toy example discussed at the lecture. We have the variable hours (hours spent on preparing for the test) and the variable score (score for the test, out of 100). We can create two vectors with these values and join them into a small data frame.

hours <- c(4, 2, 5, 1, 3)
score <- c(82, 45, 100, 18, 65)
dat <- cbind.data.frame(hours, score)  # cbind - bind as columns

Note: definitely, we might skip this step and work with vectors themselves, but usually we work with data frames loaded from files, so let’s make our example realistic and create a data frame dat.

dat

##   hours score
## 1     4    82
## 2     2    45
## 3     5   100
## 4     1    18
## 5     3    65

Now we can run a simple linear model (a paired regression). So as to do this, we need a function lm() that stands for linear model. This function will be used for multiple linear regressions as well.

Dependent variable: score;

Independent variable: hours;

General equation: \(score = \hat{\beta_0} + \hat{\beta_1} \times hours\);

So, hours spent on preparation for a test affect the score for this test.

model <- lm(data = dat, score ~ hours)

Comments:

we should specify the data frame we work with, so here data = dat;
we should write an equation of a model in the following form dependent variable ~ independent variable, so here we try to explain score by hours;

Now let’s look at the summary of this model that includes coefficients, test statistics and indicators of model quality.

summary(model)

## 
## Call:
## lm(formula = score ~ hours, data = dat)
## 
## Residuals:
##    1    2    3    4    5 
## -0.1  3.1 -2.2 -3.8  3.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.700      3.728   0.456 0.679355    
## hours         20.100      1.124  17.883 0.000381 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.554 on 3 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9876 
## F-statistic: 319.8 on 1 and 3 DF,  p-value: 0.0003813

Comments:

Call: formula of the model, model equation;
Residuals: residuals, errors, \(\varepsilon_i = (y_i-\hat{y_i})\) for every i-th observation;
Estimate: coefficients, \(\beta_0\) (Intercept) and \(\beta_1\) (hours);
Std. Error: standard errors of coefficients;
t value: t-statistics, \(t = \frac{\beta}{se(\beta)}\);
Pr(>|t|): p-value.

Interpretation:

First of all, using this output, we can write the regression equation:

\[ scores = 1.7 + 20.1 \times hours \] So, if time spent on the preparation for the test increases by one, the score for this test increases by 20.1 on average. If a person does not prepare for the test at all (hours=0), the expected score is 1.7.

Then, we can check whether the effect of hours on score is statistically significant.

\[ H_0: \beta_1 = 0 \text{ (`hours` does not affect `score`)} \] \[ H_1: \beta_1 \ne 0 \text{ (`hours` really affects `score`)} \] P-value for the coefficient of hours is 0.000381 that is less than any conventional significance level, so we can reject the null hypothesis about the absence of the hours effect. So, the time spent on preparation really affects the score for the test. To be more precise, if we consider the significant codes (stars) in the output, we can conclude that the null hypothesis is rejected at the 0.001, at 0.1% significance level as there are three stars (***) near the p-value. As it is usually said, the coefficient of hours is significant at the 0.01% significance level or the effect of hours on score is significant at the 0.01% significance level.

If we consider the model quality, we should look at the \(R^2\) that is Multiple R-squared here. As it shows the share of variance of scores explained by our model (“the share of reality explained”), the higher is \(R^2\), the better the quality of the model is. In our case it is extremely high, very close to R, so we can conclude that our model is good.

If needed, we can extract certain values from the model without asking for a detailed output. For example, take coefficients:

model$coefficients

## (Intercept)       hours 
##         1.7        20.1

Or fitted (predicted) values, \(\hat{y}\):

model$fitted.values

##     1     2     3     4     5 
##  82.1  41.9 102.2  21.8  62.0

Or residuals, prediction errors, \(\varepsilon\):

model$residuals

##    1    2    3    4    5 
## -0.1  3.1 -2.2 -3.8  3.0

We can male sure that residuals are actually \((y-\hat{y})\), so score minus fitted values:

dat$score - model$fitted.values

##    1    2    3    4    5 
## -0.1  3.1 -2.2 -3.8  3.0

Now let us create a scatterplot with the regression line added. Load the ggplot2:

library(ggplot2)

We usually place an independent variable by x-axis and a dependent one — by y-axis:

ggplot(data = dat, aes(x = hours, y = score)) + geom_point() +
  geom_smooth(method = lm)

Shaded area along the regression line is 95% confidence interval for the regression coefficient \(\beta_1\).

Data Analysis in the Social Sciences

Simple linear regression

Alla Tambovtseva