Let us load the NALS dataset and look at these variables Download and load the variables ### Loading libraries
library(haven)
library(synthpop)
## Loading required package: lattice
## Loading required package: MASS
## Loading required package: nnet
## Loading required package: ggplot2
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.0.1 ✔ purrr 0.3.0
## ✔ tidyr 0.8.3 ✔ dplyr 0.7.8
## ✔ readr 1.3.1 ✔ stringr 1.3.1
## ✔ tibble 2.0.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
library(ggplot2)
library(jtools)
#library(summarytools)
library(kableExtra)
library(ggfortify)
nals <- read.csv("~/Desktop/stat2/nals_synthetic.csv")
head(nals)
For the estimation of conditional and joint probabilities we need to create contingency tables. This process is pretty easy in R using the prop.table() and table() functions. Let us analyze a two-way contingency table from the NALS dataset. The two variables we will be analyzing:
1. education attainment - 6 possible values (Education) 2. risk of unemployment (unemp)
The first line code is to show how to create frequency count tables from the dataset we have,
table(nals$Education,nals$unemp)
##
## 0 1
## 1 1331 239
## 2 416 65
## 3 4900 502
## 4 1463 128
## 5 2120 111
## 6 1164 53
Now that we know the frequency count, we can obtain the corresponding contingency table in the following way
prop.table(table(nals$Education,nals$unemp))
##
## 0 1
## 1 0.106548191 0.019132245
## 2 0.033301313 0.005203330
## 3 0.392251041 0.040185719
## 4 0.117114954 0.010246558
## 5 0.169708614 0.008885687
## 6 0.093179635 0.004242715
To get the same in percentages, just multipl by 100
nals.contingency <- prop.table(table(nals$Education,nals$unemp))*100
nals.contingency
##
## 0 1
## 1 10.6548191 1.9132245
## 2 3.3301313 0.5203330
## 3 39.2251041 4.0185719
## 4 11.7114954 1.0246558
## 5 16.9708614 0.8885687
## 6 9.3179635 0.4242715
How do we get marginal probabilities from this table?
To get this, we use the margin argument to prop.table function. It tells where in rows (margin=1) or in columns (margin=2) grouping variable is.
Let us look at the marginals for the unemployment status:
prop.table(table(nals$Education,nals$unemp), margin =2)*100
##
## 0 1
## 1 11.681587 21.766849
## 2 3.651044 5.919854
## 3 43.005090 45.719490
## 4 12.840091 11.657559
## 5 18.606284 10.109290
## 6 10.215903 4.826958
In the table above you can see that those who are unemployed are twice as likely to have no degree. We can also see that those who are employed are three times more likely to have a masters degree of higher than those who are unemployed.
We can also look at the marginals for education attainment
prop.table(table(nals$Education,nals$unemp), margin =1)*100
##
## 0 1
## 1 84.777070 15.222930
## 2 86.486486 13.513514
## 3 90.707146 9.292854
## 4 91.954745 8.045255
## 5 95.024653 4.975347
## 6 95.645029 4.354971
In this table you can clearly see that as the as education attainment increases, the rish of unemployment decreases.
Today we are going to analyze linear regression in R, using the NALS data.
The variables we are going to be using in the analysis
- X = parent years of education
- Z = respondent years of education
- Y = adult literacy
Let us select the variables we are going to need in this analysis - id, parented (X), yearsed (Z) and literacy (Y)
# we can assign this subsetted data into a new dataframe
nals.dataframe <- nals %>% select(id,parented,yearsed,literacy)
Now that we have subset the data we can rename the columns in terms of X,Y and Z for the sake of our analysis.
# check the current column names
colnames(nals)
## [1] "id" "annearn" "occprest" "sei" "age"
## [6] "yearsed" "gender" "ln_earn" "literacy" "unemp"
## [11] "parented" "Ethnicity" "Language" "Education"
Let us rename them
nals <- nals %>% transmute(id=id, x=parented, z=yearsed, y=literacy)
# if you want to keep the old variables but add new ones
## then use `mutate` instead of `transmute`
Check the column names again
colnames(nals)
## [1] "id" "x" "z" "y"
Since we do not need the id variable in this case, let us remove that also
nals <- nals %>% select(id, x,y,z)
The most basic method to look at the descriptives is
summary(nals)
## id x y z
## Min. :1.010e+10 Min. : 0.00 Min. : 49.25 Min. : 0.00
## 1st Qu.:3.110e+10 1st Qu.: 9.00 1st Qu.:253.73 1st Qu.:12.00
## Median :5.080e+10 Median :12.00 Median :292.81 Median :13.00
## Mean :4.791e+10 Mean :10.93 Mean :286.18 Mean :13.31
## 3rd Qu.:6.670e+10 3rd Qu.:13.00 3rd Qu.:329.13 3rd Qu.:16.00
## Max. :8.570e+10 Max. :18.00 Max. :441.40 Max. :18.00
Before jumping in to the syntax of the linear model, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:
Scatter plot: Visualize the linear relationship between the predictor and response
Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
Let us look at scatter plots for each of the variables in our data
ggplot(nals, aes(x,y)) + geom_point(color="blue") +
labs(title="Relationship between literacy and parent's years of education") +
geom_smooth(method="loess", linetype="dashed",color="darkred", fill="blue")
ggplot(nals, aes(z,y)) + geom_point(color="blue") +
labs(title="Relationship between literacy and years of education") + geom_smooth(method="lm")
The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the y-x and y-z. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.
Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where years of parental education increases increases, the adult literacy also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.
A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.
cor(nals$y, nals$x)
## [1] 0.4965022
cor(nals$y, nals$z)
## [1] 0.6472464
Now that we have looked at the graphical plots, let us look how to model this linear relationship and what assumptions need to be satisfied for the same.
We will begin with the simple univariate predictor scenario where the response variable is adult literacy (y) and the predictor is parental education (x). The model would look like \[ adult\_literacy_i = \beta_0+\beta_1parental\_education_i+\epsilon\] For the rest of the document we will refer to the above equation in its alternate form \[y_i = \beta_0+\beta_1x_i+\epsilon\] Given that, \[y = \beta_0+\beta_1x+\epsilon\] On taking expectation on both sides, we get \[\mathop{{}\mathbb{E}}Y = \mu(x)\] Note that we are talking about \(\mu(x)\) (expected value of Y is a function of x) and not \(\mu_x\) (expectation of x). For instance,
\[ \mu(x) = \beta_0+\beta_1\] Now we try to estimate \(\beta_0\) and \(\beta_1\). Now let us look at the assumptions required to consistently estimate the \(\beta\) coefficients.
Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely:
The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.
linearMod <- lm(y ~ x, data=nals) # build linear regression model on full data
print(linearMod)
##
## Call:
## lm(formula = y ~ x, data = nals)
##
## Coefficients:
## (Intercept) x
## 192.47 8.57
Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Adult literacy(y) as a function for parental education(x).
For the above output, you can notice the ‘Coefficients’ part having two components:
Intercept: 220.03
x: 6.237
These are also called the beta coefficients. In other words,
\[ y = \beta_0 + \beta_1x + \epsilon\] is the same as \[ y = 220.03 + 6.24x + \epsilon \]
**What assumptions do we need to satisfy to be able to report these results?**
Linearity of data - The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
Normality of residuals - The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance - The residuals are assumed to have a constant variance (homoscedasticity).
Independence of residuals error terms.
You should check whether or not these assumptions hold true. Potential problems include:
All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing the residual errors.
Regression diagnostics plots can be created using the R base function plot() or the autoplot() function [ggfortify package], which creates a ggplot2-based graphics.
par(mfrow = c(2, 2))
plot(linearMod)
The residual plots can be used in four different ways
Residuals vs. Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis. This plot will be described further in the next sections.
Now the linear model is built and we have a formula that we can use to predict the adult literacy if the corresponding parental years of education is known. Is this enough to actually use this model? NO! Before using a regression model, you have to ensure that it is statistically significant. How do you ensure this? Lets begin by printing the summary statistics for linearMod.
summary(linearMod) # model summary
##
## Call:
## lm(formula = y ~ x, data = nals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -254.977 -30.513 4.537 36.879 163.180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 192.4736 1.5413 124.88 <2e-16 ***
## x 8.5697 0.1341 63.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.24 on 12490 degrees of freedom
## Multiple R-squared: 0.2465, Adjusted R-squared: 0.2465
## F-statistic: 4086 on 1 and 12490 DF, p-value: < 2.2e-16
The summary statistics above tells us a number of things. One of them is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column under ‘Coefficients’). The p-Values are very important because, We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variable’s p-Value, the more significant the variable.