good_1_ <- read.csv("good.csv") # for illustration
good <- good_1_ # rename the dataframeInstall the packages that will be used for this tutorial.
install.packages(“ppcor”)
install.packages(“ggplot2”)
install.packages(“ggthemes”)
install.packages(“data.table”)
A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related.
Assumptions for the Pearson r correlation:
Variables should be continuous.
Both variables should be approximately normally distributed (variables have a bell-shaped curve).
The two variables have linear relationship, shown by a straight line.
The data is homoscedastic, which means the variance around the best fit line is the same for all values of the predictor variable (X).
“pearson” is the default (you’ll get the same result without specifying the method actually). Specify other methods when appropriate, such as “kendall” or “spearman”.
cor(good$read97, good$read02, use="complete.obs", method="pearson") ## [1] 0.6988714
# In the cor() function, if use is "complete.obs" then missing values are handled by casewise deletion. Besides using the function cor(), we can use cor.test(). The difference of the two is that the cor.test() function will output a p-value, 95% confidence interval for your correlation.
cor.test(good$read97, good$read02, method = "pearson") ##
## Pearson's product-moment correlation
##
## data: good$read97 and good$read02
## t = 33.436, df = 1171, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6683699 0.7270266
## sample estimates:
## cor
## 0.6988714
# What is the p-value? What can you say about the correlation btw read97 and read02?
# Read on the R documentation for more elaborations --> help("cor.test") First, let’s create a new data frame with the variables we will use. There are at least two ways to do this. First, we can subset data frame using cbind() function to help us combine R objects by rows or columns.
goodMini <- cbind(good$read02, good$read97, good$AFDCpreg) # append columns Let’s look at the new data frame created by cbind() using View(), “View” needs to be capitalized.
View(goodMini)Can you see the column name? Not really! Let us rename the column names as follows.
colnames(goodMini, do.NULL = FALSE) # check current column names## [1] "col1" "col2" "col3"
colnames(goodMini) <- c("read02","read97","AFDCpreg") # rename column names
View(goodMini)# Examine the class or type of an object using class()
class(goodMini) ## [1] "matrix"
# Convert matrix to a data frame for running regression later on.
library(data.table)
goodMini <- data.table(goodMini) Alternatively, we can subset our original data frame by selecting the columns pretty quickly!
goodMini <- good[,c("read02","read97","AFDCpreg")] # subset data frame with named columnsIn R, missing values are represented by the symbol NA (not available) and in many cases NA prevents us from running statistical analysis smoothly. Hence, it is a good practice to always “clean up” your data first.
The na.omit() function returns a list without any rows that contain na values.
goodMiniClean <- na.omit(goodMini)What can you tell from the two summary results?
summary(goodMini)## read02 read97 AFDCpreg
## Min. : 7.00 Min. : 2.00 Min. :0.0000
## 1st Qu.: 57.00 1st Qu.:45.00 1st Qu.:0.0000
## Median : 69.00 Median :61.00 Median :0.0000
## Mean : 66.46 Mean :56.29 Mean :0.1592
## 3rd Qu.: 79.00 3rd Qu.:71.00 3rd Qu.:0.0000
## Max. :100.00 Max. :97.00 Max. :1.0000
## NA's :1026 NA's :2041 NA's :246
summary(goodMiniClean)## read02 read97 AFDCpreg
## Min. : 16.00 Min. : 2.00 Min. :0.0000
## 1st Qu.: 70.00 1st Qu.:44.75 1st Qu.:0.0000
## Median : 78.00 Median :60.00 Median :0.0000
## Mean : 76.03 Mean :55.89 Mean :0.1461
## 3rd Qu.: 84.00 3rd Qu.:71.00 3rd Qu.:0.0000
## Max. :100.00 Max. :97.00 Max. :1.0000
It is a measure of the strength and direction of a linear relationship between two continuous variables, while controlling for the effect of one or more other continuous variables. Pearson correlation applies to continuous variables only.
library(ppcor)## Warning: package 'ppcor' was built under R version 3.5.2
## Loading required package: MASS
pcor(goodMiniClean, method = c("pearson")) ## $estimate
## read02 read97 AFDCpreg
## read02 1.0000000 0.690630225 -0.178061139
## read97 0.6906302 1.000000000 0.007543908
## AFDCpreg -0.1780611 0.007543908 1.000000000
##
## $p.value
## read02 read97 AFDCpreg
## read02 0.000000e+00 4.611467e-152 4.729960e-09
## read97 4.611467e-152 0.000000e+00 8.055768e-01
## AFDCpreg 4.729960e-09 8.055768e-01 0.000000e+00
##
## $statistic
## read02 read97 AFDCpreg
## read02 0.000000 31.1642823 -5.9052771
## read97 31.164282 0.0000000 0.2461975
## AFDCpreg -5.905277 0.2461975 0.0000000
##
## $n
## [1] 1068
##
## $gp
## [1] 1
##
## $method
## [1] "pearson"
# If the variables are not all numerical, you will see an Error as follow,
# "Error in pcor(goodMiniClean, method = c("pearson")) : 'x' must be numeric"What can you interpret from the results?
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. Scatter plots show how much one variable is affected by another, while the relationship between two variables is called their correlation.
Pairwise Comparison Scatterplot
pairs(goodMiniClean, panel = panel.smooth) # Do you like this plot better?
Bivariate Scatterplot of Read 97 against Read 02
good$childnum <- as.factor(good$childnum)
plot(good$read97, good$read02, main="Scatterplot Example",
xlab="Reading Scores '97 ", ylab="Reading Scores '02 ", pch=10)Let the response \(y_i\) be the read02 and the explanatory variable \(x_i\) be read97 (\(i = 1, \dots, n=3563\)).
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
To be specific,
\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\]
\[\textbf{Var}(y_i | x_i) = \sigma^2\]
\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] or
\[y_i \overset{iid}{\sim} \mathcal{N}( \beta_0 + \beta_1 x_i, \sigma^2)\]
\[\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}\]
The function lm() will be used extensively. This function solves the minimization problem that we defined above.
The basic output of the lm(…) function contains two elements: the Call and the Coefficients. The former is used to tell you what regression it was that you estimated. The second contains the regression coefficients. In this case there are two coefficients: the intercept and the regression weight of our sole predictor. lm(DV ~ IV1 + IV2…) function is used for fitting linear models
General rule for regression: When we regress Y on X, we use the values of explanatory variable X to predict those of response variable Y. Let’s regress reading score in 2002 (Y) on reading score in 1997 (X)
We will use the clean data which has no missing value for computing the linear regression.
reg <- lm(goodMiniClean$read02 ~ goodMiniClean$read97)Let us look at the results of the fit and print the model object. The output includes the model formula, the coefficients, parameter estimates, standard errors, as well the residual standard error, adjusted R-squared and multiple R-squared.
What is the coefficient? The y-intercept? Is the coefficient statistically significant?
summary(reg)##
## Call:
## lm(formula = goodMiniClean$read02 ~ goodMiniClean$read97)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.866 -4.265 0.523 4.722 39.537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.65799 0.74020 72.49 <2e-16 ***
## goodMiniClean$read97 0.40029 0.01249 32.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.041 on 1066 degrees of freedom
## Multiple R-squared: 0.4907, Adjusted R-squared: 0.4902
## F-statistic: 1027 on 1 and 1066 DF, p-value: < 2.2e-16
# Computes confidence intervals for the estimated coefficients in a fitted model.
confint(reg, level=0.95) ## 2.5 % 97.5 %
## (Intercept) 52.2055813 55.1103949
## goodMiniClean$read97 0.3757778 0.4247941
Before using the model, we need to check the model assumptions in the following steps:
Check linearity first; if linearity is satisfied, then
Check homoscedasticity; if homoscedasticity is satisfied, then
Check normality.
We plot the residuals against the fitted values to
check linearity by checking whether the residuals follow a symmetric pattern with respect to \(h=0\). (No systematic curvature.)
check homoscedasticity by checking whether the residuals are evenly distributed within a band. (Variability is approximately the same across the board.)
plot(reg$fitted, reg$residuals,
pch = 16,
main = "residual plot")
abline(h=0, lwd=4, col="red")We look at the qqplot of residuals to check normality.
qqnorm(reg$residuals)
qqline(reg$residuals, lwd=4, col="blue")Create a column of predicted values of Y (Y’) in the dataset
goodMiniClean$predicted <- fitted(reg) Create a column of residuals in the dataset.
goodMiniClean$residuals <- residuals(reg) Get standardized regression coefficients (dividing x and y by standard deviations)
Also called beta coefficients or beta weights, standardizing coefficients results in the variances of dependent and independent variables equal to 1. Hence, standardized coefficients refer to how many standard deviations a dependent variable will change, per standard deviation increase in the predictor variable.
regStandard <- lm(scale(good$read02)~ scale(good$read97))
summary(regStandard)##
## Call:
## lm(formula = scale(good$read02) ~ scale(good$read97))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.43302 -0.23944 0.02685 0.27489 2.35036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55556 0.01390 39.96 <2e-16 ***
## scale(good$read97) 0.46307 0.01385 33.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4761 on 1171 degrees of freedom
## (2390 observations deleted due to missingness)
## Multiple R-squared: 0.4884, Adjusted R-squared: 0.488
## F-statistic: 1118 on 1 and 1171 DF, p-value: < 2.2e-16
This will become more useful as we add more independent variables to our regression
GGPLOT produces a scatter plot between two variables with more flexibility to include a Regression Line as well as other nice features
Base R
plot(goodMiniClean$read97, goodMiniClean$read02,
main="Scatterplot Example with LS line added",
xlab="Reading Scores '97 ",
ylab="Reading Scores '02 ",
pch=10)
abline(reg, col="red", lwd=4) # many other ways.
abline(h=mean(goodMiniClean$read02), lwd=4, col="blue") # add a horizontal line, y=mean(y)ggplot
library(ggplot2)
library(ggthemes)
plot <- ggplot(goodMiniClean, aes(x=goodMiniClean$read97, y=goodMiniClean$read02)) + geom_point() +
stat_smooth(method="lm", se=FALSE)#can add titles, subtitles and caption in ggplot
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002",
subtitle = "Lab 2") + labs(caption = "(based on data from good package)")# can change smooth line color # Smoothed conditional means
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002",
subtitle = "Lab 2") + labs(caption = "(based on data from good package)") +
geom_smooth(method="lm", colour="red", size=0.5, se = TRUE)# can add themes
plot + labs(title = "Scatterplot of Reading Scores in 1997 & Reading Scores in 2002",
subtitle = "Lab 2") + labs(caption = "(based on data from good package)") +
geom_point(color='lightblue4') +
geom_smooth(method="lm", colour="black", size=0.5) +
theme_economist(base_size = 10, base_family = "sans") # Other themes e.g., theme_grey, theme_bw, theme_linedraw, theme_light, theme_dark, theme_minimal or theme_classic
# customize your theme
# https://www.r-bloggers.com/custom-themes-in-ggplot2/ Dataset: “statekffdata” dataset
Documentation: “KFF_State_Codebook” on Canvas