Slides for Week 4 are available here.
In this tutorial, you will learn how to:
Tips: If you need to install a new package, remove #
first to run the install.packages
code. Install packages
only once, then put back #
in front of the codes to prevent
re-running the install.packages()
codes when knitting. You
will also need to load the library (library()
) each session
before you can use their corresponding functions.
smoking_data <- read.csv("birthweight_smoking.csv")
smoking_data1 <- smoking_data
# smoking_data1 is now the spare copy in case we "mess up" our `data`!
# smoking_data <- smoking_data1
Since we will explore correlation this week, let’s keep the data structure for now. Correlation function cannot work on a factor variable.
## nprevist alcohol tripre1 tripre2 tripre3
## Min. : 0 Min. :0.00 Min. :0.0 Min. :0.00 Min. :0.00
## 1st Qu.: 9 1st Qu.:0.00 1st Qu.:1.0 1st Qu.:0.00 1st Qu.:0.00
## Median :12 Median :0.00 Median :1.0 Median :0.00 Median :0.00
## Mean :11 Mean :0.02 Mean :0.8 Mean :0.15 Mean :0.03
## 3rd Qu.:13 3rd Qu.:0.00 3rd Qu.:1.0 3rd Qu.:0.00 3rd Qu.:0.00
## Max. :35 Max. :1.00 Max. :1.0 Max. :1.00 Max. :1.00
## tripre0 birthweight smoker unmarried educ
## Min. :0.00 Min. : 425 Min. :0.00 Min. :0.00 Min. : 0.0
## 1st Qu.:0.00 1st Qu.:3062 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:12.0
## Median :0.00 Median :3420 Median :0.00 Median :0.00 Median :12.0
## Mean :0.01 Mean :3383 Mean :0.19 Mean :0.23 Mean :12.9
## 3rd Qu.:0.00 3rd Qu.:3750 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:14.0
## Max. :1.00 Max. :5755 Max. :1.00 Max. :1.00 Max. :17.0
## age drinks
## Min. :14 Min. : 0.0
## 1st Qu.:23 1st Qu.: 0.0
## Median :27 Median : 0.0
## Mean :27 Mean : 0.1
## 3rd Qu.:31 3rd Qu.: 0.0
## Max. :44 Max. :21.0
## 'data.frame': 3000 obs. of 12 variables:
## $ nprevist : int 12 5 12 13 9 11 12 10 13 10 ...
## $ alcohol : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre1 : int 1 0 1 1 1 1 1 1 1 1 ...
## $ tripre2 : int 0 1 0 0 0 0 0 0 0 0 ...
## $ tripre3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tripre0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ birthweight: int 4253 3459 2920 2600 3742 3420 2325 4536 2850 2948 ...
## $ smoker : int 1 0 1 0 0 0 1 0 0 0 ...
## $ unmarried : int 1 0 0 0 0 0 0 0 0 0 ...
## $ educ : int 12 16 11 17 13 16 14 13 17 14 ...
## $ age : int 27 24 23 28 27 33 24 38 29 28 ...
## $ drinks : int 0 0 0 0 0 0 0 0 0 0 ...
A Pearson correlation is a number between -1 and 1 that indicates the
extent to which two variables are linearly related. Pearson
is the default (you will get the same result without specifying the
method actually). Specify other methods when appropriate, such as
kendall or spearman.
Check the Pearson correlation between two continuous variables.
## [1] 0.11
## [1] -0.032
Besides using the function cor()
, we can use
cor.test()
. The difference of the two is that the
cor.test()
function will output a p-value, 95% confidence
interval for your correlation. An \(\alpha\) of 0.05 indicates that the risk of
concluding that a correlation exists when, actually, no correlation
exists is 5%.
##
## Pearson's product-moment correlation
##
## data: smoking_data$birthweight and smoking_data$educ
## t = 6, df = 2998, p-value = 0.000000008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07 0.14
## sample estimates:
## cor
## 0.11
# What is the p-value?
# What can you say about the correlation between birthweight and educ?
# Read on the R documentation for more elaborations --> help("cor.test")
The test statistics show that the correlation coefficient for birthweight and education attainment (r = 0.11) is statistically significant at the 0.05 level with the p-value of 0.000000008 (lower than 0.05). We reject the null hypothesis that birthweight and education attainment are not correlated, though the correlation is very weak.
First, let’s create a new data frame with a few variables we will
use. We can subset our original data frame by selecting the columns
pretty quickly. Specify the column names through this code:
data[,c("variable1", "variable2")]
.
Check the Pearson correlation between two continuous variables. The correlation function cannot analyze factor structure, that’s why we do not transform the factor structure for categorical variables here.
## birthweight nprevist age educ drinks smoker alcohol
## birthweight 1.000 0.227 0.080 0.1052 -0.032 -0.169 -0.0336
## nprevist 0.227 1.000 0.147 0.2102 -0.061 -0.109 -0.0425
## age 0.080 0.147 1.000 0.4457 0.026 -0.144 0.0381
## educ 0.105 0.210 0.446 1.0000 -0.013 -0.233 0.0049
## drinks -0.032 -0.061 0.026 -0.0129 1.000 0.092 0.6041
## smoker -0.169 -0.109 -0.144 -0.2326 0.092 1.000 0.1209
## alcohol -0.034 -0.043 0.038 0.0049 0.604 0.121 1.0000
ggcorrplot()
# hierarchical cluster: order correlation coefficients from +ve to -ve
ggcorrplot(cor_df,
lab = TRUE,
hc.order = TRUE)
# Get the lower triangle, turn insignificant coefficients blank
ggcorrplot(cor_df,
lab = TRUE,
hc.order = TRUE,
type = "lower",
insig = "blank")
# change the themes and color palettes
ggcorrplot(cor_df, lab = TRUE, hc.order = TRUE, type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_minimal(),
colors = c("#6D9EC1", "white", "#E46726"))
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot data points. Scatter plots show how much one variable is affected by another, while the relationship between two variables is called their correlation.
Let the dependent variable \(y_i\)
be the birthweight
and the independent variable \(x_i\) be nprevist
(\(i = 1, \dots, n=3000\)).
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
\[\min_{b_0,\,b_1} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^{2}\]
The function lm()
will be used extensively. This
function solves the minimization problem that we defined above.
The basic output of the lm(…) function contains two elements: the Call and the Coefficients. The former is used to tell you what regression it was that you estimated. The second contains the regression coefficients. In this case there are two coefficients: the intercept and the regression weight of our sole predictor. lm(DV ~ IV1 + IV2…) function is used for fitting linear models
General rule for regression: When we regress Y on X (y ~ x), we use the values of explanatory variable X to predict those of response variable Y.
Let’s start with a simple linear regression. Regress
birthweight
(y) on nprevist
(x) using the
small dataset.
Let us look at the results of the fit and print the model object. The output includes the model formula, the coefficients, parameter estimates, standard errors, as well the residual standard error, adjusted R-squared and multiple R-squared.
What is the coefficient? The y-intercept? Is the coefficient statistically significant?
##
## Call:
## lm(formula = birthweight ~ nprevist, data = dataMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2672.6 -302.7 27.4 367.4 2225.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2979.93 33.24 89.7 <0.0000000000000002 ***
## nprevist 36.66 2.87 12.8 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 577 on 2998 degrees of freedom
## Multiple R-squared: 0.0517, Adjusted R-squared: 0.0514
## F-statistic: 163 on 1 and 2998 DF, p-value: <0.0000000000000002
What is the average infant birthweight when nprevist
is
equal to zero?
The intercept is 2979.93 grams, meaning when all our variables are 0, our model predicts the average infant birthweight will be 2979.93 grams.
The value of \(\beta\) is the amount
by which the dependent variable changes when the independent variable
increases by one unit. So, \(\beta\) =
36.66 for nprevist
means that the dependent variable
infants’ birthweight (y) goes up by 36.66 grams on average when the
number of prenatal visits (x) goes up by 1 unit. The p-value is less
than 0.05, suggesting that the coefficient for nprevist
is
statistically significant at the 0.05 level.
The interval has a probability of 95 % to contain the true value of \(\beta_i\). So in 95 % of all samples that could be drawn, the confidence interval will cover the true value of \(\beta_i\).
# Computes confidence intervals for the estimated coefficients in a fitted model.
confint(model1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 2915 3045
## nprevist 31 42