Install the packages that will be used for this tutorial.
install.packages(“rsq”)
options(digits = 2) # results will round to two significant digits
good <- read.csv("good.csv")
attach(good) # This step is optional, remember if you decide not to attach,
# then you must specify ",data = good" in your regression model,
# or preface every variable mention with "good$"goodreg <- na.omit(good[,c("readss97","AGE97","HOME97","bthwht")])
# results are the same with using good,
# for convenience & efficiency by taking 4 variables out of 112)
# remove NAsummary(goodreg) ## readss97 AGE97 HOME97 bthwht
## Min. : 48 Min. : 3.0 Min. : 7.9 Min. :0.00
## 1st Qu.: 92 1st Qu.: 5.0 1st Qu.:18.1 1st Qu.:0.00
## Median :101 Median : 7.0 Median :20.5 Median :0.00
## Mean :102 Mean : 7.5 Mean :20.2 Mean :0.37
## 3rd Qu.:112 3rd Qu.:10.0 3rd Qu.:22.5 3rd Qu.:1.00
## Max. :166 Max. :13.0 Max. :27.0 Max. :1.00
str(goodreg)## 'data.frame': 1964 obs. of 4 variables:
## $ readss97: num 81 112 93 116 143 ...
## $ AGE97 : int 4 6 12 12 13 6 4 9 7 4 ...
## $ HOME97 : num 15.1 24 19.1 21 23.1 13.5 16 18.6 16.6 16 ...
## $ bthwht : int 1 0 1 1 1 1 1 0 0 1 ...
## - attr(*, "na.action")= 'omit' Named int 1 2 4 5 7 10 14 15 23 26 ...
## ..- attr(*, "names")= chr "1" "2" "4" "5" ...
Bar Chart for Dummy Variable
bw.tab <- table(goodreg$bthwht) # frequency table
bw.tab##
## 0 1
## 1238 726
prop.table(table(goodreg$bthwht)) # tables of proportions ##
## 0 1
## 0.63 0.37
# or prop.table(bw.tab)
barplot(bw.tab,xlab="Birthweight", ylab="N", col=c("grey", "light blue"),
main = "Graph 1: Bar Chart of Birthweight", ylim = c(0, 1500))Histogram and Kernel Density plot for Continuous Variable
Instead of counting the number of datapoints per bin, R can give the probability densities. For instance, Kernel Density plot produces a smooth curve estimating the probability density function of a continuous variable.
# Histogram
hist(goodreg$readss97, breaks = 50, col = "orange",
main = "Graph 2.1: Histogram of Readss97",
xlab = "Reading Scores in 1997")# Histogram with density plot
ggplot(goodreg, aes(x=readss97)) +
labs(title = "Graph 2.2: Histogram and Kernel Density of Readss97") +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") # alpha: color opacity## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# simplified version of density plot
r <- density(goodreg$readss97)
plot(r)Your turn! Do the same for the other variables.
Look at their correlation and partial correlations: cor(), cor.test()
cor.test(goodreg$readss97,goodreg$AGE97)##
## Pearson's product-moment correlation
##
## data: goodreg$readss97 and goodreg$AGE97
## t = 5, df = 2000, p-value = 2e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.074 0.161
## sample estimates:
## cor
## 0.12
cor.test(goodreg$readss97,goodreg$HOME97)##
## Pearson's product-moment correlation
##
## data: goodreg$readss97 and goodreg$HOME97
## t = 20, df = 2000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.34 0.41
## sample estimates:
## cor
## 0.38
cor.test(goodreg$readss97,goodreg$bthwht)##
## Pearson's product-moment correlation
##
## data: goodreg$readss97 and goodreg$bthwht
## t = -4, df = 2000, p-value = 1e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.141 -0.054
## sample estimates:
## cor
## -0.098
# which IV has the largest correlation coefficient (r)?
pcor(goodreg, method = c("pearson")) ## $estimate
## readss97 AGE97 HOME97 bthwht
## readss97 1.000 0.075 0.351 -0.089
## AGE97 0.075 1.000 0.174 0.271
## HOME97 0.351 0.174 1.000 -0.095
## bthwht -0.089 0.271 -0.095 1.000
##
## $p.value
## readss97 AGE97 HOME97 bthwht
## readss97 0.0e+00 9.3e-04 4.9e-58 7.7e-05
## AGE97 9.3e-04 0.0e+00 9.6e-15 1.9e-34
## HOME97 4.9e-58 9.6e-15 0.0e+00 2.4e-05
## bthwht 7.7e-05 1.9e-34 2.4e-05 0.0e+00
##
## $statistic
## readss97 AGE97 HOME97 bthwht
## readss97 0.0 3.3 16.6 -4.0
## AGE97 3.3 0.0 7.8 12.5
## HOME97 16.6 7.8 0.0 -4.2
## bthwht -4.0 12.5 -4.2 0.0
##
## $n
## [1] 1964
##
## $gp
## [1] 2
##
## $method
## [1] "pearson"
# it runs when we cleared all NAs using na.omit() above# 'lm' function has na.omit as the factory default
lm <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg, na.action=na.omit)
lm1 <- lm(goodreg$readss97~goodreg$AGE97 + goodreg$HOME97 + goodreg$bthwht, data = good, na.action=na.omit)
lm2 <- lm(readss97~AGE97 + HOME97 + bthwht, data = goodreg, na.action=na.omit)
# All 3 options produce the same result and outputUse the summary() function to see your detailed regression output
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# What outputs can you see from the summary? # number of observations (i.e., rows) in your dataset
nrow(good)## [1] 3563
# regression summary tells you how many observations were deleted
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# 1599 observations deleted due to missingness
# Observations used = nrow(good) - observations deleted# Number of Observations used in your regression model
nobs(lm)## [1] 1964
The F value is the ratio of the mean regression sum of squaresdivided by the mean error sum of squares. It is the probability that the null hypothesis (H0) for the full model (i.e., all of the regression coefficients are jointly zero) is true. Null hypothesis (H0) is which we try to disprove or reject, e.g., at the 0.05 level.
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# It is significant at the .0001 level, we can reject H0.
# It means that there is at least one significant coefficient in the model.R^2 measures how close the data are to the fitted regression line. 0% indicates that the model explains none of the variability of the response data around its mean.
# How much of the variance in readss97 is explained by our independent variables?
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# It is .15, meaning that our model as predicted by age97, bthwht, home97 explains about 15% of the variance in readss97.# What is the average reading score when age, weight, and income are zero?
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# The Constant is 63.34, meaning when all our variables are 0,
# our model predicts the average reading score will be 63.34. # What are the coefficients for AGE97, HOME97, and bthwht? Are they statistically significant?
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# e.g Children with low birth weight are predicted to have reading scores
# almost 3 points lower than children without low birthweight,
# holding all other variables constant. The partial R-square (or coefficient of partial determination or a squared partial correlation) measures a fully partialled proportion of the variance in Y (the unique contribution) explained by Xi over the variance in Y that is not associated with any other predictors already included in the model.
rsq.partial(lm)## $adjustment
## [1] FALSE
##
## $variable
## [1] "AGE97" "HOME97" "bthwht"
##
## $partial.rsq
## [1] 0.0056 0.1233 0.0079
# Alternatively, you may use another package called "lmSupport"
# install.packages("lmSupport")
# library(lmSupport)
# modelEffectSizes(lm)Model coefficients are in different units, standardized coefficients helps us compare the effect size across multiple variables. Standardized Coefficient measures the effect of a one-SD change in Xj as opposed to a unit change in Xj.
By placing everything on a common metric of SD’s, we get comparability across βj’s. In other words it’s meant to be an interpretation aid. Statistical significance does not change!
goodreg$bthwht <- as.numeric(goodreg$bthwht) # standardized coefficient don't apply to dummy variable
linearM <- lm(scale(readss97)~scale(AGE97) + scale(HOME97) + scale(bthwht),
data = goodreg)
summary(linearM)##
## Call:
## lm(formula = scale(readss97) ~ scale(AGE97) + scale(HOME97) +
## scale(bthwht), data = goodreg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.432 -0.594 -0.019 0.605 3.403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.29e-16 2.08e-02 0.00 1.00000
## scale(AGE97) 7.29e-02 2.20e-02 3.32 0.00093 ***
## scale(HOME97) 3.55e-01 2.14e-02 16.61 < 2e-16 ***
## scale(bthwht) -8.59e-02 2.17e-02 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.92 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
coefficients(linearM)## (Intercept) scale(AGE97) scale(HOME97) scale(bthwht)
## -6.3e-16 7.3e-02 3.6e-01 -8.6e-02
lm3 <- lm(goodreg$readss97~goodreg$AGE97 + goodreg$bthwht)
summary(lm3)##
## Call:
## lm(formula = goodreg$readss97 ~ goodreg$AGE97 + goodreg$bthwht)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.68 -9.99 -0.56 10.15 62.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.728 0.982 99.52 < 2e-16 ***
## goodreg$AGE97 0.827 0.125 6.59 5.7e-11 ***
## goodreg$bthwht -4.469 0.761 -5.88 5.0e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16 on 1961 degrees of freedom
## Multiple R-squared: 0.031, Adjusted R-squared: 0.03
## F-statistic: 31.4 on 2 and 1961 DF, p-value: 3.94e-14
summary(lm)##
## Call:
## lm(formula = readss97 ~ AGE97 + HOME97 + bthwht, data = goodreg,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.96 -9.52 -0.31 9.70 54.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.342 2.266 27.96 < 2e-16 ***
## AGE97 0.399 0.120 3.32 0.00093 ***
## HOME97 1.829 0.110 16.61 < 2e-16 ***
## bthwht -2.848 0.719 -3.96 7.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15 on 1960 degrees of freedom
## Multiple R-squared: 0.151, Adjusted R-squared: 0.149
## F-statistic: 116 on 3 and 1960 DF, p-value: <2e-16
# Run a series of models adding variables and noting the change in R2
# Comparing model before and after you add independent variable(s)
# Examines the unique contribution of an independent variable above and beyond the other variables’ contribution and the covariation/contribution shared with other independent variables
# Note changes in R2 and F-Statisticplot <- ggplot(goodreg, aes(x = AGE97 + HOME97 + bthwht, y = readss97, col = bthwht)) +
labs(title = "Graph 3: Multiple Linear Regression of Reading Score in 1997",
subtitle = "Lab 3",
caption = "(based on data from good package)") +
geom_point(color='lightpink4', size=1.5) +
geom_smooth(method="lm", colour="black", size=1)
plotUse the dataset “statekffmultregression” in Canvas. It includes the following variables:
1) Run a model in which you use unemployment rate to predict the proportion of adults that rate their physical health as fair or poor. Call this model 1 and ask R for the summary. What do you find? What are the Model and Adjusted R-squared values? Interpret the coefficient of the unemployment rate.
2) Run a model in which you use the official poverty rate, the unemployment rate, and the percent of people that smoke to predict the proportion of adults that rate their physical health as fair or poor. Call this model 2. What are the Model and Adjusted R-squared values? What are the coefficients, and which are statistically significant?
3) Run model 2 with standardized coefficients, and call it model 3. Interpret your coefficients now.