Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.
Citation:
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Examples taken from: Gianluca Rossi, 31 October 2015
Logarithmic transformation and regression: consider the following regression:
\(log(weight) = -3.5 + 2.0 * log(height) + error\)
With error that have a standard deviation of .25. Weights are in pounds and heights are in inches
correct - -(exp(0.25)) = -1.28 and exp(0.25) = 1.28
require(arm)
## Loading required package: arm
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is /Users/user/Desktop/R - Spring
# create some data of heights for males and females ( where 50% are male and 50% are female and the mean height for males = 69.7 and mean height for females = 63.8 with sd of 3.1 and 3 respectively)
male_data <- rnorm(1000, 69.7,3.1)
female_data <- rnorm(1000,63.8,3)
height_data <- c(male_data, female_data)
log_height_data <- log(height_data)
# create a vector of log weights based on the formula given in the exercise
#$log(weight) = -3.5 + 2.0 * log(height) + error$
#log_weight_data <- -3.5 + 2.0 * log_height_data + error
log_weight_data <- -3.5 + 2.0 * log_height_data + rnorm(n = length(height_data), mean = 0, sd = .25)
Now fit a model
model1 <- lm(log_weight_data ~ log_height_data)
summary(model1)
##
## Call:
## lm(formula = log_weight_data ~ log_height_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83443 -0.16650 0.00492 0.16823 0.80385
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4124 0.3664 -9.315 <2e-16 ***
## log_height_data 1.9787 0.0872 22.690 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2473 on 1998 degrees of freedom
## Multiple R-squared: 0.2049, Adjusted R-squared: 0.2045
## F-statistic: 514.8 on 1 and 1998 DF, p-value: < 2.2e-16
Plot the regression line for the model
plot(log_height_data, log_weight_data, xlab = "height", ylab = "weight")
abline(model1)
#what is the difference between abline and geom_smooth
lm(formula = weight ~ age10)
coef.est coef.se
(Intercept) 161.0 7.3
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00
lm(formula = weight ~ age10 + age10.sq)
coef.est coef.se
(Intercept) 96.2 19.3
age10 33.6 8.7
n = 2009, k = 3,
residual sd = 119.3, R-Squared = 0.01
lm(formula = weight ~ age30.44 + age45.64 + age65up)
coef.est coef.se
(Intercept) 157.2 5.4
age30.44TRUE 19.1 7.0
age45.64TRUE 27.2 7.6
n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01
# hint use ?stat_function to learn how to use this function to draw a line in ggplot2
require(ggplot2)
## Loading required package: ggplot2
df <- data.frame(x = c(0,8))
function_wt_age <- stat_function(fun = function(x) 161 + 2.6 *x, color = "#ff0aac")
ggplot(df, aes(x)) + function_wt_age + labs(x = "age", y = "weight")
can we walk through this in class briefly if there is enough time?
function_wt_age_sqrt= stat_function(fun= function(x) 96.2 + 33.6*x - 3.2*(x)^2 , col='#ff0aac')
ggplot(df, aes(x)) + function_wt_age + function_wt_age_sqrt + labs( x = "Age", y = "Weight")
ggplot(data.frame (x=c(0,8)), aes(x)) + stat_function (fun = function (x) 157.2 + 19.1*ifelse(3 <= x & x <= 4.49, 1, 0) + 27.2 * ifelse (4.5 <= x & x <= 6.49, 1, 0) + 8.5 * ifelse (6.5 <= x, 1, 0), color = "#ff0aac") + labs (x = "Age", y = "Weight")
#Im kind of confused here both with code and interpretation
Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_{i}\) and \(R_{i}\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats
The simple difference, \(D_{i} − R_{i}\):
this seems like a good transformation because is symmetric and centered at zero. One disadvantage of this transformation is that it’s not proportional. For instance, if Democrats got $6M and Republicans just $4M the difference will be $2M. Now, in the hypothesis Democrats raised $3M and Republicans $1M the absolute difference will still be $2M. However, $2M in the first example corresponds to a much closed gap between the two parties than in the second example. This could limit the effectiveness of the predictor if districts differ widely in average money raised
The ratio, \(\frac{D_{i}}{R_{i}}\):
this transformation has the disadvantage of being centered at 1 and that is asymmetric. In particular it tends to zero for case where the Republicans have more money raised than Democrats, and tend to infinity on the opposite case. This means that it will weight more cases where Democrats raised more money than the opposite party
The difference on the logarithmic scale, \(log(D_{i}) − log(R_{i})\):
this is similar to the first transformation, with the difference that is less sensitive to outliers. It is centered to zero and is symmetric. Another advantage of this transformation is that is proportional to the magnitude of the difference. So a $2M difference between the parties on a county where each raise on average above $100M will have a lower value than the same difference on a district where fundraising is poorer
The relative proportion, \(\frac{D_{i}}{D_{i} + R_{i}}\):
this transformation is centered at 0.5 and symmetric
This data set is the search time in seconds for a 1 inch pipe among a varying number of distractor pipes. The distractor pipes were always the same length and that length varied by condition. Additionally, the number of pipes varied from either 4, 6, or 8 distractor pipes.
# make sure I am in the right directory where the data is stored
#setwd to set working directory
# Read in the data and take a peek
hs_data <- read.csv("Haptic_Search_Data_2_exp_psyc_213_example.csv", header = TRUE)
head(hs_data)
## Experiment_Number Blk Distractor_Number Distractor_Lengths Relative_Size
## 1 1 1 1 0.25 3
## 2 1 2 1 0.25 3
## 3 1 1 1 0.25 3
## 4 1 2 1 0.25 3
## 5 1 1 1 0.25 3
## 6 1 2 1 0.25 3
## Times Exp_Num_c
## 1 6.1157 0
## 2 7.2289 0
## 3 4.1537 0
## 4 4.8841 0
## 5 8.7824 0
## 6 11.3860 0
After looking at the data are there any variables that you have question about? What might those variables mean with relation to the experiment?
Answer: Distractor number
I have already cleaned the data set, so what should we look at first? descriptive statistics
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following objects are masked from 'package:arm':
##
## logit, rescale, sim
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(hs_data)
## Experiment_Number Blk Distractor_Number Distractor_Lengths
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.2500
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.5000
## Median :1.000 Median :2.000 Median :2.000 Median :0.7500
## Mean :1.493 Mean :1.514 Mean :2.003 Mean :0.9975
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.5000
## Max. :2.000 Max. :2.000 Max. :3.000 Max. :1.7500
## Relative_Size Times Exp_Num_c
## Min. :0.2500 Min. : 0.5465 Min. :0.0000
## 1st Qu.:0.3333 1st Qu.: 7.1956 1st Qu.:0.0000
## Median :0.7500 Median :11.6495 Median :0.0000
## Mean :0.9803 Mean :14.7039 Mean :0.4926
## 3rd Qu.:1.0000 3rd Qu.:19.2105 3rd Qu.:1.0000
## Max. :3.0000 Max. :60.0490 Max. :1.0000
describe(hs_data)
## vars n mean sd median trimmed mad min max range
## Experiment_Number 1 3370 1.49 0.50 1.00 1.49 0.00 1.00 2.00 1.00
## Blk 2 3370 1.51 0.50 2.00 1.52 0.00 1.00 2.00 1.00
## Distractor_Number 3 3370 2.00 0.82 2.00 2.00 1.48 1.00 3.00 2.00
## Distractor_Lengths 4 3370 1.00 0.54 0.75 1.00 0.74 0.25 1.75 1.50
## Relative_Size 5 3370 0.98 0.94 0.75 0.82 0.37 0.25 3.00 2.75
## Times 6 3370 14.70 10.26 11.65 13.17 7.89 0.55 60.05 59.50
## Exp_Num_c 7 3370 0.49 0.50 0.00 0.49 0.00 0.00 1.00 1.00
## skew kurtosis se
## Experiment_Number 0.03 -2.00 0.01
## Blk -0.06 -2.00 0.01
## Distractor_Number -0.01 -1.51 0.01
## Distractor_Lengths 0.01 -1.51 0.01
## Relative_Size 1.50 0.64 0.02
## Times 1.39 1.83 0.18
## Exp_Num_c 0.03 -2.00 0.01
ggplot(data = hs_data) + geom_point(mapping = aes(x = Distractor_Number, y = Times))
Let’s make a different plot to look at the data
ggplot(
data <- hs_data, mapping = aes(x = Distractor_Number, y = Times)) + geom_jitter(mapping = aes(color = Distractor_Lengths))
Let’s test the most basic Model
rt_model <- lm(data = hs_data, Times ~ Distractor_Number)
summary(rt_model)
##
## Call:
## lm(formula = Times ~ Distractor_Number, data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.24 -7.48 -2.98 4.62 44.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.5198 0.4659 26.874 < 2e-16 ***
## Distractor_Number 1.0905 0.2153 5.064 4.32e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.23 on 3368 degrees of freedom
## Multiple R-squared: 0.007558, Adjusted R-squared: 0.007263
## F-statistic: 25.65 on 1 and 3368 DF, p-value: 4.316e-07
What do the coefficients of this model tell us?
Answer: When Distractor Numbers are held at zero, or there are no distractors, reaction time is predicted to be 12.5198 seconds. For every one increase in distractor there is a 1.0905 increase in reaction time. This model is not very good with an adjusted R-squared of only 0.007
What might be a better model ?
rt_model_2 <- lm(data = hs_data, Times ~ Distractor_Number + Distractor_Lengths)
summary(rt_model_2)
##
## Call:
## lm(formula = Times ~ Distractor_Number + Distractor_Lengths,
## data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.888 -7.233 -2.843 4.490 43.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0641 0.5629 17.880 < 2e-16 ***
## Distractor_Number 1.0915 0.2135 5.112 3.36e-07 ***
## Distractor_Lengths 2.4597 0.3221 7.636 2.90e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.14 on 3367 degrees of freedom
## Multiple R-squared: 0.02445, Adjusted R-squared: 0.02387
## F-statistic: 42.2 on 2 and 3367 DF, p-value: < 2.2e-16
Does this model do a better job with the data? How do you know?
Answer: This model does do a better job because the adjusted R-squared increased from .007 to .02387, but it is still a poor model.
Let’s think about why this model might not fit the data well. Specifically what might be going on with our variable of Distractor lengths that might not work well with the linear model? - Distractor number and distractor length are on different scales
How might be account for this issue? - Use Relative size instead of distractor length
Now lets test the new model
rt_model_3 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size)
summary(rt_model_3)
##
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size, data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.401 -6.932 -2.326 4.241 42.101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3478 0.4811 31.901 < 2e-16 ***
## Distractor_Number 1.1104 0.2074 5.355 9.13e-08 ***
## Relative_Size -2.9257 0.1798 -16.273 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.848 on 3367 degrees of freedom
## Multiple R-squared: 0.07992, Adjusted R-squared: 0.07937
## F-statistic: 146.2 on 2 and 3367 DF, p-value: < 2.2e-16
Does this model do a better job of accounting for the data? - Yes, the adjusted R-squared increased from .02 to .07.
What do the coefficients of this model mean? Do they make sense? Explain how you know.
Answer: When distractor number and size of the distractors are both held at zero the predicted reaction time is 15.3478 seconds. For every one unit increase in the number of distractors, and when relative size is held at zero, there is a 1.1104 second increase in reaction time. For every one unit increase in relative size of distractors, when distractor number is held at zero, there is a 2.9257 second decrease in reaction time. This does not make sense to me because how is there any change in the size of the distractor when there are none. Maybe I am have a poor understanding of the data.
ggplot(
data <- hs_data, mapping = aes(x= Distractor_Number, y = Times)) + geom_jitter(mapping = aes(color = Relative_Size))+ geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
Now lets use what we have learned and build a more advanced model What Should we test? The interaction between distractor number and relative size
rt_model_4 <- lm(data = hs_data, Times ~ Distractor_Number * Relative_Size)
summary(rt_model_4)
##
## Call:
## lm(formula = Times ~ Distractor_Number * Relative_Size, data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.903 -6.941 -2.392 4.176 41.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9856 0.6458 21.657 < 2e-16 ***
## Distractor_Number 1.7888 0.2984 5.994 2.26e-09 ***
## Relative_Size -1.5271 0.4780 -3.195 0.00141 **
## Distractor_Number:Relative_Size -0.6949 0.2201 -3.157 0.00161 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.835 on 3366 degrees of freedom
## Multiple R-squared: 0.08264, Adjusted R-squared: 0.08182
## F-statistic: 101.1 on 3 and 3366 DF, p-value: < 2.2e-16
t value is -.69/.22 which = -3.13
When both number and size of distractors are held constant at zero the predicted reaction time is 13.9856 seconds. For every one unit increase in distractor number when the size is held at zero there will be a 1.7888 second increase in reaction time. For every one unit increase in size, when number is held at zero, there will be a 1.5271 decrease in reaction time. Depending on the number of distractors there will be an increase in relative size which will decrease reaction time by 0.6949 seconds.
hs_data$Relative_Squared <- (hs_data$Relative_Size * hs_data$Relative_Size)
rt_model_5 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size + Experiment_Number)
summary(rt_model_5)
##
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size + Experiment_Number,
## data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.748 -6.881 -2.470 4.239 42.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.3563 0.6973 24.892 < 2e-16 ***
## Distractor_Number 1.1225 0.2069 5.425 6.21e-08 ***
## Relative_Size -2.9503 0.1795 -16.436 < 2e-16 ***
## Experiment_Number -1.3458 0.3388 -3.972 7.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.827 on 3366 degrees of freedom
## Multiple R-squared: 0.08421, Adjusted R-squared: 0.08339
## F-statistic: 103.2 on 3 and 3366 DF, p-value: < 2.2e-16
#model the quadratic of relative size
rt_model_6 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size + Experiment_Number + Relative_Squared)
summary(rt_model_6)
##
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size + Experiment_Number +
## Relative_Squared, data = hs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.371 -6.648 -2.303 4.273 40.927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.0531 0.8251 25.517 < 2e-16 ***
## Distractor_Number 1.1156 0.2049 5.444 5.59e-08 ***
## Relative_Size -11.2466 1.0288 -10.932 < 2e-16 ***
## Experiment_Number -1.4122 0.3357 -4.207 2.65e-05 ***
## Relative_Squared 2.4571 0.3001 8.187 3.76e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.732 on 3365 degrees of freedom
## Multiple R-squared: 0.1021, Adjusted R-squared: 0.101
## F-statistic: 95.65 on 4 and 3365 DF, p-value: < 2.2e-16