Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.
Citation:
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Examples taken from: Gianluca Rossi, 31 October 2015
Logarithmic transformation and regression: consider the following regression:
\(log(weight) = -3.5 + 2.0 * log(height) + error\)
With error that have a standard deviation of .25. Weights are in pounds and heights are in inches
require(arm)
## Loading required package: arm
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is C:/Users/brand/Desktop/UCR/PhD CCN/PSYC213_Spring2023/Lab/Week 4
# create some data of heights for males and females ( where 50% are male and 50% are female and the mean height for males = 69.7 and mean height for females = 63.8 with sd of 3.1 and 3 respectively)
height_f<-rnorm(n=50,mean=63.8,sd=3)
height_m<-rnorm(n=50,mean=69.7,sd=3.1)
heights<-c(height_f,height_m)
log_heights<-log(heights)
# create a vector of log weights based on the formula given in the exercise
log_weights<- -3.5+2*log_heights+rnorm(n=length(heights),mean=0,sd=.25)
Now fit a model
df_log_data<-data.frame(cbind(log_weights=log_weights,log_heights=log_heights))
model_1<-lm(log_weights~log_heights,data=df_log_data)
Plot the regression line for the model
plot(log_heights,log_weights,xlab="log(heights)",ylab="log(weights)")
abline(model_1)
lm(formula = weight ~ age10)
coef.est coef.se
(Intercept) 161.0 7.3
n = 2009, k = 2
residual sd = 119.7, R-Squared = 0.00
lm(formula = weight ~ age10 + age10.sq)
coef.est coef.se
(Intercept) 96.2 19.3
age10 33.6 8.7
n = 2009, k = 3,
residual sd = 119.3, R-Squared = 0.01
lm(formula = weight ~ age30.44 + age45.64 + age65up)
coef.est coef.se
(Intercept) 157.2 5.4
age30.44TRUE 19.1 7.0
age45.64TRUE 27.2 7.6
n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01
# hint use ?stat_function to learn how to use this function to draw a line in ggplot2
require(ggplot2)
## Loading required package: ggplot2
df1<-data.frame(x=c(0,8))
func_1<-stat_function(fun=function(x) 161+2.6*x)
ggplot(df1,aes(x))+func_1+labs(x="Age in Decades",y="Weight in Pounds")
df1<-data.frame(x=c(0,8))
func_1<-stat_function(fun=function(x) 161+2.6*x)
func_2<-stat_function(fun=function(x) 96.2+33.6*x-3.2*(x)^2, col="red")
ggplot(df1,aes(x))+func_1+func_2+labs(x="Age in Decades",y="Weight in Pounds")
ggplot(data.frame(x=c(0,8)),aes(x))+
stat_function(fun=function(x)157.2+
19.1*ifelse(3<=x & x<4.49,1,0)+
27.2*ifelse(4.5<=x & x<=6.49,1,0)+
8.5*ifelse(6.5<=x,1,0))+
labs(x="Age (in decades)",y="Weight (in pounds)")
Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_{i}\) and \(R_{i}\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats
The simple difference, \(D_{i} − R_{i}\):
this seems like a good transformation because is symmetric and centered at zero. One disadvantage of this transformation is that it’s not proportional. For instance, if Democrats got $6M and Republicans just $4M the difference will be $2M. Now, in the hypothesis Democrats raised $3M and Republicans $1M the absolute difference will still be $2M. However, $2M in the first example corresponds to a much closed gap between the two parties than in the second example. This could limit the effectiveness of the predictor if districts differ widely in average money raised
The ratio, \(\frac{D_{i}}{R_{i}}\):
this transformation has the disadvantage of being centered at 1 and that is asymmetric. In particualar it tends to zero for case where the Republics have more money raised than Democrats, and tend to infinity on the opposite case. This means that it will weight more cases where Democrats raised more money than the opposite party
The difference on the logarithmic scale, \(log(D_{i}) − log(R_{i})\):
this is similar to the first transformation, with the difference that is less sensitive to outliers. It is centered to zero and is symmetric. Another advantage of this transformation is that is proportional to the magnitude of the difference. So a $2M difference between the parties on a county where each raise on average above $100M will have a lower value than the same difference on a district where fundraising is poorer
The relative proportion, \(\frac{D_{i}}{D_{i} + R_{i}}\):
this transformation is centered at 0.5 and symmetric
This data set is the search time in seconds for a 1 inch pipe among a varying number of distractor pipes. The distractor pipes were always the same length and that length varied by condition. Additonally, the number of pipes varied from either 4, 6, or 8 distractor pipes.
# make sure I am in the right directory where the data is stored
setwd("C:/Users/brand/Desktop/UCR/PhD CCN/PSYC213_Spring2023/Lab/Week 4")
# Read in the data and take a peek
haptics_data_1 <- read.csv("Haptic_Search_Data_2_exp_psyc_213_example.csv", header = TRUE)
head(haptics_data_1)
## Experiment_Number Blk Distractor_Number Distractor_Lengths Relative_Size
## 1 1 1 1 0.25 3
## 2 1 2 1 0.25 3
## 3 1 1 1 0.25 3
## 4 1 2 1 0.25 3
## 5 1 1 1 0.25 3
## 6 1 2 1 0.25 3
## Times Exp_Num_c
## 1 6.1157 0
## 2 7.2289 0
## 3 4.1537 0
## 4 4.8841 0
## 5 8.7824 0
## 6 11.3860 0
After looking at the data are there any variables that you have question about? What might those variables mean with relation to the experiment?
Answer:
I have already cleaned the data set, so what should we look at first?
library(ggplot2)
summary(haptics_data_1)
## Experiment_Number Blk Distractor_Number Distractor_Lengths
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.2500
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.5000
## Median :1.000 Median :2.000 Median :2.000 Median :0.7500
## Mean :1.493 Mean :1.514 Mean :2.003 Mean :0.9975
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.5000
## Max. :2.000 Max. :2.000 Max. :3.000 Max. :1.7500
## Relative_Size Times Exp_Num_c
## Min. :0.2500 Min. : 0.5465 Min. :0.0000
## 1st Qu.:0.3333 1st Qu.: 7.1956 1st Qu.:0.0000
## Median :0.7500 Median :11.6495 Median :0.0000
## Mean :0.9803 Mean :14.7039 Mean :0.4926
## 3rd Qu.:1.0000 3rd Qu.:19.2105 3rd Qu.:1.0000
## Max. :3.0000 Max. :60.0490 Max. :1.0000
ggplot(data=haptics_data_1,aes(Distractor_Lengths,Times))+geom_point()+geom_jitter()
#geom_point(mapping = aes( x=Distractor_Number, y=Times))
Let’s make a different plot to look at the data
ggplot(data=haptics_data_1,mapping=aes(x=Distractor_Number,y=Times,color=Distractor_Lengths))+geom_point()+geom_jitter()
Let’s test the most basic Model
model_2<-lm(data=haptics_data_1,"Times~Distractor_Lengths")
summary(model_2)
##
## Call:
## lm(formula = "Times~Distractor_Lengths", data = haptics_data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.799 -7.241 -2.932 4.278 44.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.2515 0.3671 33.375 < 2e-16 ***
## Distractor_Lengths 2.4586 0.3233 7.604 3.69e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.18 on 3368 degrees of freedom
## Multiple R-squared: 0.01688, Adjusted R-squared: 0.01659
## F-statistic: 57.82 on 1 and 3368 DF, p-value: 3.694e-14
What do the coefficients of this model tell us?
Answer: Our intercept is 12.25, meaning that with a distractor length of 0, our mean reaction time is 12.25 seconds. For every 1 unit increase in Distractor_Lengths, the reaction time increases 2.4586 seconds to the mean.
What might be a better model ?
model_3<-lm(data=haptics_data_1,"Times~Distractor_Lengths + Distractor_Number")
summary(model_3)
##
## Call:
## lm(formula = "Times~Distractor_Lengths + Distractor_Number",
## data = haptics_data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.888 -7.233 -2.843 4.490 43.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0641 0.5629 17.880 < 2e-16 ***
## Distractor_Lengths 2.4597 0.3221 7.636 2.90e-14 ***
## Distractor_Number 1.0915 0.2135 5.112 3.36e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.14 on 3367 degrees of freedom
## Multiple R-squared: 0.02445, Adjusted R-squared: 0.02387
## F-statistic: 42.2 on 2 and 3367 DF, p-value: < 2.2e-16
Does this model do a better job with the data? How do you know?
Answer: Yes this model does a slightly better job at estimating the explained variance at 2.387%. The last model only explained 1.659% of the variance.
Let’s think about why this model might not fit the data well. Specifically what might be going on with our variable of Distractor lengths that might not work well with the linear model?
How might be account for this issue?
Now lets test the new model
model_4<-lm(data=haptics_data_1,"Times~Distractor_Number + Relative_Size")
summary(model_4)
##
## Call:
## lm(formula = "Times~Distractor_Number + Relative_Size", data = haptics_data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.401 -6.932 -2.326 4.241 42.101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3478 0.4811 31.901 < 2e-16 ***
## Distractor_Number 1.1104 0.2074 5.355 9.13e-08 ***
## Relative_Size -2.9257 0.1798 -16.273 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.848 on 3367 degrees of freedom
## Multiple R-squared: 0.07992, Adjusted R-squared: 0.07937
## F-statistic: 146.2 on 2 and 3367 DF, p-value: < 2.2e-16
Does this model do a better job of accounting for the data?
What do the coefficients of this model mean? Do they make sense? Explain how you know.
Answer: This model does a better job at accounting for the data. The current model explains 7.937% of the variance. The Relative_Size coefficient is hard to make sense of in terms of the intercept. Our intercept marks the mean time that a participant would perform when the Distractor_Number is 0 and the Relative_Size is 0. Neither of those things can be zero within the context of the experiment, but it does show an increase in reaction times when there are more distractors. As the relative size difference increases it is difficult for me to interpret a decrease in reaction times that continues to get smaller as the Relative_Size difference increases
Now lets use what we have learned and build a more advanced model What Should we test?
model_5<-lm("Times~Distractor_Number*Relative_Size",data=haptics_data_1)
summary(model_5)
##
## Call:
## lm(formula = "Times~Distractor_Number*Relative_Size", data = haptics_data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.903 -6.941 -2.392 4.176 41.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9856 0.6458 21.657 < 2e-16 ***
## Distractor_Number 1.7888 0.2984 5.994 2.26e-09 ***
## Relative_Size -1.5271 0.4780 -3.195 0.00141 **
## Distractor_Number:Relative_Size -0.6949 0.2201 -3.157 0.00161 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.835 on 3366 degrees of freedom
## Multiple R-squared: 0.08264, Adjusted R-squared: 0.08182
## F-statistic: 101.1 on 3 and 3366 DF, p-value: < 2.2e-16
We could look at more interaction effects.Maybe Blk x Distractor_Number, or Blk x Relative_Size