Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.

Citation:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Examples taken from: Gianluca Rossi, 31 October 2015

Question 1

Logarithmic transformation and regression: consider the following regression:

\(log(weight) = -3.5 + 2.0 * log(height) + error\)

With error that have a standard deviation of .25. Weights are in pounds and heights are in inches

(a) Fill in the blanks: Approximately 68% of the persons will have weights with a factor of ___ and ___ of their predicted values from the regression.

Answer:

(b) Draw the regression line and scatterplot of log(weight) vs. log(height) that makes sense and is consistent with the fitted model. Be sure to label the axes!

require(arm)
## Loading required package: arm
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is C:/Users/brand/Desktop/UCR/PhD CCN/PSYC213_Spring2023/Lab/Week 4
# create some data of heights for males and females ( where 50% are male and 50% are female and the mean height for males = 69.7 and mean height for females = 63.8 with sd of 3.1 and 3 respectively)
height_f<-rnorm(n=50,mean=63.8,sd=3)
height_m<-rnorm(n=50,mean=69.7,sd=3.1)
heights<-c(height_f,height_m)
log_heights<-log(heights)
# create a vector of log weights based on the formula given in the exercise
log_weights<- -3.5+2*log_heights+rnorm(n=length(heights),mean=0,sd=.25)

Now fit a model

df_log_data<-data.frame(cbind(log_weights=log_weights,log_heights=log_heights))
model_1<-lm(log_weights~log_heights,data=df_log_data)

Plot the regression line for the model

plot(log_heights,log_weights,xlab="log(heights)",ylab="log(weights)")
abline(model_1)

Question 2 Plotting linear and nonlinear regressions: We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)^2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm(formula = weight ~ age10)

              coef.est coef.se 
              

(Intercept) 161.0 7.3

age10 2.6 1.6

n = 2009, k = 2

residual sd = 119.7, R-Squared = 0.00

lm(formula = weight ~ age10 + age10.sq)

              coef.est coef.se
              

(Intercept) 96.2 19.3

age10 33.6 8.7

age10.sq -3.2 0.9

n = 2009, k = 3,

residual sd = 119.3, R-Squared = 0.01

lm(formula = weight ~ age30.44 + age45.64 + age65up)

              coef.est coef.se
              

(Intercept) 157.2 5.4

age30.44TRUE 19.1 7.0

age45.64TRUE 27.2 7.6

age65upTRUE 8.5 8.7

n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.

# hint use ?stat_function to learn how to use this function to draw a line in ggplot2
require(ggplot2)
## Loading required package: ggplot2
df1<-data.frame(x=c(0,8))
func_1<-stat_function(fun=function(x) 161+2.6*x)
ggplot(df1,aes(x))+func_1+labs(x="Age in Decades",y="Weight in Pounds")

(b) On the same graph, draw the fitted regression line from the second model.

df1<-data.frame(x=c(0,8))
func_1<-stat_function(fun=function(x) 161+2.6*x)
func_2<-stat_function(fun=function(x) 96.2+33.6*x-3.2*(x)^2, col="red")
ggplot(df1,aes(x))+func_1+func_2+labs(x="Age in Decades",y="Weight in Pounds")

(c) On another graph with the same axes and scale, draw the fitted regression line from the third model. (It will be discontinuous.)

ggplot(data.frame(x=c(0,8)),aes(x))+
  stat_function(fun=function(x)157.2+
                  19.1*ifelse(3<=x & x<4.49,1,0)+
                  27.2*ifelse(4.5<=x & x<=6.49,1,0)+
                  8.5*ifelse(6.5<=x,1,0))+
  labs(x="Age (in decades)",y="Weight (in pounds)")

Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_{i}\) and \(R_{i}\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats

(a) Discuss the advantages and disadvantages of the following measures:

The simple difference, \(D_{i} − R_{i}\):

this seems like a good transformation because is symmetric and centered at zero. One disadvantage of this transformation is that it’s not proportional. For instance, if Democrats got $6M and Republicans just $4M the difference will be $2M. Now, in the hypothesis Democrats raised $3M and Republicans $1M the absolute difference will still be $2M. However, $2M in the first example corresponds to a much closed gap between the two parties than in the second example. This could limit the effectiveness of the predictor if districts differ widely in average money raised

The ratio, \(\frac{D_{i}}{R_{i}}\):

this transformation has the disadvantage of being centered at 1 and that is asymmetric. In particualar it tends to zero for case where the Republics have more money raised than Democrats, and tend to infinity on the opposite case. This means that it will weight more cases where Democrats raised more money than the opposite party

The difference on the logarithmic scale, \(log(D_{i}) − log(R_{i})\):

this is similar to the first transformation, with the difference that is less sensitive to outliers. It is centered to zero and is symmetric. Another advantage of this transformation is that is proportional to the magnitude of the difference. So a $2M difference between the parties on a county where each raise on average above $100M will have a lower value than the same difference on a district where fundraising is poorer

The relative proportion, \(\frac{D_{i}}{D_{i} + R_{i}}\):

this transformation is centered at 0.5 and symmetric

Question 4: Download the Haptic Search Data Set. Perform a series of analyses to understand the effects of set size and distractor lengths on search.

This data set is the search time in seconds for a 1 inch pipe among a varying number of distractor pipes. The distractor pipes were always the same length and that length varied by condition. Additonally, the number of pipes varied from either 4, 6, or 8 distractor pipes.

(a) Reading in data and peek at it

# make sure I am in the right directory where the data is stored

setwd("C:/Users/brand/Desktop/UCR/PhD CCN/PSYC213_Spring2023/Lab/Week 4")

# Read in the data and take a peek

haptics_data_1 <- read.csv("Haptic_Search_Data_2_exp_psyc_213_example.csv", header = TRUE)
head(haptics_data_1)
##   Experiment_Number Blk Distractor_Number Distractor_Lengths Relative_Size
## 1                 1   1                 1               0.25             3
## 2                 1   2                 1               0.25             3
## 3                 1   1                 1               0.25             3
## 4                 1   2                 1               0.25             3
## 5                 1   1                 1               0.25             3
## 6                 1   2                 1               0.25             3
##     Times Exp_Num_c
## 1  6.1157         0
## 2  7.2289         0
## 3  4.1537         0
## 4  4.8841         0
## 5  8.7824         0
## 6 11.3860         0

After looking at the data are there any variables that you have question about? What might those variables mean with relation to the experiment?

Answer:

I have already cleaned the data set, so what should we look at first?

library(ggplot2)
summary(haptics_data_1)
##  Experiment_Number      Blk        Distractor_Number Distractor_Lengths
##  Min.   :1.000     Min.   :1.000   Min.   :1.000     Min.   :0.2500    
##  1st Qu.:1.000     1st Qu.:1.000   1st Qu.:1.000     1st Qu.:0.5000    
##  Median :1.000     Median :2.000   Median :2.000     Median :0.7500    
##  Mean   :1.493     Mean   :1.514   Mean   :2.003     Mean   :0.9975    
##  3rd Qu.:2.000     3rd Qu.:2.000   3rd Qu.:3.000     3rd Qu.:1.5000    
##  Max.   :2.000     Max.   :2.000   Max.   :3.000     Max.   :1.7500    
##  Relative_Size        Times           Exp_Num_c     
##  Min.   :0.2500   Min.   : 0.5465   Min.   :0.0000  
##  1st Qu.:0.3333   1st Qu.: 7.1956   1st Qu.:0.0000  
##  Median :0.7500   Median :11.6495   Median :0.0000  
##  Mean   :0.9803   Mean   :14.7039   Mean   :0.4926  
##  3rd Qu.:1.0000   3rd Qu.:19.2105   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :60.0490   Max.   :1.0000
ggplot(data=haptics_data_1,aes(Distractor_Lengths,Times))+geom_point()+geom_jitter()

  #geom_point(mapping = aes( x=Distractor_Number, y=Times))

Let’s make a different plot to look at the data

ggplot(data=haptics_data_1,mapping=aes(x=Distractor_Number,y=Times,color=Distractor_Lengths))+geom_point()+geom_jitter()

(b) Model Building

Let’s test the most basic Model

model_2<-lm(data=haptics_data_1,"Times~Distractor_Lengths")
summary(model_2)
## 
## Call:
## lm(formula = "Times~Distractor_Lengths", data = haptics_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.799  -7.241  -2.932   4.278  44.724 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         12.2515     0.3671  33.375  < 2e-16 ***
## Distractor_Lengths   2.4586     0.3233   7.604 3.69e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 3368 degrees of freedom
## Multiple R-squared:  0.01688,    Adjusted R-squared:  0.01659 
## F-statistic: 57.82 on 1 and 3368 DF,  p-value: 3.694e-14

What do the coefficients of this model tell us?

Answer: Our intercept is 12.25, meaning that with a distractor length of 0, our mean reaction time is 12.25 seconds. For every 1 unit increase in Distractor_Lengths, the reaction time increases 2.4586 seconds to the mean.

What might be a better model ?

model_3<-lm(data=haptics_data_1,"Times~Distractor_Lengths + Distractor_Number")
summary(model_3)
## 
## Call:
## lm(formula = "Times~Distractor_Lengths + Distractor_Number", 
##     data = haptics_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.888  -7.233  -2.843   4.490  43.636 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         10.0641     0.5629  17.880  < 2e-16 ***
## Distractor_Lengths   2.4597     0.3221   7.636 2.90e-14 ***
## Distractor_Number    1.0915     0.2135   5.112 3.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.14 on 3367 degrees of freedom
## Multiple R-squared:  0.02445,    Adjusted R-squared:  0.02387 
## F-statistic:  42.2 on 2 and 3367 DF,  p-value: < 2.2e-16

Does this model do a better job with the data? How do you know?

Answer: Yes this model does a slightly better job at estimating the explained variance at 2.387%. The last model only explained 1.659% of the variance.

Let’s think about why this model might not fit the data well. Specifically what might be going on with our variable of Distractor lengths that might not work well with the linear model?

How might be account for this issue?

Now lets test the new model

model_4<-lm(data=haptics_data_1,"Times~Distractor_Number + Relative_Size")
summary(model_4)
## 
## Call:
## lm(formula = "Times~Distractor_Number + Relative_Size", data = haptics_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.401  -6.932  -2.326   4.241  42.101 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.3478     0.4811  31.901  < 2e-16 ***
## Distractor_Number   1.1104     0.2074   5.355 9.13e-08 ***
## Relative_Size      -2.9257     0.1798 -16.273  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.848 on 3367 degrees of freedom
## Multiple R-squared:  0.07992,    Adjusted R-squared:  0.07937 
## F-statistic: 146.2 on 2 and 3367 DF,  p-value: < 2.2e-16

Does this model do a better job of accounting for the data?

What do the coefficients of this model mean? Do they make sense? Explain how you know.

Answer: This model does a better job at accounting for the data. The current model explains 7.937% of the variance. The Relative_Size coefficient is hard to make sense of in terms of the intercept. Our intercept marks the mean time that a participant would perform when the Distractor_Number is 0 and the Relative_Size is 0. Neither of those things can be zero within the context of the experiment, but it does show an increase in reaction times when there are more distractors. As the relative size difference increases it is difficult for me to interpret a decrease in reaction times that continues to get smaller as the Relative_Size difference increases

Now lets use what we have learned and build a more advanced model What Should we test?

model_5<-lm("Times~Distractor_Number*Relative_Size",data=haptics_data_1)
summary(model_5)
## 
## Call:
## lm(formula = "Times~Distractor_Number*Relative_Size", data = haptics_data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.903  -6.941  -2.392   4.176  41.600 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      13.9856     0.6458  21.657  < 2e-16 ***
## Distractor_Number                 1.7888     0.2984   5.994 2.26e-09 ***
## Relative_Size                    -1.5271     0.4780  -3.195  0.00141 ** 
## Distractor_Number:Relative_Size  -0.6949     0.2201  -3.157  0.00161 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.835 on 3366 degrees of freedom
## Multiple R-squared:  0.08264,    Adjusted R-squared:  0.08182 
## F-statistic: 101.1 on 3 and 3366 DF,  p-value: < 2.2e-16

(c) Knowing how this model fits what might we do to improve

We could look at more interaction effects.Maybe Blk x Distractor_Number, or Blk x Relative_Size