Aknowledgment: This work was adapted from the Gelman and Hill 2006 Book.

Citation:

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Examples taken from: Gianluca Rossi, 31 October 2015

Question 1

Logarithmic transformation and regression: consider the following regression:

\(log(weight) = -3.5 + 2.0 * log(height) + error\)

With error that have a standard deviation of .25. Weights are in pounds and heights are in inches

(a) Fill in the blanks: Approximately 68% of the persons will have weights with a factor of ___ and ___ of their predicted values from the regression.

Answer:

(b) Draw the regression line and scatterplot of log(weight) vs. log(height) that makes sense and is consistent with the fitted model. Be sure to label the axes!

require(arm)
## Loading required package: arm
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is /Users/user/Desktop/R - Spring
# create some data of heights for males and females ( where 50% are male and 50% are female and the mean height for males = 69.7 and mean height for females = 63.8 with sd of 3.1 and 3 respectively)
male_data <- rnorm(1000, 69.7,3.1)
female_data <- rnorm(1000,63.8,3)

height_data <- c(male_data, female_data)

log_height_data <- log(height_data)
# create a vector of log weights based on the formula given in the exercise
#$log(weight) = -3.5 + 2.0 * log(height) + error$

#log_weight_data <- -3.5 + 2.0 * log_height_data + error

log_weight_data <- -3.5 + 2.0 * log_height_data + rnorm(n = length(height_data), mean = 0, sd = .25)

Now fit a model

model1 <- lm(log_weight_data ~ log_height_data)
summary(model1)
## 
## Call:
## lm(formula = log_weight_data ~ log_height_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82864 -0.17709 -0.00718  0.17639  0.87072 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.99298    0.37048  -10.78   <2e-16 ***
## log_height_data  2.11920    0.08826   24.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.252 on 1998 degrees of freedom
## Multiple R-squared:  0.2239, Adjusted R-squared:  0.2235 
## F-statistic: 576.5 on 1 and 1998 DF,  p-value: < 2.2e-16

Plot the regression line for the model

plot(log_height_data, log_weight_data, xlab = "height", ylab = "weight")
abline(model1)

#what is the difference between abline and geom_smooth

Question 2 Plotting linear and nonlinear regressions: We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We first created new variables: age10 = age/10 and age10.sq = (age/10)^2, and indicators age18.29, age30.44, age45.64, and age65up for four age categories. We then fit some regressions, with the following results:

lm(formula = weight ~ age10)

              coef.est coef.se 
              

(Intercept) 161.0 7.3

age10 2.6 1.6

n = 2009, k = 2

residual sd = 119.7, R-Squared = 0.00

lm(formula = weight ~ age10 + age10.sq)

              coef.est coef.se
              

(Intercept) 96.2 19.3

age10 33.6 8.7

age10.sq -3.2 0.9

n = 2009, k = 3,

residual sd = 119.3, R-Squared = 0.01

lm(formula = weight ~ age30.44 + age45.64 + age65up)

              coef.est coef.se
              

(Intercept) 157.2 5.4

age30.44TRUE 19.1 7.0

age45.64TRUE 27.2 7.6

age65upTRUE 8.5 8.7

n = 2009, k = 4 residual sd = 119.4, R-Squared = 0.01

(a) On a graph of weights versus age (that is, weight on y-axis, age on x-axis), draw the fitted regression line from the first model.

# hint use ?stat_function to learn how to use this function to draw a line in ggplot2
require(ggplot2)
## Loading required package: ggplot2
df <- data.frame(x = c(0,8))
function_wt_age <- stat_function(fun = function(x) 161 + 2.6 *x, color = "#ff0aac")
ggplot(df, aes(x)) + function_wt_age + labs(x = "age", y = "weight")

can we walk through this in class briefly if there is enough time?

(b) On the same graph, draw the fitted regression line from the second model.

function_wt_age_sqrt= stat_function(fun= function(x) 96.2 + 33.6*x - 3.2*(x)^2 , col='#ff0aac')

ggplot(df, aes(x)) + function_wt_age + function_wt_age_sqrt + labs( x = "Age", y = "Weight")

(c) On another graph with the same axes and scale, draw the fitted regression line from the third model. (It will be discontinuous.)

ggplot(data.frame (x=c(0,8)), aes(x)) + stat_function (fun = function (x) 157.2 + 19.1*ifelse(3 <= x & x <= 4.49, 1, 0) + 27.2 * ifelse (4.5 <= x & x <= 6.49, 1, 0) + 8.5 * ifelse (6.5 <= x, 1, 0), color = "#ff0aac") + labs (x = "Age", y = "Weight")

#Im kind of confused here both with code and interpretation

Special-purpose transformations: for a study of congressional elections, you would like a measure of the relative amount of money raised by each of the two major-party candidates in each district. Suppose that you know the amount of money raised by each candidate; label these dollar values \(D_{i}\) and \(R_{i}\). You would like to combine these into a single variable that can be included as an input variable into a model predicting vote share for the Democrats

(a) Discuss the advantages and disadvantages of the following measures:

The simple difference, \(D_{i} − R_{i}\):

this seems like a good transformation because is symmetric and centered at zero. One disadvantage of this transformation is that it’s not proportional. For instance, if Democrats got $6M and Republicans just $4M the difference will be $2M. Now, in the hypothesis Democrats raised $3M and Republicans $1M the absolute difference will still be $2M. However, $2M in the first example corresponds to a much closed gap between the two parties than in the second example. This could limit the effectiveness of the predictor if districts differ widely in average money raised

The ratio, \(\frac{D_{i}}{R_{i}}\):

this transformation has the disadvantage of being centered at 1 and that is asymmetric. In particular it tends to zero for case where the Republicans have more money raised than Democrats, and tend to infinity on the opposite case. This means that it will weight more cases where Democrats raised more money than the opposite party

The difference on the logarithmic scale, \(log(D_{i}) − log(R_{i})\):

this is similar to the first transformation, with the difference that is less sensitive to outliers. It is centered to zero and is symmetric. Another advantage of this transformation is that is proportional to the magnitude of the difference. So a $2M difference between the parties on a county where each raise on average above $100M will have a lower value than the same difference on a district where fundraising is poorer

The relative proportion, \(\frac{D_{i}}{D_{i} + R_{i}}\):

this transformation is centered at 0.5 and symmetric

Question 4: Download the Haptic Search Data Set. Perform a series of analyses to understand the effects of set size and distractor lengths on search.

This data set is the search time in seconds for a 1 inch pipe among a varying number of distractor pipes. The distractor pipes were always the same length and that length varied by condition. Additionally, the number of pipes varied from either 4, 6, or 8 distractor pipes.

(a) Reading in data and peek at it

# make sure I am in the right directory where the data is stored

#setwd to set working directory

# Read in the data and take a peek

hs_data <- read.csv("Haptic_Search_Data_2_exp_psyc_213_example.csv", header = TRUE)
head(hs_data)
##   Experiment_Number Blk Distractor_Number Distractor_Lengths Relative_Size
## 1                 1   1                 1               0.25             3
## 2                 1   2                 1               0.25             3
## 3                 1   1                 1               0.25             3
## 4                 1   2                 1               0.25             3
## 5                 1   1                 1               0.25             3
## 6                 1   2                 1               0.25             3
##     Times Exp_Num_c
## 1  6.1157         0
## 2  7.2289         0
## 3  4.1537         0
## 4  4.8841         0
## 5  8.7824         0
## 6 11.3860         0

After looking at the data are there any variables that you have question about? What might those variables mean with relation to the experiment?

Answer: Distractor number

I have already cleaned the data set, so what should we look at first? descriptive statistics

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following objects are masked from 'package:arm':
## 
##     logit, rescale, sim
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summary(hs_data)
##  Experiment_Number      Blk        Distractor_Number Distractor_Lengths
##  Min.   :1.000     Min.   :1.000   Min.   :1.000     Min.   :0.2500    
##  1st Qu.:1.000     1st Qu.:1.000   1st Qu.:1.000     1st Qu.:0.5000    
##  Median :1.000     Median :2.000   Median :2.000     Median :0.7500    
##  Mean   :1.493     Mean   :1.514   Mean   :2.003     Mean   :0.9975    
##  3rd Qu.:2.000     3rd Qu.:2.000   3rd Qu.:3.000     3rd Qu.:1.5000    
##  Max.   :2.000     Max.   :2.000   Max.   :3.000     Max.   :1.7500    
##  Relative_Size        Times           Exp_Num_c     
##  Min.   :0.2500   Min.   : 0.5465   Min.   :0.0000  
##  1st Qu.:0.3333   1st Qu.: 7.1956   1st Qu.:0.0000  
##  Median :0.7500   Median :11.6495   Median :0.0000  
##  Mean   :0.9803   Mean   :14.7039   Mean   :0.4926  
##  3rd Qu.:1.0000   3rd Qu.:19.2105   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :60.0490   Max.   :1.0000
describe(hs_data)
##                    vars    n  mean    sd median trimmed  mad  min   max range
## Experiment_Number     1 3370  1.49  0.50   1.00    1.49 0.00 1.00  2.00  1.00
## Blk                   2 3370  1.51  0.50   2.00    1.52 0.00 1.00  2.00  1.00
## Distractor_Number     3 3370  2.00  0.82   2.00    2.00 1.48 1.00  3.00  2.00
## Distractor_Lengths    4 3370  1.00  0.54   0.75    1.00 0.74 0.25  1.75  1.50
## Relative_Size         5 3370  0.98  0.94   0.75    0.82 0.37 0.25  3.00  2.75
## Times                 6 3370 14.70 10.26  11.65   13.17 7.89 0.55 60.05 59.50
## Exp_Num_c             7 3370  0.49  0.50   0.00    0.49 0.00 0.00  1.00  1.00
##                     skew kurtosis   se
## Experiment_Number   0.03    -2.00 0.01
## Blk                -0.06    -2.00 0.01
## Distractor_Number  -0.01    -1.51 0.01
## Distractor_Lengths  0.01    -1.51 0.01
## Relative_Size       1.50     0.64 0.02
## Times               1.39     1.83 0.18
## Exp_Num_c           0.03    -2.00 0.01
ggplot(data = hs_data) + geom_point(mapping = aes(x = Distractor_Number, y = Times))

Let’s make a different plot to look at the data

ggplot(
  data <- hs_data, mapping = aes(x = Distractor_Number, y = Times)) + geom_jitter(mapping = aes(color = Distractor_Lengths))

(b) Model Building

Let’s test the most basic Model

rt_model <- lm(data = hs_data, Times ~ Distractor_Number)
summary(rt_model)
## 
## Call:
## lm(formula = Times ~ Distractor_Number, data = hs_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.24  -7.48  -2.98   4.62  44.26 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12.5198     0.4659  26.874  < 2e-16 ***
## Distractor_Number   1.0905     0.2153   5.064 4.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.23 on 3368 degrees of freedom
## Multiple R-squared:  0.007558,   Adjusted R-squared:  0.007263 
## F-statistic: 25.65 on 1 and 3368 DF,  p-value: 4.316e-07

What do the coefficients of this model tell us?

Answer: When Distractor Numbers are held at zero, or there are no distractors, reaction time is predicted to be 12.5198 seconds. For every one increase in distractor there is a 1.0905 increase in reaction time. This model is not very good with an adjusted R-squared of only 0.007

What might be a better model ?

rt_model_2 <- lm(data = hs_data, Times ~ Distractor_Number + Distractor_Lengths)
summary(rt_model_2)
## 
## Call:
## lm(formula = Times ~ Distractor_Number + Distractor_Lengths, 
##     data = hs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.888  -7.233  -2.843   4.490  43.636 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         10.0641     0.5629  17.880  < 2e-16 ***
## Distractor_Number    1.0915     0.2135   5.112 3.36e-07 ***
## Distractor_Lengths   2.4597     0.3221   7.636 2.90e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.14 on 3367 degrees of freedom
## Multiple R-squared:  0.02445,    Adjusted R-squared:  0.02387 
## F-statistic:  42.2 on 2 and 3367 DF,  p-value: < 2.2e-16

Does this model do a better job with the data? How do you know?

Answer: This model does do a better job because the adjusted R-squared increased from .007 to .02387, but it is still a poor model.

Let’s think about why this model might not fit the data well. Specifically what might be going on with our variable of Distractor lengths that might not work well with the linear model? - Distractor number and distractor length are on different scales

How might be account for this issue? - Use Relative size instead of distractor length

Now lets test the new model

rt_model_3 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size)
summary(rt_model_3)
## 
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size, data = hs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.401  -6.932  -2.326   4.241  42.101 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        15.3478     0.4811  31.901  < 2e-16 ***
## Distractor_Number   1.1104     0.2074   5.355 9.13e-08 ***
## Relative_Size      -2.9257     0.1798 -16.273  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.848 on 3367 degrees of freedom
## Multiple R-squared:  0.07992,    Adjusted R-squared:  0.07937 
## F-statistic: 146.2 on 2 and 3367 DF,  p-value: < 2.2e-16

Does this model do a better job of accounting for the data? - Yes, the adjusted R-squared increased from .02 to .07.

What do the coefficients of this model mean? Do they make sense? Explain how you know.

Answer: When distractor number and size of the distractors are both held at zero the predicted reaction time is 15.3478 seconds. For every one unit increase in the number of distractors, and when relative size is held at zero, there is a 1.1104 second increase in reaction time. For every one unit increase in relative size of distractors, when distractor number is held at zero, there is a 2.9257 second decrease in reaction time. This does not make sense to me because how is there any change in the size of the distractor when there are none. Maybe I am have a poor understanding of the data.

ggplot(
  data <- hs_data, mapping = aes(x= Distractor_Number, y = Times)) + geom_jitter(mapping = aes(color = Relative_Size))+ geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Now lets use what we have learned and build a more advanced model What Should we test? The interaction between distractor number and relative size

rt_model_4 <- lm(data = hs_data, Times ~ Distractor_Number * Relative_Size)
summary(rt_model_4)
## 
## Call:
## lm(formula = Times ~ Distractor_Number * Relative_Size, data = hs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.903  -6.941  -2.392   4.176  41.600 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      13.9856     0.6458  21.657  < 2e-16 ***
## Distractor_Number                 1.7888     0.2984   5.994 2.26e-09 ***
## Relative_Size                    -1.5271     0.4780  -3.195  0.00141 ** 
## Distractor_Number:Relative_Size  -0.6949     0.2201  -3.157  0.00161 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.835 on 3366 degrees of freedom
## Multiple R-squared:  0.08264,    Adjusted R-squared:  0.08182 
## F-statistic: 101.1 on 3 and 3366 DF,  p-value: < 2.2e-16

t value is -.69/.22 which = -3.13

When both number and size of distractors are held constant at zero the predicted reaction time is 13.9856 seconds. For every one unit increase in distractor number when the size is held at zero there will be a 1.7888 second increase in reaction time. For every one unit increase in size, when number is held at zero, there will be a 1.5271 decrease in reaction time. Depending on the number of distractors there will be an increase in relative size which will decrease reaction time by 0.6949 seconds.

(c) Knowing how this model fits what might we do to improve

hs_data$Relative_Squared <- (hs_data$Relative_Size * hs_data$Relative_Size)

rt_model_5 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size + Experiment_Number)
summary(rt_model_5)
## 
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size + Experiment_Number, 
##     data = hs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.748  -6.881  -2.470   4.239  42.549 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        17.3563     0.6973  24.892  < 2e-16 ***
## Distractor_Number   1.1225     0.2069   5.425 6.21e-08 ***
## Relative_Size      -2.9503     0.1795 -16.436  < 2e-16 ***
## Experiment_Number  -1.3458     0.3388  -3.972 7.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.827 on 3366 degrees of freedom
## Multiple R-squared:  0.08421,    Adjusted R-squared:  0.08339 
## F-statistic: 103.2 on 3 and 3366 DF,  p-value: < 2.2e-16
#model the quadratic of relative size
rt_model_6 <- lm(data = hs_data, Times ~ Distractor_Number + Relative_Size + Experiment_Number + Relative_Squared)
summary(rt_model_6)
## 
## Call:
## lm(formula = Times ~ Distractor_Number + Relative_Size + Experiment_Number + 
##     Relative_Squared, data = hs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.371  -6.648  -2.303   4.273  40.927 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        21.0531     0.8251  25.517  < 2e-16 ***
## Distractor_Number   1.1156     0.2049   5.444 5.59e-08 ***
## Relative_Size     -11.2466     1.0288 -10.932  < 2e-16 ***
## Experiment_Number  -1.4122     0.3357  -4.207 2.65e-05 ***
## Relative_Squared    2.4571     0.3001   8.187 3.76e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.732 on 3365 degrees of freedom
## Multiple R-squared:  0.1021, Adjusted R-squared:  0.101 
## F-statistic: 95.65 on 4 and 3365 DF,  p-value: < 2.2e-16