Assignment summary

Chapter 14 -Even numbered questions 2 – 18 and 19.
Chapter 16: Questions 1-5.

Chapter 14

Question 2:

Q: The formula for a regression equation based on a sample size of 25 observations is Y’ = 2X + 9.

  1. What would be the predicted score for a person scoring 6 on X?
  2. If someone’s predicted score was 14, what was this person’s score on X?

A: (a) With a score of 6 = 8, we can simply calculate our Y as 2(6) + 9 = 21. (b) Reversing course, if our predicted value of Y = 14, then we know our x = ((14)-9)/2 = 2.5

Question 4:

Q: What does the standard error of the estimate measure? What is the formula for the standard error of the estimate?

A: The standard errof of an estimate measures the accuracy of our predictions. The formula for the standard error of the estimate sigma = sqrt[ sums(Y - Y’)^2 / N]. Where the delta of Y - Y’ is the difference in predicted versus actual Y and N is our sample size.

Question 6:

Q: For the X,Y data below, compute:

  1. r and determine if it is significantly different from zero.
  2. the slope of the regression line and test if it differs significantly from zero.
  3. the 95% confidence interval for the slope.
X <- c(2,4,4,5,6)
Y <- c(5,6,7,11,12)
XY <- cbind(X,Y)
XY
##      X  Y
## [1,] 2  5
## [2,] 4  6
## [3,] 4  7
## [4,] 5 11
## [5,] 6 12
XY.LM <- lm(Y ~ X)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##      0.1818       1.9091
summary(XY.LM)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##       1       2       3       4       5 
##  1.0000 -1.8182 -0.8182  1.2727  0.3636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.1818     2.2234   0.082   0.9400  
## X             1.9091     0.5048   3.782   0.0324 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.497 on 3 degrees of freedom
## Multiple R-squared:  0.8266, Adjusted R-squared:  0.7688 
## F-statistic:  14.3 on 1 and 3 DF,  p-value: 0.0324

A: (a) We can see our “un-adjusted” R^2 value is 0.8266, meaning our R value is 0.9091755, and insinuates our predicted values are very well explained by our predictor data - and significantly different from zero. (b) The slope of our linear model is 0.1818, with a p-value of 0.0324 we know it is significant enough from zero, this is a good fit. (c) The 95% confidence interval for the slope of our line is our slope +/- two standard errors. Leaving us with [-0.183,1.8362].

Question 8:

Q: The correlation between years of education and salary in a sample of 20 people from a certain company is .4. Is this correlation statistically significant at the .05 level?

A: To find out if I am confident / statistically significant I can calculate my t-stat and compare it to my alpha. In this case alpha is 0.05 and my t-stat = [r* sqrt(n-2)] / sqrt(1-r^2).

Because my t-stat = 1.8516402 which is >> than our alpha value of 0.05, our correlation value is not statistically significant to the 95% percentile.

Question 10:

Q: Using linear regression, find the predicted post-test score for someone with a score of 43 on the pre-test.

Pre <- c(59,52,44,51,42,42,41,45,27,63,54,44,50,47,55,49,45,57,46,60,65,64,50,74,59)
Post <- c(56,63,55,50,66,48,58,36,13,50,81,56,64,50,63,57,73,63,46,60,47,73,58,85,44)
Pre.Post <- cbind(Pre,Post)
Pre.Post
##       Pre Post
##  [1,]  59   56
##  [2,]  52   63
##  [3,]  44   55
##  [4,]  51   50
##  [5,]  42   66
##  [6,]  42   48
##  [7,]  41   58
##  [8,]  45   36
##  [9,]  27   13
## [10,]  63   50
## [11,]  54   81
## [12,]  44   56
## [13,]  50   64
## [14,]  47   50
## [15,]  55   63
## [16,]  49   57
## [17,]  45   73
## [18,]  57   63
## [19,]  46   46
## [20,]  60   60
## [21,]  65   47
## [22,]  64   73
## [23,]  50   58
## [24,]  74   85
## [25,]  59   44
Pre.Post.LM <- lm(Post ~ Pre)
Pre.Post.LM
## 
## Call:
## lm(formula = Post ~ Pre)
## 
## Coefficients:
## (Intercept)          Pre  
##     16.1552       0.7869
summary(Pre.Post.LM)
## 
## Call:
## lm(formula = Post ~ Pre)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.401  -6.351   2.288   6.486  22.354 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  16.1552    13.5774   1.190  0.24624   
## Pre           0.7869     0.2596   3.032  0.00593 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.61 on 23 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2544 
## F-statistic: 9.191 on 1 and 23 DF,  p-value: 0.005933

A: The results indicate our predictor coefficient is 0.7869 (m) and our y-intercept is 16.1552 (b). Given the equation of a line y = mx + b, and our new linear model, if our x is 43, then our predicted post test score is 49.9919

Pre.Post.DF <- data.frame(Pre,Post)
ggplot(Pre.Post.DF, aes(x=Pre, y=Post)) + 
   geom_point(size = 2, shape = 23) + 
   geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

And looking at our plotted data we can see my predictive model seems to fit.

Question 12:

Q: Based on the table below, compute the regression line that predicts Y from X.

##      Mx My  sX sY    r
## [1,] 10 12 2.5  3 -0.6

A: For a linear model our equation y = bx + A can be filled in using the above data points.

#The slope (b) can be calculated as follows:
b = r * sY/sX
#The intercept (A) can be calculated as
A = My - b*Mx

This yields a linear model regression line of y = -0.72*x + 19.2

Question 14:

Q: True/false: If the slope of a simple linear regression line is statistically significant, then the correlation will also always be significant.

A: False. The values we need to determine the significance of the slope of a line are the value we’re measuring and the standard error. The values we need to determine the significance of the correlation and the sample size. We also need to know what confidence we’re measuring each against.

Question 16:

Q: True/false: If the correlation is .8, then 40% of the variance is explained.

A: False. Under general conditions are correlation is r and the amount of variation explained is r squared. In this case case that would be 0.8^2 or 0.64% of the variation is explained.

The following questions use data from the Angry Moods (AM) case study.

Question 18:

Q: Find the regression line for predicting Anger-Out from Control-Out.

  1. What is the slope?
  2. What is the intercept?
  3. Is the relationship at least approximately linear?
  4. Test to see if the slope is significantly different from 0.
  5. What is the standard error of the estimate?

A:

q16file <- read.csv(file = "angry_moods.csv", header = TRUE)
AI.AO.LM <- lm(q16file$Anger.Out ~ q16file$Control.Out)
AI.AO.LM
## 
## Call:
## lm(formula = q16file$Anger.Out ~ q16file$Control.Out)
## 
## Coefficients:
##         (Intercept)  q16file$Control.Out  
##             28.4948              -0.5241
summary(AI.AO.LM)
## 
## Call:
## lm(formula = q16file$Anger.Out ~ q16file$Control.Out)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.488 -2.440 -0.295  2.193 10.560 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         28.49482    2.02477   14.07  < 2e-16 ***
## q16file$Control.Out -0.52413    0.08386   -6.25 2.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.45 on 76 degrees of freedom
## Multiple R-squared:  0.3395, Adjusted R-squared:  0.3308 
## F-statistic: 39.07 on 1 and 76 DF,  p-value: 2.183e-08
q16file.DF <- data.frame(q16file)
Control.Out <- q16file$Control.Out
Anger.Out <- q16file$Anger.Out
ggplot(q16file.DF, aes(x=Anger.Out, y=Control.Out)) + 
   geom_point(size = 2, shape = 23) + 
   geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  1. The Slope is -0.5241
  2. The Intercept is 28.4948
  3. The relationship LOOKS marginally linear, but we with a P-value of 2.183e-08, we can feel confident in the relationship
  4. See above. Our P-value is solid, but if I want a 95% confidence that my slope is significant I need it to be below the 0.05 threshold. It is, so i can be confident it is significantly different from 0.
  5. My standard error is 0.08386

The following question is from the SAT and GPA (SG) case study.

Question 19:

Q: Find the regression line for predicting the overall university GPA from the high school GPA.

  1. What is the slope?
  2. What is the y-intercept?
  3. If someone had a 2.2 GPA in high school, what is the best estimate of his or her college GPA?
  4. If someone had a 4.0 GPA in high school, what is the best estimate of his or her college GPA?

A:

q19file <- read.csv(file = "sat.csv", header = TRUE)
univ_GPA <- q19file$univ_GPA
high_GPA <- q19file$high_GPA
UGP.HGP.LM <- lm(univ_GPA ~ high_GPA)
UGP.HGP.LM
## 
## Call:
## lm(formula = univ_GPA ~ high_GPA)
## 
## Coefficients:
## (Intercept)     high_GPA  
##      1.0968       0.6748
summary(UGP.HGP.LM)
## 
## Call:
## lm(formula = univ_GPA ~ high_GPA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69040 -0.11922  0.03274  0.17397  0.91278 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09682    0.16663   6.583 1.98e-09 ***
## high_GPA     0.67483    0.05342  12.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2814 on 103 degrees of freedom
## Multiple R-squared:  0.6077, Adjusted R-squared:  0.6039 
## F-statistic: 159.6 on 1 and 103 DF,  p-value: < 2.2e-16
q19file.DF <- data.frame(q19file)
ggplot(q19file.DF, aes(x=high_GPA, y=univ_GPA)) + 
   geom_point(size = 2, shape = 23) + 
   geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  1. The slope is 0.6748
  2. The intercept is 1.0968
  3. If someone had a 2.2 GPA in high school, the estimated university GPA would be 2.58136
  4. If someone had a 4.0 GPA in high school, the estimated university GPA would be 3.796

Chapter 15

Question 1:

Q: What is the null hypothesis tested by analysis of variance?

A:“Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means.” The null hypothesis is that there are differences between the means.

Question 2:

Q: What are the assumptions of between-subjects analysis of variance?

A: The assumptions are as follows from the book:
1.) The populations have the same variance. This assumption is called the assumption of homogeneity of variance.
2.) The populations are normally distributed.
3.) Each value is sampled independently from each other value. This assumption requires that each subject provide only one value. If a subject provides two scores, then the values are not independent. The analysis of data with two scores per subject is shown in the section on within-subjects ANOVA later in this chapter.

Question 3:

Q: What is a between-subjects variable?

A: A between-subjects variable is based on an experiment where only one condition is met for each subject. E.g. in our Smiles and Leniency case study there were four conditions, but only one was assigned to each subject.

Question 4:

Q: Why not just compute t-tests among all pairs of means instead computing an analysis of variance?

A: Calculations can be costly, so if simply knowing if there a difference in means is enough to keep or reject your null, then avoiding pair calculations on a large data set is beneficial.

Question 5:

Q: What is the difference between “N” and “n”?

A: Capital letters refer to populations and lowercase letters refer to sample attributes. The first is scale of a population and the second the sample.