Note

Question 9

The tables below show the regression output of a multiple regression model relating \(\tt Salary\), the beginning salaries in dollars of employees in a given company to the following predictor variables: \(\tt Education\), \(\tt Experience\) and a variable \(\tt STEM\) indicating whether or not they have an undergraduate degree in a \(\tt STEM\) field or not (0 or 1). (The units of both \(\tt Education\) and \(\tt Experience\) are years. You can assume an intercept model.)

  1. Fill in the missing values in the above table. You may not necessarily be able to compute everything, but be as explicit as possible. (2 points)
table1 = matrix(nrow = 2,ncol = 5)
table1[2,1] = 62
table1[,2] = c(2216338,8913083)
table1[1,1] = 3
options(digits=8)
table1[1,3] = table1[1,2]/table1[1,1]
table1[2,3] = table1[2,2]/62
table1[1,4] = table1[1,3]/table1[2,3]
p.value = 1 - pf(table1[1,4], table1[1,1], table1[2,1])
table1[1,5] = p.value
table1
##      [,1]    [,2]      [,3]      [,4]         [,5]
## [1,]    3 2216338 738779.33 5.1389983 0.0030778765
## [2,]   62 8913083 143759.40        NA           NA
table2 = matrix(nrow = 4,ncol = 4)
table2[,1] = c(3316.4,850.0,932.4,0)
table2[,2] = c(937.7,0,260.1,330.1)
table2[,3] = c(0,3.646,0,1.675)
table2[1,3] =  table2[1,1]/table2[1,2]
table2[1,4] = 2 * pt(table2[1,3], 62, lower = FALSE)
table2[2,2] = table2[2,1]/table2[2,3]
table2[2,4] = 2 * pt(table2[2,3], 62, lower = FALSE)
table2[3,3] =  table2[3,1]/table2[3,2]
table2[3,4] = 2 * pt(table2[3,3], 62, lower = FALSE)
table2[4,1] =  table2[4,2]*table2[4,3]
table2[4,4] = 2 * pt(table2[4,3], 62, lower = FALSE)
table2
##           [,1]     [,2]      [,3]          [,4]
## [1,] 3316.4000 937.7000 3.5367388 0.00077332142
## [2,]  850.0000 233.1322 3.6460000 0.00054683456
## [3,]  932.4000 260.1000 3.5847751 0.00066450386
## [4,]  552.9175 330.1000 1.6750000 0.09897146444
  1. Test whether or not the linear regression model explains significantly more variability in \(\tt Salary\) than a model with no explanatory variables. What assumptions are you making? Specify the null and alternative hypotheses, the test used, and your conclusion using \(\alpha = 0.05\). (2 points)
  • The assumptions are:
  • \(Salary=\beta_0+\beta_1Education+\beta_2Experience+\beta_3STEM+ε\)
  • \[ε∼N(0, σ^{2})\]
  • Observations forms a independent sample.
  • The null hypothesis is H0:\(\beta_{1}=\beta_{2}=\beta_{3}=0.\) The alternative hypothesis is H1:\(\beta_{j}≠0,j=1,2,3.\)
  • test:F test
 p_value = table1[1,5]
p_value
## [1] 0.0030778765
  • Conclusion:we reject H0:\(\beta_{1}=\beta_{2}=\beta_{3}=0\)and conclude that some or all of the other predictors are needed.
  1. Is there a positive linear relationship between \(\tt Salary\) and \(\tt Experience\), after accounting for the effect of the variables \(\tt STEM\) and \(\tt Education\)? (Hint: one-sided test for individual regression coefficient) Specify the null and alternative hypotheses, the test used, and your conclusion using \(\alpha = 0.05\). (2 points)
  • The null hypothesis is H0: \(\beta_{2}≤0\). The alternative hypothesis is H1:\(\beta_{2}>0\).
  • test:t test
p_value = pt(table2[3,3], 62, lower = FALSE)
p_value
## [1] 0.00033225193
  • There is a positive linear relationship between Salary and Experience.
  1. What salary interval would you predict for an electrical engineer with 10 years of education and 5 years working in a related field? You may not necessarily be able to compute everything, but be as explicit as possible. Assume that the confidence level is \(1 - \alpha = 0.95\). (2 points)
beta_hat = table2[,1]
a = c(1,10,5,1)
y_hat = a %*% beta_hat
sigma.hat.sq = table1[2,2]/table1[2,1]
se = sqrt(sigma.hat.sq + sum(a %*% a)) 
df = table1[2,1]+3
q = qt((1 - 0.95)/2,df,lower.tail=FALSE)
upper = y_hat + se * q
lower = y_hat - se * q
lower
##           [,1]
## [1,] 16273.756
upper
##           [,1]
## [1,] 17788.879
  1. What salary interval would you predict, on average, for English majors with 10 years of education and 6 years in a related field? You may not necessarily be able to compute everything, but be as explicit as possible. Assume that the confidence level is \(1 - \alpha = 0.95\). (2 points)
beta_hat = table2[,1]
a = c(1,10,6,0)
y_hat = a %*% beta_hat
sigma.hat.sq = table1[2,2]/table1[2,1]
se = sqrt(sum(a %*% a)) 
df = table1[2,1]+3
q = qt((1 - 0.95)/2,df,lower.tail=FALSE)
upper = y_hat + se * q
lower = y_hat - se * q
lower
##           [,1]
## [1,] 17387.424
upper
##           [,1]
## [1,] 17434.176

Question 10

This question is based on our textbook Page 86, Exercise 3.15. A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:

  • Age: Median age of a person living in a state.

  • HS: Percentage of people over 25 years of age in a state who had completed high school.

  • Income: Per capita personal income for a state (income in dollars).

  • Black: Percentage of blacks living in a state.

  • Female: Percentage of females living in a state.

  • Price: Weighted average price (in cents) of a pack of cigarettes in a state.

  • Sales: Number of packs of cigarettes sold in a state on a per capita basis.

The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt. [Hint: Use read.table() to read the data.]

In 1 - 2 below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.

  1. Test the hypothesis that the variable \(\tt Female\) is not needed in the regression equation relating \(\tt Sales\) to the six predictor variables. (2 points)
cigarettes = read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE)
head(cigarettes)
##   State  Age   HS Income Black Female Price Sales
## 1    AL 27.0 41.3   2948  26.2   51.7  42.7  89.8
## 2    AK 22.9 66.7   4644   3.0   45.7  41.8 121.3
## 3    AZ 26.3 58.1   3665   3.0   50.8  38.5 115.2
## 4    AR 29.1 39.9   2878  18.3   51.5  38.8 100.3
## 5    CA 28.1 62.6   4493   7.0   50.8  39.7 123.0
## 6    CO 26.2 63.9   3855   3.0   50.7  31.1 124.8

\(Sales=\beta_0+\beta_1Age+\beta_2Hs+\beta_3Income+\beta_4Black+\beta_5Female+\beta_6Priec+\varepsilon\)

  • The null hypothesis is\(H_0:\beta_5=0\). The alternative hypothesis is\(H_1:~\beta_5\not=0\).
cigarettes.lm <- lm(Sales ~  Age + HS + Income + Black + Female + Price,data = cigarettes)
summary(cigarettes.lm)
## 
## Call:
## lm(formula = Sales ~ Age + HS + Income + Black + Female + Price, 
##     data = cigarettes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -48.3982 -12.3876  -5.3672   6.2703 133.2135 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.344846 245.607185  0.4208 0.675969   
## Age           4.520452   3.219768  1.4040 0.167348   
## HS           -0.061586   0.814684 -0.0756 0.940084   
## Income        0.018946   0.010216  1.8546 0.070364 . 
## Black         0.357535   0.487219  0.7338 0.466946   
## Female       -1.052859   5.561008 -0.1893 0.850706   
## Price        -3.254918   1.031407 -3.1558 0.002886 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.174 on 44 degrees of freedom
## Multiple R-squared:  0.32084,    Adjusted R-squared:  0.22823 
## F-statistic: 3.4644 on 6 and 44 DF,  p-value: 0.006857
  • Conclusion:p−value=0.850706>0.05, as the significant level α=0.05, we accept \(H_0:\beta_5=0\).
  1. Test the hypothesis that the variables \(\tt Female\) and \(\tt H\)S are not needed in the regression equation relating \(\tt Sales\) to the six predictor variables. (2 points)
  • The null hypothesis is \(H_0:\beta_2=\beta_5=0\). The alternative hypothesis is \(H_1:one\ of \beta_2,\beta_5≠0\).
cigarettes.lm.reduced = lm(Sales ~  Age + Income + Black + Price,data = cigarettes)
anova(cigarettes.lm.reduced, cigarettes.lm)
## Analysis of Variance Table
## 
## Model 1: Sales ~ Age + Income + Black + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
##   Res.Df     RSS Df Sum of Sq       F  Pr(>F)
## 1     46 34959.8                             
## 2     44 34926.0  2   33.7986 0.02129 0.97895
  • Conclusion:p−value=0.97895>0.05, as the significant level α=0.05, we accept \(H_0:\beta_2=\beta_5=0\).
  1. Compute a 95% confidence interval for the true regression coefficient of the variable \(\tt Income\) in the regression equation relating \(\tt Sales\) to the six predictor variables. (2 points)
confint(cigarettes.lm, c("Income"), level=0.95) 
##                2.5 %      97.5 %
## Income -0.0016425173 0.039535423
  1. What percentage of the variation in \(\tt Sales\) can be accounted for when \(\tt Income\) is removed from the regression equation relating \(\tt Sales\) to the six predictor variables? Which model did you use? (2 points)
cigarettes.lm2 <- lm(Sales ~  Age + HS + Black + Female + Price,data = cigarettes)
R_squared <- summary(cigarettes.lm2)$r.squared
R_squared
## [1] 0.26775262
  • Approximately 26.8% of the variation in Sales can be accounted for by the remaining five predictors.
  1. What percentage of the variation in \(\tt Sales\) can be accounted for by the three variables: \(\tt Proce\), \(\tt Age\), and \(\tt Income\)? Explain. (2 points)
cigarettes.lm3 <- lm(Sales ~  Age + Income + Price,data = cigarettes)
R_squared3 <- summary(cigarettes.lm3)$r.squared
R_squared3
## [1] 0.30324338
  • Approximately 30.3% of the variation in Sales can be accounted for by these three predictors.
  1. What percentage of the variation in \(\tt Sales\) that can be accounted for by the variable \(\tt Income\), when \(\tt Sales\) is regressed on only \(\tt Income\)? Explain. (2 points)
cigarettes.lm4 <- lm(Sales ~ Income ,data = cigarettes)
R_squared4 <- summary(cigarettes.lm4)$r.squared
R_squared4
## [1] 0.10632027
  • Approximately 10.6% of the variation in Sales can be accounted for by Income. ## Question 11

A researcher in a scientific foundation wished to evaluate the relation between intermediate and senior level annuals salaries of bachelor’s and master’s level mathematicians (\(\tt Y\), in thousand dollars) and an index of work quality (\(\tt X1\)), number of years of experience (\(\tt X2\)), and an index of publication success (\(\tt X3\)). The data for a sample of 24 bachelor’s and master’s level mathematicians can be found at http://www.stanford.edu/class/stats191/data/math-salaries.table. [Hint: Use read.table() to read the data.]

  1. Make the scatter plot matrix and the correlation matrix of the table. Summarize the results. (2 points)
data1 = read.table("http://www.stanford.edu/class/stats191/data/math-salaries.table",header = TRUE)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(data1, mapping= aes(color="darkred"),cardinality_threshold = 38)

cor(data1)
##             Y         X1         X2         X3
## Y  1.00000000 0.66709585 0.85855820 0.55819603
## X1 0.66709585 1.00000000 0.46695108 0.32276117
## X2 0.85855820 0.46695108 1.00000000 0.25375299
## X3 0.55819603 0.32276117 0.25375299 1.00000000
  • There is a strong correlation between Y and X1
  1. Fit a linear regression model for salary based on \(\tt X1\), \(\tt X2\), \(\tt X3\). Report the fitted regression function. (2 points)

\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3\)

Y.lm = lm(Y~X1+X2+X3,data = data1)
summary(Y.lm)
## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.24629 -0.95928  0.03772  1.19950  3.30886 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 17.846931   2.001876  8.9151 2.103e-08 ***
## X1           1.103130   0.329573  3.3471 0.0032092 ** 
## X2           0.321520   0.037109  8.6643 3.327e-08 ***
## X3           1.288941   0.298479  4.3184 0.0003342 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.7528 on 20 degrees of freedom
## Multiple R-squared:  0.91086,    Adjusted R-squared:  0.89749 
## F-statistic: 68.119 on 3 and 20 DF,  p-value: 1.124e-10
  1. Test the overall goodness of fit for the regression model at level \(\alpha = 0.10\). Specify the null and alternative hypotheses, as well as the test used. (2 points)
Y_reduced.lm <- lm(Y ~ 1, data = data1)
anova(Y_reduced.lm,Y.lm)
## Analysis of Variance Table
## 
## Model 1: Y ~ 1
## Model 2: Y ~ X1 + X2 + X3
##   Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
## 1     23 689.260                                   
## 2     20  61.443  3   627.817 68.1192 1.124e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The null hypothesis is \(H_0:\beta_1=\beta_2=\beta_3=0\). The alternative hypothesis is \(H_1: \beta_j≠0\),j=1,2,3.

  • test:F test

  • Conclusion:p−value=1.124e−10<0.1, as the significant level α=0.1, we reject \(H_0: \beta_1=\beta_2=\beta_3=0\) and conclude that some or all of the predictors are needed.

  1. Give Bonferroni corrected simultaneous 90% confidence intervals for \(\beta_1\), \(\beta_2\), \(\beta_3\). (2 points)
confint(Y.lm,c("X1","X2","X3"),level = 1 - .1/3)
##        1.67 %   98.33 %
## X1 0.34989147 1.8563693
## X2 0.23670797 0.4063314
## X3 0.60676766 1.9711141
  1. What is the \(R^2\) of the model? How is the \(R^2\) interpreted? What is the adjusted \(R^2\)? (2 points)
R_squared <- summary(Y.lm)$r.squared 
R_squared
## [1] 0.91085657
  • Approximately 91% of the variation in Y can be accounted for by these three predictors.
Adjusted_R_squared <- summary(Y.lm)$adj.r.squared
Adjusted_R_squared
## [1] 0.89748505
  1. The researcher wishes to find confidence interval estimates at certain levels of the \(\tt X\) variables found in http://stats191.stanford.edu/data/salary_levels.table. Construct Bonferonni corrected simultaneous 95% confidence intervals at each of the columns of the above table. (2 points)
data2 = read.table("http://stats191.stanford.edu/data/salary_levels.table",header = TRUE)
head(data2)
##    L1 L2 L3 L4 L5
## X1  5  7  3  5  4
## X2  6  3  2  5  7
## X3  4  2  3  6  6
L1_hat = predict(Y.lm,list(X1=data2[1,1],X2=data2[2,1],X3=data2[3,1]),interval = 'prediction',level = 1-0.05/3)
L1_hat
##         fit       lwr       upr
## 1 30.447464 25.317974 35.576954
L2_hat = predict(Y.lm,list(X1=data2[1,2],X2=data2[2,2],X3=data2[3,2]),interval = 'prediction',level = 1-0.05/3) 
L2_hat
##         fit       lwr       upr
## 1 29.111284 22.704834 35.517734
L3_hat = predict(Y.lm,list(X1=data2[1,3],X2=data2[2,3],X3=data2[3,3]),interval = 'prediction',level = 1-0.05/3)
L3_hat
##         fit       lwr       upr
## 1 25.666184 20.280119 31.052249
L4_hat = predict(Y.lm,list(X1=data2[1,4],X2=data2[2,4],X3=data2[3,4]),interval = 'prediction',level = 1-0.05/3) 
L4_hat
##         fit      lwr       upr
## 1 32.703826 27.68524 37.722413
L5_hat = predict(Y.lm,list(X1=data2[1,5],X2=data2[2,5],X3=data2[3,5]),interval = 'prediction',level = 1-0.05/3)
L5_hat
##         fit      lwr       upr
## 1 32.243735 27.29005 37.197421

Question 12

The dataset \(\tt iris\) in \(\Re\) gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.

data(iris)
  1. Fit a multiple linear regression model to the data with sepal length as the dependent variable and sepal width, petal length and petal width as the independent variables. (2 points)
iris.lm = lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width,data = iris)
summary(iris.lm)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.828164 -0.219894  0.018748  0.197093  0.845696 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   1.855997   0.250777  7.4010 9.854e-12 ***
## Sepal.Width   0.650837   0.066647  9.7654 < 2.2e-16 ***
## Petal.Length  0.709132   0.056719 12.5025 < 2.2e-16 ***
## Petal.Width  -0.556483   0.127548 -4.3629 2.413e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.31455 on 146 degrees of freedom
## Multiple R-squared:  0.85861,    Adjusted R-squared:  0.85571 
## F-statistic: 295.54 on 3 and 146 DF,  p-value: < 2.22e-16
  1. Test the reduced model of \(H_0 : \beta_{\rm sepalwidth} = \beta_{\rm petallength} = 0\) with an F-test at level \(\alpha = 0.05\). (2 points)
iris.lm.reduced = lm(Sepal.Length~Petal.Width,data = iris)
anova(iris.lm.reduced,iris.lm)
## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ Petal.Width
## Model 2: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
##   Res.Df     RSS Df Sum of Sq       F     Pr(>F)    
## 1    148 33.8149                                    
## 2    146 14.4454  2   19.3695 97.8839 < 2.22e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F:97.884 Pr(>F)< 2.2e-16<0.05,refuse\(H_0:\beta\)sepalwidth=\(\beta\)petallength=0

  1. Test \(H_0 : \beta_{\rm sepalwidth} = \beta_{\rm petallength}\) at level \(\alpha = 0.05\). (2 points)
iris$syn = iris$Sepal.Width+iris$Petal.Length
equal.lm=lm(Sepal.Length~syn+Petal.Width,data = iris)
anova(equal.lm,iris.lm)
## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ syn + Petal.Width
## Model 2: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
##   Res.Df     RSS Df Sum of Sq       F  Pr(>F)
## 1    147 14.5080                             
## 2    146 14.4454  1  0.062559 0.63229 0.42781

F:0.6323 Pr(>F) = 0.4278>0.05,accept\(H_0:\beta\)sepalwidth=\(\beta\)petallength.

  1. Test \(H_0 : \beta_{\rm sepalwidth} < \beta_{\rm petallength}\) at level \(\alpha = 0.05\). (2 points)
SSE.R = sum(resid(equal.lm)^2)
SSE.F = sum(resid(iris.lm)^2)
df.num = equal.lm$df - iris.lm$df
df.den = iris.lm$df
Fv = ((SSE.R - SSE.F) / df.num) / (SSE.F / df.den)
cutoff = qf(0.95,df.num,df.den)
Fv
## [1] 0.63228527
cutoff
## [1] 3.9059421

0.6322853<3.905942,accept\(H_0:~ \beta\)sepalwidth<\(\beta\)petallength