You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one HTML file.
Please use the file name as: 20200160XX_Name_03.html.
Grading scheme: \(\left\{ 0,~ 1,~ 2~ \right\}\) points per question.
2: submitted on time and more or less correct answer
1: submitted on time and partially correct answer
0: submitted with a completely incorrect answer or late submission (any day after the due date for more than one homework assignment).
Send your answers to: 0205@cupk.edu.cn. Due date: 24:00 October 15, 2022 (Saturday evening).
The tables below show the regression output of a multiple regression model relating \(\tt Salary\), the beginning salaries in dollars of employees in a given company to the following predictor variables: \(\tt Education\), \(\tt Experience\) and a variable \(\tt STEM\) indicating whether or not they have an undergraduate degree in a \(\tt STEM\) field or not (0 or 1). (The units of both \(\tt Education\) and \(\tt Experience\) are years. You can assume an intercept model.)
(2 points)table1 = matrix(nrow = 2,ncol = 5)
table1[2,1] = 62
table1[,2] = c(2216338,8913083)
table1[1,1] = 3
options(digits=8)
table1[1,3] = table1[1,2]/table1[1,1]
table1[2,3] = table1[2,2]/62
table1[1,4] = table1[1,3]/table1[2,3]
p.value = 1 - pf(table1[1,4], table1[1,1], table1[2,1])
table1[1,5] = p.value
table1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3 2216338 738779.33 5.1389983 0.0030778765
## [2,] 62 8913083 143759.40 NA NA
table2 = matrix(nrow = 4,ncol = 4)
table2[,1] = c(3316.4,850.0,932.4,0)
table2[,2] = c(937.7,0,260.1,330.1)
table2[,3] = c(0,3.646,0,1.675)
table2[1,3] = table2[1,1]/table2[1,2]
table2[1,4] = 2 * pt(table2[1,3], 62, lower = FALSE)
table2[2,2] = table2[2,1]/table2[2,3]
table2[2,4] = 2 * pt(table2[2,3], 62, lower = FALSE)
table2[3,3] = table2[3,1]/table2[3,2]
table2[3,4] = 2 * pt(table2[3,3], 62, lower = FALSE)
table2[4,1] = table2[4,2]*table2[4,3]
table2[4,4] = 2 * pt(table2[4,3], 62, lower = FALSE)
table2
## [,1] [,2] [,3] [,4]
## [1,] 3316.4000 937.7000 3.5367388 0.00077332142
## [2,] 850.0000 233.1322 3.6460000 0.00054683456
## [3,] 932.4000 260.1000 3.5847751 0.00066450386
## [4,] 552.9175 330.1000 1.6750000 0.09897146444
(2 points) p_value = table1[1,5]
p_value
## [1] 0.0030778765
(2 points)p_value = pt(table2[3,3], 62, lower = FALSE)
p_value
## [1] 0.00033225193
(2 points)beta_hat = table2[,1]
a = c(1,10,5,1)
y_hat = a %*% beta_hat
sigma.hat.sq = table1[2,2]/table1[2,1]
se = sqrt(sigma.hat.sq + sum(a %*% a))
df = table1[2,1]+3
q = qt((1 - 0.95)/2,df,lower.tail=FALSE)
upper = y_hat + se * q
lower = y_hat - se * q
lower
## [,1]
## [1,] 16273.756
upper
## [,1]
## [1,] 17788.879
(2 points)beta_hat = table2[,1]
a = c(1,10,6,0)
y_hat = a %*% beta_hat
sigma.hat.sq = table1[2,2]/table1[2,1]
se = sqrt(sum(a %*% a))
df = table1[2,1]+3
q = qt((1 - 0.95)/2,df,lower.tail=FALSE)
upper = y_hat + se * q
lower = y_hat - se * q
lower
## [,1]
## [1,] 17387.424
upper
## [,1]
## [1,] 17434.176
This question is based on our textbook Page 86, Exercise 3.15. A national insurance organization wanted to study the consumption pattern of cigarettes in all 50 states and the District of Columbia. The variables chosen for the study are:
Age: Median age of a person living in a state.
HS: Percentage of people over 25 years of age in a state who had completed high school.
Income: Per capita personal income for a state (income in dollars).
Black: Percentage of blacks living in a state.
Female: Percentage of females living in a state.
Price: Weighted average price (in cents) of a pack of cigarettes in a state.
Sales: Number of packs of cigarettes sold in a state on a per capita basis.
The data can be found at http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt. [Hint: Use read.table() to read the data.]
In 1 - 2 below, specify the null and alternative hypotheses, the test used, and your conclusion using a 5% level of significance.
(2 points)cigarettes = read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt", header = TRUE)
head(cigarettes)
## State Age HS Income Black Female Price Sales
## 1 AL 27.0 41.3 2948 26.2 51.7 42.7 89.8
## 2 AK 22.9 66.7 4644 3.0 45.7 41.8 121.3
## 3 AZ 26.3 58.1 3665 3.0 50.8 38.5 115.2
## 4 AR 29.1 39.9 2878 18.3 51.5 38.8 100.3
## 5 CA 28.1 62.6 4493 7.0 50.8 39.7 123.0
## 6 CO 26.2 63.9 3855 3.0 50.7 31.1 124.8
\(Sales=\beta_0+\beta_1Age+\beta_2Hs+\beta_3Income+\beta_4Black+\beta_5Female+\beta_6Priec+\varepsilon\)
cigarettes.lm <- lm(Sales ~ Age + HS + Income + Black + Female + Price,data = cigarettes)
summary(cigarettes.lm)
##
## Call:
## lm(formula = Sales ~ Age + HS + Income + Black + Female + Price,
## data = cigarettes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.3982 -12.3876 -5.3672 6.2703 133.2135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.344846 245.607185 0.4208 0.675969
## Age 4.520452 3.219768 1.4040 0.167348
## HS -0.061586 0.814684 -0.0756 0.940084
## Income 0.018946 0.010216 1.8546 0.070364 .
## Black 0.357535 0.487219 0.7338 0.466946
## Female -1.052859 5.561008 -0.1893 0.850706
## Price -3.254918 1.031407 -3.1558 0.002886 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.174 on 44 degrees of freedom
## Multiple R-squared: 0.32084, Adjusted R-squared: 0.22823
## F-statistic: 3.4644 on 6 and 44 DF, p-value: 0.006857
(2 points)cigarettes.lm.reduced = lm(Sales ~ Age + Income + Black + Price,data = cigarettes)
anova(cigarettes.lm.reduced, cigarettes.lm)
## Analysis of Variance Table
##
## Model 1: Sales ~ Age + Income + Black + Price
## Model 2: Sales ~ Age + HS + Income + Black + Female + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 34959.8
## 2 44 34926.0 2 33.7986 0.02129 0.97895
(2 points)confint(cigarettes.lm, c("Income"), level=0.95)
## 2.5 % 97.5 %
## Income -0.0016425173 0.039535423
(2 points)cigarettes.lm2 <- lm(Sales ~ Age + HS + Black + Female + Price,data = cigarettes)
R_squared <- summary(cigarettes.lm2)$r.squared
R_squared
## [1] 0.26775262
(2 points)cigarettes.lm3 <- lm(Sales ~ Age + Income + Price,data = cigarettes)
R_squared3 <- summary(cigarettes.lm3)$r.squared
R_squared3
## [1] 0.30324338
(2 points)cigarettes.lm4 <- lm(Sales ~ Income ,data = cigarettes)
R_squared4 <- summary(cigarettes.lm4)$r.squared
R_squared4
## [1] 0.10632027
A researcher in a scientific foundation wished to evaluate the relation between intermediate and senior level annuals salaries of bachelor’s and master’s level mathematicians (\(\tt Y\), in thousand dollars) and an index of work quality (\(\tt X1\)), number of years of experience (\(\tt X2\)), and an index of publication success (\(\tt X3\)). The data for a sample of 24 bachelor’s and master’s level mathematicians can be found at http://www.stanford.edu/class/stats191/data/math-salaries.table. [Hint: Use read.table() to read the data.]
(2 points)data1 = read.table("http://www.stanford.edu/class/stats191/data/math-salaries.table",header = TRUE)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(data1, mapping= aes(color="darkred"),cardinality_threshold = 38)
cor(data1)
## Y X1 X2 X3
## Y 1.00000000 0.66709585 0.85855820 0.55819603
## X1 0.66709585 1.00000000 0.46695108 0.32276117
## X2 0.85855820 0.46695108 1.00000000 0.25375299
## X3 0.55819603 0.32276117 0.25375299 1.00000000
(2 points)\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3\)
Y.lm = lm(Y~X1+X2+X3,data = data1)
summary(Y.lm)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.24629 -0.95928 0.03772 1.19950 3.30886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.846931 2.001876 8.9151 2.103e-08 ***
## X1 1.103130 0.329573 3.3471 0.0032092 **
## X2 0.321520 0.037109 8.6643 3.327e-08 ***
## X3 1.288941 0.298479 4.3184 0.0003342 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.7528 on 20 degrees of freedom
## Multiple R-squared: 0.91086, Adjusted R-squared: 0.89749
## F-statistic: 68.119 on 3 and 20 DF, p-value: 1.124e-10
(2 points)Y_reduced.lm <- lm(Y ~ 1, data = data1)
anova(Y_reduced.lm,Y.lm)
## Analysis of Variance Table
##
## Model 1: Y ~ 1
## Model 2: Y ~ X1 + X2 + X3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 689.260
## 2 20 61.443 3 627.817 68.1192 1.124e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is \(H_0:\beta_1=\beta_2=\beta_3=0\). The alternative hypothesis is \(H_1: \beta_j≠0\),j=1,2,3.
test:F test
Conclusion:p−value=1.124e−10<0.1, as the significant level α=0.1, we reject \(H_0: \beta_1=\beta_2=\beta_3=0\) and conclude that some or all of the predictors are needed.
(2 points)confint(Y.lm,c("X1","X2","X3"),level = 1 - .1/3)
## 1.67 % 98.33 %
## X1 0.34989147 1.8563693
## X2 0.23670797 0.4063314
## X3 0.60676766 1.9711141
(2 points)R_squared <- summary(Y.lm)$r.squared
R_squared
## [1] 0.91085657
Adjusted_R_squared <- summary(Y.lm)$adj.r.squared
Adjusted_R_squared
## [1] 0.89748505
(2 points)data2 = read.table("http://stats191.stanford.edu/data/salary_levels.table",header = TRUE)
head(data2)
## L1 L2 L3 L4 L5
## X1 5 7 3 5 4
## X2 6 3 2 5 7
## X3 4 2 3 6 6
L1_hat = predict(Y.lm,list(X1=data2[1,1],X2=data2[2,1],X3=data2[3,1]),interval = 'prediction',level = 1-0.05/3)
L1_hat
## fit lwr upr
## 1 30.447464 25.317974 35.576954
L2_hat = predict(Y.lm,list(X1=data2[1,2],X2=data2[2,2],X3=data2[3,2]),interval = 'prediction',level = 1-0.05/3)
L2_hat
## fit lwr upr
## 1 29.111284 22.704834 35.517734
L3_hat = predict(Y.lm,list(X1=data2[1,3],X2=data2[2,3],X3=data2[3,3]),interval = 'prediction',level = 1-0.05/3)
L3_hat
## fit lwr upr
## 1 25.666184 20.280119 31.052249
L4_hat = predict(Y.lm,list(X1=data2[1,4],X2=data2[2,4],X3=data2[3,4]),interval = 'prediction',level = 1-0.05/3)
L4_hat
## fit lwr upr
## 1 32.703826 27.68524 37.722413
L5_hat = predict(Y.lm,list(X1=data2[1,5],X2=data2[2,5],X3=data2[3,5]),interval = 'prediction',level = 1-0.05/3)
L5_hat
## fit lwr upr
## 1 32.243735 27.29005 37.197421
The dataset \(\tt iris\) in \(\Re\) gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.
data(iris)
(2 points)iris.lm = lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width,data = iris)
summary(iris.lm)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.828164 -0.219894 0.018748 0.197093 0.845696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.855997 0.250777 7.4010 9.854e-12 ***
## Sepal.Width 0.650837 0.066647 9.7654 < 2.2e-16 ***
## Petal.Length 0.709132 0.056719 12.5025 < 2.2e-16 ***
## Petal.Width -0.556483 0.127548 -4.3629 2.413e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.31455 on 146 degrees of freedom
## Multiple R-squared: 0.85861, Adjusted R-squared: 0.85571
## F-statistic: 295.54 on 3 and 146 DF, p-value: < 2.22e-16
(2 points)iris.lm.reduced = lm(Sepal.Length~Petal.Width,data = iris)
anova(iris.lm.reduced,iris.lm)
## Analysis of Variance Table
##
## Model 1: Sepal.Length ~ Petal.Width
## Model 2: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 148 33.8149
## 2 146 14.4454 2 19.3695 97.8839 < 2.22e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F:97.884 Pr(>F)< 2.2e-16<0.05,refuse\(H_0:\beta\)sepalwidth=\(\beta\)petallength=0
(2 points)iris$syn = iris$Sepal.Width+iris$Petal.Length
equal.lm=lm(Sepal.Length~syn+Petal.Width,data = iris)
anova(equal.lm,iris.lm)
## Analysis of Variance Table
##
## Model 1: Sepal.Length ~ syn + Petal.Width
## Model 2: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 147 14.5080
## 2 146 14.4454 1 0.062559 0.63229 0.42781
F:0.6323 Pr(>F) = 0.4278>0.05,accept\(H_0:\beta\)sepalwidth=\(\beta\)petallength.
(2 points)SSE.R = sum(resid(equal.lm)^2)
SSE.F = sum(resid(iris.lm)^2)
df.num = equal.lm$df - iris.lm$df
df.den = iris.lm$df
Fv = ((SSE.R - SSE.F) / df.num) / (SSE.F / df.den)
cutoff = qf(0.95,df.num,df.den)
Fv
## [1] 0.63228527
cutoff
## [1] 3.9059421
0.6322853<3.905942,accept\(H_0:~ \beta\)sepalwidth<\(\beta\)petallength