Question 1 part 1, 2
mstr = 5291
sse = 181289
xdf = 1
resdf = 48
sstr = mstr*xdf
F_value = (sstr/xdf)/(sse/resdf)
mse = mstr/F_value
P_value = 1 - pf(F_value, 1, 48)
print(paste("sstr:", sstr, "f-value:", F_value, "mse:", mse, "p value: ", P_value))
## [1] "sstr: 5291 f-value: 1.40090132330147 mse: 3776.85416666667 p value: 0.242401401487488"
print("Since the p value, 0.242 is greater than 0.05, we fail to reject the null hypothesis that B1 = 0")
## [1] "Since the p value, 0.242 is greater than 0.05, we fail to reject the null hypothesis that B1 = 0"
Part 3
one_sided_p_value = P_value/2
print("Since 0.1212 > 0.05, fail to reject the null hypothesis that B1 <= 0")
## [1] "Since 0.1212 > 0.05, fail to reject the null hypothesis that B1 <= 0"
Part 4, 5
var <- sstr/(sse + sstr)
print(var)
## [1] 0.02835781
print("The R^2 value will still be 0.0284. Switching the response and predictor variables has no effect
on the R^2 value.")
## [1] "The R^2 value will still be 0.0284. Switching the response and predictor variables has no effect\n on the R^2 value."
Question 2 Part 1
sstr = 2216338
sse = 8913083
res_df = 62
# r = 4 because there are 4 possible levels/outcomes
reg_df = 3
mstr = sstr/reg_df
mse = sse/res_df
f_val = mstr/mse
pval = 1 - pf(f_val, 3, 62)
print(paste("Regression df: ", reg_df, "mstr:", mstr, "mse:", mse, "f-value:", f_val, "p value", pval))
## [1] "Regression df: 3 mstr: 738779.333333333 mse: 143759.403225806 f-value: 5.13899833162854 p value 0.00307787648117097"
# t-value = estimate/standard error
int_t = 3316.4/937.7
edu_std = 850/3.646
exp_t = 932.4/260.1
stem_est = 1.675*330.1
print(paste("intercept t-value:", int_t, "education standard error:", edu_std, "Experience t-value:", exp_t, "Stem estimate: ", stem_est))
## [1] "intercept t-value: 3.5367388290498 education standard error: 233.132199670872 Experience t-value: 3.58477508650519 Stem estimate: 552.9175"
int_pval = 2*(1-pt(int_t, 62))
edu_pval = 2*(1-pt(3.646, 62))
exp_pval = 2*(1-pt(exp_t, 62))
stem_pval = 2*(1-pt(1.675, 62))
print(paste(int_pval, edu_pval, exp_pval, stem_pval))
## [1] "0.000773321422863127 0.000546834555301245 0.000664503855203114 0.0989714644421769"
Part 2 Null hypothesis: The model would be best as a reduced model with no predictor variables. Alternative hypothesis: the model would be best as a full model with three predictor variables. I am using an f test to determine which type of model would be best. I am assuming that the errors have constant variance, are independent, and are normally distributed. Since the p-value using the f-value was previously calculated to be 0.0031, we can reject the null hypothesis that the model would be best reduced since 0.0031 < 0.05.
Part 3 Null hypothesis: There is not a positive linear correlation between salary and experience. Alternative hypothesis: There is a positive linear correlation between salary and experience. Since 0.00033 < 0.05, we can reject the null hypothesis that there is not a positive linear correlation between salary and experience.
p_val_exp = 1-pt(exp_t, 62)
p_val_exp
## [1] 0.0003322519
Part 4
stem = 1
education = 10
experience = 5
conf_level <- qt(1-.025, df = 62)
std = sqrt(mse)
y_hat = stem*552.9175 + education*850 + experience*932.4 + 3316.4
upper = y_hat+(conf_level*std/sqrt(66))
lower = y_hat-(conf_level*std/sqrt(66))
print(paste("The 95% confidence interval for the predicted salary is between", lower, upper))
## [1] "The 95% confidence interval for the predicted salary is between 16938.0237030239 17124.6112969761"
Part 5
stem = 0
education = 10
experience = 6
conf_level <- qt(1-.025, df = 62)
std = sqrt(mse)
y_hat = stem*552.9175 + education*850 + experience*932.4 + 3316.4
upper = y_hat+(conf_level*std/sqrt(66))
lower = y_hat-(conf_level*std/sqrt(66))
print(paste("The 95% confidence interval for the predicted salary is between", lower, upper))
## [1] "The 95% confidence interval for the predicted salary is between 17317.5062030239 17504.0937969761"
Question 3
cigarettes <- read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt",header=T)
head(cigarettes)
## State Age HS Income Black Female Price Sales
## 1 AL 27.0 41.3 2948 26.2 51.7 42.7 89.8
## 2 AK 22.9 66.7 4644 3.0 45.7 41.8 121.3
## 3 AZ 26.3 58.1 3665 3.0 50.8 38.5 115.2
## 4 AR 29.1 39.9 2878 18.3 51.5 38.8 100.3
## 5 CA 28.1 62.6 4493 7.0 50.8 39.7 123.0
## 6 CO 26.2 63.9 3855 3.0 50.7 31.1 124.8
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cigarettes = cigarettes %>% select(-State)
Part 1 I will be conducting a t test. Null hypothesis: The predictor Female has no significant effect on sales. Alternative hypothesis: The predictor female has a significant effect on sales. Since the Pr(>|t|) = 0.85071 which is greater than 0.05, we fail to reject the null that female has no significant effect on sales.
model <- lm(data = cigarettes, Sales ~ .)
summary(model)
##
## Call:
## lm(formula = Sales ~ ., data = cigarettes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.398 -12.388 -5.367 6.270 133.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.34485 245.60719 0.421 0.67597
## Age 4.52045 3.21977 1.404 0.16735
## HS -0.06159 0.81468 -0.076 0.94008
## Income 0.01895 0.01022 1.855 0.07036 .
## Black 0.35754 0.48722 0.734 0.46695
## Female -1.05286 5.56101 -0.189 0.85071
## Price -3.25492 1.03141 -3.156 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2282
## F-statistic: 3.464 on 6 and 44 DF, p-value: 0.006857
Part 2 Null hypothesis: The reduced model is better. Alternative hypothesis: The full model is better. I will be conducting an anova f test. Since Pr(>F) = 0.9789 which is greater than 0.05, we fail to reject the null that the reduced model is better.
reduced = lm(data = cigarettes, Sales ~ Age + Income + Black + Price)
full = lm(data = cigarettes, Sales ~ .)
anova(full, reduced)
## Analysis of Variance Table
##
## Model 1: Sales ~ Age + HS + Income + Black + Female + Price
## Model 2: Sales ~ Age + Income + Black + Price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44 34926
## 2 46 34960 -2 -33.799 0.0213 0.9789
Part 3
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -3.916439e+02 598.33360254
## Age -1.968565e+00 11.00946945
## HS -1.703475e+00 1.58030249
## Income -1.642517e-03 0.03953542
## Black -6.243909e-01 1.33946122
## Female -1.226033e+01 10.15461632
## Price -5.333583e+00 -1.17625412
Part 4 Before: Adjusted R^2 = 0.2282 After: Adjusted R^2 = 0.1864
reduced_model <- lm(data = cigarettes, Sales ~ Age + Female + Black + Price + HS)
summary(reduced_model)
##
## Call:
## lm(formula = Sales ~ Age + Female + Black + Price + HS, data = cigarettes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.414 -16.454 -5.746 8.541 133.319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.3245 250.0537 0.649 0.51954
## Age 7.3073 2.9238 2.499 0.01616 *
## Female -3.7815 5.5063 -0.687 0.49576
## Black 0.8447 0.4213 2.005 0.05101 .
## Price -2.8603 1.0362 -2.760 0.00832 **
## HS 0.9717 0.6103 1.592 0.11836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.93 on 45 degrees of freedom
## Multiple R-squared: 0.2678, Adjusted R-squared: 0.1864
## F-statistic: 3.291 on 5 and 45 DF, p-value: 0.01287
Question 4
math = read.table("http://people.math.binghamton.edu/qiao/data501/data/math-salaries.table", header = T)
head(math)
## Y X1 X2 X3
## 1 33.2 3.5 9 6.1
## 2 40.3 5.3 20 6.4
## 3 38.7 5.1 18 7.4
## 4 46.8 5.8 33 6.7
## 5 41.4 4.2 31 7.5
## 6 37.5 6.0 13 5.9
Part 1
library(GGally)
## Loading required package: ggplot2
ggpairs(math)
Part 2 & 3 & 5 y_hat = 17.84693 + 1.10313X1 + 0.32152X2 + 1.28894*X3
I am using an F-test. Null hypothesis: The predictors do not have a significant effect on Y. Alternative hypothesis: The predictors do have a significant effect on Y. Since the p-value is less than 0.10, we reject the null that the predictors do not have a significant effect on Y.
The R^2 is 0.9109. 91.09% of the variability in Y can be explained by X1, X2, and X3. The adjusted R^2 is 0.8975
math_model <- lm(data = math, Y ~ .)
summary(math_model)
##
## Call:
## lm(formula = Y ~ ., data = math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2463 -0.9593 0.0377 1.1995 3.3089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.84693 2.00188 8.915 2.10e-08 ***
## X1 1.10313 0.32957 3.347 0.003209 **
## X2 0.32152 0.03711 8.664 3.33e-08 ***
## X3 1.28894 0.29848 4.318 0.000334 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.753 on 20 degrees of freedom
## Multiple R-squared: 0.9109, Adjusted R-squared: 0.8975
## F-statistic: 68.12 on 3 and 20 DF, p-value: 1.124e-10
Part 4
confint(math_model, level = 0.966666)
## 1.67 % 98.33 %
## (Intercept) 13.2716685 22.422193
## X1 0.3498947 1.856366
## X2 0.2367083 0.406331
## X3 0.6067705 1.971111
Part 6
newdata = read.table("http://people.math.binghamton.edu/qiao/data501/data/salary_levels.table",
sep="\t",row.names=1,header=T)
newdata
## L1 L2 L3 L4 L5
## X1 5 7 3 5 4
## X2 6 3 2 5 7
## X3 4 2 3 6 6
newdata_t <- data.frame(t(newdata))
newdata_t
## X1 X2 X3
## L1 5 6 4
## L2 7 3 2
## L3 3 2 3
## L4 5 5 6
## L5 4 7 6
predict(math_model, newdata = newdata_t, interval = "confidence", level = .99)
## fit lwr upr
## L1 30.44746 27.93019 32.96474
## L2 29.11128 24.23181 33.99076
## L3 25.66618 22.57800 28.75437
## L4 32.70383 30.46736 34.94029
## L5 32.24374 30.18604 34.30143
Question 5
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Part 1
iris_model <- lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width)
summary(iris_model)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82816 -0.21989 0.01875 0.19709 0.84570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
## Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
## Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
## Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.8557
## F-statistic: 295.5 on 3 and 146 DF, p-value: < 2.2e-16
Part 2
iris_model_2 <- lm(data = iris, Sepal.Length ~ Petal.Width)
summary(iris_model_2)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38822 -0.29358 -0.04393 0.26429 1.34521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.77763 0.07293 65.51 <2e-16 ***
## Petal.Width 0.88858 0.05137 17.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16
Since Pr(>F) is less than 0.05, we reject the null that the reduced model is better.
anova(iris_model, iris_model_2)
## Analysis of Variance Table
##
## Model 1: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Model 2: Sepal.Length ~ Petal.Width
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 146 14.445
## 2 148 33.815 -2 -19.369 97.884 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Part 3 Since Pr(>F) > 0.05, we fail to reject the null that Bsepalwidth = Bpetallength
iris$sum <- iris$Sepal.Width + iris$Petal.Length
iris_model3 <- lm(data = iris, Sepal.Length ~ sum + Petal.Width)
summary(iris_model3)
##
## Call:
## lm(formula = Sepal.Length ~ sum + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82040 -0.21854 0.02115 0.20234 0.83697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.76198 0.22087 7.977 3.88e-13 ***
## sum 0.68663 0.04909 13.987 < 2e-16 ***
## Petal.Width -0.49880 0.10478 -4.760 4.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3142 on 147 degrees of freedom
## Multiple R-squared: 0.858, Adjusted R-squared: 0.8561
## F-statistic: 444.1 on 2 and 147 DF, p-value: < 2.2e-16
anova(iris_model, iris_model3)
## Analysis of Variance Table
##
## Model 1: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Model 2: Sepal.Length ~ sum + Petal.Width
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 146 14.445
## 2 147 14.508 -1 -0.062559 0.6323 0.4278
Part 4 Since the p-value is greater than 0.05, we fail to reject the null that Bsepalwidth = Bpetallength
a <- c(0,1,-1,0)
beta.hat <- coef(iris_model)
V <- vcov(iris_model)
theta.hat <- sum(a * beta.hat)
se.theta <- sqrt( as.numeric( t(a) %*% V %*% a ) )
t.stat <- theta.hat / se.theta
df <- df.residual(iris_model)
crit <- qt(0.05, df)
p_val <- pt(t.stat, df)
p_val
## [1] 0.2139045