DATA501 HW 2

Question 1 part 1, 2

mstr = 5291
sse = 181289
xdf = 1
resdf = 48

sstr = mstr*xdf

F_value = (sstr/xdf)/(sse/resdf)
mse = mstr/F_value

P_value = 1 - pf(F_value, 1, 48)

print(paste("sstr:", sstr, "f-value:", F_value, "mse:", mse, "p value: ", P_value))

## [1] "sstr: 5291 f-value: 1.40090132330147 mse: 3776.85416666667 p value:  0.242401401487488"

print("Since the p value, 0.242 is greater than 0.05, we fail to reject the null hypothesis that B1 = 0")

## [1] "Since the p value, 0.242 is greater than 0.05, we fail to reject the null hypothesis that B1 = 0"

Part 3

one_sided_p_value = P_value/2
print("Since 0.1212 > 0.05, fail to reject the null hypothesis that B1 <= 0")

## [1] "Since 0.1212 > 0.05, fail to reject the null hypothesis that B1 <= 0"

Part 4, 5

var <- sstr/(sse + sstr)
print(var)

## [1] 0.02835781

print("The R^2 value will still be 0.0284. Switching the response and predictor variables has no effect
      on the R^2 value.")

## [1] "The R^2 value will still be 0.0284. Switching the response and predictor variables has no effect\n      on the R^2 value."

Question 2 Part 1

sstr = 2216338
sse = 8913083
res_df = 62

# r = 4 because there are 4 possible levels/outcomes
reg_df = 3
mstr = sstr/reg_df
mse = sse/res_df

f_val = mstr/mse
pval = 1 - pf(f_val, 3, 62)
print(paste("Regression df: ", reg_df, "mstr:", mstr, "mse:", mse, "f-value:", f_val, "p value", pval))

## [1] "Regression df:  3 mstr: 738779.333333333 mse: 143759.403225806 f-value: 5.13899833162854 p value 0.00307787648117097"

# t-value = estimate/standard error
int_t = 3316.4/937.7
edu_std = 850/3.646
exp_t = 932.4/260.1
stem_est = 1.675*330.1
print(paste("intercept t-value:", int_t, "education standard error:", edu_std, "Experience t-value:", exp_t, "Stem estimate: ", stem_est))

## [1] "intercept t-value: 3.5367388290498 education standard error: 233.132199670872 Experience t-value: 3.58477508650519 Stem estimate:  552.9175"

int_pval = 2*(1-pt(int_t, 62))
edu_pval = 2*(1-pt(3.646, 62))
exp_pval = 2*(1-pt(exp_t, 62))
stem_pval = 2*(1-pt(1.675, 62))

print(paste(int_pval, edu_pval, exp_pval, stem_pval))

## [1] "0.000773321422863127 0.000546834555301245 0.000664503855203114 0.0989714644421769"

Part 2 Null hypothesis: The model would be best as a reduced model with no predictor variables. Alternative hypothesis: the model would be best as a full model with three predictor variables. I am using an f test to determine which type of model would be best. I am assuming that the errors have constant variance, are independent, and are normally distributed. Since the p-value using the f-value was previously calculated to be 0.0031, we can reject the null hypothesis that the model would be best reduced since 0.0031 < 0.05.

Part 3 Null hypothesis: There is not a positive linear correlation between salary and experience. Alternative hypothesis: There is a positive linear correlation between salary and experience. Since 0.00033 < 0.05, we can reject the null hypothesis that there is not a positive linear correlation between salary and experience.

p_val_exp = 1-pt(exp_t, 62)
p_val_exp

## [1] 0.0003322519

Part 4

stem = 1
education = 10
experience = 5

conf_level <- qt(1-.025, df = 62)
std = sqrt(mse)

y_hat = stem*552.9175 + education*850 + experience*932.4 + 3316.4

upper = y_hat+(conf_level*std/sqrt(66))
lower = y_hat-(conf_level*std/sqrt(66))

print(paste("The 95% confidence interval for the predicted salary is between", lower, upper))

## [1] "The 95% confidence interval for the predicted salary is between 16938.0237030239 17124.6112969761"

Part 5

stem = 0
education = 10
experience = 6

conf_level <- qt(1-.025, df = 62)
std = sqrt(mse)

y_hat = stem*552.9175 + education*850 + experience*932.4 + 3316.4

upper = y_hat+(conf_level*std/sqrt(66))
lower = y_hat-(conf_level*std/sqrt(66))

print(paste("The 95% confidence interval for the predicted salary is between", lower, upper))

## [1] "The 95% confidence interval for the predicted salary is between 17317.5062030239 17504.0937969761"

Question 3

cigarettes <- read.table("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/P088.txt",header=T)
head(cigarettes)

##   State  Age   HS Income Black Female Price Sales
## 1    AL 27.0 41.3   2948  26.2   51.7  42.7  89.8
## 2    AK 22.9 66.7   4644   3.0   45.7  41.8 121.3
## 3    AZ 26.3 58.1   3665   3.0   50.8  38.5 115.2
## 4    AR 29.1 39.9   2878  18.3   51.5  38.8 100.3
## 5    CA 28.1 62.6   4493   7.0   50.8  39.7 123.0
## 6    CO 26.2 63.9   3855   3.0   50.7  31.1 124.8

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

cigarettes = cigarettes %>% select(-State)

Part 1 I will be conducting a t test. Null hypothesis: The predictor Female has no significant effect on sales. Alternative hypothesis: The predictor female has a significant effect on sales. Since the Pr(>|t|) = 0.85071 which is greater than 0.05, we fail to reject the null that female has no significant effect on sales.

model <- lm(data = cigarettes, Sales ~ .)
summary(model)

## 
## Call:
## lm(formula = Sales ~ ., data = cigarettes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

Part 2 Null hypothesis: The reduced model is better. Alternative hypothesis: The full model is better. I will be conducting an anova f test. Since Pr(>F) = 0.9789 which is greater than 0.05, we fail to reject the null that the reduced model is better.

reduced = lm(data = cigarettes, Sales ~ Age + Income + Black + Price)
full = lm(data = cigarettes, Sales ~ .)

anova(full, reduced)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Age + HS + Income + Black + Female + Price
## Model 2: Sales ~ Age + Income + Black + Price
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     44 34926                           
## 2     46 34960 -2   -33.799 0.0213 0.9789

Part 3

confint(model, level = 0.95)

##                     2.5 %       97.5 %
## (Intercept) -3.916439e+02 598.33360254
## Age         -1.968565e+00  11.00946945
## HS          -1.703475e+00   1.58030249
## Income      -1.642517e-03   0.03953542
## Black       -6.243909e-01   1.33946122
## Female      -1.226033e+01  10.15461632
## Price       -5.333583e+00  -1.17625412

Part 4 Before: Adjusted R^2 = 0.2282 After: Adjusted R^2 = 0.1864

reduced_model <- lm(data = cigarettes, Sales ~ Age + Female + Black + Price + HS)
summary(reduced_model)

## 
## Call:
## lm(formula = Sales ~ Age + Female + Black + Price + HS, data = cigarettes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.414 -16.454  -5.746   8.541 133.319 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 162.3245   250.0537   0.649  0.51954   
## Age           7.3073     2.9238   2.499  0.01616 * 
## Female       -3.7815     5.5063  -0.687  0.49576   
## Black         0.8447     0.4213   2.005  0.05101 . 
## Price        -2.8603     1.0362  -2.760  0.00832 **
## HS            0.9717     0.6103   1.592  0.11836   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.93 on 45 degrees of freedom
## Multiple R-squared:  0.2678, Adjusted R-squared:  0.1864 
## F-statistic: 3.291 on 5 and 45 DF,  p-value: 0.01287

Question 4

math = read.table("http://people.math.binghamton.edu/qiao/data501/data/math-salaries.table", header = T)
head(math)

##      Y  X1 X2  X3
## 1 33.2 3.5  9 6.1
## 2 40.3 5.3 20 6.4
## 3 38.7 5.1 18 7.4
## 4 46.8 5.8 33 6.7
## 5 41.4 4.2 31 7.5
## 6 37.5 6.0 13 5.9

Part 1

library(GGally)

## Loading required package: ggplot2

ggpairs(math)

Part 2 & 3 & 5 y_hat = 17.84693 + 1.10313X1 + 0.32152X2 + 1.28894*X3

I am using an F-test. Null hypothesis: The predictors do not have a significant effect on Y. Alternative hypothesis: The predictors do have a significant effect on Y. Since the p-value is less than 0.10, we reject the null that the predictors do not have a significant effect on Y.

The R^2 is 0.9109. 91.09% of the variability in Y can be explained by X1, X2, and X3. The adjusted R^2 is 0.8975

math_model <- lm(data = math, Y ~ .)
summary(math_model)

## 
## Call:
## lm(formula = Y ~ ., data = math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2463 -0.9593  0.0377  1.1995  3.3089 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.84693    2.00188   8.915 2.10e-08 ***
## X1           1.10313    0.32957   3.347 0.003209 ** 
## X2           0.32152    0.03711   8.664 3.33e-08 ***
## X3           1.28894    0.29848   4.318 0.000334 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.753 on 20 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.8975 
## F-statistic: 68.12 on 3 and 20 DF,  p-value: 1.124e-10

Part 4

confint(math_model, level = 0.966666)

##                 1.67 %   98.33 %
## (Intercept) 13.2716685 22.422193
## X1           0.3498947  1.856366
## X2           0.2367083  0.406331
## X3           0.6067705  1.971111

Part 6

newdata = read.table("http://people.math.binghamton.edu/qiao/data501/data/salary_levels.table",
sep="\t",row.names=1,header=T)
newdata

##    L1 L2 L3 L4 L5
## X1  5  7  3  5  4
## X2  6  3  2  5  7
## X3  4  2  3  6  6

newdata_t <- data.frame(t(newdata))
newdata_t

##    X1 X2 X3
## L1  5  6  4
## L2  7  3  2
## L3  3  2  3
## L4  5  5  6
## L5  4  7  6

predict(math_model, newdata = newdata_t, interval = "confidence", level = .99)

##         fit      lwr      upr
## L1 30.44746 27.93019 32.96474
## L2 29.11128 24.23181 33.99076
## L3 25.66618 22.57800 28.75437
## L4 32.70383 30.46736 34.94029
## L5 32.24374 30.18604 34.30143

Question 5

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Part 1

iris_model <- lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width)
summary(iris_model)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

Part 2

iris_model_2 <- lm(data = iris, Sepal.Length ~ Petal.Width)
summary(iris_model_2)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38822 -0.29358 -0.04393  0.26429  1.34521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.77763    0.07293   65.51   <2e-16 ***
## Petal.Width  0.88858    0.05137   17.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.478 on 148 degrees of freedom
## Multiple R-squared:  0.669,  Adjusted R-squared:  0.6668 
## F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16

Since Pr(>F) is less than 0.05, we reject the null that the reduced model is better.

anova(iris_model, iris_model_2)

## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Model 2: Sepal.Length ~ Petal.Width
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    146 14.445                                  
## 2    148 33.815 -2   -19.369 97.884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part 3 Since Pr(>F) > 0.05, we fail to reject the null that Bsepalwidth = Bpetallength

iris$sum <- iris$Sepal.Width + iris$Petal.Length
iris_model3 <- lm(data = iris, Sepal.Length ~ sum + Petal.Width)
summary(iris_model3)

## 
## Call:
## lm(formula = Sepal.Length ~ sum + Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82040 -0.21854  0.02115  0.20234  0.83697 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.76198    0.22087   7.977 3.88e-13 ***
## sum          0.68663    0.04909  13.987  < 2e-16 ***
## Petal.Width -0.49880    0.10478  -4.760 4.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3142 on 147 degrees of freedom
## Multiple R-squared:  0.858,  Adjusted R-squared:  0.8561 
## F-statistic: 444.1 on 2 and 147 DF,  p-value: < 2.2e-16

anova(iris_model, iris_model3)

## Analysis of Variance Table
## 
## Model 1: Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
## Model 2: Sepal.Length ~ sum + Petal.Width
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    146 14.445                           
## 2    147 14.508 -1 -0.062559 0.6323 0.4278

Part 4 Since the p-value is greater than 0.05, we fail to reject the null that Bsepalwidth = Bpetallength

a <- c(0,1,-1,0)
beta.hat <- coef(iris_model)
V <- vcov(iris_model)

theta.hat <- sum(a * beta.hat)
se.theta <- sqrt( as.numeric( t(a) %*% V %*% a ) )
t.stat <- theta.hat / se.theta
df <- df.residual(iris_model)
crit <- qt(0.05, df)  
p_val <- pt(t.stat, df)
p_val

## [1] 0.2139045

DATA501 HW 2

2025-10-20