BMGT430 HW #1

Problem 24

COSTEST3 <- read_excel("COSTEST3.xlsx")
attach(COSTEST3)
NUMBER<-COSTEST3$NUMBER 

fit <- lm(COST~NUMBER)
summary(fit)

## 
## Call:
## lm(formula = COST ~ NUMBER)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3022 -2.3110  0.5253  1.8948  5.2685 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.3107     4.0830   6.934 5.82e-07 ***
## NUMBER        2.1549     0.1437  14.995 4.94e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.84 on 22 degrees of freedom
## Multiple R-squared:  0.9109, Adjusted R-squared:  0.9068 
## F-statistic: 224.9 on 1 and 22 DF,  p-value: 4.942e-13

anova_vals <- anova(fit)

# part (a) What is the estimated regression equation relating y to x

# y = 28.3107 + 2.1549(x), where x=NUMBER and y=COST

# part (b) What percentage of the variation in y has been explained by the regression?

SSR_percentage = (anova_vals[1,2]/(anova_vals[1,2]+anova_vals[2,2])) * 100

# 91.09 % of the variation in y has been explained by the regression.

# part (c) Are x and y linearly related? Construct a hypothesis test to answer this question and use a 5% level of significance. State the hypothesis to be tested, the decision rule, the test statistic and your decision. What conclusion can be drawn from the result of the test?

# The hypotheses: H0: B_1 = 0, Ha: B_1 != 0
# Decision Rule: Reject H0 if t < -2.073873 or t > 2.073873. Do not reject H0 if -2.073873 <= t <= 2.073873.
# Test Statistic: t = 14.995.
# Decision: The resulting t-value is greater than 2.073873 so the p-value is less than 0.05, meaning we will reject the null hypothesis.
# Conclusion: There is sufficient evidence to conclude that a linear relationship between x (NUMBER) and y (COST) does exist

# part (d) Estimate the fixed cost involved in the production process. Find a point estimate an a 95% confidence interval estimate. 

# Point Estimate for B_0: 28.3107

# 95% Confidence Interval for B_0: [19.84308,36.77832]
B_0_lwr_bound <- 28.3107 + qt(0.025,22)*4.0830
B_0_upr_bound <- 28.3107 - qt(0.025,22)*4.0830


# part (e) Estimate the variable cost involved in the production process. Find a point estimate an a 95% confidence interval estimate.

# Point Estimate for B_1: 2.1549

# 95% Confidence Interval for B_1: [2.096438, 2.213362]
B_1_lwr_bound <- 2.1549 + qt(0.025,22)*0.1437
B_1_upr_bound <- 2.1549 - qt(0.025,22)*0.1437

Problem 25

INCONS3 <- read_excel("INCONS3.xlsx")
attach(INCONS3)

fit <- lm(CONS~INCOME)
anova_vals <- anova(fit)

# part (a) What is the estimated regression equation relating y to x

# y = 2521.3789 + 0.8269(x), where x=INCOME (in dollars) and y=CONSUMPTION (in dollars)

# part (b) What percentage of the variation in y has been explained by the regression?

SSR_percentage = (anova_vals[1,2]/(anova_vals[1,2]+anova_vals[2,2])) * 100

# 93.12% of the variation in y has been explained by regression

# part (c) Construct a 90% confidence interval estimate of B_1

# 90% Confidence Interval for B_1: [0.698034, 0.955766]
B_1_lwr_bound <- 0.8269 + qt(0.05,10)*0.0711
B_1_upr_bound <- 0.8269 - qt(0.05,10)*0.0711

# part (d) Use a ttest to test the hypothesis H0: B_1 = 0, Ha: B_1 != 0 at the 5% significance level. State the decision rule, the test statistic, and your decision. WHat conclusion can be drawn from the test?

qt(0.025,10)

## [1] -2.228139

# The hypotheses: H0: B_1 = 0, Ha: B_1 != 0
# Decision Rule: Reject H0 if t < -2.228139 or t > 2.228139. Do not reject H0 if -2.228139 <= t <= 2.228139.
# Test Statistic: t = 11.630.
# Decision: The resulting t-value is greater than 2.228139 so the p-value is less than 0.05. This means we will reject the null hypothesis.
# Conclusion: There is sufficient evidence to conclude that a linear relationship between x (INCOME) and y (CONSUMPTION) does exist

# part (e) Use an F test to test the hypothesis H0: B_1 = 0, Ha: B_1 != 0 at the 5% significance level. State the decision rule, the test statistic, and your decision.

# is this line necesary? qf(0.025,1,10)
qf(1-0.025,1,10)

## [1] 6.936728

# Decision Rule: Reject H0 if f > 6.936728. Do not reject H0 if f <= 6.936728.
# Test Statistic: f = 135.25
# Decision: The resulting f-value is greater than 6.936728 so the p-value is less than 0.05. This means we will reject the null hypothesis.

# part (f) Can the F test be used to test the hypothesis H0: B_1 <= 0 and Ha: B_1 > 0 

# No, the t test should be used, not the F test.

# part (g) Test the hypothesis H0: B_1 = 1, Ha: B_1 != 1 at the 5% significance level. State the decision rule, the test statistic, and your decision. WHat conclusion can be drawn from the test?

t_star = (0.8269 - 1)/0.0711
qt(0.025,10)

## [1] -2.228139

# Decision Rule: Reject H0 if t < -2.228139 or t > 2.228139. Do not reject H0 if -2.228139 <= t <= 2.228139.
# Test Statistic: t = -2.434599.
# Decision: The resulting t-value is less than -2.228139 so the p-value is less than 0.05. This means we will reject the null hypothesis.
# Conclusion: There is sufficient evidence to conclude that B1 does not equal 1.

Problem 26

APEX3 <- read_excel("APEX3.xlsx")
attach(APEX3)

## The following object is masked from COSTEST3:
## 
##     COST

fit <- lm(COST~MACHINE)
anova_vals <- anova(fit)

# part (a) What is the estimated regression equation relating y to x

# y = 206.86512 + 4.17880(x), where x=MACHINE (in hours) and y=COST (in thousands of dollars)

# part (b) What percentage of the variation in y has been explained by regression?

SSR_percentage = (anova_vals[1,2]/(anova_vals[1,2]+anova_vals[2,2])) * 100

# 99.48% of the variation in y has been explained by the regression.

# part (c) Are x and y linearly related? Construct a hypothesis test to answer this question and use a 5% level of significance. State the hypothesis to be tested, the decision rule, the test statistic and your decision. What conclusion can be drawn from the result of the test?

qt(0.025,25)

## [1] -2.059539

# The hypotheses: H0: B_1 = 0, Ha: B_1 != 0
# Decision Rule: Reject H0 if t < -2.059539 or t > 2.059539. Do not reject H0 if -2.059539 <= t <= 2.059539.
# Test Statistic: t = 69.05.
# Decision: The resulting p-value is less than 0.05, meaning we will reject the null hypothesis.
# Conclusion: There is sufficient evidence to conclude that a linear relationship between x (MACHINE) and y (COST) does exist

# part (d) Use the equation developed to estimate the average manufacturing cost in a month with 350 machine hours. Find a point estimate and a 95% confidence interval. How reliable do you believe this forecast might be?

# Point Estimate: y = 206.86512 + 4.17880(x) = 206.86512 + 4.17880(350 machine hours) = 1669.445 thousands of dollars

# Confidence Interval:[1659.398 thousands of dollars, 1679.492 thousands of dollars]
new <- data.frame(MACHINE = 350)
predict(fit,new,interval = 'confidence')

##        fit      lwr      upr
## 1 1669.445 1659.398 1679.492

# We are 95% confident that the average manufacturing costs in a month with 350 machine hours in the population is between 1659.398 and 1679.492 thousands of dollars.

# part (e) Use the equation developed to estimate the average manufacturing cost in a month with 550 machine hours. Find a point estimate and a 95% confidence interval. How reliable do you believe this forecast might be?

# Point Estimate: y = 206.86512 + 4.17880(x) = 206.86512 + 4.17880(550 machine hours) = 2505.205 thousands of dollars

# Confidence Interval: [2473.923 thousands of dollars, 2536.486 thousands of dollars]
new <- data.frame(MACHINE = 550)
predict(fit,new,interval = 'confidence')

##        fit      lwr      upr
## 1 2505.204 2473.923 2536.486

# We are 95% confident that the average manufacturing costs in a month with 550 machine hours in the population is between 2473.923 and 2536.486 thousands of dollars.

Problem 27

NEWCON3 <- read_excel("NEWCON3.xlsx")
attach(NEWCON3)
year_t = c(1:11)
fit <- lm(NEWCON~year_t)
anova_vals <- anova(fit)

# part (a) Fit a linear trend to the these data. What is the resulting regression equation?

# y = 368.23 + 42.99(x), where x=YEAR and y=NEWCON (in billions of dollars)

# part (b) What percentage of the variation in y has been explained by regression?

SSR_percentage = (anova_vals[1,2]/(anova_vals[1,2]+anova_vals[2,2])) * 100

# 98.91% of the variation in y has been explained by the regression.

# part (c) Based on your answer in part b and on any other regression results you obtain, how well does the equation fit the data? Does a good fit ensure that forecasts for future years will be accurate?

# Based on part b and other regression results, the equation fits the data very well since this percentage is the R squared value, and a high R squared value of 98.91% tells us that 98.91% of variation in the response (New Construction) is explained by the linear relationship between Years and New Construction (in billions of dollars). However, a good fit does not ensure that forecasts for future years will be accurate. This is because the standard R squared overestimates the quality of future/out of sample predictions. 

# part (d) Use the equation developed to predict new construction in both 2002 and 2003. Find a point prediction and a 95% prediction interval. 

# Point Prediction 2002 (t = 12): y = 368.23 + 42.99(x) = 368.23 + 42.99(12) = 884.11 billions of dollars

# Confidence Interval 2002: [841.6819 billions of dollars, 926.5945 billions of dollars]
new <- data.frame(year_t = 12)
predict(fit,new,interval = 'prediction')

##        fit      lwr      upr
## 1 884.1382 841.6819 926.5945

# Point Prediction 2003 (t = 13): y = 368.23 + 42.99(x) = 368.23 + 42.99(13) = 927.10 billions of dollars

# Confidence Interval 2003: [882.941 billions of dollars, 971.3208 billions of dollars]
new <- data.frame(year_t = 13)
predict(fit,new,interval = 'prediction')

##        fit     lwr      upr
## 1 927.1309 882.941 971.3208

# part (e) How reliable do you believe the forecast in part d might be?

# I am cautionary to believe that these forecasts are very realiable due to the fact that the x values used are outside of the range of these used to create the linear trend model. The relationship that holds over the range 1991-2001 may be different than the one that exists from 2002 to beyond, which makes me question the reliability of these forecasts.

Problem 28

USPOP3 <- read_excel("USPOP3.xlsx")
attach(USPOP3)

## The following object is masked from NEWCON3:
## 
##     YEAR

year_t = c(1:70)
fit <- lm(POPULATION~year_t)
anova_vals <- anova(fit)

# part (a) Fit a linear trend to the these data. What is the resulting regression equation?

# y = 108961085 + 2311504(x), where x=YEAR and y=POPULATION

# part (b) What percentage of the variation in y has been explained by regression?

SSR_percentage = (anova_vals[1,2]/(anova_vals[1,2]+anova_vals[2,2])) * 100

# 99.41% of the variation in y has been explained by the regression.

# part (c) Based on your answer in part b and on any other regression results you obtain, how well does the equation fit the data? Does a good fit ensure that forecasts for future years will be accurate?

# Based on part b and other regression results, the equation fits the data very well since this percentage is the R squared value, and a high R squared value of 99.41% tells us that 99.41% of variation in the response (New Construction) is explained by the linear relationship between Years and New Construction (in billions of dollars). However, a good fit does not ensure that forecasts for future years will be accurate. This is because the standard R squared overestimates the quality of future/out of sample predictions. Additionally, the good fit in this linear model does not imply overall causation between x and y. The relationship within this range of values could be caused by a third variable, or it could be that y is causing x. 

# part (d) Use the equation developed to predict new construction in both 2000 and 2001. Find a point prediction and a 95% prediction interval. 

# Point Prediction 2000 (t = 71): y = 108961085 + 2311504(x) = 108961085 + 2311504(71) = 273077869 people

# Confidence Interval 2000: [265569416 people, 280586304 people]
new <- data.frame(year_t = 71)
predict(fit,new,interval = 'prediction')

##         fit       lwr       upr
## 1 273077860 265569416 280586304

# Point Prediction 2001 (t = 72): y = 108961085 + 2311504(x) = 108961085 + 2311504(72) = 275389373 people

# Confidence Interval 2001: [267871988 people, 282906739 people]
new <- data.frame(year_t = 72)
predict(fit,new,interval = 'prediction')

##         fit       lwr       upr
## 1 275389364 267871988 282906739

# part (e) How reliable do you believe the forecast in part d might be?

# I am cautionary to believe that these forecasts are very realiable due to the fact that the x values used are outside of the range of the explanatory variable used to create the linear trend model. The relationship that holds over the range 1930-1999 may be different than the one that exists from 2000 to beyond, which makes me question the reliability of these forecasts. Factors that might influence the accuracy of this model are improved medicine to help people live longer, the occurance of a war that could cause the untimely deaths of many in a short period of time, the break out of a new, unvaccinated virus, or changes in cost of living that could cause people to have more or less kids based on their income and financial stability.

BMGT430 HW #1

Melanie Abel

2/17/2018