A courier service advertises that its average delivery time is less than 6 hours for local deliveries. A random sample of times for 12 deliveries to addresses across town was recorded. The data is given as 3.13, 6.33, 6.50, 5.22, 3.76, 6.71, 7.65, 4.82, 7.96, 4.54, 5.09, 6.46.
Construct a quantile-quantile plot to check if the data is reasonably normally distributed. Interpret the plot. (4 points).
library(tinytex)
library(lattice)
library(psych)
library(pastecs)
library(MASS)
library(qualityTools)
##
## Attaching package: 'qualityTools'
## The following object is masked from 'package:stats':
##
## sigma
Delivery.Time_courier = c(3.13, 6.33, 6.50, 5.22, 3.76, 6.71, 7.65, 4.82, 7.96, 4.54, 5.09, 6.46)
#qqnorm(Delivery.Time_courier)
#qqline(Delivery.Time_courier)
qqPlot(Delivery.Time_courier)
Ans: Based on the QQ-Plot we can infer that the data follows a normal distribution. The data points are close to the red line which indicates that they are close to the expected value and stay within the blue curved lines
Carry out a hypothesis testing procedure to check the normality assumption. Clearly state the null and alternative hypothesis and the final conclusion based on the test. (5 points)
Null Hypothesis: \(H_0\): The population is normal
Alternative Hypothesis: \(H_1\) The population is not normal
### Shapiro-Wilk normality test based on the order statistics (quantiles)
shapiro.test(Delivery.Time_courier)
##
## Shapiro-Wilk normality test
##
## data: Delivery.Time_courier
## W = 0.96445, p-value = 0.845
Ans: Since our p-value is 0.845, which is greater than 0.05, we fail to reject the Null Hypothesis that the distribution of Delivery Times is Normal.
Find a two sided 90% confidence interval for the average delivery time. Count the number of sample data points that are outside this interval. Explain (with proper justification) whether you believe that there is something wrong in the 90% confidence interval based on the proportion of data points outside the interval. (4+2 = 6 points)
t.test(Delivery.Time_courier, mu=0, alternative="two.sided", conf.level=0.90) # leave the mu = 0 part
##
## One Sample t-test
##
## data: Delivery.Time_courier
## t = 13.151, df = 11, p-value = 4.514e-08
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 4.905060 6.456607
## sample estimates:
## mean of x
## 5.680833
## supplemental, same value as above t.test
num.samples <- length(Delivery.Time_courier)
x_bar = mean(Delivery.Time_courier)
STD_Delivery.Time = sd(Delivery.Time_courier)
t_a.Delivery.Time <- qt(1-.1/2, num.samples-1)
LCL <- x_bar - t_a.Delivery.Time*STD_Delivery.Time/sqrt(num.samples)
UCL <- x_bar + t_a.Delivery.Time*STD_Delivery.Time/sqrt(num.samples)
c(LCL, UCL)
## [1] 4.905060 6.456607
orderedDelivery4Data <- Delivery.Time_courier[order(Delivery.Time_courier)]
#print(Delivery.Time_courier)
#print(orderedDelivery4Data)
Delivery.Points_outsideInterval <- c(3.13, 3.76, 4.54, 4.82, 6.46, 6.50, 6.71, 7.65, 7.96) # here we will consider 6.46 out of the range of 6.456607 because its technically larger prior to rounding
print(Delivery.Points_outsideInterval)
## [1] 3.13 3.76 4.54 4.82 6.46 6.50 6.71 7.65 7.96
length(Delivery.Points_outsideInterval)
## [1] 9
### "A p-value less than 0.05 means we we *reject the null hypothesis*, and accept the alternative hypothesis.
There are 9 sample data points outside of this 90% confidence interval. I believe that there is nothing wrong in the 90% confidence interval based on the proportion of data points outside the interval, given the size of our sample. Although much of our data lies outside of the confidence interval, because of the Law of Large Numbers we do not have enough data points to draw an accurate conclusion for the entire population as a whole. This makes sense because we are looking at a small data set of only 12 deliveries.
Null Hypothesis: \(H_0\): mean = 0 (exact opposite of alternative hypothesis)
Alternative Hypothesis: \(H_1\) mean ≠ 0 (alternative hypothesis: true mean is not equal to 0)
In this instance the p-value is basically 0, which is smaller than 0.05, so we reject the Null Hypothesis that the mean is zero. This makes sense because our sample of estimates reports a mean of 5.680833
Test a hypothesis if there is evidence in favor of the claim that the average delivery time is less than 6 hours for local deliveries. State the null and alternative hypothesis, rejection rule and the final conclusion in the context. (5 points)
Step 1: Parameter of interest: \(\mu\), mean delivery time
Step 2: Set up hypothesis
Null Hypothesis: \(H_0\): mu = 6
Alternative Hypothesis: \(H_1\) mu < 6
or \(H_0:\mu \geq 6 \ vs. H_1: \mu < 6\)
Rejection Rule: When p-value is less than 0.05, we reject \(H_0\) (reject the null hypothesis, and accept the alternative hypothesis)
Delivery.Time_Average <- 6
t.test(Delivery.Time_courier, mu=Delivery.Time_Average, alternative="less", conf.level=0.95)
##
## One Sample t-test
##
## data: Delivery.Time_courier
## t = -0.73886, df = 11, p-value = 0.2377
## alternative hypothesis: true mean is less than 6
## 95 percent confidence interval:
## -Inf 6.456607
## sample estimates:
## mean of x
## 5.680833
##### A small p-value (*typically ≤ 0.05*) indicates strong evidence against the null hypothesis, so you *reject the null hypothesis.*
##### A large p-value (*> 0.05*) indicates weak evidence against the null hypothesis, so you *fail to reject the null hypothesis.*
Ans: Since our p-value is 0.2377, which is greater than 0.05, we fail to reject the Null Hypothesis (average delivery time = 6 hours). This means that the Null Hypothesis is true and the average delivery time is 6 hours.
NOTE: When p-value is less than 0.05, we reject \(H_0\)
Find a two sided 95% prediction interval of the delivery time for the next item delivered by the courier service. (5 points)
#convert to data frame
data.frame_assoc_Delivery.Time <- data.frame(location = c(1,2,3,4,5,6,7,8,9,10,11,12), AvgTime = Delivery.Time_courier)
#now we can use lm.fit to call upon corresponding values within data set
lm.fit = lm(data.frame_assoc_Delivery.Time) #lm is the function
lm.fit
##
## Call:
## lm(formula = data.frame_assoc_Delivery.Time)
##
## Coefficients:
## (Intercept) AvgTime
## 3.0901 0.6003
#predict
predict(lm.fit, level = 0.95, interval = 'prediction', newdata = data.frame_assoc_Delivery.Time)
## fit lwr upr
## 1 4.968855 -4.503584 14.44129
## 2 6.889664 -1.670482 15.44981
## 3 6.991707 -1.607765 15.59118
## 4 6.223384 -2.303689 14.75046
## 5 5.347015 -3.714512 14.40854
## 6 7.117760 -1.542508 15.77803
## 7 7.681998 -1.407532 16.77153
## 8 5.983282 -2.627185 14.59375
## 9 7.868076 -1.415235 17.15139
## 10 5.815212 -2.882794 14.51322
## 11 6.145351 -2.403366 14.69407
## 12 6.967697 -1.621720 15.55711
Ans: With 95% confidence we can predict that deliveries will range from 4 to 15 hours. Time will vary depending on how far apart delivery locations are from their prior address (ex. depends on which stop you are after the truck departs vs being the first delivery from another truck).
The president of a company that manufactures car seats has been concerned about the number and cost of machine breakdowns. The problem is that the machines are old and becoming quite unreliable. However, the cost of replacing them is quite high. To help make a decision about replacement, she gathered data about last month costs (in dollars) for repairs and the ages (in months) of the plants 20 welding machines. The goal of the project is to study the relationship between the repair cost for machines and their ages, more specifically to predict repair cost from age. The data for this problem is posted on canvas under the data module with the file name “Age_Repair.csv”.
Fit a linear regression model with repair cost as the response and age as the regressor. Create a scatter plot of the data and plot the fitted linear regression line on top of it. Report the fitted file and interpret the estimate of the slope. (10 points)
#Important Notes: costs of repairs unit = $; ages unit = months; number of machines = 20
# Fit a linear regression model with **repair cost as the response and age as the regressor**
##repairs = response <=> weight (from mondal rmd)
##age = regressor <=> height (from mondal rmd)
data.Age_Repair = read.csv("Age_Repair.csv")
#str(data.Age_Repair)
lm.fit = lm(Repairs ~ Age, data=data.Age_Repair) #lm is the function
### Create a scatterplot of the data, plot the fitted regression line on top the scatter plot.
attach(data.Age_Repair) ##attach is important here
plot(Age, Repairs)
abline(lm.fit, col="blue") #plots the straight line
## Report the fitted file and interpret the estimate of the slope.
lm.fit
##
## Call:
## lm(formula = Repairs ~ Age, data = data.Age_Repair)
##
## Coefficients:
## (Intercept) Age
## 114.852 2.473
#lm.fit$fitted.values
Ans: The estimated value of the slope is \(b_1 = 2.473\). Interpretation: The average cost of repair increases by $2.473 for every month in operation.
The estimated value of the slope is \(b_0 = 114.852\).
Further analyzing this, the president needs to consider how many months from now the max/min value will occur between cost of monthly operation, and the income created by these 20 machines. Not only this, but she will also have to use further statistical inference to project the rate at which each machine will “die” or become inoperable, and will need to consider that certain machines may last longer than others (ex. manufacturer, status when purchased, prior number of repairs, speed of machine, energy consumption for output).
Test an hypothesis for the significance in the linear regression model. Clearly state the null and alternative hypothesis and the final conclusion in the context. (5 points).
summary(lm.fit)
##
## Call:
## lm(formula = Repairs ~ Age, data = data.Age_Repair)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.547 -22.660 -0.981 22.233 83.354
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 114.8525 58.6854 1.957 0.06603 .
## Age 2.4733 0.5106 4.844 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.32 on 18 degrees of freedom
## Multiple R-squared: 0.5659, Adjusted R-squared: 0.5417
## F-statistic: 23.46 on 1 and 18 DF, p-value: 0.0001304
Ans:
Null hypothesis \(H_0:\beta_1=0\)
Alternative hypothesis \(H_1: \beta_1 \neq 0\)
Since the p-value of the t-test (or F test) is 0.0001304, which is very close to 0, we reject \(H_0\) and conclude that the linear regression model is significant.
Find the 95% prediction interval for the repair cost of a 115 month old machine and explain the interpretation of this prediction interval. (4 points)
# Find the 95% prediction interval for the **repair cost of a 115 month old machine** and explain the interpretation of this prediction interval.
predict(lm.fit, data.frame(Age = 115), level = 0.95, interval = 'prediction')
## fit lwr upr
## 1 399.2865 306.0135 492.5595
Ans: With 95% confidence we can predict that the repair cost of a 115 month old machine is between 306.0135 & 492.5595 dollars
Is a simple linear regression (SLR) model adequate for these data? Use the figures from the residual analysis to justify your answer. To receive full credit you must assess at least three model assumptions. (6 points)
plot(lm.fit)
Ans:
From the residual vs fitted values plot: At first it may seem to follow Non-linear plot since it slopes downward through residuals=0 taking a negative residual value, then increases again to come out of a negative residual value back into the positive residual. Despite having this trait, most of our reiduals are distributed around zero. This confirms that the linear model is adequate and the error variance is constant.
(a) Satisfactory:
(b) Funnel: Variance not constant no funnel shape here
(c) Double bow: Variance not constant no drastic symmetry in the shape of the angled square (reflected triage) about zero
(d) Non-linear: Include non-linear terms in the model there are dips in the curve that form a slight upside-down-parabola shape where you have a decreasing slope (passing through the residuals = 0 dotted line), that continues decreasing while under the residuals = 0 line, and then increasing again past the residuals = 0 line. While this may seem to fit to appear “pseudo” Non-Linear, it is still closely centered around zero and there is not enough of a drastic upside-down-parabola shape to constitute a Non-linear plot.
Normal qqplot: we can see that the data closely follows a straight line, which confirms the normality assumption of the population. There appears to be a slight sinusoidal curve around the normal distribution line, (a trait that almost seems visible in the heavy tail theoretical quantile) but the values are still very close to the projected line with what appears to be very little outliers.
Square root of the residuals vs fitted value plot (scale-location) also confirms a constant variance assumption since the spread of the values match the shape of the red line are not drastically far from the red line when there is a change in the slope of the red line
Leverage plot indicates that there are no drastic outliers in the data set since most values seem to hover about the red line.
Questions we are trying to answer with residual analysis: useful graphs for residual analysis
1. Are the errors (at least approx.) normally distributed?
2. Do the errors have (approx.) equal variance?
3. Are the errors independent?
4. Is the relationship between x and y linear?
Useful Graphs for residual analysis
a. Normal probability plot of residuals
b. Residual plot against the predictor variable(s) x
c. Residual plot against the fitted values y_hat
i. for the SLR model this is equivalent to plotting against the predictor
d. Histogram / Boxplot of residuals
e. Residual plot against time sequence (if applicable)
anova(lm.fit)
## Analysis of Variance Table
##
## Response: Repairs
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 44024 44024 23.461 0.0001304 ***
## Residuals 18 33777 1876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(data.Age_Repair)
Thank you, Have a good summer!
Best,
Jared Kokinos
End of STAT 312R Final Exam Summer 2021 8/3