library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
copier <- read.delim("~/Desktop/UVA/MSDS/STAT 6021/Day 1 - R Basics/copier.txt")
The number of minutes is the y variable (response) and the x variable is number of units serviced (predictor).
There is a strong positive relationship, meaning the more copiers needed to be serviced, the more minutes it takes to service.
copier%>%
ggplot(aes(x=Serviced, y=Minutes))+
geom_point(alpha = 0.5)+
theme(plot.title = element_text(hjust = 0.5))+
labs(x="Copiers Serviced", y="Minutes to Service", title="Copiers")
There is a strong correlation between the number of units serviced and the total minutes to service.
The correlation can be interpreted reliably based on the scatter plot that we have - there are no clear outliers and the relationship is linear. There is still more information needed to ensure that the linear relationship exists by a statistical measure.
cor(copier$Serviced, copier$Minutes)
## [1] 0.978517
Beta 0: -0.5802 Beta 1: 14.0352 R^2: 0.9575 Sigma^2: 8.914
Copier_results <- lm(Minutes~Serviced, data=copier)
summary(Copier_results)
##
## Call:
## lm(formula = Minutes ~ Serviced, data = copier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.7723 -3.7371 0.3334 6.3334 15.4039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5802 2.8039 -0.207 0.837
## Serviced 15.0352 0.4831 31.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.914 on 43 degrees of freedom
## Multiple R-squared: 0.9575, Adjusted R-squared: 0.9565
## F-statistic: 968.7 on 1 and 43 DF, p-value: < 2.2e-16
Beta 0: -0.5802 Beta 1: 14.0352
Given the above intercept and coefficient, if there was 1 copier that had to be serviced then it would take about 13 and half minutes to complete this. If there was no copier to be serviced, you would think that this would take no time at all, not negative half of a minute. However, this negative intercept is saying that the model is over predicting which is due to what appears to be larger variance when there’s less copiers to be serviced. Overall though, the negative intercept does not make sense.
Ho: Number of copier machines needed to be serviced is not a predictor of total number of minutes to service Ha: A linear relationship exists between the number of machines needed to be serviced and the number of minutes to service
F Statistic: Is 968.66
Critical F Value: 4.067 with an alpha of .05, meaning the F Statistic is larger than the Critical F Value, we can reject the null hypothesis. Moreove, the p-value is incredibly small, giving us further evidence to reject the null hypothesis.
anova.Copier <- anova(Copier_results)
anova.Copier
## Analysis of Variance Table
##
## Response: Minutes
## Df Sum Sq Mean Sq F value Pr(>F)
## Serviced 1 76960 76960 968.66 < 2.2e-16 ***
## Residuals 43 3416 79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qf(.05,df1 = 1, df2 = 43,lower.tail=FALSE)
## [1] 4.067047
newdata2 <-data.frame(Serviced=5)
predict(Copier_results,newdata2,level=0.95,interval="confidence")
## fit lwr upr
## 1 74.59608 71.91422 77.27794
What is the value of the residual for the first observation? Interpret this value contextually.
Actual value of y - predicted value of y
In copier the first point: serviced = 2 and Minutes = 20
Beta 0: -0.5802 Beta 1: 14.0352
Predicted = -0.5802 + 14.0352(2) predicted y = 27.49 Residual = 20 - 27.89 = -7.89
The predicted value is greater than the actual value, meaning the residual is negative, showing an over prediction for the first data point i.e. less time was spent servicing 2 copies than expected.