library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
copier <- read.delim("~/Desktop/UVA/MSDS/STAT 6021/Day 1 - R Basics/copier.txt")

Part A.)

The number of minutes is the y variable (response) and the x variable is number of units serviced (predictor).

Part B.)

There is a strong positive relationship, meaning the more copiers needed to be serviced, the more minutes it takes to service.

copier%>%
ggplot(aes(x=Serviced, y=Minutes))+ 
  geom_point(alpha  = 0.5)+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Copiers Serviced", y="Minutes to Service", title="Copiers")

Part C, D.)

There is a strong correlation between the number of units serviced and the total minutes to service.

The correlation can be interpreted reliably based on the scatter plot that we have - there are no clear outliers and the relationship is linear. There is still more information needed to ensure that the linear relationship exists by a statistical measure.

cor(copier$Serviced, copier$Minutes)
## [1] 0.978517

Part E.)

Beta 0: -0.5802 Beta 1: 14.0352 R^2: 0.9575 Sigma^2: 8.914

Copier_results <- lm(Minutes~Serviced, data=copier)
summary(Copier_results)
## 
## Call:
## lm(formula = Minutes ~ Serviced, data = copier)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7723  -3.7371   0.3334   6.3334  15.4039 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5802     2.8039  -0.207    0.837    
## Serviced     15.0352     0.4831  31.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.914 on 43 degrees of freedom
## Multiple R-squared:  0.9575, Adjusted R-squared:  0.9565 
## F-statistic: 968.7 on 1 and 43 DF,  p-value: < 2.2e-16

Part F.)

Beta 0: -0.5802 Beta 1: 14.0352

Given the above intercept and coefficient, if there was 1 copier that had to be serviced then it would take about 13 and half minutes to complete this. If there was no copier to be serviced, you would think that this would take no time at all, not negative half of a minute. However, this negative intercept is saying that the model is over predicting which is due to what appears to be larger variance when there’s less copiers to be serviced. Overall though, the negative intercept does not make sense.

Part G.)

Ho: Number of copier machines needed to be serviced is not a predictor of total number of minutes to service Ha: A linear relationship exists between the number of machines needed to be serviced and the number of minutes to service

F Statistic: Is 968.66

Critical F Value: 4.067 with an alpha of .05, meaning the F Statistic is larger than the Critical F Value, we can reject the null hypothesis. Moreove, the p-value is incredibly small, giving us further evidence to reject the null hypothesis.

anova.Copier <- anova(Copier_results)
anova.Copier
## Analysis of Variance Table
## 
## Response: Minutes
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Serviced   1  76960   76960  968.66 < 2.2e-16 ***
## Residuals 43   3416      79                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qf(.05,df1 = 1, df2 = 43,lower.tail=FALSE)
## [1] 4.067047

Part H.)

newdata2 <-data.frame(Serviced=5)
predict(Copier_results,newdata2,level=0.95,interval="confidence")
##        fit      lwr      upr
## 1 74.59608 71.91422 77.27794

Part I.)

What is the value of the residual for the first observation? Interpret this value contextually.

Actual value of y - predicted value of y

In copier the first point: serviced = 2 and Minutes = 20

Beta 0: -0.5802 Beta 1: 14.0352

Predicted = -0.5802 + 14.0352(2) predicted y = 27.49 Residual = 20 - 27.89 = -7.89

The predicted value is greater than the actual value, meaning the residual is negative, showing an over prediction for the first data point i.e. less time was spent servicing 2 copies than expected.