# Load Libraries
library(ggplot2)
library(car)
library(sandwich)
library(survival)
library(estimability)
library(gplots)
library(dplyr)
library(gridExtra)
library(lsmeans)
The objective of this assignment is to analyse the batteries dataset.
batteries <- read.csv("Bateries.csv", sep = ";")
str(batteries)
'data.frame': 36 obs. of 3 variables:
$ Material: int 1 1 1 1 1 1 1 1 1 1 ...
$ Temp : int 15 15 15 15 70 70 70 70 125 125 ...
$ Duration: int 130 74 155 180 34 40 80 75 20 70 ...
summary(batteries)
Material Temp Duration
Min. :1 Min. : 15 Min. : 20.0
1st Qu.:1 1st Qu.: 15 1st Qu.: 70.0
Median :2 Median : 70 Median :108.0
Mean :2 Mean : 70 Mean :105.5
3rd Qu.:3 3rd Qu.:125 3rd Qu.:141.8
Max. :3 Max. :125 Max. :188.0
First it is important to get to know the data set and to know what we are dealing with. From doing a summary and str of the data set, we know that we are dealing with 3 different variables Material, Temperature and Duration.
We are dealing Material and Temperature (Temp) are factors, with 3 modalities each. Now we know it is possible to perform a Two-Way Anova. It is important to know if the dataset is balanced, so we can know if the groups would have to have wieghts or not.
batteries %>% group_by(Material) %>%
summarise(freq=n()) %>% as.data.frame()
Material freq
1 1 12
2 2 12
3 3 12
batteries %>% group_by(Material,Temp) %>%
summarise(freq=n()) %>% as.data.frame()
Material Temp freq
1 1 15 4
2 1 70 4
3 1 125 4
4 2 15 4
5 2 70 4
6 2 125 4
7 3 15 4
8 3 70 4
9 3 125 4
So we can see our data is balanced, and we have the same number of instances in each modality of each factor.
To see the distribution of our Duration, given our Materials, we compute the following plot:
We can notice that Duration behaves differently depending to the Material and the Temperature of the battery. To have a better detail lets split the plots into seperate ones (Materials and Temperature).
Now we can plot the means:
Given the plots, we can see that Duration increases as the type of material changes in increasing order. At the same time, the Duration decreases as the temperature increases.
Now it is interesting to see if there exist a relationship between the different factors.
The mean duration for temperature 15 seems to be the highest and remains quite constant even when the materials changes, different that what we saw in the previous plot.
On the other hand, the mean for temperature 125 seems to be the lowest and barely increases when the material changes in increasing order.
Across all the Materials, when the Temperature is 70 the duration seems to vary. This means that we will probably have an interaction. We will further investigate it with an Analysis of Variance.
It is useful to calculate the mean, standar deviation and its ratio for each combination of groups.
b <- batteries %>% group_by(Material,Temp) %>%
summarize(mean=mean(Duration), sd= sd(Duration)) %>% mutate(cv = sd/mean) %>%
as.data.frame()
b
Material Temp mean sd cv
1 1 15 134.75 45.35324 0.3365732
2 1 70 57.25 23.59908 0.4122110
3 1 125 57.50 26.85144 0.4669816
4 2 15 155.75 25.61738 0.1644775
5 2 70 119.75 12.65899 0.1057118
6 2 125 49.50 19.26136 0.3891184
7 3 15 144.00 25.97435 0.1803774
8 3 70 145.75 22.54440 0.1546786
9 3 125 85.50 19.27866 0.2254814
By having different coefficients of variation for our different combination of factor levels, we can proceede with a Linear Regression Analysis.
Before generating the Linear Model, it is important to take into account that our predictors are Factors, otherwise the model would take them as continuous and the we would not be able to find the results for each given group of Temp and Material.
Since we are looking forward to do an ANOVA, we should make a linear model including the interaction of the Factors, but first lets start with a linear model without interaction, and then compare.
fit0 <- lm(Duration ~ factor(Material) + factor(Temp) , data=batteries)
summary(fit0)
Call:
lm(formula = Duration ~ factor(Material) + factor(Temp), data = batteries)
Residuals:
Min 1Q Median 3Q Max
-54.389 -21.681 2.694 17.215 57.528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.47 11.17 10.965 3.39e-12 ***
factor(Material)2 25.17 12.24 2.057 0.04819 *
factor(Material)3 41.92 12.24 3.426 0.00175 **
factor(Temp)70 -37.25 12.24 -3.044 0.00472 **
factor(Temp)125 -80.67 12.24 -6.593 2.30e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.97 on 31 degrees of freedom
Multiple R-squared: 0.6414, Adjusted R-squared: 0.5951
F-statistic: 13.86 on 4 and 31 DF, p-value: 1.367e-06
For this first model, we can observe that our data is not centered around the median. We see that all of our coefficients with respect our Intercept(Material 1 and Temperature 15), are significant. Also, the model explains 64.14% of our variance in our data.
Now lets see the model with interaction:
fit1 <- lm(Duration ~ factor(Material) * factor(Temp), data=batteries)
summary(fit1)
Call:
lm(formula = Duration ~ factor(Material) * factor(Temp), data = batteries)
Residuals:
Min 1Q Median 3Q Max
-60.750 -14.625 1.375 17.938 45.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 134.75 12.99 10.371 6.46e-11 ***
factor(Material)2 21.00 18.37 1.143 0.263107
factor(Material)3 9.25 18.37 0.503 0.618747
factor(Temp)70 -77.50 18.37 -4.218 0.000248 ***
factor(Temp)125 -77.25 18.37 -4.204 0.000257 ***
factor(Material)2:factor(Temp)70 41.50 25.98 1.597 0.121886
factor(Material)3:factor(Temp)70 79.25 25.98 3.050 0.005083 **
factor(Material)2:factor(Temp)125 -29.00 25.98 -1.116 0.274242
factor(Material)3:factor(Temp)125 18.75 25.98 0.722 0.476759
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.98 on 27 degrees of freedom
Multiple R-squared: 0.7652, Adjusted R-squared: 0.6956
F-statistic: 11 on 8 and 27 DF, p-value: 9.426e-07
In this case we can observe that our model improves. 76% of the variance is explained by our model. We can see that our our Intercept (Material 1 and Temperature 15) is significant, and changing the temperature with respect to the Intercept, is also significant. Also the interaction between Material 3 and Temperature 70 is significant with respect to the Intercept.
Now we can confirm with an Analysis of Variance if the interaction of our variables is significant.
Fitting an Analysis of variance:
Anova(fit1)
Anova Table (Type II tests)
Response: Duration
Sum Sq Df F value Pr(>F)
factor(Material) 10684 2 7.9114 0.001976 **
factor(Temp) 39119 2 28.9677 1.909e-07 ***
factor(Material):factor(Temp) 9614 4 3.5595 0.018611 *
Residuals 18231 27
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both of our variables are significant. This means that the interaction of Material and Temperature of our process will affect the duration of the Battery produced.
We can compare the means in order to see which are actually similar.
cld(lsmeans(fit1, specs = c('Temp','Material')))
Temp Material lsmean SE df lower.CL upper.CL .group
125 2 49.50 12.99243 27 22.84174 76.15826 1
70 1 57.25 12.99243 27 30.59174 83.90826 1
125 1 57.50 12.99243 27 30.84174 84.15826 1
125 3 85.50 12.99243 27 58.84174 112.15826 12
70 2 119.75 12.99243 27 93.09174 146.40826 23
15 1 134.75 12.99243 27 108.09174 161.40826 23
15 3 144.00 12.99243 27 117.34174 170.65826 23
70 3 145.75 12.99243 27 119.09174 172.40826 23
15 2 155.75 12.99243 27 129.09174 182.40826 3
Confidence level used: 0.95
P value adjustment: tukey method for comparing a family of 9 estimates
significance level used: alpha = 0.05
Looking at the pairs, we can observe that there are 4 different groups of Material and Temperature. This means for example: Temperature 125 with Material 2, is no different from Temperature 70 with Material 1. But they are significantly different to an interaction of another group such as, Temperature 70 with Material 2.
To see if our residuals follow a normal distribution, we compute the Probability Plot (QQ-Plot)
qqPlot(fit1,test=F)
Since our points are linearly distributed, we can say that our residuals follow a normal distribution.
Now we can plot our residuals for each factor and for our fitted values.
residualPlots(fit1,test=F)
There is not an obvious trend on the Fitted values against our Pearson residuals, which basically means that we can approve this plot and state that our linear model is reasonable.
For the plots of each Factor against the Pearson residuals, all seem to have its mean centered at \(0\), if it wasn’t centered at \(0\), then we could not expect our model to be good for that specific factor or modality. Hence we can approve our residual plots.
Is it good to do a density plot to see the distribution of the duration of each plot?
- Not recommended since we only have 8 values (At least more than 50 is recommende)
- Instead we could use a boxplot or dotplot
* Duration has sense to take either Normal or Exponential distribution. in this case is Normal.
*When response variable is continuous, it is natural to assume normality.
* Linear model - Response is normal
* glm - (often) Binomial, Poisson responses.
##### For Comparing two models that are not nested
Better is the maximum loglikelihood or a small value for AIC
Ho: model with less parameters
H1: model with more parameters
Residual.Deviance full parameter - Residual Deviance
Exam
GLM, comparing nested models, comparing not nested models,
Analyse Datasets
the paper is important too