Batteries Dataset Analysis

# Load Libraries
library(ggplot2)
library(car)
library(sandwich)
library(survival)
library(estimability)
library(gplots)
library(dplyr)
library(gridExtra)
library(lsmeans)

Objective

The objective of this assignment is to analyse the batteries dataset.

Exploring the dataset

batteries <- read.csv("Bateries.csv", sep = ";")
str(batteries)

'data.frame':   36 obs. of  3 variables:
 $ Material: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Temp    : int  15 15 15 15 70 70 70 70 125 125 ...
 $ Duration: int  130 74 155 180 34 40 80 75 20 70 ...

summary(batteries)

    Material      Temp        Duration    
 Min.   :1   Min.   : 15   Min.   : 20.0  
 1st Qu.:1   1st Qu.: 15   1st Qu.: 70.0  
 Median :2   Median : 70   Median :108.0  
 Mean   :2   Mean   : 70   Mean   :105.5  
 3rd Qu.:3   3rd Qu.:125   3rd Qu.:141.8  
 Max.   :3   Max.   :125   Max.   :188.0

First it is important to get to know the data set and to know what we are dealing with. From doing a summary and str of the data set, we know that we are dealing with 3 different variables Material, Temperature and Duration.

We are dealing Material and Temperature (Temp) are factors, with 3 modalities each. Now we know it is possible to perform a Two-Way Anova. It is important to know if the dataset is balanced, so we can know if the groups would have to have wieghts or not.

batteries %>% group_by(Material) %>% 
  summarise(freq=n()) %>% as.data.frame()

  Material freq
1        1   12
2        2   12
3        3   12

batteries %>% group_by(Material,Temp) %>% 
  summarise(freq=n()) %>% as.data.frame()

  Material Temp freq
1        1   15    4
2        1   70    4
3        1  125    4
4        2   15    4
5        2   70    4
6        2  125    4
7        3   15    4
8        3   70    4
9        3  125    4

So we can see our data is balanced, and we have the same number of instances in each modality of each factor.

To see the distribution of our Duration, given our Materials, we compute the following plot:

We can notice that Duration behaves differently depending to the Material and the Temperature of the battery. To have a better detail lets split the plots into seperate ones (Materials and Temperature).

Now we can plot the means:

Given the plots, we can see that Duration increases as the type of material changes in increasing order. At the same time, the Duration decreases as the temperature increases.

Now it is interesting to see if there exist a relationship between the different factors.

The mean duration for temperature 15 seems to be the highest and remains quite constant even when the materials changes, different that what we saw in the previous plot.

On the other hand, the mean for temperature 125 seems to be the lowest and barely increases when the material changes in increasing order.

Across all the Materials, when the Temperature is 70 the duration seems to vary. This means that we will probably have an interaction. We will further investigate it with an Analysis of Variance.

It is useful to calculate the mean, standar deviation and its ratio for each combination of groups.

b <- batteries %>% group_by(Material,Temp) %>% 
  summarize(mean=mean(Duration), sd= sd(Duration)) %>% mutate(cv = sd/mean) %>%
  as.data.frame()
b

  Material Temp   mean       sd        cv
1        1   15 134.75 45.35324 0.3365732
2        1   70  57.25 23.59908 0.4122110
3        1  125  57.50 26.85144 0.4669816
4        2   15 155.75 25.61738 0.1644775
5        2   70 119.75 12.65899 0.1057118
6        2  125  49.50 19.26136 0.3891184
7        3   15 144.00 25.97435 0.1803774
8        3   70 145.75 22.54440 0.1546786
9        3  125  85.50 19.27866 0.2254814

By having different coefficients of variation for our different combination of factor levels, we can proceede with a Linear Regression Analysis.

Linear Model

Before generating the Linear Model, it is important to take into account that our predictors are Factors, otherwise the model would take them as continuous and the we would not be able to find the results for each given group of Temp and Material.

Since we are looking forward to do an ANOVA, we should make a linear model including the interaction of the Factors, but first lets start with a linear model without interaction, and then compare.

fit0 <-  lm(Duration ~ factor(Material) + factor(Temp) , data=batteries)
summary(fit0)


Call:
lm(formula = Duration ~ factor(Material) + factor(Temp), data = batteries)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.389 -21.681   2.694  17.215  57.528 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         122.47      11.17  10.965 3.39e-12 ***
factor(Material)2    25.17      12.24   2.057  0.04819 *  
factor(Material)3    41.92      12.24   3.426  0.00175 ** 
factor(Temp)70      -37.25      12.24  -3.044  0.00472 ** 
factor(Temp)125     -80.67      12.24  -6.593 2.30e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.97 on 31 degrees of freedom
Multiple R-squared:  0.6414,    Adjusted R-squared:  0.5951 
F-statistic: 13.86 on 4 and 31 DF,  p-value: 1.367e-06

For this first model, we can observe that our data is not centered around the median. We see that all of our coefficients with respect our Intercept(Material 1 and Temperature 15), are significant. Also, the model explains 64.14% of our variance in our data.

Now lets see the model with interaction:

fit1 <-  lm(Duration ~ factor(Material) * factor(Temp), data=batteries)
summary(fit1)


Call:
lm(formula = Duration ~ factor(Material) * factor(Temp), data = batteries)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.750 -14.625   1.375  17.938  45.250 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         134.75      12.99  10.371 6.46e-11 ***
factor(Material)2                    21.00      18.37   1.143 0.263107    
factor(Material)3                     9.25      18.37   0.503 0.618747    
factor(Temp)70                      -77.50      18.37  -4.218 0.000248 ***
factor(Temp)125                     -77.25      18.37  -4.204 0.000257 ***
factor(Material)2:factor(Temp)70     41.50      25.98   1.597 0.121886    
factor(Material)3:factor(Temp)70     79.25      25.98   3.050 0.005083 ** 
factor(Material)2:factor(Temp)125   -29.00      25.98  -1.116 0.274242    
factor(Material)3:factor(Temp)125    18.75      25.98   0.722 0.476759    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.98 on 27 degrees of freedom
Multiple R-squared:  0.7652,    Adjusted R-squared:  0.6956 
F-statistic:    11 on 8 and 27 DF,  p-value: 9.426e-07

In this case we can observe that our model improves. 76% of the variance is explained by our model. We can see that our our Intercept (Material 1 and Temperature 15) is significant, and changing the temperature with respect to the Intercept, is also significant. Also the interaction between Material 3 and Temperature 70 is significant with respect to the Intercept.

Now we can confirm with an Analysis of Variance if the interaction of our variables is significant.

ANOVA

Fitting an Analysis of variance:

Anova(fit1)

Anova Table (Type II tests)

Response: Duration
                              Sum Sq Df F value    Pr(>F)    
factor(Material)               10684  2  7.9114  0.001976 ** 
factor(Temp)                   39119  2 28.9677 1.909e-07 ***
factor(Material):factor(Temp)   9614  4  3.5595  0.018611 *  
Residuals                      18231 27                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both of our variables are significant. This means that the interaction of Material and Temperature of our process will affect the duration of the Battery produced.

We can compare the means in order to see which are actually similar.

cld(lsmeans(fit1, specs = c('Temp','Material')))

 Temp Material lsmean       SE df  lower.CL  upper.CL .group
  125        2  49.50 12.99243 27  22.84174  76.15826  1    
   70        1  57.25 12.99243 27  30.59174  83.90826  1    
  125        1  57.50 12.99243 27  30.84174  84.15826  1    
  125        3  85.50 12.99243 27  58.84174 112.15826  12   
   70        2 119.75 12.99243 27  93.09174 146.40826   23  
   15        1 134.75 12.99243 27 108.09174 161.40826   23  
   15        3 144.00 12.99243 27 117.34174 170.65826   23  
   70        3 145.75 12.99243 27 119.09174 172.40826   23  
   15        2 155.75 12.99243 27 129.09174 182.40826    3  

Confidence level used: 0.95 
P value adjustment: tukey method for comparing a family of 9 estimates 
significance level used: alpha = 0.05

Looking at the pairs, we can observe that there are 4 different groups of Material and Temperature. This means for example: Temperature 125 with Material 2, is no different from Temperature 70 with Material 1. But they are significantly different to an interaction of another group such as, Temperature 70 with Material 2.

Residuals

To see if our residuals follow a normal distribution, we compute the Probability Plot (QQ-Plot)

qqPlot(fit1,test=F)

Since our points are linearly distributed, we can say that our residuals follow a normal distribution.

Now we can plot our residuals for each factor and for our fitted values.

residualPlots(fit1,test=F)

There is not an obvious trend on the Fitted values against our Pearson residuals, which basically means that we can approve this plot and state that our linear model is reasonable.

For the plots of each Factor against the Pearson residuals, all seem to have its mean centered at \(0\), if it wasn’t centered at \(0\), then we could not expect our model to be good for that specific factor or modality. Hence we can approve our residual plots.

Is it good to do a density plot to see the distribution of the duration of each plot?
  - Not recommended since we only have 8 values (At least more than 50 is recommende)
  - Instead we could use a boxplot or dotplot

* Duration has sense to take either Normal or Exponential distribution. in this case is Normal.

*When response variable is continuous, it is natural to assume normality.

* Linear model - Response is normal
* glm - (often) Binomial, Poisson responses.


##### For Comparing two models that are not nested

Better is the maximum loglikelihood or a small value for AIC

Ho: model with less parameters
H1: model with more parameters

Residual.Deviance full parameter - Residual Deviance

Exam
GLM, comparing nested models, comparing not nested models, 
Analyse Datasets
the paper is important too