Exercise 9

a <- c(176, 125, 152, 180, 159, 168, 160, 151) 
b <- c(164, 121, 137, 169, 144, 145, 156, 139)
ab <- data.frame(a,b); ab
##     a   b
## 1 176 164
## 2 125 121
## 3 152 137
## 4 180 169
## 5 159 144
## 6 168 145
## 7 160 156
## 8 151 139
dim(ab)
## [1] 8 2


We structure the different observations related for Server a and Server b as a matrix. A data frame composed by 2 columns and 8 rows is obtained. Each row reports the processing times (in milliseconds) for several files uploaded by the servers considered. All the variables are quantitative (numeric); more specifically they are discrete.

At first glance, it is notable that the variables exhibit a similar variability among themselves. In particular, we observe that the max observation are respectively 180 and 169 for variable a and b, while min observations are 125 and 121.

9.0

require(skimr)
## Caricamento del pacchetto richiesto: skimr
Range <- function(x) {max(x)- min(x)}
my_skim <- skim_with(base = sfl(),
numeric = sfl(hist = NULL, range = Range))

my_skim(ab)
Data summary
Name ab
Number of rows 8
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable mean sd p0 p25 p50 p75 p100 range
a 158.88 17.24 125 151.75 159.5 170 180 55
b 146.88 15.62 121 138.50 144.5 158 169 48

We summaries the main results of descriptive analysis below:
- From the median, we can observe that 50% of file are uploaded within 159.5 milliseconds with server a. The mean assumes a very similar value (158.9), that can lead to a precondition of a reasonably symmetrical distribution. Quartile analysis shows that 25.0% of sample processed the files in less than 151.7 milliseconds, while another quarter performed the same task in 170.0 milliseconds. The max and the min are not equidistant from the center of the distribution and the range account a value of 59.0. Hence, The observation of quartiles cannot confirms the hypothesis of symmetry of the distribution. Also, regarding the value of the standard deviation (accounted for 17.2), we can’t state surely that the data of server a are normally distributed. To get more consistent results, we need a larger sample of size.
- The variable b shows a lower variability of standard deviation (15.6), but an analogue range (59) to the previously variable analysed. Respectively, the minimum and maximum values are equals to 121.0 and 169.0 milliseconds. The median is quite close to the mean: they measure 144.5 and 146.9. The distribution is not skewed, since p0 and p100 are almost equidistant to the median. Further, it can be said that on average the server b processed files in less time than the server a.


### Coefficient of variation

cv <- function(x) {sd(x)/mean(x)}
round(apply(ab, 2, cv), 2)
##    a    b 
## 0.11 0.11

After computing the coefficient of variability, the results show that the two variables account the same value, equal to 0.11. In other words, the observations are dispersed in the sample in same manner and comparatively the standard deviation is relatively small compared to the mean.
### Empirical cumulative distribution function
In the following we represent the empirical cumulative distribution function for each variable against a nosrmal distribution.

set.seed(123)
y <-rnorm(50)
par(mfrow = c(1, 2))
for (i in 1:ncol(ab)) {
plot(ecdf(ab[, i]),
do.points = TRUE,
main = colnames(ab)[i])
curve( pnorm(x, mean(ab[, i]), sd(ab[, i])),
from = 0, to = 200, add = TRUE, col='red', lwd = 2)

}


None of two variables takes negative values. While the center of the variable is narrower, it is noticeable the the difference between the tails; for both, the empirical cumulative function significantly deviates from the curve of the theoretical function. This phenomenon is most evident in the left tail of the two distributions. However, we cannot establish with confidence that the assumptions of normality have been violated, as the graphical analysis of the empirical cumulative distribution function is not very sensitive. Moreover, as mentioned above, we have few observations and therefore each of these can significantly bias future claims. For more in-depth analyses we rely on other tools, such as QQ-plot.

Q-Q plot


Below, we represent the quantile quantile (QQ) plot for each variable with reference to the Gaussian distribution.

par(mfrow = c(1, 2))
for (i in 1:ncol(ab)) {
qqnorm(ab[, i],
main = colnames(ab)[i])
qqline(ab[, i], col = "darkorange")
}


The graphs depict confirm the non-normality of a and b variables: the distribution of each eight observations deviates significantly from the reference line (y=x) of the normal quantile. Furthermore, we can see several overlap from above and under the line, resulting in a “s” curve. This indicates the presence of a particularly heavy left tail and the presence of asymmetry of the two variables.

We proceed to compute the skewness, by calculating the point estimate for the reference population with the library e1071 .

require(e1071)
sk <- e1071::skewness(ab)
## Warning in mean.default(x): l'argomento non è numerico o logico: si restituisce
## NA
sk
## [1] NA

We obtained a negative value, which suggest that the tail of left is longer than the tail of the right. However, having obtained a value around 0.25, it can be said that the data are fairly symmetrical.
## 9.1

Nonparametric bootstrap sampling offers a robust alternative to classic (parametric) methods for statistical inference.
The nonparametric bootstrap involves randomly sampling data with replacement to form a “new” sample of data, which is referred to as a bootstrap sample.In our case, we obtain two replicated sample with replacement of the units and same length of the original one. Also, We convert dtype of ab in matrix.

ab<- data.matrix(ab)
n <- length(ab)
ab1 <- sample(ab, n, replace = TRUE)
summary(ab1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   139.0   145.0   158.0   156.6   164.0   180.0


ab2 <- sample(ab, n, replace = TRUE)
summary(ab2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   121.0   144.8   152.0   150.1   159.0   168.0

for each bootstrap sample we calculate the corresponding value of the skewness and we save the realized values in an object s by using the library(e1071)

library(e1071)
s1<- skewness(ab1)
s2<- skewness(ab2)
s <- c(s1,s2)
s <- cbind(s1,s2);s
##              s1         s2
## [1,] 0.08236338 -0.8174142

We calculate the standard deviation of the bootstrap replications. This is the standard error we can associate to the point estimate obtained for the skewness in the original sample.

sd(s)
## [1] 0.6362388

Next, we compare the previous result of standard deviation with a bootstrap function with a new sampl with 1000 values, by using the library bootstrap.

B <- 1000
n <- length(ab)
Tboot <- rep(0, B)
set.seed(16253)
for (i in 1:B) {
Xstar <- sample(ab,
n,
replace = TRUE)
Tboot[i] <- e1071::skewness(Xstar)
}
summary(Tboot)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.658237 -0.460958 -0.229331 -0.230510  0.001117  1.068567
seTboot <- sd(Tboot); seTboot
## [1] 0.3437788

The standard deviation is much higher than the previous one. The standard error of the point estimated of the skewness of the reference population is 0.34.

9.2

We draw the bootstrap distribution on the following

hist(Tboot, breaks = 16, freq=FALSE, main = "Bootstrap distribution with n= 1000")
curve(dnorm(x, mean(x), sd(x)), add = TRUE)


The figure above represent the approximation of the sampling distribution of the sample ab. Both distributions have similar shape and spread. The mean of Tboot is -0.23, which is quite close to the median. The spread of the distribution appears to be quite consistent with the standard deviation we specified in the distribution.

Finally, we can also compute the values of 𝜇 ± 3𝜎; indeed it is known that, for the Gaussian distribution, more than 99% of the realizations are in the interval [𝜇 − 3𝜎, 𝜇 + 3𝜎]. We can therefore evaluate if the number of generated data outside this interval is very limited.

x_Max <- mean(Tboot)+3*sd(Tboot); x_Max
## [1] 0.8008262
x_Min <- mean(Tboot)-3*sd(Tboot); x_Min
## [1] -1.261847
length(Tboot[Tboot < x_Min])
## [1] 3
length(Tboot[Tboot > x_Max])
## [1] 4

We detect the presence of only 5 observations that fall outside the calculated interval, a very low number, which is consistent with the information known from a theoretical point of view; in particular, 3 of them are in the left tail and 4 in the right tail.

9.3

Now we compute the confidence interval at a confidence level of 95% with the percentile method and then we depict the corresponding histogram.

alpha <- 0.05
round(quantile(Tboot, c(0.025, 0.975)),2)
##  2.5% 97.5% 
## -0.94  0.41


the confident interval for the skewness computing at confidence level of 0.95 is [-0.94, 0.41]: the lower bound is (-0.94), while the upper bound is (0.41). This represent the expected range of the skewness in the population. So the Tboot has values that can range in this interval and the length is not very huge proportioned to the distribution of the observations. Indeed the precision is quite high.

sB<- mean(Tboot)
Q <- quantile(Tboot, c(0.025, 0.975))
Q[1]; Q[2]
##       2.5% 
## -0.9431137
##     97.5% 
## 0.4123592
hist(Tboot,
breaks = 60,
freq=FALSE,
main = "Bootstrap distribution with B = 1000 ",
xlab = "Skewness index",
col = "gray",
ylab = "Density")
#
abline( v = c(sk,sB,Q[1], Q[2]),
col = c("red", "blue", "green", "green"))
#
legend(0.5,0.5,
c("Estimate", "Mean of B",
"conf. int1", "conf. int2"),
col = c("red", "blue", "green", "green"),
lty = c(1,1,1,1),
lwd = c(3,3,3,3),
cex = 0.7)


In the figure are depict the realized values of the skewness that we calculated for each bootstrap sample. These are the possible values of the skewness with the reference of the population = 1000. The red line represents the point estimate of the skewness in the original data, the blue line explains the mean skewness generated by bootstrap method, while the green lines represent the extremes bound for the parameter. We observe that the two means are quite close and the value of the skewness fails in range [-0.94, 0.41], regarding of confidence of 95%.

9.3