a <- c(176, 125, 152, 180, 159, 168, 160, 151)
b <- c(164, 121, 137, 169, 144, 145, 156, 139)
ab <- data.frame(a,b); ab
## a b
## 1 176 164
## 2 125 121
## 3 152 137
## 4 180 169
## 5 159 144
## 6 168 145
## 7 160 156
## 8 151 139
dim(ab)
## [1] 8 2
We structure the different observations related for Server a
and Server b as a matrix. A data frame composed by 2 columns
and 8 rows is obtained. Each row reports the processing times (in
milliseconds) for several files uploaded by the servers considered. All
the variables are quantitative (numeric); more specifically they are
discrete.
At first glance, it is notable that the variables exhibit a similar
variability among themselves. In particular, we observe that the max
observation are respectively 180 and 169 for variable a and
b, while min observations are 125 and 121.
require(skimr)
## Caricamento del pacchetto richiesto: skimr
Range <- function(x) {max(x)- min(x)}
my_skim <- skim_with(base = sfl(),
numeric = sfl(hist = NULL, range = Range))
my_skim(ab)
| Name | ab |
| Number of rows | 8 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | mean | sd | p0 | p25 | p50 | p75 | p100 | range |
|---|---|---|---|---|---|---|---|---|
| a | 158.88 | 17.24 | 125 | 151.75 | 159.5 | 170 | 180 | 55 |
| b | 146.88 | 15.62 | 121 | 138.50 | 144.5 | 158 | 169 | 48 |
We summaries the main results of descriptive analysis below:
- From the median, we can observe that 50% of file are uploaded within
159.5 milliseconds with server a. The mean assumes a very
similar value (158.9), that can lead to a precondition of a reasonably
symmetrical distribution. Quartile analysis shows that 25.0% of sample
processed the files in less than 151.7 milliseconds, while another
quarter performed the same task in 170.0 milliseconds. The max and the
min are not equidistant from the center of the distribution and the
range account a value of 59.0. Hence, The observation of quartiles
cannot confirms the hypothesis of symmetry of the distribution. Also,
regarding the value of the standard deviation (accounted for 17.2), we
can’t state surely that the data of server a are normally
distributed. To get more consistent results, we need a larger sample of
size.
- The variable b shows a lower variability of standard
deviation (15.6), but an analogue range (59) to the previously variable
analysed. Respectively, the minimum and maximum values are equals to
121.0 and 169.0 milliseconds. The median is quite close to the mean:
they measure 144.5 and 146.9. The distribution is not skewed, since p0
and p100 are almost equidistant to the median. Further, it can be said
that on average the server b processed files in less time than
the server a.
### Coefficient of variation
cv <- function(x) {sd(x)/mean(x)}
round(apply(ab, 2, cv), 2)
## a b
## 0.11 0.11
After computing the coefficient of variability, the results show that
the two variables account the same value, equal to 0.11. In other words,
the observations are dispersed in the sample in same manner and
comparatively the standard deviation is relatively small compared to the
mean.
### Empirical cumulative distribution function
In the following we represent the empirical cumulative distribution
function for each variable against a nosrmal distribution.
set.seed(123)
y <-rnorm(50)
par(mfrow = c(1, 2))
for (i in 1:ncol(ab)) {
plot(ecdf(ab[, i]),
do.points = TRUE,
main = colnames(ab)[i])
curve( pnorm(x, mean(ab[, i]), sd(ab[, i])),
from = 0, to = 200, add = TRUE, col='red', lwd = 2)
}
None of two variables takes negative values. While the center of the
variable is narrower, it is noticeable the the difference between the
tails; for both, the empirical cumulative function significantly
deviates from the curve of the theoretical function. This phenomenon is
most evident in the left tail of the two distributions. However, we
cannot establish with confidence that the assumptions of normality have
been violated, as the graphical analysis of the empirical cumulative
distribution function is not very sensitive. Moreover, as mentioned
above, we have few observations and therefore each of these can
significantly bias future claims. For more in-depth analyses we rely on
other tools, such as QQ-plot.
Below, we represent the quantile quantile (QQ) plot for each variable
with reference to the Gaussian distribution.
par(mfrow = c(1, 2))
for (i in 1:ncol(ab)) {
qqnorm(ab[, i],
main = colnames(ab)[i])
qqline(ab[, i], col = "darkorange")
}
The graphs depict confirm the non-normality of a and b
variables: the distribution of each eight observations deviates
significantly from the reference line (y=x) of the normal quantile.
Furthermore, we can see several overlap from above and under the line,
resulting in a “s” curve. This indicates the presence of a particularly
heavy left tail and the presence of asymmetry of the two
variables.
We proceed to compute the skewness, by calculating the point estimate for the reference population with the library e1071 .
require(e1071)
sk <- e1071::skewness(ab)
## Warning in mean.default(x): l'argomento non è numerico o logico: si restituisce
## NA
sk
## [1] NA
We obtained a negative value, which suggest that the tail of left is
longer than the tail of the right. However, having obtained a value
around 0.25, it can be said that the data are fairly symmetrical.
## 9.1
Nonparametric bootstrap sampling offers a robust alternative to
classic (parametric) methods for statistical inference.
The nonparametric bootstrap involves randomly sampling data with
replacement to form a “new” sample of data, which is referred to as a
bootstrap sample.In our case, we obtain two replicated sample with
replacement of the units and same length of the original one. Also, We
convert dtype of ab in matrix.
ab<- data.matrix(ab)
n <- length(ab)
ab1 <- sample(ab, n, replace = TRUE)
summary(ab1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 139.0 145.0 158.0 156.6 164.0 180.0
ab2 <- sample(ab, n, replace = TRUE)
summary(ab2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 121.0 144.8 152.0 150.1 159.0 168.0
for each bootstrap sample we calculate the corresponding value of the skewness and we save the realized values in an object s by using the library(e1071)
library(e1071)
s1<- skewness(ab1)
s2<- skewness(ab2)
s <- c(s1,s2)
s <- cbind(s1,s2);s
## s1 s2
## [1,] 0.08236338 -0.8174142
We calculate the standard deviation of the bootstrap replications. This is the standard error we can associate to the point estimate obtained for the skewness in the original sample.
sd(s)
## [1] 0.6362388
Next, we compare the previous result of standard deviation with a bootstrap function with a new sampl with 1000 values, by using the library bootstrap.
B <- 1000
n <- length(ab)
Tboot <- rep(0, B)
set.seed(16253)
for (i in 1:B) {
Xstar <- sample(ab,
n,
replace = TRUE)
Tboot[i] <- e1071::skewness(Xstar)
}
summary(Tboot)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.658237 -0.460958 -0.229331 -0.230510 0.001117 1.068567
seTboot <- sd(Tboot); seTboot
## [1] 0.3437788
The standard deviation is much higher than the previous one. The standard error of the point estimated of the skewness of the reference population is 0.34.
We draw the bootstrap distribution on the following
hist(Tboot, breaks = 16, freq=FALSE, main = "Bootstrap distribution with n= 1000")
curve(dnorm(x, mean(x), sd(x)), add = TRUE)
The figure above represent the approximation of the sampling
distribution of the sample ab. Both distributions have similar
shape and spread. The mean of Tboot is -0.23, which is quite
close to the median. The spread of the distribution appears to be quite
consistent with the standard deviation we specified in the
distribution.
Finally, we can also compute the values of 𝜇 ± 3𝜎; indeed it is known that, for the Gaussian distribution, more than 99% of the realizations are in the interval [𝜇 − 3𝜎, 𝜇 + 3𝜎]. We can therefore evaluate if the number of generated data outside this interval is very limited.
x_Max <- mean(Tboot)+3*sd(Tboot); x_Max
## [1] 0.8008262
x_Min <- mean(Tboot)-3*sd(Tboot); x_Min
## [1] -1.261847
length(Tboot[Tboot < x_Min])
## [1] 3
length(Tboot[Tboot > x_Max])
## [1] 4
We detect the presence of only 5 observations that fall outside the
calculated interval, a very low number, which is consistent with the
information known from a theoretical point of view; in particular, 3 of
them are in the left tail and 4 in the right tail.
Now we compute the confidence interval at a confidence level of 95% with the percentile method and then we depict the corresponding histogram.
alpha <- 0.05
round(quantile(Tboot, c(0.025, 0.975)),2)
## 2.5% 97.5%
## -0.94 0.41
the confident interval for the skewness computing at confidence level of
0.95 is [-0.94, 0.41]: the lower bound is (-0.94), while the upper bound
is (0.41). This represent the expected range of the skewness in the
population. So the Tboot has values that can range in this
interval and the length is not very huge proportioned to the
distribution of the observations. Indeed the precision is quite
high.
sB<- mean(Tboot)
Q <- quantile(Tboot, c(0.025, 0.975))
Q[1]; Q[2]
## 2.5%
## -0.9431137
## 97.5%
## 0.4123592
hist(Tboot,
breaks = 60,
freq=FALSE,
main = "Bootstrap distribution with B = 1000 ",
xlab = "Skewness index",
col = "gray",
ylab = "Density")
#
abline( v = c(sk,sB,Q[1], Q[2]),
col = c("red", "blue", "green", "green"))
#
legend(0.5,0.5,
c("Estimate", "Mean of B",
"conf. int1", "conf. int2"),
col = c("red", "blue", "green", "green"),
lty = c(1,1,1,1),
lwd = c(3,3,3,3),
cex = 0.7)
In the figure are depict the realized values of the skewness that we
calculated for each bootstrap sample. These are the possible values of
the skewness with the reference of the population = 1000. The red line
represents the point estimate of the skewness in the original data, the
blue line explains the mean skewness generated by bootstrap method,
while the green lines represent the extremes bound for the parameter. We
observe that the two means are quite close and the value of the skewness
fails in range [-0.94, 0.41], regarding of confidence of 95%.