Question 1)
a)Pick a reshaping package (we discussed two in class) – research
them online and tell us why you picked it over others (provide any
helpful links that supported your decision).
I choose to pick reshape2 package. According to the R
Documentation, tidyr is “is designed specifically for tidying data, not
general reshaping (reshape2)”. What’s more, the argument of reshape2 is
clearer than tidyr, and it’s useful for me.
b)Show the code to reshape the versizon_wide.csv sample.
time_wide <- data.table::fread("C:/R-language/BACS/verizon_wide.csv")
library(reshape2)
## Warning: 套件 'reshape2' 是用 R 版本 4.2.2 來建造的
time_long <- melt(time_wide, na.rm = TRUE,
value.name = "time",
variable.name = "group")
## No id variables; using all as measure variables
c)Show us the “head” and “tail” of the data to show that the
reshaping worked
head(time_long)
## group time
## 1 ILEC 17.50
## 2 ILEC 2.40
## 3 ILEC 0.00
## 4 ILEC 0.65
## 5 ILEC 22.23
## 6 ILEC 1.20
tail(time_long)
## group time
## 1682 CLEC 24.20
## 1683 CLEC 22.13
## 1684 CLEC 18.57
## 1685 CLEC 20.00
## 1686 CLEC 14.13
## 1687 CLEC 5.80
d)Visualize Verizon’s response times for ILEC vs. CLEC
customers
groups <- split(x = time_long$time, f = time_long$group)
plot(density(groups$ILEC), col="cornflowerblue", lwd=2, xlim=c(0, 200),main="Response times for ILEC vs. CLEC customers")
lines(density(groups$CLEC), col="coral3", lwd=2)
legend(150,0.12 , lty=1, c("CLEC", "ILEC"), col=c("coral3", "cornflowerblue"))

Question 2)
a)State the appropriate null and alternative hypotheses
(one-tailed)
Null Hypothesis (H0): the mean of response times for CLEC
customers and ILEC customers are the same.
Alternative Hypothesis (H1): the mean of response times for
CLEC customers are different(greater or less) from ILEC
customers.
b)Use the appropriate form of the t.test() function to test the
difference between the mean of ILEC versus CLEC response times at 1%
significance.
b-i)Conduct the test assuming variances of the two populations are
equal
t.test(groups$CLEC, groups$ILEC,
alt="greater", var.equal=TRUE)
##
## Two Sample t-test
##
## data: groups$CLEC and groups$ILEC
## t = 2.6125, df = 1685, p-value = 0.004534
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 2.996491 Inf
## sample estimates:
## mean of x mean of y
## 16.509130 8.411611
Since the p-value=0.004534 is smaller than 0.01, which means
significant, we can reject the null hypothesis which means that the mean
of response times for CLEC customers is less than for ILEC
customers.
b-ii)Conduct the test assuming variances of the two populations are
not equal
t.test(groups$CLEC, groups$ILEC,
alt="greater", var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: groups$CLEC and groups$ILEC
## t = 1.9834, df = 22.346, p-value = 0.02987
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.091721 Inf
## sample estimates:
## mean of x mean of y
## 16.509130 8.411611
Since the p-value=0.02987 is bigger than 0.01, which means
not significant, we cannot reject the null hypothesis which means that
the mean of response times for CLEC customers is equal to or greater
than for ILEC customers.
c)Use a permutation test to compare the means of ILEC vs. CLEC
response times
c-i)Visualize the distribution of permuted differences, and indicate
the observed difference as well.
#Observed Difference
observed_diff <- mean(groups$CLEC) - mean(groups$ILEC)
#Permutations: switch elements between groups
permute_diff <- function(values, group) {
permuted <- sample(values, replace = FALSE)
grouped <- split(permuted, group)
permuted_diff <- mean(grouped$CLEC) - mean(grouped$ILEC)
}
nperms <- 10000
permuted_diffs <- replicate(nperms, permute_diff(time_long$time, time_long$group))
#Visualize it.
hist(permuted_diffs, breaks = "fd", probability = TRUE)
lines(density(permuted_diffs), lwd=2)

c-ii)What are the one-tailed and two-tailed p-values of the
permutation test?
p_1tailed <- sum(permuted_diffs > observed_diff) / nperms
p_2tailed <- sum(abs(permuted_diffs) > observed_diff) / nperms
p_1tailed
## [1] 0.0171
p_2tailed
## [1] 0.0171
As we repeated the permutations for 10000 times, both of the
one-tailed and two-tailed p-values are 0.0172, which means that 1.72% of
10,000 permutations where difference bigger than original
difference.
c-iii)Would you reject the null hypothesis at 1% significance in a
one-tailed test?
The p-value is 0.0172 which is greater than 0.01. Therefore,
we cannot reject the null hypothesis. This means that there is not
enough evidence to suggest that the means of the two data sets are
different.
Question 3)
a)Compute the W statistic comparing the values.
gt_eq <- function(a, b) {
ifelse(a > b, 1, 0) + ifelse(a == b, 0.5, 0)
}
W <- sum(outer(groups$CLEC, groups$ILEC, FUN = gt_eq));W
## [1] 26820
b)Compute the one-tailed p-value for W.
n1 <- length(groups$CLEC) # 23
n2 <- length(groups$ILEC) # 1664
wilcox_p_1tail <- 1 - pwilcox(W,n1,n2);wilcox_p_1tail
## [1] 0.0003688341
c)Run the Wilcoxon Test again using the wilcox.test() function in
R
wilcox.test(groups$CLEC, groups$ILEC, alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: groups$CLEC and groups$ILEC
## W = 26820, p-value = 0.0004565
## alternative hypothesis: true location shift is greater than 0
d)At 1% significance, and one-tailed, would you reject the null
hypothesis that the values of CLEC and ILEC are similar?
Since the p-value is 0.0004565, which is less than 0.01, we
can reject the null hypothesis. Therefore, the values of CLEC and ILEC
are different from each other.
Question 4)
a)Follow the following steps to create a function to see how a
distribution of values compares to a perfectly normal distribution.
a)Make a function called norm_qq_plot() that takes a set of
values):
norm_qq_plot <- function(values){
probs1000 <- seq(0,1,0.001)
q_vals <- quantile(values, probs = probs1000)
q_norm <- qnorm(probs1000,mean = mean(values), sd = sd(values))
plot(q_norm, q_vals, xlab="normal quantiles", ylab="values quantiles")
abline( a = 0, b = 1 , col="red", lwd=2)
}
b)Confirm that your function works by running it against the values
of our d123 distribution from week 3 and checking that it looks like the
plot on the right:
set.seed(978234)
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)
d123 <- c(d1, d2, d3)
par( mfrow= c(1,2) )
plot(density(d123))
norm_qq_plot(d123)

b)Interpret the plot you produced (see this article on how to
interpret normal Q-Q plots) and tell us if it suggests whether d123 is
normally distributed or not.
The QQ plot suggests that the d123 distribution is not
perfectly normal, as the points do not fall exactly on the red line.
Specifically, the distribution appears to be slightly skewed to the
right, as the points in the upper tail and lower tail of the plot
deviate slightly above the red line. However, overall, the deviation
from normality is not severe, and the distribution appears to be
relatively normal.
c)Use your normal Q-Q plot function to check if the values from each
of the CLEC and ILEC samples we compared in question 2 could be normally
distributed. What’s your conclusion?
norm_qq_plot(groups$CLEC)

norm_qq_plot(groups$ILEC)

Both of the CLEC and ILEC are not perfectly normal
distributed. In view of the fact that the points don’t fall on the red
line well.
For CLEC, the points are divided by two parts and the the
number of higher value part is less than the lower one, which make the
distribution skewed to the right, as the points in the upper tail of the
plot deviate above the red line, and lower tail of the plot approach to
the flat.
For ILEC, the distribution appears to be apparently skewed
to the right, as the points in the upper tail of the plot deviate above
the red line so much, and lower tail of the plot exactly approach to the
flat.