Question 1)

I choose to pick reshape2 package. According to the R Documentation, tidyr is “is designed specifically for tidying data, not general reshaping (reshape2)”. What’s more, the argument of reshape2 is clearer than tidyr, and it’s useful for me.

b)Show the code to reshape the versizon_wide.csv sample.

time_wide <- data.table::fread("C:/R-language/BACS/verizon_wide.csv")
library(reshape2)
## Warning: 套件 'reshape2' 是用 R 版本 4.2.2 來建造的
time_long <- melt(time_wide, na.rm = TRUE,
                  value.name = "time",
                  variable.name = "group")
## No id variables; using all as measure variables

c)Show us the “head” and “tail” of the data to show that the reshaping worked

head(time_long)
##   group  time
## 1  ILEC 17.50
## 2  ILEC  2.40
## 3  ILEC  0.00
## 4  ILEC  0.65
## 5  ILEC 22.23
## 6  ILEC  1.20
tail(time_long)
##      group  time
## 1682  CLEC 24.20
## 1683  CLEC 22.13
## 1684  CLEC 18.57
## 1685  CLEC 20.00
## 1686  CLEC 14.13
## 1687  CLEC  5.80

d)Visualize Verizon’s response times for ILEC vs. CLEC customers

groups <- split(x = time_long$time, f = time_long$group)
plot(density(groups$ILEC), col="cornflowerblue", lwd=2, xlim=c(0, 200),main="Response times for ILEC vs. CLEC customers")
lines(density(groups$CLEC), col="coral3", lwd=2)
legend(150,0.12 , lty=1, c("CLEC", "ILEC"), col=c("coral3", "cornflowerblue"))

Question 2)

a)State the appropriate null and alternative hypotheses (one-tailed)

Null Hypothesis (H0): the mean of response times for CLEC customers and ILEC customers are the same.

Alternative Hypothesis (H1): the mean of response times for CLEC customers are different(greater or less) from ILEC customers.

b)Use the appropriate form of the t.test() function to test the difference between the mean of ILEC versus CLEC response times at 1% significance.

b-i)Conduct the test assuming variances of the two populations are equal

t.test(groups$CLEC, groups$ILEC,
       alt="greater", var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  groups$CLEC and groups$ILEC
## t = 2.6125, df = 1685, p-value = 0.004534
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.996491      Inf
## sample estimates:
## mean of x mean of y 
## 16.509130  8.411611

Since the p-value=0.004534 is smaller than 0.01, which means significant, we can reject the null hypothesis which means that the mean of response times for CLEC customers is less than for ILEC customers.

b-ii)Conduct the test assuming variances of the two populations are not equal

t.test(groups$CLEC, groups$ILEC,
       alt="greater", var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  groups$CLEC and groups$ILEC
## t = 1.9834, df = 22.346, p-value = 0.02987
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  1.091721      Inf
## sample estimates:
## mean of x mean of y 
## 16.509130  8.411611

Since the p-value=0.02987 is bigger than 0.01, which means not significant, we cannot reject the null hypothesis which means that the mean of response times for CLEC customers is equal to or greater than for ILEC customers.

c)Use a permutation test to compare the means of ILEC vs. CLEC response times

c-i)Visualize the distribution of permuted differences, and indicate the observed difference as well.

#Observed Difference
observed_diff <- mean(groups$CLEC) - mean(groups$ILEC)
#Permutations: switch elements between groups
permute_diff <- function(values, group) {
  permuted <- sample(values, replace = FALSE)
  grouped <- split(permuted, group)
  permuted_diff <- mean(grouped$CLEC) - mean(grouped$ILEC)
}
nperms <- 10000
permuted_diffs <- replicate(nperms, permute_diff(time_long$time, time_long$group))
#Visualize it.
hist(permuted_diffs, breaks = "fd", probability = TRUE)
lines(density(permuted_diffs), lwd=2)

c-ii)What are the one-tailed and two-tailed p-values of the permutation test?

p_1tailed <- sum(permuted_diffs > observed_diff) / nperms
p_2tailed <- sum(abs(permuted_diffs) > observed_diff) / nperms
p_1tailed
## [1] 0.0171
p_2tailed
## [1] 0.0171

As we repeated the permutations for 10000 times, both of the one-tailed and two-tailed p-values are 0.0172, which means that 1.72% of 10,000 permutations where difference bigger than original difference.

c-iii)Would you reject the null hypothesis at 1% significance in a one-tailed test?

The p-value is 0.0172 which is greater than 0.01. Therefore, we cannot reject the null hypothesis. This means that there is not enough evidence to suggest that the means of the two data sets are different.

Question 3)

a)Compute the W statistic comparing the values.

gt_eq <- function(a, b) {
  ifelse(a > b, 1, 0) + ifelse(a == b, 0.5, 0)
}
W <- sum(outer(groups$CLEC, groups$ILEC, FUN = gt_eq));W
## [1] 26820

b)Compute the one-tailed p-value for W.

n1 <- length(groups$CLEC) # 23
n2 <- length(groups$ILEC) # 1664
wilcox_p_1tail <- 1 - pwilcox(W,n1,n2);wilcox_p_1tail
## [1] 0.0003688341

c)Run the Wilcoxon Test again using the wilcox.test() function in R

wilcox.test(groups$CLEC, groups$ILEC, alternative = "greater")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  groups$CLEC and groups$ILEC
## W = 26820, p-value = 0.0004565
## alternative hypothesis: true location shift is greater than 0

d)At 1% significance, and one-tailed, would you reject the null hypothesis that the values of CLEC and ILEC are similar?

Since the p-value is 0.0004565, which is less than 0.01, we can reject the null hypothesis. Therefore, the values of CLEC and ILEC are different from each other.

Question 4)

a)Follow the following steps to create a function to see how a distribution of values compares to a perfectly normal distribution.

a)Make a function called norm_qq_plot() that takes a set of values):

norm_qq_plot <- function(values){
  probs1000 <- seq(0,1,0.001)
  q_vals <- quantile(values, probs = probs1000)
  q_norm <- qnorm(probs1000,mean = mean(values), sd = sd(values))
  plot(q_norm, q_vals, xlab="normal quantiles", ylab="values quantiles")
  abline( a = 0, b = 1 , col="red", lwd=2)
}

b)Confirm that your function works by running it against the values of our d123 distribution from week 3 and checking that it looks like the plot on the right:

set.seed(978234)
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)
d123 <- c(d1, d2, d3)

par( mfrow= c(1,2) )
plot(density(d123))
norm_qq_plot(d123)

b)Interpret the plot you produced (see this article on how to interpret normal Q-Q plots) and tell us if it suggests whether d123 is normally distributed or not.

The QQ plot suggests that the d123 distribution is not perfectly normal, as the points do not fall exactly on the red line. Specifically, the distribution appears to be slightly skewed to the right, as the points in the upper tail and lower tail of the plot deviate slightly above the red line. However, overall, the deviation from normality is not severe, and the distribution appears to be relatively normal.

c)Use your normal Q-Q plot function to check if the values from each of the CLEC and ILEC samples we compared in question 2 could be normally distributed. What’s your conclusion?

norm_qq_plot(groups$CLEC)

norm_qq_plot(groups$ILEC)

Both of the CLEC and ILEC are not perfectly normal distributed. In view of the fact that the points don’t fall on the red line well.

For CLEC, the points are divided by two parts and the the number of higher value part is less than the lower one, which make the distribution skewed to the right, as the points in the upper tail of the plot deviate above the red line, and lower tail of the plot approach to the flat.

For ILEC, the distribution appears to be apparently skewed to the right, as the points in the upper tail of the plot deviate above the red line so much, and lower tail of the plot exactly approach to the flat.