Statistical Inference

K. Bautista | N. Menodiado

2023-07-11

Statistical Inference Case Study

Answer the following statistical analysis questions:

  1. Is there significant difference in the G1, G2, and G3 Exam grades for students coming from UP and Arneo at 95% confidence level?
data_1 <- data[,c("school","G1")]
data_2 <- data[,c("school","G2")]
data_3 <- data[,c("school","G3")]

result1 <- t.test(G1~school,data_1)
result2 <- t.test(G2~school,data_2)
result3 <- t.test(G3~school,data_3)

Period_Grade <- c("G1","G2","G3")
conf_int <- c( "-1.3160832  0.7842532",  "-1.6624404  0.4892748"," -1.9863568  0.7020663" )
p_value <- c(0.6141,0.2799,0.3431)
remarks <- c("NSD", "NSD", "NSD")
data.frame(Period_Grade,conf_int,p_value, remarks  )
##   Period_Grade               conf_int p_value remarks
## 1           G1  -1.3160832  0.7842532  0.6141     NSD
## 2           G2  -1.6624404  0.4892748  0.2799     NSD
## 3           G3  -1.9863568  0.7020663  0.3431     NSD

Based on the p-values obtained for the G1, G2, and G3 exam grades, which are all greater than the significance level of 0.05, we can conclude that there is no statistically significant difference (NSD) between the means of the Ateneo and UP groups for each exam period.

  1. Is sex a significant factor in the G1, G2, and G3 Exam grades for students? Fit a Normal Distribution to the G2 scores. What are the parameters of this Normal Distribution?
data_4 <- data[,c("sex","G1")]
data_5 <- data[,c("sex","G2")]
data_6 <- data[,c("sex","G3")]

result4 <- t.test(G1~sex,data_4)
result5 <- t.test(G2~sex,data_5)
result6 <- t.test(G3~sex,data_6)

Period_Grade <- c("G1","G2","G3")
conf_int <- c( " -1.26715575  0.04764732",  " -1.430978  0.060092"," -1.85073226 -0.04545244" )
p_value <- c(0.06898,0.07144,0.03958)
remarks <- c("NSD", "NSD", "WSD")
data.frame(Period_Grade,conf_int,p_value, remarks  )
##   Period_Grade                 conf_int p_value remarks
## 1           G1  -1.26715575  0.04764732 0.06898     NSD
## 2           G2      -1.430978  0.060092 0.07144     NSD
## 3           G3  -1.85073226 -0.04545244 0.03958     WSD

Based on the summary, it can be noted that the p-values for both the G1 and G2 exam periods are above 0.05. This indicates that gender is not a statistically significant factor for the G1 and G2 exam grades. However, the p-value for G3 exam period is below 0.05, suggesting a significant difference of means between the female and male groups. Thus, sex becomes a significant factor for the G3 exam grades.

  1. Fit a Normal Distribution to the G2 scores. What are the parameters of this Normal Distribution?
data_mean = mean(data$G2)
data_sd = sd(data$G2) 

normal <- rnorm(data$G2, mean=data_mean, sd=data_sd )
hist(normal, main="Normal Distribution of G2 Scores", xlab="Scores", col="lavender")

x_norm <- seq(min(normal), max(normal), length=100)
y_norm <- dnorm(x_norm, mean=data_mean, sd=data_sd)

#Generate the curve
lines(x_norm,y_norm*800,col="blue")

To fit a normal distribution to the G2 scores, rnorm() was used. The following parameters of the normal distribution are as follows:

## [1] "Distribution mean: 10.713924"
## [1] "Distribution standard deviation: 3.761505"
  1. Using the theory of the CLT, generate (set.seed(1)) 100,000 random samples that follow the normal distribution of the G2 scores. Plot these 100,000 random samples.
set.seed(1)
CLT<- rnorm(100000, data_mean, data_sd)

hist(CLT,
     main = "Distribution 100,000 Random Samples",
     breaks = 50,
     col="lavender")

x <- seq(min(CLT), max(CLT), length=100)
y <- dnorm(x, mean=data_mean, sd=data_sd)

#Generate the curve
lines(x,y*50000,col="blue")

  1. Generate a bootstrap sample (set.seed(2)) to increase the samples of Arneo students since there are fewer Arneo Students. Increase their samples such that the number of UP and Arneo Students have the same number of rows. What R code is used here?
set.seed(2)
UP<-data[data$school == "UP",]
n_UP<- nrow(UP)

Ateneo <- data[data$school == "Arneo",]
n_Ateneo<- nrow(Ateneo)

bootstrap_sample <- Ateneo[sample(1:n_Ateneo, size = n_UP, replace = TRUE), ]
UP_Ateneo <- rbind(UP,bootstrap_sample)
rownames(UP_Ateneo) <- 1:nrow(UP_Ateneo)
length(which(UP_Ateneo$school == "UP"))
## [1] 349
length(which(UP_Ateneo$school == "Arneo"))
## [1] 349

By employing the sample() function, we were able to augment the number of samples for Arneo students so that it matched the number of rows with UP students.

The marketing group in an airline wanted to increase the number of business class seats sold on its off-peak flights. Key factors were identified as advertising level and pricing strategy. They trialed two advertising campaigns and three pricing strategies in geographically separate but demographically similar areas. As there was a small number of possible trials, they performed an experiment, as illustrated.

Does Advertising Level and Pricing Strategy affect seats purchased?

In order to further analyze the given dataset, we check if the data (number of seats) are normally distributed.

t = c(28,26,21,32,25,28,21,29,23,26,22,28,29,33,32,28,42,44,39,45,36,40,39,35)
qqnorm(t)

shapiro.test(t)
## 
##  Shapiro-Wilk normality test
## 
## data:  t
## W = 0.94285, p-value = 0.1888

The results show that the data is normally distributed so we can proceed with the stat test. Two-way ANOVA was used in order to test for the effect of the two kinds of treatments being tested (Advertising Level and Pricing Strategy). We get the following code and their respective results:

alevel = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2)
pstrat = c(1,1,1,1,2,2,2,2,3,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
y = data.frame(alevel,pstrat,t)
y$alevel = as.factor(y$alevel)
y$pstrat = as.factor(y$pstrat)
anova = aov(t~alevel+pstrat, data = y)
summary(anova)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## alevel       1  737.0   737.0  41.339 2.85e-06 ***
## pstrat       2  121.3    60.7   3.403   0.0535 .  
## Residuals   20  356.6    17.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing the generated P-values, the results show that only Advertising Level influences the number of seats significantly rather than the Pricing Strategy. With this, advertising plays a bigger factor when selling business class seats.

A data scientist is investigating the usefulness of the two Data Science Languages (R and Python) in improving programming tasks. Twelve expert data scientists, familiar with both languages are asked to code a standard function in both languages and the time (in minutes) is recorded. The data is as follows:

Find a 95% confidence interval on the difference in mean coding times. Is there any indication that one design language is preferable?

Programmer <- c(1:12)
R <- c(17,16,21,14,18,24,16,14,21,23,13,18)
Python <- c(18,14,19,11,23,21,10,13,19,24,15,20)
data_coding <- data.frame(Programmer, R, Python)

# Null hypothesis: There is no significant difference between R and Python
# alpha rate: 0.05

# Paired t-test
coding_ttest <- t.test(data_coding$R, data_coding$Python, paired = TRUE, alternative = "two.sided", conf.level = 0.95)
coding_ttest
## 
##  Paired t-test
## 
## data:  data_coding$R and data_coding$Python
## t = 0.77904, df = 11, p-value = 0.4524
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -1.216846  2.550179
## sample estimates:
## mean difference 
##       0.6666667

Since the p value is greater than 0.05, it suggests that there is no significant difference between the mean coding time of two languages. Moreover, there were no indications that one language is preferable than the other.

The time between calls to a hotline is exponentially distributed with a mean time between calls of 15 minutes.

The probability density of the exponential distribution is calculated and plotted:

mean = 15
decay = 1/mean
xexp = 1:100
yexp = decay * exp(-(decay*xexp))

plot(xexp, yexp, xlab = "Time between Calls (min)", ylab = "Probabilty Density", main = "Probability Density")

  1. What is the probability that there are no calls within a 30-minute interval?
a = exp(-(decay*30))
a
## [1] 0.1353353
  1. What is the probability that at least one call arrives within a 10-minute interval?
b = 1-(exp(-decay*10))
b
## [1] 0.4865829
  1. What is the probability that the first call arrives within 5 and 10 minutes after opening?
c = (1-(exp(-(decay*10)) )) - (1-(exp(-(decay*5)) ))
c
## [1] 0.2031142
  1. Determine the length of an interval of time such that the probability of at least one call in the interval is 0.90.
d = (log(1-0.9))/(-decay)
d
## [1] 34.53878

An experiment was performed to determine how the cutting speed of a tool v in feet per minute affected the lifetime of the tool t in minutes. The operation was run at 40 to 110 feet per minute in 10 foot per minute increments and tools life was recorded to the nearest minute. Three observations were taken at each speed and the order of the observations was completely randomized. The data are shown in the following table. The “*” indicates that the tool was broken before it wore out. Analyze the data to determine how tool life depends on speed. What speed gives the longest life?

Checking if the data is normally distributed:

speed = c(40, 40 , 40, 50, 50, 50, 60, 60, 60, 70, 70 ,70 ,80 ,80 ,80 ,90, 90, 90 ,100, 100, 100)
toollife = c(161, 153, 152, 144, 146, 148, 136, 138, 137, 137, 139, 128, 121, 118, 123,
             114, 119, 115, 108, 105, 104)
z = data.frame(speed, toollife)
shapiro.test(z$toollife)
## 
##  Shapiro-Wilk normality test
## 
## data:  z$toollife
## W = 0.95834, p-value = 0.4833
ggqqplot(z$speed)

ggqqplot(z$toollife)

The data is found to be normally distributed.

We can also calculate the correlation between speed and tool life as follows:

res <- cor.test(z$speed, z$toollife, 
                method = "pearson")
res
## 
##  Pearson's product-moment correlation
## 
## data:  z$speed and z$toollife
## t = -20.683, df = 19, p-value = 1.728e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9914124 -0.9467237
## sample estimates:
##        cor 
## -0.9785065

The results show that there is a negative correlation between the two variables and that the correlation are significant. Furthermore, we can also use linear regression to see if there are any linear relationships between speed and tool life as variables.

lin = lm(toollife~speed, data=z)
lin$coefficients
## (Intercept)       speed 
## 187.0952381  -0.8047619
plot(z$speed, z$toollife, xlab = "Speed", ylab = "Tool Life", main = "Tool Life vs Speed")
abline(lin, col = "red", lwd = 2)

Here we confirm that as speed increases, the tool life decreases. More specifically, one unit change in speed corresponds to a decrease of 0.805 tool life units.