Assessment 2: Parameter estimation, confidence interval, hypothesis testing and data mining models

Student ID Number:33801630

Part 1: Point Estimation [10 marks]

Question 1: (10 Marks)

Suppose that we know that a random variable \(X\) follows the distribution given below:

\[ \text{PDF}\left(X=x|\theta\right)= \frac{{2 \choose x}\theta^x\left(1-\theta\right)^{2-x}}{1-\left(1-\theta\right)^2}, \; x= \{1,2\} \]

Imagine that we observe a sample of \(\textbf{n}\) i.i.d random varaibles \(\mathbf{x}=\) \(\left(x_{1}, \ldots, x_{n}\right)\) and want to model them using this distribution. Please use the concept of maximum likelihood to estimate for the parameter \(\theta\). (Check the back of the unit formula sheet for differentiation rules)

The given information the sample x=(x1, x2, ..., xn) and the observed sample are independent and identically distributed(i.i.d)
From the fomula we can see n = 2, which means only 2 values of 1 and 2 for the random variable X , we can assume k samples of value 1 and m samples of value 2, k + m = n

The likelihood function V=L(θ|x) us the probability of the sample xi with given θ

\[L(\theta) = \left( \frac{2\theta(1-\theta)}{1-(1-\theta)^2} \right)^k \left( \frac{(1-\theta)^2}{1-(1-\theta)^2} \right)^m\]

Convert to the log-likelihood function to simplify the differentiation.

\[\ln(L(\theta)) = k \ln\left( \frac{2\theta(1-\theta)}{1-(1-\theta)^2} \right) + m \ln\left( \frac{(1-\theta)^2}{1-(1-\theta)^2} \right)\]

For n = 2, we have 3 scenarios to take an further analysis: 

---For k = 2 and m = 0:

\[\ln(L(\theta)) = 2 \ln\left( \frac{2\theta(1-\theta)}{1-(1-\theta)^2} \right)\]

---For k = 1 and m = 1:

\[\ln(L(\theta)) = \ln\left( \frac{2\theta(1-\theta)}{1-(1-\theta)^2} \right) + \ln\left( \frac{(1-\theta)^2}{1-(1-\theta)^2} \right)\]

---For k = 0 and m = 2:

\[\ln(L(\theta)) = 2 \ln\left( \frac{(1-\theta)^2}{1-(1-\theta)^2} \right)\]

Then, differentiate ln(L) with respect to θ and set the derivative equal to 0 to find the critical points

\[\frac{d}{d\theta} \ln(L(\theta)) = 0\]

We need to set three equations to zero and solve for θ.

To solve the above three equation, we use code to run as following in R:
---scenario 1 : k = 2, m = 0
---scenario 2 : k = 1, m = 1
---scenario 3 : k = 0, m = 2

# Define the negative log-likelihood function
nll <- function(theta, k, m) {
  -k * log((2*theta*(1-theta))/(1-(1-theta)^2)) - m * log(((1-theta)^2)/(1-(1-theta)^2))
}

# Set the observed values of k and m for each scenario
scenarios <- list(
  scenario1 = c(k=2, m=0),
  scenario2 = c(k=1, m=1),
  scenario3 = c(k=0, m=2)
)

# Find the theta that minimizes the negative log-likelihood for each scenario
for (scenario in names(scenarios)) {
  result <- optimize(f = nll, interval = c(0, 1), k = scenarios[[scenario]]['k'], m = scenarios[[scenario]]['m'])
  cat(scenario, ": Estimated Theta = ", result$minimum, "\n")
}

## scenario1 : Estimated Theta =  6.610696e-05 
## scenario2 : Estimated Theta =  6.610696e-05 
## scenario3 : Estimated Theta =  6.610696e-05

Part 2: Central Limit Theorem [15 marks]

Question 1 (8 Marks)

Someone wants to investigate the subscription rate p of Netflix in Monash University, and they take the average frequency of all surveyed people as an estimate \(\hat{p}\) for p. Now it is necessary to ensure that there is at least 90% certainty that the difference between the surveyed rate \(\hat{p}\) and the actual rate p is not more than 5%. How many people at least should take the survey?

The formula is

\[Z = \frac{{\hat{p}*(1-\hat{p})}}{{\text{σ} \cdot \sqrt{n}}}\]

=> \[n = \frac{Z^2*{\hat{p}*(1-\hat{p})}}{σ^2}\]

From the given information, we get:
(1) Z-score for 90% confidence level, which corresponds to the 95% of standard normal distribution with left tail 5% and right tail 5%. Checked Z-tables, Z = 1.645.
(2) The estimated P is 0.5 for the best estimate with maximum variability.
(3) The surveyed rate $\hat{p}$ and the actual rate p is not more than 5%, so σ = 0.05.

Then,

\[n = \frac{Z^2*{\hat{p}*(1-\hat{p})}}{σ^2}\]

n = 1.645^2*0.5(1-0.5)/ 0.05^2
  = 270.6025 ≈ 271

So, to meet the requirement, at least 271 students should take the survey.

Question 2 (7 Marks)

In Module 2, we mentioned the use of the weak law of large numbers which tells us that the sample estimator will converge to the population parameter if we have sufficiently large number of observations (or sample size). In this question, we would like to see how big the sample size should be in order to get the approximation error down to a certain level.

Assuming that we have a Binomial random varibale \(X \sim \text{Bin}(n, 0.8)\), what is the smallest sample size \(\textbf{n}\) such that

\[ P\left(\left|\frac{X}{n} - p\right| >0.1\right) < 0.01 \]

The given information:
p = 0.8
According to the weak law of large numbers,

\[ P\left(\left|\frac{X}{n} - 0.8\right| >σ\right) ≤ \frac{p*(1-p)}{n*σ^2} \]

put σ = 0.1, p = 0.8 into inequality:
=> 0.8*(1-0.8)/(n(0.1^2)) ≤ 0.01
=> 0.16/(0.01*n) ≤ 0.01
=> n ≥ 0.16/0.0001
=> n ≥ 1600

So, the smallest sample size should be 1600 to make sure we have a high level of confidence of difference between the sample mean and population mean is no more than 0.1

Part 3: Confidence Interval Estimation & Hypothesis Testing [25 marks]

Suppose you are a technical data analyst for a rally car team. There are two brands of engine oil in your warehouse, namely Shell and Speedline. Your team conducted 40 experiments for each brand to test the length of time they can keep the racing engine working normally. The data has been saved in the csv file, Engine_oil_info.csv. Please note that the first two columns are about Shell and Speedline, while the last column is about Castrol, new engine oil that your team wants to buy but has not yet decided on. We assume the samples are distributed as per Gaussian.

getwd()

## [1] "D:/Monash/Statistics/Assessment 2"

setwd("D:/Monash/Statistics/Assessment 2")

# Import and split the dateset to 36 variables.
data = read.csv('Engine_oil_info.csv')
# To calculate the mean of the first column:
mu_1 <- mean(data[, 1])

# To calculate the mean of the second column:
mu_2 <- mean(data[, 2])

# To calculate the mean of the last column:
mu_3 <- mean(data[, ncol(data)])

summary(data)

##      Shell         Speedline        Castrol     
##  Min.   :264.7   Min.   :291.7   Min.   :321.5  
##  1st Qu.:309.4   1st Qu.:316.5   1st Qu.:333.7  
##  Median :324.5   Median :327.3   Median :340.9  
##  Mean   :322.8   Mean   :329.5   Mean   :341.8  
##  3rd Qu.:337.6   3rd Qu.:342.7   3rd Qu.:345.4  
##  Max.   :365.1   Max.   :378.3   Max.   :381.9

#Get n from dataset
n <- nrow(data)

# get standard deviation of shell and speedline.
s1 <-sd(data[, 1])

s2 <-sd(data[, 2])

s3 <-sd(data[, ncol(data)])

# Standard error
se1 <- s1 / sqrt(n)

se2 <- s2 / sqrt(n)

se3 <- s3 / sqrt(n)

# Confidence level 95%
alpha <- 0.05

# Confidence Interval for mu_1
ci_lower1 <- mu_1 - qnorm(1 - alpha/2) * se1
ci_upper1 <- mu_1 + qnorm(1 - alpha/2) * se1

# Display the confidence interval
cat("Confidence Interval for Shell: [", ci_lower1, ", ", ci_upper1, "]\n")

## Confidence Interval for Shell: [ 315.8 ,  329.7073 ]

# Confidence Interval for mu_2
ci_lower2 <- mu_2 - qnorm(1 - alpha/2) * se2
ci_upper2 <- mu_2 + qnorm(1 - alpha/2) * se2

# Display the confidence interval
cat("Confidence Interval for Speedline: [", ci_lower2, ", ", ci_upper2, "]\n")

## Confidence Interval for Speedline: [ 323.3844 ,  335.6373 ]

P.S. \(\mu_1\) is for mean of Shell working hours; \(\mu_2\) is for mean of Speedline working hours; \(\mu_3\) is for Castrol working hours.

Question 1 (5 Marks)

You want to analyse the working hours for both Shell and Speedline. Determine 95 percent two-sided confidence intervals for both Shell \(\mu_1\) and Speedline \(\mu_2\). Also explain what does a 95% confidence interval mean? What will the range of the confidence interval look like when the sample size increases and why? Can you also relate this observation to the weak law of large numbers?

We can calculate as above by R, or we can manually calculate as the following: 

From the given dataset, we can get mean value for the working hours of Shell μ1(mu_1) :
   mu_1 = 322.75
The mean value for the working hours of Speedline μ2(mu_2)
   mu_2 = 329.51
The sample size n = 40
The sample standard deviations of Shell: s1 = 22.44
The sample standard deviations of Speedline: s2 =19.77
The confidence interval 95% z = 1.96 by z-tables.

The formula of confidence interval:
CI = μ ± z*(σ/sqrt(n))

Put all given information into the formula, we can get:
CI_shell = 322.75  ± 1.96 *(22.44/sqrt(40))
         ≈ 322.75 ± 6.95
         ≈ (315.8, 329.7)

CI_speedline = 329.51  ± 1.96 *(19.77/sqrt(40))
             ≈ 329.51 ± 6.13
             ≈ (323.38,335.64)

(1) Explanation mean of a 95% confidence interval:
 A 95% confidence interval can provide an estimated range for working hours of shell and speedline, which means that 95% confident for the true population means falls within this interval. If we take a large number of samples from the same population and compute a 95% confidence interval for each of them, we would expect about 95% of those intervals to contain the true population mean.

(2) The effect of increasing sample size and reason:
 The increasing sample size will make our estimation more precision for the population parameter. If the sample size of dataset increases, the standard error(SE) will decrease as SE is the standard deviation divided by the square root of the sample size. As our confidence interval formalu, if other parameter is the same, standard error get smaller then the confidence interval will get narrower, which means our estimation will be more precise if we have a larger sample size.

(3) Relate the observations to the weak law of large numbers:
 According to the Weak Law of Large Numbers, if the sample size increases, the sample mean will converge in probability to the population mean, which means if the sample size grows, the probability that the sample mean μ1 and μ2 is close to the population mean increases, consistent with the Weak Law.

Question 2 (7 Marks)

You want to know which engine oil works longer. Determine a 95 percent two-sided confidence interval for \(\mu_1-\mu_2\), and explain your opinion based on your calculations which oil works longer, Shell or Speedline. Please note we assume \(\sigma_1=\sigma_2\).

The given assumption is σ1 = σ2, so we can use pooled-variance t test to determint the 95% confidence interval for μ1 - μ2. The sample size of shell and speedline are all 40,      which means n = m = 40.

The formula is 
μ1 -μ2 ±t(α/2)* Sp *sqrt(1/n + 1/m)
    for Sp^2 = [(n-1)* s1^2 +(m-1) *s2^2]/(n+m-2)

(1) Calculate pooled variance Sp value:
Sp^2 = [(n-1)* s1^2 +(m-1) *s2^2]/(n+m-2)
     = [(40-1)* 22.44^2 + (40-1) *19.77^2]/(40+40-2)
     = 447.2

(2) Calculate the standard error:
SE = sqrt( Sp^2/n + Sp^2/m) 
   = sqrt(447.2/40 + 447.2/40)
   ≈ 4.73

(3) t value for 95% confidence interval:
df= n + m - 2 = 78

To check t table for a 95% two-side confidence interval and 78 degrees of freedom, 
we get: t = 1.99

Then, 
CI = μ1-μ2 ± t*(Sp/sqrt(1/n+ 1/m))
   = μ1-μ2 ± t*SE
   = 322.75 - 329.51 ± 1.99 *4.73
   = -6.76 ± 9.41
   ≈ (−16.17,2.65)

So, the 95% confidence interval for the difference for the working hours for shell and speedline is from -16.17 to 2.65, which include zero, so we did not have enough evidence to say which engiene oil works longer.

Question 3 (5 Marks)

According to Monash University Rally regulation, only when the engine oil’s working hours is greater than or equal to 340 hours, this engine oil can be used in this rally. You need to do a hypothesis testing to see whether Speedline is qualified for this rally. You have null hypothesis \(H_0: \mu_2 \ge 340\). Write down the alternative hypothesis and report the findings of your hypothesis test based on the significance level 0.05 (Answer should include whether Speedline is qualified for the Monash University Rally).

Based on the requirment, given hypothesis testing:
 
 H0: μ2 ≥ 340 (Null Hypothesis)
 H1: μ2 < 340 (Alternative Hypothesis)
 This is a one-tailed test for testing whether the Speedline oil's mean working hours are significantly less than 340 hours.

(1) Given information and formula
    
    To conduct the hypothesis test at a significance level of α=0.05
    μ2 = 329.51 
    μ0 =340 (assumed true mean under the null hypothesis)
    s2 = 19.77
    n = 40

Given that we have a sample data for Speedline, we will use a t-test. The t-statistic is given by formula:
  t = (μ2 - μ0)/ (s2/ sqrt(n))

(2) Put given values into the formula:
  t = (329.51−340)/(19.77/sqrt(40))
    = −10.49 /3.12
    ≈ −3.36

(3) Check t Value and make a conclusion:
 For a one-tailed test at a significance level of α=0.05 and 39 degrees of freedom (df = n-1 = 39), according to t table we get t value -1.68, so the calculated t value is -3.36 is less than -1.68, so we reject the null hypothesis. Based on the significance level of 0.05, we would reject the null hypothesis, This provides sufficient evidence to conclude that the mean working hours of Speedline oil is less than 340. Therefore, according to this hypothesis test, Speedline is not qualified for the Monash University rally regulation.

Question 4 (8 Marks)

Your team manager wants to buy Castrol engine oil because it is much cheaper than Shell and Speedline. But your team engineer believes that Castrol’s working hours are less than or equal to 340. Remove outlier(s) for Castrol data and do the hypothesis testing \(H_0: \mu_3 \le 340\). Do you think your team engineer’s statement is true? Report your findings based on the hypothesis testing with significance level 0.05.

# To remove the outliers for castrol data
# Extract Castrol data from dataset
castrol <- data$Castrol

# Calculate the First Quartile (Q1) and Third Quartile (Q3)
Q1 <- quantile(castrol, 0.25)
Q3 <- quantile(castrol, 0.75)

# Calculate Interquartile Range (IQR)
IQR <- Q3 - Q1

# Define the Lower and Upper Bound to identify outliers
Lower_Bound <- Q1 - 1.5 * IQR
Upper_Bound <- Q3 + 1.5 * IQR

# Identify and Remove Outliers
castrol_no_outliers <- castrol[castrol > Lower_Bound & castrol < Upper_Bound]

# Print the new vector without outliers
print(castrol_no_outliers)

##  [1] 354.9593 345.2779 332.1468 350.9831 345.2200 329.8432 342.6543 345.2523
##  [9] 349.0391 341.7141 333.7169 333.7501 336.9743 332.0897 354.4785 340.8265
## [17] 344.4123 344.9136 335.7612 334.5009 328.8976 323.9727 345.8608 340.3311
## [25] 336.1584 329.0019 334.6342 341.0712 330.2548 339.4112 340.2970 346.2773
## [33] 344.1075 357.9083 321.4973 329.6985 342.6102

# To calculate the mean of the last column:
mu_3 <- mean(castrol_no_outliers)


#Get n from dataset
n <- length(castrol_no_outliers)

# get standard deviation of shell and speedline.

s3 <-sd(castrol_no_outliers)

# Standard error
se3 <- s3 / sqrt(n)

cat("The sample size, mean value and standard deviation of Castrol are: [", n, ", ", round(mu_3, 2), ", ", round(s3, 2), "]\n")

## The sample size, mean value and standard deviation of Castrol are: [ 37 ,  339.47 ,  8.55 ]

 (1) Using R code to remove the outliers, calculate first Quartile (Q1) and third Quartile (Q3)
 Calculate Interquartile Range (IQR) then use lower bound (Q1 - 1.5 * IQR) and Upper_Bound (<-) Q3 + 1.5 * IQR) to identify and remove the outliers.

 (2) After remove the outliers, we get sample size to 37 and calculate the mean value the standard deviation of Castrol:
     n = 37
     μ = 339.47 
     μ0 =340 
     s = 8.55

 (3)Hypothesis testing:
     
     H0: μ2 ≤ 340 (Null Hypothesis: The mean working hours of Castrol oil are less than or equal to 340 hours)
     H1: μ2 > 340 (Alternative Hypothesis: The mean working hours of Castrol oild are greater than 340 hours)

 (4) The t-statistic is given by formula:
     t = (μ - μ0)/ (s/ sqrt(n))
  Put given values into the formula:
     t = (339.47−340)/(8.55/sqrt(37))
       = −0.53 /1.41
       ≈ −0.38
  For a one-tailed test at α=0.05 with 36 degrees of freedom, the crtical t-value is approximately 1.68 from the t-distribution table.

 (5)Conclusion:
  The calculated t value is less than the critical t value, cannot reject the null hypothesis.
  Based on the provided sample data, there is not enough evidence to support that the mean working hours of Castrol oil are greater than 340 hours. Therefore, it seems like the  team engineer's belief that Castrol’s working hours are less than or equal to 340 might be valid.

Part 4: Linear Regression Model [25 marks]

Imagine you are having painful surgery while your body being heavily paralysed under general anaesthesia; however, you are awake and can’t express this to the surgeons. Ouch! It’s a real nightmare! After the surgery you eventually recover but you are determined to not let the same thing happen to other people. So you set out to create a consciousness metre that analyses brain activity and determines a person’s level of consciousness to make sure no one will experience similar pain while being paralysed.

You run some experiments on 11 people while they are undergoing general anaesthesia and record 8 neurophysiological variables from 5 different locations in the brain. You also simultaneously obtain a behaviour-based measure of consciousness level that takes on the value \(100\) when the person is fully awake. This value will decrease to \(0\) as the person consciousness level fades to full unconsciousness.

Your goal is to create a consciousness metre by building a regression model that uses the neurophysiological variables to predict consciousness level. Such a model could then be used during the patient’s surgery to make sure that he/she is not awake by recording his/her neurophysiological variables. This is to predict the consciousness level and verify that the patient’s conscionsess level is below a sufficient threshold to ensure that the patient won’t experience pain.

You have been provided with three datasets, Regression_train.csv, Regression_test.csv, and Regression_new.csv. Using these datasets, you hope to build a model that can predict consciousness level using the other variables. Regression_train.csv and Regression_new.csv come with the ground-truth target label (i.e. consciousness level) whereas Regression_test.csv comes with independent variables (input information) only.

The information of the attributes for these datasets can be found below:
- sub_ind: participant ID number
- channel.num: indexes of the brain location from where the neurophysiological recording is taken (please note that there are 5 locations)
- aEP: average synaptic connection strength from the local excitatory interneuron population to the local pyramidal neuron population
- aIP: average synaptic connection strength from the local inhibitory interneuron population to the local pyramidal neuron population
- aPE: average synaptic connection strength from the local pyramidal neuron population to the local excitatory interneuron population
- aPI: average synaptic connection strength from the local pyramidal neuron population to the local inhibitory interneuron population
- input: spatially averaged neurophysiological input to the local brain area being recorded
- v_es: average membrane potential of the local excitatory interneuron population
- v_ii: average membrane potential of the local inhibitory interneuron population
- v_pyr: average membrane potential of the local pyramidal neuron population
- consc_lev: consciousness level of the study participant represented by a number from 0 to 100, 0 indicating fully unconscious and 100 indicating fully conscious/awake.

It can be noted that aEP, aIP, aPE, aPI, input, v_es, v_ii and v_pyr are the 8 neurophysiological variables being recorded at each of the 5 locations (indexed by channel_num) within the brains of the 11 study participants (indexed by sub_ind).

PLEASE NOTE THAT THE USE OF LIBRARIES ARE PROHIBITED IN THESE QUESTIONS UNLESS STATED OTHERWISE, ANSWERS USING LIBRARIES WILL RECEIVE 0 MARKS

Question 1 (5 marks)

Please load the Regression_train.csv and fit a \(\textbf{multiple linear regression model}\) with consciousness level being the target variable. According to the summary table, which predictors do you think are possibly associated with the target variable (use the significance level of \(0.01\)), and which are the Top 5 strongest predictors? Please write an R script to automatically fetch and print this information.

NOTE: Manually doing the above tasks will result in 0 marks.

# ANSWER BLOCK

# Read in the data here
train <- read.csv("Regression_train.csv")

# Build the multiple linear regression model here, remove sub_ind from predictors as it is participation ID number.
lm.fit <- lm(consc_lev ~ . -sub_ind, data = train)  

# Get the summary 
fit.summary <- summary(lm.fit) 

# Write the function to get the important predictors as well as the top 5 strongest predictors:
top.predictors <- function(fit.summary){
    
    # Getting the important predictors
    p_value <- fit.summary$coefficients[, 4]  # Extracting the p-values from the summary
    # Getting the predictors with p-values less than 0.01
    coef.imp <- names(p_value[p_value < 0.01])  
    
    # Getting the top 5 predictors
     # Extracting the absolute t-values from the summary
    t_value <- abs(fit.summary$coefficients[, 2])
    # Sorting the t-values in descending order and getting the names of the top 5 predictors
    coef.most <- names(sort(t_value, decreasing = TRUE)[1:6])
    
    
    # Printing out the results
    if(length(coef.imp) > 0){
        print(paste("The important features are: ", paste(coef.imp, collapse = ', ')))
    } else {
        print("No features are significant at the 0.01 level.")
    }
    print(paste("The top 5 most important features are: ", paste(coef.most, collapse = ', ')))
}

# Call the function to print the results
top.predictors(fit.summary)

## [1] "The important features are:  (Intercept), channel.num, aEP, aPE, aPI, input, v_ii, v_pyr"
## [1] "The top 5 most important features are:  (Intercept), v_pyr, v_es, v_ii, input, aIP"

Question 2 (5 Marks)

Rather than calling the lm() function, you would like to write your own function to do the least square estimation for the simple linear regression model parameters \(\boldsymbol{\beta_0}\) and \(\boldsymbol{\beta_1}\). The function takes two input arguments with the first being the dataset name and the second the predictor name, and outputs the fitted linear model with the form: \[\textbf{E}[\text{consciousness level}]=\hat{\beta}_{0}+\hat{\beta}{_1}x\]

Code up this function in R and apply it to the two predictors input and v_pyr separately, and explain the effect that those two variables have on consc_lev.

# ANSWER BLOCK

# Least squared estimator function
lsq <- function(dataset, predictor){
    # Extracting predictor and response variable from the dataset
    x <- dataset[[predictor]]
    y <- dataset$consc_lev
    
    # Number of observations
    n <- length(y)
    
    # Calculate beta_1 and beta_0 by using the least-squares formula
    beta_1 <- (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x^2) - sum(x)^2)
    beta_0 <- mean(y) - beta_1 * mean(x)
    
    # Return the results
    return(paste0('E[consc_lev]=', round(beta_0, 3),'+', round(beta_1, 3),'*',predictor))
}

# Applying the lsq function to the predictors input and v_pyr
print(lsq(train, 'input'))

## [1] "E[consc_lev]=51.95+0.034*input"

print(lsq(train, 'v_pyr'))

## [1] "E[consc_lev]=55.438+0.252*v_pyr"

From the above result, we can see "input" and "v_pyr" is positively related to consc_lev, it means as input or v_pyr increases, the expected consciousness level also increases.     
However,
β1_input = 0.03, for every additional unit of "input", the consciousness level is expected to increase by 0.03; 
β1_v_pyr = 0.25, for every additional unit of "v_pyr", the consciousness level is expected to increase by 0.25.
By comparing two β1, v_pyr has a more stronger relationship than "input" between the predictor and the response variable.

β0 represents the expected value of consc_lev when the predictor is 0. The estimated consciousness level is 51.95 when "input" variable is 0, the estimated consciousness level is 55.44 when "v_pyr" variable is 0.

Question 3 (2 Marks)

R squared from the summary table reflects that the full model doesn’t fit the training dataset well; thus, you try to quantify the error between the values of the ground-truth and those of the model prediction. You want to write a function to predict consciousness level with the given dataset and calculate the root mean squared error (rMSE) between the model predictions and the ground truths from the training data. Please test this function on the full model and the training dataset.

# ANSWER BLOCK

rmse <- function(model, dataset){
    # Using the predict function to get the predicted values for consc_lev from the model
    predictions <- predict(model, newdata = dataset)
    
    # Calculating residuals
    residuals <- dataset$consc_lev - predictions
    
    # Squaring residuals, averaging them, and then taking the square root
    rMSE <- sqrt(mean(residuals^2))
    
    # Return the computed rMSE
    return(rMSE)
}

# Testing the rmse function on your model and training dataset
print(rmse(lm.fit, train))

## [1] 30.63296

The root mean squared error of 31.63 for this model on the training dataset indicates that, on average, our model's predictions for consciousness level are off by about 31.63 units from the true values. Given that the consciousness level can vary from 0 to 100, an error of 31.63 means that there's a substantial average difference between the predicted and actual consciousness levels in the dataset.

Question 4 (4 Marks)

You find the full model complicated and try to reduce the complexity by performing bidirectional stepwise regression with BIC.

Calculate the rMSE of this new model from the training data with the function that you implemented previously. Is there anything findings you can make? Explain your findings in 100 words.

# ANSWER BLOCK

# Performing bidirectional stepwise regression using BIC criterion
sw.fit <- step(lm.fit, direction="both", k=log(nrow(train)))

## Start:  AIC=300667.9
## consc_lev ~ (sub_ind + channel.num + aEP + aIP + aPE + aPI + 
##     input + v_es + v_ii + v_pyr) - sub_ind
## 
##               Df Sum of Sq      RSS    AIC
## - v_es         1        83 41208951 300657
## - aIP          1      2899 41211767 300660
## - v_ii         1      9830 41218698 300668
## <none>                     41208868 300668
## - v_pyr        1     19641 41228509 300678
## - aPE          1     30352 41239220 300690
## - aPI          1     32159 41241027 300691
## - input        1    193514 41402382 300863
## - aEP          1   1123775 42332643 301839
## - channel.num  1   2726554 43935422 303471
## 
## Step:  AIC=300657.3
## consc_lev ~ channel.num + aEP + aIP + aPE + aPI + input + v_ii + 
##     v_pyr
## 
##               Df Sum of Sq      RSS    AIC
## - aIP          1      2884 41211835 300650
## <none>                     41208951 300657
## - v_pyr        1     19675 41228625 300668
## + v_es         1        83 41208868 300668
## - v_ii         1     37003 41245953 300686
## - aPI          1     48096 41257047 300698
## - aPE          1     64992 41273943 300716
## - input        1    193515 41402466 300852
## - aEP          1   1124174 42333125 301829
## - channel.num  1   2728301 43937251 303462
## 
## Step:  AIC=300649.7
## consc_lev ~ channel.num + aEP + aPE + aPI + input + v_ii + v_pyr
## 
##               Df Sum of Sq      RSS    AIC
## <none>                     41211835 300650
## + aIP          1      2884 41208951 300657
## + v_es         1        68 41211767 300660
## - v_pyr        1     22630 41234465 300663
## - v_ii         1     37293 41249128 300679
## - aPE          1     66505 41278340 300710
## - aPI          1     79403 41291238 300723
## - input        1    317629 41529464 300976
## - aEP          1   1132975 42344810 301830
## - channel.num  1   2837353 44049188 303563

# Display the model summary 
summary(sw.fit)

## 
## Call:
## lm(formula = consc_lev ~ channel.num + aEP + aPE + aPI + input + 
##     v_ii + v_pyr, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.710 -30.597   3.061  29.811  54.020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.200e+01  5.171e-01 139.235  < 2e-16 ***
## channel.num -5.407e-03  9.834e-05 -54.981  < 2e-16 ***
## aEP         -6.421e-03  1.848e-04 -34.743  < 2e-16 ***
## aPE          8.551e-04  1.016e-04   8.418  < 2e-16 ***
## aPI         -9.147e-04  9.945e-05  -9.198  < 2e-16 ***
## input        2.285e-02  1.242e-03  18.396  < 2e-16 ***
## v_ii         1.816e-01  2.882e-02   6.303 2.94e-10 ***
## v_pyr       -1.660e+00  3.382e-01  -4.910 9.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.64 on 43907 degrees of freedom
## Multiple R-squared:  0.1071, Adjusted R-squared:  0.1069 
## F-statistic: 752.2 on 7 and 43907 DF,  p-value: < 2.2e-16

# print the rMSE of the new model on the training data
print(rmse(sw.fit, train))

## [1] 30.63406

(1) Based on the above stepwise selection results, the model starts with all predictors and iteratively removes and adds terms to find the optimal model with the lowest BIC. The final model selected contains predictors: channel.num, aEP, aIP, aPE, input, and v_ii and v_pyr.

(2) Each predictor's coefficient indicates its effect on the consciousness level:
 Both channel.num and aEP are highly significant and have a substantial impact on the model, and both having negative coefficients, suggesting inverse relationship with "consc_lec".
 aIP also has a negative effect, but its magnitude is greater than that of aEP.
 aPE has a positive coefficient, indicating a direct relationship with consciousness level.
 input has a positive effect, implying that higher input values lead to increased consciousness levels.
 v_pyr has a negative impact on consciousness level.

(3) Significance: All predictors are significant at the 0.001 level, suggesting they play a critical role in the model.

(4) The rMSE value remains 30.63, identical to the full model, implying that the simplified model achieves almost the same predictive accuracy as the more complex full model. The R-squared value is 0.1071, suggesting that the model explains only about 10.71% of the variability in the consciousness level. This value is relatively low, suggesting that there might be other influential predictors not present in the dataset or non-linear relationships that aren't captured by the linear model.

(5) Conclusion: The stepwise regression helped to simplify the model, the predictors still account for a small proportion of the variance in consciousness level. It highlights the complex nature of predicting consciousness level and suggests that it may be influenced by numerous other unmeasured factors or non-linear relationships. The low R-squared suggests the need to further investigate the data, possibly need to consider other analytic techniques or additional data sources to improve predictive capabilities.

Question 5 (4 Marks)

You have been given a new dataset Regression_new.csv obtained from different recording sessions where the subjects only experienced moderate reductions in consciousness. You are going to apply the new model sw.fit on the new dataset to evaluate the model performance with using rMSE. When you look into rMSE, what do you find? If you think sw.fit works well on Regression_new.csv, please explain why it does well. Otherwise, if you think your model sw.fit doesn’t perform well on Regression_new.csv, can you point out the potential reason(s) for this?

# ANSWER BLOCK

# Import the new dataset
new <- read.csv("Regression_new.csv") 
# Finding out the rMSE of the sw.fit model with respect to the new dataset
print(rmse(sw.fit,new))

## [1] 34.83331

 (1) From the above calculated result, we get the rMSE of the new dataset is 34.83, this value is higher than the rMSE on the previous rMSE of 30.63 obtained on the training dataset.Therefore, the model "sw.fit" might not be performing well on the new dataset.
 
 (2) Potential reasons for not performing well:
 --datasize and diversity: The new dataset is approximately half the size of the training dataset, if the new dataset is not sufficiently representative of the variability in the full population, the model may perform poorly. Also, the training dataset with 43,915 observations may contain more diverse patterns and trends, making it harder for model trained on it to generalize well to a smaller and potentially less diverse dataset of 22,302 observations. 
 --Different characteristics: the new dataset is import from another dataset, it can be different in distribution between training and new dataset, causing a decrease in model performance on the new dataset.
 --Overfitting: the model might be overfitting the training data, can capture noise in the training data as if it were a real pattern, leading to poor generalization to the new dataset.
 --Moderate reductions in consciousness level: As the subjects in the new dataset experienced only moderate reductions on consciousness, if the training data had more variability in consciousness levels, the model might be less accurate in this specific range of consciousness levels, leading to a higher rMSE.
 
 (3) conclusion: 
 The higher rMSE value 34.83 on the new dataset suggests that the model "sw.fit" might not be generalizing well to new, unseen data. This could be due to overfitting, different distributions and characteristics of the new data or the limitations in the model itself. It might need to do model refinement or restrain the model on more diverse data to improve the performance on unseen datasets.

Question 6 (5 Marks)

Although stepwise regression has reduced the model complexity significantly, the model still contains a lot of variables that we want to remove. Therefore, you are interested in lightweight linear regression models with ONLY TWO predictors. Write a script to automatically find the best lightweight model which corresponds to the model with the least rMSE on the training dataset (Not the new dataset). Compare the rMSE of the best lightweight model with the rMSE of the full model - lm.fit - that you built previously. Give an explanation for these results based on consideration of the predictors involved.

# ANSWER BLOCK

# Initialize the minimum error to a large number
minimum_error <- Inf
# Initialize the variable to store the names of the best features
features <- c()

# Extracting the names of the predictors from the dataset excluding sub_ind and the target variable consc_lev
predictors <- colnames(train)[!colnames(train) %in% c("consc_lev", "sub_ind")]

# Looping through all possible pairs of predictors
for(i in seq_along(predictors)) {
  for(j in seq_along(predictors)) {
    
    # Ensure we have a pair of different predictors
    if(i >= j) next
    
    # Extracting the pair of predictors
    pair <- predictors[c(i, j)]
    
    # Building the linear model with the selected pair of predictors
    formula <- as.formula(paste("consc_lev ~", paste(pair, collapse = " + ")))
    model <- lm(formula, data = train)
    
    # Calculating the rMSE for the current model
    predictions <- predict(model, newdata = train)
    error <- sqrt(mean((predictions - train$consc_lev)^2))
    
    # Comparing the rMSE with the minimum error to find the best model
    if(error < minimum_error) {
      minimum_error <- error
      features <- pair
    }
  }
}

# Printing the results
print(paste('The best features are', paste(features, collapse = ' and '), '; and the rMSE is', minimum_error))

## [1] "The best features are channel.num and aEP ; and the rMSE is 30.8144854188303"

 (1) From the result of output, we get the rMSE of the best lightweight linear regrerssion model by using a pair of predictors, which is channel number and aEP, and rMSE is 30.81, this value is pretty similiar with full model rMSE 30.63, this implies that the lightweight model is quite efficient, and the removal of the additional variables didnot significantly affect the model's predictive accuracy on the training data.
 
 (2) Consideration of predictors:
 --The best lightweight model has "channel.num" and "aEP" as its predictors. These variables shows the highest individual influence on the target variable "consc_lev". This would explain why a model with just these two variables is able to achieve an rMSE close to the full model with multiple predictors.
 
 --The full model contains additional variables like "aPE", "aPI", "input", "v_ii", and "v_pyr", and the rMSE is slightly better than the lightweight model. However, the improvement in rMSE might not justify the increased complexity of including more variables.

 --Variables like "v_ii" and "v_pyr", which are present in the full model but not in the lightweight model, are not contributing much to reducing the error, as evidenced by the minimal difference in rMSE between the two models.

 --The predictors "channel.num" and "aEP" could be capturing most of the variability in the "consc_lev" which is also explained by other predictors in the full model, hence making them redundant in the presence of "channel.num" and "aEP".
 
 (3) Conclusion:
The rMSE between the full and lightweight models has a small difference, which suggests that "channel.num" and "aEP" are crucial predictors for dependent variable "consc_lev". The lightweight model with these two predictors is able to provide a balanced trade-off between model complexity and predictive accuracy. The full model only shows slightly more accurate as the result, and the full model also includes several additional variables which might not be adding substantial value in explaining the variability in the target variable. Therefore, depending on the above given condition, analysis and the cost of model complexity, the lightweight model might be preferable when parsimony is desired.

Assessment 2: Parameter estimation, confidence interval, hypothesis testing and data mining models

Hongmei Zhang

2023-09-28

Student ID Number:33801630

Part 1: Point Estimation [10 marks]

Question 1: (10 Marks)

Part 2: Central Limit Theorem [15 marks]

Question 1 (8 Marks)

Question 2 (7 Marks)

Part 3: Confidence Interval Estimation & Hypothesis Testing [25 marks]

Question 1 (5 Marks)

Question 2 (7 Marks)

Question 3 (5 Marks)

Question 4 (8 Marks)

Part 4: Linear Regression Model [25 marks]

Question 1 (5 marks)

Question 2 (5 Marks)

Question 3 (2 Marks)

Question 4 (4 Marks)

Question 5 (4 Marks)

Question 6 (5 Marks)