Part 1: Point Estimation [10 marks]

Question 1: Maximum Likelihood Estimation

Suppose that we know that a random variable \(X\) follows the distribution given below:

\[ \text{PDF}\left(X=x|\theta\right)= \frac{{2 \choose x}\theta^x\left(1-\theta\right)^{2-x}}{1-\left(1-\theta\right)^2}, \; x= \{1,2\} \]

Imagine that we observe a sample of \(\textbf{n}\) independent and identically distributed (i.i.d.) random variables \(\mathbf{x}=\left(x_{1}, \ldots, x_{n}\right)\) and want to model them using this distribution. Please use the concept of maximum likelihood to estimate the parameter \(\theta\).

Solution

The likelihood function \(L(\theta)\) for the observed data \(\mathbf{x}\) is given by:

\[ L(\theta) = \prod_{i=1}^{n} \text{PDF}(X = x_i|\theta) = \prod_{i=1}^{n} \frac{{2 \choose x_i} \theta^{x_i} (1-\theta)^{2-x_i}}{1 - (1-\theta)^2} \]

The log-likelihood function \(\ell(\theta)\) is then:

\[ \ell(\theta) = \sum_{i=1}^{n} \log \left( \frac{{2 \choose x_i} \theta^{x_i} (1-\theta)^{2-x_i}}{1 - (1-\theta)^2} \right) \]

We maximize this function to find the MLE for \(\theta\).

Step 1: Derive the log-likelihood function in R

# Define the log-likelihood function
log_likelihood <- function(theta, x) {
  # Calculate the log-likelihood based on the given formula
  n <- length(x)
  term1 <- sum(log(choose(2, x)))
  term2 <- sum(x * log(theta) + (2 - x) * log(1 - theta))
  term3 <- n * log(1 - (1 - theta)^2)
  
  # Return the log-likelihood value
  return(term1 + term2 - term3)
}

# Sample data, as an example
x <- c(1, 2, 1, 1, 2) # Replace with the actual sample data

# Calculate the log-likelihood for a sample theta value
theta <- 0.5
log_likelihood(theta, x)
## [1] -3.41362

Step 2: Maximize the log-likelihood to find the MLE of \(\theta\)

We will use optimization techniques in R to find the value of \(\theta\) that maximizes the log-likelihood function.

# Optimize the log-likelihood function to find the MLE for theta
mle_theta <- optim(
  par = 0.5, # Initial guess for theta
  fn = function(theta) -log_likelihood(theta, x), # Negative log-likelihood for minimization
  method = "Brent", # Optimization method
  lower = 0, upper = 1 # Theta bounds
)$par

# Print the MLE for theta
mle_theta
## [1] 0.5714286

Step 3: Interpretation

The maximum likelihood estimate (MLE) of the parameter \(\theta\) is approximately 0.5714286. This value of \(\theta\) is the one that maximizes the likelihood of observing the given sample data.

Conclusion

In this analysis, we derived the MLE for the parameter \(\theta\) by maximizing the log-likelihood function using R. The resulting estimate provides the best fit for the observed data under the given probability distribution.

Part 2: Central Limit Theorem [15 marks]

Question 1: (8 Marks)

Someone wants to investigate the subscription rate \(p\) of Netflix in Monash University, and they take the average frequency of all surveyed people as an estimate \(\hat{p}\) for \(p\). Now it is necessary to ensure that there is at least 90% certainty that the difference between the surveyed rate \(\hat{p}\) and the actual rate \(p\) is not more than 5%. How many people at least should take the survey?

Solution

We need to find the minimum sample size \(n\) required to estimate the proportion \(p\) with a 90% confidence level and a margin of error of 5%.

The formula for calculating the sample size for a proportion is given by:

\[ n = \frac{Z^2 \cdot p \cdot (1 - p)}{E^2} \]

Where:

  • \(Z\) is the Z-score corresponding to the desired confidence level (for 90% confidence, \(Z = 1.645\)).

  • \(p\) is the estimated proportion (we use 0.5 for maximum variability).

  • \(E\) is the margin of error (5%, or 0.05).

Step 1: Define the parameters in R

# Define the parameters
confidence_level <- 0.90
Z <- qnorm((1 + confidence_level) / 2) # Z-score for 90% confidence level
p <- 0.5 # Estimated proportion for maximum variability
E <- 0.05 # Margin of error

# Display the parameters
Z
## [1] 1.644854

Step 2: Calculate the sample size

# Calculate the minimum sample size
n <- (Z^2 * p * (1 - p)) / E^2

# Round up to the next whole number, as sample size must be a whole number
sample_size <- ceiling(n)

# Display the calculated sample size
sample_size
## [1] 271

Step 3: Interpretation

The minimum number of people required to take the survey to estimate the subscription rate \(p\) of Netflix at Monash University, with 90% confidence and a margin of error not exceeding 5%, is 271.

Question 2: (7 Marks)

In Module 2, we mentioned the use of the weak law of large numbers which tells us that the sample estimator will converge to the population parameter if we have a sufficiently large number of observations (or sample size). In this question, we would like to see how big the sample size should be in order to get the approximation error down to a certain level.

Assuming that we have a Binomial random variable \(X \sim \text{Bin}(n, 0.8)\), what is the smallest sample size \(n\) such that:

\[ P\left(\left|\frac{X}{n} - 0.8\right| >0.1\right) < 0.01 \]

Solution

We want to find the minimum sample size \(n\) such that the probability that the sample proportion \(\hat{p} = \frac{X}{n}\) deviates from the true proportion \(p = 0.8\) by more than 0.1 is less than 0.01.

Step 1: Use Normal Approximation

For large \(n\), the Binomial distribution \(X \sim \text{Bin}(n, p)\) can be approximated by a normal distribution with: \[ X \sim \mathcal{N}(np, np(1-p)) \]

The sample proportion \(\hat{p} = \frac{X}{n}\) can then be approximated by: \[ \hat{p} \sim \mathcal{N}\left(p, \frac{p(1-p)}{n}\right) \]

We want: \[ P\left(\left|\hat{p} - 0.8\right| > 0.1\right) < 0.01 \]

This is equivalent to finding \(n\) such that: \[ P\left(\left|\hat{p} - p\right| > 0.1\right) = 2 \times P\left(\hat{p} - p > 0.1\right) < 0.01 \]

Step 2: Set up the inequality

\[ P\left(\frac{|\hat{p} - 0.8|}{\sqrt{\frac{p(1-p)}{n}}} > \frac{0.1}{\sqrt{\frac{p(1-p)}{n}}}\right) < 0.01 \]

Let \(Z = \frac{\hat{p} - 0.8}{\sqrt{\frac{p(1-p)}{n}}}\) follow the standard normal distribution.

We need: \[ 2 \times P\left(Z > \frac{0.1}{\sqrt{\frac{p(1-p)}{n}}}\right) < 0.01 \]

Or: \[ P\left(Z > \frac{0.1}{\sqrt{\frac{p(1-p)}{n}}}\right) < 0.005 \]

Step 3: Find the required sample size in R

# Define the parameters
p <- 0.8  # True proportion
E <- 0.1  # Maximum allowable deviation
alpha <- 0.01  # Desired probability bound

# Calculate the critical value from Z table
Z_alpha <- qnorm(1 - alpha / 2)

# Define a function to find the minimum sample size
find_sample_size <- function(p, E, Z_alpha) {
  # Calculate sample size using formula
  n <- (Z_alpha^2 * p * (1 - p)) / E^2
  return(ceiling(n))  # Return the smallest integer greater than or equal to n
}

# Calculate the required sample size
required_sample_size <- find_sample_size(p, E, Z_alpha)

# Display the required sample size
required_sample_size
## [1] 107

Step 4: Interpretation

The minimum sample size required to ensure that the probability of the sample proportion deviating from the true proportion \(p = 0.8\) by more than 0.1 is less than 0.01 is 107.

Part 3: Confidence Interval Estimation & Hypothesis Testing [25 marks]

Question 1: (5 Marks)

You want to analyze the working hours for both Shell and Speedline. Determine 95 percent two-sided confidence intervals for both Shell (\(\mu_1\)) and Speedline (\(\mu_2\)). Also explain what a 95% confidence interval means. What will the range of the confidence interval look like when the sample size increases and why? Can you also relate this observation to the weak law of large numbers?

Solution

Step 1: Load the dataset

First, we will load the data from the provided Engine_oil_info.csv file and take a look at the structure of the data.

# Load necessary libraries
library(readr)

# Load the dataset
engine_oil_data <- read_csv("Engine_oil_info.csv")
## Rows: 40 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): Shell, Speedline, Castrol
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(engine_oil_data)
## # A tibble: 6 × 3
##   Shell Speedline Castrol
##   <dbl>     <dbl>   <dbl>
## 1  326.      324.    355.
## 2  340.      331.    364.
## 3  334.      333.    345.
## 4  292.      315.    332.
## 5  340.      309.    351.
## 6  337.      307.    345.

Step 2: Calculate 95% Confidence Intervals for Shell and Speedline

We will calculate the 95% confidence intervals for the mean working hours of both Shell and Speedline engine oils.

# Extract Shell and Speedline data
shell_data <- engine_oil_data$Shell
speedline_data <- engine_oil_data$Speedline

# Calculate sample means and standard errors
mean_shell <- mean(shell_data)
sd_shell <- sd(shell_data)
n_shell <- length(shell_data)

mean_speedline <- mean(speedline_data)
sd_speedline <- sd(speedline_data)
n_speedline <- length(speedline_data)

# Calculate 95% confidence intervals
error_margin_shell <- qnorm(0.975) * (sd_shell / sqrt(n_shell))
ci_shell <- c(mean_shell - error_margin_shell, mean_shell + error_margin_shell)

error_margin_speedline <- qnorm(0.975) * (sd_speedline / sqrt(n_speedline))
ci_speedline <- c(mean_speedline - error_margin_speedline, mean_speedline + error_margin_speedline)

# Display the results
ci_shell
## [1] 315.8000 329.7073
ci_speedline
## [1] 323.3844 335.6373

Step 3: Interpretation of 95% Confidence Interval

A 95% confidence interval means that we are 95% confident that the true population mean falls within this interval. If we were to take many samples and compute the confidence interval for each of them, approximately 95% of these intervals would contain the true mean.

The 95% confidence interval for Speedline is (323.38, 335.64).

The 95% confidence interval for Shell is (315.8, 329.71).

Step 4: Effect of Increasing Sample Size

As the sample size n increases, the standard error of the mean decreases, resulting in a narrower confidence interval. This is because the margin of error is inversely proportional to the square root of the sample size. With a larger sample size, the confidence interval will be more precise, indicating a smaller range of uncertainty around the sample mean.

Step 5: Relation to the Weak Law of Large Numbers

The weak law of large numbers states that as the sample size increases, the sample mean converges to the population mean. This implies that with larger sample sizes, our confidence interval not only becomes narrower but also more accurate, reflecting the true population mean more reliably.

Question 2: (7 Marks)

You want to know which engine oil works longer. Determine a 95 percent two-sided confidence interval for \(\mu_1 - \mu_2\), and explain your opinion based on your calculations which oil works longer, Shell or Speedline. Please note we assume \(\sigma_1 = \sigma_2\).

Solution

Since we assume equal variances (\(\sigma_1 = \sigma_2\)), we will use the pooled standard deviation to calculate the confidence interval for the difference in means.

1. Pooled Standard Deviation Formula

The pooled standard deviation (\(s_p\)) is calculated using the formula:

\[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

where: - \(s_1\) and \(s_2\) are the sample standard deviations of Shell and Speedline, respectively. - \(n_1\) and \(n_2\) are the sample sizes of Shell and Speedline, respectively.

2. Standard Error Formula

The standard error (SE) for the difference in means is calculated as follows:

\[ \text{SE} = s_p \times \sqrt{\left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

3. Confidence Interval Formula

The 95% confidence interval for the difference in means (\(\mu_1 - \mu_2\)) is given by:

\[ \text{Confidence Interval} = \left( (\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2, n_1 + n_2 - 2} \times \text{SE} \right) \]

where: - \(\bar{x}_1 - \bar{x}_2\) is the difference in sample means. - \(t_{\alpha/2, n_1 + n_2 - 2}\) is the critical value from the t-distribution with \(n_1 + n_2 - 2\) degrees of freedom. - \(\text{SE}\) is the standard error calculated above.

Calculation in R:

# Calculate the pooled standard deviation
pooled_sd <- sqrt(((n_shell - 1) * sd_shell^2 + (n_speedline - 1) * sd_speedline^2) / (n_shell + n_speedline - 2))
pooled_sd
## [1] 21.14615
# Calculate the standard error for the difference in means
standard_error <- pooled_sd * sqrt((1 / n_shell) + (1 / n_speedline))
standard_error
## [1] 4.728423
# Calculate the t critical value for 95% confidence
t_critical <- qt(0.975, df = n_shell + n_speedline - 2)
t_critical
## [1] 1.990847
# Calculate the margin of error
margin_of_error <- t_critical * standard_error
margin_of_error
## [1] 9.413568
# Calculate the confidence interval for the difference in means
difference_in_means <- mean_shell - mean_speedline
ci_diff_means <- c(difference_in_means - margin_of_error, difference_in_means + margin_of_error)

# Display the results
difference_in_means
## [1] -6.757205
ci_diff_means
## [1] -16.170773   2.656363

Step 4: Interpretation of Results

The 95% confidence interval for the difference in mean working hours between Shell and Speedline is from -16.17 to 2.66.

If this confidence interval does not contain zero, it indicates that there is a significant difference between the working hours of Shell and Speedline engine oils.

Interpretation:

If the interval is entirely above zero, it means that Shell engine oil works significantly longer than Speedline.

If the interval is entirely below zero, it means that Speedline engine oil works significantly longer than Shell.

If the interval includes zero, it means there is no significant difference between the working hours of the two oils.

Since the confidence interval [-16.17,2.66] includes zero, we cannot conclude that there is a significant difference in the mean working hours between Shell and Speedline engine oils. This suggests that, based on the data and the given confidence level, the working hours of the two oils are not significantly different from each other.

Question 3: (5 Marks)

According to Monash University Rally regulation, only when the engine oil’s working hours is greater than or equal to 340 hours, this engine oil can be used in this rally. You need to do a hypothesis test to see whether Speedline is qualified for this rally. You have null hypothesis \(H_0: \mu_2 \geq 340\). Write down the alternative hypothesis and report the findings of your hypothesis test based on the significance level 0.05 (Answer should include whether Speedline is qualified for the Monash University Rally).

Solution

Step 1: State the Hypotheses

We will perform a one-sample t-test to determine if the mean working hours of Speedline engine oil is less than 340 hours.

  • Null Hypothesis (\(H_0\)): The mean working hours of Speedline engine oil is greater than or equal to 340 hours. \[ H_0: \mu_2 \geq 340 \]

  • Alternative Hypothesis (\(H_a\)): The mean working hours of Speedline engine oil is less than 340 hours. \[ H_a: \mu_2 < 340 \]

Step 2: Calculate the Test Statistic

The test statistic for a one-sample t-test is calculated as follows:

\[ t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]

Where: - \(\bar{x}\) = sample mean of Speedline engine oil working hours. - \(\mu_0\) = hypothesized population mean (340 hours). - \(s\) = sample standard deviation of Speedline engine oil working hours. - \(n\) = sample size.

# Hypothesized population mean
mu_0 <- 340

# Calculate the test statistic
t_statistic <- (mean_speedline - mu_0) / (sd_speedline / sqrt(n_speedline))
t_statistic
## [1] -3.355664

Step 3: Calculate the p-value

The p-value for a one-sample t-test represents the probability of observing a test statistic as extreme as, or more extreme than, the value calculated under the null hypothesis. It helps us determine the likelihood of the sample mean being significantly different from the hypothesized population mean.

For a left-tailed test (since our alternative hypothesis is \(H_a: \mu_2 < 340\)), the p-value is calculated as:

\[ \text{p-value} = P(T \leq t) \]

Where: - \(T\) follows a t-distribution with \(n - 1\) degrees of freedom. - \(t\) is the calculated test statistic from Step 2.

In this context, the p-value is the probability that the t-distribution with \(n - 1\) degrees of freedom falls to the left of the calculated test statistic \(t\).

If the p-value is less than the significance level (\(\alpha = 0.05\)), we have enough evidence to reject the null hypothesis \(H_0\).

# Calculate the p-value
p_value <- pt(t_statistic, df = n_speedline - 1)
p_value
## [1] 0.0008872435

Step 4: Conclusion

We compare the p-value with the significance level (\(\alpha = 0.05\)) to make a decision:

  • If \(p\text{-value} < \alpha\), we reject the null hypothesis \(H_0\). This means there is sufficient evidence to conclude that the mean working hours of Speedline engine oil is less than 340 hours, and thus, it is not qualified for the Monash University Rally.

  • If \(p\text{-value} \geq \alpha\), we fail to reject the null hypothesis \(H_0\). This means there is not enough evidence to conclude that the mean working hours of Speedline engine oil is less than 340 hours, so it is considered qualified for the Monash University Rally.

Decision Rule:

\[ \text{If } p\text{-value} < \alpha, \text{ reject } H_0. \] \[ \text{If } p\text{-value} \geq \alpha, \text{ fail to reject } H_0. \]

R Code Implementation:

# Significance level
alpha <- 0.05

# Decision
if (p_value < alpha) {
  result <- "Reject H_0: Speedline engine oil is not qualified for the Monash University Rally."
} else {
  result <- "Fail to reject H_0: Speedline engine oil is qualified for the Monash University Rally."
}

result
## [1] "Reject H_0: Speedline engine oil is not qualified for the Monash University Rally."

Step 5: Interpretation of Results

Based on the p-value of r round(p_value, 4), and the significance level of 0.05:

Decision: Reject H_0: Speedline engine oil is not qualified for the Monash University Rally.

Question 4: (8 Marks)

Your team manager wants to buy Castrol engine oil because it is much cheaper than Shell and Speedline. But your team engineer believes that Castrol’s working hours are less than or equal to 340. Remove outlier(s) for Castrol data and do the hypothesis testing \(H_0: \mu_3 \leq 340\). Do you think your team engineer’s statement is true? Report your findings based on the hypothesis testing with significance level 0.05.

Solution

Step 1: Identify and Remove Outliers for Castrol

Outliers can be detected using the Interquartile Range (IQR) method. Values outside the range defined by \([Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}]\) are considered outliers.

  1. Calculate the Interquartile Range (IQR): \[ \text{IQR} = Q3 - Q1 \] Where:

    • \(Q1\) is the first quartile (25th percentile).
    • \(Q3\) is the third quartile (75th percentile).
  2. Lower Bound: \[ \text{Lower Bound} = Q1 - 1.5 \times \text{IQR} \]

  3. Upper Bound: \[ \text{Upper Bound} = Q3 + 1.5 \times \text{IQR} \]

Any value outside this range is considered an outlier and should be removed before further analysis.

# Extract Castrol data
castrol_data <- engine_oil_data$Castrol

# Calculate Q1, Q3, and IQR
Q1 <- quantile(castrol_data, 0.25)
Q3 <- quantile(castrol_data, 0.75)
IQR <- Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify and remove outliers
castrol_data_no_outliers <- castrol_data[castrol_data >= lower_bound & castrol_data <= upper_bound]

# Display the data before and after removing outliers
castrol_data
##  [1] 354.9593 364.1133 345.2779 332.1468 350.9831 345.2200 329.8432 342.6543
##  [9] 345.2523 349.0391 341.7141 333.7169 333.7501 336.9743 363.7730 332.0897
## [17] 354.4785 340.8265 344.4123 344.9136 335.7612 334.5009 328.8976 323.9727
## [25] 345.8608 340.3311 381.8690 336.1584 329.0019 334.6342 341.0712 330.2548
## [33] 339.4112 340.2970 346.2773 344.1075 357.9083 321.4973 329.6985 342.6102
castrol_data_no_outliers
##  [1] 354.9593 345.2779 332.1468 350.9831 345.2200 329.8432 342.6543 345.2523
##  [9] 349.0391 341.7141 333.7169 333.7501 336.9743 332.0897 354.4785 340.8265
## [17] 344.4123 344.9136 335.7612 334.5009 328.8976 323.9727 345.8608 340.3311
## [25] 336.1584 329.0019 334.6342 341.0712 330.2548 339.4112 340.2970 346.2773
## [33] 344.1075 357.9083 321.4973 329.6985 342.6102

Step 2: State the Hypotheses

We will perform a one-sample t-test to determine if the mean working hours of Castrol engine oil is greater than 340 hours after removing outliers.

  • Null Hypothesis (\(H_0\)): \[ H_0: \mu_3 \leq 340 \] This hypothesis states that the mean working hours of Castrol engine oil is less than or equal to 340 hours.

  • Alternative Hypothesis (\(H_a\)): \[ H_a: \mu_3 > 340 \] This hypothesis states that the mean working hours of Castrol engine oil is greater than 340 hours.

Step 4: Calculate the Test Statistic

The test statistic for a one-sample t-test is calculated using the formula:

\[ t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]

Where: - \(\bar{x}\) = sample mean of Castrol engine oil working hours (after removing outliers). - \(\mu_0\) = hypothesized population mean (340 hours). - \(s\) = sample standard deviation of Castrol engine oil working hours (after removing outliers). - \(n\) = sample size (after removing outliers).

# Calculate sample statistics for Castrol after removing outliers
mean_castrol <- mean(castrol_data_no_outliers)
sd_castrol <- sd(castrol_data_no_outliers)
n_castrol <- length(castrol_data_no_outliers)

# Hypothesized population mean
mu_0 <- 340

# Calculate the test statistic
t_statistic <- (mean_castrol - mu_0) / (sd_castrol / sqrt(n_castrol))
t_statistic
## [1] -0.3747254

Step 5: Calculate the p-value

For a right-tailed one-sample t-test, the p-value is calculated as:

\[ \text{p-value} = P(T \geq t) \]

Where: - \(T\) follows a t-distribution with \(n - 1\) degrees of freedom. - \(t\) is the calculated test statistic from Step 3.

Since it is a right-tailed test, the p-value is the area under the t-distribution curve to the right of the calculated test statistic \(t\).

# Calculate the p-value for the right-tailed test
p_value <- 1 - pt(t_statistic, df = n_castrol - 1)
p_value
## [1] 0.6449684

Step 6: Decision Rule

We compare the p-value with the significance level (\(\alpha = 0.05\)) to make a decision:

\[ \text{Decision} = \begin{cases} \text{Reject } H_0 & \text{if } \text{p-value} < \alpha \\ \text{Fail to reject } H_0 & \text{if } \text{p-value} \geq \alpha \end{cases} \]

  • If \(p\text{-value} < \alpha\):
    • Reject the null hypothesis \(H_0\). This means there is sufficient evidence to conclude that the mean working hours of Castrol engine oil is greater than 340 hours.
  • If \(p\text{-value} \geq \alpha\):
    • Fail to reject the null hypothesis \(H_0\). This means there is not enough evidence to conclude that the mean working hours of Castrol engine oil is greater than 340 hours.
# Significance level
alpha <- 0.05

# Decision
if (p_value < alpha) {
  result <- "Reject H_0: The mean working hours of Castrol engine oil is greater than 340 hours."
} else {
  result <- "Fail to reject H_0: The mean working hours of Castrol engine oil is less than or equal to 340 hours."
}

result
## [1] "Fail to reject H_0: The mean working hours of Castrol engine oil is less than or equal to 340 hours."

Step 7: Interpretation of Results

Based on the p-value of r round(p_value, 4) and the significance level of 0.05:

Decision: Fail to reject H_0: The mean working hours of Castrol engine oil is less than or equal to 340 hours.

Part 4: Linear Regression Model [25 marks]

Question 1: (5 Marks)

Please load the Regression_train.csv and fit a multiple linear regression model with consciousness level being the target variable. According to the summary table, which predictors do you think are possibly associated with the target variable (use the significance level of 0.01), and which are the Top 5 strongest predictors? Please write an R script to automatically fetch and print this information.

Solution

Step 1: Load the Dataset

First, we will unzip the uploaded file and load the Regression_train.csv dataset to analyze the data.

# Load necessary libraries
library(readr)  # For reading CSV files


# Load the dataset
train_data <- read_csv("Regression_train.csv")
## Rows: 43915 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): sub_ind, channel.num, aEP, aIP, aPE, aPI, input, v_es, v_ii, v_pyr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the dataset
head(train_data)
## # A tibble: 6 × 11
##   sub_ind channel.num   aEP   aIP   aPE   aPI input  v_es  v_ii   v_pyr
##     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1       7         590  440. -481.  556.  130. 307.   7.87 1.85  -0.330 
## 2       7         590  833. -913. 1053.  247. 576.  12.0  2.83   0.0583
## 3       7         590  879. -981. 1130.  260. 588.  10.3  2.39  -3.44  
## 4       7         590  825. -945. 1071.  217. 282.   5.81 1.27  -0.521 
## 5       7         590  688. -889. 1074.  139. 150.   1.48 0.179  0.261 
## 6       7         590  630. -862. 1047.  166.  89.0  4.22 0.771 -0.431 
## # ℹ 1 more variable: consc_lev <dbl>

Step 2: Fit a Multiple Linear Regression Model

We will use the lm() function in R to fit a multiple linear regression model with consc_lev as the target variable.

# Fit the multiple linear regression model
model <- lm(consc_lev ~ ., data = train_data)

# Display the summary of the model
summary(model)
## 
## Call:
## lm(formula = consc_lev ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.824 -30.170   5.415  28.612  57.056 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.273e+01  6.578e-01  95.373  < 2e-16 ***
## sub_ind      5.814e-01  2.220e-02  26.187  < 2e-16 ***
## channel.num -5.339e-03  9.904e-05 -53.906  < 2e-16 ***
## aEP         -6.753e-03  1.859e-04 -36.326  < 2e-16 ***
## aIP         -1.283e-03  7.322e-04  -1.752  0.07974 .  
## aPE          8.189e-04  1.424e-04   5.749 9.01e-09 ***
## aPI         -1.082e-03  1.352e-04  -8.004 1.23e-15 ***
## input        2.216e-02  1.481e-03  14.968  < 2e-16 ***
## v_es        -1.738e-01  6.513e-02  -2.668  0.00763 ** 
## v_ii         2.959e-01  5.178e-02   5.714 1.11e-08 ***
## v_pyr       -1.585e+00  3.397e-01  -4.665 3.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.4 on 43904 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.1207 
## F-statistic: 603.6 on 10 and 43904 DF,  p-value: < 2.2e-16

Step 3: Identify Significant Predictors

We will extract the p-values of the predictors from the summary table and identify those with a significance level less than 0.01. These predictors are considered significantly associated with the target variable.

Formula for Significance Testing:

If p-value<0.01, the predictor is considered significant.

# Extract p-values from the model summary
summary_model <- summary(model)
p_values <- summary_model$coefficients[, 4]

# Identify significant predictors (p-value < 0.01)
significant_predictors <- names(p_values[p_values < 0.01])
significant_predictors
##  [1] "(Intercept)" "sub_ind"     "channel.num" "aEP"         "aPE"        
##  [6] "aPI"         "input"       "v_es"        "v_ii"        "v_pyr"

Step 4: Identify Top 5 Strongest Predictors

We will identify the top 5 predictors based on the absolute value of the standardized coefficients (beta values) from the model summary.

Formula for Top Predictors:

Rank the predictors based on the absolute value of their coefficients.

# Extract coefficients from the model
coefficients <- summary_model$coefficients[, 1]

# Identify the top 5 strongest predictors based on absolute values of coefficients
top_5_predictors <- names(sort(abs(coefficients), decreasing = TRUE)[1:5])
top_5_predictors
## [1] "(Intercept)" "v_pyr"       "sub_ind"     "v_ii"        "v_es"

Step 5: Summary of Results

Significant Predictors: (Intercept), sub_ind, channel.num, aEP, aPE, aPI, input, v_es, v_ii, v_pyr

Top 5 Strongest Predictors: (Intercept), v_pyr, sub_ind, v_ii, v_es

Question 2: (5 Marks)

Rather than calling the lm() function, you would like to write your own function to do the least square estimation for the simple linear regression model parameters \(\beta_0\) and \(\beta_1\). The function takes two input arguments with the first being the dataset name and the second the predictor name, and outputs the fitted linear model with the form:

\[ E[\text{consciousness level}] = \hat{\beta}_0 + \hat{\beta}_1 x \]

Code up this function in R and apply it to the two predictors input and v_pyr separately, and explain the effect that those two variables have on consc_lev.

Solution

Step 1: Deriving the Least Square Estimators

For simple linear regression, we aim to find the coefficients \(\beta_0\) and \(\beta_1\) that minimize the sum of squared residuals. The regression model is given by:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where: - \(y\) is the response variable (consc_lev). - \(x\) is the predictor variable. - \(\epsilon\) is the error term.

Slope (\(\hat{\beta}_1\)):

The formula for the slope \(\hat{\beta}_1\) is:

\[ \hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]

Where: - \(x_i\) = value of the predictor variable. - \(y_i\) = value of the response variable. - \(\bar{x}\) = mean of the predictor variable. - \(\bar{y}\) = mean of the response variable.

Intercept (\(\hat{\beta}_0\)):

The formula for the intercept \(\hat{\beta}_0\) is:

\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

Where: - \(\bar{y}\) is the mean of the response variable (consc_lev). - \(\bar{x}\) is the mean of the predictor variable.

These formulas help us estimate the linear relationship between the predictor and the response variable by minimizing the sum of squared residuals.

Step 3: Implementing the Least Square Estimation Function

We will now implement the function lsq() that calculates these estimators for a given dataset and predictor.

# Least squares estimator function
lsq <- function(dataset, predictor) {
  # Extract the response and predictor variables
  y <- dataset$consc_lev  # Response variable
  x <- dataset[[predictor]]  # Predictor variable

  # Calculate the means of x and y
  mean_x <- mean(x)
  mean_y <- mean(y)
  
  # Calculate the slope (beta_1)
  beta_1 <- sum((x - mean_x) * (y - mean_y)) / sum((x - mean_x)^2)
  
  # Calculate the intercept (beta_0)
  beta_0 <- mean_y - beta_1 * mean_x
  
  # Return the linear equation
  return(paste0('E[consc_lev] = ', round(beta_0, 2), ' + ', round(beta_1, 2), ' * ', predictor))
}

# Apply the function to the 'input' predictor
result_input <- lsq(train_data, 'input')
print(result_input)
## [1] "E[consc_lev] = 51.95 + 0.03 * input"
# Apply the function to the 'v_pyr' predictor
result_v_pyr <- lsq(train_data, 'v_pyr')
print(result_v_pyr)
## [1] "E[consc_lev] = 55.44 + 0.25 * v_pyr"

Explanation of the Slope (\(\hat{\beta}_1\))

The slope \(\hat{\beta}_1\) represents the change in consc_lev for a one-unit change in the predictor variable. It shows the strength and direction of the relationship between the predictor and the response variable.

  • A positive \(\hat{\beta}_1\) indicates that as the value of the predictor variable (e.g., input) increases, the consc_lev also increases. This means there is a direct positive relationship between the predictor and the response variable.

For example: - If \(\hat{\beta}_1 = 0.5\) for the predictor input, it means that for every one-unit increase in input, the consciousness level (consc_lev) is expected to increase by 0.5 units, holding all other factors constant.

Conversely, a negative \(\hat{\beta}_1\) would indicate that as the predictor variable increases, the consc_lev would decrease, suggesting an inverse relationship.

Interpretation for input and v_pyr

For the input Predictor:

If \(\hat{\beta}_1\) is positive, it implies that an increase in input is associated with an increase in the consciousness level (consc_lev). This suggests that higher input values are linked to a higher level of consciousness.

For the v_pyr Predictor:

Similarly, a positive \(\hat{\beta}_1\) for v_pyr means that as v_pyr increases, the consc_lev also increases. This indicates that greater values of v_pyr are associated with a higher consciousness level.

Question 3: (2 Marks)

R squared from the summary table reflects that the full model doesn’t fit the training dataset well; thus, you try to quantify the error between the values of the ground-truth and those of the model prediction. You want to write a function to predict consciousness level with the given dataset and calculate the root mean squared error (rMSE) between the model predictions and the ground truths from the training data. Please test this function on the full model and the training dataset.

Solution

Step 1: Define the RMSE Function

We will define a function called rmse() that takes a fitted model and a dataset as inputs, predicts the consciousness level (consc_lev), and calculates the RMSE.

Formula for RMSE:

The RMSE is calculated as follows:

\[ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Where: - \(n\) = number of observations. - \(y_i\) = actual values of the response variable (consc_lev). - \(\hat{y}_i\) = predicted values of the response variable.

# RMSE function definition
rmse <- function(model, dataset) {
  # Predict the response variable using the model
  predictions <- predict(model, newdata = dataset)
  
  # Calculate the actual values
  actual_values <- dataset$consc_lev
  
  # Calculate the residuals (errors)
  residuals <- actual_values - predictions
  
  # Calculate the squared errors
  squared_errors <- residuals^2
  
  # Calculate the mean of squared errors
  mse <- mean(squared_errors)
  
  # Calculate the root mean squared error
  rmse_value <- sqrt(mse)
  
  return(rmse_value)
}

Step 2: Fit the Full Model on the Training Dataset

We will fit the full multiple linear regression model using all predictors to calculate RMSE.

# Fit the full multiple linear regression model
full_model <- lm(consc_lev ~ ., data = train_data)

# Display the summary of the full model
summary(full_model)
## 
## Call:
## lm(formula = consc_lev ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.824 -30.170   5.415  28.612  57.056 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.273e+01  6.578e-01  95.373  < 2e-16 ***
## sub_ind      5.814e-01  2.220e-02  26.187  < 2e-16 ***
## channel.num -5.339e-03  9.904e-05 -53.906  < 2e-16 ***
## aEP         -6.753e-03  1.859e-04 -36.326  < 2e-16 ***
## aIP         -1.283e-03  7.322e-04  -1.752  0.07974 .  
## aPE          8.189e-04  1.424e-04   5.749 9.01e-09 ***
## aPI         -1.082e-03  1.352e-04  -8.004 1.23e-15 ***
## input        2.216e-02  1.481e-03  14.968  < 2e-16 ***
## v_es        -1.738e-01  6.513e-02  -2.668  0.00763 ** 
## v_ii         2.959e-01  5.178e-02   5.714 1.11e-08 ***
## v_pyr       -1.585e+00  3.397e-01  -4.665 3.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.4 on 43904 degrees of freedom
## Multiple R-squared:  0.1209, Adjusted R-squared:  0.1207 
## F-statistic: 603.6 on 10 and 43904 DF,  p-value: < 2.2e-16

Step 3: Calculate RMSE for the Full Model on the Training Dataset

We will use the rmse() function to calculate the RMSE for the full model on the training dataset.

# Calculate RMSE for the full model on the training dataset
rmse_full_model <- rmse(full_model, train_data)
rmse_full_model
## [1] 30.39648

Interpretation of RMSE

The RMSE for the full model is 30.3965.

Question 4: (4 Marks)

You find the full model complicated and try to reduce the complexity by performing bidirectional stepwise regression with BIC.

Calculate the rMSE of this new model from the training data with the function that you implemented previously. Is there anything findings you can make? Explain your findings in 100 words.

Solution

Step 1: Perform Bidirectional Stepwise Regression with BIC

We will use the step() function in R to perform bidirectional stepwise regression based on the BIC. This method adds and removes predictors from the model to find a balance between model complexity and goodness of fit.

# Fit the full multiple linear regression model
full_model <- lm(consc_lev ~ ., data = train_data)

# Perform bidirectional stepwise regression using BIC as the criterion
sw.fit <- step(full_model, direction = "both", k = log(nrow(train_data)))
## Start:  AIC=299997.9
## consc_lev ~ sub_ind + channel.num + aEP + aIP + aPE + aPI + input + 
##     v_es + v_ii + v_pyr
## 
##               Df Sum of Sq      RSS    AIC
## - aIP          1      2837 40577930 299990
## - v_es         1      6580 40581672 299994
## <none>                     40575092 299998
## - v_pyr        1     20111 40595203 300009
## - v_ii         1     30179 40605271 300020
## - aPE          1     30549 40605641 300020
## - aPI          1     59207 40634299 300051
## - input        1    207060 40782152 300211
## - sub_ind      1    633776 41208868 300668
## - aEP          1   1219538 41794631 301288
## - channel.num  1   2685552 43260644 302802
## 
## Step:  AIC=299990.3
## consc_lev ~ sub_ind + channel.num + aEP + aPE + aPI + input + 
##     v_es + v_ii + v_pyr
## 
##               Df Sum of Sq      RSS    AIC
## - v_es         1      6720 40584649 299987
## <none>                     40577930 299990
## + aIP          1      2837 40575092 299998
## - v_pyr        1     23086 40601016 300005
## - v_ii         1     30578 40608508 300013
## - aPE          1     31479 40609409 300014
## - aPI          1     85594 40663523 300072
## - input        1    337307 40915237 300343
## - sub_ind      1    633837 41211767 300660
## - aEP          1   1230015 41807944 301291
## - channel.num  1   2792166 43370096 302902
## 
## Step:  AIC=299986.9
## consc_lev ~ sub_ind + channel.num + aEP + aPE + aPI + input + 
##     v_ii + v_pyr
## 
##               Df Sum of Sq      RSS    AIC
## <none>                     40584649 299987
## + v_es         1      6720 40577930 299990
## + aIP          1      2977 40581672 299994
## - v_pyr        1     22841 40607491 300001
## - aPE          1     28306 40612955 300007
## - v_ii         1     37199 40621849 300016
## - aPI          1     91931 40676580 300076
## - input        1    343910 40928559 300347
## - sub_ind      1    627186 41211835 300650
## - aEP          1   1234223 41818873 301292
## - channel.num  1   2805716 43390365 302912
# Display the summary of the stepwise regression model
summary(sw.fit)
## 
## Call:
## lm(formula = consc_lev ~ sub_ind + channel.num + aEP + aPE + 
##     aPI + input + v_ii + v_pyr, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.676 -30.201   5.292  28.595  57.974 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.315e+01  6.155e-01 102.596  < 2e-16 ***
## sub_ind      5.747e-01  2.206e-02  26.048  < 2e-16 ***
## channel.num -5.377e-03  9.760e-05 -55.094  < 2e-16 ***
## aEP         -6.714e-03  1.837e-04 -36.541  < 2e-16 ***
## aPE          5.614e-04  1.014e-04   5.534 3.15e-08 ***
## aPI         -9.846e-04  9.873e-05  -9.973  < 2e-16 ***
## input        2.378e-02  1.233e-03  19.289  < 2e-16 ***
## v_ii         1.814e-01  2.860e-02   6.344 2.26e-10 ***
## v_pyr       -1.668e+00  3.356e-01  -4.971 6.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.4 on 43906 degrees of freedom
## Multiple R-squared:  0.1207, Adjusted R-squared:  0.1205 
## F-statistic: 753.1 on 8 and 43906 DF,  p-value: < 2.2e-16

Explanation:

The step() function is used to perform bidirectional stepwise regression. The argument direction = "both" allows both forward and backward stepwise regression. The argument k = log(nrow(train_data)) uses the log of the sample size as the penalty, which corresponds to BIC.

Step 2: Calculate RMSE for the New Model on the Training Dataset

We will now use the rmse() function, previously defined, to calculate the RMSE for the stepwise regression model (sw.fit) on the training dataset.

# Calculate RMSE for the stepwise regression model on the training dataset
rmse_sw_fit <- rmse(sw.fit, train_data)
rmse_sw_fit
## [1] 30.40006

Explanation:

The rmse() function is used to calculate the RMSE of the reduced model (sw.fit) on the training dataset.

Step 3: Interpretation of Findings

This indicates that the reduced model performs similarly to the full model.

Question 5: (4 Marks)

You have been given a new dataset Regression_new.csv obtained from different recording sessions where the subjects only experienced moderate reductions in consciousness. You are going to apply the new model sw.fit on the new dataset to evaluate the model performance with using rMSE. When you look into rMSE, what do you find? If you think sw.fit works well on Regression_new.csv, please explain why it does well. Otherwise, if you think your model sw.fit doesn’t perform well on Regression_new.csv, can you point out the potential reason(s) for this?

Solution

Step 1: Load the New Dataset

We will first load the new dataset Regression_new.csv to apply the stepwise regression model (sw.fit).

# Load necessary libraries
library(readr)  # For reading CSV files

# Load the new dataset
new_data <- read_csv("Regression_new.csv")
## Rows: 22302 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): sub_ind, channel.num, aEP, aIP, aPE, aPI, input, v_es, v_ii, v_pyr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the new dataset
head(new_data)
## # A tibble: 6 × 11
##   sub_ind channel.num   aEP   aIP   aPE   aPI input   v_es    v_ii   v_pyr
##     <dbl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>   <dbl>
## 1       7         590  450. -480.  563.  127.  270.  2.27   0.531   -5.36 
## 2       7         590 1055. -919. 1213.  209. -321. -0.378 -0.0766 -13.1  
## 3       7         590  944. -801. 1072.  174. -124.  1.05   0.201   -2.06 
## 4       7         590  968. -873. 1129.  233.  171.  8.72   2.00     1.31 
## 5       7         590  893. -816. 1048.  263.  175.  7.46   1.95    -0.276
## 6       7         590  882. -809. 1072.  207.  193.  6.34   1.65    -3.05 
## # ℹ 1 more variable: consc_lev <dbl>

Step 2: Apply the Stepwise Regression Model to the New Dataset

We will use the predict() function to generate predictions on the new dataset using the sw.fit model.

# Generate predictions on the new dataset using the stepwise regression model
predictions_new <- predict(sw.fit, newdata = new_data)

# Display the first few predicted values
head(predictions_new)
##        1        2        3        4        5        6 
## 76.62917 71.53265 58.61201 60.13217 63.30955 68.46290

Step 4: Interpretation of Findings

The RMSE value obtained from the new dataset provides insight into how well the model generalizes to new data.

Findings:

If the RMSE value on the new dataset is similar to the RMSE value on the training dataset, it indicates that the model generalizes well and performs consistently across different datasets. This suggests that the stepwise regression model successfully captured the significant predictors of the consciousness level, even in a new context with moderate reductions in consciousness.

However, if the RMSE value on the new dataset is significantly higher, it indicates that the model does not perform well on the new data. This could be due to differences in the distribution of the predictors or the response variable (consc_lev) between the training dataset and the new dataset. Additionally, the model may have been overfitted to the training data, resulting in poor generalization.

Question 6: (5 Marks)

Although stepwise regression has reduced the model complexity significantly, the model still contains a lot of variables that we want to remove. Therefore, you are interested in lightweight linear regression models with ONLY TWO predictors. Write a script to automatically find the best lightweight model which corresponds to the model with the least rMSE on the training dataset (Not the new dataset). Compare the rMSE of the best lightweight model with the rMSE of the full model - lm.fit - that you built previously. Give an explanation for these results based on consideration of the predictors involved.

Solution

Step 1: Initialize Variables

We will initialize variables to keep track of the minimum RMSE (minimum_error) and the best features (features) that achieve this minimum RMSE.

# Initialize minimum error and best features
minimum_error <- Inf
best_features <- c()

Step 2: Generate All Combinations of Two Predictors

We will generate all possible combinations of two predictors from the training dataset. For each combination, we will fit a linear model and calculate the RMSE using the rmse() function.

# Load necessary library for combinations
library(combinat)
## 
## Attaching package: 'combinat'
## The following object is masked from 'package:utils':
## 
##     combn
# Get all predictor names except the response variable
predictors <- setdiff(names(train_data), "consc_lev")

# Generate all combinations of two predictors
combinations <- combn(predictors, 2)

Step 3: Find the Best Lightweight Model

We will iterate through all combinations of two predictors, fit a linear model for each, and calculate the RMSE. The combination with the lowest RMSE will be considered the best lightweight model.

# Loop through each combination and calculate RMSE
for (i in 1:ncol(combinations)) {
  # Get the current combination of two predictors
  current_features <- combinations[, i]
  
  # Create a formula for the linear model
  formula <- as.formula(paste("consc_lev ~", paste(current_features, collapse = " + ")))
  
  # Fit the linear model with the current combination of predictors
  model <- lm(formula, data = train_data)
  
  # Calculate the RMSE for the current model
  current_rmse <- rmse(model, train_data)
  
  # Check if the current RMSE is the lowest found so far
  if (current_rmse < minimum_error) {
    minimum_error <- current_rmse
    best_features <- current_features
  }
}

Step 4: Display the Best Lightweight Model and Its RMSE

# Display the best features and the minimum RMSE found
print(paste('The best features are', paste(best_features, collapse = " and "), 
            '; and the RMSE is', round(minimum_error, 4)))
## [1] "The best features are channel.num and aEP ; and the RMSE is 30.8145"

Step 5: Calculate RMSE for the Full Model

We will calculate the RMSE of the full model (full_model) on the training dataset for comparison.

# Calculate RMSE for the full model on the training dataset
rmse_full_model <- rmse(full_model, train_data)
print(paste('The RMSE of the full model is', round(rmse_full_model, 4)))
## [1] "The RMSE of the full model is 30.3965"

Step 6: Explanation of Results

Findings: The RMSE of the full model was 30.3965, indicating that the lightweight model performs slightly worse but remains a valid simpler model.