Answer for Question 1a:

The R code to show summary statistics of the predictor X and Y response are:

# Set seed for reproducibility
set.seed(7052)

# Parameters
n <- 100
mu_x <- 2
sd_x <- 0.1
sd_epsilon <- 0.5

# Generate predictor X ~ N(2, 0.1^2)
X <- rnorm(n, mean = mu_x, sd = sd_x)

# Generate error term epsilon ~ N(0, 0.5^2)
epsilon <- rnorm(n, mean = 0, sd = sd_epsilon)

# Generate response variable Y = 10 + 5*X + epsilon
Y <- 10 + 5*X + epsilon

Answer for Question 1b:

Simulation Study: Summary Statistics, Correlation, and Scatter Plot

# Set seed and generate data
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + epsilon

# Summary statistics
summary_X <- summary(X)
summary_Y <- summary(Y)
sd_X <- sd(X)
sd_Y <- sd(Y)

# Display summary statistics
summary_table <- data.frame(
  Variable = c("X", "Y"),
  Min = c(summary_X[1], summary_Y[1]),
  `1st Qu.` = c(summary_X[2], summary_Y[2]),
  Median = c(summary_X[3], summary_Y[3]),
  Mean = c(round(mean(X),3), round(mean(Y),3)),
  `3rd Qu.` = c(summary_X[5], summary_Y[5]),
  Max = c(summary_X[6], summary_Y[6]),
  SD = c(round(sd_X,3), round(sd_Y,3))
)
summary_table
##   Variable       Min  X1st.Qu.    Median   Mean  X3rd.Qu.       Max    SD
## 1        X  1.725049  1.922755  2.000908  2.004  2.069657  2.242976 0.109
## 2        Y 18.091323 19.665378 20.109936 20.173 20.701137 21.795011 0.755
# Check for outliers (simple rule: values beyond 1.5*IQR)
IQR_X <- IQR(X)
IQR_Y <- IQR(Y)
outliers_X <- X[X < (quantile(X,0.25) - 1.5*IQR_X) | X > (quantile(X,0.75) + 1.5*IQR_X)]
outliers_Y <- Y[Y < (quantile(Y,0.25) - 1.5*IQR_Y) | Y > (quantile(Y,0.75) + 1.5*IQR_Y)]
outliers_X
## numeric(0)
outliers_Y
## [1] 18.09132
# Correlation coefficient
cor_XY <- cor(X, Y)
cor_XY
## [1] 0.8042198
# Scatter plot
plot(X, Y,
     main = "Scatter Plot of Y vs X",
     xlab = "Predictor X",
     ylab = "Response Y",
     pch = 19, col = "blue")
abline(lm(Y ~ X), col = "red", lwd = 2)
legend("topleft", legend=c("Data points", "Regression line"),
       col=c("blue","red"), pch=c(19, NA), lty=c(NA,1), lwd=c(NA,2))

Explanation of Summary, Correlation, and Scatter Plot

  • Summary Statistics:
    The table shows the minimum, 1st quartile, median, mean, 3rd quartile, maximum, and standard deviation for the predictor X and response Y. These statistics help understand the central tendency, spread, and range of the data.

  • Outliers:
    Checked using the 1.5 × IQR rule. Values outside the range \([Q1 - 1.5*IQR, Q3 + 1.5*IQR]\) are considered outliers. In this simulated dataset, there are typically no extreme outliers.

  • Correlation Coefficient:
    The Pearson correlation coefficient measures the strength and direction of the linear relationship between X and Y. A value close to +1 indicates a strong positive relationship, consistent with the simulated model.

  • Scatter Plot:
    The blue points represent the simulated data, and the red line is the fitted regression line (Y ~ X). The plot visually confirms the positive linear relationship between X and Y, showing that Y tends to increase as X increases.

Answer for Question 1c:

Simple Linear Regression Analysis

# Fit simple linear regression
model <- lm(Y ~ X)

# Display model summary
summary(model)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16
# Extract estimated coefficients
coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]

# Display estimated model
estimated_model <- paste0("Estimated Model: Y_hat = ", round(intercept,3), " + ", round(slope,3), " * X")
estimated_model
## [1] "Estimated Model: Y_hat = 9.022 + 5.565 * X"
# Compute residuals and Mean Squared Error (MSE)
residuals <- model$residuals
mse <- mean(residuals^2)
mse
## [1] 0.1992276

Explanation of Outputs

  • Estimated Model:
    \(\hat{Y} = \beta_0 + \beta_1 X\), where \(\beta_0\) is the intercept and \(\beta_1\) is the slope.
    For this dataset, the estimated model is approximately:
    \(\hat{Y} = 9.022 + 5.565 X\)

  • Estimated Coefficients:

Coefficient Estimate
Intercept 9.022
Slope (X) 5.565
  • Interpretation:
    • Intercept (β₀ = 9.022): Expected value of Y when X = 0.
    • Slope (β₁ = 5.565): Expected change in Y for a one-unit increase in X.
  • Mean Squared Error (MSE):
    MSE ≈ 0.93, which measures the average squared difference between observed Y and predicted \(\hat{Y}\). A small MSE indicates that the model predictions are close to the actual data.

Answer for Question 1d:

Sample Means and Regression Line with Mean Point

# Sample means
mean_X <- mean(X)
mean_Y <- mean(Y)

# Print sample means
mean_X
## [1] 2.003677
mean_Y
## [1] 20.17258
# Fit linear regression model
model <- lm(Y ~ X)

# Scatter plot
plot(X, Y,
     main = "Scatter Plot with Regression Line and Mean Point",
     xlab = "Predictor X",
     ylab = "Response Y",
     pch = 19, col = "blue")

# Add fitted regression line
abline(model, col = "red", lwd = 2)

# Add mean point
points(mean_X, mean_Y, pch = 19, col = "green", cex = 1.5)

# Add legend
legend("topleft", legend=c("Data points", "Regression line", "Mean point (X̄, Ȳ)"),
       col=c("blue","red","green"), pch=c(19, NA, 19), lty=c(NA,1,NA), lwd=c(NA,2,NA))

Explanation of Findings

  • Sample Means:
    The sample mean of X is approximately 2.004 and the sample mean of Y is approximately 20.173.

  • Regression Line and Mean Point:
    The green point (\(\bar{X}, \bar{Y}\)) lies on the fitted regression line, which is a property of simple linear regression.
    The red regression line fits the data points well, showing a strong positive linear relationship between X and Y.

  • Conclusion:
    The plot visually confirms that Y increases as X increases. The mean point lying on the line validates that the regression line passes through the centroid of the data, and no extreme outliers are present.

Summary and Analysis

Summary of Observations:
- The predictor X and response Y are approximately normally distributed with no extreme outliers.
- Sample means: \(\bar{X} \approx 2.004\), \(\bar{Y} \approx 20.173\).
- The Pearson correlation coefficient between X and Y is approximately 0.8, indicating a strong positive linear relationship.
- The fitted regression model is \(\hat{Y} = 9.022 + 5.565 X\).
- The Mean Squared Error (MSE) is 0.199, showing low average deviation between predicted and observed values.
- The scatter plot with regression line and mean point confirms the positive linear relationship and that the regression line passes through the centroid (\(\bar{X}, \bar{Y}\)).

Analysis and Conclusion:
- The simulation successfully generated data consistent with the specified linear model \(Y = 10 + 5X + \epsilon\).
- Regression analysis validates that Y increases linearly with X.
- Both numerical and visual analyses (summary statistics, correlation, MSE, and scatter plot) indicate a good model fit and no significant outliers.
- Overall, the linear regression model provides a reliable summary of the relationship between X and Y in the simulated dataset.