# Set seed for reproducibility
set.seed(7052)
# Parameters
n <- 100
mu_x <- 2
sd_x <- 0.1
sd_epsilon <- 0.5
# Generate predictor X ~ N(2, 0.1^2)
X <- rnorm(n, mean = mu_x, sd = sd_x)
# Generate error term epsilon ~ N(0, 0.5^2)
epsilon <- rnorm(n, mean = 0, sd = sd_epsilon)
# Generate response variable Y = 10 + 5*X + epsilon
Y <- 10 + 5*X + epsilon
# Set seed and generate data
set.seed(7052)
n <- 100
X <- rnorm(n, mean = 2, sd = 0.1)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
Y <- 10 + 5*X + epsilon
# Summary statistics
summary_X <- summary(X)
summary_Y <- summary(Y)
sd_X <- sd(X)
sd_Y <- sd(Y)
# Display summary statistics
summary_table <- data.frame(
Variable = c("X", "Y"),
Min = c(summary_X[1], summary_Y[1]),
`1st Qu.` = c(summary_X[2], summary_Y[2]),
Median = c(summary_X[3], summary_Y[3]),
Mean = c(round(mean(X),3), round(mean(Y),3)),
`3rd Qu.` = c(summary_X[5], summary_Y[5]),
Max = c(summary_X[6], summary_Y[6]),
SD = c(round(sd_X,3), round(sd_Y,3))
)
summary_table
## Variable Min X1st.Qu. Median Mean X3rd.Qu. Max SD
## 1 X 1.725049 1.922755 2.000908 2.004 2.069657 2.242976 0.109
## 2 Y 18.091323 19.665378 20.109936 20.173 20.701137 21.795011 0.755
# Check for outliers (simple rule: values beyond 1.5*IQR)
IQR_X <- IQR(X)
IQR_Y <- IQR(Y)
outliers_X <- X[X < (quantile(X,0.25) - 1.5*IQR_X) | X > (quantile(X,0.75) + 1.5*IQR_X)]
outliers_Y <- Y[Y < (quantile(Y,0.25) - 1.5*IQR_Y) | Y > (quantile(Y,0.75) + 1.5*IQR_Y)]
outliers_X
## numeric(0)
outliers_Y
## [1] 18.09132
# Correlation coefficient
cor_XY <- cor(X, Y)
cor_XY
## [1] 0.8042198
# Scatter plot
plot(X, Y,
main = "Scatter Plot of Y vs X",
xlab = "Predictor X",
ylab = "Response Y",
pch = 19, col = "blue")
abline(lm(Y ~ X), col = "red", lwd = 2)
legend("topleft", legend=c("Data points", "Regression line"),
col=c("blue","red"), pch=c(19, NA), lty=c(NA,1), lwd=c(NA,2))
Summary Statistics:
The table shows the minimum, 1st quartile, median, mean, 3rd quartile,
maximum, and standard deviation for the predictor X and
response Y. These statistics help understand the central
tendency, spread, and range of the data.
Outliers:
Checked using the 1.5 × IQR rule. Values outside the range \([Q1 - 1.5*IQR, Q3 + 1.5*IQR]\) are
considered outliers. In this simulated dataset, there are typically no
extreme outliers.
Correlation Coefficient:
The Pearson correlation coefficient measures the strength and direction
of the linear relationship between X and Y. A
value close to +1 indicates a strong positive relationship, consistent
with the simulated model.
Scatter Plot:
The blue points represent the simulated data, and the red line is the
fitted regression line (Y ~ X). The plot visually confirms
the positive linear relationship between X and
Y, showing that Y tends to increase as
X increases.
# Fit simple linear regression
model <- lm(Y ~ X)
# Display model summary
summary(model)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
# Extract estimated coefficients
coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]
# Display estimated model
estimated_model <- paste0("Estimated Model: Y_hat = ", round(intercept,3), " + ", round(slope,3), " * X")
estimated_model
## [1] "Estimated Model: Y_hat = 9.022 + 5.565 * X"
# Compute residuals and Mean Squared Error (MSE)
residuals <- model$residuals
mse <- mean(residuals^2)
mse
## [1] 0.1992276
Estimated Model:
\(\hat{Y} = \beta_0 + \beta_1 X\),
where \(\beta_0\) is the intercept and
\(\beta_1\) is the slope.
For this dataset, the estimated model is approximately:
\(\hat{Y} = 9.022 + 5.565 X\)
Estimated Coefficients:
| Coefficient | Estimate |
|---|---|
| Intercept | 9.022 |
| Slope (X) | 5.565 |
Y when X = 0.Y for a one-unit increase in X.Y and predicted \(\hat{Y}\). A small MSE indicates that the
model predictions are close to the actual data.# Sample means
mean_X <- mean(X)
mean_Y <- mean(Y)
# Print sample means
mean_X
## [1] 2.003677
mean_Y
## [1] 20.17258
# Fit linear regression model
model <- lm(Y ~ X)
# Scatter plot
plot(X, Y,
main = "Scatter Plot with Regression Line and Mean Point",
xlab = "Predictor X",
ylab = "Response Y",
pch = 19, col = "blue")
# Add fitted regression line
abline(model, col = "red", lwd = 2)
# Add mean point
points(mean_X, mean_Y, pch = 19, col = "green", cex = 1.5)
# Add legend
legend("topleft", legend=c("Data points", "Regression line", "Mean point (X̄, Ȳ)"),
col=c("blue","red","green"), pch=c(19, NA, 19), lty=c(NA,1,NA), lwd=c(NA,2,NA))
Sample Means:
The sample mean of X is approximately 2.004 and the sample mean of Y is
approximately 20.173.
Regression Line and Mean Point:
The green point (\(\bar{X}, \bar{Y}\))
lies on the fitted regression line, which is a property of simple linear
regression.
The red regression line fits the data points well, showing a strong
positive linear relationship between X and Y.
Conclusion:
The plot visually confirms that Y increases as X increases. The mean
point lying on the line validates that the regression line passes
through the centroid of the data, and no extreme outliers are
present.
Summary of Observations:
- The predictor X and response Y are approximately normally distributed
with no extreme outliers.
- Sample means: \(\bar{X} \approx
2.004\), \(\bar{Y} \approx
20.173\).
- The Pearson correlation coefficient between X and Y is approximately
0.8, indicating a strong positive linear relationship.
- The fitted regression model is \(\hat{Y} =
9.022 + 5.565 X\).
- The Mean Squared Error (MSE) is 0.199, showing low average deviation
between predicted and observed values.
- The scatter plot with regression line and mean point confirms the
positive linear relationship and that the regression line passes through
the centroid (\(\bar{X},
\bar{Y}\)).
Analysis and Conclusion:
- The simulation successfully generated data consistent with the
specified linear model \(Y = 10 + 5X +
\epsilon\).
- Regression analysis validates that Y increases linearly with X.
- Both numerical and visual analyses (summary statistics, correlation,
MSE, and scatter plot) indicate a good model fit and no significant
outliers.
- Overall, the linear regression model provides a reliable summary of
the relationship between X and Y in the simulated dataset.