1 Least Squares and matrices analyis (Question 2)
- 1.1 Statistical Explantation
- 1.2 Geometrical Explanation
2 Question # 3 Happiness and Income analyis
3 Question # 4 fertility
- 3.1 Interpretation of the Fit

datax <- c(6,7,8)
datay <- c(9,16,23)

X <- cbind(Intercept = 1, Slope = datax)

# Step 2: Construct the response vector Y
Y <- datay

# Step 3: Compute (X'X)
XtX <- t(X) %*% X

# Step 4: Compute (X'Y)
XtY <- t(X) %*% Y

# Step 5: Compute the inverse of (X'X)
XtX_inv <- solve(XtX)

# Step 6: Compute the regression coefficients (beta_hat)
beta_hat <- XtX_inv %*% XtY

# Display the estimated coefficients as a table
names(beta_hat) <- c("parameter","Coefficent")
head(beta_hat)

##           [,1]
## Intercept  -33
## Slope        7

# Step 7: Make predictions
# Predicted distances
Y_pred <- X %*% beta_hat
residuals <- Y - Y_pred
print(residuals)

##              [,1]
## [1,] 2.842171e-13
## [2,] 2.557954e-13
## [3,] 2.273737e-13







``` r
tabls <- captioner(prefix="Table.")
Table1 <-tabls(name="Table.1", " Least squares coefficients.")
 beta_hat %>%
   kbl(caption = Table1,digits = 2,align ="c") %>%
  kable_classic(full_width = F, html_font = "Cambria")

Table 0.1: Table. 1: Least squares coefficients.
Intercept	-33
Slope	7

datax <- c(6,7,8)
datay <- c(9,16,23)
dataf <- cbind(datax,datay)
dataf <- as.data.frame(dataf)
names(dataf) <- c("X","Y")
model <- lm(Y~X,data=dataf)
summary(model)

## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = Y ~ X, data = dataf)
## 
## Residuals:
##          1          2          3 
##  7.252e-16 -1.450e-15  7.252e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -3.300e+01  8.852e-15 -3.728e+15   <2e-16 ***
## X            7.000e+00  1.256e-15  5.573e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.776e-15 on 1 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.106e+31 on 1 and 1 DF,  p-value: < 2.2e-16

1 Least Squares and matrices analyis (Question 2)

datax <- c(2,2,2)
datay <- c(9,16,23)
dataf <- cbind(datax,datay)
dataf <- as.data.frame(dataf)
names(dataf) <- c("X","Y")
model2 <- lm(Y~X,data=dataf)
summary(model2)

## 
## Call:
## lm(formula = Y ~ X, data = dataf)
## 
## Residuals:
##          1          2          3 
## -7.000e+00 -3.553e-15  7.000e+00 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   16.000      4.041   3.959   0.0583 .
## X                 NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7 on 2 degrees of freedom

1.1 Statistical Explantation

In the case where X values are all the same, there is no variation in the explanatory variable X. Since ordinary least squares regression relies on the variation in X to explain the variation in Y, a lack of variation in X means that the regression model cannot distinguish between the effect of the intercept β0 and the slope β1 This results in an inability to estimate the slope β1 as the model cannot tell how much of the change in Y is due to changes in X versus random error.

1.2 Geometrical Explanation

Geometrically, if all the X values are identical, the data points lie on a vertical line in the X-Y plane. The regression line must pass through the mean of Y, but since the slope of the line cannot be determined (as there is no horizontal spread in X), the concept of a “best fit” line becomes meaningless. The problem is that multiple lines with different slopes could equally “fit” the data, leading to an indeterminate solution. This situation is known as perfect multicollinearity, where the design matrix X does not have full rank, making it impossible to compute unique coefficients. These explanations highlight why it’s crucial to have variation in the independent variables when performing regression analysis.

2 Question # 3 Happiness and Income analyis

# Set working directory
setwd("D:/2024/NewData")
happy_df <- read.csv("happy.csv",stringsAsFactors=FALSE)
model <- lm(happiness ~ income,data=happy_df)
coefficients <- summary(model)$coefficients
intercept <- round(coefficients[1, 1], 3)
slope <- round(coefficients[2, 1], 3)
paste("Happiness =", intercept, "+", slope, "* Income")

## [1] "Happiness = 0.204 + 0.714 * Income"

# Extract p-value for the slope
p_value <- coefficients[2, 4]
print(p_value)

## [1] 3.956245e-151

# Check significance
if (p_value < 0.05) {
  "Income has a significant effect on Happiness (p < 0.05)."
} else {
  "Income does not have a significant effect on Happiness (p ≥ 0.05)."
}

## [1] "Income has a significant effect on Happiness (p < 0.05)."

# Extract R-squared value
r_squared <- summary(model)$r.squared
paste("R-squared value:", round(r_squared, 3))

## [1] "R-squared value: 0.749"

The interpretation is that for every $10,000 increase in income Happines increses by 0.714 units. The R2 value is 0.714 indicating a resonable lack of it.

plot(model$residuals ~ model$fitted.values, 
     xlab = "Fitted Values (Predicted Happiness)", 
     ylab = "Residuals", 
     main = "")
     

# Add a horizontal line at y=0
abline(h = 0, col = "red")

Figure 2.1: Residual plot of happiness based on a linear model fit

The in a good model fit as shown in Figure 2.1, the residuals appear to be randomly scattered around zero, indicating that the model’s predictions are unbiased. The model fits the observed data with no large outliers that are apparent.

ggplot(happy_df, aes(x = income, y = happiness)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("") +
  xlab("Income (in $10,000)") +
  ylab("Happiness (scale 1-10)") +
  annotate("text", x = max(happy_df$income), y = max(happy_df$happiness), 
           label = paste("Happiness =", intercept, "+", slope, "* Income"),
           hjust = 1)

## `geom_smooth()` using formula = 'y ~ x'

Figure 2.2: Linear model fit of Income and Happines

The happines index predictions associated with the income values of 4.76,6.48, and 8.98 are 3.60, 4.83, and 6.62.

3 Question # 4 fertility

rate and age in female rhesus macaques from Cayo Santiago

macaque_data <- read.csv("macaque.csv")

ggplot(macaque_data, aes(x = age, y = mean_fertility)) +
  geom_point() +
  ggtitle("") +
  xlab("Age (years)") +
  ylab("Mean Fertility Rate") +
  theme_minimal()

Figure 3.1: Mean Age-Specific Fertility vs Age in Female Rhesus Macaques

In Figure 3.1, we observe that mean fertility rate increases up to 15 years of age and then starts to decline as females get older. This pattern is common in biological systems where reproductive capacity peaks at a certain age and diminishes with advancing age due to natural aging processes.

# Build the linear regression model
model <- lm(mean_fertility ~ age, data = macaque_data)

# Display the summary of the model
summary(model)

## 
## Call:
## lm(formula = mean_fertility ~ age, data = macaque_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49299 -0.05819  0.05484  0.09969  0.16112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.821764   0.080313  10.232 2.15e-09 ***
## age         -0.019259   0.005384  -3.577  0.00189 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1602 on 20 degrees of freedom
## Multiple R-squared:  0.3901, Adjusted R-squared:  0.3596 
## F-statistic: 12.79 on 1 and 20 DF,  p-value: 0.001887

# Extract coefficients
coefficients <- summary(model)$coefficients
intercept <- round(coefficients[1, 1], 3)
slope <- round(coefficients[2, 1], 3)

# Regression equation
paste("Mean Fertility Rate =", intercept, "+", slope, "* Age")

## [1] "Mean Fertility Rate = 0.822 + -0.019 * Age"

This equation represents the estimated mean fertility rate when age is zero. In biological terms, it may not have practical significance since age zero is not within the range of the data. The slope indicates the change in mean fertility rate for each one-year increase in age. A positive slope suggests an increase in fertility with age, while a negative slope indicates a decrease in fertility as age increases. in this case the mean fertility rate decrease by -0.019 for one unit increase in age.

 # Residual plot
par(mfrow = c(2, 2))
plot(model)

Figure 3.2: Residual plot predicted maen fertility of Female Rhesus Macaques

# Reset plotting layout
par(mfrow = c(1, 1))

ggplot(macaque_data, aes(x = age, y = mean_fertility)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("") +
  xlab("Age (years)") +
  ylab("Mean Fertility Rate") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Figure 3.3: Mean Age-Specific Fertility vs Age with Regression Lines

3.1 Interpretation of the Fit

Trend Observation: The Figure 3.3, shows a negative linear relationship between Age and Mean Fertility Rate. As the age of the female rhesus macaques increases, their mean fertility rate tends to decrease. This is evidenced by the downward-sloping blue regression line. Model Fit: The blue line in represents the best linear fit for the data, with the shaded region around it representing the confidence interval for the regression line. The confidence interval widens as age increases, particularly at the extreme ages, indicating higher uncertainty in the predictions for those ages. The spread of the data points around the regression line also suggests that while the linear model captures the general downward trend, there may be some variability in fertility rates that the linear model does not fully explain. Residuals and Model Adequacy: The scatter of data points Figure 3.2, especially at higher ages, it seems that there is some variability not captured by the model. This might suggest that the linear model, while a good first approximation, may not fully capture the complexity of the relationship, particularly at the extreme values of age where the fertility rates may deviate more from the predicted values. Biological Interpretation: The biological implication of this model is that as female rhesus macaques age, their fertility declines, which is a common pattern observed in many species. The linear model provides a straightforward way to describe this decline, although more sophisticated models could be considered if the relationship appears to be more complex than linear.

The age specific fertility predictions for the ages 6.95,14.85,and 19.45 are 0.69, 0.54, and 0.45.

STAT462-24S2 - Data Mining

Ron Taylor