Homework Assignment 1

# Load necessary libraries
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

library(car)

## Loading required package: carData

##Problem 1

# Read data
solar <- read.csv("SolarThermalEnergy.csv")

#Removing the last observation
 solar <- solar[-nrow(solar), ]

#Problem 1a

# Fitting simple linear regression model
model <- lm(y ~ x4, data = solar)

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = y ~ x4, data = solar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.7378  -5.0615   0.9373   7.2909  25.0246 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  606.677     42.576   14.25 8.52e-14 ***
## x4           -21.408      2.545   -8.41 6.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 26 degrees of freedom
## Multiple R-squared:  0.7312, Adjusted R-squared:  0.7209 
## F-statistic: 70.73 on 1 and 26 DF,  p-value: 6.84e-09

##Results
#The linear regression equation is: y = 606.677 - (21.408(x4))

#Intercept (606.677, p < 0.001): 
#This represents the estimated total heat flux when the radial deflection
#(x4) is zero. It is statistically significant (p < 0.001)
#But since its equal to zero it might not be useful 

#Slope (-21.408, p < 0.001): 
#There is a statistical  negative significant relationship between radial deflection
#(x4) and total heat flux (y), as indicated by the slope of -21.408 
#(p<0.001). This implies that for every 1 milliradian increase in radial
#deflection, the total heat flux decreases by 21.408 kilowatts onaverage.
#The statistical significance of the coefficient confirms the strength of this negative association.


#Model Fit (R-squared = 0.7312)
# Only 73.12% of the variation in heat flux is explained by the model
#Since the r-squared is relatively high it indicates that the model has a
#good fit

#F-Statistic (70.73, p-value: 6.84e-09)
#This very small p-value indicates that the model is statistically significant

plot(solar$x4, solar$y, 
     main = "Scatter Plot of Total Heat Flux vs Radial Deflection",
     xlab = "Radial Deflection (milliradians)", 
     ylab = "Total Heat Flux (kilowatts)",
     col = "blue", pch = 16)

#Conclusion
#Based on the scatter plot above we can see that 
#There is a negative linear relationship, as radial deflection increases, #total heat flux tends to decrease, indicating an inverse relationship.

# Create scatter plot with regression line and loess smoother
ggplot(solar, aes(x = x4, y = y)) +
  geom_point() +  
  geom_smooth(method = "lm", color = "blue", se = TRUE) +  
  geom_smooth(method = "loess", color = "red", se = TRUE) +   
  labs(title = "Heat Flux vs. Radial Deflection",
       x = "Radial Deflection (milliradians)",
       y = "Total Heat Flux (kW)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#The linear regression line (blue) suggests a consistent downward trend,
#implying a simple negative relationship between the variables.

#The LOESS smoother (red), however, indicates slight deviations from a
#strict linear trend, suggesting that the rate of decrease in total heat
#flux is not entirely uniform.

#Conclusions
#The LOESS curve suggests some non-linearity in the relationship.
#A straight-line model might not fully capture the pattern, particularly
#at the higher range of x4 (radial deflection).
#If the goal is simplicity, a linear model may be a reasonable
#approximation, but if higher accuracy is required, a more flexible model
#Like polynomial regression or another non-linear approach should be considered.

#Problem 1b

#Construct the ANOVA Table and test for significance of regression
#The hypothesis for the test will be:
  #H0: The regression model is significant.
  #H1: The regression model is not significant

#Anova table
anova_table <- anova(model)
anova_table

# Check the p-value for the regression significance
p_value <- anova_table["x4", "Pr(>F)"]
if (p_value < 0.05) {
  print("The regression model is significant.")
} else {
  print("The regression model is not significant.")
}

## [1] "The regression model is significant."

# Thus from the results above we reject the null hypothesis
#and conclude that x4 is statistically significant in
#explaining the variation in the model (y)
#Since the p-value(6.84e-09)is less than the significant level of 0.05
#The regression model is statistically significant, confirming that
#Radial Deflection (x4) has a meaningful effect on Total Heat Flux (y).

##Problem 1c

#Find the 99% confidence interval on the slope
slope = confint(model, level = 0.99)
slope

##                 0.5 %    99.5 %
## (Intercept) 488.36892 724.98419
## x4          -28.48064 -14.33447

#The 99% confidence interval for the Intercept=(488.36892 724.98419)
#The 99% confidence interval further supports that an increase in radial
#deflection significantly reduces total heat flux, with a minimum
#estimated decrease of 14.33 kW and a maximum estimated decrease of 
#28.48kW per unit increase in x4.

##Problem 1d

#Calculate R-squared 
R_squared = summary(model)$r.squared
R_squared

## [1] 0.731211

##Comments
#The R-squared value obtained is 73.12%
#This implies that 73.12% of the variation in total heat flux
#y(kilowatts) can be explained by the radial deflection of the deflected
#rays x4 in the model.

##Problem 1e

# 95% CI on the mean heat flux when the radial deflection is 16.5 milliradians.
# Define the new data point 
 new_solar = data.frame(x4 = 16.5)

#Predict the mean heat flux at x4 = 16.5 milliradians
d2 = predict(model, new_solar, interval = "confidence", level = 0.95)
d2

##        fit      lwr      upr
## 1 253.4519 248.5842 258.3196

##Conclusion
#The predicted mean heat flux when the radial deflection is 
#16.5milliradians = 253.4519 kW
#This means that with 95% confidence, the true mean heat flux for 
#x4 = 16.5 milliradians falls between 248.5842 and 258.3196.

##Problem 1f

#To predict the mean total heat flux using the simple regression obtained in part (a) when x4 = 16.5.
# Compute the point estimate for x4 = 16.5
point_estimate <- predict(model, newdata = new_solar)
print(paste("Point estimate: ", point_estimate))

## [1] "Point estimate:  253.451881818248"

# Compute the 95% prediction interval for x4 = 16.5
prediction <- predict(model, newdata = new_solar, interval = "prediction", level = 0.95)
print("95% Prediction Interval:")

## [1] "95% Prediction Interval:"

print(prediction)

##        fit      lwr      upr
## 1 253.4519 227.8407 279.0631

##Results
#The predicted mean total heat flux when x4=16.5 is 253.45 kW.
#Prediction Interval (227.84, 279.06): The range within which a future
#observation of total heat flux will fall, with 95% confidence,at x4=16.5.
#It means if you were to measure total heat flux at x4 =16.5 in the future, there is a 95% probability that the observed value will fall between 227.84 and 279.06.

##Problem 1g

##Conclusion
#The prediction interval (227.84, 279.06) is wider than the confidence
#interval (248.58, 258.32) because the prediction interval accounts for
#both the uncertainty in the regression model and the variability of
#future individual observations.
#The confidence interval reflects the expected average value at 
#x4=16.5, while the prediction interval accounts for the greater variability expected in future individual observations.

##Problem 2a

# multiple regression model relating total heat flux to all five predictor variables,

model2 <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = solar) 
summary(model2)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = solar)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6188  -2.7896   0.4168   4.3807  16.1877 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 345.15900   97.42412   3.543  0.00183 ** 
## x1            0.07105    0.02905   2.445  0.02293 *  
## x2            2.12624    1.30292   1.632  0.11693    
## x3            3.50171    1.48064   2.365  0.02727 *  
## x4          -22.92065    2.69262  -8.512 2.07e-08 ***
## x5            2.59559    1.80825   1.435  0.16523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.006 on 22 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.8804 
## F-statistic: 40.76 on 5 and 22 DF,  p-value: 2.102e-10

#Model: A multiple regression model is used to predict y (total heat
#flux) using five predictors (x1, x2, x3, x4, x5).
#Total Heat Flux=345.159+0.071(x1)+2.126(x2)+3.50(x)3-22.921(x4)+2.596(x5)

#Significant Predictors: x1, x3, and x4 are statistically significant
#with p-values less than 0.05.

#Non-significant Predictors: x2 and x5 are not statistically significant
#(p-values greater than 0.05).

#R-squared: 90.26% of the variance in y is explained by the model.
#This implies that the model has a good fit

#F-statistic: The model is significant overall (p-value = 2.102e-10)is
#less than the significance level of 0.05

#Problem 2b

#To test the significance of the regression, we look at the F-statistic and its associated p-value.

# Test the significance of the regression in (a)
#Null hypothesis (H0): The regression model does not explain the any #significant variation in total heat flux. In other words, all regression
#coefficients are equal to zero.

#Alternative hypothesis (H1): At least one of the regression coefficients
#significantly explains the variation in total heat flux..

# The F-statistic: 40.76 , with a p-value of 2.102e-10 which is less than #the significance level of 0.05, thus we reject the null hypothesis
#and conclude that at least one of the predictors is significantly
#related to total heat flux, that the model is significant. 

#Thus, we conclude that the model explains a significant portion of the
#variation in total heat flux.

##Problem 2c

#T-tests assess the significance of each predictor by testing whether
#their coefficients are significantly different from zero.

#Null hypothesis (H0): The coefficient for the predictor is equal to zero
#(that is, the predictor has no effect on total heat flux).

#Alternative hypothesis (1₁):The coefficient for the predictor is not
#equal to zero (that is, the predictor has an effect on total heat flux)

#Intercept: Estimate = 345.159, t-value = 3.543, p-value = 0.00183
# The intercept is statistically significant (p=0.00183), which means
#that when all predictor variables are zero, the total heat flux is
#significantly different from zero.


#x1: Estimate = 0.07105, t-value = 2.445, p-value = 0.02293
#This predictor is statistically significant (p=0.02293) at the 5%
#significance level. A unit increase in x1 is associated with a
#significant increase in total heat flux.

#x2: Estimate = 2.12624, t-value = 1.632, p-value = 0.11693
#This predictor is not statistically significant (p=0.11693). 
#The evidence suggests that x2 does not contribute significantly to
#explaining the variation in total heat flux at the 5% significance level.

#x3: Estimate = 3.50171, t-value = 2.365, p-value = 0.02727
#This predictor is statistically significant (p=0.02727).
#A unit increase in x3 leads to a significant increase in total heat flux.

#x4: Estimate = -22.92065, t-value = -8.512, p-value = 2.07e-08
#This predictor is highly statistically significant (p=2.07e-08), 
#with a very small p-value. A unit increase in x4 leads to a significant
#decrease in total heat flux.

#x5: Estimate = 2.59559, t-value = 1.435, p-value = 0.16523
#This predictor is not statistically significant (p=0.16523), meaning it
#does not contribute significantly to explaining the variation in total heat flux.

#In conclusion, x1, x3, and x4 are important predictors in the model
#while x2 and x5 do not appear to provide significant additional explanatory value.

##Problem 2d

#Contribution of x1 to the Model Given Other Predictors

#The coefficient of x1 is 0.07105, meaning that for each unit increase in
#x1, the total heat flux increases by approximately 0.07105 kilowatts,
#holding all other predictor variables constant.
#The p-value for x1 is 0.02293, which is below the significance level of 0.05. 

#x1 makes a statistically significant contribution to the model after
#accounting for the other predictors, as indicated by its positive
#coefficient, low p-value, and significant t-value. It explains
#additional variation in y that is not fully explained by x2, x3, x4, and x5.

#The coefficient of 0.07105 suggests that x1 has a small but meaningful positive effect on y when the effects of the other variables are controlled for.

##Problem 2e

vif(model2)

##       x1       x2       x3       x4       x5 
## 2.303242 1.367654 3.268810 2.612310 5.377488

##Results
#There is moderate multicollinearity present, especially for x5, which has a VIF slightly above 5.
#x1, x2, x3, and x4 have VIF values that fall below 5, which suggests
#there is no strong multicollinearity for these predictors.
#x5 has a VIF of 5.38, which is slightly above the typical threshold of 5. 
#This indicates that x5 may have some degree of multicollinearity with
#the other predictor variables in the model. 

#While it's not drastically high, it is worth investigating whether this
#variable is contributing redundantly with other variables and whether
#removing or transforming it might improve the model.

Homework Assignment 1

Doris Mbitazi Asongafac

2025-01-31