1 Model Fitting
1.1 Data
SolarRadPrediction <- read_csv("Data/SolarRadPrediction.csv")
colnames(SolarRadPrediction)[8] <- "WindDirection"
numData <- SolarRadPrediction[, -c(1,2,3,10,11)]
head(numData)| Radiation | Temperature | Pressure | Humidity | WindDirection | Speed |
|---|---|---|---|---|---|
| 1.21 | 48 | 30.46 | 59 | 177.39 | 5.62 |
| 1.21 | 48 | 30.46 | 58 | 176.78 | 3.37 |
| 1.23 | 48 | 30.46 | 57 | 158.75 | 3.37 |
| 1.21 | 48 | 30.46 | 60 | 137.71 | 3.37 |
| 1.17 | 48 | 30.46 | 62 | 104.95 | 5.62 |
| 1.21 | 48 | 30.46 | 64 | 120.20 | 5.62 |
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Radiation | 1 | 32686 | 207.124697 | 315.9163872 | 2.66 | 143.327743 | 2.194248 | 1.11 | 1601.26 | 1600.15 | 1.3695556 | 0.5106074 | 1.7473957 |
| Temperature | 2 | 32686 | 51.103255 | 6.2011574 | 50.00 | 50.705736 | 5.930400 | 34.00 | 71.00 | 37.00 | 0.5220547 | -0.3164319 | 0.0342998 |
| Pressure | 3 | 32686 | 30.422879 | 0.0546732 | 30.43 | 30.429294 | 0.044478 | 30.19 | 30.56 | 0.37 | -1.2268490 | 2.1190866 | 0.0003024 |
| Humidity | 4 | 32686 | 75.016307 | 25.9902185 | 85.00 | 77.997782 | 22.239000 | 8.00 | 103.00 | 95.00 | -0.7762092 | -0.7508877 | 0.1437570 |
| WindDirection | 5 | 32686 | 143.489821 | 83.1674996 | 147.70 | 137.081818 | 63.025326 | 0.09 | 359.95 | 359.86 | 0.5685598 | 0.2307129 | 0.4600158 |
| Speed | 6 | 32686 | 6.243869 | 3.4904736 | 5.62 | 5.975417 | 3.335850 | 0.00 | 40.50 | 40.50 | 1.4696796 | 6.6303588 | 0.0193065 |
1.2 Correlation
- We can see high positive correlation between Radiation & Temperature.
- Pressure & Temperature has positive low correlation.
- Humidity with Radiation, Temperature, and Pressure and WindDirection with Radiation, Temperature, and Pressure and Speed with Humidity are all having a same negative lower correlation.
1.3 Simple Linear Regression
1.3.1 Model 1
##
## Call:
## lm(formula = Radiation ~ Temperature, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -720.40 -130.80 -14.85 104.55 1132.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1706.2876 9.8369 -173.5 <2e-16 ***
## Temperature 37.4421 0.1911 195.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.2 on 32684 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.5401
## F-statistic: 3.839e+04 on 1 and 32684 DF, p-value: < 2.2e-16
ggplot(data = numData, aes(x = Temperature, y = Radiation)) +
geom_point(color="blue", alpha=0.35, size=2) +
geom_smooth(method=lm, color="red")1.3.2 Model 2
##
## Call:
## lm(formula = Radiation ~ Pressure, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -300.2 -217.6 -174.2 146.6 1361.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20714.84 965.46 -21.46 <2e-16 ***
## Pressure 687.70 31.73 21.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 313.7 on 32684 degrees of freedom
## Multiple R-squared: 0.01416, Adjusted R-squared: 0.01413
## F-statistic: 469.6 on 1 and 32684 DF, p-value: < 2.2e-16
ggplot(data = numData, aes(x = Pressure, y = Radiation)) +
geom_point(color="blue", alpha=0.35, size=2) +
geom_smooth(method=lm, color="red")1.3.3 Model 3
##
## Call:
## lm(formula = Radiation ~ Humidity, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -371.6 -186.7 -134.5 151.3 1443.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 413.35579 5.19952 79.50 <2e-16 ***
## Humidity -2.74915 0.06549 -41.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 307.7 on 32684 degrees of freedom
## Multiple R-squared: 0.05115, Adjusted R-squared: 0.05112
## F-statistic: 1762 on 1 and 32684 DF, p-value: < 2.2e-16
ggplot(data = numData, aes(x = Humidity, y = Radiation)) +
geom_point(color="blue", alpha=0.35, size=2) +
geom_smooth(method=lm, color="red")1.3.4 Model 4
##
## Call:
## lm(formula = Radiation ~ WindDirection, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -331.4 -196.8 -166.0 133.4 1271.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 332.66349 3.39100 98.10 <2e-16 ***
## WindDirection -0.87490 0.02045 -42.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 307.4 on 32684 degrees of freedom
## Multiple R-squared: 0.05305, Adjusted R-squared: 0.05302
## F-statistic: 1831 on 1 and 32684 DF, p-value: < 2.2e-16
ggplot(data = numData, aes(x = WindDirection, y = Radiation)) +
geom_point(color="blue", alpha=0.35, size=2) +
geom_smooth(method=lm, color="red")1.3.5 Model 5
##
## Call:
## lm(formula = Radiation ~ Speed, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.6 -208.9 -179.3 148.2 1375.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165.5166 3.5714 46.34 <2e-16 ***
## Speed 6.6638 0.4993 13.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 315.1 on 32684 degrees of freedom
## Multiple R-squared: 0.005421, Adjusted R-squared: 0.00539
## F-statistic: 178.1 on 1 and 32684 DF, p-value: < 2.2e-16
ggplot(data = numData, aes(x = Speed, y = Radiation)) +
geom_point(color="blue", alpha=0.35, size=2) +
geom_smooth(method=lm, color="red")So the final conclution for the Simple Linear model is first model, that is
\[\hat{Y} = \beta_0 + \beta_1x_1 + \epsilon\]
\[\hat{Radiation} = (-1706.2876) + (37.4421) * Temperature + \epsilon\]
Which gives,
| R-sq | R-sq(adj) |
|---|---|
| 0.5402 | 0.5401 |
1.3.5.1 What does the final model describe
##
## Call:
## lm(formula = Radiation ~ Temperature, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -720.40 -130.80 -14.85 104.55 1132.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1706.2876 9.8369 -173.5 <2e-16 ***
## Temperature 37.4421 0.1911 195.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214.2 on 32684 degrees of freedom
## Multiple R-squared: 0.5402, Adjusted R-squared: 0.5401
## F-statistic: 3.839e+04 on 1 and 32684 DF, p-value: < 2.2e-16
- Residuals: We look for a symmetrical distribution(Median close to 0, Min, Max has same distance from 0(Min & Max and 1Q & 3Q should have close in magnitude)). Here do not appear to be strongly symmetrical.
- Coefficients:
- Estimate: Predicting the response variable. For a example \(1^\circ F\) increment of Temperature, will result \(37.4421 kg/s^3\)(watts per meter squared) increment of Radiation.
- Std.Error: Average amount that the estimate varies from the actual value.(relatively low number than the estimates).
- t-value: \(\frac{Estimate}{Std.error}\)
- p-values: Want \(p-value < \alpha=0.05\) to be statistically significant.
- R-sq: x variable can explain R-sq% of the variation in y-variable.
- R-sq(adj): Scaled by the number of the parameters in the model.
- F-stat, p-val: Tells us that R-sq is significant or not.
1.3.5.2 Regression Assumptions
- Assumption for the linear models never perfectly met,
- But we must check if they are reasonable enough to work with.
- Assumptions,
- Y-values (or the error \(e\)) are independent.
- Y-values (or the error \(e\)) are normally distributed.
- Y-values can express by a linear function of X-variable.
- Homoscedasticity: variation of observation around the regression line is constant.
fit <- lm(formula = Radiation ~ Temperature, data = numData)
plot(Radiation ~ Temperature)
abline(fit, col = "red", lwd = 2)- Residual Plot: Should be no pattern, red line should be flat.
- Normal QQ: Should follow a roughly a diagonal line.
- Scale-Location: Shows increasing and decreasing diagnose.
1.4 Bayesian Simple Regression
1.4.1 Calculations
# Obtain residuals and n
resd <- residuals(fit)
n <- length(resd)
# Calculate MSE
MSE <- sum((resd^2))/(n-2)
MSE## [1] 45895.04
# Model
fit <- lm(formula = Radiation ~ Temperature, data = numData)
# Credible interval
output <- summary(fit)$coef[, 1:2]
#output
out <- cbind(output, confint(fit))
colnames(out) <- c("Posterior Mean", "Posterior Std", "2.5", "97.5")
round(out, 2)## Posterior Mean Posterior Std 2.5 97.5
## (Intercept) -1706.29 9.84 -1725.57 -1687.01
## Temperature 37.44 0.19 37.07 37.82
- Let’s discuss about above bayesian outcome.
- Interpretation: Based on the information we can say that there is a 95% chance that Radiation will increase \(37.07 kg/s^3\) up to \(37.82 kg/s^3\) with every additional \(1^\circ F\) increase in the Temperature.
1.5 Multiple Linear Regression
1.5.1 Model 1
- Having all the variables to the model.
##
## Call:
## lm(formula = Radiation ~ ., data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -752.37 -132.65 -20.19 102.98 1121.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.098e+04 6.959e+02 30.153 < 2e-16 ***
## Temperature 3.837e+01 2.064e-01 185.930 < 2e-16 ***
## Pressure -7.471e+02 2.290e+01 -32.622 < 2e-16 ***
## Humidity -2.691e-01 4.844e-02 -5.555 2.79e-08 ***
## WindDirection -2.694e-01 1.462e-02 -18.432 < 2e-16 ***
## Speed 7.875e+00 3.416e-01 23.053 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 208.2 on 32680 degrees of freedom
## Multiple R-squared: 0.5659, Adjusted R-squared: 0.5659
## F-statistic: 8522 on 5 and 32680 DF, p-value: < 2.2e-16
1.5.2 Model 2
- By looking at the correlations we remove least correlation having with Radiation. Which is Speed.
fit <- lm(formula = Radiation ~ Temperature+Pressure+Humidity+WindDirection,
data = numData)
summary(fit)##
## Call:
## lm(formula = Radiation ~ Temperature + Pressure + Humidity +
## WindDirection, data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -740.27 -133.86 -21.25 102.62 1171.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.282e+04 6.969e+02 32.75 <2e-16 ***
## Temperature 3.810e+01 2.077e-01 183.44 <2e-16 ***
## Pressure -8.048e+02 2.295e+01 -35.07 <2e-16 ***
## Humidity -5.382e-01 4.739e-02 -11.36 <2e-16 ***
## WindDirection -2.593e-01 1.473e-02 -17.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.8 on 32681 degrees of freedom
## Multiple R-squared: 0.5589, Adjusted R-squared: 0.5588
## F-statistic: 1.035e+04 on 4 and 32681 DF, p-value: < 2.2e-16
1.5.3 Model 3
- Remove pressure.
##
## Call:
## lm(formula = Radiation ~ Temperature + Humidity + WindDirection,
## data = numData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -709.90 -138.80 -21.75 106.72 1123.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.616e+03 1.288e+01 -125.494 < 2e-16 ***
## Temperature 3.654e+01 2.066e-01 176.821 < 2e-16 ***
## Humidity -2.649e-01 4.762e-02 -5.564 2.66e-08 ***
## WindDirection -1.683e-01 1.477e-02 -11.392 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 213.7 on 32682 degrees of freedom
## Multiple R-squared: 0.5423, Adjusted R-squared: 0.5422
## F-statistic: 1.291e+04 on 3 and 32682 DF, p-value: < 2.2e-16
- We can see only all variable including gives us higher R-sq value
1.6 Bayesian Multiple Linear Regression
# Bayesian Multiple Model fitting
bas.lmall = bas.lm(Radiation ~ ., data = numData,
prior = "BIC",
modelprior = Bernoulli(1),
include.always = ~ .,
n.models = 1)
# Coefficients
lmall.coef <- coef(bas.lmall)
lmall.coef##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 1 models
## post mean post SD post p(B != 0)
## Intercept 207.12470 1.15133 1.00000
## Temperature 38.37214 0.20638 1.00000
## Pressure -747.06869 22.90106 1.00000
## Humidity -0.26910 0.04844 1.00000
## WindDirection -0.26944 0.01462 1.00000
## Speed 7.87488 0.34160 1.00000
# Summary table
out = confint(lmall.coef)[, 1:2]
# Extract the upper and lower bounds of the credible intervals
names = c("posterior mean", "posterior std", colnames(out))
out = cbind(lmall.coef$postmean, lmall.coef$postsd, out)
colnames(out) = names
round(out, 2)## posterior mean posterior std 2.5% 97.5%
## Intercept 207.12 1.15 204.87 209.38
## Temperature 38.37 0.21 37.97 38.78
## Pressure -747.07 22.90 -791.96 -702.18
## Humidity -0.27 0.05 -0.36 -0.17
## WindDirection -0.27 0.01 -0.30 -0.24
## Speed 7.87 0.34 7.21 8.54
1.6.1 Model selection
- We start with the full model, with all possible predictors.
- i.e Temperature, Pressure, Humidity, WindDirection, Speed.
- We drop one variable at a time and record BIC value.
- Then finally choose the best model having minimum BIC value.
# Total # of observation
n <- nrow(numData)
# Full Model
rad.lm1 <- lm(Radiation ~ ., data = numData)
rad.step <- step(rad.lm1, k = log(n))## Start: AIC=349029.7
## Radiation ~ Temperature + Pressure + Humidity + WindDirection +
## Speed
##
## Df Sum of Sq RSS AIC
## <none> 1415934816 349030
## - Humidity 1 1337143 1417271959 349050
## - WindDirection 1 14720291 1430655107 349357
## - Speed 1 23025460 1438960276 349547
## - Pressure 1 46107462 1462042278 350067
## - Temperature 1 1497820497 2913755313 372607
# Let's drop each variable and test
## Tried, Speed,Pressure,WindDirection,Humidity,
rad.lmt <- lm(Radiation ~ Temperature+Pressure+Humidity+WindDirection,
data = numData)
rad.step <- step(rad.lmt, k = log(n))## Start: AIC=349546.5
## Radiation ~ Temperature + Pressure + Humidity + WindDirection
##
## Df Sum of Sq RSS AIC
## <none> 1438960276 349547
## - Humidity 1 5679155 1444639431 349665
## - WindDirection 1 13645524 1452605800 349845
## - Pressure 1 54157664 1493117940 350744
## - Temperature 1 1481702879 2920663155 372674
| Model | BIC value | BIC Difference from FM |
|---|---|---|
| FM | 349029.7 | 0 |
| FM-Temperature | 372607.2 | -23577.5 |
| FM-Pressure | 350066.7 | -1037 |
| FM-Humidity | 349050.1 | -20.4 |
| FM-WindDirection | 349357.3 | -327.6 |
| FM-Speed | 349546.5 | -516.8 |
- Tried removing each less correlated variable, but we got the minimum BIC on Full Model.
1.6.1.1 BAS to get best model
# Model
bas_model <- bas.lm(Radiation ~ ., data = numData,
prior = "BIC",
modelprior = uniform()) # equal prior to the model
#bas_model
bas_coef <- coef(bas_model)
bas_coef##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 32 models
## post mean post SD post p(B != 0)
## Intercept 207.12470 1.15133 1.00000
## Temperature 38.37215 0.20639 1.00000
## Pressure -747.06784 22.90148 1.00000
## Humidity -0.26909 0.04847 0.99996
## WindDirection -0.26944 0.01462 1.00000
## Speed 7.87490 0.34161 1.00000
# Best model
best <- which.max(bas_model$logmarg)
bestmodel <- bas_model$which[[best]]
#bestmodel
bestgamma <- rep(0, bas_model$n.vars)
bestgamma[bestmodel + 1] <- 1
#bestgamma
bas_bestmodel <- bas.lm(Radiation ~ ., data = numData,
prior = "BIC", n.models = 1, bestmodel = bestgamma,
modelprior = uniform())
## Coefficients
bas_coef <- coef(bas_bestmodel)
## creadible intervals
out <- confint(bas_coef)[, 1:2]
## Summary table
bas_summary <- cbind(bas_coef$postmean, bas_coef$postsd, out)
names <- c("Posterior Mean", "Posterior SD", colnames(out))
colnames(bas_summary) <- names
bas_summary## Posterior Mean Posterior SD 2.5% 97.5%
## Intercept 207.1246974 1.15132973 204.8680490 209.3813458
## Temperature 38.3721364 0.20637957 37.9676249 38.7766479
## Pressure -747.0686858 22.90105705 -791.9555954 -702.1817763
## Humidity -0.2691005 0.04844022 -0.3640451 -0.1741559
## WindDirection -0.2694445 0.01461813 -0.2980965 -0.2407924
## Speed 7.8748844 0.34160218 7.2053317 8.5444372
## Get posterior probability
bas_model <- bas.lm(Radiation ~ Temperature+Pressure+Humidity+WindDirection+Speed,
data = numData,
prior = "BIC",
modelprior = uniform())
round(summary(bas_model), 3)## P(B != 0 | Y) model 1 model 2 model 3 model 4
## Intercept 1 1.000 1.000 1.000 1.000
## Temperature 1 1.000 1.000 1.000 1.000
## Pressure 1 1.000 1.000 1.000 1.000
## Humidity 1 1.000 0.000 1.000 0.000
## WindDirection 1 1.000 1.000 0.000 0.000
## Speed 1 1.000 1.000 1.000 1.000
## BF NA 1.000 0.000 0.000 0.000
## PostProbs NA 1.000 0.000 0.000 0.000
## R2 NA 0.566 0.566 0.561 0.561
## dim NA 6.000 5.000 5.000 4.000
## logmarg NA -344395.449 -344405.678 -344559.279 -344561.106
## model 5
## Intercept 1.000
## Temperature 1.000
## Pressure 1.000
## Humidity 1.000
## WindDirection 1.000
## Speed 0.000
## BF 0.000
## PostProbs 0.000
## R2 0.559
## dim 5.000
## logmarg -344653.878
##
## Call:
## bas.lm(formula = Radiation ~ Temperature + Pressure + Humidity +
## WindDirection + Speed, data = numData, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept Temperature Pressure Humidity WindDirection
## 1 1 1 1 1
## Speed
## 1
2 Interpritation
3 Questions that I had…. have :)
- Standardization, Normalization is necessary in Regression modeling ?.
- How to set prior information to a model.