ST406 - Project 2 Model fitting

P.I.N.Kehelbedda

1/3/2021

1 Model Fitting

1.1 Data

SolarRadPrediction <- read_csv("Data/SolarRadPrediction.csv")
colnames(SolarRadPrediction)[8] <- "WindDirection"
numData <- SolarRadPrediction[, -c(1,2,3,10,11)]
head(numData)
Radiation Temperature Pressure Humidity WindDirection Speed
1.21 48 30.46 59 177.39 5.62
1.21 48 30.46 58 176.78 3.37
1.23 48 30.46 57 158.75 3.37
1.21 48 30.46 60 137.71 3.37
1.17 48 30.46 62 104.95 5.62
1.21 48 30.46 64 120.20 5.62
describe(numData)
vars n mean sd median trimmed mad min max range skew kurtosis se
Radiation 1 32686 207.124697 315.9163872 2.66 143.327743 2.194248 1.11 1601.26 1600.15 1.3695556 0.5106074 1.7473957
Temperature 2 32686 51.103255 6.2011574 50.00 50.705736 5.930400 34.00 71.00 37.00 0.5220547 -0.3164319 0.0342998
Pressure 3 32686 30.422879 0.0546732 30.43 30.429294 0.044478 30.19 30.56 0.37 -1.2268490 2.1190866 0.0003024
Humidity 4 32686 75.016307 25.9902185 85.00 77.997782 22.239000 8.00 103.00 95.00 -0.7762092 -0.7508877 0.1437570
WindDirection 5 32686 143.489821 83.1674996 147.70 137.081818 63.025326 0.09 359.95 359.86 0.5685598 0.2307129 0.4600158
Speed 6 32686 6.243869 3.4904736 5.62 5.975417 3.335850 0.00 40.50 40.50 1.4696796 6.6303588 0.0193065
attach(numData)

1.2 Correlation

m <- cor(numData)
m_round <- round(m, digits = 4)
corrplot(corr = m_round, method = "number")

corrplot(corr = m_round, method = "circle")

#ggpairs(numData, 
#        lower = list(continuous = wrap("smooth", alpha = 0.05, size=0.1)))
  • We can see high positive correlation between Radiation & Temperature.
  • Pressure & Temperature has positive low correlation.
  • Humidity with Radiation, Temperature, and Pressure and WindDirection with Radiation, Temperature, and Pressure and Speed with Humidity are all having a same negative lower correlation.

1.3 Simple Linear Regression

1.3.1 Model 1

fit <- lm(formula = Radiation ~ Temperature, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Temperature, data = numData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -720.40 -130.80  -14.85  104.55 1132.38 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1706.2876     9.8369  -173.5   <2e-16 ***
## Temperature    37.4421     0.1911   195.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.2 on 32684 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.5401 
## F-statistic: 3.839e+04 on 1 and 32684 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

ggplot(data = numData, aes(x = Temperature, y = Radiation)) + 
  geom_point(color="blue", alpha=0.35, size=2) +
  geom_smooth(method=lm, color="red")

  #geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2])

1.3.2 Model 2

fit <- lm(formula = Radiation ~ Pressure, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Pressure, data = numData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -300.2 -217.6 -174.2  146.6 1361.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20714.84     965.46  -21.46   <2e-16 ***
## Pressure       687.70      31.73   21.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 313.7 on 32684 degrees of freedom
## Multiple R-squared:  0.01416,    Adjusted R-squared:  0.01413 
## F-statistic: 469.6 on 1 and 32684 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

ggplot(data = numData, aes(x = Pressure, y = Radiation)) + 
  geom_point(color="blue", alpha=0.35, size=2) +
  geom_smooth(method=lm, color="red")

1.3.3 Model 3

fit <- lm(formula = Radiation ~ Humidity, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Humidity, data = numData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -371.6 -186.7 -134.5  151.3 1443.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 413.35579    5.19952   79.50   <2e-16 ***
## Humidity     -2.74915    0.06549  -41.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.7 on 32684 degrees of freedom
## Multiple R-squared:  0.05115,    Adjusted R-squared:  0.05112 
## F-statistic:  1762 on 1 and 32684 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

ggplot(data = numData, aes(x = Humidity, y = Radiation)) + 
  geom_point(color="blue", alpha=0.35, size=2) +
  geom_smooth(method=lm, color="red")

1.3.4 Model 4

fit <- lm(formula = Radiation ~ WindDirection, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ WindDirection, data = numData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -331.4 -196.8 -166.0  133.4 1271.7 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   332.66349    3.39100   98.10   <2e-16 ***
## WindDirection  -0.87490    0.02045  -42.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 307.4 on 32684 degrees of freedom
## Multiple R-squared:  0.05305,    Adjusted R-squared:  0.05302 
## F-statistic:  1831 on 1 and 32684 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

ggplot(data = numData, aes(x = WindDirection, y = Radiation)) + 
  geom_point(color="blue", alpha=0.35, size=2) +
  geom_smooth(method=lm, color="red")

1.3.5 Model 5

fit <- lm(formula = Radiation ~ Speed, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Speed, data = numData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -411.6 -208.9 -179.3  148.2 1375.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 165.5166     3.5714   46.34   <2e-16 ***
## Speed         6.6638     0.4993   13.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.1 on 32684 degrees of freedom
## Multiple R-squared:  0.005421,   Adjusted R-squared:  0.00539 
## F-statistic: 178.1 on 1 and 32684 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

ggplot(data = numData, aes(x = Speed, y = Radiation)) + 
  geom_point(color="blue", alpha=0.35, size=2) +
  geom_smooth(method=lm, color="red")

So the final conclution for the Simple Linear model is first model, that is

\[\hat{Y} = \beta_0 + \beta_1x_1 + \epsilon\]

\[\hat{Radiation} = (-1706.2876) + (37.4421) * Temperature + \epsilon\]

Which gives,

R-sq R-sq(adj)
0.5402 0.5401

1.3.5.1 What does the final model describe

fit <- lm(formula = Radiation ~ Temperature, data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Temperature, data = numData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -720.40 -130.80  -14.85  104.55 1132.38 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1706.2876     9.8369  -173.5   <2e-16 ***
## Temperature    37.4421     0.1911   195.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214.2 on 32684 degrees of freedom
## Multiple R-squared:  0.5402, Adjusted R-squared:  0.5401 
## F-statistic: 3.839e+04 on 1 and 32684 DF,  p-value: < 2.2e-16
    1. Residuals: We look for a symmetrical distribution(Median close to 0, Min, Max has same distance from 0(Min & Max and 1Q & 3Q should have close in magnitude)). Here do not appear to be strongly symmetrical.
    1. Coefficients:
    • Estimate: Predicting the response variable. For a example \(1^\circ F\) increment of Temperature, will result \(37.4421 kg/s^3\)(watts per meter squared) increment of Radiation.
    • Std.Error: Average amount that the estimate varies from the actual value.(relatively low number than the estimates).
    • t-value: \(\frac{Estimate}{Std.error}\)
    • p-values: Want \(p-value < \alpha=0.05\) to be statistically significant.
  • R-sq: x variable can explain R-sq% of the variation in y-variable.
  • R-sq(adj): Scaled by the number of the parameters in the model.
  • F-stat, p-val: Tells us that R-sq is significant or not.

1.3.5.2 Regression Assumptions

  • Assumption for the linear models never perfectly met,
  • But we must check if they are reasonable enough to work with.
  • Assumptions,
      1. Y-values (or the error \(e\)) are independent.
      1. Y-values (or the error \(e\)) are normally distributed.
      1. Y-values can express by a linear function of X-variable.
      1. Homoscedasticity: variation of observation around the regression line is constant.
fit <- lm(formula = Radiation ~ Temperature, data = numData)
plot(Radiation ~ Temperature)
abline(fit, col = "red", lwd = 2)

par(mfrow = c(2,2))
plot(fit)

  • Residual Plot: Should be no pattern, red line should be flat.
  • Normal QQ: Should follow a roughly a diagonal line.
  • Scale-Location: Shows increasing and decreasing diagnose.

1.4 Bayesian Simple Regression

1.4.1 Calculations

# Obtain residuals and n
resd <- residuals(fit)
n <- length(resd)
# Calculate MSE
MSE <- sum((resd^2))/(n-2)
MSE
## [1] 45895.04
# Model
fit <- lm(formula = Radiation ~ Temperature, data = numData)
# Credible interval 
output <- summary(fit)$coef[, 1:2]
#output

out <- cbind(output, confint(fit))
colnames(out) <- c("Posterior Mean", "Posterior Std", "2.5", "97.5")
round(out, 2)
##             Posterior Mean Posterior Std      2.5     97.5
## (Intercept)       -1706.29          9.84 -1725.57 -1687.01
## Temperature          37.44          0.19    37.07    37.82
  • Let’s discuss about above bayesian outcome.
  • Interpretation: Based on the information we can say that there is a 95% chance that Radiation will increase \(37.07 kg/s^3\) up to \(37.82 kg/s^3\) with every additional \(1^\circ F\) increase in the Temperature.

1.5 Multiple Linear Regression

1.5.1 Model 1

  • Having all the variables to the model.
fit <- lm(formula = Radiation ~ ., data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ ., data = numData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -752.37 -132.65  -20.19  102.98 1121.15 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.098e+04  6.959e+02  30.153  < 2e-16 ***
## Temperature    3.837e+01  2.064e-01 185.930  < 2e-16 ***
## Pressure      -7.471e+02  2.290e+01 -32.622  < 2e-16 ***
## Humidity      -2.691e-01  4.844e-02  -5.555 2.79e-08 ***
## WindDirection -2.694e-01  1.462e-02 -18.432  < 2e-16 ***
## Speed          7.875e+00  3.416e-01  23.053  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208.2 on 32680 degrees of freedom
## Multiple R-squared:  0.5659, Adjusted R-squared:  0.5659 
## F-statistic:  8522 on 5 and 32680 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

1.5.2 Model 2

  • By looking at the correlations we remove least correlation having with Radiation. Which is Speed.
fit <- lm(formula = Radiation ~ Temperature+Pressure+Humidity+WindDirection,
          data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Temperature + Pressure + Humidity + 
##     WindDirection, data = numData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -740.27 -133.86  -21.25  102.62 1171.57 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.282e+04  6.969e+02   32.75   <2e-16 ***
## Temperature    3.810e+01  2.077e-01  183.44   <2e-16 ***
## Pressure      -8.048e+02  2.295e+01  -35.07   <2e-16 ***
## Humidity      -5.382e-01  4.739e-02  -11.36   <2e-16 ***
## WindDirection -2.593e-01  1.473e-02  -17.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.8 on 32681 degrees of freedom
## Multiple R-squared:  0.5589, Adjusted R-squared:  0.5588 
## F-statistic: 1.035e+04 on 4 and 32681 DF,  p-value: < 2.2e-16

1.5.3 Model 3

  • Remove pressure.
fit <- lm(formula = Radiation ~ Temperature+Humidity+WindDirection,
          data = numData)
summary(fit)
## 
## Call:
## lm(formula = Radiation ~ Temperature + Humidity + WindDirection, 
##     data = numData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -709.90 -138.80  -21.75  106.72 1123.52 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -1.616e+03  1.288e+01 -125.494  < 2e-16 ***
## Temperature    3.654e+01  2.066e-01  176.821  < 2e-16 ***
## Humidity      -2.649e-01  4.762e-02   -5.564 2.66e-08 ***
## WindDirection -1.683e-01  1.477e-02  -11.392  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213.7 on 32682 degrees of freedom
## Multiple R-squared:  0.5423, Adjusted R-squared:  0.5422 
## F-statistic: 1.291e+04 on 3 and 32682 DF,  p-value: < 2.2e-16
  • We can see only all variable including gives us higher R-sq value

1.6 Bayesian Multiple Linear Regression

# Bayesian Multiple Model fitting
bas.lmall = bas.lm(Radiation ~ ., data = numData,
                 prior = "BIC",
                 modelprior = Bernoulli(1),
                 include.always = ~ .,
                 n.models = 1)
# Coefficients
lmall.coef <- coef(bas.lmall)
lmall.coef
## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  1 models 
##                post mean   post SD     post p(B != 0)
## Intercept       207.12470     1.15133     1.00000    
## Temperature      38.37214     0.20638     1.00000    
## Pressure       -747.06869    22.90106     1.00000    
## Humidity         -0.26910     0.04844     1.00000    
## WindDirection    -0.26944     0.01462     1.00000    
## Speed             7.87488     0.34160     1.00000
# Coefficients visualization
par(mfrow = c(2, 3)) 
plot(lmall.coef, ask = F)

# Summary table
out = confint(lmall.coef)[, 1:2]
# Extract the upper and lower bounds of the credible intervals 
names = c("posterior mean", "posterior std", colnames(out)) 
out = cbind(lmall.coef$postmean, lmall.coef$postsd, out) 
colnames(out) = names

round(out, 2)
##               posterior mean posterior std    2.5%   97.5%
## Intercept             207.12          1.15  204.87  209.38
## Temperature            38.37          0.21   37.97   38.78
## Pressure             -747.07         22.90 -791.96 -702.18
## Humidity               -0.27          0.05   -0.36   -0.17
## WindDirection          -0.27          0.01   -0.30   -0.24
## Speed                   7.87          0.34    7.21    8.54

1.6.1 Model selection

  • We start with the full model, with all possible predictors.
    • i.e Temperature, Pressure, Humidity, WindDirection, Speed.
  • We drop one variable at a time and record BIC value.
  • Then finally choose the best model having minimum BIC value.
# Total # of observation
n <- nrow(numData)

# Full Model
rad.lm1 <- lm(Radiation ~ ., data = numData)
rad.step <- step(rad.lm1, k = log(n))
## Start:  AIC=349029.7
## Radiation ~ Temperature + Pressure + Humidity + WindDirection + 
##     Speed
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        1415934816 349030
## - Humidity       1    1337143 1417271959 349050
## - WindDirection  1   14720291 1430655107 349357
## - Speed          1   23025460 1438960276 349547
## - Pressure       1   46107462 1462042278 350067
## - Temperature    1 1497820497 2913755313 372607
# Let's drop each variable and test
## Tried, Speed,Pressure,WindDirection,Humidity,
rad.lmt <- lm(Radiation ~ Temperature+Pressure+Humidity+WindDirection,
              data = numData)
rad.step <- step(rad.lmt, k = log(n))
## Start:  AIC=349546.5
## Radiation ~ Temperature + Pressure + Humidity + WindDirection
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        1438960276 349547
## - Humidity       1    5679155 1444639431 349665
## - WindDirection  1   13645524 1452605800 349845
## - Pressure       1   54157664 1493117940 350744
## - Temperature    1 1481702879 2920663155 372674
### Table
Model BIC value BIC Difference from FM
FM 349029.7 0
FM-Temperature 372607.2 -23577.5
FM-Pressure 350066.7 -1037
FM-Humidity 349050.1 -20.4
FM-WindDirection 349357.3 -327.6
FM-Speed 349546.5 -516.8
  • Tried removing each less correlated variable, but we got the minimum BIC on Full Model.

1.6.1.1 BAS to get best model

# Model
bas_model <- bas.lm(Radiation ~ ., data = numData,
                    prior = "BIC",
                    modelprior = uniform()) # equal prior to the model
#bas_model
bas_coef <- coef(bas_model)
bas_coef
## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  32 models 
##                post mean   post SD     post p(B != 0)
## Intercept       207.12470     1.15133     1.00000    
## Temperature      38.37215     0.20639     1.00000    
## Pressure       -747.06784    22.90148     1.00000    
## Humidity         -0.26909     0.04847     0.99996    
## WindDirection    -0.26944     0.01462     1.00000    
## Speed             7.87490     0.34161     1.00000
# Best model
best <- which.max(bas_model$logmarg)
bestmodel <- bas_model$which[[best]]
#bestmodel

bestgamma <- rep(0, bas_model$n.vars)
bestgamma[bestmodel + 1] <- 1
#bestgamma

bas_bestmodel <- bas.lm(Radiation ~ ., data = numData,
                    prior = "BIC", n.models = 1, bestmodel = bestgamma,
                    modelprior = uniform())
## Coefficients
bas_coef <- coef(bas_bestmodel)
## creadible intervals
out <- confint(bas_coef)[, 1:2]

## Summary table
bas_summary <- cbind(bas_coef$postmean, bas_coef$postsd, out)
names <- c("Posterior Mean", "Posterior SD", colnames(out))
colnames(bas_summary) <- names
bas_summary
##               Posterior Mean Posterior SD         2.5%        97.5%
## Intercept        207.1246974   1.15132973  204.8680490  209.3813458
## Temperature       38.3721364   0.20637957   37.9676249   38.7766479
## Pressure        -747.0686858  22.90105705 -791.9555954 -702.1817763
## Humidity          -0.2691005   0.04844022   -0.3640451   -0.1741559
## WindDirection     -0.2694445   0.01461813   -0.2980965   -0.2407924
## Speed              7.8748844   0.34160218    7.2053317    8.5444372
## Get posterior probability
bas_model <- bas.lm(Radiation ~ Temperature+Pressure+Humidity+WindDirection+Speed, 
                    data = numData,
                    prior = "BIC",
                    modelprior = uniform())

round(summary(bas_model), 3)
##               P(B != 0 | Y)     model 1     model 2     model 3     model 4
## Intercept                 1       1.000       1.000       1.000       1.000
## Temperature               1       1.000       1.000       1.000       1.000
## Pressure                  1       1.000       1.000       1.000       1.000
## Humidity                  1       1.000       0.000       1.000       0.000
## WindDirection             1       1.000       1.000       0.000       0.000
## Speed                     1       1.000       1.000       1.000       1.000
## BF                       NA       1.000       0.000       0.000       0.000
## PostProbs                NA       1.000       0.000       0.000       0.000
## R2                       NA       0.566       0.566       0.561       0.561
## dim                      NA       6.000       5.000       5.000       4.000
## logmarg                  NA -344395.449 -344405.678 -344559.279 -344561.106
##                   model 5
## Intercept           1.000
## Temperature         1.000
## Pressure            1.000
## Humidity            1.000
## WindDirection       1.000
## Speed               0.000
## BF                  0.000
## PostProbs           0.000
## R2                  0.559
## dim                 5.000
## logmarg       -344653.878
## Marginal posterior inclusion probability
print(bas_model)
## 
## Call:
## bas.lm(formula = Radiation ~ Temperature + Pressure + Humidity + 
##     WindDirection + Speed, data = numData, prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##     Intercept    Temperature       Pressure       Humidity  WindDirection  
##             1              1              1              1              1  
##         Speed  
##             1

2 Interpritation

3 Questions that I had…. have :)

  • Standardization, Normalization is necessary in Regression modeling ?.
  • How to set prior information to a model.