class: center, middle, inverse, title-slide .title[ #
Electricity Output vs. Production Costs
] .subtitle[ ##
Linear Regression
] .author[ ###
Andrew Heneghan
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
09/27/2022
STA 490: Capstone Statistics
] --- class: middle, center ## Research Question ### Do different types of production costs affect the amount of electrical output from public electricity supply authorities? <img src = "https://www.nicepng.com/png/detail/23-235671_money-sign-dollar-sign-cash-clip-art-clipart.png" width="180" height="150"> --- class: top .pull-left[ ## <center>Description of Data</center> - From Helmut Spaeth, Mathematical Algorithms for Linear Regression - 16 independent public electricity supply authorities were randomly sampled. - The dependent variable is electricity output, in millions of kilowatts. - The three independent cost variables are capital costs, costs to keep suppliers running, labor costs, to fund the workers, and energy costs. ] .pull-right[ ```r x12 = read.table("https://people.sc.fsu.edu/~jburkardt/datasets/regression/x12.txt", skip = 38) names(x12) = c("Index", "One", "Capital", "Labor", "Energy", "ElectricOutput") x12[12,6] = 2.239 x12.data = x12[, c("Capital", "Labor", "Energy", "ElectricOutput")] x12.data$ElectricOutput = as.numeric(x12.data$ElectricOutput) knitr::kable(head(x12.data), format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> Capital </th> <th style="text-align:right;"> Labor </th> <th style="text-align:right;"> Energy </th> <th style="text-align:right;"> ElectricOutput </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 98.288 </td> <td style="text-align:right;"> 0.386 </td> <td style="text-align:right;"> 13.219 </td> <td style="text-align:right;"> 1.270 </td> </tr> <tr> <td style="text-align:right;"> 255.068 </td> <td style="text-align:right;"> 1.179 </td> <td style="text-align:right;"> 49.145 </td> <td style="text-align:right;"> 4.597 </td> </tr> <tr> <td style="text-align:right;"> 208.904 </td> <td style="text-align:right;"> 0.532 </td> <td style="text-align:right;"> 18.005 </td> <td style="text-align:right;"> 1.985 </td> </tr> <tr> <td style="text-align:right;"> 528.864 </td> <td style="text-align:right;"> 1.836 </td> <td style="text-align:right;"> 75.639 </td> <td style="text-align:right;"> 9.897 </td> </tr> <tr> <td style="text-align:right;"> 307.419 </td> <td style="text-align:right;"> 1.136 </td> <td style="text-align:right;"> 52.234 </td> <td style="text-align:right;"> 5.907 </td> </tr> <tr> <td style="text-align:right;"> 138.283 </td> <td style="text-align:right;"> 1.085 </td> <td style="text-align:right;"> 9.027 </td> <td style="text-align:right;"> 1.832 </td> </tr> </tbody> </table> ] --- class: top ## <center>Diagnostics</center> .pull-left[ <font size = 3 color = "black">H_0: β_1=0, β_2=0, β_3=0<br>H_1: At least one β_i≠0<br></font> ```r pairs(x12.data) ``` <img src="data:image/png;base64,#Electricity-Output-vs-Production-Costs_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ - All three cost variables are linearly correlated with the electricity output. - All three cost variables are not linear correlated with each other, so there likely is no collinearity. - There are no extreme outliers or skewed distributions, so I will not perform discretization on the predictor variables. ] --- class: top ## <center>Diagnostics (cont'd)</center> .pull-left[ ```r x12.model = lm(ElectricOutput ~ Capital + Labor + Energy, data = x12.data) par(mfrow=c(2,2), mar=c(2,3,2,2)) plot(x12.model) ``` <img src="data:image/png;base64,#Electricity-Output-vs-Production-Costs_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] .pull-right[ - <font size = 4 color = "black">Observation 13 appears to be an outlier.</font> - <font size = 4 color = "black">The line on the Residuals vs Fitted graph is not fully horizontal. Therefore, the relationship of the residuals is not linear.</font> - <font size = 4 color = "black">Some of the points on the Normal QQ plot are not on the line. Therefore, the residuals are not normally distributed.</font> - <font size = 4 color = "black">The line on the Scale-Location is curved instead of horizontal and the points are not spread evenly. Therefore, there is no homogeneity of variances.</font> ] --- class: middle ## <center>Box-Cox Transformation</center> ```r library(MASS) boxcox(ElectricOutput ~ Capital + Labor + Energy, data = x12.data, lambda = seq(0.5, 1.5, length = 10), xlab=expression(paste(lambda))) title(main = "Box-Cox Transformation: 95% CI of lambda", col.main = "navy", cex.main = 0.9) ``` <img src="data:image/png;base64,#Electricity-Output-vs-Production-Costs_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> <font size = 3 color = "black"> Since 1 is within the 95% confidence interval of λ, I will not perform the power transformation. By the optimal λ is closer to 0.5, I will perform the log transformation on the model. </font> --- class: top ## <center>Log-Transformed Model</center> .pull-left[ - The residual plots don't show any definite improvements in model fit. - The line on the Residual vs Fitted graph is less horizontal than the graph associated with the initial model. - Less points on the Normal QQ Plot fall on the line than the same plot for the initial model. - The line on the Scale-Location is still quite curved and the points are not spread evenly just like the graph associated with the initial model. - I believe it is better to use the initial response to build the final model. ] .pull-right[ ```r x12.transform = lm(log(ElectricOutput) ~ Capital * Labor * Energy, data = x12.data) par(mfrow=c(2,2), mar = c(2,2,2,2)) plot(x12.transform) ``` ``` ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced ## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced ``` <img src="data:image/png;base64,#Electricity-Output-vs-Production-Costs_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- class: top ## <center>Final Model</center> <font size = 3 color = "black">Since the interaction effects are insignificant, I used the automatic variable selection method to find the final model.</font> ```r x12.last = lm(ElectricOutput ~ Capital*Labor*Energy, data=x12.data) x12.final = step(x12.last, direction = "backward", trace = 0) knitr::kable(summary(x12.final)$coef, caption = "Summarized statistics of the regression coefficients of the final model") ``` Table: Summarized statistics of the regression coefficients of the final model | | Estimate| Std. Error| t value| Pr(>|t|)| |:--------------------|----------:|----------:|----------:|------------------:| |(Intercept) | -0.1179266| 0.8008111| -0.1472589| 0.8865712| |Capital | -0.0015419| 0.0112743| -0.1367634| 0.8945970| |Labor | 1.7545171| 1.5500180| 1.1319333| 0.2904482| |Energy | 0.0466167| 0.0579497| 0.8044334| 0.4443927| |Capital:Labor | -0.0022914| 0.0043735| -0.5239147| 0.6145383| |Capital:Energy | 0.0002759| 0.0001435| 1.9228627| 0.0907141| |Labor:Energy | -0.0262049| 0.0321856| -0.8141819| 0.4391016| |Capital:Labor:Energy | -0.0000256| 0.0000201| -1.2752171| 0.2380164| --- class: top ## <center>Choosing an Optimal Model</center> ```r r.x12.model = summary(x12.model)$r.squared r.x12.transform = summary(x12.transform)$r.squared r.x12.final = summary(x12.final)$r.squared Rsquare = cbind(x12.model = r.x12.model, x12.transform = r.x12.transform, x12.final = r.x12.final) knitr::kable(Rsquare, caption=" Coefficients of correlation of the three candidate models") ``` Table: Coefficients of correlation of the three candidate models | x12.model| x12.transform| x12.final| |---------:|-------------:|---------:| | 0.921969| 0.9526196| 0.9709497| <font size = 4 color = "black">The first model has a R^2 of 92.2%, the second model has an R^2 of 95.26%, and the third model has an R^2 of 97.1%. Both the first and third models are based on the initial model, but the second and third are not as straightforward as the first. Since the first model has a simpler structure, a relatively high R^2, and is easy to interpret, I will choose it as the final model to report. </font> --- class: top ## <center>Conclusions</center> ```r summary.x12.model = summary(x12.model)$coef knitr::kable(summary.x12.model, caption = "Summary of the final working model") ``` Table: Summary of the final working model | | Estimate| Std. Error| t value| Pr(>|t|)| |:-----------|----------:|----------:|----------:|------------------:| |(Intercept) | 0.6371382| 0.4917738| 1.2955919| 0.2194840| |Capital | 0.0022478| 0.0051614| 0.4355009| 0.6709293| |Labor | -0.5995563| 0.9394285| -0.6382138| 0.5353245| |Energy | 0.0906942| 0.0233788| 3.8793313| 0.0021905| <font size = 4 color = "black">In conclusion, since the p-value is less than 0, energy cost is statistically significant and is positively correlated to electricity output, in millions of kilowatts. In holding capital and labor costs, a 1-dollar increase in energy costs will result in a 0.0906942 increase, in millions of kilowatts in electrical output. Capital and labor costs appear to not be statistically significant towards electricity output, in millions of kilowatts, since their p-values are greater than 0.</font>