library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
wd<-getwd()
ECD<-read_xlsx("ECD.xlsx")
###2) Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
###DEPENDENT VARIABLE: Total Usage
###INDEPENDENT VARIABLES: 1. Total people in household 2. Annual household income
###3) create a linear model using the "lm()" command, save it to some object###
ECD_multiple<-lm(ECD$`2023 Total Usage` ~ ECD$`Total people in household` + ECD$`Annual Household Income`, data=ECD)
###4) call a "summary()" on your new model###
summary(ECD_multiple)
##
## Call:
## lm(formula = ECD$`2023 Total Usage` ~ ECD$`Total people in household` +
## ECD$`Annual Household Income`, data = ECD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16254 -6609 -2249 3203 54081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.441e+04 1.408e+03 10.238 < 2e-16 ***
## ECD$`Total people in household` 1.500e+03 2.700e+02 5.557 6.12e-08 ***
## ECD$`Annual Household Income` 4.056e-02 3.645e-02 1.113 0.267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10060 on 296 degrees of freedom
## Multiple R-squared: 0.1005, Adjusted R-squared: 0.09443
## F-statistic: 16.54 on 2 and 296 DF, p-value: 1.554e-07
###5) interpret the model's r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
###THE ADJUSTED R SQUARED IS ONLY 9% MEANING IT ONLY REPRESENTS 9% OF THE DATA, WHICH I BELIEVE IS CONSIDERED VERY LOW!
###THE P VALUE IS 1.554e-07, WHICH MEANS IT IS VERY UNLIKELY THAT THESE ESTIMATES ARE DUE TO RANDOM CHANCE OR IN OTHER WORDS, IT IS VERY UNLIKELY THAT THE MODEL IS COMPLETE BS.
###THE SIGNIFICANT VARIABLE IN THIS MODEL IS THE TOTAL PEOPLE IN THE HOUSEHOLD, SO 9% OF TOTAL ENERGY USAGE CAN BE EXPLAINED BY THE TOTAL PEOPLE IN THE HOUSEHOLD.
###THE INSIGNIFICANT VARIABLE HERE IS THE ANNUAL HOUSEHOLD INCOME.
###6) Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?
###TOTAL PEOPLE IN THE HOUSEHOLD HAS AN ESTIMATE OF 1.500e+03 OR 0.001500. BECAUSE THIS SPECIFIC VARIABLE IS SIGNIFICANT, ALL THINGS BEING EQUAL, IF WE ADD AN EXTRA PERSON TO THE HOUSEHOLD IT INCREASES ENERGY USAGE BY 0.001500
###7) Does the model you create meet or violate the assumption of linearity? Show your work with "plot(x,which=1)"
###HERE WE SEE THAT RESIDUALS ARE SOMEWHAT SCATTERED. ALTHOUGH THERE ARE DEVIATIONS, THE LINE IS MOSTLY STRAIGHT, SUGGESTING THAT THE RELATIONSHIP BETWEEN VARIABLES IS MOSTLY LINEAR
plot(ECD_multiple, which = 1)
