library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
wd<-getwd()
ECD<-read_xlsx("ECD.xlsx")
###2) Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable

###DEPENDENT VARIABLE: Total Usage 

###INDEPENDENT VARIABLES: 1. Total people in household 2. Annual household income
###3) create a linear model using the "lm()" command, save it to some object###

ECD_multiple<-lm(ECD$`2023 Total Usage` ~ ECD$`Total people in household` + ECD$`Annual Household Income`, data=ECD)

###4) call a "summary()" on your new model###

summary(ECD_multiple)
## 
## Call:
## lm(formula = ECD$`2023 Total Usage` ~ ECD$`Total people in household` + 
##     ECD$`Annual Household Income`, data = ECD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16254  -6609  -2249   3203  54081 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.441e+04  1.408e+03  10.238  < 2e-16 ***
## ECD$`Total people in household` 1.500e+03  2.700e+02   5.557 6.12e-08 ***
## ECD$`Annual Household Income`   4.056e-02  3.645e-02   1.113    0.267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10060 on 296 degrees of freedom
## Multiple R-squared:  0.1005, Adjusted R-squared:  0.09443 
## F-statistic: 16.54 on 2 and 296 DF,  p-value: 1.554e-07
###5) interpret the model's r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

###THE ADJUSTED R SQUARED IS ONLY 9% MEANING IT ONLY REPRESENTS 9% OF THE DATA, WHICH I BELIEVE IS CONSIDERED VERY LOW!

###THE P VALUE IS 1.554e-07, WHICH MEANS IT IS VERY UNLIKELY THAT THESE ESTIMATES ARE DUE TO RANDOM CHANCE OR IN OTHER WORDS, IT IS VERY UNLIKELY THAT THE MODEL IS COMPLETE BS.

###THE SIGNIFICANT VARIABLE IN THIS MODEL IS THE TOTAL PEOPLE IN THE HOUSEHOLD, SO 9% OF TOTAL ENERGY USAGE CAN BE EXPLAINED BY THE TOTAL PEOPLE IN THE HOUSEHOLD. 

###THE INSIGNIFICANT VARIABLE HERE IS THE ANNUAL HOUSEHOLD INCOME.
###6) Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? 

###TOTAL PEOPLE IN THE HOUSEHOLD HAS AN ESTIMATE OF 1.500e+03 OR 0.001500. BECAUSE THIS SPECIFIC VARIABLE IS SIGNIFICANT, ALL THINGS BEING EQUAL, IF WE ADD AN EXTRA PERSON TO THE HOUSEHOLD IT INCREASES ENERGY USAGE BY 0.001500
###7) Does the model you create meet or violate the assumption of linearity? Show your work with "plot(x,which=1)"

###HERE WE SEE THAT RESIDUALS ARE SOMEWHAT SCATTERED. ALTHOUGH THERE ARE DEVIATIONS, THE LINE IS MOSTLY STRAIGHT, SUGGESTING THAT THE RELATIONSHIP BETWEEN VARIABLES IS MOSTLY LINEAR
plot(ECD_multiple, which = 1)