Linear Regression

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
district<-read_excel("district.xls")

model_multiple <- lm(DA0912DR21R ~ DA0AT21R+DA0CT21R, data = district)
summary(model_multiple)
## 
## Call:
## lm(formula = DA0912DR21R ~ DA0AT21R + DA0CT21R, data = district)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6637 -0.9424 -0.2303  0.6698 28.0421 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.626183   2.043604  29.666   <2e-16 ***
## DA0AT21R    -0.624078   0.021918 -28.473   <2e-16 ***
## DA0CT21R    -0.004277   0.002269  -1.885   0.0597 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.002 on 1078 degrees of freedom
##   (126 observations deleted due to missingness)
## Multiple R-squared:  0.462,  Adjusted R-squared:  0.461 
## F-statistic: 462.9 on 2 and 1078 DF,  p-value: < 2.2e-16
#My Dependent variable is Drop Out Rate, and my Independent variables are Attendance Rate and College Prep Class Participation

#My Multiple model for both attendance and college prep shows that Attendance has a stronger effect on Drop Out Rate than College Prep, and that its effect is signficant. My R squared shows that .462 of my model is explained by attendance and college prep courses.

Regression, much like t-tests and correlations, is all about relationships. What is the relationship between X and Y? Or between X, Y and Z?

For very simple data, this is easy enough to see. You can just plot it:

ggplot(district,aes(x= DA0AT21R,y = DA0912DR21R)) + geom_point()
## Warning: Removed 112 rows containing missing values or values outside the scale range
## (`geom_point()`).

#graph 1 looks like there is a relationship with the variables. 
ggplot(district,aes(x= DA0CT21R,y = DA0912DR21R)) + geom_point()
## Warning: Removed 126 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Graph 2 does not look like it has a relationship.

HOMEWORK

  1. Load your chosen dataset into Rmarkdown
  2. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
  3. create a linear model using the “lm()” command, save it to some object
  4. call a “summary()” on your new model
  5. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
  6. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?
  7. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”