library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  1. Load your chosen dataset into Rmarkdown
teacher_data <- read_csv("Teacher_Hiring_Certification_Turnover.csv")
## Rows: 33 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): REGION, distname, geotype_new, region_lea, Year
## dbl (20): district, schyr, intern, other_temp, oos_std, lag_starter, no_cert...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(teacher_data)
## # A tibble: 6 × 25
##   district schyr REGION intern other_temp oos_std lag_starter no_cert reenterer
##      <dbl> <dbl> <chr>   <dbl>      <dbl>   <dbl>       <dbl>   <dbl>     <dbl>
## 1   101902  2013 04        145         71      11          60      22       165
## 2   101902  2014 04        201        102       8          36      50       215
## 3   101902  2015 04        267        120      16          21      38       162
## 4   101902  2016 04        306        105      14          27      55       159
## 5   101902  2017 04        371        106      15          17      74       179
## 6   101902  2018 04        245         70       8           9      55       117
## # ℹ 16 more variables: emer <dbl>, std_all <dbl>, distname <chr>,
## #   geotype_new <chr>, total_new_hires <dbl>, region_lea <chr>, Year <chr>,
## #   total_teachers <dbl>, turnover_rate_teachers <dbl>, beg_year <dbl>,
## #   `1-5_years` <dbl>, `6-10_years` <dbl>, `11-20_years` <dbl>,
## #   over20_years <dbl>, `st-per-tch` <dbl>, num_st_mem <dbl>
  1. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable

dependent variable: turnover rate

independent variables: certification type, years of experience, student-teacher ratio

  1. create a linear model using the “lm()” command, save it to some object
turnover_model <- lm(turnover_rate_teachers~intern + other_temp + oos_std + lag_starter + no_cert + reenterer + emer + std_all + beg_year + `1-5_years` + `6-10_years` + `11-20_years` + `over20_years` + `st-per-tch`, data=teacher_data)
  1. call a “summary()” on your new model
summary(turnover_model)
## 
## Call:
## lm(formula = turnover_rate_teachers ~ intern + other_temp + oos_std + 
##     lag_starter + no_cert + reenterer + emer + std_all + beg_year + 
##     `1-5_years` + `6-10_years` + `11-20_years` + over20_years + 
##     `st-per-tch`, data = teacher_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026441 -0.007286 -0.001636  0.005288  0.050850 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.860e-02  1.568e-01   0.119  0.90686    
## intern        -1.330e-04  1.608e-04  -0.827  0.41904    
## other_temp    -1.221e-03  7.402e-04  -1.649  0.11642    
## oos_std       -7.055e-04  1.931e-03  -0.365  0.71904    
## lag_starter   -5.347e-04  5.932e-04  -0.901  0.37926    
## no_cert        3.065e-04  1.589e-04   1.929  0.06967 .  
## reenterer     -3.633e-05  1.771e-04  -0.205  0.83977    
## emer           1.363e-03  3.719e-03   0.367  0.71826    
## std_all        1.039e-03  9.052e-04   1.148  0.26610    
## beg_year       1.802e-04  7.556e-05   2.385  0.02826 *  
## `1-5_years`    1.234e-04  4.134e-05   2.986  0.00792 ** 
## `6-10_years`   9.695e-05  1.013e-04   0.957  0.35123    
## `11-20_years`  2.001e-04  1.046e-04   1.913  0.07182 .  
## over20_years  -7.627e-04  1.528e-04  -4.991 9.48e-05 ***
## `st-per-tch`   6.035e-03  9.500e-03   0.635  0.53322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02133 on 18 degrees of freedom
## Multiple R-squared:  0.9225, Adjusted R-squared:  0.8622 
## F-statistic:  15.3 on 14 and 18 DF,  p-value: 3.288e-07
  1. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

The model’s R-squared value is 0.9225, meaning it explains about 92.25% of the variation in the teacher turnover rate. This means the model does a good job of explaining the dependent variable (teacher turnover rates).

Significant Variables:

beg_year has a p-value of 0.02826

1-5_years has a p-value of 0.00792

over20_years has a p-value of 9.48e-05

Insignificant Variables: these variables have high p-values and do not significantly predict turnover rates in this model: intern, other_temp, oos_std, lag_starter, reenterer, emer, std_all, 6-10_years, 11-20_years, and st-per-tch

  1. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?

beg_year: The positive coefficient (1.802e-04) suggests that as the number of beginning teachers increases, the turnover rate also increases slightly

1-5_years: The positive coefficient (1.234e-04) suggests that as the number of teachers with 1-5 years of experience increases, the turnover rate increases

over20_years: The negative coefficient (-7.627e-04) suggests that as the number of teachers with over 20 years of experience increases, the turnover rate decreases

  1. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
plot(turnover_model, which=1)

The model meets the assumption of linearity. There are some outliers (33,16,12) that might affect the model.