# 1. Load dataset
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
eci <- read_csv("eci.csv")
## Rows: 274 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): County, CSCS, CSFA, TS, PPSC, TPPS
## dbl (1): B3P
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Checked for missing values
summary(eci)
##     County               B3P               CSCS               CSFA          
##  Length:274         Min.   :     2.0   Length:274         Length:274        
##  Class :character   1st Qu.:   383.0   Class :character   Class :character  
##  Mode  :character   Median :   993.5   Mode  :character   Mode  :character  
##                     Mean   :  6920.7                                        
##                     3rd Qu.:  2587.5                                        
##                     Max.   :316834.0                                        
##                     NA's   :20                                              
##       TS                PPSC               TPPS          
##  Length:274         Length:274         Length:274        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
# Removed rows with missing values
cleaned_data <- na.omit(eci)
# 2. Dependent and Independent variables
dep_var <- cleaned_data$CSCS
ind_var <- cleaned_data$TS

cleaned_data$CSCS<-as.numeric(cleaned_data$CSCS)
## Warning: NAs introduced by coercion
cleaned_data$TS<-as.numeric(cleaned_data$TS)
## Warning: NAs introduced by coercion
# 3. Create a linear model
model<-lm(CSCS~TS, data=cleaned_data)
# 4. Summary of the model
summary(model)
## 
## Call:
## lm(formula = CSCS ~ TS, data = cleaned_data)
## 
## Residuals:
##       7      16      28      30      31      38 
##  2.4684  3.9745 -3.7455 -1.3267 -1.7198  0.3491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.227507   1.694585  -7.216  0.00196 ** 
## TS            0.956326   0.006359 150.379 1.17e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.192 on 4 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9998 
## F-statistic: 2.261e+04 on 1 and 4 DF,  p-value: 1.173e-08
# 5. Interpret the model
# R-squared: 0.9998, meaning the model explains 99.89% of the variation in the dependent variable
# p-value: < 1.173e-08, indicating the model is statistically significant overall

Overall Model The F-statistic is very high (22610), with a very small p-value (1.17e-08), which suggests that the model as a whole is statistically significant and that “TS” has a meaningful effect on “CSCS”.

The p-value associated with the F-statistic is less than 0.001, indicating strong evidence against the null hypothesis and confirming that the model significantly explains the variance in “CSCS”.

Significance of Variables The p-value for the intercept (0.00196) and p-value for “TS” (1.17e-08) both indicate statistical significance.

Beta Coefficients Intercept (-12.23): This value represents the predicted value of “CSCS” when “TS” is zero. It might not always have a meaningful interpretation, especially if a “TS” value of zero is outside the meaningful range for this data.

Slope (0.956): The slope coefficient for “TS” is 0.956, which indicates that for each one-unit increase in “TS,” the “CSCS” score is expected to increase by approximately 0.956 units. This suggests a strong, positive relationship between “TS” and “CSCS.”

Residual Standard Error The residual standard error is 3.192, which provides an estimate of the standard deviation of the residuals. Which provides a visual of the size of the errors.

# 6. Interpret the coefficients
# The significant independent variable is "TS" (with a very low p-value of 1.17e-08). 

Beta Coefficient for TS (Estimate = 0.956): The coefficient, or slope, tells us that for each one-unit increase in “TS,” the predicted value of “CSCS” increases by approximately 0.956 units. This positive relationship implies that as “TS” scores go up, “CSCS” scores also tend to increase.

Impact of TS on CSCS: The high t-value (150.379) and extremely low p-value suggest that “TS” is a strong predictor of “CSCS,” making it significant in this model. Therefore, “TS” likely explains a substantial portion of the variance in “CSCS.”

The independent variable “TS” positively influences “CSCS,” and this effect is both statistically significant and large. Each unit increase in “TS” leads to nearly a one-unit increase in “CSCS,” reflecting a nearly linear and direct relationship between these variables in the model.

# 7. Check the linearity assumption
plot(model, which = 1)

# The plot shows a clear linear relationship between the variables, so the linearity assumption is met.