1. Introduction

As part of the RCT analysis of Shamiri’s 4-week intervention in 6-schools, we are also studing the impact of the treatment on the academic outcomes between control and treatment students.

The academic outcomes consist of term wise, and form wise grades for 5 out of the 6 schools. The academic outcomes data for the 6th school could not be retrived. Moreover, majority of the Form 4 data for the other 5 schools could not be retrieved due to the issues with schools not releasing the grades in time.

Accounting for the missingness we can see that we have about 87% of the academic outcomes data for the 1600+ students in the RCT.

The intervention was conducted in Term 2. Therefore, Term 1 marks serve as a baseline for the analysis. Term 2 marks were immediately after the intervention and can therefore serve as the endline. Term 3 marks were a few months after the intervention (about 4 months) and can therefore serve as a follow-up data point post-intervention.

2. Preparing the academic outcomes for Data cleaning

Before the academic outcomes data could be cleaned for the analysis, we had to follow the steps mentioned below:

  1. Some sheets were saved as pdf, therefore, each of the sheets had to be converted into an excel file manually.
  2. Most of the sheets were password protected, hence a copy of the sheets had to be made.
  3. We followed the naming convention of each sheet as FormX_TermY_SchoolName. This helped organise the data set in a manner which could be extracted and pivoted for the cleaning process.
  4. Each school and each form had different manner of saving the data, hence the column named had to be checked for both manually and through R to ensure there is consistency in names so that when the data is merged across terms, forms and schools, it doesn’t run into any error. For example, in some files the subject Computer Science was marked as “C/S” and in some it was noted as “C_S”. Similarly, Total Marks was inconsistently named as TT MKS, Ttmks, T.T Mks etc. There were several other columns such as Average Marks, Admission Number and Mean_Marks.
  5. The academic outcomes data was uniquely identified for each student by their admission number and not Participant_ID which was used in the RCT. Hence it was important to retrieve a masterfile which linked the Participant_ID and the admission number to identify academic data for only those students who were part of the trial.

3. Steps followed in data cleaning

3.1: Loading the libaries and the datasets

We included 3 additional columns in each of the form/term/school-wise sheets: Form, Term and School_Name. These columns tool values from the sheet name. For example, in the sheet named “F1_T1_Aquinas”, the form column took value F1 for each row, Term column took value T1 for each row and the School_Name column took value of Aquinas for each row. This was done in order to retain individual identification of form, term and school in the merged data frame which contained all academic outcomes.

# Installing essential libraries
library(stargazer)
library(readxl)
library(dplyr)
library(knitr)
library(tidyverse)
library(lmtest)
library(sandwich)
library(fixest)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(reshape2)
library(gridExtra)
library(grid)
library(tableone)
library(writexl)
library(patchwork)
library(interactions)
library(kableExtra)
library(broom)




# Setting the working directory
setwd('/Users/ishangupta/Desktop/Books for Masters/R Practice')


# Path to your Excel file
file_path <- "/Users/ishangupta/Desktop/Books for Masters/R Practice/Mastersheet.xlsx"

# Get the sheet names
sheet_names <- excel_sheets(file_path)

# Function to check and convert "Admission_Number" to numeric
convert_to_numeric <- function(df, sheet_name) {
  if (!"Admission_Number" %in% colnames(df)) {
    warning(paste("Sheet", sheet_name, "does not have an 'Admission_Number' column."))
    return(df)
  }
  
  if (!is.numeric(df$Admission_Number)) {
    df$Admission_Number <- as.numeric(df$Admission_Number)
    if (any(is.na(df$Admission_Number))) {
      warning(paste("Sheet", sheet_name, "contains non-numeric 'Admission_Number' values that were converted to NA."))
    }
  }
  
  df
}

# Load all sheets into a named list of dataframes, add the "Names" column, check and convert "Admission_Number"
dfs <- lapply(setNames(sheet_names, sheet_names), function(sheet) {
  df <- read_excel(file_path, sheet = sheet)
  df$Names <- sheet
  df <- convert_to_numeric(df, sheet)
  df
})

# Assign each dataframe to the global environment with its sheet name
list2env(dfs, .GlobalEnv)
## <environment: R_GlobalEnv>

3.2: Editing each school/term/form wise data frame to include the respective terms, forms and school names

Each of these sheets had uneven set of columns with different names. This was due to schools not offering certain subjects or not recording certain data points. For example, Computer Science or Home Science was not offered in all schools. We ensured that all the sheets have equal number of columns. We added missing columns in each sheet with NA values from the master set of column names.

# Function to extract form, term, and school name from sheet name
parse_sheet_name <- function(sheet_name) {
  parts <- strsplit(sheet_name, "_")[[1]]
  list(
    Form = parts[1],
    Term = parts[2],
    School_Name = parts[3]
  )
}

# Function to add Form, Term, and School Name columns
add_columns <- function(df, sheet_name) {
  info <- parse_sheet_name(sheet_name)
  df %>%
    mutate(
      Form = info$Form,
      Term = info$Term,
      School_Name = info$School_Name
    )
}

# Load all sheets into a named list of dataframes, add columns
dfs <- lapply(setNames(sheet_names, sheet_names), function(sheet) {
  df <- read_excel(file_path, sheet = sheet)
  df <- add_columns(df, sheet)
  df
})

# Assign each dataframe to the global environment with its sheet name
list2env(dfs, .GlobalEnv)
## <environment: R_GlobalEnv>

3.3: Ensuring that all academic outcome data frames have equal number and same name of columns. Also converting all sheets into character data for merging purposes

Since the some of the sheets were converted from pdf, their data sturcture was not consistent. To ensure that all the sheets have the same data type for merging purposes. We forced convert all the columns into the character data type.

# Load necessary library
library(dplyr)

# Define the full list of columns
full_columns <- c("Admission_Number", "No_of_Sub", "STR", "Name", "ENG", "KIS", "MAT", "BIO", 
                   "PHY", "CHE", "HIS", "GEO", "CRE", "HSC", "AGR", "BST", "CS", "COM", 
                   "MW", "DD", "FRE", "KSL", "UPI", "Avg_Marks", "Total_Marks", 
                   "Total_Points", "DEV", "Grade", "VAP", "STR_POS", "OVR_POS", 
                   "Mean_Points", "KCPE", "Form", "Term", "School_Name")

# Function to add missing columns and ensure consistent column data types
add_missing_columns_and_convert_types <- function(df, full_columns) {
  # Determine which columns are missing
  missing_cols <- setdiff(full_columns, names(df))
  
  if (length(missing_cols) > 0) {
    # Add missing columns with NA values
    df <- df %>%
      bind_cols(set_names(as.data.frame(matrix(NA, nrow(df), length(missing_cols))), missing_cols))
  }
  
  # Ensure columns have the same data type
  for (col in full_columns) {
    if (col %in% names(df)) {
      # Determine the type of the column across all data frames
      column_types <- sapply(sheet_list, function(x) if (col %in% names(x)) typeof(x[[col]]) else NA)
      column_type <- na.omit(column_types)[1]  # Choose the first non-NA type as the consistent type
      
      # Convert column to the consistent type
      df[[col]] <- switch(column_type,
                          "character" = as.character(df[[col]]),
                          "double" = as.numeric(df[[col]]),
                          "integer" = as.integer(df[[col]]),
                          df[[col]])  # No conversion if type is not recognized
    }
  }
  
  return(df)
}

# List of data frames (sheets)
sheet_list <- list(F1_T1_Aquinas, F1_T1_Fatima, F1_T1_Kamkunji, F1_T1_Muthurwa, F1_T1_Ruiru, 
                   F1_T2_Aquinas, F1_T2_Fatima, F1_T2_Kamkunji, F1_T2_Muthurwa, F1_T2_Ruiru, 
                   F1_T3_Aquinas, F1_T3_Fatima, F1_T3_Kamkunji, F1_T3_Muthurwa, F1_T3_Ruiru, 
                   F2_T1_Aquinas, F2_T1_Fatima, F2_T1_Kamkunji, F2_T1_Muthurwa, F2_T1_Ruiru, 
                   F2_T2_Aquinas, F2_T2_Fatima, F2_T2_Kamkunji, F2_T2_Muthurwa, F2_T2_Ruiru, 
                   F2_T3_Aquinas, F2_T3_Fatima, F2_T3_Kamkunji, F2_T3_Muthurwa, F2_T3_Ruiru, 
                   F3_T1_Aquinas, F3_T1_Fatima, F3_T1_Kamkunji, F3_T1_Muthurwa, F3_T1_Ruiru, 
                   F3_T2_Aquinas, F3_T2_Fatima, F3_T2_Kamkunji, F3_T2_Muthurwa, F3_T2_Ruiru, 
                   F3_T3_Aquinas, F3_T3_Fatima, F3_T3_Kamkunji, F3_T3_Muthurwa, F3_T3_Ruiru, 
                   F4_T1_Kamkunji, F4_T2_Kamkunji)

# Add missing columns and ensure consistent data types
sheet_list_updated <- lapply(sheet_list, function(df) {
  add_missing_columns_and_convert_types(df, full_columns)
})

# Assign updated data frames back to original variable names
list2env(setNames(sheet_list_updated, c("F1_T1_Aquinas", "F1_T1_Fatima", "F1_T1_Kamkunji", 
                                         "F1_T1_Muthurwa", "F1_T1_Ruiru", "F1_T2_Aquinas", 
                                         "F1_T2_Fatima", "F1_T2_Kamkunji", "F1_T2_Muthurwa", 
                                         "F1_T2_Ruiru", "F1_T3_Aquinas", "F1_T3_Fatima", 
                                         "F1_T3_Kamkunji", "F1_T3_Muthurwa", "F1_T3_Ruiru", 
                                         "F2_T1_Aquinas", "F2_T1_Fatima", "F2_T1_Kamkunji", 
                                         "F2_T1_Muthurwa", "F2_T1_Ruiru", "F2_T2_Aquinas", 
                                         "F2_T2_Fatima", "F2_T2_Kamkunji", "F2_T2_Muthurwa", 
                                         "F2_T2_Ruiru", "F2_T3_Aquinas", "F2_T3_Fatima", 
                                         "F2_T3_Kamkunji", "F2_T3_Muthurwa", "F2_T3_Ruiru", 
                                         "F3_T1_Aquinas", "F3_T1_Fatima", "F3_T1_Kamkunji", 
                                         "F3_T1_Muthurwa", "F3_T1_Ruiru", "F3_T2_Aquinas", 
                                         "F3_T2_Fatima", "F3_T2_Kamkunji", "F3_T2_Muthurwa", 
                                         "F3_T2_Ruiru", "F3_T3_Aquinas", "F3_T3_Fatima", 
                                         "F3_T3_Kamkunji", "F3_T3_Muthurwa", "F3_T3_Ruiru", 
                                         "F4_T1_Kamkunji", "F4_T2_Kamkunji")), envir = .GlobalEnv)
## <environment: R_GlobalEnv>

3.4: Checking data type consistency across all sheets and merging them into a single data frame.

We merged all the academic outcomes sheets into a single file.

# Define your sheet list with names
sheet_names <- c(
  "F1_T1_Aquinas", "F1_T1_Fatima", "F1_T1_Kamkunji", "F1_T1_Muthurwa", "F1_T1_Ruiru",
  "F1_T2_Aquinas", "F1_T2_Fatima", "F1_T2_Kamkunji", "F1_T2_Muthurwa", "F1_T2_Ruiru",
  "F1_T3_Aquinas", "F1_T3_Fatima", "F1_T3_Kamkunji", "F1_T3_Muthurwa", "F1_T3_Ruiru",
  "F2_T1_Aquinas", "F2_T1_Fatima", "F2_T1_Kamkunji", "F2_T1_Muthurwa", "F2_T1_Ruiru",
  "F2_T2_Aquinas", "F2_T2_Fatima", "F2_T2_Kamkunji", "F2_T2_Muthurwa", "F2_T2_Ruiru",
  "F2_T3_Aquinas", "F2_T3_Fatima", "F2_T3_Kamkunji", "F2_T3_Muthurwa", "F2_T3_Ruiru",
  "F3_T1_Aquinas", "F3_T1_Fatima", "F3_T1_Kamkunji", "F3_T1_Muthurwa", "F3_T1_Ruiru",
  "F3_T2_Aquinas", "F3_T2_Fatima", "F3_T2_Kamkunji", "F3_T2_Muthurwa", "F3_T2_Ruiru",
  "F3_T3_Aquinas", "F3_T3_Fatima", "F3_T3_Kamkunji", "F3_T3_Muthurwa", "F3_T3_Ruiru",
  "F4_T1_Kamkunji", "F4_T2_Kamkunji"
)

# Define your sheet list with actual data frames
sheet_list <- list(
  F1_T1_Aquinas, F1_T1_Fatima, F1_T1_Kamkunji, F1_T1_Muthurwa, F1_T1_Ruiru,
  F1_T2_Aquinas, F1_T2_Fatima, F1_T2_Kamkunji, F1_T2_Muthurwa, F1_T2_Ruiru,
  F1_T3_Aquinas, F1_T3_Fatima, F1_T3_Kamkunji, F1_T3_Muthurwa, F1_T3_Ruiru,
  F2_T1_Aquinas, F2_T1_Fatima, F2_T1_Kamkunji, F2_T1_Muthurwa, F2_T1_Ruiru,
  F2_T2_Aquinas, F2_T2_Fatima, F2_T2_Kamkunji, F2_T2_Muthurwa, F2_T2_Ruiru,
  F2_T3_Aquinas, F2_T3_Fatima, F2_T3_Kamkunji, F2_T3_Muthurwa, F2_T3_Ruiru,
  F3_T1_Aquinas, F3_T1_Fatima, F3_T1_Kamkunji, F3_T1_Muthurwa, F3_T1_Ruiru,
  F3_T2_Aquinas, F3_T2_Fatima, F3_T2_Kamkunji, F3_T2_Muthurwa, F3_T2_Ruiru,
  F3_T3_Aquinas, F3_T3_Fatima, F3_T3_Kamkunji, F3_T3_Muthurwa, F3_T3_Ruiru,
  F4_T1_Kamkunji, F4_T2_Kamkunji
)

# Define a function to get column names and their data types in a specific format
get_column_types_for_sheet <- function(df, sheet_name) {
  tibble(
    Sheet = sheet_name,
    Column = names(df),
    Type = sapply(df, class)
  ) %>%
    pivot_wider(names_from = Column, values_from = Type)
}

# Apply the function to each sheet
column_types_list <- Map(get_column_types_for_sheet, sheet_list, sheet_names)

# Combine all data frames into a single data frame
combined_column_types <- bind_rows(column_types_list)

# Print the combined column types data frame
print(combined_column_types)
## # A tibble: 47 × 38
##    Sheet  Admission_Number UPI   NAME  STR   ENG   KIS   MAT   BIO   PHY   CHE  
##    <chr>  <chr>            <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 F1_T1… character        char… char… char… char… char… char… char… char… char…
##  2 F1_T1… character        char… char… char… char… char… char… char… char… char…
##  3 F1_T1… character        char… char… char… char… char… char… char… char… char…
##  4 F1_T1… character        char… <NA>  char… char… char… char… char… char… char…
##  5 F1_T1… character        char… char… char… char… char… char… char… char… char…
##  6 F1_T2… character        char… char… char… char… char… char… char… char… char…
##  7 F1_T2… character        char… char… char… char… char… char… char… char… char…
##  8 F1_T2… character        char… char… char… char… char… char… char… char… char…
##  9 F1_T2… character        char… <NA>  char… char… char… char… char… char… char…
## 10 F1_T2… character        char… char… char… char… char… char… char… char… char…
## # ℹ 37 more rows
## # ℹ 27 more variables: HIS <chr>, GEO <chr>, CRE <chr>, HSC <chr>, AGR <chr>,
## #   MW <chr>, COM <chr>, FRE <chr>, BST <chr>, No_of_Sub <chr>, KCPE <chr>,
## #   VAP <chr>, Avg_Marks <chr>, DEV <chr>, Grade <chr>, Total_Marks <chr>,
## #   Total_Points <chr>, Mean_Points <chr>, STR_POS <chr>, OVR_POS <chr>,
## #   Form <chr>, Term <chr>, School_Name <chr>, Name <chr>, CS <chr>, DD <chr>,
## #   KSL <chr>
merged_df <- bind_rows(sheet_list)

merged_df <- merged_df %>%
  mutate(across(everything(), ~ na_if(., "--")))

3.5: Splitting text to columns for subject scores which have marks and letter grades

Since all of the subject wise grades contained both marks and the letter grade achieved in the respective subject, we had to use split columns to text method using space as a delimiter to retrieve only the numeric marks.

# Define the columns to process
columns_to_process <- c("ENG", "KIS", "MAT", "BIO", "PHY", "CHE", "HIS", "GEO", "CRE", 
                         "HSC", "AGR", "MW", "FRE", "COM", "BST", "KCPE", "CS", "DD", "KSL")

# Function to keep only the part before the first space
clean_column <- function(column) {
  # This function will split the column by space and take the first part
  return(sapply(column, function(x) strsplit(as.character(x), " ")[[1]][1]))
}

# Apply the cleaning function to the specified columns
cleaned_df <- merged_df %>%
  mutate(across(all_of(columns_to_process), ~ clean_column(.)))

cleaned_df <- cleaned_df %>% 
  select(-UPI,-NAME,-Name)

school_lookup <- data.frame(
  School_ID = c("ANS23_School_47", "ANS23_School_25", "ANS23_School_18", 
                "ANS23_School_17", "ANS23_School_7", "ANS23_School_5"),
  School_Name = c("Makongeni", "Muthurwa", "Aquinas", 
                  "Kamkunji", "Ruiru", "Fatima"),
  stringsAsFactors = FALSE
)

cleaned_df <- cleaned_df %>%
  left_join(school_lookup, by = "School_Name")

3.6: Identifying academic outcomes for RCT students and pivoting wide the data frame using Term as an identifier

We used the MasterSheet which had the link between Participant_IDs and Admission_Number and joined it with the pivoted dataset from the step above. The resulting data set included only those students who were part of the RCT conducted by Shamiri.

The current data frame was in the format of:

Admission_Number Form Term Sub1 Sub2
12345 F1 T1 80 60
12345 F1 T2 90 70
12345 F1 T3 70 100

We had to pivot it by Term to create the following type of format:

Admission_Number Form T1_Sub1 T1_Sub2 T2_Sub1 T2_Sub2 T3_Sub1 T3_Sub2
12345 F1 80 60 90 70 70 100

We did this as each Participant_ID needed to have a unique entry for the regressions analysis.

Now since we had the entire data purely numeric, we were able to convert all the character data types into numeric for the regression analysis.

Once the pivoted data set was ready, for all the subject scores, average marks, etc. we demeaned the data set. This was done to ensure that we remove any inconsistencies across the schools and individual effects which may impact the regression results.

# Select only Participant_ID and Admission_Number from Mastersheet
mastersheet_reduced <- MasterSheet %>%
  select(Participant_ID, Admission_Number, School_ID)

final <- mastersheet_reduced %>%
  left_join(cleaned_df, by = c("School_ID", "Admission_Number"), relationship = "many-to-many")


final$Data_Avail <- apply(final[, 3:33], 1, function(row) {
  if (all(is.na(row))) {
    return(0)
  } else {
    return(1)
  }
})


# Define the columns to process
columns_to_process <- c("ENG", "KIS", "MAT", "BIO", "PHY", "CHE", "HIS", "GEO", "CRE", 
                         "HSC", "AGR", "MW", "FRE", "COM", "BST", "KCPE", "CS", "DD", "KSL")

# Function to keep only the part before the first space
clean_column <- function(column) {
  # Split each value by space and keep the first part
  return(sapply(column, function(x) {
    parts <- strsplit(as.character(x), " ")[[1]]
    if (length(parts) > 0) {
      return(parts[1])
    } else {
      return(NA)
    }
  }))
}

# Apply the cleaning function to the specified columns in the final dataframe
final <- final %>%
  mutate(across(all_of(columns_to_process), ~ clean_column(.)))



# List of columns to pivot
columns_to_pivot <- c("No_of_Sub", "ENG", "KIS", "MAT", "BIO", 
                      "PHY", "CHE", "HIS", "GEO", "CRE", "HSC", 
                      "AGR", "BST", "CS", "COM", "MW", "DD", 
                      "FRE", "KSL", "Avg_Marks", "Total_Marks", 
                      "Total_Points", "DEV", "Grade", "VAP", 
                      "STR_POS", "OVR_POS", "Mean_Points", "KCPE")

# Filter the data
filtered_final <- final %>%
  filter(Data_Avail == 1)

# Define the columns to rename
columns_to_rename <- c("No_of_Sub", "ENG", "KIS", "MAT", "BIO", "PHY", "CHE", 
                       "HIS", "GEO", "CRE", "HSC", "AGR", "BST", "CS", "COM", 
                       "MW", "DD", "FRE", "KSL", "Avg_Marks", "Total_Marks", 
                       "Total_Points", "DEV", "Grade", "VAP", "STR_POS", 
                       "OVR_POS", "Mean_Points", "KCPE")

# Function to rename columns with term prefix and remove Term column
rename_and_remove_term <- function(df, term_prefix) {
  df %>%
    rename_with(~ paste0(term_prefix, "_", .), all_of(columns_to_rename)) %>%
    select(-Term)
}

# Subset and rename for Term 1
filtered_final_T1 <- filtered_final %>%
  filter(Term == "T1") %>%
  rename_and_remove_term("T1")

# Subset and rename for Term 2
filtered_final_T2 <- filtered_final %>%
  filter(Term == "T2") %>%
  rename_and_remove_term("T2")

# Subset and rename for Term 3
filtered_final_T3 <- filtered_final %>%
  filter(Term == "T3") %>%
  rename_and_remove_term("T3")

# Combine all unique participant_IDs and Admission_Number
all_ids <- bind_rows(
  filtered_final_T1 %>% select(Participant_ID, Admission_Number),
  filtered_final_T2 %>% select(Participant_ID, Admission_Number),
  filtered_final_T3 %>% select(Participant_ID, Admission_Number)
) %>% distinct()

# Merge the data frames based on participant_ID and Admission_Number
merged_final <- all_ids %>%
  left_join(filtered_final_T1, by = c("Participant_ID", "Admission_Number")) %>%
  left_join(filtered_final_T2, by = c("Participant_ID", "Admission_Number")) %>%
  left_join(filtered_final_T3, by = c("Participant_ID", "Admission_Number"))



# List of columns to convert to numeric
cols_to_convert <- c("T1_No_of_Sub", "T1_ENG", "T1_KIS", "T1_MAT", "T1_BIO", 
                     "T1_PHY", "T1_CHE", "T1_HIS", "T1_GEO", "T1_CRE", "T1_HSC", 
                     "T1_AGR", "T1_BST", "T1_CS", "T1_COM", "T1_MW", "T1_DD", 
                     "T1_FRE", "T1_KSL", "T1_Avg_Marks", "T1_Total_Marks", 
                     "T1_Total_Points", "T1_DEV", "T1_VAP", "T1_STR_POS", 
                     "T1_OVR_POS", "T1_Mean_Points", "T1_KCPE", "T2_No_of_Sub", 
                     "T2_ENG", "T2_KIS", "T2_MAT", "T2_BIO", "T2_PHY", "T2_CHE", 
                     "T2_HIS", "T2_GEO", "T2_CRE", "T2_HSC", "T2_AGR", "T2_BST", 
                     "T2_CS", "T2_COM", "T2_MW", "T2_DD", "T2_FRE", "T2_KSL", 
                     "T2_Avg_Marks", "T2_Total_Marks", "T2_Total_Points", "T2_DEV", 
                     "T2_VAP", "T2_STR_POS", "T2_OVR_POS", "T2_Mean_Points", 
                     "T2_KCPE", "T3_No_of_Sub", "T3_ENG", "T3_KIS", "T3_MAT", 
                     "T3_BIO", "T3_PHY", "T3_CHE", "T3_HIS", "T3_GEO", "T3_CRE", 
                     "T3_HSC", "T3_AGR", "T3_BST", "T3_COM", "T3_MW", "T3_DD", 
                     "T3_FRE", "T3_KSL", "T3_Avg_Marks", "T3_Total_Marks", 
                     "T3_Total_Points", "T3_DEV", "T3_VAP", "T3_STR_POS", 
                     "T3_OVR_POS", "T3_Mean_Points", "T3_KCPE")

# Convert specified columns to numeric
merged_final <- merged_final %>%
  mutate(across(all_of(cols_to_convert), as.numeric))


# List of columns to convert to numeric and demean
cols_to_demean <- c("T1_ENG", "T1_KIS", "T1_MAT", "T1_BIO", "T1_PHY", "T1_CHE", "T1_HIS", "T1_GEO",
                    "T1_CRE", "T1_HSC", "T1_AGR", "T1_BST", "T1_CS", "T1_COM", "T1_MW", "T1_DD", 
                    "T1_FRE", "T1_KSL", "T1_Avg_Marks", "T1_Total_Marks", "T1_Total_Points", 
                    "T1_Mean_Points", "T1_KCPE", "T2_ENG", "T2_KIS", "T2_MAT", "T2_BIO", "T2_PHY", 
                    "T2_CHE", "T2_HIS", "T2_GEO", "T2_CRE", "T2_HSC", "T2_AGR", "T2_BST", "T2_CS", 
                    "T2_COM", "T2_MW", "T2_DD", "T2_FRE", "T2_KSL", "T2_Avg_Marks", "T2_Total_Marks", 
                    "T2_Total_Points", "T2_Mean_Points", "T2_KCPE", "T3_ENG", "T3_KIS", "T3_MAT", 
                    "T3_BIO", "T3_PHY", "T3_CHE", "T3_HIS", "T3_GEO", "T3_CRE", "T3_HSC", "T3_AGR", 
                    "T3_BST", "T3_COM", "T3_MW", "T3_DD", "T3_FRE", "T3_KSL", "T3_Avg_Marks", 
                    "T3_Total_Marks", "T3_Total_Points", "T3_Mean_Points", "T3_KCPE")


library(dplyr)

# Step 1: Create `trt` column
merged_final <- merged_final %>%
  mutate(trt = ifelse(startsWith(Participant_ID, "S"), 1, 
                      ifelse(startsWith(Participant_ID, "C"), 0, NA)))

temp <- merged_final

# Step 2: Calculate means for control students
control_means <- merged_final %>%
  filter(trt == 0) %>%
  group_by(School_ID) %>%
  summarize(across(all_of(cols_to_demean), ~ mean(as.numeric(.), na.rm = TRUE), .names = "mean_{.col}"))

# Step 3: Calculate standard deviations for control students
control_sds <- merged_final %>%
  filter(trt == 0) %>%
  group_by(School_ID) %>%
  summarize(across(all_of(cols_to_demean), ~ sd(as.numeric(.), na.rm = TRUE), .names = "sd_{.col}"))

# Step 4: Merge the means and sds with the original dataframe
merged_final <- merged_final %>%
  left_join(control_means, by = "School_ID") %>%
  left_join(control_sds, by = "School_ID")

# Step 5: Normalize columns in `cols_to_demean` without changing column names
for (col in cols_to_demean) {
  mean_col <- paste0("mean_", col)  # Name of the mean column
  sd_col <- paste0("sd_", col)  # Name of the sd column
  
  merged_final <- merged_final %>%
    mutate(!!sym(col) := (as.numeric(!!sym(col)) - !!sym(mean_col)) / !!sym(sd_col))
}

# Step 6: Select relevant columns, removing the mean and sd columns
final_final <- merged_final %>%
  select(-starts_with("mean_"), -starts_with("sd_"))

final_final <- final_final %>% 
  select(-School_Name.x,-School_ID.y,-School_Name.y, -School_Name,-STR.x,-STR.y,-STR,-Form.x,-Form.y,-Data_Avail.x,-Data_Avail.y, -trt)

3.7: Loading RCT data set and merging with Academic Outcomes

We then loaded the RCT data and pivoted it by time frame (0, 2, 4, and 8 weeks) and used the Participant_IDs in this dataframe to merge with the academic scores data set.

# Setting the working directory
setwd('/Users/ishangupta/Desktop/Books for Masters/R Practice')


# Loading the data in R
treatment <- read_xlsx("Treatment.xlsx", sheet = "2023Studentlongtrialdata")
control <- read_xlsx("Control.xlsx", sheet = "2023NaiKiambuControl")
six_mfu <- read.csv("six_mfu.csv")

# Renaming the columns in the treatment group
treatment <- treatment %>% 
  rename(
    Age = `student age`,
    Form = form,
    Gender = `student gender`,
    School_ID = `school id`
  ) %>% 
  mutate(trt = 1)

# List of school IDs to be dropped
school_ids_to_drop <- c("ANS23_School_31", "ANS23_School_32", "ANS23_School_33", 
                        "ANS23_School_34", "ANS23_School_35", "ANS23_School_36", 
                        "ANS23_School_37")

# Assuming 'treatment' is your dataframe with the treatment group data
# Filtering out rows with specified school IDs
treatment <- treatment %>%
  filter(!School_ID %in% school_ids_to_drop)

# Renaming the columns in the control group
control <- control %>% 
  rename(
    School_ID = `school id`
  ) %>% 
  mutate(supervisor_id = "NA") %>% 
  mutate(hub = "NA") %>% 
  mutate(fellow_gender = "NA") %>% 
  mutate(fellow_age = "NA") %>% 
  mutate(trt = 0)

# Creating the RCT data set
rct <- rbind(treatment,control)

rct <- rct %>% 
  rename(
    Participant_ID = participant_id
  ) %>% 
  mutate(student_female = case_when(
           is.na(Gender) ~ NA_real_,
           Gender == "Female" ~ 1,
           TRUE ~ 0),
         rural_smalltown = case_when(
           is.na(Home) ~ NA_real_,
           Home == 1 ~ 1,
           Home == 2 ~ 1,
           TRUE ~ 0),
         bigtown_city = case_when(
           is.na(Home) ~ NA_real_,
           Home == 3 ~ 1,
           Home == 4 ~ 1,
           TRUE ~ 0)
         )

# Converting the columns into numeric
cols_to_numeric <- c(1, 11, 12, 13, 103, 104, 17:101)
rct[cols_to_numeric] <- lapply(rct[cols_to_numeric], as.numeric)

rct <- rct %>% 
  select(-c(PHQ_Functioning, GAD_Functioning, GAD_Check))

rct <- rct %>% 
  select(-c(88:100))


phq_questions <- paste("PHQ", 1:8, sep = "_")
gad_questions <- paste("GAD", 1:7, sep = "_")
pcs_questions <- c("PCS_2", "PCS_4", "PCS_5", "PCS_6")
gq_questions <- c("GQ_2", "GQ_4")
ses_questions <- c("SES_2", "SES_3", "SES_11")


rct <- rct %>%
  mutate(
    across(all_of(phq_questions), ~ 3 - .x),
    across(all_of(gad_questions), ~ 3 - .x),
    across(all_of(pcs_questions), ~ 3 - .x),
    across(all_of(gq_questions), ~ 8 - .x),
    across(all_of(ses_questions), ~ 8 - .x)
  )

# List of column prefixes for each scale
scales <- c("PCS_", "PHQ_", "GAD_", "MSPSS_", "GQ_", "SWEMWBS_", "SES_")

# Function to calculate total score for each scale
calculate_total_score <- function(df, prefix) {
  scale_columns <- grep(paste0("^", prefix), names(df), value = TRUE)
  df[paste0(prefix, "Total")] <- rowSums(df[, scale_columns], na.rm = TRUE)
  return(df)
}

# Apply the function to calculate total scores
for (scale in scales) {
  rct <- calculate_total_score(rct, scale)
}


rct_pivoted <- rct %>%
  pivot_wider(names_from = time, values_from = c(starts_with("PHQ_"), starts_with("GAD_"), 
                                                 starts_with("MSPSS_"), starts_with("GQ_"), 
                                                 starts_with("SWEMWBS_"), starts_with("SES_"), 
                                                 starts_with("PCS_")), 
              names_sep = "_")



six_mfu <- six_mfu %>% 
  select(participant_id, treatment, time, starts_with("pcs"), starts_with("swembs"), 
         starts_with("phq"), starts_with("gad"), starts_with("gq"), 
         starts_with("mspss"), starts_with("ses")) %>% 
    rename_with(~ gsub("^swembs", "swemwbs", .), starts_with("swembs"))


six_mfu_filtered <- six_mfu %>%
  filter(time == 26) %>%  # Filter rows where time is 26
  select(-phq_functioning, -gad_functioning, -gad_check) %>%  # Remove specific columns
  rename_with(~ toupper(.), starts_with("pcs")) %>%  # Convert pcs columns to uppercase
  rename_with(~ toupper(.), starts_with("swemwbs")) %>%
  rename_with(~ toupper(.), starts_with("phq")) %>%
  rename_with(~ toupper(.), starts_with("gad")) %>%
  rename_with(~ toupper(.), starts_with("gq")) %>%
  rename_with(~ toupper(.), starts_with("mspss")) %>%
  rename_with(~ toupper(.), starts_with("ses")) %>%
  rename(Participant_ID = participant_id) %>%  # Rename participant_id to Participant_ID
  select(-treatment)  # Remove the treatment column


six_mfu_filtered <- six_mfu_filtered %>% 
   mutate(
    PHQ_Total = rowSums(select(., starts_with("PHQ")), na.rm = TRUE),
    GAD_Total = rowSums(select(., starts_with("GAD")), na.rm = TRUE),
    PCS_Total = rowSums(select(., starts_with("PCS")), na.rm = TRUE),
    SWEMWBS_Total = rowSums(select(., starts_with("SWEMWBS")), na.rm = TRUE),
    GQ_Total = rowSums(select(., starts_with("GQ")), na.rm = TRUE),
    MSPSS_Total = rowSums(select(., starts_with("MSPSS")), na.rm = TRUE),
    SES_Total = rowSums(select(., starts_with("SES")), na.rm = TRUE)
  )

six_mfu_filtered <- six_mfu_filtered %>%
  rename_with(~ paste0(., "_26"), starts_with("GAD")) %>%  # Add _26 to GAD columns
  rename_with(~ paste0(., "_26"), starts_with("PCS")) %>%  # Add _26 to PCS columns
  rename_with(~ paste0(., "_26"), starts_with("PHQ")) %>%  # Add _26 to PHQ columns
  rename_with(~ paste0(., "_26"), starts_with("SWEMWBS")) %>%  # Add _26 to SWEMWBS columns
  rename_with(~ paste0(., "_26"), starts_with("GQ")) %>%  # Add _26 to GQ columns
  rename_with(~ paste0(., "_26"), starts_with("MSPSS")) %>%  # Add _26 to MSPSS columns
  rename_with(~ paste0(., "_26"), starts_with("SES"))  # Add _26 to SES columns

six_mfu_filtered <- six_mfu_filtered %>% 
  select(-time)

merged_results <- rct_pivoted %>% 
  left_join(six_mfu_filtered, by = "Participant_ID")


combined <- merged_results %>%
  left_join(final_final, by = "Participant_ID")

combined <- combined %>% 
  select(-School_ID.x,-School_ID.y,-Form.y) %>% 
  rename (School_ID = School_ID.x.x, Form = Form.x) 

combined <- combined %>% 
      mutate(clinical = ifelse(PHQ_Total_0 <=14 | GAD_Total_0 <=14, 1, 0))


short <- combined %>% 
  select(Participant_ID, School_ID, PHQ_Total_0, GAD_Total_0, clinical) %>% 
  group_by(clinical) %>% summarise (count = n())

4. Analysis of Academic Outcomes

4.1: Baseline Analysis

4.1.1: Balance Test

Since T1 was right before the intervention, we can treat T1 scores as the baseline and therefore conduct balance test.

# Select columns that start with "T1" and the treatment variable
t1_columns <- grep("^T1", names(combined), value = TRUE)
selected_columns <- c("trt", t1_columns)

# Subset the dataframe to include only the selected columns
balance_df <- combined[, selected_columns]

balance_df <- balance_df %>% 
  select(-T1_Grade)

# Create a TableOne object
balance_table <- CreateTableOne(vars = t1_columns, strata = "trt", data = balance_df, test = TRUE)

# Convert the TableOne object to a data frame
balance_table_df <- as.data.frame(print(balance_table, smd = TRUE, printToggle = FALSE))

# Format the data frame for stargazer (adjust as needed)
balance_table_df$Variable <- rownames(balance_table_df)
rownames(balance_table_df) <- NULL

# Reorder the columns to fit stargazer input format
balance_table_df <- balance_table_df %>%
  select(Variable, everything())

# Use stargazer to display the table
stargazer(balance_table_df, type = "text", summary = FALSE, rownames = FALSE,
          title = "Table 1: Balance Table for Term 1 Academic Outcomes",
          digits = 2)
## 
## Table 1: Balance Table for Term 1 Academic Outcomes
## ==========================================================================
## Variable                          0              1          p   test  SMD 
## --------------------------------------------------------------------------
## n                                850            866                       
## T1_ENG (mean (SD))           0.00 (1.00)    0.02 (1.09)   0.710      0.020
## T1_KIS (mean (SD))           0.00 (1.00)    0.11 (1.02)   0.052      0.106
## T1_MAT (mean (SD))           0.00 (1.00)    0.06 (1.06)   0.302      0.056
## T1_BIO (mean (SD))           0.00 (1.00)    0.14 (1.07)   0.021      0.131
## T1_PHY (mean (SD))           0.00 (1.00)    0.10 (1.03)   0.095      0.100
## T1_CHE (mean (SD))           0.00 (1.00)    0.08 (1.05)   0.164      0.076
## T1_HIS (mean (SD))           0.00 (1.00)    0.08 (1.04)   0.159      0.080
## T1_GEO (mean (SD))           0.00 (1.00)    0.11 (0.99)   0.080      0.108
## T1_CRE (mean (SD))           0.00 (1.00)    -0.01 (1.04)  0.850      0.011
## T1_HSC (mean (SD))           0.00 (0.98)    0.13 (1.04)   0.258      0.127
## T1_AGR (mean (SD))           0.00 (0.99)    0.19 (1.64)   0.167      0.141
## T1_MW (mean (SD))            0.00 (1.00)    0.05 (0.57)   0.866      0.056
## T1_COM (mean (SD))           0.00 (0.99)    -0.01 (1.19)  0.899      0.012
## T1_FRE (mean (SD))           0.00 (0.99)    -0.03 (1.03)  0.873      0.033
## T1_BST (mean (SD))           0.00 (0.99)    0.14 (1.06)   0.033      0.133
## T1_No_of_Sub (mean (SD))     9.96 (1.92)    10.04 (1.75)  0.444      0.041
## T1_KCPE (mean (SD))          0.00 (1.00)    -0.04 (1.04)  0.457      0.044
## T1_VAP (mean (SD))          -11.20 (13.32) -9.76 (11.57)  0.050      0.115
## T1_Avg_Marks (mean (SD))     0.00 (1.00)    0.10 (1.03)   0.055      0.104
## T1_DEV (mean (SD))           -0.41 (4.12)   -0.30 (3.35)  0.609      0.031
## T1_Total_Marks (mean (SD))   0.00 (1.00)    0.10 (1.05)   0.066      0.102
## T1_Total_Points (mean (SD))  0.00 (1.00)    0.13 (1.15)   0.046      0.117
## T1_Mean_Points (mean (SD))   0.00 (1.00)    0.04 (0.97)   0.444      0.046
## T1_STR_POS (mean (SD))      25.58 (16.09)  26.02 (16.36)  0.616      0.027
## T1_OVR_POS (mean (SD))      120.44 (82.41) 120.20 (83.98) 0.956      0.003
## T1_CS (mean (SD))            0.00 (1.00)    -0.20 (0.71)  0.427      0.226
## T1_DD (mean (SD))            0.00 (1.00)    -0.42 (1.54)  0.390      0.327
## T1_KSL (mean (SD))           0.00 (1.00)    0.21 (0.78)   0.597      0.233
## --------------------------------------------------------------------------
school_stats <- merged_final %>%
  group_by(School_ID) %>%
  summarise(
    Mean_T1 = mean(T1_Avg_Marks, na.rm = TRUE),
    SD_T1 = sd(T1_Avg_Marks, na.rm = TRUE),
    Mean_T2 = mean(T2_Avg_Marks, na.rm = TRUE),
    SD_T2 = sd(T2_Avg_Marks, na.rm = TRUE),
    Mean_T3 = mean(T3_Avg_Marks, na.rm = TRUE),
    SD_T3 = sd(T3_Avg_Marks, na.rm = TRUE)
  )

# Use stargazer to present the summary statistics by School_ID
stargazer(school_stats,
          type = "text", 
          summary = FALSE,
          title = "Table 2: School-wise Mean and Standard Deviation of Average Marks Across Terms",
          digits = 2,
          rownames = FALSE)
## 
## Table 2: School-wise Mean and Standard Deviation of Average Marks Across Terms
## ===================================================================================================================================
## School_ID             Mean_T1              SD_T1             Mean_T2              SD_T2             Mean_T3             SD_T3      
## -----------------------------------------------------------------------------------------------------------------------------------
## ANS23_School_17 -0.00651313970601903 0.971935801389596 -0.0277842496996569  0.932567207786478 -0.0316167920505773 0.947113596803313
## ANS23_School_18  0.0907202805645688  0.91180438810084   0.0820121279239994  0.952576578182148 0.0542900329251336  0.953673279746632
## ANS23_School_25  0.0362225979087015  0.972663077088342 -0.00719284011220962 1.05279819743362  0.0301510452608446  1.00365493158475 
## ANS23_School_5  -0.0207080807112431  1.00879878567722  -0.0465681393745451  0.993421073159136 -0.047542718891748  1.02143797066025 
## ANS23_School_7   0.0813721641956309   1.1082482237558   0.0987877545908999  1.06652600870861  0.0852487666406166  1.03539466253287 
##                  0.497049467447658   1.29847414317251   0.386719861870872   1.29776183111221          NaN                NA        
## -----------------------------------------------------------------------------------------------------------------------------------

From the balance table above, we can note that majority of the academic outcomes for T1 are balanced (that is, there is no statistically significant difference between the treatment and control group). The only exceptions are BIO (Biology) and Total_Mean_Points.

Moreover, a small effect size (SMD) indicates that the differences in baseline characteristics between the control and the treatment group are minimal. This suggests that the two groups are fairly similar in terms of these characteristics at the baseline of the study.

4.1.2: Distribution of T1, T2, T3 Average Marks (Demeaned Data)

# Reshape data to long format for easier plotting
combined_long <- combined %>%
  select(School_ID, T1_Avg_Marks, T2_Avg_Marks, T3_Avg_Marks) %>%
  pivot_longer(cols = starts_with("T"), 
               names_to = "Term", 
               values_to = "Avg_Marks")

# Density plot for T1, T2, and T3 average marks
ggplot(combined_long, aes(x = Avg_Marks, fill = Term, color = Term)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Average Marks Across Terms (Demeaned Scores)", 
       x = "Average Marks", 
       y = "Density") +
  theme_minimal()

4.1.3: Distribution of T1, T2, T3 Average Marks (Actual Data)

# Reshape data to long format for easier plotting
temp_long <- temp %>%
  select(School_ID, T1_Avg_Marks, T2_Avg_Marks, T3_Avg_Marks) %>%
  pivot_longer(cols = starts_with("T"), 
               names_to = "Term", 
               values_to = "Avg_Marks")

# Density plot for T1, T2, and T3 average marks in merged_final
ggplot(temp_long, aes(x = Avg_Marks, fill = Term, color = Term)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Average Marks Across Terms (Actual Scores)", 
       x = "Average Marks", 
       y = "Density") +
  theme_minimal()

4.1.4: Correlation between T1, T2 and T3 Average Marks (Actual Data)

# Create the first plot: SES_Total_26 vs T3_Avg_Marks
plot_t2_t1 <- ggplot(temp, aes(x = T1_Avg_Marks, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "T1 Avg Marks vs T2 Avg Marks",
       x = "T1 Avg Marks",
       y = "T2 Avg Marks") +
  theme_minimal()

print(plot_t2_t1)

# Create the first plot: SES_Total_26 vs T3_Avg_Marks
plot_t2_t3 <- ggplot(temp, aes(x = T2_Avg_Marks, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "T2 Avg Marks vs T3 Avg Marks",
       x = "T2 Avg Marks",
       y = "T3 Avg Marks") +
  theme_minimal()

print(plot_t2_t3)

# Create the first plot: SES_Total_26 vs T3_Avg_Marks
plot_t1_t3 <- ggplot(temp, aes(x = T1_Avg_Marks, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "T1 Avg Marks vs T3 Avg Marks",
       x = "T1 Avg Marks",
       y = "T3 Avg Marks") +
  theme_minimal()

print(plot_t1_t3)

4.1.5: Checking the relation between Mental Health (GAD and PHQ) scores in baseline (0 week) and T1 scores

# Create the first plot: PHQ_Total_0 vs T1_Avg_Marks
plot1 <- ggplot(combined, aes(x = PHQ_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "PHQ_Total_0 vs T1_Avg_Marks",
       x = "PHQ_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

# Create the second plot: GAD_Total_0 vs T1_Avg_Marks
plot2 <- ggplot(combined, aes(x = GAD_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "GAD_Total_0 vs T1_Avg_Marks",
       x = "GAD_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot <- plot1 + plot2 + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot)

# Define models for PHQ without controls or interaction term
model_phq_t1 <- lm(T1_Avg_Marks ~ PHQ_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_phq_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)

# Same for GAD
model_gad_t1 <- lm(T1_Avg_Marks ~ GAD_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_gad_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)



stargazer(model_phq_t1, model_phq_t1_v1 , model_gad_t1, model_gad_t1_v1, type = "text", title = "Table 3: Term 1 Average Scores Regressed on PHQ and GAD baseline scores", notes = "Includes School fixed effects")
## 
## Table 3: Term 1 Average Scores Regressed on PHQ and GAD baseline scores
## =======================================================================
##                                           Dependent variable:          
##                                 ---------------------------------------
##                                              T1_Avg_Marks              
##                                    (1)       (2)       (3)       (4)   
## -----------------------------------------------------------------------
## PHQ_Total_0                     0.021***                               
##                                  (0.005)                               
##                                                                        
## trt                                        0.097*              0.097*  
##                                            (0.052)             (0.052) 
##                                                                        
## GAD_Total_0                                         0.014***           
##                                                      (0.005)           
##                                                                        
## T1_No_of_Sub                    0.171***  0.171***  0.170***  0.171*** 
##                                  (0.014)   (0.015)   (0.015)   (0.015) 
##                                                                        
## Constant                        -1.977*** -1.710*** -1.831*** -1.710***
##                                  (0.163)   (0.150)   (0.160)   (0.150) 
##                                                                        
## -----------------------------------------------------------------------
## Observations                      1,373     1,373     1,373     1,373  
## R2                                0.105     0.095     0.097     0.095  
## Adjusted R2                       0.104     0.093     0.096     0.093  
## Residual Std. Error (df = 1370)   0.959     0.964     0.963     0.964  
## F Statistic (df = 2; 1370)      80.485*** 71.632*** 73.717*** 71.632***
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01
##                                           Includes School fixed effects

4.1.6: Checking the relation between Individual Well-being (GQ, PCS, SWEMWBS) scores in baseline (0 week) and T1 scores

# Create scatter plots with regression lines for SWEMWBS, PCS, and GQ
plot_swemwbs <- ggplot(combined, aes(x = SWEMWBS_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SWEMWBS_Total_0 vs T1_Avg_Marks",
       x = "SWEMWBS_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

plot_pcs <- ggplot(combined, aes(x = PCS_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "PCS_Total_0 vs T1_Avg_Marks",
       x = "PCS_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

plot_gq <- ggplot(combined, aes(x = GQ_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "GQ_Total_0 vs T1_Avg_Marks",
       x = "GQ_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side
combined_plots <- plot_swemwbs + plot_pcs + plot_gq +
  plot_layout(guides = "collect")

# Print the combined plots
print(combined_plots)

model_swemwbs_t1 <- lm(T1_Avg_Marks ~ SWEMWBS_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_swemwbs_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)

model_pcs_t1 <- lm(T1_Avg_Marks ~ PCS_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_pcs_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)

model_gq_t1 <- lm(T1_Avg_Marks ~ GQ_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_gq_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)


stargazer(model_swemwbs_t1, model_swemwbs_t1_v1 , model_pcs_t1, model_pcs_t1_v1, type = "text", title = "Table 4: Term 1 Average Scores Regressed on SWEMWBS and PCS baseline scores", notes = "Includes School fixed effects")
## 
## Table 4: Term 1 Average Scores Regressed on SWEMWBS and PCS baseline scores
## =======================================================================
##                                           Dependent variable:          
##                                 ---------------------------------------
##                                              T1_Avg_Marks              
##                                    (1)       (2)       (3)       (4)   
## -----------------------------------------------------------------------
## SWEMWBS_Total_0                 0.020***                               
##                                  (0.004)                               
##                                                                        
## trt                                        0.097*              0.097*  
##                                            (0.052)             (0.052) 
##                                                                        
## PCS_Total_0                                         -0.062***          
##                                                      (0.009)           
##                                                                        
## T1_No_of_Sub                    0.170***  0.171***  0.169***  0.171*** 
##                                  (0.014)   (0.015)   (0.014)   (0.015) 
##                                                                        
## Constant                        -2.091*** -1.710*** -1.210*** -1.710***
##                                  (0.167)   (0.150)   (0.159)   (0.150) 
##                                                                        
## -----------------------------------------------------------------------
## Observations                      1,373     1,373     1,373     1,373  
## R2                                0.111     0.095     0.125     0.095  
## Adjusted R2                       0.110     0.093     0.124     0.093  
## Residual Std. Error (df = 1370)   0.956     0.964     0.948     0.964  
## F Statistic (df = 2; 1370)      85.469*** 71.632*** 97.745*** 71.632***
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01
##                                           Includes School fixed effects
stargazer(model_gq_t1, model_gq_t1_v1, type = "text",  title = "Table 5: Term 1 Average Scores Regressed on GQ baseline scores", notes = "Includes School fixed effects")
## 
## Table 5: Term 1 Average Scores Regressed on GQ baseline scores
## ==============================================================
##                                      Dependent variable:      
##                                 ------------------------------
##                                          T1_Avg_Marks         
##                                       (1)            (2)      
## --------------------------------------------------------------
## GQ_Total_0                         0.011***                   
##                                     (0.004)                   
##                                                               
## trt                                                 0.097*    
##                                                    (0.052)    
##                                                               
## T1_No_of_Sub                       0.172***        0.171***   
##                                     (0.014)        (0.015)    
##                                                               
## Constant                           -1.975***      -1.710***   
##                                     (0.180)        (0.150)    
##                                                               
## --------------------------------------------------------------
## Observations                         1,373          1,373     
## R2                                   0.098          0.095     
## Adjusted R2                          0.097          0.093     
## Residual Std. Error (df = 1370)      0.962          0.964     
## F Statistic (df = 2; 1370)         74.642***      71.632***   
## ==============================================================
## Note:                              *p<0.1; **p<0.05; ***p<0.01
##                                  Includes School fixed effects

4.1.7: Checking the relation between External Support System (MSPSS, SES) scores in baseline (0 week) and T1 scores

# Create the first plot: SES_Total_0 vs T1_Avg_Marks
plot_ses <- ggplot(combined, aes(x = SES_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SES_Total_0 vs T1_Avg_Marks",
       x = "SES_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

# Create the second plot: MSPSS_Total_0 vs T1_Avg_Marks
plot_mpss <- ggplot(combined, aes(x = MSPSS_Total_0, y = T1_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "MSPSS_Total_0 vs T1_Avg_Marks",
       x = "MSPSS_Total_0",
       y = "T1_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot_ses_mpss <- plot_ses + plot_mpss + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot_ses_mpss)

model_mspss_t1 <- lm(T1_Avg_Marks ~ MSPSS_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_mspss_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)

model_ses_t1 <- lm(T1_Avg_Marks ~ SES_Total_0 + T1_No_of_Sub, data = combined, cluster = ~School_ID)
model_ses_t1_v1 <- lm(T1_Avg_Marks ~ trt + T1_No_of_Sub, data = combined, cluster = ~School_ID)


stargazer(model_mspss_t1, model_mspss_t1_v1 , model_ses_t1, model_ses_t1_v1, type = "text", title = "Table 6: Term 1 Average Scores Regressed on MSPSS and SES baseline scores", notes = "Includes School fixed effects")
## 
## Table 6: Term 1 Average Scores Regressed on MSPSS and SES baseline scores
## =======================================================================
##                                           Dependent variable:          
##                                 ---------------------------------------
##                                              T1_Avg_Marks              
##                                    (1)       (2)       (3)       (4)   
## -----------------------------------------------------------------------
## MSPSS_Total_0                     0.001                                
##                                  (0.002)                               
##                                                                        
## trt                                        0.097*              0.097*  
##                                            (0.052)             (0.052) 
##                                                                        
## SES_Total_0                                         0.007***           
##                                                      (0.002)           
##                                                                        
## T1_No_of_Sub                    0.171***  0.171***  0.165***  0.171*** 
##                                  (0.015)   (0.015)   (0.014)   (0.015) 
##                                                                        
## Constant                        -1.692*** -1.710*** -2.075*** -1.710***
##                                  (0.167)   (0.150)   (0.171)   (0.150) 
##                                                                        
## -----------------------------------------------------------------------
## Observations                      1,373     1,373     1,373     1,373  
## R2                                0.092     0.095     0.107     0.095  
## Adjusted R2                       0.091     0.093     0.105     0.093  
## Residual Std. Error (df = 1370)   0.966     0.964     0.958     0.964  
## F Statistic (df = 2; 1370)      69.797*** 71.632*** 81.900*** 71.632***
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01
##                                           Includes School fixed effects

4.2: Endline Analysis and 6 Month Follow-up

4.2.1: Conducting the regression for endline scores

# Define the base model and incremental models
model_fe_1 <- lm(T2_Avg_Marks ~ trt, data = combined, cluster = ~School_ID)

model_fe_2 <- lm(T2_Avg_Marks ~ trt + student_female, data = combined, cluster = ~School_ID)

model_fe_3 <- lm(T2_Avg_Marks ~ trt + student_female + rural_smalltown, data = combined, cluster = ~School_ID)

model_fe_4 <- lm(T2_Avg_Marks ~ trt + student_female + rural_smalltown + bigtown_city, data = combined, cluster = ~School_ID)

model_fe_5 <- lm(T2_Avg_Marks ~ trt + student_female + rural_smalltown + bigtown_city + as.factor(Form), data = combined, cluster = ~School_ID)

# Create the stargazer table
stargazer(model_fe_1, model_fe_2, model_fe_3, model_fe_4, model_fe_5,
          type = "text", 
          title = "Table 7: Incremental Regression Results for Avg Marks in term 2 on Shamiri treatment",
          covariate.labels = c("Treatment", "Female Student", "Rural/Small Town", "Big Town/City", 
                               "Form 2 Dummy", "Form 3 Dummy", "Form 4 Dummy"),
          dep.var.labels = "Average Marks (T2)",
          omit.stat = c("f", "ser"),  # Adjust to show/hide specific statistics
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE, notes = "Includes School fixed effects")
## 
## Table 7: Incremental Regression Results for Avg Marks in term 2 on Shamiri treatment
## ============================================================
##                              Dependent variable:            
##                  -------------------------------------------
##                              Average Marks (T2)             
##                    (1)     (2)     (3)      (4)       (5)   
## ------------------------------------------------------------
## Treatment         0.056   0.055   0.048    0.048    -0.001  
##                  (0.055) (0.055) (0.055)  (0.055)   (0.054) 
## Female Student            0.022   0.001    -0.002    0.062  
##                          (0.056) (0.056)  (0.056)   (0.056) 
## Rural/Small Town                 0.235*** 0.365***  0.319*  
##                                  (0.056)  (0.128)   (0.165) 
## Big Town/City                              0.147     0.124  
##                                           (0.131)   (0.167) 
## Form 2 Dummy                                       -0.485***
##                                                     (0.065) 
## Form 3 Dummy                                       -0.673***
##                                                     (0.070) 
## Form 4 Dummy                                         0.056  
##                                                     (0.176) 
## Constant         -0.000  -0.009  -0.133** -0.262**   0.120  
##                  (0.040) (0.046) (0.054)  (0.126)   (0.171) 
## ------------------------------------------------------------
## Observations      1,346   1,346   1,346    1,346     1,301  
## R2                0.001   0.001   0.014    0.015     0.091  
## Adjusted R2      0.00001 -0.001   0.012    0.012     0.086  
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01
##                                Includes School fixed effects

4.2.2: Checking the relation between Mental Health (GAD and PHQ) scores in endline (8 week) and T2 scores

# Create the first plot: GAD_Total_8 vs T2_Avg_Marks
plot1 <- ggplot(combined, aes(x = GAD_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "GAD_Total_8 vs T2_Avg_Marks",
       x = "GAD_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

# Create the second plot: PHQ_Total_8 vs T2_Avg_Marks
plot2 <- ggplot(combined, aes(x = PHQ_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "PHQ_Total_8 vs T2_Avg_Marks",
       x = "PHQ_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot <- plot1 + plot2 + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot)

model_phq_t2 <- lm(T2_Avg_Marks ~ PHQ_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_phq_t2_v1 <- lm(T2_Avg_Marks ~  trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

model_gad_t2 <- lm(T2_Avg_Marks ~ GAD_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_gad_t2_v1 <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

stargazer(model_phq_t2, model_phq_t2_v1 , model_gad_t2, model_gad_t2_v1, type = "text", title = "Table 8: Term 1 Average Scores Regressed on GAD and PHQ endline scores", notes = "Includes School fixed effects")
## 
## Table 8: Term 1 Average Scores Regressed on GAD and PHQ endline scores
## ===========================================================================
##                                             Dependent variable:            
##                                 -------------------------------------------
##                                                T2_Avg_Marks                
##                                    (1)        (2)        (3)        (4)    
## ---------------------------------------------------------------------------
## PHQ_Total_8                      0.011***                                  
##                                  (0.002)                                   
##                                                                            
## trt                                          -0.013                -0.013  
##                                             (0.032)               (0.032)  
##                                                                            
## GAD_Total_8                                            0.010***            
##                                                        (0.002)             
##                                                                            
## T2_No_of_Sub                      -0.009     -0.006     -0.008     -0.006  
##                                  (0.009)    (0.009)    (0.009)    (0.009)  
##                                                                            
## T1_Avg_Marks                     0.823***   0.831***   0.825***   0.831*** 
##                                  (0.017)    (0.017)    (0.017)    (0.017)  
##                                                                            
## Constant                          -0.103     0.038      -0.074     0.038   
##                                  (0.092)    (0.091)    (0.092)    (0.091)  
##                                                                            
## ---------------------------------------------------------------------------
## Observations                      1,330      1,330      1,330      1,330   
## R2                                0.675      0.668      0.672      0.668   
## Adjusted R2                       0.674      0.667      0.671      0.667   
## Residual Std. Error (df = 1326)   0.580      0.587      0.583      0.587   
## F Statistic (df = 3; 1326)      917.726*** 888.114*** 905.462*** 888.114***
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01
##                                               Includes School fixed effects

4.2.3: Checking the relation between Individual Well-being (GQ, PCS, SWEMWBS) scores in endline (8 week) and T2 scores

# Create scatter plots with regression lines for SWEMWBS_8, PCS_8, and GQ_8
plot_swemwbs_8 <- ggplot(combined, aes(x = SWEMWBS_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SWEMWBS_Total_8 vs T2_Avg_Marks",
       x = "SWEMWBS_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

plot_pcs_8 <- ggplot(combined, aes(x = PCS_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "PCS_Total_8 vs T2_Avg_Marks",
       x = "PCS_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

plot_gq_8 <- ggplot(combined, aes(x = GQ_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "GQ_Total_8 vs T2_Avg_Marks",
       x = "GQ_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side
combined_plots_8 <- plot_swemwbs_8 + plot_pcs_8 + plot_gq_8 +
  plot_layout(guides = "collect")

# Print the combined plots
print(combined_plots_8)

model_swemwbs_t2 <- lm(T2_Avg_Marks ~ SWEMWBS_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_swemwbs_t2_v1 <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

model_pcs_t2 <- lm(T2_Avg_Marks ~ PCS_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_pcs_t2_v1 <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

model_gq_t2 <- lm(T2_Avg_Marks ~ GQ_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_gq_t2_v1 <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

stargazer(model_swemwbs_t2, model_swemwbs_t2_v1 , model_pcs_t2, model_pcs_t2_v1, type = "text", title = "Table 9: Term 2 Average Scores Regressed on SWEMWBS and PCS endline scores", notes = "Includes School fixed effects")
## 
## Table 9: Term 2 Average Scores Regressed on SWEMWBS and PCS endline scores
## ===========================================================================
##                                             Dependent variable:            
##                                 -------------------------------------------
##                                                T2_Avg_Marks                
##                                    (1)        (2)        (3)        (4)    
## ---------------------------------------------------------------------------
## SWEMWBS_Total_8                  0.008***                                  
##                                  (0.002)                                   
##                                                                            
## trt                                          -0.013                -0.013  
##                                             (0.032)               (0.032)  
##                                                                            
## PCS_Total_8                                             0.007              
##                                                        (0.005)             
##                                                                            
## T2_No_of_Sub                      -0.007     -0.006     -0.007     -0.006  
##                                  (0.009)    (0.009)    (0.009)    (0.009)  
##                                                                            
## T1_Avg_Marks                     0.824***   0.831***   0.832***   0.831*** 
##                                  (0.017)    (0.017)    (0.017)    (0.017)  
##                                                                            
## Constant                          -0.118     0.038     -0.00004    0.038   
##                                  (0.093)    (0.091)    (0.092)    (0.091)  
##                                                                            
## ---------------------------------------------------------------------------
## Observations                      1,330      1,330      1,330      1,330   
## R2                                0.674      0.668      0.668      0.668   
## Adjusted R2                       0.673      0.667      0.667      0.667   
## Residual Std. Error (df = 1326)   0.581      0.587      0.586      0.587   
## F Statistic (df = 3; 1326)      914.357*** 888.114*** 890.075*** 888.114***
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01
##                                               Includes School fixed effects
stargazer(model_gq_t2, model_gq_t2_v1, type = "text",  title = "Table 10: Term 2 Average Scores Regressed on GQ endline scores", notes = "Includes School fixed effects")
## 
## Table 10: Term 2 Average Scores Regressed on GQ endline scores
## ==============================================================
##                                      Dependent variable:      
##                                 ------------------------------
##                                          T2_Avg_Marks         
##                                       (1)            (2)      
## --------------------------------------------------------------
## GQ_Total_8                         0.006***                   
##                                     (0.001)                   
##                                                               
## trt                                                 -0.013    
##                                                    (0.032)    
##                                                               
## T2_No_of_Sub                        -0.008          -0.006    
##                                     (0.009)        (0.009)    
##                                                               
## T1_Avg_Marks                       0.828***        0.831***   
##                                     (0.017)        (0.017)    
##                                                               
## Constant                            -0.097          0.038     
##                                     (0.093)        (0.091)    
##                                                               
## --------------------------------------------------------------
## Observations                         1,330          1,330     
## R2                                   0.672          0.668     
## Adjusted R2                          0.672          0.667     
## Residual Std. Error (df = 1326)      0.582          0.587     
## F Statistic (df = 3; 1326)        906.725***      888.114***  
## ==============================================================
## Note:                              *p<0.1; **p<0.05; ***p<0.01
##                                  Includes School fixed effects

4.2.4: Checking the relation between Individual Well-being (MSPSS, SES) scores in endline (8 week) and T2 scores

# Create the first plot: SES_Total_8 vs T2_Avg_Marks
plot_ses_8 <- ggplot(combined, aes(x = SES_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SES_Total_8 vs T2_Avg_Marks",
       x = "SES_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

# Create the second plot: MSPSS_Total_8 vs T2_Avg_Marks
plot_mpss_8 <- ggplot(combined, aes(x = MSPSS_Total_8, y = T2_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "MSPSS_Total_8 vs T2_Avg_Marks",
       x = "MSPSS_Total_8",
       y = "T2_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot_ses_mpss_8 <- plot_ses_8 + plot_mpss_8 + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot_ses_mpss_8)

model_mspss_t2 <- lm(T2_Avg_Marks ~ MSPSS_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_mspss_t2_v1 <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

model_ses_t2 <- lm(T2_Avg_Marks ~  MSPSS_Total_8 + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)
model_ses_t2_v1 <- lm(T2_Avg_Marks ~  trt + T2_No_of_Sub + T1_Avg_Marks, data = combined, cluster = ~School_ID)

stargazer(model_mspss_t2, model_mspss_t2_v1 , model_ses_t2, model_ses_t2_v1, type = "text", title = "Table 11: Term 2 Average Scores Regressed on MSPSS and SES endline scores", notes = "Includes School fixed effects")
## 
## Table 11: Term 2 Average Scores Regressed on MSPSS and SES endline scores
## ===========================================================================
##                                             Dependent variable:            
##                                 -------------------------------------------
##                                                T2_Avg_Marks                
##                                    (1)        (2)        (3)        (4)    
## ---------------------------------------------------------------------------
## MSPSS_Total_8                    0.004***              0.004***            
##                                  (0.001)               (0.001)             
##                                                                            
## trt                                          -0.013                -0.013  
##                                             (0.032)               (0.032)  
##                                                                            
## T2_No_of_Sub                      -0.008     -0.006     -0.008     -0.006  
##                                  (0.009)    (0.009)    (0.009)    (0.009)  
##                                                                            
## T1_Avg_Marks                     0.827***   0.831***   0.827***   0.831*** 
##                                  (0.017)    (0.017)    (0.017)    (0.017)  
##                                                                            
## Constant                          -0.076     0.038      -0.076     0.038   
##                                  (0.093)    (0.091)    (0.093)    (0.091)  
##                                                                            
## ---------------------------------------------------------------------------
## Observations                      1,330      1,330      1,330      1,330   
## R2                                0.671      0.668      0.671      0.668   
## Adjusted R2                       0.671      0.667      0.671      0.667   
## Residual Std. Error (df = 1326)   0.583      0.587      0.583      0.587   
## F Statistic (df = 3; 1326)      902.941*** 888.114*** 902.941*** 888.114***
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01
##                                               Includes School fixed effects

4.2.5: Checking the relation between Mental Health (GAD and PHQ) scores in 6 month followup (26 weeks) and T3 scores

# Create the first plot: GAD_Total_26 vs T3_Avg_Marks
plot1_t3 <- ggplot(combined, aes(x = GAD_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "GAD_Total_26 vs T3_Avg_Marks",
       x = "GAD_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

# Create the second plot: PHQ_Total_26 vs T3_Avg_Marks
plot2_t3 <- ggplot(combined, aes(x = PHQ_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "PHQ_Total_26 vs T3_Avg_Marks",
       x = "PHQ_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot_t3 <- plot1_t3 + plot2_t3 + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot_t3)

model_phq_t3 <- lm(T3_Avg_Marks ~ PHQ_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_phq_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_gad_t3 <- lm(T3_Avg_Marks ~ GAD_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_gad_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)


stargazer(model_phq_t3, model_phq_t3_v1 , model_gad_t3, model_gad_t3_v1, type = "text", title = "Table 12: Term 3 Average Scores Regressed on GAD and PHQ 6MFU scores", notes = "Includes School fixed effects")
## 
## Table 12: Term 3 Average Scores Regressed on GAD and PHQ 6MFU scores
## =========================================================================================================================
##                                                              Dependent variable:                                         
##                     -----------------------------------------------------------------------------------------------------
##                                                                 T3_Avg_Marks                                             
##                               (1)                       (2)                      (3)                       (4)           
## -------------------------------------------------------------------------------------------------------------------------
## PHQ_Total_26                 -0.002                                                                                      
##                             (0.004)                                                                                      
##                                                                                                                          
## trt                                                   -0.025                                             -0.025          
##                                                       (0.027)                                            (0.027)         
##                                                                                                                          
## GAD_Total_26                                                                    0.004                                    
##                                                                                (0.004)                                   
##                                                                                                                          
## T3_No_of_Sub                0.032***                  0.024**                  0.032***                  0.024**         
##                             (0.010)                   (0.009)                  (0.010)                   (0.009)         
##                                                                                                                          
## T1_Avg_Marks                0.116***                 0.142***                  0.116***                 0.142***         
##                             (0.030)                   (0.027)                  (0.030)                   (0.027)         
##                                                                                                                          
## T2_Avg_Marks                0.770***                 0.729***                  0.770***                 0.729***         
##                             (0.028)                   (0.024)                  (0.028)                   (0.024)         
##                                                                                                                          
## Constant                   -0.283***                 -0.183**                 -0.294***                 -0.183**         
##                             (0.094)                   (0.085)                  (0.094)                   (0.085)         
##                                                                                                                          
## -------------------------------------------------------------------------------------------------------------------------
## Observations                  981                      1,266                     981                      1,266          
## R2                           0.785                     0.759                    0.785                     0.759          
## Adjusted R2                  0.784                     0.759                    0.784                     0.759          
## Residual Std. Error     0.460 (df = 976)         0.484 (df = 1261)         0.459 (df = 976)         0.484 (df = 1261)    
## F Statistic         891.065*** (df = 4; 976) 994.356*** (df = 4; 1261) 891.959*** (df = 4; 976) 994.356*** (df = 4; 1261)
## =========================================================================================================================
## Note:                                                                                         *p<0.1; **p<0.05; ***p<0.01
##                                                                                             Includes School fixed effects

4.2.6: Checking the relation between Individual Well-being (GQ, PCS, SWEMWBS) scores in 6 month followup (26 week) and T3 scores

# Create scatter plots with regression lines for SWEMWBS_26, PCS_26, and GQ_26
plot_swemwbs_26 <- ggplot(combined, aes(x = SWEMWBS_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SWEMWBS_Total_26 vs T3_Avg_Marks",
       x = "SWEMWBS_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

plot_pcs_26 <- ggplot(combined, aes(x = PCS_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "PCS_Total_26 vs T3_Avg_Marks",
       x = "PCS_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

plot_gq_26 <- ggplot(combined, aes(x = GQ_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "GQ_Total_26 vs T3_Avg_Marks",
       x = "GQ_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side
combined_plots_26 <- plot_swemwbs_26 + plot_pcs_26 + plot_gq_26 +
  plot_layout(guides = "collect")

# Print the combined plots
print(combined_plots_26)

model_swemwbs_t3 <- lm(T3_Avg_Marks ~ SWEMWBS_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_swemwbs_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)

model_pcs_t3 <- lm(T3_Avg_Marks ~ PCS_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_pcs_t3_v1 <- lm(T3_Avg_Marks ~  + trt + T3_No_of_Sub, data = combined, cluster = ~School_ID)

model_gq_t3 <- lm(T3_Avg_Marks ~ GQ_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_gq_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub, data = combined, cluster = ~School_ID)


stargazer(model_swemwbs_t3, model_swemwbs_t3_v1 , model_pcs_t3, model_pcs_t3_v1, type = "text", title = "Table 13: Term 3 Average Scores Regressed on SWEMWBS and PCS 6MFU scores", notes = "Includes School fixed effects")
## 
## Table 13: Term 3 Average Scores Regressed on SWEMWBS and PCS 6MFU scores
## ========================================================================================================================
##                                                             Dependent variable:                                         
##                     ----------------------------------------------------------------------------------------------------
##                                                                 T3_Avg_Marks                                            
##                               (1)                       (2)                      (3)                      (4)           
## ------------------------------------------------------------------------------------------------------------------------
## SWEMWBS_Total_26            0.003**                                                                                     
##                             (0.001)                                                                                     
##                                                                                                                         
## trt                                                   -0.025                                             -0.019         
##                                                       (0.027)                                           (0.051)         
##                                                                                                                         
## PCS_Total_26                                                                    0.003                                   
##                                                                                (0.002)                                  
##                                                                                                                         
## T3_No_of_Sub                0.033***                  0.024**                  0.032***                 0.191***        
##                             (0.010)                   (0.009)                  (0.010)                  (0.015)         
##                                                                                                                         
## T1_Avg_Marks                0.117***                 0.142***                  0.116***                                 
##                             (0.030)                   (0.027)                  (0.030)                                  
##                                                                                                                         
## T2_Avg_Marks                0.764***                 0.729***                  0.766***                                 
##                             (0.028)                   (0.024)                  (0.028)                                  
##                                                                                                                         
## Constant                   -0.344***                 -0.183**                 -0.323***                -1.685***        
##                             (0.097)                   (0.085)                  (0.096)                  (0.135)         
##                                                                                                                         
## ------------------------------------------------------------------------------------------------------------------------
## Observations                  981                      1,266                     981                     1,316          
## R2                           0.786                     0.759                    0.786                    0.113          
## Adjusted R2                  0.785                     0.759                    0.785                    0.112          
## Residual Std. Error     0.459 (df = 976)         0.484 (df = 1261)         0.459 (df = 976)        0.932 (df = 1313)    
## F Statistic         896.032*** (df = 4; 976) 994.356*** (df = 4; 1261) 893.767*** (df = 4; 976) 84.049*** (df = 2; 1313)
## ========================================================================================================================
## Note:                                                                                        *p<0.1; **p<0.05; ***p<0.01
##                                                                                            Includes School fixed effects
stargazer(model_gq_t3, model_gq_t3_v1, type = "text",  title = "Table 14: Term 3 Average Scores Regressed on GQ 6MFU scores", notes = "Includes School fixed effects")
## 
## Table 14: Term 3 Average Scores Regressed on GQ 6MFU scores
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                       T3_Avg_Marks                   
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## GQ_Total_26                 0.002**                                  
##                             (0.001)                                  
##                                                                      
## trt                                                   -0.019         
##                                                      (0.051)         
##                                                                      
## T3_No_of_Sub                0.033***                 0.191***        
##                             (0.010)                  (0.015)         
##                                                                      
## T1_Avg_Marks                0.117***                                 
##                             (0.030)                                  
##                                                                      
## T2_Avg_Marks                0.764***                                 
##                             (0.028)                                  
##                                                                      
## Constant                   -0.355***                -1.685***        
##                             (0.097)                  (0.135)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  981                     1,316          
## R2                           0.786                    0.113          
## Adjusted R2                  0.785                    0.112          
## Residual Std. Error     0.458 (df = 976)        0.932 (df = 1313)    
## F Statistic         898.160*** (df = 4; 976) 84.049*** (df = 2; 1313)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01
##                                         Includes School fixed effects

4.2.7: Checking the relation between Individual Well-being (MSPSS, SES) scores in 6 month followup (26 week) and T3 scores

# Create the first plot: SES_Total_26 vs T3_Avg_Marks
plot_ses_26 <- ggplot(combined, aes(x = SES_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "SES_Total_26 vs T3_Avg_Marks",
       x = "SES_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

# Create the second plot: MSPSS_Total_26 vs T3_Avg_Marks
plot_mpss_26 <- ggplot(combined, aes(x = MSPSS_Total_26, y = T3_Avg_Marks)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "MSPSS_Total_26 vs T3_Avg_Marks",
       x = "MSPSS_Total_26",
       y = "T3_Avg_Marks") +
  theme_minimal()

# Combine the plots side by side with the same y-axis scale
combined_plot_ses_mpss_26 <- plot_ses_26 + plot_mpss_26 + plot_layout(guides = "collect")

# Print the combined plot
print(combined_plot_ses_mpss_26)

model_mspss_t3 <- lm(T3_Avg_Marks ~ MSPSS_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_mspss_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)

model_ses_t3 <- lm(T3_Avg_Marks ~ SES_Total_26 + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)
model_ses_t3_v1 <- lm(T3_Avg_Marks ~ trt + T3_No_of_Sub + T1_Avg_Marks + T2_Avg_Marks, data = combined, cluster = ~School_ID)

stargazer(model_mspss_t3, model_mspss_t3_v1 , model_ses_t3, model_ses_t3_v1, type = "text", title = "Table 15: Term 3 Average Scores Regressed on MSPSS and GQ 6MFU scores", notes = "Includes School fixed effects")
## 
## Table 15: Term 3 Average Scores Regressed on MSPSS and GQ 6MFU scores
## =========================================================================================================================
##                                                              Dependent variable:                                         
##                     -----------------------------------------------------------------------------------------------------
##                                                                 T3_Avg_Marks                                             
##                               (1)                       (2)                      (3)                       (4)           
## -------------------------------------------------------------------------------------------------------------------------
## MSPSS_Total_26              0.002**                                                                                      
##                             (0.001)                                                                                      
##                                                                                                                          
## trt                                                   -0.025                                             -0.025          
##                                                       (0.027)                                            (0.027)         
##                                                                                                                          
## SES_Total_26                                                                    0.001*                                   
##                                                                                (0.0004)                                  
##                                                                                                                          
## T3_No_of_Sub                0.032***                  0.024**                  0.032***                  0.024**         
##                             (0.010)                   (0.009)                  (0.010)                   (0.009)         
##                                                                                                                          
## T1_Avg_Marks                0.119***                 0.142***                  0.117***                 0.142***         
##                             (0.030)                   (0.027)                  (0.030)                   (0.027)         
##                                                                                                                          
## T2_Avg_Marks                0.764***                 0.729***                  0.765***                 0.729***         
##                             (0.028)                   (0.024)                  (0.028)                   (0.024)         
##                                                                                                                          
## Constant                   -0.334***                 -0.183**                 -0.328***                 -0.183**         
##                             (0.096)                   (0.085)                  (0.096)                   (0.085)         
##                                                                                                                          
## -------------------------------------------------------------------------------------------------------------------------
## Observations                  981                      1,266                     981                      1,266          
## R2                           0.786                     0.759                    0.786                     0.759          
## Adjusted R2                  0.785                     0.759                    0.785                     0.759          
## Residual Std. Error     0.458 (df = 976)         0.484 (df = 1261)         0.459 (df = 976)         0.484 (df = 1261)    
## F Statistic         896.070*** (df = 4; 976) 994.356*** (df = 4; 1261) 894.323*** (df = 4; 976) 994.356*** (df = 4; 1261)
## =========================================================================================================================
## Note:                                                                                         *p<0.1; **p<0.05; ***p<0.01
##                                                                                             Includes School fixed effects

4.2.8: Comparison for clinical v/s non clinical samples in endline

clinical <- combined %>% 
  filter(clinical == 1)

nonclinical <- combined %>% 
  filter(clinical == 0)



# Fit the linear models
model_lm_clinical <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub, 
                        data = clinical, cluster = ~School_ID)

model_lm_nonclinical <- lm(T2_Avg_Marks ~ trt + T2_No_of_Sub, 
                           data = nonclinical, cluster = ~School_ID)

# Generate the stargazer output
stargazer(model_lm_clinical, model_lm_nonclinical,
          type = "text",
          title = "Table 16: Correlation of Treatment on Avg Marks of Term 2 for Clinical and Non-Clinical sample",
          covariate.labels = c("Treatment", "Number of Subjects"),
          dep.var.labels = "T2_Avg_Marks",
          column.labels = c("Clinical", "Non-Clinical"),
          no.space = TRUE,
          align = TRUE,
          digits = 3,
          out = "Impact_Treatment_T2.txt")
## 
## Table 16: Correlation of Treatment on Avg Marks of Term 2 for Clinical and Non-Clinical sample
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                           T2                       
##                            Clinical              Non-Clinical      
##                               (1)                     (2)          
## -------------------------------------------------------------------
## Treatment                    0.080                   0.002         
##                             (0.069)                 (0.085)        
## Number of Subjects         0.138***                0.129***        
##                             (0.019)                 (0.023)        
## Constant                   -1.400***               -1.096***       
##                             (0.183)                 (0.232)        
## -------------------------------------------------------------------
## Observations                  790                     556          
## R2                           0.066                   0.053         
## Adjusted R2                  0.064                   0.050         
## Residual Std. Error    0.967 (df = 787)        1.003 (df = 553)    
## F Statistic         27.882*** (df = 2; 787) 15.486*** (df = 2; 553)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

4.2.9: Regression of change in marks from T1 to T2 and change in PHQ/GAD scores from week 0 to week 8

combined <- combined %>% 
  mutate(phq_score_diff = PHQ_Total_8 - PHQ_Total_0,
         gad_score_diff = GAD_Total_8 - GAD_Total_0,
         avg_score_diff = T2_Avg_Marks - T1_Avg_Marks)

regress_change_phq <- lm(avg_score_diff ~ phq_score_diff, data = combined, cluster = ~School_ID)
regress_change_phq_trt <- lm(avg_score_diff ~ trt, data = combined, cluster = ~School_ID)

regress_change_gad <- lm(avg_score_diff ~ gad_score_diff, data = combined, cluster = ~School_ID)
regress_change_gad_trt <- lm(avg_score_diff ~ trt, data = combined, cluster = ~School_ID)

# Create stargazer table
stargazer(regress_change_phq, regress_change_phq_trt, 
          title = "Table 17: Difference in Avg Score Between T1 and T2 Regressed on PHQ Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T1 and T2 Average Score Difference",
          covariate.labels = c("PHQ Score diff (W0 to W8)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ PHQ", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 17: Difference in Avg Score Between T1 and T2 Regressed on PHQ Score Difference and Treatment
## ===================================================================
##                                         Dependent variable:        
##                                 -----------------------------------
##                                 T1 and T2 Average Score Difference 
##                                 Score diff ~ PHQ  Score diff ~ trt 
##                                        (1)               (2)       
## -------------------------------------------------------------------
## PHQ Score diff (W0 to W8)           0.007***                       
##                                      (0.002)                       
## Treatment (Trt)                                        -0.028      
##                                                        (0.034)     
## Constant                            -0.033**           -0.025      
##                                      (0.017)           (0.024)     
## -------------------------------------------------------------------
## Observations                          1,330             1,330      
## R2                                    0.010             0.001      
## Adjusted R2                           0.010            -0.0002     
## Residual Std. Error (df = 1328)       0.608             0.611      
## F Statistic (df = 1; 1328)          13.984***           0.720      
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01
stargazer(regress_change_gad, regress_change_gad_trt, 
          title = "Table 18: Difference in Avg Score Between T1 and T2 Regressed on GAD Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T1 and T2 Average Score Difference",
          covariate.labels = c("GAD Score diff (W0 to W8)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ GAD", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 18: Difference in Avg Score Between T1 and T2 Regressed on GAD Score Difference and Treatment
## ===================================================================
##                                         Dependent variable:        
##                                 -----------------------------------
##                                 T1 and T2 Average Score Difference 
##                                 Score diff ~ GAD  Score diff ~ trt 
##                                        (1)               (2)       
## -------------------------------------------------------------------
## GAD Score diff (W0 to W8)           0.006***                       
##                                      (0.002)                       
## Treatment (Trt)                                        -0.028      
##                                                        (0.034)     
## Constant                            -0.034**           -0.025      
##                                      (0.017)           (0.024)     
## -------------------------------------------------------------------
## Observations                          1,330             1,330      
## R2                                    0.006             0.001      
## Adjusted R2                           0.005            -0.0002     
## Residual Std. Error (df = 1328)       0.610             0.611      
## F Statistic (df = 1; 1328)          8.089***            0.720      
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

4.2.10: Regression of change in marks from T1 to T3 and change in PHQ/GAD scores from week 0 to week 26

combined <- combined %>% 
  mutate(phq_score_diff_1 = PHQ_Total_26 - PHQ_Total_0,
         gad_score_diff_1 = GAD_Total_26 - GAD_Total_0,
         avg_score_diff_1 = T3_Avg_Marks - T1_Avg_Marks)

regress_change_phq_1 <- lm(avg_score_diff_1 ~ phq_score_diff_1, data = combined, cluster = ~School_ID)
regress_change_phq_trt_1 <- lm(avg_score_diff_1 ~ trt, data = combined, cluster = ~School_ID)

regress_change_gad_1 <- lm(avg_score_diff_1 ~ gad_score_diff_1, data = combined, cluster = ~School_ID)
regress_change_gad_trt_1 <- lm(avg_score_diff_1 ~ trt, data = combined, cluster = ~School_ID)


# Create stargazer table
stargazer(regress_change_phq_1, regress_change_phq_trt_1, 
          title = "Table 19: Difference in Avg Score Between T1 and T3 Regressed on PHQ Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T2 and T3 Average Score Difference",
          covariate.labels = c("PHQ Score diff (W0 to W26)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ PHQ", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 19: Difference in Avg Score Between T1 and T3 Regressed on PHQ Score Difference and Treatment
## ===================================================================
##                                      Dependent variable:           
##                            ----------------------------------------
##                               T2 and T3 Average Score Difference   
##                             Score diff ~ PHQ     Score diff ~ trt  
##                                    (1)                 (2)         
## -------------------------------------------------------------------
## PHQ Score diff (W0 to W26)        0.001                            
##                                  (0.003)                           
## Treatment (Trt)                                       -0.043       
##                                                      (0.038)       
## Constant                         -0.007               0.0003       
##                                  (0.045)             (0.027)       
## -------------------------------------------------------------------
## Observations                       991                1,300        
## R2                               0.0002               0.001        
## Adjusted R2                      -0.001               0.0002       
## Residual Std. Error         0.657 (df = 989)    0.689 (df = 1298)  
## F Statistic                0.212 (df = 1; 989) 1.255 (df = 1; 1298)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01
stargazer(regress_change_gad_1, regress_change_gad_trt_1, 
          title = "Table 20: Difference in Avg Score Between T1 and T3 Regressed on GAD Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T2 and T3 Average Score Difference",
          covariate.labels = c("GAD Score diff (W0 to W26)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ GAD", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 20: Difference in Avg Score Between T1 and T3 Regressed on GAD Score Difference and Treatment
## ===================================================================
##                                      Dependent variable:           
##                            ----------------------------------------
##                               T2 and T3 Average Score Difference   
##                             Score diff ~ GAD     Score diff ~ trt  
##                                    (1)                 (2)         
## -------------------------------------------------------------------
## GAD Score diff (W0 to W26)        0.001                            
##                                  (0.003)                           
## Treatment (Trt)                                       -0.043       
##                                                      (0.038)       
## Constant                         -0.007               0.0003       
##                                  (0.042)             (0.027)       
## -------------------------------------------------------------------
## Observations                       991                1,300        
## R2                               0.0003               0.001        
## Adjusted R2                      -0.001               0.0002       
## Residual Std. Error         0.657 (df = 989)    0.689 (df = 1298)  
## F Statistic                0.250 (df = 1; 989) 1.255 (df = 1; 1298)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

4.2.11: Regression of change in marks from T2 to T3 and change in PHQ/GAD scores from week 8 to week 26

combined <- combined %>% 
  mutate(phq_score_diff_2 = PHQ_Total_26 - PHQ_Total_8,
         gad_score_diff_2 = GAD_Total_26 - GAD_Total_8,
         avg_score_diff_2 = T3_Avg_Marks - T2_Avg_Marks)

regress_change_phq_2 <- lm(avg_score_diff_2 ~ phq_score_diff_2, data = combined, cluster = ~School_ID)
regress_change_phq_trt_2 <- lm(avg_score_diff_2 ~ trt, data = combined, cluster = ~School_ID)


regress_change_gad_2 <- lm(avg_score_diff_2 ~ gad_score_diff_2, data = combined, cluster = ~School_ID)
regress_change_gad_trt_2 <- lm(avg_score_diff_2 ~ trt, data = combined, cluster = ~School_ID)


# Create stargazer table

stargazer(regress_change_phq_2, regress_change_phq_trt_2, 
          title = "Table 21: Difference in Avg Score Between T2 and T3 Regressed on PHQ Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T2 and T3 Average Score Difference",
          covariate.labels = c("PHQ Score diff (W8 to W26)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ PHQ", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 21: Difference in Avg Score Between T2 and T3 Regressed on PHQ Score Difference and Treatment
## ===================================================================
##                                      Dependent variable:           
##                            ----------------------------------------
##                               T2 and T3 Average Score Difference   
##                             Score diff ~ PHQ     Score diff ~ trt  
##                                    (1)                 (2)         
## -------------------------------------------------------------------
## PHQ Score diff (W8 to W26)       -0.0003                           
##                                  (0.002)                           
## Treatment (Trt)                                       -0.013       
##                                                      (0.029)       
## Constant                         -0.004               0.030        
##                                  (0.027)             (0.020)       
## -------------------------------------------------------------------
## Observations                       990                1,280        
## R2                               0.00003              0.0001       
## Adjusted R2                      -0.001               -0.001       
## Residual Std. Error         0.487 (df = 988)    0.517 (df = 1278)  
## F Statistic                0.026 (df = 1; 988) 0.189 (df = 1; 1278)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01
stargazer(regress_change_gad_2, regress_change_gad_trt_2, 
          title = "Table 22: Difference in Avg Score Between T2 and T3 Regressed on GAD Score Difference and Treatment", 
          type = "text",
          dep.var.labels = "T2 and T3 Average Score Difference",
          covariate.labels = c("GAD Score diff (W8 to W26)", 
                               "Treatment (Trt)"),
          column.labels = c("Score diff ~ GAD", "Score diff ~ trt"),
          star.cutoffs = c(0.1, 0.05, 0.01),  # Adjust significance levels for stars
          no.space = TRUE,
          out = "regression_results.txt")
## 
## Table 22: Difference in Avg Score Between T2 and T3 Regressed on GAD Score Difference and Treatment
## ===================================================================
##                                      Dependent variable:           
##                            ----------------------------------------
##                               T2 and T3 Average Score Difference   
##                             Score diff ~ GAD     Score diff ~ trt  
##                                    (1)                 (2)         
## -------------------------------------------------------------------
## GAD Score diff (W8 to W26)       0.0005                            
##                                  (0.002)                           
## Treatment (Trt)                                       -0.013       
##                                                      (0.029)       
## Constant                          0.005               0.030        
##                                  (0.026)             (0.020)       
## -------------------------------------------------------------------
## Observations                       990                1,280        
## R2                               0.0001               0.0001       
## Adjusted R2                      -0.001               -0.001       
## Residual Std. Error         0.487 (df = 988)    0.517 (df = 1278)  
## F Statistic                0.061 (df = 1; 988) 0.189 (df = 1; 1278)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

4.2.12: Subject wise analysis - correlation between term 2 scores and PHQ score in week 8

# List of columns to check for missing values
columns_to_check <- c("T1_ENG", "T1_KIS", "T1_MAT", "T1_BIO", "T1_PHY", "T1_CHE", 
                      "T1_HIS", "T1_GEO", "T1_CRE", "T1_HSC", "T1_AGR", "T1_BST", 
                      "T1_CS", "T1_COM", "T1_MW", "T1_DD", "T1_FRE", "T1_KSL", 
                      "T2_ENG", "T2_KIS", "T2_MAT", "T2_BIO", "T2_PHY", "T2_CHE", 
                      "T2_HIS", "T2_GEO", "T2_CRE", "T2_HSC", "T2_AGR", "T2_BST", 
                      "T2_CS", "T2_COM", "T2_MW", "T2_DD", "T2_FRE", "T2_KSL", 
                      "T3_ENG", "T3_KIS", "T3_MAT", "T3_BIO", "T3_PHY", "T3_CHE", 
                      "T3_HIS", "T3_GEO", "T3_CRE", "T3_HSC", "T3_AGR", "T3_BST", 
                      "T3_COM", "T3_MW", "T3_DD", "T3_FRE", "T3_KSL")

# Calculate the number of missing values for each column
missing_values <- sapply(columns_to_check, function(col) sum(is.na(combined[[col]])))


eng_reg <- lm(T2_ENG ~ PHQ_Total_8 + trt + T1_ENG, data = combined)
kis_reg <- lm(T2_KIS ~ PHQ_Total_8 + trt + T1_KIS, data = combined)
bio_reg <- lm(T2_BIO ~ PHQ_Total_8 + trt + T1_BIO, data = combined)
mat_reg <- lm(T2_MAT ~ PHQ_Total_8 + trt + T1_MAT, data = combined)
che_reg <- lm(T2_CHE ~ PHQ_Total_8 + trt + T1_CHE, data = combined)
cre_reg <- lm(T2_CRE ~ PHQ_Total_8 + trt + T1_CRE, data = combined)
phy_reg <- lm(T2_PHY ~ PHQ_Total_8 + trt + T1_PHY, data = combined)


# Calculate clustered standard errors for each model
eng_se <- sqrt(diag(vcovHC(eng_reg, type = "HC0", cluster = ~School_ID)))
kis_se <- sqrt(diag(vcovHC(kis_reg, type = "HC0", cluster = ~School_ID)))
bio_se <- sqrt(diag(vcovHC(bio_reg, type = "HC0", cluster = ~School_ID)))
mat_se <- sqrt(diag(vcovHC(mat_reg, type = "HC0", cluster = ~School_ID)))
che_se <- sqrt(diag(vcovHC(che_reg, type = "HC0", cluster = ~School_ID)))
phy_se <- sqrt(diag(vcovHC(phy_reg, type = "HC0", cluster = ~School_ID)))


stargazer(eng_reg, kis_reg, bio_reg, mat_reg, che_reg, phy_reg,
          se = list(eng_se, kis_se, bio_se, mat_se, che_se, phy_se),
          type = "text",
          title = "Table 23: Subject wise analysis of correlation between term 2 scores and PHQ total scores in week 8",
          no.space = TRUE, digits = 2,                  
          omit.stat = c("f", "ser"))   
## 
## Table 23: Subject wise analysis of correlation between term 2 scores and PHQ total scores in week 8
## ===============================================================
##                             Dependent variable:                
##              --------------------------------------------------
##              T2_ENG  T2_KIS   T2_BIO   T2_MAT   T2_CHE  T2_PHY 
##                (1)     (2)     (3)      (4)      (5)      (6)  
## ---------------------------------------------------------------
## PHQ_Total_8  0.01**  0.01*** 0.02***  0.01***  0.01***   0.003 
##              (0.003) (0.003) (0.003)  (0.003)  (0.003)  (0.005)
## trt           0.04   0.08**    0.02    0.09*    -0.01    0.06  
##              (0.05)  (0.04)   (0.05)   (0.05)   (0.04)  (0.06) 
## T1_ENG       0.41***                                           
##              (0.03)                                            
## T1_KIS               0.64***                                   
##                      (0.02)                                    
## T1_BIO                       0.67***                           
##                               (0.03)                           
## T1_MAT                                0.73***                  
##                                        (0.04)                  
## T1_CHE                                         0.68***         
##                                                 (0.02)         
## T1_PHY                                                  0.73***
##                                                         (0.04) 
## Constant     -0.12** -0.14** -0.26*** -0.15*** -0.16***  -0.12 
##              (0.06)  (0.05)   (0.06)   (0.06)   (0.05)  (0.08) 
## ---------------------------------------------------------------
## Observations  1,285   1,298   1,112    1,296    1,300     888  
## R2            0.18    0.42     0.43     0.41     0.47    0.43  
## Adjusted R2   0.18    0.42     0.43     0.41     0.47    0.43  
## ===============================================================
## Note:                               *p<0.1; **p<0.05; ***p<0.01