Data Description

Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.

Key variables included:

Methods

Load assessment data

Load spending data

New names:
• `` -> `...73`
• `` -> `...74`
• `` -> `...75`
• `` -> `...76`
• `` -> `...77`
• `` -> `...78`
• `` -> `...79`
• `` -> `...80`
• `` -> `...81`
• `` -> `...82`
• `` -> `...83`
• `` -> `...84`
• `` -> `...85`
• `` -> `...86`
• `` -> `...87`
• `` -> `...88`
• `` -> `...89`
• `` -> `...90`
• `` -> `...91`
• `` -> `...92`
• `` -> `...93`
• `` -> `...94`
• `` -> `...95`
• `` -> `...96`
• `` -> `...97`
• `` -> `...98`
• `` -> `...99`
• `` -> `...100`
• `` -> `...101`
• `` -> `...102`
• `` -> `...103`
• `` -> `...104`
• `` -> `...105`
• `` -> `...106`
• `` -> `...107`
• `` -> `...108`
• `` -> `...109`
• `` -> `...110`
• `` -> `...111`
• `` -> `...112`
• `` -> `...113`
• `` -> `...114`
• `` -> `...115`
• `` -> `...116`
• `` -> `...117`
• `` -> `...118`
• `` -> `...119`
• `` -> `...120`
• `` -> `...121`
• `` -> `...122`
• `` -> `...123`
• `` -> `...124`
• `` -> `...125`
• `` -> `...126`
• `` -> `...127`
• `` -> `...128`
• `` -> `...129`
• `` -> `...130`
• `` -> `...131`
• `` -> `...132`
• `` -> `...133`
• `` -> `...134`
• `` -> `...135`
• `` -> `...136`
• `` -> `...137`
• `` -> `...138`
• `` -> `...139`
• `` -> `...140`
• `` -> `...141`
• `` -> `...142`
• `` -> `...143`
• `` -> `...144`
• `` -> `...145`
• `` -> `...146`
• `` -> `...147`
• `` -> `...148`
• `` -> `...149`
• `` -> `...150`
• `` -> `...151`
• `` -> `...152`
• `` -> `...153`
• `` -> `...154`
• `` -> `...155`
• `` -> `...156`
• `` -> `...157`
• `` -> `...158`
• `` -> `...159`
• `` -> `...160`
• `` -> `...161`
• `` -> `...162`
• `` -> `...163`
• `` -> `...164`
• `` -> `...165`
• `` -> `...166`
• `` -> `...167`
• `` -> `...168`
• `` -> `...169`
• `` -> `...170`
• `` -> `...171`
• `` -> `...172`
• `` -> `...173`
• `` -> `...174`
• `` -> `...175`
• `` -> `...176`
• `` -> `...177`
• `` -> `...178`
• `` -> `...179`
• `` -> `...180`
• `` -> `...181`
• `` -> `...182`
• `` -> `...183`
• `` -> `...184`

Load demographic data

Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 62 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (3): County, FIPS, Rank within US (of 3143 counties)
dbl (2): Value (Percent), People (Unemployed)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Adding in new data

Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People (Education: Less Than 9th...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People(Education: Less Than High...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 64 Columns: 5
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3143 counties)
dbl (3): FIPS, Value (Percent), People (Education: At Least Bach...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 65 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3142 counties)
dbl (1): FIPS
num (1): Value (Dollars)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 65 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3142 counties)
dbl (1): FIPS
num (1): Value (Dollars)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: One or more parsing issues, call `problems()` on your data frame
for details, e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 64 Columns: 4
── Column specification ────────────────────────────────────────────
Delimiter: ","
chr (2): County, Rank within US (of 3135 counties)
dbl (1): FIPS
num (1): Value (Dollars)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Joined data

Correlations

Create test/training data

Linear Regression Model


Call:
lm(formula = proficiency ~ enroll + tlocrev + ppcstot + unemployed + 
    at_least_bachelor_education + household_income, data = t_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7360 -2.9337 -0.2146  2.3756  8.5571 

Coefficients:
                               Estimate  Std. Error t value Pr(>|t|)   
(Intercept)                  8.59405726 10.62009737   0.809  0.42418   
enroll                       0.00045645  0.00052416   0.871  0.39014   
tlocrev                     -0.00001838  0.00008596  -0.214  0.83201   
ppcstot                      0.00070458  0.00044490   1.584  0.12281   
unemployed                  -0.19554527  0.27793175  -0.704  0.48663   
at_least_bachelor_education  0.39826973  0.13013574   3.060  0.00437 **
household_income             0.00007607  0.00012306   0.618  0.54074   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.912 on 33 degrees of freedom
Multiple R-squared:  0.5629,    Adjusted R-squared:  0.4834 
F-statistic: 7.082 on 6 and 33 DF,  p-value: 0.00006584

Root Mean Squared Error (RMSE): 4.939345 
R-squared: 0.3152698 
Error in usmap::map_with_data(data, values = values, include = include,  : 
  `data` must be a data.frame containing either a `state` or `fips` column.

pca

k-means

Decision Tree

Cross-fold validation

Bootstrapping

We don’t have equal number of spam and non-spam messages. We can use bootstrapping to create a more balanced dataset.

Note that we re-used the function from earlier to return a t_train and a t_test using our bootstrapped sample.

---
title: "R Project 3"
author: "Josh Freeman"
date: "11/22/2024"
output: html_notebook
---

## Thesis: The Project 538 prediction models overestimate the popular vote support for third-party candidates, resulting in an underestimation of Republican popular vote support and creating the false impression that the race is less competitive than it truly is.

This analysis uses popular vote data from 2016 and 2020 to predict the popular vote per candidate for the 2024 election. Its data includes the election year, the candidate, the grade of the pollster, and the popular vote prediction per pollster. If Project 538 pollster data over predicts the third-party candidate and under predicts the republican and democratic candidate, I expect the actual popular vote to have less votes for the third-party candidate and more votes for republican and democratic candidates. Proportionally, though, the republican candidate should get more of the third-party votes than the democratic candidate. However, if the predictions for the republican, democratic, and third-party candidate are accurate, the Project 538 pollster data is accurate as is.


## Data Description

Project 538 provided over thirty-thousand rows of data from the 2016, 2020, 2024 presidential elections. Twenty-thousand rows related to specific states were excluded.

Key variables included:

- **Candidate**: The presidential candidate termed either by republican, democrat, or other.
- **Average Error**: The error of actual percentage minus predicted percentage on average per candidate.
- **Predicted Percentage**: The percentage of the popular vote per candidate that was predicted by the 538 pollster data.
- **Expected Percentage**: The percentage of the popular vote per candidate that is expected based on the 538 pollster data predictions.
- **Actual Percentage**: The percentage of the popular vote per candidate that actually occurred for previous elections.
- **Average Percentage**: The average percentage of the popular vote per candidate predicted with and without a model.
- **Frequency**: The number of times a value occurs.
- **Residuals**: The actual percentage result per candidate minus the predicted percentage result per candidate.

## Methods

### Load assessment data

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(caret)
library(rpart)
library(readxl)
assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_science <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'db3:db7312', 
                           col_names = c('science_proficiency'),
                           na = '**')

t_assess <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  janitor::clean_names() %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = science_proficiency)  

view(t_assess)
```

### Load spending data


```{r echo=FALSE, message=FALSE, warning=FALSE}
spending_path <- './us census ed spending/elsec22t.xls'

t_spending_raw <- read_excel(path = spending_path,
                           sheet = 'elsec22t',
                           range = 'a1:gb14106') %>% 
  janitor::clean_names()


cooperates <- c('MOUNTAIN STATE EDUCATIONAL SERVICES COOPERATIVE',
                'EASTERN PANHANDLE INSTRUCTIONAL COOPERATIVE',
                'SOUTHERN EDUCATIONAL SERVICES COOPERATIVE')

t_spending <- t_spending_raw %>% 
  filter(state == 49) %>% 
  filter(!name %in% cooperates) %>% 
  select(name, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot) %>% 
  mutate(county = str_to_title(str_split_i(name, ' ',1)),
         county = ifelse(county == 'Mc', 'McDowell', county)) %>% 
  mutate(county = paste0(county, " County"))


view(t_spending)
```

### Load demographic data

```{r echo=FALSE, message=FALSE, warning=FALSE}

t_demographics <- read_csv('./demographics/unemployed.csv', 
                            skip = 4,
                            na = 'N/A') %>%
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  select(county, value_percent) %>%
  rename(unemployed = value_percent)

view(t_demographics)
```

### Adding in new data

```{r echo=FALSE, message=FALSE, warning=FALSE}
# Adding in math and reading proficiency

library(tidyverse)
library(caret)
library(rpart)
library(readxl)

assessment_path <- './wv ed student achievement/Historical_AssessmentResults_SY15-to-SY21.xlsx'


t_assess_raw_school <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'b2:f7312')


t_assess_raw_math <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'at3:at7312', 
                           col_names = c('math_proficiency'),
                           na = '**')

t_assess_raw_reading <- read_excel(path = assessment_path,
                           sheet = 'SY21 School & District',
                           range = 'ch3:ch7312', 
                           col_names = c('reading_proficiency'),
                           na = '**')

t_assess <- t_assess_raw_school %>%
  bind_cols(t_assess_raw_science) %>% 
  bind_cols(t_assess_raw_math) %>% 
  bind_cols(t_assess_raw_reading) %>% 
  janitor::clean_names() %>% 
  filter(school == 999) %>% 
  filter(population_group == 'Total Population') %>% 
  filter(county != 'Statewide') %>% 
  mutate(proficiency = (science_proficiency + math_proficiency + reading_proficiency) / 3) %>% 
  mutate(county = paste0(county, " County"))

view(t_assess)

# Adding in educational data

## less than 9th grade

education_9th_path <- './wv education data/WVEducationLessThan9thGrade.csv'

t_education_9th <- read_csv(education_9th_path,
skip = 5,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  select(county, value_percent) %>%
  rename(less_than_9th_grade_education = value_percent)

view(t_education_9th)

## less than 12th grade

education_high_school_path <- './wv education data/WVEducationLessThanHighSchool.csv'

t_education_high_school <- read_csv(education_high_school_path,
skip = 5,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  select(county, value_percent) %>%
  rename(less_than_high_school_grade_education = value_percent)

view(t_education_high_school)

## at least a bachelors

education_bachelor_path <- './wv education data/WVEducationAtLeastBachelorsDegree.csv'

t_education_bachelor <- read_csv(education_bachelor_path,
skip = 5,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_percent) ) %>% 
  select(county, value_percent) %>%
  rename(at_least_bachelor_education = value_percent)

view(t_education_bachelor)

# Adding in income data

## family income

family_income_path <- './wv income data/WVFamilyIncome.csv'

t_family_income <- read_csv(family_income_path,
skip = 4,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_dollars) ) %>% 
  select(county, value_dollars) %>%
  rename(family_income = value_dollars)

view(t_family_income)

## household income

household_income_path <- './wv income data/WVHouseholdIncome.csv'

t_household_income <- read_csv(household_income_path,
skip = 4,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_dollars) ) %>% 
  select(county, value_dollars) %>%
  rename(household_income = value_dollars)

view(t_household_income)

## vast majority income

vast_majority_income_path <- './wv income data/WVVastMajorityIncome.csv'

t_vast_majority_income <- read_csv(vast_majority_income_path,
skip = 5,
na = "N/A") %>% 
  janitor::clean_names() %>% 
  filter(county != 'West Virginia',
         county != 'United States',
         !is.na(value_dollars) ) %>% 
  select(county, value_dollars) %>%
  rename(vast_majority_income = value_dollars)

view(t_vast_majority_income)

```

### Joined data

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr)
library(purrr)

# List of all table names
tables <- list(
  t_assess,
  t_demographics,
  t_education_9th,
  t_education_bachelor,
  t_education_high_school,
  t_family_income,
  t_household_income,
  t_spending,
  t_vast_majority_income
)

# Join all tables on the 'county' column
t <- reduce(tables, full_join, by = "county")

t <- t %>% 
  select(-school, -school_name, -population_group, -subgroup, -name)
t <- t %>% 
  select(county, enroll, tfedrev, tstrev, tlocrev, totalexp, ppcstot, unemployed, less_than_9th_grade_education, less_than_high_school_grade_education, at_least_bachelor_education, vast_majority_income, household_income, family_income, proficiency)

view(t)
```


### Correlations

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(ggcorrplot)
c <- cor(t %>% select(where(is.numeric)))

ggcorrplot(c, 
           hc.order = TRUE, 
           type = "lower", 
           lab = FALSE, 
           lab_size = 100, 
           method = "circle")

```

### Create test/training data

```{undefined echo=FALSE, message=FALSE, warning=FALSE}

# Create training/test split
sample <- sample(c(1, 0), size = nrow(t), replace = TRUE, prob = c(0.7, 0.3))
t_train <- t[sample == 1, ]
t_test <- t[sample == 0, ]

```

### Linear Regression Model

```{undefined echo=FALSE, message=FALSE, warning=FALSE}
# Ensure county is a factor in both datasets
t_train$county <- as.factor(t_train$county)
t_test$county <- factor(t_test$county, levels = levels(t_train$county))

# Build the model excluding county
model <- lm(proficiency ~ enroll + tlocrev + ppcstot + unemployed + at_least_bachelor_education + household_income, data = t_train)
summary(model)

# Generate predictions
predicted_proficiency <- predict(model, newdata = t_test)

# Add predictions and residuals to the test dataset
t_test <- t_test %>%
  mutate(prediction = predicted_proficiency, 
         residuals = proficiency - prediction)

# Handle NAs in residuals (if any)
t_test <- t_test %>% filter(!is.na(prediction), !is.na(proficiency))

# Evaluate the model
options(scipen = 999)

# Plot residuals
hist(t_test$residuals, 
     main = "Histogram of Residuals", 
     xlab = "Residuals", 
     col = "lightblue", 
     border = "black")

# Calculate RMSE
rmse <- sqrt(mean((t_test$proficiency - t_test$prediction)^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")

# Calculate R-squared
SST <- sum((t_test$proficiency - mean(t_test$proficiency))^2)
SSE <- sum((t_test$proficiency - t_test$prediction)^2)
r_squared <- 1 - (SSE / SST)
cat("R-squared:", r_squared, "\n")

```


```{undefined echo=FALSE, message=FALSE, warning=FALSE}
library(usmap)

plot_usmap(data = t, 
           values = "proficiency", 
           include = 'West Virginia') + 
  scale_fill_continuous(name = "Proficiency",
                        low = 'red',
                        high = 'blue') + 
  theme(legend.position = "right") +
  labs('Proficiency')



```


### pca

```{undefined echo=FALSE, message=FALSE, warning=FALSE}

t_ingredients <- t %>% select(where(is.numeric)) %>%  select(starts_with('i '))

pca_results <- prcomp(t_ingredients, scale = TRUE, center = TRUE, rank = 2)

# Show new components
summary(pca_results)
print(pca_results)

ggplot(data = as.data.frame(pca_results$rotation), 
       aes(x = PC1, y = PC2, label = rownames(pca_results$rotation))) + 
  geom_text() + 
  theme_minimal()
```


```{undefined echo=FALSE, message=FALSE, warning=FALSE}
library(plotly)

# add title and category back
# add PC1 and PC2 back to t_ingredients as new variables
t_ingredients <- t_ingredients %>%  
  bind_cols(title = t$title) %>% 
  bind_cols(first_category = t$first_category) %>% 
  bind_cols(as.data.frame(pca_results$x))

# Pick a random sample of 1000 items for the plot
set.seed(123)
t_ingredients_sample <- t_ingredients[sample(nrow(t_ingredients), 1000),]

# Plot some items with labels 
plot <- ggplot(data = t_ingredients_sample,,
       mapping = aes( x = jitter(PC1), y = jitter(PC2), text = title, color = first_category) ) + 
  geom_point( ) + 
  theme_minimal() 


ggplotly(plot, tooltip = 'text')

```



### k-means

```{undefined echo=FALSE, message=FALSE, warning=FALSE}

t_kmeans <- t %>% select(where(is.numeric)) %>%  select(starts_with('i ')) 

kresult <- kmeans(
  x = t_kmeans,
  centers = 3,
  nstart = 10
)

# Add the new kmeans2 column, using kresult's cluster vector.
# Use factor to turn the cluster from a number to a categorical variable.
t_kmeans <- t_kmeans %>% 
  mutate(kmeans = factor(kresult$cluster)) %>% 
  bind_cols(title = t$title) %>% 
  bind_cols(first_category = t$first_category)  %>% 
  bind_cols(bind_cols(as.data.frame(pca_results$x)))


# Pick a random sample of 1000 items for the plot
set.seed(123)
t_kmeans_sample <- t_kmeans[sample(nrow(t_kmeans), 1000),]

# Add the new kmeans2 column
plot <- ggplot() + 
  geom_point(
    data = t_kmeans_sample, 
    mapping = aes( x = jitter(PC1), y = jitter(PC2), text = title, color = kmeans) 
  ) +
  labs( 
    title = 'Iris dataset: k-means clustering',
    subtitle = paste('Error:', round(kresult$tot.withinss,2))
    )

ggplotly(plot, tooltip = 'text')

```



### Decision Tree

```{undefined echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(rpart)
library(rpart.plot)

#sample_vector <- sample(x = c(0, 1), size=nrow(t), replace = TRUE, prob = c(0.7, 0.3))

t_dt <- t %>% 
  select(first_category, starts_with('i '))  
#  mutate(sample = sample_vector) 

##t_dt_test <- filter(t_dt, sample == 0)
#t_dt_train <- filter(t_dt, sample == 1)

m <- rpart(formula = first_category ~ .,
           data = t_dt,
           minsplit = 10,
           minbucket = 10,
           method = "class")

# Show results of model
rpart.plot(m)

```


```{undefined echo=FALSE, message=FALSE, warning=FALSE}

# Create the predicted value and add it to our tibble
predicted <- predict(m, t_dt, type = 'class')
t_dt <- mutate( t_dt, 
             predicted = predicted,
             is_correct = predicted ==first_category)

# Percentage correct
print(mean(t_dt$is_correct))

# Show a confusion matrix
# Predicted values are in upper case.
table(str_to_upper(t_dt$predicted), t_dt$first_category)
prop.table(table(str_to_upper(t_dt$predicted), t_dt$first_category), 2)
```


```{undefined echo=FALSE, message=FALSE, warning=FALSE}

# Refactor code to evaluate a NN

evaluate_model <- function(t_data, vector_outcomes, print_outcomes = FALSE) {
    
  vector_predicted <- predict(n, t_data)
  vector_predicted <- round(vector_predicted, digits = 0)
  
  accuracy <- mean(vector_predicted == vector_outcomes)

  if(print_outcomes) {
    print(table(vector_predicted, vector_outcomes))
  }
  
  return(accuracy)
}

results <- c(evaluate_model(t_train, t_train$is_spam),
              evaluate_model(t_test, t_test$is_spam))

print(results)
```

### Cross-fold validation

```{r echo=FALSE, message=FALSE, warning=FALSE}

# Create a function for creating a sample

create_split <- function(t, train_size = 0.5) {
  sample <- sample(c(1, 0), 
                   size = nrow(t), 
                   replace = TRUE, 
                   prob = c(train_size, 1 - train_size))
  
  t_train <- t[sample == 1, ]
  t_test <- t[sample == 0, ]
  
  return(list(t_train = t_train, t_test = t_test))
}


# Create vectors to store results
training_results <- c()
test_results <- c()

# Run the model multiple times and record the results
for (i in 1:50) {
  split <- create_split(t, 0.70)
  
  n <- neuralnet( is_spam ~ .,
                  data = split$t_train,
                  hidden = 1,
                  linear.output = FALSE
  )
  
  training_results <- c(training_results, evaluate_model(t_train, t_train$is_spam))
  test_results <- c(test_results, evaluate_model(t_test, t_test$is_spam))
  
}

print(test_results)
hist(test_results)
```


### Bootstrapping

We don't have equal number of spam and non-spam messages. We can use bootstrapping to create a more balanced dataset.

Note that we re-used the function from earlier to return a t_train and
a t_test using our bootstrapped sample.

```{undefined echo=FALSE, message=FALSE, warning=FALSE}

# Create a function for creating a sample
create_bootstrapped_balanced_split <- function(t, 
                                               number_of_samples_per_group = 1000) {

    
  # Create tibbles of just spam and non-spam
  t_spam <- t[t$is_spam == 1, ]
  t_ham <- t[t$is_spam == 0, ]
  
  
  # Build spam sample, grabbing indexes with replacement to the length of t_spam
  sample_spam <- sample(1:nrow(t_spam),
                   size = number_of_samples_per_group,
                   replace = TRUE)
  t_sample_spam <- t_spam[sample_spam, ]
  t_unsampled_spam <- t_spam[-sample_spam, ]
  
  # Test to make sure that we have at least 100 unsampled items
  if(nrow(t_unsampled_spam) < 100) {
    stop("Not enough unsampled spam")
  }
  
  
  # build ham sample
    sample_ham <- sample(1:nrow(t_ham),
                   size = number_of_samples_per_group,
                   replace = TRUE)
  t_sample_ham <- t_ham[sample_ham, ]
  t_unsampled_ham <- t_ham[-sample_ham, ]
  
  # Test to make sure that we have at least 100 unsampled items
  if(nrow(t_unsampled_ham) < 100) {
    stop("Not enough unsampled ham")
  }
  

  
  # append the two tibbles into a cohesive sample
  t_train <- bind_rows(t_sample_spam, t_sample_ham)
  t_test <- bind_rows(t_unsampled_spam, t_unsampled_ham)
  
  return(list(t_train = t_train, t_test = t_test))
}

# Evaluate the model using bootstrapping

# Create vectors to store results
training_results <- c()
test_results <- c()

# Run the model multiple times and record the results
for (i in 1:10) {
  split <- create_bootstrapped_balanced_split(t, 200)
  
  n <- neuralnet( is_spam ~ .,
                  data = split$t_train,
                  hidden = 1,
                  linear.output = FALSE
  )
  
  training_results <- c(training_results, evaluate_model(t_train, t_train$is_spam))
  test_results <- c(test_results, evaluate_model(t_test, t_test$is_spam))
  
}

print(test_results)
hist(test_results)

```