Question and Background Information

Question and Context

For our final project, we are imagining that we work in the office of a Congressperson. This congressperson is interested in using the demographic information obtained by conducting polls of their constituents. The congressperson has asked the following question: “Given the demographic information available, can we use this data to predict whether particular constituents will think that abortion should be legal or illegal?” If we can build an adequate model to predict these attitudes, this could be useful when creating targeted ads or outreach programs (with unknown data). The model may also be able to assist us in understanding whether particular demographic characteristics have more sway over abortion attitudes.

For the purposes of this project, we have treated attitudes towards abortion as a strict binary problem: people either think abortion should be legal or illegal. We understand that, in reality, people tend to have a more nuanced view of abortion legality. For example, a person may believe that it should be legal only when the life of the mother is in danger or if she has been raped. Therefore, this treatment of abortion views as a black-and-white issue does have limitations, but we wanted to see whether we could build a model that provides any predictive power/information.

Data Background Information

The data we are using was obtained from an Ipsos poll conducted on behalf of Buzzfeed in the spring of 2015. The data cover 17,030 adults, ages 16–65, across 23 countries.

Note that because the data was collected for Buzzfeed (which tends to skew more liberal), there may be a liberal, or more pro-abortion sway to the data, as seen by the distributions in the EDA below. The “legal” and “illegal” classes are very imbalanced, with significantly more observations for “legal” than “illegal.”

The data contains a variety of demographic information including variables on gender, household income level, education level, employment status, marital status, whether the person knows someone in the LGBTQ community, if the person or someone close to them has had an abortion, religiosity (measured by how often do they attend religious services), and age. The variable that we are trying to predict as our target variable is the “attitude” variable which has the levels “legal” and “illegal.” All variables excluding age are categorical factor variables.

Previous Research and Relevance to Data

Attitudes towards abortion have been thoroughly researched by multiple outlets. One particularly good source of information is the Pew Research Center in the U.S, which provides fairly balanced information on the subject (from a largely American perspective). According to Pew, in the U.S. around 6 in 10 adults (59%) think that abortion should be legal in all or most cases, and 39% say it should be illegal in all or most cases. This distribution highlights how our data (which has 84% legal and 16% illegal) has a more pro-abortion skew than the U.S. at large. Importantly, our data does include information from multiple countries in Europe, the Americas, Asia, and beyond, so this likely also significantly impacts the leanings of the data.

Additionally, in the U.S., Pew has found that views on abortion tend to vary widely by religious affiliation and level of religiosity. Evangelical protestants are much more likely to think that abortion should be illegal, while non-evangelical protestants tend to have a more liberal stance and tend believe that abortion should be legal.

Data Preparation and Cleaning

We began by reading in our data. We then subset the relevant columns excluding columns that were not useful in helping us answer our guiding question regarding abortion attitude. We then replaced all empty/null values with “other”, instead of excluding these rows so we had more data points that we could use to train, tune, and test our data with. Finally, we re-coded most of our variables to fit into a smaller number of groups so that our model would run more efficiently. After cleaning and simplifying our data, we were ready to move on to some exploratory data analysis.

# Read in data 
SurveyResults <- read_excel("SurveyResults.xlsx")

# Subset relevant columns
data <- SurveyResults[,c(3:10,17,18,20,21)]

# Rename columns
colnames(data) <- c("country","gender","age","employment","edu","marital_status","household_income","chief_income_earner","close_lgbtq","attitude","had_abortion","religiosity")

# set Na to NA
data[data=="Na"] <- NA
# sapply(data, function(x) sum(is.na(x)))

# remove all NA's
#data <- na.omit(data)

# Recode NAs
data <- data %>% replace_na(list(employment = "Employment_Status_Unavailable", marital_status = "Marital_Status_Unavailable", household_income = "Income_Unavailable"))
#sapply(data, function(x) sum(is.na(x)))

# Remove observations where attitude = don't know
data <- data[!(data$attitude == "Don't know/Prefer not to say"),]

# Factor collapsing 
data$employment<-fct_collapse(data$employment,
                              student = "A student",
                              employed = c("Full time", "Part time", "Self employed"),
                              unemployed = c("Retired", "Unemployed", "Prefer not to answer"),
                              other = c("Employment_Status_Unavailable")
                              )

# unique(data$employment), check that factor worked

data$attitude<-fct_collapse(data$attitude,
legal = c("Abortion SHOULD be permitted whenever a woman decides she wants one","Abortion SHOULD be permitted in certain circumstances, such as if a woman has been raped"),
illegal = c("Abortion should NOT be permitted under any circumstances, except when the life of the mother is in danger", "Abortion should NEVER be permitted, no matter what circumstance exists"))

#data$attitude <- recode(data$attitude, "legal" = "1", "illegal" = "0")
#data$attitude <- as.factor(data$attitude)
# unique(data$attitude), check that it worked 

data$marital_status <- fct_collapse(data$marital_status,
  married = c("Married", "Domestic partnership / Living as married"),
  single = c("Single"),
  other = c("Divorced", "Widowed", "Marital_Status_Unavailable"))
# unique(data$marital_status), check that it worked 


data$country <- fct_collapse(data$country,
  americas = c("United States", "Canada", "Argentina", "Brazil", "Mexico"), #1883
  asia = c("India", "Japan", "South Korea", "China", "Russia"), #3477
  europe = c("Great Britain", "Ireland", "Italy", "Hungary", "Belgium", "Sweden", "Spain", "Poland", "Turkey", "Germany", "France"), #5193
  other = c("South Africa", "Australia")) #1501
# unique(data$country)


data$had_abortion <- fct_collapse(data$had_abortion,
  No = "No",
  Yes = "Yes",
  Other = c("Not sure", "Prefer not to answer"))
# unique(data$had_abortion), make sure that it worked 


data$religiosity<-fct_collapse(data$religiosity,
                               often = c("Every week", "More than once a week", "Nearly every week"),
                               occasionally = c("Once a month", "Two to three times a month"),
                               rarely = c("Several times a year", "Once a year", "Less than once a year"),
                               never = c("Never", "Don’t know/no answer")
                               )


# Update additional variables to factors
data$gender <- as.factor(data$gender)
data$edu <- as.factor(data$edu)
data$household_income <- as.factor(data$household_income)
data$chief_income_earner <- as.factor(data$chief_income_earner)
data$close_lgbtq <- as.factor(data$close_lgbtq)

# unique(data$religiosity) # make sure that it worked 
# str(data)
head(data) %>% gt()%>%
  tab_header(title = "Data Preview") %>% tab_options(heading.align = "left")
Data Preview
country gender age employment edu marital_status household_income chief_income_earner close_lgbtq attitude had_abortion religiosity
europe Male 27 employed High single Medium Yes No legal No rarely
europe Female 17 student Medium single Medium No No legal No never
europe Female 19 student Medium single Income_Unavailable Yes Yes legal Yes never
europe Male 60 employed Medium married High Yes Yes legal Yes rarely
europe Male 25 employed High single Medium Yes Yes legal No rarely
europe Male 35 employed High single High Yes Yes legal No rarely

Exploratory Data Analysis

Calculate Prevalence

For this project, we have chosen the “positive” class to be “legal” and the “negative” class to be “illegal.” As is apparent in the table above, there is a strong class imbalance in data, with the positive class dominating the negative class. The prevalence of the positive class is 0.8378, or approximately 84%. Thus, if we were to randomly guess we have an 84% change of correctly classifying an observation as a member of the positive class.

target_table <- table(data$attitude)
target_table
baserate <- target_table[[1]]/(target_table[[1]] + target_table[[2]])
baserate

High-Level Summary of Variables

Summary Table

data %>% tbl_summary()
Characteristic N = 15,2881
country
americas 3,653 (24%)
other 1,363 (8.9%)
europe 7,291 (48%)
asia 2,981 (19%)
gender
Female 7,658 (50%)
Male 7,630 (50%)
age 40 (29, 51)
employment
student 1,165 (7.6%)
other 1,865 (12%)
employed 9,118 (60%)
unemployed 3,140 (21%)
edu
High 6,075 (40%)
Low 3,547 (23%)
Medium 5,666 (37%)
marital_status
other 2,863 (19%)
married 8,432 (55%)
single 3,993 (26%)
household_income
High 4,885 (32%)
Income_Unavailable 1,251 (8.2%)
Low 3,574 (23%)
Medium 5,578 (36%)
chief_income_earner 8,950 (59%)
close_lgbtq
No 8,930 (58%)
Not sure 1,431 (9.4%)
Prefer not to answer 232 (1.5%)
Yes 4,695 (31%)
attitude
legal 12,809 (84%)
illegal 2,479 (16%)
had_abortion
No 7,652 (50%)
Other 2,153 (14%)
Yes 5,483 (36%)
religiosity
never 5,849 (38%)
often 2,018 (13%)
rarely 6,454 (42%)
occasionally 967 (6.3%)

1 n (%); Median (IQR)

The above summary table provides a summary of the different variables in the dataset and their distributions by level.

Age Distribution

# histogram of age distribution for survey
hist(data$age, xlab = "Age", main = "Age Distribution")

This histogram shows the distribution of age for all respondents of the survey.

Household Income

# household income
income_count <- data %>% count(household_income)
ggplot(income_count, aes(x = household_income, y=n)) + 
  geom_bar(stat = "identity") + xlab("Household Income") +ylab("Count") +labs(title ="Household Income Distribution")

This bar graph displays the density of each category of household income. We can clearly see that the majority of respondents come from households of medium and high income.

Attitudes by Demographic Metrics

The following graphs depict attitude toward abortion in relation to various demographic metrics.

Education by Had Abortion

# abortion by Education
ab_edu <- data %>% group_by(edu) %>% count(had_abortion)
ggplot(ab_edu, aes(x = edu, y = n, fill = had_abortion)) +
  geom_col(position = "dodge") + labs(x = "Education", y = "Count", title = "Education by Had Abortion")

The above graph shows the number of respondents that have and have not had an abortion (or know someone that has), grouped by education level. There does seem to be a general trend showing that respondents with higher education rates have not/do not know of anyone who had an abortion. However, this is also the case for lower and medium education levels as well, so it is not definitive as to whether or not these two variables are strongly correlated.

Marital Status

# abortion by marriage
ab_marriage <- data %>% group_by(marital_status) %>% count(had_abortion)
ggplot(ab_marriage, aes(x = marital_status, y = n, fill = had_abortion)) +
  geom_col(position = "dodge")+ labs(x = "Marital Status", y = "Count", title = "Marital Status by Had Abortion")

This graph is somewhat surprising in that a large number of married persons responded that they have had an abortion or know someone who has. While it is important to note that the person who had an abortion is not necessarily married just because the respondent is, it is still rather shocking to see the difference between the married group and the single/other groups. This will be an important observation that should be kept in mind moving forward.

Age

# abortion and age
ab_age <- data %>% group_by(age) %>% count(had_abortion)
ab_age <- ab_age[ab_age$had_abortion == "Yes",]
ggplot(ab_age, aes(x=as.factor(age), y = n)) +
  geom_bar(stat="identity") + labs(x= "Age", y = "count", title = "Age Distribution")

This graph is surprising, as it reveals that 33 year old respondents in this survey had the most abortions (or knew the most people who had abortions). It is interesting to see that this age is much higher than expected and thus is an important factor to keep in mind for future analysis.

Gender

counts_gender <- table(data$gender, data$attitude)
barplot(counts_gender, main="Attitude by Gender",
        xlab="Gender", col=c("darkblue","red"),
        legend = rownames(counts_gender), beside=TRUE)

This graph is somewhat surprising, as we had anticipated a stronger relationship between gender and attitude for females specifically.

Geographic Region

counts_country <- table(data$country, data$attitude)
barplot(counts_country, main="Attitude by Country",
        xlab="Country", col=c("darkblue","red", "green", "yellow"),
        legend = rownames(counts_country), beside=TRUE)

This graph is illuminating, as we had not anticipated Europe being so pro-abortion when compared to other continents. There appears to be a trend here, which we will further investigate later.

Education

counts_education <- table(data$edu, data$attitude)
barplot(counts_education, main="Attitude by Education",
        xlab="Education", col=c("darkblue","red", "green"),
        legend = rownames(counts_education), beside=TRUE)

The above graph demonstrated a correlation between rates of education and attitudes toward abortion. As seen here, those with higher rates of education seem to be more favorable to abortion. That being said, this graph also reveals the lower response rate among persons with lower education. As such, we will further examine the cross-section of attitude and education to understand how significant the correlation is.

Had/Know Someone w/ Abortion

counts_abortion <- table(data$had_abortion, data$attitude)
barplot(counts_abortion, main="Attitude by Whether The Respondent \n or Someone Close to Respondent has had an Abortion",
        xlab="Respondent or Someone Close to Respondent has had an Abortion",
        col=c("darkblue","red", "green"),
        legend = rownames(counts_abortion), beside=TRUE)

This graph is not surprising, as we had anticipated that persons who either had an abortion themselves or who are close to someone who did, would be more favorable to legalization of abortion.

Household Income

counts_income <- table(data$household_income, data$attitude)
barplot(counts_income, main="Attitude by Household Income",
        xlab="Household Income", col=c("darkblue","red", "green", "yellow"),
        legend = rownames(counts_income), beside=TRUE)

This graph does not suggest a strong correlation between household income and attitude towards abortion.

Chief Income Earner

counts_breadwinner <- table(data$chief_income_earner, data$attitude)
barplot(counts_breadwinner, main="Attitude by Whether or Not The Respondent is Breadwinner",
        xlab="Whether or Not The Respondent is Breadwinner", col=c("darkblue","red"),
        legend = rownames(counts_breadwinner), beside=TRUE)

This graph also does not suggest a relationship between whether or not the respondent is the chief income earner and attitude towards abortion.

Religiosity

counts_religiosity <- table(data$religiosity, data$attitude)
barplot(counts_religiosity, main="Attitude by Whether or Not The Respondent is Religious",
        xlab="Whether or Not The Respondent is Religious", col=c("darkblue","red", "green", "yellow"),
        legend = rownames(counts_religiosity), beside=TRUE)

This graph was not surprising in that it is logical that those who rarely or never attend religious services/worship favor the legalization of abortion. Thus, there does appear to be a strong correlation between religiosity and attitude towards abortion.

Age & Gender

ggplot(data, aes(x=age, fill=gender)) +
  geom_bar() +
  facet_wrap(~attitude)

This graph shows the number of people in favor of and not in favor of the legalization of abortion, categorized by their age and further categorized by gender. As previously observed, there is a relatively even split between male and female for each attitude (legal / illegal), and this did not change when the subjects were divided by age. It is once again surprising to see the lack of strong correlation between these variables and attitude towards abortion.

EDA Conclusions

Through this exploratory data analysis, we are able to determine which variables will be useful in predicting attitude toward abortion. The usefulness of a variable is based on the apparent correlation examined in each of the above figures. As such, we believe that religiosity, whether or not the respondent had/knew of someone who had an abortion, country, and education level will be rather useful features in our model. Moreover, despite our original belief that gender would have a significant impact on a person’s attitude toward abortion, we now know that gender is not an extremely useful predictor. From here we can proceed with this knowledge and begin to build our decision tree model.

Methods: Model Building and Evaluation

Partition the Data & Build Decision Tree

We first split the data into test, train, and tune subsets. 70% of our data was used to train then 15% to tune and 15% to test.

set.seed(2000)
part_index_1 <- caret::createDataPartition(data$attitude,
                                           times=1,
                                           p = 0.70,
                                           groups=1,
                                           list=FALSE)
train <- data[part_index_1, ]
tune_and_test <- data[-part_index_1, ]
tune_and_test_index <- caret::createDataPartition(tune_and_test$attitude,
                                           p = .5,
                                           list = FALSE,
                                           times = 1)
tune <- tune_and_test[tune_and_test_index, ]
test <- tune_and_test[-tune_and_test_index, ]
dim(train)
dim(tune)
dim(test)
# Choose the features and classes
features <- train[,c(-10)]
target <- train$attitude
# Cross validation process 
fitControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5, 
                          returnResamp="all",
                          classProbs = TRUE,
                          allowParallel = TRUE) 
grid <- expand.grid(.winnow = c(TRUE,FALSE), 
                    .trials=c(1,5,10,15,20), 
                    .model="tree")
set.seed(2000)
abortion_dt_mdl <- train(x=features,
                y=target,
                method="C5.0",
                tuneGrid=grid,
                trControl=fitControl,
                verbose=TRUE)

Decision Tree Outputs

Model Output

# Initial tree model output
abortion_dt_mdl
## C5.0 
## 
## 10703 samples
##    11 predictor
##     2 classes: 'legal', 'illegal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 9633, 9633, 9633, 9633, 9632, 9632, ... 
## Resampling results across tuning parameters:
## 
##   winnow  trials  Accuracy   Kappa    
##   FALSE    1      0.8414086  0.2080936
##   FALSE    5      0.8383813  0.2243437
##   FALSE   10      0.8395215  0.1933807
##   FALSE   15      0.8397090  0.2231350
##   FALSE   20      0.8413157  0.2089132
##    TRUE    1      0.8398209  0.1425921
##    TRUE    5      0.8377649  0.1476347
##    TRUE   10      0.8386808  0.1354137
##    TRUE   15      0.8386252  0.1459779
##    TRUE   20      0.8390923  0.1430361
## 
## Tuning parameter 'model' was held constant at a value of tree
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = tree and winnow
##  = FALSE.

Variable Importance

varImp(abortion_dt_mdl)
## C5.0 variable importance
## 
##                      Overall
## religiosity         100.0000
## country              13.0430
## employment           11.3714
## close_lgbtq          11.3413
## marital_status        7.8779
## household_income      5.2753
## had_abortion          1.5415
## chief_income_earner   1.3914
## edu                   0.7307
## age                   0.1502
## gender                0.0000

Re-Sampling Distributions

# visualize the re-sample distributions
xyplot(abortion_dt_mdl,type = c("g", "p", "smooth"))

### Variable Importance

varImp(abortion_dt_mdl)
## C5.0 variable importance
## 
##                      Overall
## religiosity         100.0000
## country              13.0430
## employment           11.3714
## close_lgbtq          11.3413
## marital_status        7.8779
## household_income      5.2753
## had_abortion          1.5415
## chief_income_earner   1.3914
## edu                   0.7307
## age                   0.1502
## gender                0.0000

Confusion Matrix

# predict performance using tune dataset 
abort_dt_pred_tune = predict(abortion_dt_mdl, tune, type= "raw")
abort_dt_eval <- confusionMatrix(as.factor(abort_dt_pred_tune), 
                as.factor(tune$attitude), 
                dnn=c("Prediction", "Actual"), 
                mode = "everything")
table(tune$attitude)
## 
##   legal illegal 
##    1921     372
abort_dt_eval 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction legal illegal
##    legal    1875     319
##    illegal    46      53
##                                           
##                Accuracy : 0.8408          
##                  95% CI : (0.8252, 0.8556)
##     No Information Rate : 0.8378          
##     P-Value [Acc > NIR] : 0.3584          
##                                           
##                   Kappa : 0.1683          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9761          
##             Specificity : 0.1425          
##          Pos Pred Value : 0.8546          
##          Neg Pred Value : 0.5354          
##               Precision : 0.8546          
##                  Recall : 0.9761          
##                      F1 : 0.9113          
##              Prevalence : 0.8378          
##          Detection Rate : 0.8177          
##    Detection Prevalence : 0.9568          
##       Balanced Accuracy : 0.5593          
##                                           
##        'Positive' Class : legal           
## 

Decision Tree Analysis

The initial tree model obtained through repeated cross validation found the optimal model (based on accuracy) to be one that has trials = 1, model = tree and winnow = FALSE. Within the model the most important variable that was used in 100% of the models was the “religiosity” variable. Logically, based on our background research and EDA it makes sense that this variable is highly deterministic as abortion attitudes tend to vary greatly depending on religious affiliation and strength of religious beliefs. The second most important variable was “country,” followed by “employment” and “close_lqbtq.” However, all three of these variables were used in significantly fewer of the models.

For this problem we are mainly interested in the values of the following metrics: sensitivity, specificity, and F1 score. We care about the F1 score because the target classes in this dataset are highly imbalanced, with there being many more “legal” observations than “illegal” observations. We also want to maximize the sensitivity and specificity, if possible. From this first model, we achieved a fairly high F1 score of 0.91, and our sensitivity was also quite high at 0.9761. This means that we achieve about 98% coverage of the positive (legal) class observations with this model. However, due to the data imbalance, we got a very low specificity value of 0.1425, so we only identify about 14% of our negative class samples correctly. This makes this model particularly bad at identifying people who do not think abortion should be legal and means the model has a high false positive rate. If this model were used in production, it would provide a misleading representation of constituent views and falsely make it seem that more people have positive attitudes towards abortion than they actually do.

Finally, this model achieved an overall accuracy of 84%, which is only very slightly higher than the calculated prevalence. Therefore, this model generally appears to have little predictive power.

Build and Evalaute Initial Random Forest Ensemble Model

Next, we chose to try an ensemble model to see whether we could get improved results.

# Calculate initial mtry
mtry_tune <- function(x){
  xx <- dim(x)[2]-1
  sqrt(xx)
}
#mtry_tune(train)
# Initial RF model
set.seed(1984)  
abortion_RF_mdl = randomForest(attitude~.,          
                            train,     
                            #y = NULL,           
                            #subset = NULL,      
                            #xtest = NULL,       
                            #ytest = NULL,       
                            ntree = 300,  # 300 tree       
                            mtry = 3,  # mtry = 3 (calculated above)          
                            replace = TRUE,      
                            #classwt = NULL,     
                            #strata = NULL,      
                            sampsize = 100,      
                            nodesize = 5,        
                            #maxnodes = NULL,    
                            importance = TRUE,   
                            #localImp = FALSE,   
                            norm.votes = TRUE,   
                            do.trace = TRUE,     
                            keep.forest = TRUE,  
                            keep.inbag = TRUE)  

Random Forest Outputs

Model Output

# Look at the abortion RF model
abortion_RF_mdl
## 
## Call:
##  randomForest(formula = attitude ~ ., data = train, ntree = 300,      mtry = 3, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE,      norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.03%
## Confusion matrix:
##         legal illegal class.error
## legal    8850     117  0.01304784
## illegal  1599     137  0.92108295

Error Rate

err.rate_RFmdl1 <- as.data.frame(abortion_RF_mdl$err.rate)
err.rate_RFmdl1
##           OOB       legal   illegal
## 1   0.3075255 0.267634155 0.5142857
## 2   0.2374544 0.144992193 0.7152738
## 3   0.1991031 0.094122895 0.7413594
## 4   0.1858357 0.060889930 0.8312212
## 5   0.1836868 0.061670570 0.8139401
## 6   0.1796693 0.053752649 0.8300691
## 7   0.1723816 0.045611687 0.8271889
## 8   0.1687377 0.037024646 0.8490783
## 9   0.1660282 0.033902085 0.8485023
## 10  0.1640661 0.032563845 0.8433180
## 11  0.1630384 0.024868964 0.8767281
## 12  0.1641596 0.019739043 0.9101382
## 13  0.1641596 0.020854243 0.9043779
## 14  0.1650005 0.018066243 0.9239631
## 15  0.1639727 0.018735363 0.9141705
## 16  0.1617304 0.012378722 0.9331797
## 17  0.1621975 0.011598082 0.9400922
## 18  0.1620107 0.010482882 0.9447005
## 19  0.1618238 0.011598082 0.9377880
## 20  0.1607960 0.012713282 0.9256912
## 21  0.1607960 0.013828482 0.9199309
## 22  0.1609829 0.012267202 0.9291475
## 23  0.1602354 0.012601762 0.9228111
## 24  0.1593946 0.008810081 0.9372120
## 25  0.1612632 0.010148322 0.9418203
## 26  0.1620107 0.011486562 0.9395161
## 27  0.1611698 0.011040482 0.9366359
## 28  0.1608895 0.009702242 0.9418203
## 29  0.1619172 0.009702242 0.9481567
## 30  0.1616369 0.008810081 0.9510369
## 31  0.1607026 0.011598082 0.9308756
## 32  0.1605157 0.014609122 0.9141705
## 33  0.1606092 0.012378722 0.9262673
## 34  0.1603289 0.011932642 0.9268433
## 35  0.1607026 0.011932642 0.9291475
## 36  0.1612632 0.012490242 0.9297235
## 37  0.1609829 0.010928962 0.9360599
## 38  0.1606092 0.009590722 0.9406682
## 39  0.1607026 0.013159362 0.9228111
## 40  0.1605157 0.011932642 0.9279954
## 41  0.1603289 0.011932642 0.9268433
## 42  0.1606092 0.010036802 0.9383641
## 43  0.1607026 0.011709602 0.9302995
## 44  0.1606092 0.011598082 0.9302995
## 45  0.1608895 0.012155682 0.9291475
## 46  0.1601420 0.011486562 0.9279954
## 47  0.1607026 0.013270882 0.9222350
## 48  0.1593946 0.012155682 0.9199309
## 49  0.1601420 0.013493922 0.9176267
## 50  0.1590208 0.011486562 0.9210829
## 51  0.1598617 0.012155682 0.9228111
## 52  0.1607960 0.013047842 0.9239631
## 53  0.1607026 0.012713282 0.9251152
## 54  0.1602354 0.012490242 0.9233871
## 55  0.1607026 0.012936322 0.9239631
## 56  0.1606092 0.012267202 0.9268433
## 57  0.1602354 0.012044162 0.9256912
## 58  0.1601420 0.010371362 0.9337558
## 59  0.1608895 0.009590722 0.9423963
## 60  0.1611698 0.008475521 0.9498848
## 61  0.1607960 0.008029441 0.9498848
## 62  0.1610763 0.007806401 0.9527650
## 63  0.1611698 0.008587041 0.9493088
## 64  0.1613566 0.008475521 0.9510369
## 65  0.1612632 0.009590722 0.9447005
## 66  0.1616369 0.009144641 0.9493088
## 67  0.1605157 0.009813762 0.9389401
## 68  0.1605157 0.010259842 0.9366359
## 69  0.1599552 0.010148322 0.9337558
## 70  0.1610763 0.009925282 0.9418203
## 71  0.1612632 0.009367681 0.9458525
## 72  0.1613566 0.009813762 0.9441244
## 73  0.1614501 0.009813762 0.9447005
## 74  0.1618238 0.010148322 0.9452765
## 75  0.1614501 0.009813762 0.9447005
## 76  0.1621975 0.010148322 0.9475806
## 77  0.1613566 0.010036802 0.9429724
## 78  0.1619172 0.009702242 0.9481567
## 79  0.1621041 0.009256161 0.9516129
## 80  0.1619172 0.010036802 0.9464286
## 81  0.1616369 0.010482882 0.9423963
## 82  0.1619172 0.010259842 0.9452765
## 83  0.1622909 0.010928962 0.9441244
## 84  0.1620107 0.010928962 0.9423963
## 85  0.1616369 0.010705922 0.9412442
## 86  0.1618238 0.010482882 0.9435484
## 87  0.1611698 0.010148322 0.9412442
## 88  0.1611698 0.010817442 0.9377880
## 89  0.1606092 0.010594402 0.9354839
## 90  0.1607026 0.011263522 0.9326037
## 91  0.1607960 0.011040482 0.9343318
## 92  0.1611698 0.011040482 0.9366359
## 93  0.1607960 0.011040482 0.9343318
## 94  0.1607960 0.010817442 0.9354839
## 95  0.1610763 0.011040482 0.9360599
## 96  0.1612632 0.010371362 0.9406682
## 97  0.1610763 0.010259842 0.9400922
## 98  0.1613566 0.010594402 0.9400922
## 99  0.1606092 0.010148322 0.9377880
## 100 0.1608895 0.010148322 0.9395161
## 101 0.1607960 0.010036802 0.9395161
## 102 0.1613566 0.009590722 0.9452765
## 103 0.1609829 0.009925282 0.9412442
## 104 0.1607026 0.009367681 0.9423963
## 105 0.1607960 0.009256161 0.9435484
## 106 0.1613566 0.009144641 0.9475806
## 107 0.1612632 0.009144641 0.9470046
## 108 0.1615435 0.009256161 0.9481567
## 109 0.1613566 0.009256161 0.9470046
## 110 0.1614501 0.009144641 0.9481567
## 111 0.1615435 0.009813762 0.9452765
## 112 0.1608895 0.010482882 0.9377880
## 113 0.1610763 0.010371362 0.9395161
## 114 0.1610763 0.011040482 0.9360599
## 115 0.1607960 0.011821122 0.9302995
## 116 0.1604223 0.011486562 0.9297235
## 117 0.1610763 0.011821122 0.9320276
## 118 0.1607960 0.011598082 0.9314516
## 119 0.1606092 0.011821122 0.9291475
## 120 0.1601420 0.011932642 0.9256912
## 121 0.1607026 0.012155682 0.9279954
## 122 0.1607960 0.012378722 0.9274194
## 123 0.1608895 0.013047842 0.9245392
## 124 0.1602354 0.013270882 0.9193548
## 125 0.1603289 0.012824802 0.9222350
## 126 0.1604223 0.013270882 0.9205069
## 127 0.1603289 0.012713282 0.9228111
## 128 0.1611698 0.013159362 0.9256912
## 129 0.1607960 0.013047842 0.9239631
## 130 0.1606092 0.013047842 0.9228111
## 131 0.1605157 0.012713282 0.9239631
## 132 0.1604223 0.012490242 0.9245392
## 133 0.1606092 0.012713282 0.9245392
## 134 0.1606092 0.012713282 0.9245392
## 135 0.1609829 0.013382402 0.9233871
## 136 0.1606092 0.013493922 0.9205069
## 137 0.1607960 0.013605442 0.9210829
## 138 0.1607026 0.013159362 0.9228111
## 139 0.1603289 0.013270882 0.9199309
## 140 0.1602354 0.013605442 0.9176267
## 141 0.1598617 0.014051522 0.9130184
## 142 0.1601420 0.013828482 0.9158986
## 143 0.1604223 0.014163042 0.9158986
## 144 0.1604223 0.013716962 0.9182028
## 145 0.1604223 0.013605442 0.9187788
## 146 0.1604223 0.013605442 0.9187788
## 147 0.1600486 0.013382402 0.9176267
## 148 0.1600486 0.013605442 0.9164747
## 149 0.1599552 0.013828482 0.9147465
## 150 0.1604223 0.013940002 0.9170507
## 151 0.1601420 0.013716962 0.9164747
## 152 0.1605157 0.014051522 0.9170507
## 153 0.1604223 0.014051522 0.9164747
## 154 0.1604223 0.014051522 0.9164747
## 155 0.1601420 0.014163042 0.9141705
## 156 0.1604223 0.014051522 0.9164747
## 157 0.1602354 0.013828482 0.9164747
## 158 0.1606092 0.014274562 0.9164747
## 159 0.1603289 0.014386082 0.9141705
## 160 0.1605157 0.013716962 0.9187788
## 161 0.1602354 0.013605442 0.9176267
## 162 0.1606092 0.013605442 0.9199309
## 163 0.1607960 0.013940002 0.9193548
## 164 0.1602354 0.013493922 0.9182028
## 165 0.1598617 0.013716962 0.9147465
## 166 0.1598617 0.013828482 0.9141705
## 167 0.1596749 0.014051522 0.9118664
## 168 0.1598617 0.014497602 0.9107143
## 169 0.1593946 0.014720642 0.9066820
## 170 0.1595814 0.014943682 0.9066820
## 171 0.1594880 0.014720642 0.9072581
## 172 0.1595814 0.014609122 0.9084101
## 173 0.1594880 0.014497602 0.9084101
## 174 0.1595814 0.014497602 0.9089862
## 175 0.1595814 0.014274562 0.9101382
## 176 0.1598617 0.014386082 0.9112903
## 177 0.1601420 0.014163042 0.9141705
## 178 0.1599552 0.014051522 0.9135945
## 179 0.1600486 0.014386082 0.9124424
## 180 0.1599552 0.014386082 0.9118664
## 181 0.1595814 0.014163042 0.9107143
## 182 0.1598617 0.013716962 0.9147465
## 183 0.1600486 0.013605442 0.9164747
## 184 0.1603289 0.013493922 0.9187788
## 185 0.1603289 0.013270882 0.9199309
## 186 0.1604223 0.013047842 0.9216590
## 187 0.1593946 0.012936322 0.9158986
## 188 0.1600486 0.013382402 0.9176267
## 189 0.1601420 0.012936322 0.9205069
## 190 0.1599552 0.013047842 0.9187788
## 191 0.1600486 0.013493922 0.9170507
## 192 0.1598617 0.013605442 0.9153226
## 193 0.1596749 0.013159362 0.9164747
## 194 0.1598617 0.013047842 0.9182028
## 195 0.1599552 0.013382402 0.9170507
## 196 0.1599552 0.013270882 0.9176267
## 197 0.1596749 0.013382402 0.9153226
## 198 0.1598617 0.013047842 0.9182028
## 199 0.1594880 0.012378722 0.9193548
## 200 0.1596749 0.012490242 0.9199309
## 201 0.1600486 0.012601762 0.9216590
## 202 0.1595814 0.012713282 0.9182028
## 203 0.1596749 0.012378722 0.9205069
## 204 0.1601420 0.012267202 0.9239631
## 205 0.1601420 0.011932642 0.9256912
## 206 0.1600486 0.012155682 0.9239631
## 207 0.1598617 0.012155682 0.9228111
## 208 0.1598617 0.011932642 0.9239631
## 209 0.1599552 0.011709602 0.9256912
## 210 0.1599552 0.011598082 0.9262673
## 211 0.1598617 0.011598082 0.9256912
## 212 0.1601420 0.012267202 0.9239631
## 213 0.1594880 0.012155682 0.9205069
## 214 0.1593946 0.012267202 0.9193548
## 215 0.1590208 0.012490242 0.9158986
## 216 0.1597683 0.012936322 0.9182028
## 217 0.1599552 0.013159362 0.9182028
## 218 0.1593946 0.012713282 0.9170507
## 219 0.1593946 0.012378722 0.9187788
## 220 0.1598617 0.012713282 0.9199309
## 221 0.1595814 0.012601762 0.9187788
## 222 0.1594880 0.012713282 0.9176267
## 223 0.1597683 0.012936322 0.9182028
## 224 0.1599552 0.013493922 0.9164747
## 225 0.1602354 0.013382402 0.9187788
## 226 0.1605157 0.013493922 0.9199309
## 227 0.1601420 0.013047842 0.9199309
## 228 0.1605157 0.013270882 0.9210829
## 229 0.1606092 0.012936322 0.9233871
## 230 0.1605157 0.013270882 0.9210829
## 231 0.1607026 0.013159362 0.9228111
## 232 0.1606092 0.013159362 0.9222350
## 233 0.1604223 0.012936322 0.9222350
## 234 0.1603289 0.012936322 0.9216590
## 235 0.1602354 0.012378722 0.9239631
## 236 0.1603289 0.011932642 0.9268433
## 237 0.1599552 0.011821122 0.9251152
## 238 0.1598617 0.011821122 0.9245392
## 239 0.1603289 0.012155682 0.9256912
## 240 0.1603289 0.011932642 0.9268433
## 241 0.1601420 0.011598082 0.9274194
## 242 0.1605157 0.011598082 0.9297235
## 243 0.1601420 0.011821122 0.9262673
## 244 0.1604223 0.011709602 0.9285714
## 245 0.1605157 0.011598082 0.9297235
## 246 0.1599552 0.011263522 0.9279954
## 247 0.1603289 0.011375042 0.9297235
## 248 0.1600486 0.011709602 0.9262673
## 249 0.1603289 0.011709602 0.9279954
## 250 0.1602354 0.011486562 0.9285714
## 251 0.1603289 0.011821122 0.9274194
## 252 0.1604223 0.012155682 0.9262673
## 253 0.1602354 0.011821122 0.9268433
## 254 0.1601420 0.011932642 0.9256912
## 255 0.1602354 0.011932642 0.9262673
## 256 0.1603289 0.011821122 0.9274194
## 257 0.1602354 0.012044162 0.9256912
## 258 0.1604223 0.012267202 0.9256912
## 259 0.1604223 0.012601762 0.9239631
## 260 0.1606092 0.012601762 0.9251152
## 261 0.1606092 0.012824802 0.9239631
## 262 0.1607026 0.012824802 0.9245392
## 263 0.1607026 0.012713282 0.9251152
## 264 0.1607026 0.013047842 0.9233871
## 265 0.1607960 0.013159362 0.9233871
## 266 0.1606092 0.013382402 0.9210829
## 267 0.1608895 0.013159362 0.9239631
## 268 0.1607026 0.012713282 0.9251152
## 269 0.1606092 0.012601762 0.9251152
## 270 0.1610763 0.013047842 0.9256912
## 271 0.1607960 0.012824802 0.9251152
## 272 0.1607960 0.012936322 0.9245392
## 273 0.1609829 0.012824802 0.9262673
## 274 0.1608895 0.012824802 0.9256912
## 275 0.1610763 0.013159362 0.9251152
## 276 0.1607960 0.013159362 0.9233871
## 277 0.1607960 0.013047842 0.9239631
## 278 0.1604223 0.013159362 0.9210829
## 279 0.1603289 0.013047842 0.9210829
## 280 0.1606092 0.013159362 0.9222350
## 281 0.1605157 0.013047842 0.9222350
## 282 0.1603289 0.013047842 0.9210829
## 283 0.1606092 0.013270882 0.9216590
## 284 0.1603289 0.013159362 0.9205069
## 285 0.1605157 0.013605442 0.9193548
## 286 0.1606092 0.013493922 0.9205069
## 287 0.1607026 0.013605442 0.9205069
## 288 0.1606092 0.013493922 0.9205069
## 289 0.1606092 0.013382402 0.9210829
## 290 0.1607026 0.013716962 0.9199309
## 291 0.1606092 0.013605442 0.9199309
## 292 0.1605157 0.013382402 0.9205069
## 293 0.1603289 0.013382402 0.9193548
## 294 0.1603289 0.013605442 0.9182028
## 295 0.1604223 0.013828482 0.9176267
## 296 0.1603289 0.013605442 0.9182028
## 297 0.1605157 0.013716962 0.9187788
## 298 0.1603289 0.013270882 0.9199309
## 299 0.1605157 0.013270882 0.9210829
## 300 0.1603289 0.013047842 0.9210829

Confusion Matrix

# predict performance using tune dataset
abort_RF1_pred_tune = predict(abortion_RF_mdl, tune, type= "response", predict.all = TRUE, proximity = FALSE)
abort_RF1_eval <- confusionMatrix(abort_RF1_pred_tune$aggregate, 
                as.factor(tune$attitude), 
                dnn=c("Prediction", "Actual"),
                mode = "everything")
table(tune$attitude)
## 
##   legal illegal 
##    1921     372
abort_RF1_eval
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction legal illegal
##    legal    1900     338
##    illegal    21      34
##                                           
##                Accuracy : 0.8434          
##                  95% CI : (0.8279, 0.8581)
##     No Information Rate : 0.8378          
##     P-Value [Acc > NIR] : 0.2404          
##                                           
##                   Kappa : 0.1226          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9891          
##             Specificity : 0.0914          
##          Pos Pred Value : 0.8490          
##          Neg Pred Value : 0.6182          
##               Precision : 0.8490          
##                  Recall : 0.9891          
##                      F1 : 0.9137          
##              Prevalence : 0.8378          
##          Detection Rate : 0.8286          
##    Detection Prevalence : 0.9760          
##       Balanced Accuracy : 0.5402          
##                                           
##        'Positive' Class : legal           
## 

Random Forest Analysis

For this ensemble Random Forest model, we achieved a slightly higher sensitivity score (0.9885), but a specificity score that was about 5% lower than the decision tree model. This means that, while this model is better at identifying/covering the positive class than the decision tree, it is even worse than the tree model at detecting the negative class observations and has an even higher false positive rate. The F1 score for this model is quite high at 0.9134, but this is likely in large part due to the extremely high sensitivity value that is inflating it.

We achieved a comparable overall accuracy with the RF model that we did with the decision tree model, but a slightly lower balanced accuracy. Again, overall this model appears to have relatively little predictive power and would not be good to use in production as it would certainly overstate the positive views on abortion and miss many of the negative. The model appears to be learning the positive class almost exclusively due to the data imbalance and the minimal number of differentiating variables (as seen in the EDA).

Tuning the Decision Tree Model

Next, we tried to tune the decision tree model as that model achieved a higher specificity.

# Create a new trainControl object
# Change the CV method (not repeated)
fitControl_2 <- trainControl(method = "cv",
                          number = 10, 
                          returnResamp="all",
                          allowParallel = TRUE) 
# Add the 30 boosting rounds 
grid_2 <- expand.grid(.winnow = c(TRUE,FALSE), 
                    .trials=c(5, 10, 15, 20, 25, 30), 
                    .model="tree")
# Train new model
set.seed(1984)
abortion_dt2_mdl <- train(x=features,
                y=target,
                method="C5.0",
                tuneGrid=grid_2,
                trControl=fitControl_2)

Tune Decision Tree Ouputs

Model Output

# Look at tuned model output
abortion_dt2_mdl
## C5.0 
## 
## 10703 samples
##    11 predictor
##     2 classes: 'legal', 'illegal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9632, 9634, 9632, 9634, 9633, 9633, ... 
## Resampling results across tuning parameters:
## 
##   winnow  trials  Accuracy   Kappa     
##   FALSE    5      0.8377116  0.22650663
##   FALSE   10      0.8408875  0.20563858
##   FALSE   15      0.8379912  0.22583889
##   FALSE   20      0.8401393  0.21906159
##   FALSE   25      0.8390173  0.21927981
##   FALSE   30      0.8403253  0.22265068
##    TRUE    5      0.8387378  0.09463009
##    TRUE   10      0.8401383  0.09540538
##    TRUE   15      0.8384577  0.09448856
##    TRUE   20      0.8395781  0.09556408
##    TRUE   25      0.8395781  0.09556408
##    TRUE   30      0.8395781  0.09556408
## 
## Tuning parameter 'model' was held constant at a value of tree
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = tree and winnow
##  = FALSE.

Re-Sample Distributions

# visualize the re-sample distributions
xyplot(abortion_dt2_mdl,type = c("g", "p", "smooth"))

Variable Importance

# variable importance
varImp(abortion_dt2_mdl)
## C5.0 variable importance
## 
##                     Overall
## country              100.00
## religiosity          100.00
## marital_status       100.00
## employment            98.69
## close_lgbtq           73.57
## age                   62.38
## edu                   51.17
## household_income      33.41
## gender                20.40
## had_abortion          19.79
## chief_income_earner    0.00

Confusion Matrix

# predict performance using tune dataset 
abort_dt2_pred_tune = predict(abortion_dt2_mdl, tune, type= "raw")
abort_dt2_eval <- confusionMatrix(as.factor(abort_dt2_pred_tune), 
                as.factor(tune$attitude), 
                dnn=c("Prediction", "Actual"), 
                mode = "everything")
table(tune$attitude)
## 
##   legal illegal 
##    1921     372
abort_dt2_eval 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction legal illegal
##    legal    1879     306
##    illegal    42      66
##                                           
##                Accuracy : 0.8482          
##                  95% CI : (0.8329, 0.8627)
##     No Information Rate : 0.8378          
##     P-Value [Acc > NIR] : 0.09072         
##                                           
##                   Kappa : 0.2179          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.1774          
##          Pos Pred Value : 0.8600          
##          Neg Pred Value : 0.6111          
##               Precision : 0.8600          
##                  Recall : 0.9781          
##                      F1 : 0.9152          
##              Prevalence : 0.8378          
##          Detection Rate : 0.8195          
##    Detection Prevalence : 0.9529          
##       Balanced Accuracy : 0.5778          
##                                           
##        'Positive' Class : legal           
## 

Tune Decision Tree Analysis

In the tuned model, we made several changes. We removed the trials = 1 option, added trials = 30, and changed the cross validation method to be just “CV” (not repeated CV). Following these tuning changes, we observed that the variable importance measures in the models changed somewhat significantly. The variables “country,” “religiosity,” and “marital_status” were used in 100% of the models. It makes sense that “religiosity” and “marital_status” were included often based on our background research and the exploratory data analysis that we conducted. However, we were slightly surprised by the inclusion of the “country” variable in 100% of the models as it did not appear to have significant differentiating power.

In our key metrics we observed several improvements over our original decision tree model. The sensitivity improved very slightly, but we saw about a 3% increase in the specificity. Thus, this model is providing slightly better coverage of the negative class and a slightly lower false positive rate. The F1 score and balanced accuracy also improved marginally over the original model. The overall accuracy was also slightly higher.

Overall, this model performed the best of those we tested, but was still not terribly useful when predicting the negative class due to reasons discussed previously. Moving forward, changes to the model and/or the data used to train it would likely produce a more powerful model. These future improvements are discussed in the sections below.

Try a Naive Bayes Classifier

Finally, we tried to build a simple Naive Bayes classifier model to see whether it demonstrated improved results over the previously-built tree-based models. We would not move forward with this model, but wanted to use the metrics as a means of comparison.

# Choose the features and classes
features_nb <- train[,c(-10)]
target_nb <- train$attitude
# str(features)
# str(target)
# Cross validation process 
fitControl_nb <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5, 
                          returnResamp="all",
                          classProbs = TRUE,
                          allowParallel = TRUE) 
set.seed(2000)
abortion_nb_mdl <- train(x=features_nb,
                y=target_nb,
                method="nb",
                trControl=fitControl_nb,
                verbose=TRUE)

Naive Bayes Classifier Outputs

Model Output

# Naive Bayes Model
abortion_nb_mdl
## Naive Bayes 
## 
## 10703 samples
##    11 predictor
##     2 classes: 'legal', 'illegal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 9633, 9633, 9633, 9633, 9632, 9632, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy  Kappa    
##   FALSE      0.831861  0.2602052
##    TRUE      0.832291  0.2588707
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.

Confusion Matrix

# predict performance 
abort_nb_pred_tune = predict(abortion_nb_mdl, tune, type= "raw")
abort_nb_eval <- confusionMatrix(as.factor(abort_nb_pred_tune), 
                as.factor(tune$attitude), 
                dnn=c("Prediction", "Actual"), 
                mode = "everything")
table(tune$attitude)
## 
##   legal illegal 
##    1921     372
abort_nb_eval 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction legal illegal
##    legal    1793     265
##    illegal   128     107
##                                           
##                Accuracy : 0.8286          
##                  95% CI : (0.8125, 0.8438)
##     No Information Rate : 0.8378          
##     P-Value [Acc > NIR] : 0.8878          
##                                           
##                   Kappa : 0.2595          
##                                           
##  Mcnemar's Test P-Value : 6.872e-12       
##                                           
##             Sensitivity : 0.9334          
##             Specificity : 0.2876          
##          Pos Pred Value : 0.8712          
##          Neg Pred Value : 0.4553          
##               Precision : 0.8712          
##                  Recall : 0.9334          
##                      F1 : 0.9012          
##              Prevalence : 0.8378          
##          Detection Rate : 0.7819          
##    Detection Prevalence : 0.8975          
##       Balanced Accuracy : 0.6105          
##                                           
##        'Positive' Class : legal           
## 

Naive Bayes Classifier Analysis

From the simple Naive Bayes classifier, we observed slight declines in sensitivity, overall accuracy, and F1, but we saw a large jump in the specificity value. This classifier seems to do a better job of correctly detecting more negative class values than the tree-based models. The balanced accuracy was also slightly higher.

Based strictly on the lower accuracy, we chose not to use the Naive Bayes classifier because it is possible to obtain similar or better results through randomly guessing the positive class based on the previously calculated prevalence value.

Final Model Evaluation

We chose the tuned decision tree model as our “Final” model to move forward with. We evaluated the performance of the model using the test data set.

# predict performance using test dataset 
abort_dtFINAL_pred_tune = predict(abortion_dt2_mdl, test, type= "raw")
abort_dtFINAL_eval <- confusionMatrix(as.factor(abort_dtFINAL_pred_tune), 
                as.factor(test$attitude), 
                dnn=c("Prediction", "Actual"), 
                mode = "everything")
# table(tune$attitude)
abort_dtFINAL_eval
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction legal illegal
##    legal    1874     315
##    illegal    47      56
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.8265, 0.8568)
##     No Information Rate : 0.8381          
##     P-Value [Acc > NIR] : 0.3166          
##                                           
##                   Kappa : 0.1785          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9755          
##             Specificity : 0.1509          
##          Pos Pred Value : 0.8561          
##          Neg Pred Value : 0.5437          
##               Precision : 0.8561          
##                  Recall : 0.9755          
##                      F1 : 0.9119          
##              Prevalence : 0.8381          
##          Detection Rate : 0.8176          
##    Detection Prevalence : 0.9551          
##       Balanced Accuracy : 0.5632          
##                                           
##        'Positive' Class : legal           
## 

In the final model, we observed similar model metric performance as with the tune dataset. The F1 score was quite high at 0.9119, which was likely due to the very high sensitivity of the model. The model had a sensitivity of 0.9775 on the test dataset, meaning that about 98% of all positive observations were properly classified. However, as we have consistently observed, the model performed fairly poorly on the negative class with a specificity of 0.1509, and has corresponded with a false positive rate of about 95%.

The final overall and balanced accuracies were comparable to those obtained with the tuning data. The overall accuracy is slightly higher than the calculated prevalence, which suggests that the model does have some predictive power over simple random guessing.

The fact that the we observed very little improvement in overall accuracy over the course of training, tuning, and testing signals that using the outputs of this model should be done with caution.

Conclusions

Based on the final outputs and metrics of our chosen model, we can see that the model itself has very little usefulness in understanding abortion in the real world. The model is good at predicting people who will have the view that abortion should be legal, but it performs poorly when predicting people who do not think it should be legal. In the context of our driving question, this is problematic because we would like to have a model that is good at identifying people who have the “illegal” view so that we can more effectively target outreach, ads, and education to these people.

Despite our model not having the best performance metrically, we were still able to learn more about variables that might be important in future models. For example, across the decision tree models we consistently observed that the religiosity variable was deemed to be an important predictor for determining abortion attitudes. While this aligns with previous research, it also validates the variable’s value in a machine learning context. With this knowledge, we could move forward and gather more specific religiosity data such as: denomination, region of the country (some religions may be influenced by geographic region), contribution levels to church, ratings on spirituality, and others.

Given the size of the dataset and the limits placed on the model’s learning by the small negative class, this could not, as is, be put into production right now. In order to be made more reliable, we would need more expansive data on the negative class specifically. Additionally, given that the specificity of the Naive Bayes model in predicting negatives was the highest we observed, we would likely proceed with that type model as a basis for future work. The performance of the Naive Bayes classifier may have been better for the negative class due to the nature of the classifier: it uses prior probabilities of observations to predict future observations and is less prone to over-learning a positive class than the tree-based models. For further suggestions, please see the future recommendations section below.

Finally, is important to remember that variables such as gender, household income, and age are protected classes. However, given that our project is targeting persons for lobbying and not actual policy, we decided against conducting a fairness assessment because no action we take (targeting people for ads/outreach) would disadvantage these protected classes.

Future Work and Recommendations

If time permitted, there are many different ways we could have worked to improve our model and better answer our question.

One way we could improve our model in the future is by including more data, specifically including more balanced data. We saw early on that our target variable, abortion attitude, is very unbalanced with 84% of our population falling into the legal category. In the next iterations of this project, it would be helpful to have more data about people in the illegal category because we could learn more about the differences that exist between people who fall in the illegal and legal categories. Ultimately, having more balanced data would allow us to increase our models specificity and decrease our models false positive rate.

Another way we could improve our model in future iterations would be to gather more quantitative and qualitative demographic information on our survey participants. Right now, the data has rather basic demographic information about the participants. However, as we saw in our model more specific data, such as religiosity could be a great indicator of people’s attitudes towards abortion. It would be interesting if we knew more about people’s political beliefs or sexual habits and whether or not these would impact our models ability to predict. Either way, if we were able to know more specific demographic information the model would likely have a better chance at distinguishing and classifying people in the illegal and legal categories.

Some other paths we would explore in the future are increasing the number of categories we place people in, instead of re-factoring into smaller binary groups. By increasing the number of categories people could fall into we are leveraging our data more fully and would likely have a positive impact on our models accuracy. We could also increase the number of categories abortion attitude falls into from the simplified illegal, legal back to it’s original five categories (only illegal in some scenarios, always illegal, no opinion, etc.). The reason we decreased the number of categories in our project was to increase model efficiency and to save time. However, more time would allow us to use a larger number of categorizations, which could potentially improve the accuracy of our model. Another way we might be able to increase accuracy is by including more continuous variables, which our current model is lacking. These could be variables such as salary and actual household income. These continuous variables have the potential to further increase our model’s accuracy.