Research Question: Is a high school student’s likelihood of texting while driving predicted by their age, gender, race, physical activity level, average hours of sleep on school nights, and hours of television watched per school day?
The dataset used in this project comes from the Youth Risk Behavior Surveillance System (YRBSS), which contains survey responses from high school students collected between 1991 and 2013 across the United States. The data includes 13,583 observations on 13 variables, including demographic information (age, gender, race), health behaviors, and risky behaviors such as texting while driving.
For this project, I use the yrbss dataset (available
from the OpenIntro resources), focusing specifically on the following
variables:
text_while_driving_30d: self-reported days texting or
emailing while driving in the last 30 daysage: age in years (12–18)gender: male or femalerace: race categoryphysically_active_7d: days per week physically
activehours_tv_per_school_day: average hours of TV per school
dayschool_night_hours_sleep: typical hours of sleep on
school nightsI chose this topic because distracted driving (especially texting while driving) is a major public safety concern, and it is relevant for teenagers who are newer drivers. Understanding how texting while driving relates to age, gender, race, physical activity, sleep, and TV use can help highlight which groups of students may be at higher risk of this dangerous behavior.
In this section, I describe the data cleaning process and preparation steps used to make the dataset suitable for logistic regression:
# Load required packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Read the YRBSS dataset
yrbss <- read.csv("yrbss.csv")
# Check the dimensions and structure
dim(yrbss)
## [1] 13583 13
str(yrbss)
## 'data.frame': 13583 obs. of 13 variables:
## $ age : int 14 14 15 15 15 15 15 14 15 15 ...
## $ gender : chr "female" "female" "female" "female" ...
## $ grade : chr "9" "9" "9" "9" ...
## $ hispanic : chr "not" "not" "hispanic" "not" ...
## $ race : chr "Black or African American" "Black or African American" "Native Hawaiian or Other Pacific Islander" "Black or African American" ...
## $ height : num NA NA 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 ...
## $ weight : num NA NA 84.4 55.8 46.7 ...
## $ helmet_12m : chr "never" "never" "never" "never" ...
## $ text_while_driving_30d : chr "0" NA "30" "0" ...
## $ physically_active_7d : int 4 2 7 0 2 1 4 4 5 0 ...
## $ hours_tv_per_school_day : chr "5+" "5+" "5+" "2" ...
## $ strength_training_7d : int 0 0 0 0 1 0 2 0 3 0 ...
## $ school_night_hours_sleep: chr "8" "6" "<5" "6" ...
summary(yrbss)
## age gender grade hispanic
## Min. :12.00 Length:13583 Length:13583 Length:13583
## 1st Qu.:15.00 Class :character Class :character Class :character
## Median :16.00 Mode :character Mode :character Mode :character
## Mean :16.16
## 3rd Qu.:17.00
## Max. :18.00
## NA's :77
## race height weight helmet_12m
## Length:13583 Min. :1.270 Min. : 29.94 Length:13583
## Class :character 1st Qu.:1.600 1st Qu.: 56.25 Class :character
## Mode :character Median :1.680 Median : 64.41 Mode :character
## Mean :1.691 Mean : 67.91
## 3rd Qu.:1.780 3rd Qu.: 76.20
## Max. :2.110 Max. :180.99
## NA's :1004 NA's :1004
## text_while_driving_30d physically_active_7d hours_tv_per_school_day
## Length:13583 Min. :0.000 Length:13583
## Class :character 1st Qu.:2.000 Class :character
## Mode :character Median :4.000 Mode :character
## Mean :3.903
## 3rd Qu.:7.000
## Max. :7.000
## NA's :273
## strength_training_7d school_night_hours_sleep
## Min. :0.00 Length:13583
## 1st Qu.:0.00 Class :character
## Median :3.00 Mode :character
## Mean :2.95
## 3rd Qu.:5.00
## Max. :7.00
## NA's :1176
The outcome that we focus on is whether a student texted while driving in the last 30 days. The variable text_while_driving_30d records the number of days they texted or emailed while driving with values such as “0”, “1-2”, “3-5”, “6-9”, “10-19”, “20-29”, “30”, “did not drive”, and missing values.
Since we are using logistic regression, I will define a binary outcome:
0 = did not text while driving (reported “0” days)
1 = texted at least once (reported any positive number of days)
Students who answered “did not drive” do not have a well-defined risk of texting while driving (they did not drive at all), so they are excluded from the logistic regression analysis.
# Keep only students who reported on texting while driving and actually drove
yrbss_drivers <- yrbss |>
filter(!is.na(text_while_driving_30d),
text_while_driving_30d != "did not drive")
# Create a binary outcome:
# 0 = no texting (0 days), 1 = texted at least once (1+ days)
yrbss_drivers <- yrbss_drivers |>
mutate(texting_binary = ifelse(text_while_driving_30d == "0", 0, 1))
# Check counts of the new binary outcome
table(yrbss_drivers$texting_binary)
##
## 0 1
## 4792 3227
prop.table(table(yrbss_drivers$texting_binary))
##
## 0 1
## 0.5975807 0.4024193
This step restricts the analysis to students who drove and reported their texting behavior, and it simplifies the outcome into a clear binary indicator for our logistic regression.
Next, I will select only the variables needed for the model and remove rows with missing values in any of those variables. Because the dataset is large, deleting by row will still leave a substantial sample size for our logistic regression. It also ensures accuracy of our regression model.
# Select the model variables and drop rows with missing values
yrbss_model <- yrbss_drivers |>
select(texting_binary,
age,
gender,
race,
physically_active_7d,
hours_tv_per_school_day,
school_night_hours_sleep) |>
na.omit() # removes all rows containing any NA
# Check the new dimensions and a summary
dim(yrbss_model)
## [1] 5911 7
summary(yrbss_model)
## texting_binary age gender race
## Min. :0.0000 Min. :12.00 Length:5911 Length:5911
## 1st Qu.:0.0000 1st Qu.:16.00 Class :character Class :character
## Median :0.0000 Median :17.00 Mode :character Mode :character
## Mean :0.4236 Mean :16.53
## 3rd Qu.:1.0000 3rd Qu.:17.00
## Max. :1.0000 Max. :18.00
## physically_active_7d hours_tv_per_school_day school_night_hours_sleep
## Min. :0.00 Length:5911 Length:5911
## 1st Qu.:2.00 Class :character Class :character
## Median :4.00 Mode :character Mode :character
## Mean :4.05
## 3rd Qu.:7.00
## Max. :7.00
After filtering to drivers and removing incomplete cases, there are still 5,911 students remaining in the final analysis set. About 42% of these students report texting while driving at least once in the last 30 days, while about 58% report no texting while driving.
Logistic regression in R expects a binary outcome and can include both numeric and categorical predictors. I will now convert the new outcome to a factor with meaningful labels and make sure the categorical predictors are stored as factors:
# Prepare factors for logistic regression
yrbss_model <- yrbss_model |>
mutate(
texting_binary = factor(texting_binary,
levels = c(0, 1),
labels = c("NoText", "Texted")),
gender = factor(gender),
race = factor(race),
hours_tv_per_school_day = factor(hours_tv_per_school_day),
school_night_hours_sleep = factor(school_night_hours_sleep)
)
# Confirm structure
str(yrbss_model)
## 'data.frame': 5911 obs. of 7 variables:
## $ texting_binary : Factor w/ 2 levels "NoText","Texted": 1 2 1 1 1 1 1 1 1 1 ...
## $ age : int 14 15 15 15 16 15 15 14 16 15 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 1 1 ...
## $ race : Factor w/ 5 levels "American Indian or Alaska Native",..: 3 4 3 3 3 3 3 3 3 3 ...
## $ physically_active_7d : int 4 7 0 0 7 7 7 7 2 4 ...
## $ hours_tv_per_school_day : Factor w/ 7 levels "<1","1","2","3",..: 6 6 3 7 5 6 6 2 3 2 ...
## $ school_night_hours_sleep: Factor w/ 7 levels "<5","10+","5",..: 6 1 4 2 6 7 6 4 7 6 ...
## - attr(*, "na.action")= 'omit' Named int [1:2108] 11 12 16 22 50 64 68 89 121 151 ...
## ..- attr(*, "names")= chr [1:2108] "11" "12" "16" "22" ...
Before fitting the logistic regression model, I identified the baseline levels that R uses for each categorical predictor. The model estimates the log-odds of texting while driving relative to these baseline groups. The baselines for the categorical predictors are:
Gender: female
Race: American Indian or Alaska Native
Hours of TV per school day: <1 hour
School-night hours of sleep: <5 hours
(Age and physically active days are numeric variables and do not have baselines.)
The outcome variable’s baseline group is “NoText”, meaning the model predicts the log-odds of texting at least once relative to not texting while driving.
I will now fit a logistic regression model with glm():
# Logistic regression model
logit_mod <- glm(texting_binary ~ age +
gender +
race +
physically_active_7d +
hours_tv_per_school_day +
school_night_hours_sleep,
data = yrbss_model,
family = binomial)
# Model summary
summary(logit_mod)
##
## Call:
## glm(formula = texting_binary ~ age + gender + race + physically_active_7d +
## hours_tv_per_school_day + school_night_hours_sleep, family = binomial,
## data = yrbss_model)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -9.37210 0.49082 -19.095
## age 0.55447 0.02669 20.772
## gendermale 0.04990 0.05798 0.861
## raceAsian -0.25931 0.21903 -1.184
## raceBlack or African American -0.47752 0.17832 -2.678
## raceNative Hawaiian or Other Pacific Islander 0.14961 0.26504 0.564
## raceWhite 0.11606 0.17246 0.673
## physically_active_7d 0.03572 0.01154 3.095
## hours_tv_per_school_day1 0.09866 0.09837 1.003
## hours_tv_per_school_day2 -0.08333 0.09028 -0.923
## hours_tv_per_school_day3 -0.21139 0.09875 -2.141
## hours_tv_per_school_day4 -0.04886 0.12486 -0.391
## hours_tv_per_school_day5+ -0.10152 0.11353 -0.894
## hours_tv_per_school_daydo not watch -0.28277 0.10137 -2.790
## school_night_hours_sleep10+ -0.30383 0.21193 -1.434
## school_night_hours_sleep5 -0.02998 0.13164 -0.228
## school_night_hours_sleep6 -0.10038 0.12013 -0.836
## school_night_hours_sleep7 -0.14750 0.11782 -1.252
## school_night_hours_sleep8 -0.35980 0.12310 -2.923
## school_night_hours_sleep9 -0.31500 0.16123 -1.954
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## age < 2e-16 ***
## gendermale 0.38946
## raceAsian 0.23645
## raceBlack or African American 0.00741 **
## raceNative Hawaiian or Other Pacific Islander 0.57242
## raceWhite 0.50095
## physically_active_7d 0.00197 **
## hours_tv_per_school_day1 0.31591
## hours_tv_per_school_day2 0.35600
## hours_tv_per_school_day3 0.03231 *
## hours_tv_per_school_day4 0.69556
## hours_tv_per_school_day5+ 0.37122
## hours_tv_per_school_daydo not watch 0.00528 **
## school_night_hours_sleep10+ 0.15168
## school_night_hours_sleep5 0.81983
## school_night_hours_sleep6 0.40337
## school_night_hours_sleep7 0.21060
## school_night_hours_sleep8 0.00347 **
## school_night_hours_sleep9 0.05073 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8055.9 on 5910 degrees of freedom
## Residual deviance: 7408.3 on 5891 degrees of freedom
## AIC: 7448.3
##
## Number of Fisher Scoring iterations: 4
Here are key points from the logistic regression model summary:
The age coefficient is positive and statistically significant. This means that, holding other variables constant, older students are more likely to text while driving.
The coefficient for physically_active_7d is also positive and statistically significant, meaning that students who are physically active on more days per week have slightly higher odds of texting while driving.
Some race and TV categories differ from the reference levels. For example, one of the race categories (e.g., Black or African American compared to the baseline race) has a negative coefficient, indicating lower log-odds of texting while driving compared to the reference category, holding other variables constant.
Gender is has a small coefficient and is not strongly significant at the 0.05 level in this model, meaning there is not clear evidence of a large difference between males and females in the probability of texting while driving after controlling for other predictors.
To give a more understandable sense of the effect sizes, I convert the coefficients to odds ratios:
odds_ratios <- exp(coef(logit_mod))
odds_ratios
## (Intercept)
## 8.506427e-05
## age
## 1.741011e+00
## gendermale
## 1.051164e+00
## raceAsian
## 7.715847e-01
## raceBlack or African American
## 6.203199e-01
## raceNative Hawaiian or Other Pacific Islander
## 1.161384e+00
## raceWhite
## 1.123068e+00
## physically_active_7d
## 1.036365e+00
## hours_tv_per_school_day1
## 1.103690e+00
## hours_tv_per_school_day2
## 9.200438e-01
## hours_tv_per_school_day3
## 8.094616e-01
## hours_tv_per_school_day4
## 9.523125e-01
## hours_tv_per_school_day5+
## 9.034625e-01
## hours_tv_per_school_daydo not watch
## 7.536917e-01
## school_night_hours_sleep10+
## 7.379841e-01
## school_night_hours_sleep5
## 9.704625e-01
## school_night_hours_sleep6
## 9.044919e-01
## school_night_hours_sleep7
## 8.628617e-01
## school_night_hours_sleep8
## 6.978166e-01
## school_night_hours_sleep9
## 7.297899e-01
Odds Ratios should be interpreted as:
OR > 1 : higher odds of texting while driving
OR < 1 : lower odds of texting while driving
Numeric predictors (age, physically_active_7d) : effect applies per unit increase
Categorical predictors : effect compares to baseline level
Interpretation: For each one-year increase in age, the odds of texting while driving increase by a factor of 1.74, or 74% higher, holding all other variables constant.
This means older high school students are substantially more likely to text while driving compared to younger students.
Interpretation: Each additional day per week that a student is physically active is associated with approximately a 3.6% increase in the odds of texting while driving, controlling for all other predictors.
This is a small effect, but it is statistically significant.
Baseline race = American Indian or Alaska Native
Interpretation: Students who identify as Black or African American have 38% lower odds of texting while driving compared to the baseline race group (American Indian or Alaska Native), holding all other variables constant.
(1 − 0.62 = 0.38 : 38% lower)
This is a meaningful negative association.
Interpretation: Asian students have about 23% lower odds of texting while driving compared to the baseline race group, holding all other predictors constant.
(1 − 0.77 = 0.23 : 23% lower)
Baseline TV category = “<1 hour”
Interpretation: Students who watch 3 hours of television per school day have about 19% lower odds of texting while driving compared to students who watch less than 1 hour, controlling for other variables.
(1 − 0.81 = 0.19 : 19% lower)
Baseline sleep category = “<5 hours”
Interpretation: Students who typically sleep 9 hours per school night have about 27% lower odds of texting while driving compared to students who sleep less than 5 hours, holding other variables constant.
(1 − 0.73 = 0.27 : 27% lower)
To evaluate how well the model classifies students as “texted” or “did not text”, I check predicted probabilities and classify them with a threshold of 0.5. Then I build a confusion matrix and summary metrics (accuracy, sensitivity, specificity, precision).
# Predicted probabilities of texting while driving
yrbss_model$pred_prob <- predict(logit_mod, type = "response")
# Classify using a 0.5 threshold
yrbss_model$pred_class <- ifelse(yrbss_model$pred_prob >= 0.5, "Texted", "NoText")
yrbss_model$pred_class <- factor(yrbss_model$pred_class,
levels = c("NoText", "Texted"))
# Confusion matrix
conf_mat <- table(Actual = yrbss_model$texting_binary,
Predicted = yrbss_model$pred_class)
conf_mat
## Predicted
## Actual NoText Texted
## NoText 2580 827
## Texted 1220 1284
I then look at the performance metrics:
# Extract cells
TN <- conf_mat["NoText", "NoText"]
FP <- conf_mat["NoText", "Texted"]
FN <- conf_mat["Texted", "NoText"]
TP <- conf_mat["Texted", "Texted"]
# Performance metrics
accuracy <- (TP + TN) / sum(conf_mat)
sensitivity <- TP / (TP + FN) # True positive rate (recall)
specificity <- TN / (TN + FP) # True negative rate
precision <- TP / (TP + FP) # Positive predictive value
accuracy
## [1] 0.6536965
sensitivity
## [1] 0.5127796
specificity
## [1] 0.7572645
precision
## [1] 0.6082425
Accuracy: ~0.65
Sensitivity (Recall): ~0.51 (about 51% of students who texted are correctly identified)
Specificity: ~0.76 (about 76% of students who did not text are correctly identified)
Precision: ~0.61 (about 61% of students predicted to text actually did text)
This means the model is somewhat better at correctly identifying non-texters than texters at the default 0.5 threshold, and overall accuracy is moderate.
Finally, I check the ROC curve and the Area Under the Curve (AUC) to see the model’s overall performance across all thresholds:
# Create a numeric version of the outcome for ROC
yrbss_model$texting_num <- ifelse(yrbss_model$texting_binary == "Texted", 1, 0)
# ROC analysis
roc_obj <- roc(response = yrbss_model$texting_num,
predictor = yrbss_model$pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# AUC value
auc(roc_obj)
## Area under the curve: 0.6967
# Plot ROC curve with AUC printed on the plot
plot(roc_obj,
print.auc = TRUE,
legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)",
main = "ROC Curve for Texting While Driving Logistic Model")
For this model, the AUC is about 0.70.
Interpretation:
An AUC of 0.5 corresponds to random guessing.
An AUC of 1.0 corresponds to perfect discrimination.
An AUC of about 0.70 indicates that the model has a moderate ability to distinguish between students who text while driving and those who do not based on the predictors.
This project looked at whether a high school student’s likelihood of texting while driving can be predicted by their age, gender, race, physical activity level, hours of sleep on school nights, and hours of TV watched per school day using logistic regression. Here are the following main points:
Age is a strong positive predictor: Older high school students have higher odds of texting while driving compared to younger students (all other variables constant)
Physical activity shows a slight positive association, meaning that students who are more physically active have a slightly higher odds of texting while driving
Some race groups differ a lot from the baseline category, with some races having lower odds of texting while driving than the reference
TV watching and sleep patterns also show some differences, but most of the effects were not very strong when compared to the effects of a category like age
The model classification performance is moderate, with around 65% accuracy, higher specificity than sensitivity, and an AUC of about 0.70
Overall, the model shows that age and some behavioral and demographic categories are associated with a student’s liklihood to text while driving, but a lot of variation is still not explained by our predictors.
Limitations include:
The outcome is from self-reported behavior, which could be subject to bias
The model includes a limit set of variables that we can use to predict texting while driving
Students who answered “did not drive” were excluded from our model, so our data capture range was only 30 days (as stated in the dataset description)
Future directions could include involving more detailed variables relating to behavior, such as seatbelt use, alcohol use, or other risky behaviors. We could also explore interaction effects, such as whether the effect of age differs by gender or race (age x gender and age x race). Finally, we could find even more outside data from different locations to increase the cases and accuracy of our model.