Analysis of Texting While Driving Among U.S. High School Students

Introduction

Research Question: Is a high school student’s likelihood of texting while driving predicted by their age, gender, race, physical activity level, average hours of sleep on school nights, and hours of television watched per school day?

The dataset used in this project comes from the Youth Risk Behavior Surveillance System (YRBSS), which contains survey responses from high school students collected between 1991 and 2013 across the United States. The data includes 13,583 observations on 13 variables, including demographic information (age, gender, race), health behaviors, and risky behaviors such as texting while driving.

For this project, I use the yrbss dataset (available from the OpenIntro resources), focusing specifically on the following variables:

  • text_while_driving_30d: self-reported days texting or emailing while driving in the last 30 days
  • age: age in years (12–18)
  • gender: male or female
  • race: race category
  • physically_active_7d: days per week physically active
  • hours_tv_per_school_day: average hours of TV per school day
  • school_night_hours_sleep: typical hours of sleep on school nights

I chose this topic because distracted driving (especially texting while driving) is a major public safety concern, and it is relevant for teenagers who are newer drivers. Understanding how texting while driving relates to age, gender, race, physical activity, sleep, and TV use can help highlight which groups of students may be at higher risk of this dangerous behavior.

Data Analysis

In this section, I describe the data cleaning process and preparation steps used to make the dataset suitable for logistic regression:

Loading and inspecting the data

# Load required packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Read the YRBSS dataset
yrbss <- read.csv("yrbss.csv")

# Check the dimensions and structure
dim(yrbss)
## [1] 13583    13
str(yrbss)
## 'data.frame':    13583 obs. of  13 variables:
##  $ age                     : int  14 14 15 15 15 15 15 14 15 15 ...
##  $ gender                  : chr  "female" "female" "female" "female" ...
##  $ grade                   : chr  "9" "9" "9" "9" ...
##  $ hispanic                : chr  "not" "not" "hispanic" "not" ...
##  $ race                    : chr  "Black or African American" "Black or African American" "Native Hawaiian or Other Pacific Islander" "Black or African American" ...
##  $ height                  : num  NA NA 1.73 1.6 1.5 1.57 1.65 1.88 1.75 1.37 ...
##  $ weight                  : num  NA NA 84.4 55.8 46.7 ...
##  $ helmet_12m              : chr  "never" "never" "never" "never" ...
##  $ text_while_driving_30d  : chr  "0" NA "30" "0" ...
##  $ physically_active_7d    : int  4 2 7 0 2 1 4 4 5 0 ...
##  $ hours_tv_per_school_day : chr  "5+" "5+" "5+" "2" ...
##  $ strength_training_7d    : int  0 0 0 0 1 0 2 0 3 0 ...
##  $ school_night_hours_sleep: chr  "8" "6" "<5" "6" ...
summary(yrbss)
##       age           gender             grade             hispanic        
##  Min.   :12.00   Length:13583       Length:13583       Length:13583      
##  1st Qu.:15.00   Class :character   Class :character   Class :character  
##  Median :16.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :16.16                                                           
##  3rd Qu.:17.00                                                           
##  Max.   :18.00                                                           
##  NA's   :77                                                              
##      race               height          weight        helmet_12m       
##  Length:13583       Min.   :1.270   Min.   : 29.94   Length:13583      
##  Class :character   1st Qu.:1.600   1st Qu.: 56.25   Class :character  
##  Mode  :character   Median :1.680   Median : 64.41   Mode  :character  
##                     Mean   :1.691   Mean   : 67.91                     
##                     3rd Qu.:1.780   3rd Qu.: 76.20                     
##                     Max.   :2.110   Max.   :180.99                     
##                     NA's   :1004    NA's   :1004                       
##  text_while_driving_30d physically_active_7d hours_tv_per_school_day
##  Length:13583           Min.   :0.000        Length:13583           
##  Class :character       1st Qu.:2.000        Class :character       
##  Mode  :character       Median :4.000        Mode  :character       
##                         Mean   :3.903                               
##                         3rd Qu.:7.000                               
##                         Max.   :7.000                               
##                         NA's   :273                                 
##  strength_training_7d school_night_hours_sleep
##  Min.   :0.00         Length:13583            
##  1st Qu.:0.00         Class :character        
##  Median :3.00         Mode  :character        
##  Mean   :2.95                                 
##  3rd Qu.:5.00                                 
##  Max.   :7.00                                 
##  NA's   :1176

The outcome that we focus on is whether a student texted while driving in the last 30 days. The variable text_while_driving_30d records the number of days they texted or emailed while driving with values such as “0”, “1-2”, “3-5”, “6-9”, “10-19”, “20-29”, “30”, “did not drive”, and missing values.

Since we are using logistic regression, I will define a binary outcome:

0 = did not text while driving (reported “0” days)

1 = texted at least once (reported any positive number of days)

Students who answered “did not drive” do not have a well-defined risk of texting while driving (they did not drive at all), so they are excluded from the logistic regression analysis.

# Keep only students who reported on texting while driving and actually drove

yrbss_drivers <- yrbss |>
filter(!is.na(text_while_driving_30d),
text_while_driving_30d != "did not drive")

# Create a binary outcome:

# 0 = no texting (0 days), 1 = texted at least once (1+ days)

yrbss_drivers <- yrbss_drivers |>
mutate(texting_binary = ifelse(text_while_driving_30d == "0", 0, 1))

# Check counts of the new binary outcome

table(yrbss_drivers$texting_binary)
## 
##    0    1 
## 4792 3227
prop.table(table(yrbss_drivers$texting_binary))
## 
##         0         1 
## 0.5975807 0.4024193

This step restricts the analysis to students who drove and reported their texting behavior, and it simplifies the outcome into a clear binary indicator for our logistic regression.

Selecting variables and handling missing values

Next, I will select only the variables needed for the model and remove rows with missing values in any of those variables. Because the dataset is large, deleting by row will still leave a substantial sample size for our logistic regression. It also ensures accuracy of our regression model.

# Select the model variables and drop rows with missing values
yrbss_model <- yrbss_drivers |>
  select(texting_binary,
         age,
         gender,
         race,
         physically_active_7d,
         hours_tv_per_school_day,
         school_night_hours_sleep) |>
  na.omit()     # removes all rows containing any NA

# Check the new dimensions and a summary
dim(yrbss_model)
## [1] 5911    7
summary(yrbss_model)
##  texting_binary        age           gender              race          
##  Min.   :0.0000   Min.   :12.00   Length:5911        Length:5911       
##  1st Qu.:0.0000   1st Qu.:16.00   Class :character   Class :character  
##  Median :0.0000   Median :17.00   Mode  :character   Mode  :character  
##  Mean   :0.4236   Mean   :16.53                                        
##  3rd Qu.:1.0000   3rd Qu.:17.00                                        
##  Max.   :1.0000   Max.   :18.00                                        
##  physically_active_7d hours_tv_per_school_day school_night_hours_sleep
##  Min.   :0.00         Length:5911             Length:5911             
##  1st Qu.:2.00         Class :character        Class :character        
##  Median :4.00         Mode  :character        Mode  :character        
##  Mean   :4.05                                                         
##  3rd Qu.:7.00                                                         
##  Max.   :7.00

After filtering to drivers and removing incomplete cases, there are still 5,911 students remaining in the final analysis set. About 42% of these students report texting while driving at least once in the last 30 days, while about 58% report no texting while driving.

Converting variables to appropriate types

Logistic regression in R expects a binary outcome and can include both numeric and categorical predictors. I will now convert the new outcome to a factor with meaningful labels and make sure the categorical predictors are stored as factors:

# Prepare factors for logistic regression

yrbss_model <- yrbss_model |>
mutate(
texting_binary = factor(texting_binary,
levels = c(0, 1),
labels = c("NoText", "Texted")),
gender = factor(gender),
race = factor(race),
hours_tv_per_school_day = factor(hours_tv_per_school_day),
school_night_hours_sleep = factor(school_night_hours_sleep)
)

# Confirm structure

str(yrbss_model)
## 'data.frame':    5911 obs. of  7 variables:
##  $ texting_binary          : Factor w/ 2 levels "NoText","Texted": 1 2 1 1 1 1 1 1 1 1 ...
##  $ age                     : int  14 15 15 15 16 15 15 14 16 15 ...
##  $ gender                  : Factor w/ 2 levels "female","male": 1 1 1 1 2 2 2 2 1 1 ...
##  $ race                    : Factor w/ 5 levels "American Indian or Alaska Native",..: 3 4 3 3 3 3 3 3 3 3 ...
##  $ physically_active_7d    : int  4 7 0 0 7 7 7 7 2 4 ...
##  $ hours_tv_per_school_day : Factor w/ 7 levels "<1","1","2","3",..: 6 6 3 7 5 6 6 2 3 2 ...
##  $ school_night_hours_sleep: Factor w/ 7 levels "<5","10+","5",..: 6 1 4 2 6 7 6 4 7 6 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2108] 11 12 16 22 50 64 68 89 121 151 ...
##   ..- attr(*, "names")= chr [1:2108] "11" "12" "16" "22" ...

Baseline Categories

Before fitting the logistic regression model, I identified the baseline levels that R uses for each categorical predictor. The model estimates the log-odds of texting while driving relative to these baseline groups. The baselines for the categorical predictors are:

  • Gender: female

  • Race: American Indian or Alaska Native

  • Hours of TV per school day: <1 hour

  • School-night hours of sleep: <5 hours

  • (Age and physically active days are numeric variables and do not have baselines.)

  • The outcome variable’s baseline group is “NoText”, meaning the model predicts the log-odds of texting at least once relative to not texting while driving.

Fitting the logistic regression model

I will now fit a logistic regression model with glm():

# Logistic regression model

logit_mod <- glm(texting_binary ~ age +
gender +
race +
physically_active_7d +
hours_tv_per_school_day +
school_night_hours_sleep,
data = yrbss_model,
family = binomial)

# Model summary

summary(logit_mod)
## 
## Call:
## glm(formula = texting_binary ~ age + gender + race + physically_active_7d + 
##     hours_tv_per_school_day + school_night_hours_sleep, family = binomial, 
##     data = yrbss_model)
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                   -9.37210    0.49082 -19.095
## age                                            0.55447    0.02669  20.772
## gendermale                                     0.04990    0.05798   0.861
## raceAsian                                     -0.25931    0.21903  -1.184
## raceBlack or African American                 -0.47752    0.17832  -2.678
## raceNative Hawaiian or Other Pacific Islander  0.14961    0.26504   0.564
## raceWhite                                      0.11606    0.17246   0.673
## physically_active_7d                           0.03572    0.01154   3.095
## hours_tv_per_school_day1                       0.09866    0.09837   1.003
## hours_tv_per_school_day2                      -0.08333    0.09028  -0.923
## hours_tv_per_school_day3                      -0.21139    0.09875  -2.141
## hours_tv_per_school_day4                      -0.04886    0.12486  -0.391
## hours_tv_per_school_day5+                     -0.10152    0.11353  -0.894
## hours_tv_per_school_daydo not watch           -0.28277    0.10137  -2.790
## school_night_hours_sleep10+                   -0.30383    0.21193  -1.434
## school_night_hours_sleep5                     -0.02998    0.13164  -0.228
## school_night_hours_sleep6                     -0.10038    0.12013  -0.836
## school_night_hours_sleep7                     -0.14750    0.11782  -1.252
## school_night_hours_sleep8                     -0.35980    0.12310  -2.923
## school_night_hours_sleep9                     -0.31500    0.16123  -1.954
##                                               Pr(>|z|)    
## (Intercept)                                    < 2e-16 ***
## age                                            < 2e-16 ***
## gendermale                                     0.38946    
## raceAsian                                      0.23645    
## raceBlack or African American                  0.00741 ** 
## raceNative Hawaiian or Other Pacific Islander  0.57242    
## raceWhite                                      0.50095    
## physically_active_7d                           0.00197 ** 
## hours_tv_per_school_day1                       0.31591    
## hours_tv_per_school_day2                       0.35600    
## hours_tv_per_school_day3                       0.03231 *  
## hours_tv_per_school_day4                       0.69556    
## hours_tv_per_school_day5+                      0.37122    
## hours_tv_per_school_daydo not watch            0.00528 ** 
## school_night_hours_sleep10+                    0.15168    
## school_night_hours_sleep5                      0.81983    
## school_night_hours_sleep6                      0.40337    
## school_night_hours_sleep7                      0.21060    
## school_night_hours_sleep8                      0.00347 ** 
## school_night_hours_sleep9                      0.05073 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8055.9  on 5910  degrees of freedom
## Residual deviance: 7408.3  on 5891  degrees of freedom
## AIC: 7448.3
## 
## Number of Fisher Scoring iterations: 4

Here are key points from the logistic regression model summary:

  • The age coefficient is positive and statistically significant. This means that, holding other variables constant, older students are more likely to text while driving.

  • The coefficient for physically_active_7d is also positive and statistically significant, meaning that students who are physically active on more days per week have slightly higher odds of texting while driving.

  • Some race and TV categories differ from the reference levels. For example, one of the race categories (e.g., Black or African American compared to the baseline race) has a negative coefficient, indicating lower log-odds of texting while driving compared to the reference category, holding other variables constant.

  • Gender is has a small coefficient and is not strongly significant at the 0.05 level in this model, meaning there is not clear evidence of a large difference between males and females in the probability of texting while driving after controlling for other predictors.

Converting coefficients to odds ratios

To give a more understandable sense of the effect sizes, I convert the coefficients to odds ratios:

odds_ratios <- exp(coef(logit_mod))
odds_ratios
##                                   (Intercept) 
##                                  8.506427e-05 
##                                           age 
##                                  1.741011e+00 
##                                    gendermale 
##                                  1.051164e+00 
##                                     raceAsian 
##                                  7.715847e-01 
##                 raceBlack or African American 
##                                  6.203199e-01 
## raceNative Hawaiian or Other Pacific Islander 
##                                  1.161384e+00 
##                                     raceWhite 
##                                  1.123068e+00 
##                          physically_active_7d 
##                                  1.036365e+00 
##                      hours_tv_per_school_day1 
##                                  1.103690e+00 
##                      hours_tv_per_school_day2 
##                                  9.200438e-01 
##                      hours_tv_per_school_day3 
##                                  8.094616e-01 
##                      hours_tv_per_school_day4 
##                                  9.523125e-01 
##                     hours_tv_per_school_day5+ 
##                                  9.034625e-01 
##           hours_tv_per_school_daydo not watch 
##                                  7.536917e-01 
##                   school_night_hours_sleep10+ 
##                                  7.379841e-01 
##                     school_night_hours_sleep5 
##                                  9.704625e-01 
##                     school_night_hours_sleep6 
##                                  9.044919e-01 
##                     school_night_hours_sleep7 
##                                  8.628617e-01 
##                     school_night_hours_sleep8 
##                                  6.978166e-01 
##                     school_night_hours_sleep9 
##                                  7.297899e-01

Odds Ratios should be interpreted as:

  • OR > 1 : higher odds of texting while driving

  • OR < 1 : lower odds of texting while driving

  • Numeric predictors (age, physically_active_7d) : effect applies per unit increase

Categorical predictors : effect compares to baseline level

Interpretation of Odds Ratios:

  1. Age — OR = 1.741011e+00 (≈ 1.74)

Interpretation: For each one-year increase in age, the odds of texting while driving increase by a factor of 1.74, or 74% higher, holding all other variables constant.

This means older high school students are substantially more likely to text while driving compared to younger students.

  1. Physically Active Days — OR = 1.036365e+00 (≈ 1.04)

Interpretation: Each additional day per week that a student is physically active is associated with approximately a 3.6% increase in the odds of texting while driving, controlling for all other predictors.

This is a small effect, but it is statistically significant.

  1. Race: Black or African American — OR = 0.6203199 (≈ 0.62)

Baseline race = American Indian or Alaska Native

Interpretation: Students who identify as Black or African American have 38% lower odds of texting while driving compared to the baseline race group (American Indian or Alaska Native), holding all other variables constant.

(1 − 0.62 = 0.38 : 38% lower)

This is a meaningful negative association.

  1. Race: Asian — OR = 0.7715847 (≈ 0.77)

Interpretation: Asian students have about 23% lower odds of texting while driving compared to the baseline race group, holding all other predictors constant.

(1 − 0.77 = 0.23 : 23% lower)

  1. Hours of TV per School Day: 3 Hours — OR = 0.8094616 (≈ 0.81)

Baseline TV category = “<1 hour”

Interpretation: Students who watch 3 hours of television per school day have about 19% lower odds of texting while driving compared to students who watch less than 1 hour, controlling for other variables.

(1 − 0.81 = 0.19 : 19% lower)

  1. Hours of Sleep: 9 Hours — OR = 0.729899e+00 (≈ 0.73)

Baseline sleep category = “<5 hours”

Interpretation: Students who typically sleep 9 hours per school night have about 27% lower odds of texting while driving compared to students who sleep less than 5 hours, holding other variables constant.

(1 − 0.73 = 0.27 : 27% lower)

Confusion matrix and classification performance

To evaluate how well the model classifies students as “texted” or “did not text”, I check predicted probabilities and classify them with a threshold of 0.5. Then I build a confusion matrix and summary metrics (accuracy, sensitivity, specificity, precision).

# Predicted probabilities of texting while driving

yrbss_model$pred_prob <- predict(logit_mod, type = "response")

# Classify using a 0.5 threshold

yrbss_model$pred_class <- ifelse(yrbss_model$pred_prob >= 0.5, "Texted", "NoText")
yrbss_model$pred_class <- factor(yrbss_model$pred_class,
levels = c("NoText", "Texted"))

# Confusion matrix

conf_mat <- table(Actual = yrbss_model$texting_binary,
Predicted = yrbss_model$pred_class)
conf_mat
##         Predicted
## Actual   NoText Texted
##   NoText   2580    827
##   Texted   1220   1284

I then look at the performance metrics:

# Extract cells

TN <- conf_mat["NoText", "NoText"]
FP <- conf_mat["NoText", "Texted"]
FN <- conf_mat["Texted", "NoText"]
TP <- conf_mat["Texted", "Texted"]

# Performance metrics

accuracy <- (TP + TN) / sum(conf_mat)
sensitivity <- TP / (TP + FN)      # True positive rate (recall)
specificity <- TN / (TN + FP)      # True negative rate
precision <- TP / (TP + FP)        # Positive predictive value

accuracy
## [1] 0.6536965
sensitivity
## [1] 0.5127796
specificity
## [1] 0.7572645
precision
## [1] 0.6082425
  • Accuracy: ~0.65

  • Sensitivity (Recall): ~0.51 (about 51% of students who texted are correctly identified)

  • Specificity: ~0.76 (about 76% of students who did not text are correctly identified)

  • Precision: ~0.61 (about 61% of students predicted to text actually did text)

  • This means the model is somewhat better at correctly identifying non-texters than texters at the default 0.5 threshold, and overall accuracy is moderate.

ROC curve and AUC

Finally, I check the ROC curve and the Area Under the Curve (AUC) to see the model’s overall performance across all thresholds:

# Create a numeric version of the outcome for ROC

yrbss_model$texting_num <- ifelse(yrbss_model$texting_binary == "Texted", 1, 0)

# ROC analysis

roc_obj <- roc(response = yrbss_model$texting_num,
predictor = yrbss_model$pred_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# AUC value

auc(roc_obj)
## Area under the curve: 0.6967
# Plot ROC curve with AUC printed on the plot

plot(roc_obj,
print.auc = TRUE,
legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)",
main = "ROC Curve for Texting While Driving Logistic Model")

For this model, the AUC is about 0.70.

Interpretation:

  • An AUC of 0.5 corresponds to random guessing.

  • An AUC of 1.0 corresponds to perfect discrimination.

  • An AUC of about 0.70 indicates that the model has a moderate ability to distinguish between students who text while driving and those who do not based on the predictors.

Conclusion and Future Directions

This project looked at whether a high school student’s likelihood of texting while driving can be predicted by their age, gender, race, physical activity level, hours of sleep on school nights, and hours of TV watched per school day using logistic regression. Here are the following main points:

  • Age is a strong positive predictor: Older high school students have higher odds of texting while driving compared to younger students (all other variables constant)

  • Physical activity shows a slight positive association, meaning that students who are more physically active have a slightly higher odds of texting while driving

  • Some race groups differ a lot from the baseline category, with some races having lower odds of texting while driving than the reference

  • TV watching and sleep patterns also show some differences, but most of the effects were not very strong when compared to the effects of a category like age

  • The model classification performance is moderate, with around 65% accuracy, higher specificity than sensitivity, and an AUC of about 0.70

Overall, the model shows that age and some behavioral and demographic categories are associated with a student’s liklihood to text while driving, but a lot of variation is still not explained by our predictors.

Limitations include:

  • The outcome is from self-reported behavior, which could be subject to bias

  • The model includes a limit set of variables that we can use to predict texting while driving

  • Students who answered “did not drive” were excluded from our model, so our data capture range was only 30 days (as stated in the dataset description)

Future directions could include involving more detailed variables relating to behavior, such as seatbelt use, alcohol use, or other risky behaviors. We could also explore interaction effects, such as whether the effect of age differs by gender or race (age x gender and age x race). Finally, we could find even more outside data from different locations to increase the cases and accuracy of our model.