Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?

Introduction

My project focused on the relationship between gender and the top speed ever driven of individuals at UCLA. The original dataset had 1325 observations on three variables, but I filtered it down to 882 observations on two variables to actually be able to answer my questions. The three variables included in the original dataset is speed, (the top speed ever driven), gender (male or female), and height (height of the person.) My research question is : Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?

I got the dataset from openintro.com, and the overall dataset is about the relationship between gender and height and how they influence one’s top speed ever driven at UCLA. For my research question and statistical analysis, I am going to focus on speed and gender.(top speed ever driven and the gender of the person). The direct link to access the dataset is : https://www.openintro.org/data/index.php?data=speed_gender_height

Load the Library

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Set Working Directory and Load the Dataset

setwd("~/Documents/Data 101")
speed_gender_height <- read.csv("speed_gender_height.csv")

EDA and Cleaning + Explanation

In this chunk, I used colSums to check my NA values for my main dataset, speed gender height. We see that we have 18 NA’s in speed, which means we are going to have to impute with mean.

colSums(is.na(speed_gender_height))
##  speed gender height 
##     18      0      5

In this chunk, I used str to look at the type of variables I had in my main dataset. We see that speed is an integer, gender is a chararcter, and height is numerical.

str(speed_gender_height)
## 'data.frame':    1325 obs. of  3 variables:
##  $ speed : int  85 40 87 110 110 120 90 90 80 95 ...
##  $ gender: chr  "female" "male" "female" "female" ...
##  $ height: num  69 71 64 60 70 61 65 65 61 69 ...

I had 18 NA values for the speed variable in my main dataset, and since I am looking at the mean of average speed of males and females, I decided to impute the mean in for my NA values.

speed_mean <- mean(speed_gender_height$speed, na.rm = TRUE)

speed_mean_imputed <- speed_gender_height |>
  mutate(speed_imputed = ifelse(is.na(speed_gender_height$speed),
                                speed_mean,
                                speed_gender_height$speed))

In this chunk, I created the dataset that I am going to use for my visualization. It uses select to choose the variables that I will look at in the boxplot below. I used select to choose gender and the imputed speed variables for my visualization.

speed_gender <- speed_mean_imputed |>
  select(gender, speed_imputed) 
head(speed_gender)
##   gender speed_imputed
## 1 female            85
## 2   male            40
## 3 female            87
## 4 female           110
## 5   male           110
## 6 female           120

When I tried doing the t.test, I realized that I had to use two different datasets, one with the male average top speed and the other with the female average top speed, to allow me to be able to actually do my t.test. In this chunk, I created a new dataset called speed males, with only the speed values for the males, solely for statistical analysis.

speed_males <- speed_gender |>
  filter(gender == "male")
head(speed_males)
##   gender speed_imputed
## 1   male            40
## 2   male           110
## 3   male            95
## 4   male            90
## 5   male           110
## 6   male            70

In this chunk, I did the same thing as above, except I used the speed values for the females, solely for statistical analysis.

speed_females <- speed_gender |>
  filter(gender == "female")
head(speed_females)
##   gender speed_imputed
## 1 female            85
## 2 female            87
## 3 female           110
## 4 female           120
## 5 female            90
## 6 female            90

Visualization

ggplot(speed_gender,aes(x = gender, y = speed_imputed, color = gender)) +
  geom_boxplot() + 
  labs(
    title = "Average Top Speed Driven (mph) by Gender at UCLA",
    x = "Gender",
    y = "Speed (mph)",
    caption = "Source: UCLA"
  )

In this boxplot, we are looking at the relationship between gender and the top speed ever driven. The boxplot shows that the median speed for males is greater than the median speed for females, with the median for males being 100 mph and the median for females being 90 mph. Both males and females have outliers, including some extremely low and high speeds, but overall males have more high-speed outliers. The boxplot show that the males have higher top speeds ever driven than females.

Statistical Analysis

\(H_o\): μ1 = μ2

\(H_a\): μ1 > μ2

μ1 = The average top speed ever driven of males

μ2 = The average top speed ever driven of females

t.test(speed_males$speed_imputed, speed_females$speed_imputed, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  speed_males$speed_imputed and speed_females$speed_imputed
## t = 8.3213, df = 839.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  8.599455      Inf
## sample estimates:
## mean of x mean of y 
##  97.86952  87.14852

The p-value is < 2.2e-16, which is significantly less than our alpha of 0.05, and our findings our statistically significant. This means that we reject the null hypothesis. Therefore, the average top speed ever driven of males is greater than the average top speed ever driven for females.

General Conclusion

Summary of Findings and Statistical Analysis

To answer my question of Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?, the average top speed ever driven of males is greater than the avearage top speed ever driven of females at UCLA. When doing the t.test, we see that the mean of x(male avearge top speed) and the mean of y(female top speed) being x = 97.86952, and y = 87.14852. This means that the mean of the male average top speed is greater than the females by 10 mph, which shows that the male top speed is greater than the female top speed. The boxplot backs up this claim as it shows that the median top speed ever driven is greater for males than females at UCLA, and there is more high-speed outliers for males which adds another aspect to the data. Our alpha level for this t.test was 0.05, and our p-value was < 2.2e-16, which means that we reject our null hypothesis. All in all, we found that the average top speed ever driven of males is greater than the avearage top speed ever driven of females at UCLA, by 10 mph.

Impliactions and Areas for Future Research

These findings suggest that gender may play a role in driving behavior, particularly in relation to risk-taking and speed. Future research could definitely look into this, and if variables like age, driving experience, the number of citations and personality could contribute to giving different areas of analysis. Some implications about the dataset is that it wasn’t ideal for statistical analysis. It had NA values but most importantly, I needed to split it up into two separate datasets because of the male and female variables. I split them up into male and female datasets because it allowed me to be able to actually do the t.test, and it gave me statistically significant results because of this tweak. Overall, this project taught me a lot about my capabilities in cleaning and creating new datasets, being able to create different and meaningful visualizations, and to be able to explain my results from the t.test.

Works Cited

speed_gender_height dataset. (n.d.). OpenIntro. Retrieved March 31, 2026, from https://www.openintro.org/data/index.php?data=speed_gender_height