Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?
My project focused on the relationship between gender and the top speed ever driven of individuals at UCLA. The original dataset had 1325 observations on three variables, but I filtered it down to 882 observations on two variables to actually be able to answer my questions. The three variables included in the original dataset is speed, (the top speed ever driven), gender (male or female), and height (height of the person.) My research question is : Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?
I got the dataset from openintro.com, and the overall dataset is about the relationship between gender and height and how they influence one’s top speed ever driven at UCLA. For my research question and statistical analysis, I am going to focus on speed and gender.(top speed ever driven and the gender of the person). The direct link to access the dataset is : https://www.openintro.org/data/index.php?data=speed_gender_height
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Documents/Data 101")
speed_gender_height <- read.csv("speed_gender_height.csv")
In this chunk, I used colSums to check my NA values for my main dataset, speed gender height. We see that we have 18 NA’s in speed, which means we are going to have to impute with mean.
colSums(is.na(speed_gender_height))
## speed gender height
## 18 0 5
In this chunk, I used str to look at the type of variables I had in my main dataset. We see that speed is an integer, gender is a chararcter, and height is numerical.
str(speed_gender_height)
## 'data.frame': 1325 obs. of 3 variables:
## $ speed : int 85 40 87 110 110 120 90 90 80 95 ...
## $ gender: chr "female" "male" "female" "female" ...
## $ height: num 69 71 64 60 70 61 65 65 61 69 ...
I had 18 NA values for the speed variable in my main dataset, and since I am looking at the mean of average speed of males and females, I decided to impute the mean in for my NA values.
speed_mean <- mean(speed_gender_height$speed, na.rm = TRUE)
speed_mean_imputed <- speed_gender_height |>
mutate(speed_imputed = ifelse(is.na(speed_gender_height$speed),
speed_mean,
speed_gender_height$speed))
In this chunk, I created the dataset that I am going to use for my visualization. It uses select to choose the variables that I will look at in the boxplot below. I used select to choose gender and the imputed speed variables for my visualization.
speed_gender <- speed_mean_imputed |>
select(gender, speed_imputed)
head(speed_gender)
## gender speed_imputed
## 1 female 85
## 2 male 40
## 3 female 87
## 4 female 110
## 5 male 110
## 6 female 120
When I tried doing the t.test, I realized that I had to use two different datasets, one with the male average top speed and the other with the female average top speed, to allow me to be able to actually do my t.test. In this chunk, I created a new dataset called speed males, with only the speed values for the males, solely for statistical analysis.
speed_males <- speed_gender |>
filter(gender == "male")
head(speed_males)
## gender speed_imputed
## 1 male 40
## 2 male 110
## 3 male 95
## 4 male 90
## 5 male 110
## 6 male 70
In this chunk, I did the same thing as above, except I used the speed values for the females, solely for statistical analysis.
speed_females <- speed_gender |>
filter(gender == "female")
head(speed_females)
## gender speed_imputed
## 1 female 85
## 2 female 87
## 3 female 110
## 4 female 120
## 5 female 90
## 6 female 90
ggplot(speed_gender,aes(x = gender, y = speed_imputed, color = gender)) +
geom_boxplot() +
labs(
title = "Average Top Speed Driven (mph) by Gender at UCLA",
x = "Gender",
y = "Speed (mph)",
caption = "Source: UCLA"
)
In this boxplot, we are looking at the relationship between gender and the top speed ever driven. The boxplot shows that the median speed for males is greater than the median speed for females, with the median for males being 100 mph and the median for females being 90 mph. Both males and females have outliers, including some extremely low and high speeds, but overall males have more high-speed outliers. The boxplot show that the males have higher top speeds ever driven than females.
\(H_o\): μ1 = μ2
\(H_a\): μ1 > μ2
μ1 = The average top speed ever driven of males
μ2 = The average top speed ever driven of females
t.test(speed_males$speed_imputed, speed_females$speed_imputed, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: speed_males$speed_imputed and speed_females$speed_imputed
## t = 8.3213, df = 839.81, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 8.599455 Inf
## sample estimates:
## mean of x mean of y
## 97.86952 87.14852
The p-value is < 2.2e-16, which is significantly less than our alpha of 0.05, and our findings our statistically significant. This means that we reject the null hypothesis. Therefore, the average top speed ever driven of males is greater than the average top speed ever driven for females.
To answer my question of Is the average top speed ever driven of males greater than the average top speed ever driven of females at UCLA?, the average top speed ever driven of males is greater than the avearage top speed ever driven of females at UCLA. When doing the t.test, we see that the mean of x(male avearge top speed) and the mean of y(female top speed) being x = 97.86952, and y = 87.14852. This means that the mean of the male average top speed is greater than the females by 10 mph, which shows that the male top speed is greater than the female top speed. The boxplot backs up this claim as it shows that the median top speed ever driven is greater for males than females at UCLA, and there is more high-speed outliers for males which adds another aspect to the data. Our alpha level for this t.test was 0.05, and our p-value was < 2.2e-16, which means that we reject our null hypothesis. All in all, we found that the average top speed ever driven of males is greater than the avearage top speed ever driven of females at UCLA, by 10 mph.
These findings suggest that gender may play a role in driving behavior, particularly in relation to risk-taking and speed. Future research could definitely look into this, and if variables like age, driving experience, the number of citations and personality could contribute to giving different areas of analysis. Some implications about the dataset is that it wasn’t ideal for statistical analysis. It had NA values but most importantly, I needed to split it up into two separate datasets because of the male and female variables. I split them up into male and female datasets because it allowed me to be able to actually do the t.test, and it gave me statistically significant results because of this tweak. Overall, this project taught me a lot about my capabilities in cleaning and creating new datasets, being able to create different and meaningful visualizations, and to be able to explain my results from the t.test.
speed_gender_height dataset. (n.d.). OpenIntro. Retrieved March 31, 2026, from https://www.openintro.org/data/index.php?data=speed_gender_height