Final Project Data 110

Author

Cowan

Miss Universe Winners (1952-2025)

(Source: The New York Times ‘Mexico Wins Miss Universe Pageant Marred by Scandal’. https://www.nytimes.com/2025/11/21/world/asia/miss-universe-mexico.html.)

Introduction

Miss Universe is an annual international beauty competition, where women representing their countries come together to show their culture, leadership, and advocate for what they believe. The competition is comprised by the preliminaries, semi-finals, Q&A, and finally the winner is crowned after a judge panel and viewer vote. For this project I will use a data set that is made of charts web scraped and from Wikipedia that has information collected from the Miss Universe show and articles about miss universe from all over the internet to see if there is correlation between the number wins a country has vs where the competition was hosted. I would also like to visualize the number of contestants vs the age of the winners as time goes on, and finally map out where each winner was from. I chose this because just a few weeks before I had to start thinking about my data set, the Miss Universe had announced the 2025 winner on November 8. While I had no hate to the winner, Miss Mexico, I don’t think she was the best candidate and I think the win was given to her because just a few days prior to the crowning, she was singled out and bullied by a pageant director. From this, I was inspired to dig a little deeper and see how we find the Miss Universe a midst so many beautiful, intelligent, and strong women.

Setting up my data set

Here I loaded my necessary libraries.

library (plotly)

Loading required package: ggplot2


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(rvest)
library(leaflet)
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.0.4     ✔ tidyr     1.3.1
✔ readr     2.1.5

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks plotly::filter(), stats::filter()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Now I’m loading my csv and websites and getting the exact tables that I will use in my project into R-Studio.

#loading in 1st tabel from wikipedia
winners_url <- "https://en.wikipedia.org/wiki/List_of_Miss_Universe_titleholders"
winners_page <- read_html(winners_url)

winners_tables <- winners_page |>
  html_nodes("table.wikitable") |>
  html_table()
winners_data <- winners_tables[[1]]

#loading 2nd table I will use
location_url <- "https://en.wikipedia.org/wiki/List_of_Miss_Universe_editions"
location_page <- read_html(location_url)

location_tables <- location_page |>
  html_nodes("table.wikitable") |>
  html_table()
location_data <- location_tables[[1]]

#loading csv with country coordinates
countries_data <- read_csv("world_country_and_usa_states_latitude_and_longitude_values.csv")

Rows: 245 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): country_code, country, usa_state_code, usa_state
dbl (4): latitude, longitude, usa_state_latitude, usa_state_longitude

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now I am cleaning my data and merging it so I have one data set I can work from for my project

#cleaning the headers and deleting one line from 2002 where there was 2 winners listed
names(winners_data) <- c("year", "country", "name", "age", "hometown", "nat_title", "date", "entrants")
winners_data <- winners_data[-52, ]
winners_data <- winners_data |>
  mutate(
    name = str_trim(str_replace(name, "\\[.*\\]", ""))
  )
winners_data <- winners_data |>
  select(-nat_title, -date)

#cleaning again and deleting the double winner in 2002 as well as the last row because it shows where next years comp will be held
names(location_data) <- c("year", "edition", "win_country", "month_day", "venue", "country", "entrants")
location_data <- location_data[-52, ]
location_data <- head(location_data, -1)
location_data <- location_data |>
  select(year, edition, country)

#filtering coordinates data
countries_data <- countries_data |>
  select(latitude, longitude, country)

#merging into one data set
win_and_country <- left_join(winners_data, countries_data, by="country")
loc_and_country <- left_join(location_data, countries_data, by="country")

#changing headers to organize 
names(win_and_country) <- c("year", "win_country", "name", "age", "win_hometown", "entrants", "win_lat", "win_long")
names(loc_and_country) <- c("year", "edition", "host_country", "host_lat", "host_long")

final_data <- left_join(win_and_country, loc_and_country, by="year")

Regression Analysis

For my regression analysis, I decided to just use all of my variables to predict the age of the winner. I decided to try and eliminate some variables.

fit1 <- lm(age ~ year + entrants + host_lat + win_long + host_long + win_lat, data= final_data )
summary(fit1)


Call:
lm(formula = age ~ year + entrants + host_lat + win_long + host_long + 
    win_lat, data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4057 -1.5579 -0.4692  1.4900  4.7435 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.639e+02  4.752e+01  -3.449 0.000978 ***
year         9.379e-02  2.480e-02   3.782 0.000334 ***
entrants    -3.468e-02  2.883e-02  -1.203 0.233176    
host_lat     2.532e-02  1.882e-02   1.345 0.183046    
win_long    -3.219e-03  3.327e-03  -0.968 0.336622    
host_long    1.378e-03  3.503e-03   0.393 0.695322    
win_lat      1.935e-03  9.706e-03   0.199 0.842594    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.24 on 67 degrees of freedom
Multiple R-squared:  0.333, Adjusted R-squared:  0.2733 
F-statistic: 5.575 on 6 and 67 DF,  p-value: 9.787e-05

par(mfrow = c(2, 2))
plot(fit1)

Looking at the residuals vs fitted plot, the dots are pretty randomly scattered and there is no fanning. The Q-Q plot also looks not bad with most of the points on the line with the ends curving off. However, looking at the summary, there are some really high p-values so I think I can remove them. I will go ahead and remove win_lat (the latitude of the country the winner is from), and host_long (the longitude of the host country).

fit1 <- lm(age ~ year + entrants + host_lat + win_long, data= final_data )
summary(fit1)


Call:
lm(formula = age ~ year + entrants + host_lat + win_long, data = final_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4876 -1.5279 -0.5747  1.4676  4.6028 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.650e+02  4.682e+01  -3.524 0.000759 ***
year         9.436e-02  2.443e-02   3.862 0.000251 ***
entrants    -3.477e-02  2.836e-02  -1.226 0.224331    
host_lat     2.289e-02  1.741e-02   1.315 0.192849    
win_long    -3.206e-03  3.254e-03  -0.985 0.327879    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.211 on 69 degrees of freedom
Multiple R-squared:  0.331, Adjusted R-squared:  0.2922 
F-statistic: 8.535 on 4 and 69 DF,  p-value: 1.178e-05

par(mfrow = c(2, 2))
plot(fit1)

Looking at this new equation, the adjusted R-squared is 2% higher and we have now been able to get a simpler model. I think we can go ahead and analyse this equation now.

Looking at the summary we can say:

predicated age = -165.0 + (0.09436 * year) + (-0.03477 * entrants) + (0.02289 * host_lat) + (-0.003206 * win_long)

The intercept is not meaningful here because you can not have a Miss Universe competition in the year 0 with 0 people participating that was hosted on the equator where the winner is from the prime meridian.

From the year value, we can see that for each addition year, the winners age increases by about 0.094 years or about 1.13 months. From the summary, we can see that the p-value is very small, less than 0.001 meaning that if we assume our alpha level is 0.05, it is statistically significant.

Each additional entrant decreases the wining age by 0.03 years, each degree away from the equator the host country is can add or subtract 0.023 years of the winners age depending on weather they are in the northern or southern hemisphere, and for each degree the winner’s home country’s longitude is from the prime meridian, can add or subtract 0.003 years depending on if they are in the eastern or western hemisphere.

From the summary, we can see the Multiple R-squared value is 0.331, meaning this model explains 33.1% of the variance in the age of the winner. The p-value for this whole model is 0.00001178, meaning the model is statistically significant, and at least one variable in the model is statistical significant.

The Residual SE is 2.211 meaning the typical prediction error is about 2 years.

Overall, we can see that the year the competition was held is the biggest predictor of the age of the winner as it is the only statistically significant predictor, and the explanatory power of this model being about 33% is not good but it could be worse.

Visualization 1

final_data <- final_data |>
  #Here I made groups of the decade of winners
  mutate(year_group = cut(year,
                         breaks = c(-Inf, 1980, 1990, 2000, 2010, Inf),
                         labels = c("Pre-1980", "1980s", "1990s", "2000s", "2010s+")))
#I made my stacked histogram
ggplot(final_data, aes(x = age, fill = year_group)) +
  geom_histogram(color = "black", bins = 15, alpha = 0.8, position = "stack") +
  #Added a line repersenting the mean age
  geom_vline(aes(xintercept = mean(age)), 
             color = "black", linetype = "dashed", size = 1) +
  scale_fill_brewer(palette = "Purples", name = "Decade") +
  labs(title = "Miss Universe Winner Age Distribution by Decade",
        subtitle = paste("Mean =", round(mean(final_data$age, na.rm = TRUE), 1)), 
       x = "Age (years)", y = "Count") +
  theme_light()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

For my first visualization, I made a stacked histogram. This shows the age frequency of every Miss Universe winner. I decided to use a stacked histogram so you could see not only how many, but when that age of Miss Universe won. From our regression model we saw that the age seemed to increase as time went on as the predictor year had a positive coefficient. Here, we can see that play out. Most of the winners who were under 20 years old won before the 1980s. Most of the darker purple blocks, meaning they won after 2010, are closer to 25 years old with one outliar closer to 30. This is not only interesting to observe but backs up our regression model.

Visualization 2

#I went back and added an images column to my data set and manusally added urls of pictures of winners for this visualization
final_data <- final_data |>
  mutate(
    photo_url = case_when(
      name == "Armi Kuusela" ~ "https://upload.wikimedia.org/wikipedia/commons/2/21/Armi_Kuusela-1.jpg",
      name == "Christiane Martel" ~ "https://theeyehuatulco.com/wp-content/uploads/2019/02/screen-shot-2019-02-27-at-7.08.54-pm.png",
      name == "Miriam Stevenson" ~ "https://www.missuniverse.com/wp-content/uploads/2024/09/1954-Miriam-Stevenson.jpg",
      name == "Hillevi Rombin" ~ "https://www.missuniverse.com/wp-content/uploads/2024/09/1955-Hillevi-Rombin.jpg",
      name == "Carol Morris" ~ "https://upload.wikimedia.org/wikipedia/commons/d/d0/Carol_Morris_%28cropped%29.jpg",
      name == "Gladys Zender" ~ "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRIJi77UGX4uX41M-sPNNsUlZX_Hs0NAKkyWA&s",
      name == "Luz Marina Zuluaga" ~ "https://media2.nekropole.info/2015/12/Luz-Marina-Zuluaga.jpg",
      name == "Akiko Kojima" ~ "https://upload.wikimedia.org/wikipedia/commons/d/d4/Akiko_Kojima_%281959%29.jpg",
       name == "Linda Bement" ~ "https://upload.wikimedia.org/wikipedia/commons/e/ec/Linda_Bement_Detroit_News_TV_Magazine%2C_edited.jpg",
      name == "Marlene Schmidt" ~ "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQbQm02JsaUe0HW0ISYWaWXD5tkLj4xpF6L4A&s",
      TRUE ~ "https://via.placeholder.com/150x150?text=No+Photo+Available"
    )
  )

#I made my map code
map1 <- final_data |>
  leaflet() |>
  addTiles() |>
#jittering because there are many countries that have multiple winners
  addCircleMarkers(
    lng = ~jitter(win_long, amount = 1.5), 
    lat = ~jitter(win_lat, amount = 1.5), 
#making my tooltip
    popup = ~paste(
      "Winner Name: ", name, "<br>",
#found a way to insert images into my tooltip
      "<img src='", photo_url, "' width='150' height='175' style='border-radius: 10px; display: block; margin: 0 auto;'>", "<br>",
      "Country of Origin: ", win_country, "<br>",
      "Year of Win: ", year, "<br>",
      "Age When Won: ", age, "<br>",
      "Host Country: ", host_country
    ),
#really liked the cluster option
    clusterOptions = markerClusterOptions(),
    radius = 8
  )

# Display the map
map1

For my second visualization, I have mapped out where each Miss Universe winner was from. With the cluster option, you can easily see that North America and northern parts of South America seems to have the most winners. Southern South America and Oceania having much fewer winners. Something I really wanted to do for this was to insert an image in the tooltip popup so you could see what each winner looked like. This could give the user an idea of how ‘the most beautiful lady in the world’ changed over time and how beauty standards have evolved. I’m sure there is a ore efficient way to do this but after some research and information from https://github.com/r-spatial/leafpop and https://academy.datawrapper.de/article/248-how-to-insert-images#:~:text=Insert%20images%20in%20tooltips,-To%20add%20images&text=Replace%20the%20image%20address%20link,and%20then%20link%20from%20there. and lost of trial and error, I found a way to add an image. I searched up if there was a way to add a image in tooltip popup in R and I didn’t immediate find a direct answer. I found some people using a package called leaf pop. I didn’t want to use it thought and just thought what if I could add a column to my data set that had urls, would they appear as images in my visualization? And so I tried it and it worked! With the second website, I found that there was a way to make my image fit well in my tooltip by sizing it and giving it borders and such. For the url column thought, I didn’t know how to scrape just images for each person and add everything at once, I only figured out how to manually enter each url so I was only able to do the first 10 winners due to a shortage of time. I wish I could have figures out an easier way to get these urls in the data set so every winner could have their photo on the map.

Background Information

Miss Universe has been around for over 70 years and so much has changed. Miss Universe originally started as a marketing promotion by a clothing company in California after Miss America winner, Yolande Betbeze, refused to pose in their swimwear and advertise it in 1951. For a long time, the competition was owned by media companies such as CBS and later NBC. However, it has changed hands and now is owned by an international conglomerate, a Thai-Mexican partnership, after being owned by Donald Trump for almost 20 years. Back when the competition first started, contestants had to be young, as we found in our regression analysis and first visualization, it wasn’t just that younger people were competing, it was a rule. They had to be single, never married, and never pregnant. Now, any woman is allowed to compete. During the 1950s and 60s, the competition was focused on the physical beauty of the contestants, only looking at their figure and face. Since the 1980s, they have attempted to rebrand themselves as a organization that celebrates inner beauty as much or even more than physical appearance. But, at the end of that day it’s a competition, a game, played by wealthy people. With wealthy people come connections. A place where all these connections come together to judge women on their appearance and character is doomed to become a breeding ground for allegations and scandals. I was interested in this topic because I had been following the pageant on the radio and social media for the weeks prior to choosing my data set for this final project. Before the final part of the competition, there was a clip that circulated of Miss Mexico being bullied and people walking out in her support. So, when Miss Mexico was announced the winner, I was not surprised but not particularly happy, and I wanted to know what other people had to say about it. After some digging on Reddit, I found there was much more going on beneath the surface than just bullying and giving wins away. There had been allegations, supported by an anonymous contestant, that the top 30 finalists of the competition were chosen before the event even began by an unofficial jury that was comprised by people with personal relationships to contestants. Of course the Miss Universe Organization denied all of these claims, but there was more. A more recent article stated that the winner of this years Miss Universe, Fatima Bosch, was pre selected as the winner due to business ties between her father and pageant co-owner Raul Rocha Cantu. After further investigation, it turns out Rocha Cantu is under investigation by Mexican authorities for alleged links ro organized crime. Additionally, he had implied at some point that competitor’s passports and visa difficulties were considered when judging contestants. This has caused the fourth runner up, Olivia Yace, to renounce her title.

While I love watching all of these beautiful women come together every year, this competition has seem to become corrupt. When I first watched the Miss Universe competition in elementary school, I thought this was a friendly competition where women came together to share ideas and spread love and to be role models. While it may still seem like that to the uninformed watcher, there is so much going in the shadows that ruins that spirit of the competition.

Sources:

Holland, Oscar. “This Year’s Miss Universe Debacle Shows How Beauty Pageants Turned Ugly.” CNN, Cable News Network, 29 Nov. 2025, www.cnn.com/2025/11/29/style/miss-universe-pageant-controversy. Accessed 14 Dec. 2025.

Olson, Samantha. “A Miss Universe Judge Quit after Alleging the Top 30 Contestants Were Chosen by an ‘Impromptu Jury.’” Cosmopolitan, www.cosmopolitan.com/entertainment/celebs/a69474561/miss-universe-contestants-drama/. Accessed 14 Dec. 2025.

Shaw, Gabbi. “Then and Now: How the Miss Universe Pageant Has Evolved over the Last 71 Years.” Business Insider, www.businessinsider.com/miss-universe-then-and-now-2017-1#at-first-the-pageant-was-combined-with-the-miss-usa-pageant-it-wasnt-broadcast-until-1955-2. Accessed 12 Jan. 2023.