Data 110 - Final Project

Author

Betty Ovalle

Rosé Wine: Rating and Price

Source: In Good Taste (2026). “In Good Taste.” Retrieved from ingoodtaste.com

Introduction

For my final project, I will be analyzing a dataset about rosé wines that includes information such as wine prices, ratings, country of origin, grape variety, and number of reviews. I want to explore whether there is a correlation between the price of a wine and its rating, as well as visualize trends between countries and wine varieties. I also plan to map out where the highest-rated rosé wines come from.

The data comes from a publicly available wine dataset on Kaggle, which compiles wine ratings and prices from multiple sources. The data was collected through web scraping from Vivino.com.

The dataset was cleaned by standardizing column names, removing unnecessary weather-related variables that were not relevant to the research question, and filtering missing or incomplete values.

I chose this topic because I am interested in how people judge quality in food and drinks and how ratings can influence customer choices. I do not know much about wines, but rosé is usually what I drink when it is offered to me, even though I have only bought wine myself on a few occasions. Because of this, I thought it would be interesting to learn more about rosé wines and see what factors make certain wines more popular or highly rated than others.

Load the libraries and set the working directory

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(leaflet)

setwd("/Users/bettyovalle/Desktop/College/007 – Spring 2026/DATA 110/week 15")
RoseWine <- read_csv("Rose_Weather.csv")
Rows: 394 Columns: 70
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Name, Country, Region, Winery
dbl (66): Rating, NumberOfRatings, Price, Year, lat, lng, Jan_tavg, Jan_tmin...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the Data

  1. Check column names
names(RoseWine)
 [1] "Name"            "Country"         "Region"          "Winery"         
 [5] "Rating"          "NumberOfRatings" "Price"           "Year"           
 [9] "lat"             "lng"             "Jan_tavg"        "Jan_tmin"       
[13] "Jan_tmax"        "Jan_prcp"        "Jan_tsun"        "Feb_tavg"       
[17] "Feb_tmin"        "Feb_tmax"        "Feb_prcp"        "Feb_tsun"       
[21] "Mar_tavg"        "Mar_tmin"        "Mar_tmax"        "Mar_prcp"       
[25] "Mar_tsun"        "Apr_tavg"        "Apr_tmin"        "Apr_tmax"       
[29] "Apr_prcp"        "Apr_tsun"        "May_tavg"        "May_tmin"       
[33] "May_tmax"        "May_prcp"        "May_tsun"        "Jun_tavg"       
[37] "Jun_tmin"        "Jun_tmax"        "Jun_prcp"        "Jun_tsun"       
[41] "Jul_tavg"        "Jul_tmin"        "Jul_tmax"        "Jul_prcp"       
[45] "Jul_tsun"        "Aug_tavg"        "Aug_tmin"        "Aug_tmax"       
[49] "Aug_prcp"        "Aug_tsun"        "Sep_tavg"        "Sep_tmin"       
[53] "Sep_tmax"        "Sep_prcp"        "Sep_tsun"        "Oct_tavg"       
[57] "Oct_tmin"        "Oct_tmax"        "Oct_prcp"        "Oct_tsun"       
[61] "Nov_tavg"        "Nov_tmin"        "Nov_tmax"        "Nov_prcp"       
[65] "Nov_tsun"        "Dec_tavg"        "Dec_tmin"        "Dec_tmax"       
[69] "Dec_prcp"        "Dec_tsun"       
  1. Clean variable names
names(RoseWine) <- tolower(names(RoseWine))
names(RoseWine) <- gsub(" ", "_", names(RoseWine))
  1. Rename columns
RoseWine <- RoseWine %>%
  rename(number_of_ratings = numberofratings)

The column name was renamed to make it more understandable and visually appealing. This improves readability and helps maintain a more consistent and structured format for analysis.

  1. Removing unnecessary columns
RoseWine_clean <- RoseWine |>
  select(-jan_tavg,
         -jan_tmin,
         -jan_tmax,
         -jan_prcp,
         -jan_tsun,
         -feb_tavg,
         -feb_tmin,
         -feb_tmax,
         -feb_prcp,
         -feb_tsun,
         -mar_tavg,
         -mar_tmin,
         -mar_tmax,
         -mar_prcp,
         -mar_tsun,
         -apr_tavg,
         -apr_tmin,
         -apr_tmax,
         -apr_prcp,
         -apr_tsun,
         -may_tavg,
         -may_tmin,
         -may_tmax,
         -may_prcp,
         -may_tsun,
         -jun_tavg,
         -jun_tmin,
         -jun_tmax,
         -jun_prcp,
         -jun_tsun,
         -jul_tavg,
         -jul_tmin,
         -jul_tmax,
         -jul_prcp,
         -jul_tsun,
         -aug_tavg,
         -aug_tmin,
         -aug_tmax,
         -aug_prcp,
         -aug_tsun,
         -sep_tavg,
         -sep_tmin,
         -sep_tmax,
         -sep_prcp,
         -sep_tsun,
         -oct_tavg,
         -oct_tmin,
         -oct_tmax,
         -oct_prcp,
         -oct_tsun,
         -nov_tavg,
         -nov_tmin,
         -nov_tmax,
         -nov_prcp,
         -nov_tsun,
         -dec_tavg,
         -dec_tmin,
         -dec_tmax,
         -dec_prcp,
         -dec_tsun)
head(RoseWine_clean)
# A tibble: 6 × 10
  name    country region winery rating number_of_ratings price  year   lat   lng
  <chr>   <chr>   <chr>  <chr>   <dbl>             <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mad Mé… France  Médit… Châte…    4                  25  10.1  2019  43.6  2.34
2 Redoma… Portug… Douro  Niepo…    3.9                25  12.8  2019  41.1 -7.12
3 Fantin… Italy   Abruz… Farne…    3.6                25   5.6  2019  42.2 13.9 
4 Mount … Israel  Galil… Hermon    3.4                25  12.1  2017  32.8 35.6 
5 La Vil… France  Langu… Châte…    4.6                25  69.0  2018  43.7  1.40
6 La Ros… Italy   Vigne… Manin…    4.1                25  17.5  2019  46.4 11.8 

For this project on Rosé wine, it is only necessary to consider the location where the grapes were grown and supplied. Monthly weather factors during grape growth are not required for this analysis, as the main focus is not on detailed monthly climate variation, but rather on the relationship between grape origin and wine characteristics.

Data filtering

  1. Filter for Lower-Priced Wines
price_under_15 <- RoseWine_clean |>
  filter(price < 15)

head(price_under_15)
# A tibble: 6 × 10
  name    country region winery rating number_of_ratings price  year   lat   lng
  <chr>   <chr>   <chr>  <chr>   <dbl>             <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mad Mé… France  Médit… Châte…    4                  25  10.1  2019  43.6  2.34
2 Redoma… Portug… Douro  Niepo…    3.9                25  12.8  2019  41.1 -7.12
3 Fantin… Italy   Abruz… Farne…    3.6                25   5.6  2019  42.2 13.9 
4 Mount … Israel  Galil… Hermon    3.4                25  12.1  2017  32.8 35.6 
5 Mea Ro… Italy   Ligur… Lvnae     3.9                25  11.4  2019  44.5  8.70
6 Amethy… Greece  Drama  Costa…    3.8                25   9.7  2013  41.1 24.1 
  1. Wines Filtered by High Price
price_above_15 <- RoseWine_clean |>
  filter(price >= 15) #adding >= 15

head(price_above_15)
# A tibble: 6 × 10
  name    country region winery rating number_of_ratings price  year   lat   lng
  <chr>   <chr>   <chr>  <chr>   <dbl>             <dbl> <dbl> <dbl> <dbl> <dbl>
1 La Vil… France  Langu… Châte…    4.6                25  69.0  2018  43.7  1.40
2 La Ros… Italy   Vigne… Manin…    4.1                25  17.5  2019  46.4 11.8 
3 Primer… Spain   Rioja  Marqu…    3.9                26  25.2  2018  42.3 -2.47
4 Moulin… France  Bandol Domai…    3.9                27  18.9  2019  43.1  5.75
5 Purple… Italy   Tosca… Caste…    4.1                28  21.8  2019  43.5 11.1 
6 Clos B… France  Côtes… Domai…    4.5                30  19.9  2019  43.4  6.65
  1. Best Rated Wines
top_rated <- RoseWine_clean |>
  filter(rating >= 4.0)

head(top_rated)
# A tibble: 6 × 10
  name   country region winery rating number_of_ratings price  year   lat    lng
  <chr>  <chr>   <chr>  <chr>   <dbl>             <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Mad M… France  Médit… Châte…    4                  25 10.1   2019  43.6   2.34
2 La Vi… France  Langu… Châte…    4.6                25 69.0   2018  43.7   1.40
3 La Ro… Italy   Vigne… Manin…    4.1                25 17.5   2019  46.4  11.8 
4 Perli… Argent… Uco V… DiamA…    4                  27  9.99  2018 -34   -64   
5 Danie… South … Rober… De We…    4                  28  6.5   2019 -33.8  19.9 
6 Purpl… Italy   Tosca… Caste…    4.1                28 21.8   2019  43.5  11.1 
  1. Low Rated Wines
low_rated <- RoseWine_clean |>
  filter(rating < 4.0)

head(low_rated)
# A tibble: 6 × 10
  name    country region winery rating number_of_ratings price  year   lat   lng
  <chr>   <chr>   <chr>  <chr>   <dbl>             <dbl> <dbl> <dbl> <dbl> <dbl>
1 Redoma… Portug… Douro  Niepo…    3.9                25 12.8   2019  41.1 -7.12
2 Fantin… Italy   Abruz… Farne…    3.6                25  5.6   2019  42.2 13.9 
3 Mount … Israel  Galil… Hermon    3.4                25 12.1   2017  32.8 35.6 
4 Mea Ro… Italy   Ligur… Lvnae     3.9                25 11.4   2019  44.5  8.70
5 Amethy… Greece  Drama  Costa…    3.8                25  9.7   2013  41.1 24.1 
6 Cinsau… France  Pays … Les J…    3.6                26  5.56  2019  43.6  1.45

See the available data

names(RoseWine_clean)
 [1] "name"              "country"           "region"           
 [4] "winery"            "rating"            "number_of_ratings"
 [7] "price"             "year"              "lat"              
[10] "lng"              

Scatter Plot

Visualization 1

This visualization explores whether higher-priced wines receive better ratings.

ggplot(RoseWine_clean, aes(x = price, y = rating)) +
  geom_point(alpha = 0.4, color = "#c2185b") +
  geom_smooth(method = "lm", color = "gray") +
  labs(title = "Relationship Between Price and Wine Ratings",
    x = "Price (USD)",
    y = "Rating") +
  theme_light()
`geom_smooth()` using formula = 'y ~ x'

More expensive wines are not necessarily rated significantly higher. This indicates that price alone is not a strong indicator of perceived quality in rosé wines.

Highcharter charts

Visualization 2

top5_countries <- RoseWine_clean %>%
  group_by(country) %>%
  summarise(max_price = max(price, na.rm = TRUE)) %>%
  arrange(desc(max_price)) %>%
  slice_head(n = 5)

What countries have the most expensive wines?

top5_top3_wines <- RoseWine_clean %>%
  filter(country %in% top5_countries$country) %>%
  group_by(country) %>%
  arrange(desc(price)) %>%
  slice_head(n = 3) %>%
  ungroup()
highchart_expensive <- top5_top3_wines |>
  hchart(type = "column",
         hcaes(x = name, y = price, group = country)) |>
  hc_title(text = "Top 3 Most Expensive Rosé Wines in Top 5 Countries") |>
  hc_xAxis(title = list(text = "Wine Names")) |>
  hc_yAxis(title = list(text = "Price in dollars"),
           labels = list(format = "${value}")) |>
  hc_legend(title = list(text = "Country")) |>
  hc_tooltip(pointFormat = "<b>Country:</b> {point.series.name}<br><b>Price:</b> ${point.y}") |>
  hc_add_theme(hc_theme_monokai())

highchart_expensive

This visualization shows the top 3 most expensive rosé wines from the five countries with the highest maximum wine prices in the dataset.The chart compares wine names and their prices across countries, allowing to see which regions produce the most expensive bottles and how prices vary between different wines within each country.

Visualization 3

country_rating <- RoseWine_clean %>%
  group_by(country) %>%
  summarise(avg_rating = mean(rating, na.rm = TRUE)) %>%
  arrange(desc(avg_rating))
highchart() %>%
  hc_chart(type = "bar") %>%
  
  hc_title(text = "Average Rosé Wine Ratings by Country") %>%
  
  hc_xAxis(
    categories = country_rating$country,
    title = list(text = "Country")) %>%
  
  hc_yAxis(
    title = list(text = "Average Rating"),
    min = 0,
    max = 5) %>%
  
  hc_add_series(
    name = "Average Rating",
    data = round(country_rating$avg_rating, 2)) %>%
  
  hc_tooltip(pointFormat = "Average Rating: <b>{point.y}</b>") %>%
  
  hc_add_theme(hc_theme_monokai())

Map

Visualization 4

library(dplyr)

pal <- colorFactor(
  palette = c("#cdeac0","#b5d6e6","#d7b5e6","#f8c8dc","#fff1a8"),
  domain = RoseWine_clean$rating)

popup_wine <- paste0(
  "<b>Wine:</b> ", RoseWine_clean$name, "<br>",
  "<b>Country:</b> ", RoseWine_clean$country, "<br>",
  "<b>Price:</b> $", RoseWine_clean$price, "<br>",
  "<b>Rating:</b> ", RoseWine_clean$rating, "<br>",
  "<b>Reviews:</b> ", RoseWine_clean$number_of_ratings)

leaflet(RoseWine_clean) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%  
  addCircleMarkers(
    lng = ~lng,
    lat = ~lat,
    color = "black",
    weight = 1,
    fillColor = ~pal(rating),
    fillOpacity = 0.75,
    radius = 5,
    stroke = TRUE,
    popup = popup_wine) %>%
  
    addControl("<b>Rosé Wine Ratings Around the World</b>",
    position = "bottomleft") %>%
    
    addLegend("bottomright",
    pal = pal,
    values = ~rating,
    title = "Wine Rating",
    opacity = 1)

This map visualizes rosé wines from different countries around the world using the geographic coordinates of the vineyards where the grapes were grown. Wines with both higher and lower ratings are distributed across different regions, allowing us to observe geographic patterns in wine quality. When clicking on a point, detailed information such as the wine name, country of origin, price, rating, and number of reviews can be seen. Overall, the map helps compare how rosé wines vary globally in terms of quality and characteristics. (Each point represents a wine.)

Creative Idea

Visualization 5 - Using YouTube

top_wineries <- RoseWine_clean %>%
  group_by(winery) %>%
  summarise(
    avg_rating = mean(rating, na.rm = TRUE),
    avg_price = mean(price, na.rm = TRUE),
    n = n(),
    .groups = "drop") %>%
  filter(n >= 3) %>%
  arrange(desc(avg_rating)) %>%
  slice_head(n = 5)

custom_colors <- c(
  "#F8BBD0",
  "#F48FB1",
  "#CE93D8",
  "#B39DDB",
  "#90CAF9")

top_wineries <- top_wineries %>%
  mutate(color = rep(custom_colors, length.out = n()))

fig <- plot_ly(top_wineries) %>%
  add_segments(
    x = 0, xend = ~avg_rating,
    y = ~reorder(winery, avg_rating),
    yend = ~reorder(winery, avg_rating),
    line = list(color = "lightgray", width = 2),
    showlegend = FALSE) %>%
  add_markers(
    x = ~avg_rating,
    y = ~reorder(winery, avg_rating),
    marker = list(
      color = ~color,
      size = 12,
      line = list(color = "white", width = 1)),
    text = ~paste0(
      "<b>Winery:</b> ", winery,
      "<br><b>Average Rating:</b> ", round(avg_rating, 2),
      "<br><b>Number of Wines:</b> ", n,
      "<br><b>Avg Price:</b> $", round(avg_price, 2)),
    hoverinfo = "text") %>%
  layout(
    title = "Top Rosé Wineries",
    xaxis = list(title = "Average Rating"),
    yaxis = list(title = ""))

fig

This lollipop chart shows the top rosé wine producers based on average ratings, allowing a clear comparison of quality across wineries. Each point represents a winery’s average rating, with hover details including average rating and number of wines.

Research

It was found that many of the values obtained from this dataset are very similar to those found in online sources and reports. In particular, after reviewing articles such as VinePair’s overview of popular rosé wines, there is a strong consistency between aggregated dataset trends and independently published information.

This highlights how data analysis and visualizations built from reliable datasets can be cross-validated through different sources, reinforcing the credibility of the insights derived from them.

Multiple Linear Regression

Performed to analyze how price and number of ratings influence wine ratings.

multi_linear <- RoseWine_clean %>%
  filter(price > 0, rating > 0) %>%
  mutate(log_price = log(price))

multi_linear <- multi_linear %>%
  mutate(price_category = case_when(
    price < 10 ~ "Low",
    price >= 10 & price < 20 ~ "Medium",
    price >= 20 ~ "High"
  ))

model <- lm(rating ~ log_price + number_of_ratings + price_category, data = multi_linear)

summary(model)

Call:
lm(formula = rating ~ log_price + number_of_ratings + price_category, 
    data = multi_linear)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01315 -0.14244  0.00283  0.15113  0.58332 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3.034e+00  1.677e-01  18.085  < 2e-16 ***
log_price             2.996e-01  4.524e-02   6.622 1.18e-10 ***
number_of_ratings     1.103e-04  4.239e-05   2.601  0.00964 ** 
price_categoryLow     1.701e-03  8.388e-02   0.020  0.98383    
price_categoryMedium -1.617e-02  6.336e-02  -0.255  0.79869    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2206 on 389 degrees of freedom
Multiple R-squared:  0.3558,    Adjusted R-squared:  0.3491 
F-statistic: 53.71 on 4 and 389 DF,  p-value: < 2.2e-16
summary(model)$adj.r.squared
[1] 0.349148
AIC(model)
[1] -65.99341
par(mfrow = c(2,2))
plot(model)

Summary

This project explored how price, country, and number of reviews relate to wine ratings. By using filtering, graphs, regression, and a map, I was able to see patterns in price and quality across different regions.

The results show that higher prices do not always mean higher ratings. Some countries tend to have more consistently high-rated wines, while others vary more. The map also showed that good rosé wines are spread across many countries, not just one region.

During this project, one of the main challenges was working with a large dataset that included many variables, especially weather-related columns that were not necessary for the analysis. Since some countries had significantly more data than others, comparisons became slightly unbalanced in some visualizations.

Creating clear and readable Highcharter visualizations also required multiple adjustments, such as formatting axes, handling missing values, and ensuring that all countries were properly displayed in the charts. In some cases, not all countries appeared in the final visualization, which required further filtering and debugging to ensure that the data was correctly summarized.

Additionally, in the multiple linear regression section of the project, I had to use AI and external resources to better understand what was required and how to interpret the results. This helped me clarify each component of the analysis and complete the task correctly.

In conclusion, the analysis shows that wine ratings depend on more than just price, and that data can help us better understand trends in preferences and quality.

Sources:

  1. Budnyak, A. (n.d.). Wine Rating and Price Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/budnyak/wine-rating-and-price/data

  2. Kunst, J. (n.d.). Themes in highcharter. Retrieved from https://jkunst.com/highcharter/articles/themes.html

  3. OpenAI. (2026). ChatGPT (May 11 version) [Large language model]. https://chat.openai.com/

  4. The Data Digest. (2021, April 7). Lollipop chart in R tutorial (R Graph Gallery) [Video]. YouTube. https://www.youtube.com/watch?v=6aLfDIqNIMY

  5. VinePair. (2020). 10 most popular rosé wines of 2020. https://vinepair.com/articles/10-most-popular-rose-wines-2020/