Source: In Good Taste (2026). “In Good Taste.” Retrieved from ingoodtaste.com
Introduction
For my final project, I will be analyzing a dataset about rosé wines that includes information such as wine prices, ratings, country of origin, grape variety, and number of reviews. I want to explore whether there is a correlation between the price of a wine and its rating, as well as visualize trends between countries and wine varieties. I also plan to map out where the highest-rated rosé wines come from.
The data comes from a publicly available wine dataset on Kaggle, which compiles wine ratings and prices from multiple sources. The data was collected through web scraping from Vivino.com.
The dataset was cleaned by standardizing column names, removing unnecessary weather-related variables that were not relevant to the research question, and filtering missing or incomplete values.
I chose this topic because I am interested in how people judge quality in food and drinks and how ratings can influence customer choices. I do not know much about wines, but rosé is usually what I drink when it is offered to me, even though I have only bought wine myself on a few occasions. Because of this, I thought it would be interesting to learn more about rosé wines and see what factors make certain wines more popular or highly rated than others.
Load the libraries and set the working directory
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(leaflet)setwd("/Users/bettyovalle/Desktop/College/007 – Spring 2026/DATA 110/week 15")RoseWine <-read_csv("Rose_Weather.csv")
Rows: 394 Columns: 70
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Name, Country, Region, Winery
dbl (66): Rating, NumberOfRatings, Price, Year, lat, lng, Jan_tavg, Jan_tmin...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The column name was renamed to make it more understandable and visually appealing. This improves readability and helps maintain a more consistent and structured format for analysis.
# A tibble: 6 × 10
name country region winery rating number_of_ratings price year lat lng
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mad Mé… France Médit… Châte… 4 25 10.1 2019 43.6 2.34
2 Redoma… Portug… Douro Niepo… 3.9 25 12.8 2019 41.1 -7.12
3 Fantin… Italy Abruz… Farne… 3.6 25 5.6 2019 42.2 13.9
4 Mount … Israel Galil… Hermon 3.4 25 12.1 2017 32.8 35.6
5 La Vil… France Langu… Châte… 4.6 25 69.0 2018 43.7 1.40
6 La Ros… Italy Vigne… Manin… 4.1 25 17.5 2019 46.4 11.8
For this project on Rosé wine, it is only necessary to consider the location where the grapes were grown and supplied. Monthly weather factors during grape growth are not required for this analysis, as the main focus is not on detailed monthly climate variation, but rather on the relationship between grape origin and wine characteristics.
This visualization explores whether higher-priced wines receive better ratings.
ggplot(RoseWine_clean, aes(x = price, y = rating)) +geom_point(alpha =0.4, color ="#c2185b") +geom_smooth(method ="lm", color ="gray") +labs(title ="Relationship Between Price and Wine Ratings",x ="Price (USD)",y ="Rating") +theme_light()
`geom_smooth()` using formula = 'y ~ x'
More expensive wines are not necessarily rated significantly higher. This indicates that price alone is not a strong indicator of perceived quality in rosé wines.
highchart_expensive <- top5_top3_wines |>hchart(type ="column",hcaes(x = name, y = price, group = country)) |>hc_title(text ="Top 3 Most Expensive Rosé Wines in Top 5 Countries") |>hc_xAxis(title =list(text ="Wine Names")) |>hc_yAxis(title =list(text ="Price in dollars"),labels =list(format ="${value}")) |>hc_legend(title =list(text ="Country")) |>hc_tooltip(pointFormat ="<b>Country:</b> {point.series.name}<br><b>Price:</b> ${point.y}") |>hc_add_theme(hc_theme_monokai())highchart_expensive
This visualization shows the top 3 most expensive rosé wines from the five countries with the highest maximum wine prices in the dataset.The chart compares wine names and their prices across countries, allowing to see which regions produce the most expensive bottles and how prices vary between different wines within each country.
This map visualizes rosé wines from different countries around the world using the geographic coordinates of the vineyards where the grapes were grown. Wines with both higher and lower ratings are distributed across different regions, allowing us to observe geographic patterns in wine quality. When clicking on a point, detailed information such as the wine name, country of origin, price, rating, and number of reviews can be seen. Overall, the map helps compare how rosé wines vary globally in terms of quality and characteristics. (Each point represents a wine.)
This lollipop chart shows the top rosé wine producers based on average ratings, allowing a clear comparison of quality across wineries. Each point represents a winery’s average rating, with hover details including average rating and number of wines.
Research
It was found that many of the values obtained from this dataset are very similar to those found in online sources and reports. In particular, after reviewing articles such as VinePair’s overview of popular rosé wines, there is a strong consistency between aggregated dataset trends and independently published information.
This highlights how data analysis and visualizations built from reliable datasets can be cross-validated through different sources, reinforcing the credibility of the insights derived from them.
Multiple Linear Regression
Performed to analyze how price and number of ratings influence wine ratings.
Call:
lm(formula = rating ~ log_price + number_of_ratings + price_category,
data = multi_linear)
Residuals:
Min 1Q Median 3Q Max
-1.01315 -0.14244 0.00283 0.15113 0.58332
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.034e+00 1.677e-01 18.085 < 2e-16 ***
log_price 2.996e-01 4.524e-02 6.622 1.18e-10 ***
number_of_ratings 1.103e-04 4.239e-05 2.601 0.00964 **
price_categoryLow 1.701e-03 8.388e-02 0.020 0.98383
price_categoryMedium -1.617e-02 6.336e-02 -0.255 0.79869
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2206 on 389 degrees of freedom
Multiple R-squared: 0.3558, Adjusted R-squared: 0.3491
F-statistic: 53.71 on 4 and 389 DF, p-value: < 2.2e-16
summary(model)$adj.r.squared
[1] 0.349148
AIC(model)
[1] -65.99341
par(mfrow =c(2,2))plot(model)
Summary
This project explored how price, country, and number of reviews relate to wine ratings. By using filtering, graphs, regression, and a map, I was able to see patterns in price and quality across different regions.
The results show that higher prices do not always mean higher ratings. Some countries tend to have more consistently high-rated wines, while others vary more. The map also showed that good rosé wines are spread across many countries, not just one region.
During this project, one of the main challenges was working with a large dataset that included many variables, especially weather-related columns that were not necessary for the analysis. Since some countries had significantly more data than others, comparisons became slightly unbalanced in some visualizations.
Creating clear and readable Highcharter visualizations also required multiple adjustments, such as formatting axes, handling missing values, and ensuring that all countries were properly displayed in the charts. In some cases, not all countries appeared in the final visualization, which required further filtering and debugging to ensure that the data was correctly summarized.
Additionally, in the multiple linear regression section of the project, I had to use AI and external resources to better understand what was required and how to interpret the results. This helped me clarify each component of the analysis and complete the task correctly.
In conclusion, the analysis shows that wine ratings depend on more than just price, and that data can help us better understand trends in preferences and quality.
Sources:
Budnyak, A. (n.d.). Wine Rating and Price Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/budnyak/wine-rating-and-price/data
Kunst, J. (n.d.). Themes in highcharter. Retrieved from https://jkunst.com/highcharter/articles/themes.html