My topic is Global Sugar Consumption. the data I use is sugar_consumption_dataset.csv. The source is WHO. The variables that I use for this project are Country, Continent, Region, and Country_Code, which are categorical variables. and Year, Population, GDP_Per_Capita, Per_Capita_Sugar_Consumption, Total_Sugar_Consumption, Sugar_From_Sugarcane, Sugar_From_Beet, Sugar_From_HFCS, Sugar_From_Other,Processed_Food_Consumption, Avg_Daily_Sugar_Intake, Diabetes_Prevalenc,Obesity_Rate, Sugar_Imports, Sugar_Exports, Avg_Retail_Price_Per_Kg, Gov_Tax, Gov_Subsidies, Education_Campaign, Urbanization_Rate, Climate_Conditions, Sugarcane_Production_Yield are quantitative variables . I clean the data using !is.na(column name) to check if there is a missing value. I want to explore the obesity rate around the globe over the year, average processed food consumption by region, and the distribution of average sugar from sugarcane by continent. The data was collected using a mixed-method approach combining food balance sheet analysis, household consumption surveys, trade data, and nutrition surveillance. There is no ReadMe file on this data. I chose this data to explore how sugar consumption has changed across different countries over the years, how it relates to obesity rates, and which sources of sugar are most commonly consumed. This topic is personally meaningful to me because I’ve often heard that “sugar is a silent killer,” and I want to understand the reasons behind this claim through data. By analyzing the trends and health impacts, I hope to gain deeper insight into the global patterns and consequences of sugar consumption.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(GGally)
Warning: package 'GGally' was built under R version 4.4.3
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(gganimate)
Warning: package 'gganimate' was built under R version 4.4.3
library(tidyr)library(psych)
Warning: package 'psych' was built under R version 4.4.3
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
library(scales)
Warning: package 'scales' was built under R version 4.4.3
Attaching package: 'scales'
The following objects are masked from 'package:psych':
alpha, rescale
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(tidyverse)library(knitr)library(webshot2)
Warning: package 'webshot2' was built under R version 4.4.3
Rows: 10000 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Country, Country_Code, Continent, Region
dbl (22): Year, Population, GDP_Per_Capita, Per_Capita_Sugar_Consumption, To...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data(sugar1)
Warning in data(sugar1): data set 'sugar1' not found
Grouping and summarising some columns that I want to focus on
Sugar4 <- sugar2 %>%group_by(Country, Year, Continent, Region, Country_Code) %>%# grouping by to make it easier to use.summarise(Avg_Sugar_From_Sugarcane =mean(Sugar_From_Sugarcane), # summarise the data by finding their meanAvg_Sugar_From_Beet =mean(Sugar_From_Beet),Avg_Sugar_From_HFCS =mean(Sugar_From_HFCS),Avg_Processed_Food_Consumption =mean( Processed_Food_Consumption),Avg_Obesity_Rate =mean(Obesity_Rate))
`summarise()` has grouped output by 'Country', 'Year', 'Continent', 'Region'.
You can override using the `.groups` argument.
head(Sugar4) # Displays the first few rows of the cleaned dataset
# A tibble: 6 × 10
# Groups: Country, Year, Continent, Region [6]
Country Year Continent Region Country_Code Avg_Sugar_From_Sugar…¹
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Australia 1960 Oceania Australia & New… AUS 71.6
2 Australia 1961 Oceania Australia & New… AUS 62.2
3 Australia 1962 Oceania Australia & New… AUS 73.6
4 Australia 1963 Oceania Australia & New… AUS 69.4
5 Australia 1964 Oceania Australia & New… AUS 69.9
6 Australia 1965 Oceania Australia & New… AUS 70.7
# ℹ abbreviated name: ¹Avg_Sugar_From_Sugarcane
# ℹ 4 more variables: Avg_Sugar_From_Beet <dbl>, Avg_Sugar_From_HFCS <dbl>,
# Avg_Processed_Food_Consumption <dbl>, Avg_Obesity_Rate <dbl>
Filtering the years from 2015 to 2019
Sugar5 <- Sugar4 %>%filter(Year >=2015& Year <=2019, # filtering to focuse from 2015 to 2019 Country %in%c("Brazil", "China", "Germany", "France", "India", "Indonesia","Japan","Mexico", "Russia", "South Africa","USA"))
Investigating the Most Statistically Significant Independent Variable
After a long process of backward elimination, I identified the variable that best fits the model. This section highlights the final independent variable selected. The focus moving forward will be on simple linear regression using this statistically significant variable.
# Build a multiple linear regression model with several predictors fit1 <-lm(Obesity_Rate ~ Country + Year + Continent + Country_Code + Region + Sugar_From_Sugarcane + Sugar_From_Beet + Sugar_From_HFCS + Processed_Food_Consumption,data = sugar2)# Display a summary of the model to check coefficients, p-values, R², etc.summary(fit1)
Call:
lm(formula = Obesity_Rate ~ Country + Year + Continent + Country_Code +
Region + Sugar_From_Sugarcane + Sugar_From_Beet + Sugar_From_HFCS +
Processed_Food_Consumption, data = sugar2)
Residuals:
Min 1Q Median 3Q Max
-18.4670 -8.6184 -0.1244 8.6813 18.6671
Coefficients: (25 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.476972 10.942439 2.968 0.00300 **
CountryBrazil 0.256847 0.490141 0.524 0.60027
CountryChina -0.229114 0.488632 -0.469 0.63916
CountryFrance 1.030860 0.478889 2.153 0.03137 *
CountryGermany -0.317736 0.482951 -0.658 0.51061
CountryIndia 0.068418 0.483392 0.142 0.88745
CountryIndonesia 0.163773 0.476474 0.344 0.73106
CountryJapan 0.118550 0.480889 0.247 0.80528
CountryMexico 0.665558 0.484831 1.373 0.16986
CountryRussia -0.186854 0.485365 -0.385 0.70026
CountrySouth Africa -0.130707 0.485408 -0.269 0.78773
CountryUSA 0.178241 0.487986 0.365 0.71493
Year -0.005697 0.005479 -1.040 0.29840
ContinentAsia NA NA NA NA
ContinentEurope NA NA NA NA
ContinentNorth America NA NA NA NA
ContinentOceania NA NA NA NA
ContinentSouth America NA NA NA NA
Country_CodeBRA NA NA NA NA
Country_CodeCHN NA NA NA NA
Country_CodeDEU NA NA NA NA
Country_CodeFRA NA NA NA NA
Country_CodeIDN NA NA NA NA
Country_CodeIND NA NA NA NA
Country_CodeJPN NA NA NA NA
Country_CodeMEX NA NA NA NA
Country_CodeRUS NA NA NA NA
Country_CodeUSA NA NA NA NA
Country_CodeZAF NA NA NA NA
RegionCentral America NA NA NA NA
RegionEast Asia NA NA NA NA
RegionEastern Europe NA NA NA NA
RegionNorthern America NA NA NA NA
RegionSouth America NA NA NA NA
RegionSouth Asia NA NA NA NA
RegionSoutheast Asia NA NA NA NA
RegionSub-Saharan Africa NA NA NA NA
RegionWestern Europe NA NA NA NA
Sugar_From_Sugarcane 0.013567 0.008793 1.543 0.12286
Sugar_From_Beet 0.025640 0.009889 2.593 0.00954 **
Sugar_From_HFCS -0.006791 0.005854 -1.160 0.24601
Processed_Food_Consumption -0.001529 0.001207 -1.267 0.20522
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.09 on 9983 degrees of freedom
Multiple R-squared: 0.002631, Adjusted R-squared: 0.001032
F-statistic: 1.646 on 16 and 9983 DF, p-value: 0.04973
I chose to focus on simple linear regression by selecting Sugar_From_Beet as the independent variable. In the initial multiple regression analysis, this variable had a p-value of 0.00954, which is less than the commonly used significance level of 0.05. This indicates that Sugar_From_Beet is statistically significant and has a meaningful association with the Obesity_Rate, holding other variables constant. Therefore, I built a simple linear regression model using Sugar_From_Beet to explore its relationship with obesity rates in more detail.
plot1 <-ggplot(Sugar4,aes(x = Avg_Obesity_Rate,y = Avg_Sugar_From_Beet)) +# this code tell what variable will be in the x - axis and in the y - axis to plot the graph.geom_point(size =0.5, color ="blue") +# Adds the scatter plot points with blue colorgeom_smooth(method ="lm", se =FALSE, linetype ="dotdash", size =1, color ="red") +# Adds a red linear regression line without the confidence interval shadinglabs(title ="Average Obesity Rate VS Average Sugar \n From Beet", # Adds a title to the plot x ="Average Obesity Rate", # Label for the x-axisy ="Average Sugare From Beet", # Label for the y-axiscaption ="Source: CDC"# Caption at the bottom of the plot ) +theme_minimal(base_size =13) # Minimal theme with larger text size
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Display the plotplot1
`geom_smooth()` using formula = 'y ~ x'
Calculating correlation
# this code calculate correlation between average obesity rate and average sugar from beetcor(Sugar4$Avg_Obesity_Rate,Sugar4$Avg_Sugar_From_Beet)
[1] 0.0569508
Fit linear regression model and summarize result
# Fit a linear regression model to predict the average obesity rate with average sugar from beetfit1 <-lm(Avg_Obesity_Rate ~ Avg_Sugar_From_Beet, data = Sugar4)# Summarize the results of the linear model# The summary will provide important statistical information like coefficients, R-squared, and p-valuessummary(fit1)
Call:
lm(formula = Avg_Obesity_Rate ~ Avg_Sugar_From_Beet, data = Sugar4)
Residuals:
Min 1Q Median 3Q Max
-11.4184 -1.7954 0.0455 1.9261 11.3735
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.04796 0.83072 25.337 <2e-16 ***
Avg_Sugar_From_Beet 0.05808 0.03679 1.579 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.976 on 766 degrees of freedom
Multiple R-squared: 0.003243, Adjusted R-squared: 0.001942
F-statistic: 2.493 on 1 and 766 DF, p-value: 0.1148
autoplot(fit1, 1:4, nrow=2, ncol=2) # # Generate and arrange 4 diagnostic plots for the linear model (fit1) in a 2x2 grid
This is the analysis of the summary and diagnostic Plots
The model equation is Avg_Obesity_Rate = 0.058(Avg_Sugar_From_Beet) + 21.047. The p-value of 0.1148 is greater than the common significance level of 0.05, so we fail to reject the null hypothes. The Adjusted R-squared = 0.001942 hypothesis.This means that only about 0.19% of the variation in Avg_Obesity_Rate can be explained by Avg_Sugar_From_Beet. This value is very close to 0, indicating that the model does not explain much of the variability in obesity rates.
The four diagnostic plots suggest that the linear regression model is generally valid. The Residuals vs Fitted plot shows randomly scattered points, indicating linearity and no clear pattern, which supports the model’s assumptions. The Normal Q-Q plot shows points close to the line, meaning the residuals are approximately normally distributed. The Scale-Location plot suggests constant variance (homoscedasticity) since the spread of points is even. Finally, Cook’s Distance identifies a few slightly influential points (like 293, 358, and 607), but none are extreme enough to threaten the model’s stability.
Final visualization
plot 1
# Create a heatmap showing average obesity rate by country and yearggplot(Sugar5, aes(x =factor(Year), #2I got from CHATGPT # Convert Year to factor so it's treated as discretey =reorder(Country, Avg_Obesity_Rate), # Reorder countries by average obesity rate for better layout.#3 I got from Chat GPTfill = Avg_Obesity_Rate)) +# Use Avg_Obesity_Rate to fill the tiles with colorgeom_tile(color ="grey90") +# Draw rectangular tiles with a light grey borderscale_fill_gradient( # Create a color gradient from white (low) to red (high)name ="Average Obesity Rate (%)", # Title for the legendlow ="white", high ="red" ) +labs( # Add plot labelstitle ="Average Obesity Rate by Country and \n Year", # Main title of the chartsubtitle ="Average annual obesity rates (%)\n across selected countries (2015–2019)", # Subtitle with line break #4 Chatgptx ="Year", # Label for x-axisy ="Country", # Label for y-axiscaption ="Source: CDC"# Source or data citation ) +theme_minimal(base_size =13) +# Use a clean minimal theme with slightly larger base fonttheme( # Customize theme elementsaxis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis text for readabilityplot.title =element_text(face ="bold", size =15), # Make title bold and large #5 CHATGPTplot.subtitle =element_text(size =12, color ="gray30"), # Style subtitle with lighter color #6 CHATGPTplot.caption =element_text(size =10, color ="gray40")) # Make caption small and subtle # 7 CHATGPT
Explanation About the Graph
This heatmap shows the average obesity rate (%) across various countries from 2015 to 2019. The intensity of the red color represents the obesity rate—darker red means higher obesity. For example, France in 2019 is the darkest red, indicating one of the highest obesity rates in the dataset for that year. In contrast, India and Indonesia consistently show lighter shades, suggesting they have lower obesity rates throughout the years. Comparing Mexico and Germany, we can see that Mexico had a noticeably higher rate in 2016 and 2017, while Germany stayed at a moderate level across all years. Meanwhile, China shows a sudden spike in 2017, becoming darker than in other years, which could indicate a sharp rise in obesity that year. The USA, Japan, and South Africa also show relatively high obesity levels, with varying trends.
Plot 2
# Create a violin and boxplot to visualize average processed food consumption by regionggplot(Sugar5, aes(x = Region, y = Avg_Processed_Food_Consumption, fill = Region)) +# Add a violin plot to show the distribution of consumption per region# trim = FALSE keeps the full shape of the distribution (no trimming of tails)# alpha = 0.6 makes the fill semi-transparent for visual claritygeom_violin(trim =FALSE, alpha =0.6) +#8 from CHATGPT# Overlay a boxplot on the violin to show median, quartiles, and outliers# width = 0.5 narrows the box width, color = "gray20" sets the outline colorgeom_boxplot(width =0.5, color ="gray20") +# Flip the x and y axes to make the region names appear verticallycoord_flip() +# Apply a minimal theme for a clean background and set base font sizetheme_minimal(base_size =11) +# - Use light gray major gridlines for reference# - Remove minor gridlines for a cleaner looktheme(panel.grid.major =element_line(color ="gray80", size =0.5), panel.grid.minor =element_blank() ) +labs(title ="Average Processed Food Consumption \n by Region", # Main title of the plotx ="Region", # X-axis labely ="Average Processed Food Consumption", # Y-axis labelcaption ="Source: CDC"# Caption for data source )
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
Explanation about the graph
This violin plot shows the average processed food consumption across different world regions. Each region is represented by a horizontal shape that indicates how the consumption values are distributed. Regions like Western Europe and Northern America show wide and thick plots positioned toward the right, meaning people there consume higher levels of processed food. In contrast, Sub-Saharan Africa, South Asia, and Southeast Asia have narrow plots toward the left, showing lower consumption levels with little variation. Eastern Europe, East Asia, and Central America fall somewhere in the middle, indicating moderate consumption. The width of each shape shows how common a certain level of consumption is — wider areas mean more people fall into that range. Overall, the graph shows that processed food is consumed more heavily in wealthier regions like Europe and North America, while consumption is much lower in less developed regions.
Plot3
# Create a histogram with overlaid density plot for Avg_Sugar_From_Sugarcane by Continentplot3 <-ggplot(Sugar5, aes(x = Avg_Sugar_From_Sugarcane)) +# Add histogram with density scale and continent-based fill colorgeom_histogram(aes(y =after_stat(density), fill = Continent),bins =30, # Number of bins for the histogramalpha =0.5, # Transparency level for overlappingposition ="identity"# Draw all histograms on top of each other ) +# Add a density curve for each continentgeom_density(aes(color = Continent), linewidth =0.8# Thickness of density lines ) +# Apply a minimal theme with a base font size of 11theme_minimal(base_size =11) +# Add labels and titleslabs(title ="Distribution of Average Sugar From Sugarcane \n by Continent",x ="Average Sugar From Sugarcane", y ="Density",fill ="Continent", # Legend title for fill colorcolor ="Continent", # Legend title for line colorcaption ="Source: CDC") +# Customize legends for fill and colorguides(fill =guide_legend(title ="Continent"),color =guide_legend(title ="Continent") ) +# Define custom colors for each continent's histogram fillscale_fill_manual(values =c("Africa"="red", "Asia"="green", "Europe"="yellow", "North America"="blue", "South America"="black" )) +# Define custom colors for each continent's density linescale_color_manual(values =c("Africa"="darkred", "Asia"="darkgreen", "Europe"="purple", "North America"="navy", "South America"="gray" ))# Optional: Convert to interactive plot with Plotly (uncomment if needed)#plot3 <- ggplotly(plot3) # Display the plotplot3
Explanation about the graph
The visualization highlights regional differences in average sugar consumption from sugarcane, showing both frequency (histogram) and distribution trends (density curves). South America has a strong peak around 70 , suggesting consistent consumption, while Asia and Africa exhibit broader distributions, indicating more variation. Overlapping density curves suggest similarities between some continents, while others diverge significantly. These trends may be influenced by agricultural production, dietary habits, economic conditions, and health policies. The graph provides valuable insights for researchers, policymakers, and businesses looking to understand global sugar consumption patterns.
Background Research
Global sugar consumption has steadily increased over the past five decades, driven largely by highly populated regions such as India, East Asia, and Latin America (Siervo et al.). Meanwhile, developed countries have experienced a decline in per capita sugar intake due to health concerns and market saturation, although consumption levels remain high—accounting for over 20% of total energy intake in some individuals. Sugar sources include sugar beets, sugarcane, and various sweeteners derived from fruits, milk, and cereals. The study found that total sugar consumption (including sugar, sweeteners, and honey) is significantly associated with higher global rates of overweight, obesity, and hypertension (ρ = 0.31–0.37, P < 0.001) (Siervo et al.).
Works cited
Siervo, Mario, et al. “Sugar Consumption and Global Prevalence of Obesity and Hypertension: An Ecological Analysis.” Public Health Nutrition, Cambridge University Press, 18 Feb. 2013, https://doi.org/10.1017/S1368980013000141.