FINAL PROJECT

Author

BALEMLAY Azimeraw

Global Sugar Consumption

#1 from Google

Introduction

My topic is Global Sugar Consumption. the data I use is sugar_consumption_dataset.csv. The source is WHO. The variables that I use for this project are Country, Continent, Region, and Country_Code, which are categorical variables. and Year, Population, GDP_Per_Capita, Per_Capita_Sugar_Consumption, Total_Sugar_Consumption, Sugar_From_Sugarcane, Sugar_From_Beet, Sugar_From_HFCS, Sugar_From_Other,Processed_Food_Consumption, Avg_Daily_Sugar_Intake, Diabetes_Prevalenc,Obesity_Rate, Sugar_Imports, Sugar_Exports, Avg_Retail_Price_Per_Kg, Gov_Tax, Gov_Subsidies, Education_Campaign, Urbanization_Rate, Climate_Conditions, Sugarcane_Production_Yield are quantitative variables . I clean the data using !is.na(column name) to check if there is a missing value. I want to explore the obesity rate around the globe over the year, average processed food consumption by region, and the distribution of average sugar from sugarcane by continent. The data was collected using a mixed-method approach combining food balance sheet analysis, household consumption surveys, trade data, and nutrition surveillance. There is no ReadMe file on this data. I chose this data to explore how sugar consumption has changed across different countries over the years, how it relates to obesity rates, and which sources of sugar are most commonly consumed. This topic is personally meaningful to me because I’ve often heard that “sugar is a silent killer,” and I want to understand the reasons behind this claim through data. By analyzing the trends and health impacts, I hope to gain deeper insight into the global patterns and consequences of sugar consumption.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(GGally)
Warning: package 'GGally' was built under R version 4.4.3
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(gganimate)
Warning: package 'gganimate' was built under R version 4.4.3
library(tidyr)
library(psych)
Warning: package 'psych' was built under R version 4.4.3

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.3
library(scales)
Warning: package 'scales' was built under R version 4.4.3

Attaching package: 'scales'

The following objects are masked from 'package:psych':

    alpha, rescale

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(tidyverse)
library(knitr)
library(webshot2)
Warning: package 'webshot2' was built under R version 4.4.3
setwd("C:/Users/ebale/OneDrive/Desktop/DATA110")
sugar1 <- read_csv("sugar_consumption_dataset.csv")
Rows: 10000 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Country, Country_Code, Continent, Region
dbl (22): Year, Population, GDP_Per_Capita, Per_Capita_Sugar_Consumption, To...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data(sugar1)
Warning in data(sugar1): data set 'sugar1' not found

Checking for any missing values in each column

clean1 <- !is.na(sugar1$Country) & !is.na(sugar1$Year) & !is.na(sugar1$Country_Code) & !is.na(sugar1$Continent) & !is.na(sugar1$Region) & !is.na(sugar1$Population) & !is.na(sugar1$GDP_Per_Capita) & !is.na(sugar1$Per_Capita_Sugar_Consumption) & !is.na(sugar1$Total_Sugar_Consumption) & !is.na(sugar1$Sugar_From_Sugarcane) & !is.na(sugar1$Sugar_From_Beet) & !is.na(sugar1$Sugar_From_HFCS) & !is.na(sugar1$Sugar_From_Other) & !is.na(sugar1$Processed_Food_Consumption) & !is.na(sugar1$Avg_Daily_Sugar_Intake) & !is.na(sugar1$Diabetes_Prevalence) & !is.na(sugar1$Obesity_Rate) & !is.na(sugar1$Sugar_Imports) & !is.na(sugar1$Sugar_Exports) & !is.na(sugar1$Avg_Retail_Price_Per_Kg) & !is.na(sugar1$Gov_Tax) & !is.na(sugar1$Gov_Subsidies) & !is.na(sugar1$Education_Campaign) & !is.na(sugar1$Urbanization_Rate) & !is.na(sugar1$Climate_Conditions) & !is.na(sugar1$Sugarcane_Production_Yield)

Arranging the years in ascending order because now it is not in order.

# Arrange data by Year and store it in sugar2
sugar2 <- sugar1 %>%
  arrange(Year)

# Print unique years to confirm ordering
print(unique(sugar2$Year))
 [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
[16] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
[31] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
[46] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
[61] 2020 2021 2022 2023

Grouping and summarising some columns that I want to focus on

 Sugar4 <- sugar2 %>%
  group_by(Country, Year, Continent, Region, Country_Code) %>%  # grouping by to  make it easier to use.
  summarise(
    Avg_Sugar_From_Sugarcane = mean(Sugar_From_Sugarcane),  # summarise the data by finding their mean
   Avg_Sugar_From_Beet = mean(Sugar_From_Beet),
   Avg_Sugar_From_HFCS = mean(Sugar_From_HFCS),
   Avg_Processed_Food_Consumption = mean( Processed_Food_Consumption),
  Avg_Obesity_Rate = mean(Obesity_Rate))
`summarise()` has grouped output by 'Country', 'Year', 'Continent', 'Region'.
You can override using the `.groups` argument.
head(Sugar4)  # Displays the first few rows of the cleaned dataset
# A tibble: 6 × 10
# Groups:   Country, Year, Continent, Region [6]
  Country    Year Continent Region           Country_Code Avg_Sugar_From_Sugar…¹
  <chr>     <dbl> <chr>     <chr>            <chr>                         <dbl>
1 Australia  1960 Oceania   Australia & New… AUS                            71.6
2 Australia  1961 Oceania   Australia & New… AUS                            62.2
3 Australia  1962 Oceania   Australia & New… AUS                            73.6
4 Australia  1963 Oceania   Australia & New… AUS                            69.4
5 Australia  1964 Oceania   Australia & New… AUS                            69.9
6 Australia  1965 Oceania   Australia & New… AUS                            70.7
# ℹ abbreviated name: ¹​Avg_Sugar_From_Sugarcane
# ℹ 4 more variables: Avg_Sugar_From_Beet <dbl>, Avg_Sugar_From_HFCS <dbl>,
#   Avg_Processed_Food_Consumption <dbl>, Avg_Obesity_Rate <dbl>

Filtering the years from 2015 to 2019

Sugar5 <- Sugar4 %>%
  filter(Year >= 2015 & Year <= 2019, # filtering to focuse from 2015 to 2019
         Country %in% c("Brazil", "China", "Germany", "France", "India", "Indonesia","Japan","Mexico", "Russia", "South Africa","USA"))

Investigating the Most Statistically Significant Independent Variable

After a long process of backward elimination, I identified the variable that best fits the model. This section highlights the final independent variable selected. The focus moving forward will be on simple linear regression using this statistically significant variable.

# Build a multiple linear regression model with several predictors
 fit1 <- lm(Obesity_Rate ~
             Country +
             Year +
             Continent +
             Country_Code +
             Region +
             Sugar_From_Sugarcane +
             Sugar_From_Beet +
             Sugar_From_HFCS +
             Processed_Food_Consumption,
             data = sugar2)
# Display a summary of the model to check coefficients, p-values, R², etc.
summary(fit1)

Call:
lm(formula = Obesity_Rate ~ Country + Year + Continent + Country_Code + 
    Region + Sugar_From_Sugarcane + Sugar_From_Beet + Sugar_From_HFCS + 
    Processed_Food_Consumption, data = sugar2)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.4670  -8.6184  -0.1244   8.6813  18.6671 

Coefficients: (25 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)   
(Intercept)                32.476972  10.942439   2.968  0.00300 **
CountryBrazil               0.256847   0.490141   0.524  0.60027   
CountryChina               -0.229114   0.488632  -0.469  0.63916   
CountryFrance               1.030860   0.478889   2.153  0.03137 * 
CountryGermany             -0.317736   0.482951  -0.658  0.51061   
CountryIndia                0.068418   0.483392   0.142  0.88745   
CountryIndonesia            0.163773   0.476474   0.344  0.73106   
CountryJapan                0.118550   0.480889   0.247  0.80528   
CountryMexico               0.665558   0.484831   1.373  0.16986   
CountryRussia              -0.186854   0.485365  -0.385  0.70026   
CountrySouth Africa        -0.130707   0.485408  -0.269  0.78773   
CountryUSA                  0.178241   0.487986   0.365  0.71493   
Year                       -0.005697   0.005479  -1.040  0.29840   
ContinentAsia                     NA         NA      NA       NA   
ContinentEurope                   NA         NA      NA       NA   
ContinentNorth America            NA         NA      NA       NA   
ContinentOceania                  NA         NA      NA       NA   
ContinentSouth America            NA         NA      NA       NA   
Country_CodeBRA                   NA         NA      NA       NA   
Country_CodeCHN                   NA         NA      NA       NA   
Country_CodeDEU                   NA         NA      NA       NA   
Country_CodeFRA                   NA         NA      NA       NA   
Country_CodeIDN                   NA         NA      NA       NA   
Country_CodeIND                   NA         NA      NA       NA   
Country_CodeJPN                   NA         NA      NA       NA   
Country_CodeMEX                   NA         NA      NA       NA   
Country_CodeRUS                   NA         NA      NA       NA   
Country_CodeUSA                   NA         NA      NA       NA   
Country_CodeZAF                   NA         NA      NA       NA   
RegionCentral America             NA         NA      NA       NA   
RegionEast Asia                   NA         NA      NA       NA   
RegionEastern Europe              NA         NA      NA       NA   
RegionNorthern America            NA         NA      NA       NA   
RegionSouth America               NA         NA      NA       NA   
RegionSouth Asia                  NA         NA      NA       NA   
RegionSoutheast Asia              NA         NA      NA       NA   
RegionSub-Saharan Africa          NA         NA      NA       NA   
RegionWestern Europe              NA         NA      NA       NA   
Sugar_From_Sugarcane        0.013567   0.008793   1.543  0.12286   
Sugar_From_Beet             0.025640   0.009889   2.593  0.00954 **
Sugar_From_HFCS            -0.006791   0.005854  -1.160  0.24601   
Processed_Food_Consumption -0.001529   0.001207  -1.267  0.20522   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.09 on 9983 degrees of freedom
Multiple R-squared:  0.002631,  Adjusted R-squared:  0.001032 
F-statistic: 1.646 on 16 and 9983 DF,  p-value: 0.04973
# Create diagnostic plots for checking regression assumptions
autoplot(fit1, 1:4, nrow=2, ncol=2)

Linear regression

I chose to focus on simple linear regression by selecting Sugar_From_Beet as the independent variable. In the initial multiple regression analysis, this variable had a p-value of 0.00954, which is less than the commonly used significance level of 0.05. This indicates that Sugar_From_Beet is statistically significant and has a meaningful association with the Obesity_Rate, holding other variables constant. Therefore, I built a simple linear regression model using Sugar_From_Beet to explore its relationship with obesity rates in more detail.

plot1 <- ggplot(Sugar4,
       aes(x = Avg_Obesity_Rate,
           y = Avg_Sugar_From_Beet)) + # this code tell what variable will be in the x - axis and in the y - axis to plot the graph.
  geom_point(size = 0.5, color = "blue") +  # Adds the scatter plot points with blue color
  geom_smooth(method = "lm", se = FALSE, linetype = "dotdash", size = 1, color = "red") +  # Adds a red linear regression line without the confidence interval shading
    labs(
    title = "Average Obesity Rate VS Average Sugar \n From Beet",  # Adds a title to the plot                                                                                                                                                           
    x = "Average Obesity Rate",                 # Label for the x-axis
    y = "Average Sugare From Beet",                    # Label for the y-axis
    caption = "Source: CDC"                                          # Caption at the bottom of the plot
  ) +

  theme_minimal(base_size = 13)  # Minimal theme with larger text size
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
# Display the plot
plot1
`geom_smooth()` using formula = 'y ~ x'

Calculating correlation

# this code calculate correlation  between average obesity rate and average sugar from beet
cor(Sugar4$Avg_Obesity_Rate,Sugar4$Avg_Sugar_From_Beet)
[1] 0.0569508

Fit linear regression model and summarize result

# Fit a linear regression model to predict the average obesity rate with average sugar from beet

fit1 <- lm(Avg_Obesity_Rate ~ Avg_Sugar_From_Beet, data = Sugar4)

# Summarize the results of the linear model
# The summary will provide important statistical information like coefficients, R-squared, and p-values
summary(fit1)

Call:
lm(formula = Avg_Obesity_Rate ~ Avg_Sugar_From_Beet, data = Sugar4)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4184  -1.7954   0.0455   1.9261  11.3735 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         21.04796    0.83072  25.337   <2e-16 ***
Avg_Sugar_From_Beet  0.05808    0.03679   1.579    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.976 on 766 degrees of freedom
Multiple R-squared:  0.003243,  Adjusted R-squared:  0.001942 
F-statistic: 2.493 on 1 and 766 DF,  p-value: 0.1148
autoplot(fit1, 1:4, nrow=2, ncol=2) # # Generate and arrange 4 diagnostic plots for the linear model (fit1) in a 2x2 grid

This is the analysis of the summary and diagnostic Plots

The model equation is Avg_Obesity_Rate = 0.058(Avg_Sugar_From_Beet) + 21.047. The p-value of 0.1148 is greater than the common significance level of 0.05, so we fail to reject the null hypothes. The Adjusted R-squared = 0.001942 hypothesis.This means that only about 0.19% of the variation in Avg_Obesity_Rate can be explained by Avg_Sugar_From_Beet. This value is very close to 0, indicating that the model does not explain much of the variability in obesity rates.

The four diagnostic plots suggest that the linear regression model is generally valid. The Residuals vs Fitted plot shows randomly scattered points, indicating linearity and no clear pattern, which supports the model’s assumptions. The Normal Q-Q plot shows points close to the line, meaning the residuals are approximately normally distributed. The Scale-Location plot suggests constant variance (homoscedasticity) since the spread of points is even. Finally, Cook’s Distance identifies a few slightly influential points (like 293, 358, and 607), but none are extreme enough to threaten the model’s stability.

Final visualization

plot 1

# Create a heatmap showing average obesity rate by country and year
ggplot(Sugar5, 
       aes(x = factor(Year),       #2I got from CHATGPT                  # Convert Year to factor so it's treated as discrete
           y = reorder(Country, Avg_Obesity_Rate),   # Reorder countries by average obesity rate for better layout.#3 I got from Chat GPT
           fill = Avg_Obesity_Rate)) +               # Use Avg_Obesity_Rate to fill the tiles with color

  geom_tile(color = "grey90") +                      # Draw rectangular tiles with a light grey border
  
  scale_fill_gradient(                               # Create a color gradient from white (low) to red (high)
    name = "Average Obesity Rate (%)",                   # Title for the legend
    low = "white", 
    high = "red"
  ) +

  labs(                                              # Add plot labels
    title = "Average Obesity Rate by Country and \n Year",     # Main title of the chart
    subtitle = "Average annual obesity rates (%)\n across selected countries (2015–2019)",  # Subtitle with line break #4 Chatgpt
    x = "Year",                                      # Label for x-axis
    y = "Country",                                   # Label for y-axis
    caption = "Source: CDC"                          # Source or data citation
  ) +

  theme_minimal(base_size = 13) +                    # Use a clean minimal theme with slightly larger base font

  theme(                                             # Customize theme elements
    axis.text.x = element_text(angle = 45, hjust = 1),      # Rotate x-axis text for readability
    plot.title = element_text(face = "bold", size = 15),    # Make title bold and large #5 CHATGPT
    plot.subtitle = element_text(size = 12, color = "gray30"), # Style subtitle with lighter color #6 CHATGPT
    plot.caption = element_text(size = 10, color = "gray40")) # Make caption small and subtle # 7 CHATGPT

Explanation About the Graph

This heatmap shows the average obesity rate (%) across various countries from 2015 to 2019. The intensity of the red color represents the obesity rate—darker red means higher obesity. For example, France in 2019 is the darkest red, indicating one of the highest obesity rates in the dataset for that year. In contrast, India and Indonesia consistently show lighter shades, suggesting they have lower obesity rates throughout the years. Comparing Mexico and Germany, we can see that Mexico had a noticeably higher rate in 2016 and 2017, while Germany stayed at a moderate level across all years. Meanwhile, China shows a sudden spike in 2017, becoming darker than in other years, which could indicate a sharp rise in obesity that year. The USA, Japan, and South Africa also show relatively high obesity levels, with varying trends.

Plot 2

 # Create a violin and boxplot to visualize average processed food consumption by region
ggplot(Sugar5, aes(x = Region, y =  Avg_Processed_Food_Consumption, fill = Region)) +

  # Add a violin plot to show the distribution of consumption per region
  # trim = FALSE keeps the full shape of the distribution (no trimming of tails)
  # alpha = 0.6 makes the fill semi-transparent for visual clarity
  geom_violin(trim = FALSE, alpha = 0.6) +  #8  from CHATGPT
  
  # Overlay a boxplot on the violin to show median, quartiles, and outliers
  # width = 0.5 narrows the box width, color = "gray20" sets the outline color
  geom_boxplot(width = 0.5, color = "gray20") +
  
  # Flip the x and y axes to make the region names appear vertically
  coord_flip() +
  
  # Apply a minimal theme for a clean background and set base font size
  theme_minimal(base_size = 11) +
  
  
  # - Use light gray major gridlines for reference
  # - Remove minor gridlines for a cleaner look
  theme(
    panel.grid.major = element_line(color = "gray80", size = 0.5),  
    panel.grid.minor = element_blank()
  ) +
  

  labs(
    title = "Average Processed Food Consumption \n by Region",   # Main title of the plot
    x = "Region",                                              # X-axis label
    y = "Average Processed Food Consumption",                  # Y-axis label
    caption = "Source: CDC"                                    # Caption for data source
  )
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Explanation about the graph

This violin plot shows the average processed food consumption across different world regions. Each region is represented by a horizontal shape that indicates how the consumption values are distributed. Regions like Western Europe and Northern America show wide and thick plots positioned toward the right, meaning people there consume higher levels of processed food. In contrast, Sub-Saharan Africa, South Asia, and Southeast Asia have narrow plots toward the left, showing lower consumption levels with little variation. Eastern Europe, East Asia, and Central America fall somewhere in the middle, indicating moderate consumption. The width of each shape shows how common a certain level of consumption is — wider areas mean more people fall into that range. Overall, the graph shows that processed food is consumed more heavily in wealthier regions like Europe and North America, while consumption is much lower in less developed regions.

Plot3

# Create a histogram with overlaid density plot for Avg_Sugar_From_Sugarcane by Continent
plot3 <- ggplot(Sugar5, aes(x = Avg_Sugar_From_Sugarcane)) +

  # Add histogram with density scale and continent-based fill color
  geom_histogram(
    aes(y = after_stat(density), fill = Continent),
    bins = 30,        # Number of bins for the histogram
    alpha = 0.5,      # Transparency level for overlapping
    position = "identity"  # Draw all histograms on top of each other
  ) +

  # Add a density curve for each continent
  geom_density(
    aes(color = Continent), 
    linewidth = 0.8      # Thickness of density lines
  ) +

  # Apply a minimal theme with a base font size of 11
  theme_minimal(base_size = 11) +

  # Add labels and titles
  labs(
    title = "Distribution of Average Sugar From Sugarcane \n by Continent",
    x = "Average Sugar From Sugarcane", 
    y = "Density",
    fill = "Continent",    # Legend title for fill color
    color = "Continent",    # Legend title for line color
    caption = "Source: CDC") +

  # Customize legends for fill and color
  guides(
    fill = guide_legend(title = "Continent"),
    color = guide_legend(title = "Continent")
  ) +
  

  # Define custom colors for each continent's histogram fill
  scale_fill_manual(values = c(
    "Africa" = "red",  
    "Asia" = "green",   
    "Europe" = "yellow", 
    "North America" = "blue", 
    "South America" = "black"
  )) +

  # Define custom colors for each continent's density line
  scale_color_manual(values = c(
    "Africa" = "darkred",  
    "Asia" = "darkgreen",   
    "Europe" = "purple",  
    "North America" = "navy",  
    "South America" = "gray"
  ))

# Optional: Convert to interactive plot with Plotly (uncomment if needed)
 #plot3 <- ggplotly(plot3)  

# Display the plot
plot3

Explanation about the graph

The visualization highlights regional differences in average sugar consumption from sugarcane, showing both frequency (histogram) and distribution trends (density curves). South America has a strong peak around 70 , suggesting consistent consumption, while Asia and Africa exhibit broader distributions, indicating more variation. Overlapping density curves suggest similarities between some continents, while others diverge significantly. These trends may be influenced by agricultural production, dietary habits, economic conditions, and health policies. The graph provides valuable insights for researchers, policymakers, and businesses looking to understand global sugar consumption patterns.

Background Research

Global sugar consumption has steadily increased over the past five decades, driven largely by highly populated regions such as India, East Asia, and Latin America (Siervo et al.). Meanwhile, developed countries have experienced a decline in per capita sugar intake due to health concerns and market saturation, although consumption levels remain high—accounting for over 20% of total energy intake in some individuals. Sugar sources include sugar beets, sugarcane, and various sweeteners derived from fruits, milk, and cereals. The study found that total sugar consumption (including sugar, sweeteners, and honey) is significantly associated with higher global rates of overweight, obesity, and hypertension (ρ = 0.31–0.37, P < 0.001) (Siervo et al.).

Works cited

Siervo, Mario, et al. “Sugar Consumption and Global Prevalence of Obesity and Hypertension: An Ecological Analysis.” Public Health Nutrition, Cambridge University Press, 18 Feb. 2013, https://doi.org/10.1017/S1368980013000141.

Google - #1

CHATGPT - #2,#3,#4,#5,#6,#7.#8