I find a dataset about McDonald Menu Nutrition from Kaggle, because
I’m a fan of McDonald and I want to analysis the data to find some new
information. The menu items and nutrition facts were scraped from the
McDonald’s website. This dataset provides a nutrition analysis of every
menu item on the Indian McDonald’s menu.
The dataset is from https://www.kaggle.com/datasets/deepcontractor/mcdonalds-india-menu-nutrition-facts
Description of the meaning of the dataset
features:
The data contains 13 feature dimensions.
(1).Menu Category:Includes 7 different menus.
(2).Menu Items:Food items per menu.
(3).Food nutrient content: Per Serve Size,Energy (kCal),Protein
(g),Total fat (g),Sat Fat (g),Trans fat (g),Cholesterols (mg),Total
carbohydrate (g),Total Sugars (g),Added Sugars (g),Sodium (mg)
In this section I work on processing the dataset to make the data easier to analyze.
Import the required libraries.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## corrplot 0.92 loaded
## Loading required package: lattice
## Loading required package: Matrix
## Loaded glmnet 4.1-7
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Read a dataset in csv format and view the column names.
dataInit <- read.csv('India_Menu.csv')
names(dataInit)
## [1] "Menu.Category" "Menu.Items" "Per.Serve.Size"
## [4] "Energy..kCal." "Protein..g." "Total.fat..g."
## [7] "Sat.Fat..g." "Trans.fat..g." "Cholesterols..mg."
## [10] "Total.carbohydrate..g." "Total.Sugars..g." "Added.Sugars..g."
## [13] "Sodium..mg."
To easily analyze the dataset subsequently, I make some changes to the column names and then check the dataset for missing data.
colnames(dataInit)<- c("Category","Items","Size_g","Energy_kCal","Protein_g","Totalfat_g","Satfat_g","Transfat_g","Cholesterol_mg","Totalcarbohydrate_g","Totalsugar_g","Addedsugar_g","Sodium_mg")
vis_miss(dataInit)
nrow(dataInit)
## [1] 141
I find out that there is missing data, and since there is very little
missing data, I decide to delete the missing data in that row.
After deletion, 140 rows of data remain. It can be seen that only one
row contains the missing data.
dataNew <- na.omit(dataInit)
nrow(dataNew)
## [1] 140
1.Analysis of unhealthy nutrients in different menus.
2.Is there any difference between the Gourmet Menu, the Regular Menu and
the Breakfast Menu?
3.Use PCA to downscale and see if there are any new findings.
4.I want to predict whether a particular item belongs to the Regular
Menu or not. This is a binary classification problem.
dataNew %>%
dplyr::select(where(is.numeric)) %>%
as.data.frame() -> dataNumeric
corr_matrix <- cor(dataNumeric)
corrplot(corr_matrix, method = "color", type = "lower",tl.col = "black", tl.srt = 45)
According to the heat map, we can find some information.
There are a number of high correlations between the features:
(1)Energy_kCal ~
Protein_g,Totalfat_g,Satfat_g,Totalcarbohydrate_g,Sodium_mg
(2)Protein_g ~ Energy_kCal,Totalfat_g,Sodium_mg
(3)Totalfat_g ~ Energy_kCal,Protein_g,Satfat_g,Sodium_mg
(4)Satfat_g ~ Energy_kCal,Totalfat_g
(5)Totalcarbohydrate_g ~ Energy_kCal
(6)Totalsugar_g ~ Addedsugar_g
(7)Sodium_mg ~ Energy_kCal,Protein_g,Totalfat_g
dataNew %>%
dplyr::select(where(is.numeric)) %>%
scale() %>%
as.data.frame() -> dataScaled
rownames(dataScaled) <- dataScaled$Items
#pca <- PCA(dataScaled, graph = TRUE)
pca <- prcomp(dataScaled)
fviz_eig(pca, addlabels = TRUE, ylim = c(0, 60))
The eigenvalues, contribution sizes corresponding to each principal component in the PCA are plotted.
Cumulative contribution rate | Top1 | Top2 | Top3 | Top4 | Top5 | Top6 | Top7 |
---|---|---|---|---|---|---|---|
50.5% | 74.9% | 85.5% | 92.9% | 96.6% | 98.3% | 99.1% |
fviz_pca_ind(pca,col.ind = dataNew$Category,select.ind = list(cos2 = 0.5))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
Visualize the projection of each observation on the first two
principal components.
(1)The Beverages Menu is concentrated in the first quadrant.
(2)The Regular Menu is concentrated in the third quadrant.
(3)The Condiments Menu is concentrated in the fourth quadrant.
We analyzed the correlation between nutrients and gave the contribution of each principal component after PCA.
I was inspired by many of the following resources in this project, and after reading the following, I wrote my own program and wrote the documentation in conjunction with the course content.
[1][McDonald’s India : Menu Nutrition Dataset](https://www.kaggle.com/datasets/deepcontractor/mcdonalds-india-menu-nutrition-facts)
[2][McD India - Exploratory Work in R - Graphs & PCA](https://www.kaggle.com/code/rsangole/mcd-india-exploratory-work-in-r-graphs-pca/report#graphical-eda)
[3][McDonalds Menu EDA](https://www.kaggle.com/code/prathameshgadekar/mcdonalds-menu-eda)