Izdelke, ki jih prodaja McDonald’s, želite razvrstiti v skupine glede na njihovo vsebnost maščob, ogljikovih hidratov, vlaknin, sladkorja in proteinov. #### Metoda: razvrščanje v skupine.
library(readxl)
podatki <- read_xlsx("./Hrana.xlsx")
podatki <- as.data.frame(podatki)
head(podatki)
## ID Kategorija Ime Kalorije Mascobe OH
## 1 1 Zajtrk Egg McMuffin 300 13 31
## 2 2 Zajtrk Egg White Delight 250 8 30
## 3 3 Zajtrk Sausage McMuffin 370 23 29
## 4 4 Zajtrk Sausage McMuffin with Egg 450 28 30
## 5 5 Zajtrk Sausage McMuffin with Egg Whites 400 23 30
## 6 6 Zajtrk Steak & Egg McMuffin 430 23 31
## Vlaknine Sladkor Proteini
## 1 4 3 17
## 2 4 3 18
## 3 4 2 14
## 4 4 2 21
## 5 4 2 21
## 6 4 3 26
Opis spremenljivk:
podatki$Kategorija <- factor(podatki$Kategorija,
levels = c("Zajtrk", "GlavnaJed", "Pijaca", "Sladica"),
labels = c("Zajtrk", "GlavnaJed", "Pijaca", "Sladica"))
summary(podatki[c(4:9)])
## Kalorije Mascobe OH Vlaknine
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. :0.000
## 1st Qu.: 210.0 1st Qu.: 1.00 1st Qu.: 33.00 1st Qu.:0.000
## Median : 340.0 Median : 11.00 Median : 46.00 Median :1.000
## Mean : 377.3 Mean : 14.42 Mean : 48.97 Mean :1.564
## 3rd Qu.: 510.0 3rd Qu.: 23.00 3rd Qu.: 61.00 3rd Qu.:3.000
## Max. :1880.0 Max. :118.00 Max. :141.00 Max. :7.000
## Sladkor Proteini
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 7.00 1st Qu.: 4.00
## Median : 21.00 Median :12.00
## Mean : 31.35 Mean :13.44
## 3rd Qu.: 51.00 3rd Qu.:19.00
## Max. :128.00 Max. :87.00
podatki_clu_std <- as.data.frame(scale(podatki[c("Mascobe", "OH", "Vlaknine", "Sladkor", "Proteini")]))
podatki$Razlicnost <- sqrt(podatki_clu_std$Mascobe^2 + podatki_clu_std$OH^2 + podatki_clu_std$Vlaknine^2 +
podatki_clu_std$Sladkor^2 + podatki_clu_std$Proteini^2)
head(podatki[order(-podatki$Razlicnost), c("ID", "Ime", "Razlicnost")])
## ID Ime
## 83 83 Chicken McNuggets (40 piece)
## 33 33 Big Breakfast with Hotcakes (Large Biscuit)
## 35 35 Big Breakfast with Hotcakes and Egg Whites (Large Biscuit)
## 32 32 Big Breakfast with Hotcakes (Regular Biscuit)
## 235 235 McFlurry with M_Ms Candies (Medium)
## 228 228 Strawberry Shake (Large)
## Razlicnost
## 83 10.324266
## 33 5.715116
## 35 5.321825
## 32 5.077624
## 235 4.823116
## 228 4.692270
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
podatki <- podatki %>%
filter(!ID == 83)
podatki_clu_std <- as.data.frame(scale(podatki[c("Mascobe", "OH", "Vlaknine", "Sladkor", "Proteini")]))
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.2
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(podatki_clu_std,
method = "euclidian")
fviz_dist(Distances,
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))
library(factoextra)
get_clust_tendency(podatki_clu_std,
n = nrow(podatki_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8606645
##
## $plot
## NULL
library(dplyr)
library(factoextra)
WARD <- podatki_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 240
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none"
## instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at
## <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
library(factoextra)
library(NbClust)
fviz_nbclust(podatki_clu_std, kmeans, method = "wss") +
labs(subtitle = "Metoda preloma")
fviz_nbclust(podatki_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Metoda silhuete")
#### Predlaga 3 skupine
library(NbClust)
NbClust(podatki_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW
## 2 0.9096 141.7393 125.5054 -1.2535 401.5137 6846344958 14688.210
## 3 2.3499 170.2758 67.6190 -0.4894 604.6317 6608259950 6017.438
## 4 5.6892 167.7340 28.8106 0.5766 832.6782 4542558971 3951.916
## 5 0.1625 147.7321 66.5206 -0.0114 999.3802 3543751810 2708.366
## 6 2.1945 164.2423 38.8580 3.5420 1144.0754 2792486718 1967.182
## 7 1.2382 165.3636 33.5182 5.5154 1345.2366 1643884629 1672.331
## 8 1.5680 166.2022 25.0205 7.1468 1447.5863 1401675258 1549.893
## 9 2.2081 163.5304 16.3123 8.0416 1556.7723 1125573030 1392.610
## 10 0.6114 156.7561 19.7344 8.1385 1614.0601 1094518051 1159.928
## TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2
## 2 748.9612 10.8299 1.5955 0.2997 1.1452 0.3914 1.9960 -89.3202
## 3 490.3717 28.1197 2.4369 0.2957 1.0345 0.3904 1.5478 -54.8588
## 4 381.5196 37.0301 3.1322 0.2685 0.9991 0.3843 1.1978 -11.0622
## 5 340.0114 39.7973 3.5146 0.2520 0.9331 0.3940 0.5990 48.8703
## 6 264.9991 43.8063 4.5094 0.2970 0.9974 0.3872 0.8981 5.7855
## 7 227.2602 57.5413 5.2583 0.3551 0.9155 0.3912 1.5766 -17.1885
## 8 198.6792 61.6994 6.0147 0.3279 1.0012 0.3490 1.1472 -9.8805
## 9 179.3381 68.4667 6.6634 0.3105 1.0025 0.3411 0.9731 1.3528
## 10 167.5093 69.7550 7.1339 0.3571 1.0757 0.3278 2.0329 -23.8807
## Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert
## 2 -1.5401 0.3846 374.4806 0.4758 0.4156 0.4591 0.0633 0.0014
## 3 -1.0928 0.4427 163.4572 0.5564 0.5666 0.9508 0.0468 0.0015
## 4 -0.5080 0.4115 95.3799 0.5477 0.2974 1.2084 0.0529 0.0018
## 5 2.0662 0.3779 68.0023 0.5518 0.1255 1.3118 0.0529 0.0018
## 6 0.3482 0.3600 44.1665 0.5747 0.3918 1.3896 0.0534 0.0022
## 7 -1.0844 0.3401 32.4657 0.5718 0.9627 1.4382 0.0658 0.0024
## 8 -0.3932 0.3227 24.8349 0.5204 0.6756 1.8596 0.0781 0.0025
## 9 0.0844 0.3072 19.9265 0.4953 0.5995 2.1182 0.0280 0.0025
## 10 -1.5406 0.2932 16.7509 0.4778 0.5183 2.3175 0.0336 0.0026
## SDindex Dindex SDbw
## 2 2.1161 1.5269 0.9643
## 3 1.6439 1.2813 2.8851
## 4 1.6993 1.1128 0.5984
## 5 1.6008 1.0543 0.3546
## 6 1.5384 0.9710 0.3591
## 7 1.7915 0.9086 0.2212
## 8 2.2230 0.8521 0.1996
## 9 2.2733 0.8050 0.1836
## 10 2.5683 0.7828 0.1739
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6559 93.9282 1.0000
## 3 0.6588 80.2903 1.0000
## 4 0.6335 38.7597 1.0000
## 5 0.6559 38.3059 0.0691
## 6 0.6182 31.4966 0.8831
## 7 0.4584 55.5236 1.0000
## 8 0.6080 49.6396 1.0000
## 9 0.5934 33.5761 0.9946
## 10 0.5502 38.4256 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 4.0000 3.0000 3.0000 10.0000 4.0000 4
## Value_Index 5.6892 170.2758 57.8864 8.1385 228.0466 1066893818
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 3.000 3.0000 3.0000 4.0000 5.000 7.0000
## Value_Index 8670.772 149.7373 17.2898 -0.3129 0.252 0.9155
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 5.000 2.000 2.0000 2.0000 3.0000 3.0000
## Value_Index 0.394 1.996 -89.3202 -1.5401 0.4427 211.0233
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex
## Number_clusters 6.0000 1 2.0000 8.0000 0 6.0000 0
## Value_Index 0.5747 NA 0.4591 0.0781 0 1.5384 0
## SDbw
## Number_clusters 10.0000
## Value_Index 0.1739
##
## $Best.partition
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [33] 1 1 1 3 1 3 3 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3 3 3 3 3 3 2 3 3 3
## [97] 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
## [129] 3 3 3 3 3 2 3 3 2 3 3 2 3 3 3 3 3 3 3 3 2 3 3 2 3 3 2 3 3 3 3 2
## [161] 2 3 2 2 3 2 2 3 2 2 3 2 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [193] 3 2 3 3 2 3 3 2 3 3 2 2 2 2 2 2 2 2 2 2 3 2 2 3 2 2 3 3 2 2 2 2
## [225] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
Razvrstitev <- kmeans(podatki_clu_std,
centers = 3,
nstart = 25)
Razvrstitev
## K-means clustering with 3 clusters of sizes 106, 59, 75
##
## Cluster means:
## Mascobe OH Vlaknine Sladkor Proteini
## 1 -0.721030980 -0.64304063 -0.6169564 -0.2051063 -0.7259641
## 2 -0.003113774 1.16369290 -0.2567515 1.3933125 -0.1074586
## 3 1.021506620 -0.00660766 1.0739429 -0.8061889 1.1105633
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [33] 3 3 3 1 3 1 1 2 3 3 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [65] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 2 1 1 1
## [97] 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [129] 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 2
## [161] 2 1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [193] 1 2 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 1 1 2 2 2 2
## [225] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 153.1436 147.0813 190.1468
## (between_SS / total_SS = 59.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
library(factoextra)
fviz_cluster(Razvrstitev,
palette = "Set4",
repel = TRUE,
ggtheme = theme_bw(),
data = podatki_clu_std)
#### po MGK smo 5 razvrstitvenih spremeljivk združili v 2 GK (87%
CELOTNE INFO). Enote 32, 33, 34, 35 bi odstranili: s tem bi oblikovali
bolj homogene skupine, ss_between bi se povečal.
Povprecja <- Razvrstitev$centers
round(Povprecja, 3)
## Mascobe OH Vlaknine Sladkor Proteini
## 1 -0.721 -0.643 -0.617 -0.205 -0.726
## 2 -0.003 1.164 -0.257 1.393 -0.107
## 3 1.022 -0.007 1.074 -0.806 1.111
Figure <- as.data.frame(Povprecja)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Mascobe", "OH", "Vlaknine", "Sladkor", "Proteini"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Mascobe", "OH", "Vlaknine", "Sladkor", "Proteini"),
labels = c("Mascobe", "OH", "Vlaknine", "Sladkor", "Proteini"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Povprecje") +
xlab("Razvrstitvene spremenljivke")+
ylim(-2.5, 2.5)
#### na podlagi slike: Za skupino 1 velja: povsod je podpovprečna.
skupina 3: sladkor je najslabši, OH v popvprečju, ostalo nadpovprečno. S
katerim preizkusom preverim, ali so povprečja v teh 3 skupinah enaka ali
se razlikujejo? enofaktorska analiza variance za neodvisne vzorce
(anova): ho: vsa 3 povprečja so enaka za vsako od teh razvrstitvenih
spr.
podatki$Uvrstitev <- Razvrstitev$cluster
fit <- aov(cbind(Mascobe, OH, Vlaknine, Sladkor, Proteini) ~ as.factor(Uvrstitev),
data = podatki)
summary(fit)
## Response Mascobe :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 22486 11242.8 149.62 < 2.2e-16 ***
## Residuals 237 17809 75.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response OH :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 97866 48933 127.2 < 2.2e-16 ***
## Residuals 237 91173 385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Vlaknine :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 280.89 140.447 143.1 < 2.2e-16 ***
## Residuals 237 232.60 0.981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sladkor :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 139864 69932 278.95 < 2.2e-16 ***
## Residuals 237 59414 251
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Proteini :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 16701 8350.5 196.35 < 2.2e-16 ***
## Residuals 237 10079 42.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aggregate(podatki$Kalorije,
by = list(podatki$Uvrstitev),
FUN = mean)
## Group.1 x
## 1 1 183.6321
## 2 2 496.1017
## 3 3 537.6000
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(podatki$Kalorije, as.factor(podatki$Uvrstitev))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 6.2202 0.002329 **
## 237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
podatki %>%
group_by(as.factor(Uvrstitev)) %>%
shapiro_test(Kalorije)
## # A tibble: 3 × 4
## `as.factor(Uvrstitev)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 Kalorije 0.940 0.000114
## 2 2 Kalorije 0.934 0.00334
## 3 3 Kalorije 0.876 0.00000249
kruskal.test(Kalorije ~ as.factor(Uvrstitev),
data = podatki)
##
## Kruskal-Wallis rank sum test
##
## data: Kalorije by as.factor(Uvrstitev)
## Kruskal-Wallis chi-squared = 162.05, df = 2, p-value < 2.2e-16
fit <- aov(Kalorije ~ as.factor(Uvrstitev),
data = podatki)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Uvrstitev) 2 6726356 3363178 148.6 <2e-16 ***
## Residuals 237 5362998 22629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(dplyr)
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.2
##
## Attaching package: 'janitor'
## The following object is masked from 'package:rstatix':
##
## make_clean_names
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
tabyl(podatki, Kategorija, Uvrstitev) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1)
## Kategorija 1 2 3
## Zajtrk 2.8% 1.7% 50.7%
## GlavnaJed 3.8% 0.0% 49.3%
## Pijaca 82.1% 59.3% 0.0%
## Sladica 11.3% 39.0% 0.0%
rezultati <- chisq.test(podatki$Kategorija, podatki$Uvrstitev)
rezultati
##
## Pearson's Chi-squared test
##
## data: podatki$Kategorija and podatki$Uvrstitev
## X-squared = 230.45, df = 6, p-value < 2.2e-16
####Zanima me ali je izpolnjena predpostavka hi2 testa. (teoretične frekvence).
rezultati$expected
## podatki$Uvrstitev
## podatki$Kategorija 1 2 3
## Zajtrk 18.55000 10.325000 13.1250
## GlavnaJed 18.10833 10.079167 12.8125
## Pijaca 53.88333 29.991667 38.1250
## Sladica 15.45833 8.604167 10.9375
rezultati$residuals
## podatki$Uvrstitev
## podatki$Kategorija 1 2 3
## Zajtrk -3.6104259 -2.9020428 6.8661523
## GlavnaJed -3.3154041 -3.1747703 6.7573131
## Pijaca 4.5114837 0.9145194 -6.1745445
## Sladica -0.8796006 4.9077497 -3.3071891