Clustering, also known as cluster analysis, is an unsupervised learning technique used to identify patterns and structures in data sets. The main goal of clustering is to group similar objects into the same clusters and dissimilar objects into distinct clusters based on some measure of similarity or dissimilarity between them. Clustering has various applications such as customer segmentation, anomaly detection, general data exploration, and more. As a result, cluster analysis has applications in different fields.

The dataset used in this project pertains to customers of a Portuguese wholesale distributor. It includes annual spending in monetary units (u.m) on various product categories.

The observations refer to customers, and the variables are divided as follows:

FRESH= Annual spending (in monetary units) on fresh products;

MILK= Annual spending (in monetary units) on dairy products;;

GROCERY= Annual spending (in monetary units) on grocery products;

FROZEN= Annual spending (in monetary units) on frozen products;

DETERGENTS_PAPER= Annual spending (in monetary units) on detergents and paper products;

DELICATESSEN= Annual spending (in monetary units) on delicatessen products;

CHANNEL= Customer channel - Horeca (Hotel/Restaurant/Cafe) or Retail channel;

REGION= Customer region - Lisbon, Porto, or Other city.

“CHANNEL” and “REGION” are categorical variables, while the rest are quantitative variables.

In this project, we will perform a hierarchical clustering process and a “k-means” clustering process. In summary, we will conduct a cluster analysis in which the number of clusters will be determined during the process (hierarchical method), and another analysis in which the number of clusters will be predefined. This way, we can use a common practice among data scientists, which involves using the output of the hierarchical method as input for the “k-means” method.”

Database used:
Wholesale Customers Data (Please right-click and select “open in a new tab/window.” )

Installation and loading of the used packages

pacotes <- c("plotly", "fastDummies", "tidyverse", "ggrepel", "knitr", "kableExtra", "reshape2", 
             "misc3d", "plot3D", "cluster", "factoextra", "ade4") 
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
  instalador <- pacotes[!pacotes %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacotes, require, character = T) 
} else {
  sapply(pacotes, require, character = T) 
}

Importing the database:

clientesdata <- read.csv("Wholesale customers data.csv")
save(clientesdata, file = "clientesdata.RData")

Data preparation

Visualization of the database

View(clientesdata)

Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
2	3	12669	9656	7561	214	2674	1338
2	3	7057	9810	9568	1762	3293	1776
2	3	6353	8808	7684	2405	3516	7844
1	3	13265	1196	4221	6404	507	1788
2	3	22615	5410	7198	3915	1777	5185
2	3	9413	8259	5126	666	1795	1451
2	3	12126	3199	6975	480	3140	545
2	3	7579	4956	9426	1669	3321	2566
1	3	5963	3648	6192	425	1716	750
2	3	6006	11093	18881	1159	7425	2098

showing the first rows only

Count of categories by variable

map(clientesdata[, c("Channel", "Region")], ~ summary(as.factor(.)))

## $Channel
##   1   2 
## 298 142 
## 
## $Region
##   1   2   3 
##  77  47 316

Where: Channel(1) = Hotel/Restaurant/Café; Channel(2) = Retail. Region(1) = Lisbon; Region(2) = Porto; Region(3) = Other Region.

Looking at the “type” of the variables in our database

glimpse(clientesdata)

## Rows: 440
## Columns: 8
## $ Channel          <int> 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ Region           <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ Fresh            <int> 12669, 7057, 6353, 13265, 22615, 9413, 12126, 7579, 5…
## $ Milk             <int> 9656, 9810, 8808, 1196, 5410, 8259, 3199, 4956, 3648,…
## $ Grocery          <int> 7561, 9568, 7684, 4221, 7198, 5126, 6975, 9426, 6192,…
## $ Frozen           <int> 214, 1762, 2405, 6404, 3915, 666, 480, 1669, 425, 115…
## $ Detergents_Paper <int> 2674, 3293, 3516, 507, 1777, 1795, 3140, 3321, 1716, …
## $ Delicassen       <int> 1338, 1776, 7844, 1788, 5185, 1451, 545, 2566, 750, 2…

Since the categorical variables are encoded as numerical values, we will change them to factors:

clientesdata2 <- clientesdata
clientesdata2$Channel <- as.factor(clientesdata$Channel)
clientesdata2$Region <- as.factor(clientesdata$Region)

Since we have both categorical and numerical variables in the database, we will separate the variables into two databases so that we can create two distance matrices. This procedure is necessary because we will use different distance calculation methods for numerical and categorical variables. Afterward, we will combine the matrices and perform clustering on the combined matrix.

Separating the variables into numerical and categorical

dados_numericos <- clientesdata2[, c("Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen")]
dados_categoricos <- clientesdata2[, c("Channel", "Region")]

Standardizing the numerical variables

dados_padronizados <- as.data.frame(scale(dados_numericos))

Now, all numerical variables have a mean of 0 and a standard deviation of 1. Standardization is necessary in cluster analyses when the data does not have a balanced scale of values among variables.

Dummy encoding the categorical variables

dados_dummies <- dummy_columns(.data = dados_categoricos,
                                         select_columns = "Channel",
                                         remove_selected_columns = T,
                                         remove_most_frequent_dummy = T)

dados_dummies <- dummy_columns(.data = dados_dummies,
                               select_columns = "Region",
                               remove_selected_columns = T,
                               remove_most_frequent_dummy = T)

Matrices

Creating our matrices

matriz_D_numerica <- dados_padronizados %>% dist(method = "euclidean")
matriz_D_categorica <- dados_dummies %>% dist(method = "binary")

Joining the matrices

dist_total <- matriz_D_categorica + matriz_D_numerica

Viewing the dissimilarity matrix

data.matrix(dist_total)[1:5, ] %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "250px")

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143	144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175	176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207	208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239	240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255	256	257	258	259	260	261	262	263	264	265	266	267	268	269	270	271	272	273	274	275	276	277	278	279	280	281	282	283	284	285	286	287	288	289	290	291	292	293	294	295	296	297	298	299	300	301	302	303	304	305	306	307	308	309	310	311	312	313	314	315	316	317	318	319	320	321	322	323	324	325	326	327	328	329	330	331	332	333	334	335	336	337	338	339	340	341	342	343	344	345	346	347	348	349	350	351	352	353	354	355	356	357	358	359	360	361	362	363	364	365	366	367	368	369	370	371	372	373	374	375	376	377	378	379	380	381	382	383	384	385	386	387	388	389	390	391	392	393	394	395	396	397	398	399	400	401	402	403	404	405	406	407	408	409	410	411	412	413	414	415	416	417	418	419	420	421	422	423	424	425	426	427	428	429	430	431	432	433	434	435	436	437	438	439	440
0.000000	0.620152	2.412450	2.815908	1.851574	0.4604609	0.9288162	0.9507279	2.024846	1.684588	1.562602	1.337659	1.725146	1.555848	1.209405	2.329964	1.1172125	2.496744	1.0442424	2.108562	0.9198763	2.675219	3.844109	6.840521	1.996175	0.9491927	2.5293261	2.435258	3.134218	3.699474	2.195132	2.297818	2.505257	2.996025	2.580317	1.2570434	3.008519	1.067672	1.709576	5.298941	3.785478	2.225185	1.305086	2.493664	0.8071114	2.839579	2.015219	9.717564	0.7914421	3.647675	2.6683808	2.146988	2.358176	1.288968	2.845809	2.060335	4.725849	1.2173930	2.360619	1.790340	1.073556	8.622706	1.0650610	1.513283	2.5868882	6.324153	2.567549	1.101571	2.673839	2.334694	3.5061541	6.140098	3.120222	1.888192	0.6940152	2.8478995	3.019711	2.988360	2.499260	2.344101	2.460176	1.681525	0.5825222	2.555608	1.052818	13.01495	9.773481	6.401152	3.477813	2.642477	2.6554973	3.0961414	5.130240	8.282782	0.9918894	2.473469	1.461051	2.902145	2.900292	2.7550589	1.523130	1.582254	1.0831281	6.118035	2.197141	2.367689	1.374260	1.632925	0.9005545	2.678801	2.4252862	1.340384	3.2418136	2.4042957	2.5472125	2.5077883	2.272386	2.096298	2.7631942	2.5232248	2.4544483	2.495549	2.645172	0.6492380	3.436461	7.119951	3.1927112	0.9856275	2.219317	3.765614	2.8714444	2.794698	2.283785	2.497213	2.414327	2.499788	2.115080	2.007361	2.157247	2.072152	2.095575	2.601640	3.166681	3.2151061	2.266917	3.784065	2.403025	2.141878	2.8221279	3.055119	2.370859	2.473120	2.565043	2.050250	2.935470	2.422979	1.601948	2.5475678	0.9682642	1.731974	1.397184	2.599930	2.3595352	3.285321	0.9280006	1.909736	1.2225387	2.288886	2.649268	2.632238	1.534524	3.483512	2.114305	2.232225	2.683468	1.025677	4.053654	1.906277	1.807178	2.541857	1.775984	10.423250	2.408049	19.87716	2.705040	2.177720	2.800323	2.602471	1.2428017	1.1097610	2.820121	2.599679	2.500320	1.982556	2.433316	2.8208722	5.144784	1.510205	2.324142	2.318465	3.042739	3.733230	3.482022	2.749006	2.599892	2.896711	2.608319	1.885562	2.273105	2.879442	2.477895	6.096286	2.490382	1.922470	1.738457	2.403091	4.477641	2.504291	2.772863	2.767133	2.518225	1.878491	3.138208	2.230447	2.563644	2.044545	1.397428	2.540775	2.777668	2.820725	2.317607	2.023743	2.639467	2.388312	2.490388	2.204974	2.525689	2.588923	2.659120	4.160830	3.644917	2.662727	2.647354	1.973486	1.737257	2.298772	2.345698	2.630838	1.988616	2.721709	2.554445	5.585288	3.020765	2.745277	2.661735	3.222110	1.905511	2.608944	4.857801	4.342550	2.026456	3.059403	2.627721	2.692940	1.918235	4.040050	2.463733	2.851627	2.051437	2.657270	2.975568	2.348011	2.126944	3.458776	2.602447	2.763284	2.796157	4.773910	2.6949955	0.7172437	2.408001	0.4969023	4.243469	3.578370	5.893521	3.625215	2.414388	3.0983098	2.550848	3.902122	2.433901	2.324110	2.493222	1.645870	2.324090	1.718870	2.497402	1.546328	1.472328	2.755050	1.125172	2.577483	1.779578	2.582662	3.354545	1.692876	2.441534	2.285174	2.455295	2.585994	3.906485	2.884227	3.998161	2.707455	2.172961	1.976527	2.527092	2.351716	2.264947	3.623340	2.411370	2.334406	2.570775	2.309349	2.599438	13.74114	2.658387	2.758382	3.017390	2.720258	2.420382	3.106613	2.549645	10.25892	3.336563	1.837854	2.409631	3.130803	4.610661	3.443050	1.583159	1.211341	2.198007	3.501663	2.681455	2.392510	1.548609	1.476492	2.432929	2.615152	2.614433	2.846128	2.768601	2.202564	2.4852457	2.798716	2.909795	2.367083	3.323659	2.332363	2.429414	2.680181	2.582355	2.448988	2.669030	1.1198898	2.365548	2.712538	2.643917	2.689048	2.365569	3.1660758	2.810793	0.8606141	2.417330	2.549851	1.553911	3.373196	2.400184	1.115939	3.034794	3.317088	2.908163	2.772951	3.446483	2.5221831	2.454920	2.527100	2.548406	2.5323267	2.9333884	2.465852	2.414575	3.079484	2.438117	2.5400628	0.9589695	2.949996	2.8702765	2.571611	2.744532	4.165058	2.465951	2.224331	2.3186967	2.573302	3.336707	1.816368	0.9799543	2.675998	2.225922	2.769132	2.459025	4.782897	2.7490648	0.7466402	1.1566828	1.923747	1.745959	2.050577	2.374162	0.8046204	2.647176	1.008799	1.174772	4.360966	2.079440	4.558736	2.162112	2.375680	2.526401	3.835168	2.431647	2.481654	1.868244	4.188151	3.714277	3.590575	2.346928	2.573812
0.620152	0.000000	2.170333	2.781998	1.920911	0.6784641	1.1396351	0.7164900	2.071625	1.333053	1.087422	1.562242	2.060281	1.605518	1.502044	2.539311	0.7459657	2.439154	1.1522504	2.125729	1.2480569	2.668116	3.904920	6.703197	1.962013	1.2768979	2.6091309	2.732916	2.802795	4.175155	2.359594	2.218100	2.930239	3.231743	2.603998	0.9609913	3.300854	1.054614	1.479775	5.575921	3.731049	2.433893	1.248521	2.170327	0.8381497	2.524759	1.637652	9.588071	0.7784813	3.369111	2.5994380	2.129450	2.826819	0.950533	3.226969	2.041268	4.473040	0.9943292	2.674058	1.716608	1.155835	8.491809	0.8311212	1.203668	2.4916373	6.047459	2.453293	1.282473	2.344356	2.532160	3.4396212	5.943878	2.899524	1.889524	0.5693493	3.0881554	2.705831	2.788034	2.673908	2.259615	2.595191	1.369687	0.4299921	2.893194	1.240487	12.79296	9.704220	6.434210	3.328765	2.888641	2.7857988	3.0171589	4.855283	8.002218	0.8278463	2.391501	1.309248	2.927100	2.919223	2.7550136	1.248475	1.116343	0.6858667	6.215100	2.465498	2.646393	1.046566	1.340833	0.6804688	2.442299	2.5792011	1.190899	3.2263431	2.5734494	2.8140789	2.6388750	2.485385	2.130122	2.9443148	2.5498869	2.7077371	2.511351	2.844241	0.7405835	3.774496	7.352906	3.2651965	1.2731194	2.186244	4.210032	2.7930352	2.864150	2.617328	2.713299	2.569254	2.604969	1.714314	1.945999	2.361786	2.145760	2.286692	3.096209	3.588161	3.1501042	2.720074	3.646921	2.558977	2.315999	2.8221743	3.512329	2.702215	2.442283	2.843362	2.047789	2.994883	2.096226	1.263136	2.7882482	0.5962734	1.417875	1.050356	2.798456	2.5842936	2.999673	0.6713014	1.663001	0.8904401	2.299839	2.749483	2.658352	1.170557	3.174518	1.958281	1.888116	2.524530	0.655302	4.365742	2.077059	2.081553	2.351302	1.900888	10.617149	2.127569	19.65213	2.701127	2.328058	2.741031	2.205182	0.9870185	0.6782545	2.899628	2.879040	2.615397	1.650343	2.539989	2.7912159	5.057187	1.092908	2.375833	2.545668	2.655165	3.393023	3.492744	2.757669	2.529296	2.550871	2.741326	1.648181	2.066884	2.636277	2.772169	5.834455	2.569931	1.651675	1.412845	1.927654	4.197112	2.780419	2.377339	2.798864	2.778715	1.759570	2.989934	1.989753	2.706763	2.330559	1.719066	2.546787	2.750273	2.719939	2.119902	1.927955	3.084512	2.371399	2.574239	2.169618	2.627839	2.803196	2.809488	4.516578	3.572688	3.042609	2.865038	2.124495	1.343247	1.976872	2.491640	2.856780	2.407317	2.645438	2.561460	5.234077	2.824141	3.072770	2.524771	3.606649	2.192517	2.527881	5.201899	4.771057	2.279366	3.003769	3.044992	2.466362	1.539153	3.709667	1.990872	3.004652	1.899218	2.897001	2.911250	2.321755	1.985889	3.812413	2.574937	2.784669	3.040520	4.574751	2.6842830	0.9578527	2.429211	0.7045248	4.600446	3.651516	6.192215	4.097986	2.532340	3.0777267	2.878420	4.379783	2.427477	2.342710	2.326843	1.312131	2.668239	1.833833	2.755576	1.543785	1.321134	2.772909	1.479965	2.282931	1.564784	2.363396	2.997126	1.416350	2.105007	2.631933	2.592239	2.302584	3.691251	3.241143	3.756663	2.752239	2.265259	1.591483	2.568693	2.214394	2.462636	3.452709	2.423809	2.377864	2.745276	2.436623	3.004863	13.46820	2.619321	2.772458	3.022578	2.591115	2.442527	2.888650	2.872483	10.02888	3.217769	2.155489	2.711147	3.021697	4.289330	3.147484	1.299998	0.828172	1.960374	3.215699	2.678148	2.244943	1.110863	1.728473	2.443512	2.343804	2.615834	2.466282	2.779942	1.861159	2.4401561	2.818341	3.227230	2.101995	3.048599	2.329918	2.763011	2.763249	2.606314	2.515372	2.696889	0.7738633	2.562398	2.764248	3.006901	2.812714	2.809126	3.2393533	2.730857	0.9984568	2.511805	2.687418	1.018437	3.854496	2.458623	1.079183	3.378569	3.312857	3.258790	2.652220	3.253306	2.6722709	2.371169	2.884296	2.720104	2.6302769	2.8069378	2.448401	2.376902	3.176585	2.659882	2.6516382	0.4308295	2.939790	2.9331844	2.740642	2.706698	4.173217	2.974483	2.775795	2.5293165	2.661988	3.391937	1.501668	1.0186183	2.531567	2.211974	2.620719	2.260941	4.446768	2.6378491	0.5286671	0.8481803	1.734135	1.398695	2.214883	2.123691	1.0509046	3.070028	1.385235	1.379042	4.151780	1.840948	4.546713	2.072267	2.499529	2.296263	3.563641	2.835683	2.440757	2.202477	4.114048	4.070721	3.408252	2.501223	2.648764
2.412450	2.170333	0.000000	3.680215	1.728199	2.3524738	2.7665390	1.9603064	3.674148	2.524751	2.385904	2.949354	2.768076	3.026187	2.580937	3.958012	2.5319328	2.531155	1.9687098	3.780460	2.3421551	3.940453	3.925102	5.348163	1.636735	2.9711537	3.8534730	4.037325	2.882931	5.007855	3.177927	3.631156	4.192736	4.238655	3.979389	2.7494280	3.352266	2.755301	3.076675	5.821906	3.301583	3.215371	3.016562	3.541825	2.8324339	2.653713	2.908926	9.624556	2.4003158	4.243670	3.4440827	3.828867	3.669821	2.835624	4.149989	3.532221	5.146671	2.5478252	4.020210	3.312034	2.860303	8.905342	2.4987514	2.662840	3.6891806	6.585274	4.019638	2.819092	2.990695	4.010412	4.2249561	4.134897	4.149593	3.101062	2.2301536	4.1004070	3.879960	3.804500	4.067582	3.457451	3.945953	3.030178	2.0373102	4.056692	2.898761	13.06836	10.166170	4.862571	4.440246	3.421233	4.0668181	3.9926901	4.938167	8.075351	2.8166317	3.820152	2.914309	4.220496	4.198754	4.0461746	2.062851	2.463678	2.3385204	6.488696	3.883498	3.766689	2.368863	2.672426	2.4263069	3.808322	3.6167554	2.634636	3.7319015	3.6361583	3.8737187	3.9346769	3.978052	3.568782	4.0266475	3.8675477	3.8537862	3.955509	3.977221	2.7915952	4.494778	7.774588	4.3173897	2.2341268	3.935849	4.828221	3.9004866	4.129534	4.022153	4.038778	3.930333	3.996776	3.295251	3.866832	3.023478	3.639446	3.212873	4.145371	4.771290	4.2596242	4.127026	3.873464	3.928200	3.813328	4.0473940	4.350990	3.874373	3.513217	4.170583	3.407333	4.276477	2.974998	2.048999	4.0706085	2.2553767	3.004318	2.263499	3.582133	3.9249222	4.052941	2.7054776	2.635310	1.5882315	3.759647	4.033008	3.897640	2.884362	2.978529	3.647113	3.178135	3.647284	2.306151	5.270194	3.092686	3.797180	3.311612	2.989873	10.382171	3.896814	17.80138	4.139839	3.895222	3.924740	3.919325	2.7697375	2.2818397	4.213903	4.209450	3.984460	3.046341	3.749693	3.7220556	5.305138	2.832438	3.801972	3.973114	3.869988	4.200078	2.808933	4.146329	3.695127	3.827584	4.070368	3.342502	3.625740	3.821500	3.623593	6.244701	3.574365	3.395354	3.137133	3.203459	5.113406	3.884362	1.642944	4.126829	3.983799	3.766263	4.152313	3.130311	3.961935	3.771890	3.052259	3.855190	4.070657	3.848888	2.990252	3.708561	4.200823	3.983299	3.341802	3.820501	4.027292	3.869409	4.122212	5.006884	4.074887	4.097336	4.073330	3.178164	3.129877	2.810411	3.861713	4.039632	3.873617	3.992520	3.862022	5.288472	3.721074	4.187898	3.183171	4.576966	3.842597	3.732751	5.442140	5.528259	3.806601	4.181001	4.194467	3.885852	3.107252	4.077598	2.978582	3.312225	3.449142	4.177010	4.184746	3.518833	3.590479	4.481024	3.751691	4.064519	3.742454	5.304542	3.7231694	2.3862236	3.701392	2.1463697	5.241309	4.398962	6.554537	5.097766	3.999455	4.0618881	4.171401	5.298693	3.577082	3.772460	3.644898	3.161897	3.679914	3.405060	3.942550	3.214098	3.020680	4.099847	2.780872	3.785185	3.244511	3.763791	3.718954	3.329117	3.356930	3.880887	3.913745	3.646132	4.438422	4.481652	4.707011	4.057834	3.618998	2.383210	3.767872	3.458730	4.007157	4.299640	3.779717	3.678147	4.086899	3.915756	4.073030	13.32005	3.840394	4.031003	3.906591	3.857343	3.723040	4.006177	3.558475	10.40032	3.412929	3.149494	3.948815	4.050905	5.024055	3.856963	2.597679	2.170255	3.450881	4.246098	4.002351	3.864120	2.634995	2.993674	3.816714	3.559735	3.811566	3.003658	4.113285	3.075857	3.3212077	4.157043	4.305234	3.441821	3.262242	3.762545	4.051505	4.046494	3.898029	3.966838	3.995446	1.7448115	3.946988	4.057938	4.187867	4.102992	3.443261	4.2304696	2.604013	1.8861251	3.849312	4.100138	2.476441	4.677167	3.720080	2.878641	4.486260	4.338076	3.864348	3.986922	3.315227	3.8257203	3.704331	4.008631	4.028033	3.8986296	3.8844527	3.497784	3.815500	4.222536	4.003279	4.0090022	2.0561802	4.039078	4.1601682	4.043822	3.837331	4.692361	4.278324	4.205623	3.8679843	3.994594	4.327697	3.029287	2.7588026	2.249325	3.356761	2.706781	3.973476	4.798692	3.9891743	2.1599432	2.5847750	3.509129	2.913237	3.481593	3.558453	2.3924940	4.094929	2.837229	2.989307	4.642920	3.315544	5.224369	3.782032	3.340175	3.469769	4.282591	4.148364	3.594356	3.732618	4.707595	4.576351	4.143755	3.442665	4.094651
2.815908	2.781998	3.680215	0.000000	2.659964	2.5817546	2.4676659	2.4619893	1.484365	3.791430	2.809077	2.124645	3.698296	3.114515	3.294599	1.355338	3.2648919	1.742488	2.5035669	1.518280	2.2950519	1.000697	1.841387	8.441497	3.475779	2.7013656	0.8118723	1.311934	5.255089	2.624154	1.512222	1.334066	1.517953	1.450726	1.636961	3.0023952	1.971609	3.228691	4.139641	3.513575	1.850347	1.201656	3.423722	4.523861	2.9268672	4.996961	4.142268	11.697694	2.8289858	5.812456	0.8089728	1.669532	3.562222	3.450475	1.411790	1.374538	6.917709	3.4836847	1.323084	1.662718	2.705109	10.651445	2.2541113	3.524589	0.9184272	8.343130	1.906755	2.993391	1.279767	1.406272	0.9434664	4.914947	1.189026	2.051493	2.4794017	0.8824137	1.141385	5.025585	1.100070	1.500158	1.406342	3.786775	2.8850909	1.125978	2.447738	15.03498	12.106267	5.095091	1.023756	1.266314	0.8737758	0.5982309	7.169875	5.913754	3.4468920	1.690850	2.942340	1.709681	1.677065	0.6758855	3.148556	3.624213	2.6378354	4.200038	1.163123	1.158808	3.291353	3.616374	2.9286074	4.728111	0.9044944	3.584240	0.8058153	0.7322717	0.9312205	0.8958048	1.179543	1.185160	0.6891632	0.8155746	0.9137331	1.293343	1.070037	2.7693769	1.838838	5.425332	0.8082797	2.7665028	2.057036	2.507939	0.7047406	1.671894	1.334632	1.395917	1.359376	1.329725	1.367437	1.867532	1.454710	1.439952	1.102668	2.470508	2.295225	0.7795744	1.534569	5.560381	1.314872	1.241680	0.9803606	1.891904	1.227450	1.581136	1.011857	1.782470	1.821237	4.470510	3.369998	0.8887983	2.8276286	3.712925	2.943076	1.190500	0.8990607	5.305217	2.6569369	3.697914	2.9562355	1.577969	1.336613	1.123088	3.518496	5.673672	1.814861	4.117356	1.314139	3.125465	2.759067	1.308235	1.446828	1.048634	1.538103	9.448508	2.461656	18.60734	1.774564	1.355737	1.143193	1.647729	3.0294447	3.3820755	1.012749	1.330285	1.455856	3.567467	1.279874	0.6927844	4.021769	2.904509	1.780159	2.371938	4.405746	5.119150	3.539277	2.718168	2.385583	4.448295	2.414374	2.691850	2.539324	4.496528	2.089752	7.590775	2.178035	2.700387	3.132026	3.479389	6.060813	1.942809	3.779772	2.298546	2.078218	3.038748	1.864869	1.885264	2.349276	2.268121	2.465689	2.225849	2.827800	1.549137	1.808337	2.924863	2.594698	2.698797	1.591233	2.665791	2.081620	1.801455	2.371508	3.729446	2.350863	2.305628	1.812935	2.302096	2.747769	3.647298	2.138900	2.064045	2.419290	1.797060	2.336948	6.873127	1.601197	3.413696	3.380975	2.673463	2.436313	1.892557	4.466465	4.309992	2.471139	1.768747	2.451022	2.229582	3.589317	4.393968	3.754397	2.035430	3.539624	2.050461	1.994860	2.524495	2.889188	1.930269	1.561092	1.652443	1.328193	2.299001	0.3215676	2.5593155	1.419153	2.4305693	2.862073	1.343720	4.541792	2.485526	1.301201	0.5483017	1.327319	2.662499	1.495850	1.106607	1.046843	3.412220	2.326287	2.471708	1.892173	2.419688	2.700903	2.725357	2.453613	4.194980	3.123367	3.780003	4.678831	3.446235	4.033274	2.190011	2.395910	4.434501	2.397155	2.646797	5.118988	1.748511	2.118939	3.699871	1.972003	2.508169	2.104538	5.466758	2.603364	1.873493	1.843863	2.458574	2.380959	12.64196	2.025911	2.678597	1.459571	1.791216	1.711221	4.736874	2.227990	11.40878	2.263361	2.978798	2.279176	1.603188	3.234368	2.074589	3.254689	2.998937	1.946721	5.257493	1.308167	1.851864	3.233225	2.998805	1.396550	4.697748	1.278067	4.686137	1.651276	4.059662	0.8990239	1.757234	1.185162	3.904381	2.707018	1.765276	1.090801	1.406562	1.553524	1.552081	1.484239	2.7212014	1.252214	1.433638	1.246639	1.396897	3.368416	0.7858194	1.791206	2.3890090	1.253365	1.500449	2.851814	2.332468	1.557689	2.800429	1.419227	1.068434	2.138251	1.151709	3.314617	0.8844676	1.402510	1.175003	1.171864	0.9087332	0.9062629	1.442185	1.658977	1.250127	1.145328	0.9913819	2.9733043	0.512913	0.7199765	1.199983	1.003238	1.812797	1.799818	2.162393	0.8373688	1.185714	1.318033	3.853838	2.3349533	1.581414	1.200071	1.646237	2.063986	2.524649	0.8205687	2.4735965	3.4129799	1.811518	3.788177	1.327051	2.277192	2.4799903	1.564270	2.294350	2.663343	1.866637	1.928986	3.104099	1.450612	1.390074	1.983874	1.631795	1.503086	1.404364	1.421901	2.710103	2.132232	5.669890	1.162555	1.675524
1.851574	1.920911	1.728199	2.659964	0.000000	1.8674171	2.0168263	1.6289660	3.189042	2.598596	2.227244	2.017647	1.736699	2.105596	1.663532	3.199823	2.5397479	2.585804	0.9388877	3.195140	1.3351380	3.295843	2.470645	6.057667	1.113528	2.0969324	3.0230195	3.092124	3.546989	3.404093	2.127045	3.259521	2.997422	2.654035	3.597712	2.5844163	1.818872	2.192622	3.105833	4.198962	2.226187	2.041299	2.584810	3.617476	2.3126815	3.290611	3.039870	9.777747	1.9030039	4.497656	2.8769500	3.435271	2.131585	2.811861	2.673212	3.099270	5.529063	2.5248627	3.110760	3.000515	2.299077	8.947871	1.9647477	2.467535	3.1337884	6.980894	3.745133	1.921386	2.863909	3.334281	3.1168152	4.707001	3.549357	1.859996	1.7729800	2.8211188	3.449894	3.729654	3.213797	3.225192	3.324651	3.060857	1.7171825	2.799504	2.143607	13.41539	10.458262	4.672532	3.603439	1.842502	3.1500418	3.0212796	5.460610	7.566477	2.6392163	3.541261	2.691365	3.810879	3.784991	3.1809175	1.859879	2.643253	2.1636083	5.057602	2.948317	2.761389	2.468992	2.545723	2.3448837	3.779182	2.7679715	2.341310	2.5280613	2.6313287	2.7043275	3.0502858	3.137994	2.988642	2.7231862	3.1088550	2.7145865	3.421402	3.086661	2.1677133	2.813566	6.209507	3.0719138	1.1652848	3.739327	3.205562	3.1798728	3.690584	3.072899	3.308518	3.288838	3.368294	2.978628	3.521884	2.324641	3.182519	2.073377	2.995935	3.217290	3.3529613	3.061667	3.801074	3.263139	3.113746	3.3657920	2.859271	2.860581	3.295709	2.959816	3.219670	3.869826	3.285173	2.229991	2.8985587	2.1606237	3.013678	2.236457	2.784716	2.8672305	4.155701	2.3056653	2.301177	1.7795453	3.396864	3.420889	3.287949	2.874326	3.915474	3.460383	3.263822	3.333320	2.310952	3.637975	2.381347	3.068790	2.968780	2.395028	9.429913	3.855605	18.47942	3.761925	3.248183	3.413519	3.545707	2.3875797	2.4595761	3.104732	3.269949	3.418117	2.949699	3.147229	2.5750863	4.244018	2.750165	2.882476	3.242590	3.929788	4.412849	2.132454	3.758620	3.316654	3.957551	3.446000	3.021448	3.356352	3.908351	2.480572	6.533455	2.993348	3.197933	3.021729	3.393607	5.328885	2.717079	2.703309	3.533670	2.983707	3.452213	3.456265	2.657767	3.316292	2.932316	1.897906	3.334952	3.737071	3.057166	2.199549	3.497583	2.836041	3.632622	2.236743	3.501411	3.254647	2.662200	3.439574	3.311387	2.846401	2.696193	2.862875	2.541809	2.905333	3.040426	3.138399	3.081334	2.942822	3.227264	3.391896	5.787395	3.035071	2.971601	3.095383	3.019494	3.092747	3.048485	3.837723	3.883644	3.169394	3.370425	2.914330	3.426286	3.207091	4.198733	3.198514	1.964180	3.034048	3.128474	3.511349	3.258286	3.445858	2.812684	3.456831	3.689957	2.097508	4.526944	2.7920865	1.4842460	3.293382	1.5168042	3.551096	3.015270	4.968732	3.508484	3.350417	2.9599954	3.143251	3.698779	3.260891	3.169667	3.194354	3.089427	2.430859	2.794702	2.708900	2.634048	2.632102	3.730616	1.903442	3.731616	3.071544	3.641928	4.033194	3.314749	3.373793	2.802797	3.296833	3.884542	3.721861	2.960409	4.755579	3.186532	2.842315	2.891959	3.099208	3.230008	3.124316	4.529050	3.451041	2.912207	2.987440	3.042640	2.644866	12.88020	3.275343	3.671748	2.729556	3.212211	2.895227	3.939065	2.236998	10.59294	2.431931	1.909299	3.034453	3.196788	4.507477	3.369767	2.622620	2.198790	3.430093	4.320282	3.486536	3.595175	2.538697	1.753863	3.373085	3.610407	3.330488	3.420392	3.717194	3.179745	2.4339401	3.782850	2.978898	3.308715	3.640450	3.521631	2.836622	3.483165	3.499422	3.458686	3.537415	1.7671353	3.199927	3.542392	2.882688	3.470150	1.831691	2.9539267	2.753418	1.0752942	3.250849	3.506326	2.298820	3.095307	3.348540	2.463526	2.986524	3.205219	2.307815	3.483843	3.589185	2.9457114	3.371478	2.761604	3.237613	3.1040801	3.3309433	3.193625	3.524965	2.893283	3.100779	3.2022030	2.0460238	3.089523	3.2142474	3.265339	3.266703	3.386504	2.954890	3.094551	2.6878725	3.326201	2.957140	2.883625	2.1316289	2.178977	2.873471	2.818366	3.740494	4.345866	3.2952817	1.7655472	2.5451139	3.211381	2.961067	2.911047	3.456082	1.4703631	2.703637	1.788045	2.035403	3.871588	2.809540	4.195951	3.366462	2.812272	3.269620	3.624700	2.967063	3.278835	2.759970	3.612115	2.904650	4.190340	2.755297	3.640150

showing only the first 5 rows

Since our distances are relatively small, we will use the complete linkage method during hierarchical clustering.

Hierarchical clustering

Hierarchical clustering elaboration

cluster_hier <- agnes(x = dist_total, method = "complete")

Dendrogram construction

dendo1 <- fviz_dend(x = cluster_hier, show_labels = FALSE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

dendo1

We can observe the presence of some significant “jumps” in the dendrogram. We also notice the presence of well-defined clusters and others that are less clear. This suggests that there may be some outliers in the database.

After analyzing the dendrogram in a hierarchical clustering process, we can choose the number of clusters by examining the dendrogram’s structure and identifying cuts that appear to be the most meaningful or relevant for our objective.

Dendrogram with cluster visualization

Setting a height of 7 for the dendrogram cluster definition:

dendo_clusters <- fviz_dend(x = cluster_hier,
          h = 7,
          color_labels_by_k = F,
          rect = T,
          rect_fill = T,
          lwd = 1,
          ggtheme = theme_bw(),
          show_labels = FALSE)
dendo_clusters

The height 7 was chosen simply because it seems to provide a good separation of clusters based on the size of the jumps seen in the dendrogram. The cut at height 7 resulted in 12 different clusters, but half of these clusters are clustered on the far right of the dendrogram due to the presence of the outliers mentioned earlier.

Creating a database with all the data used in the creation of the matrices

dados_completos <- cbind(dados_padronizados, Channel_2=dados_dummies$Channel_2, Region_1=dados_dummies$Region_1, Region_2=dados_dummies$Region_2)

Creating a categorical variable to indicate the cluster in the database

dados_completos$cluster_hier <- factor(cutree(tree = cluster_hier, k = 12))

Note: 12 is the number of clusters created by cutting at height 7. Therefore, the argument ‘k’ indicates the number of clusters. Next, we will check if all variables contribute to the formation of the groups.

Analysis of variance using the hierarchical method

summary(anova_channel2 <- aov(formula = Channel_2 ~ cluster_hier,
                                data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  63.77   5.798   76.59 <2e-16 ***
## Residuals    428  32.40   0.076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_region1 <- aov(formula = Region_1 ~ cluster_hier,
                              data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)
## cluster_hier  11   1.77  0.1607   1.114  0.348
## Residuals    428  61.76  0.1443

summary(anova_region2 <- aov(formula = Region_2 ~ cluster_hier,
                             data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)  
## cluster_hier  11   2.04 0.18526   1.985 0.0284 *
## Residuals    428  39.94 0.09332                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_fresh <- aov(formula = Fresh ~ cluster_hier,
                             data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  234.5  21.316    44.6 <2e-16 ***
## Residuals    428  204.5   0.478                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_milk <- aov(formula = Milk ~ cluster_hier,
                           data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  311.8  28.342   95.34 <2e-16 ***
## Residuals    428  127.2   0.297                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_grocery <- aov(formula = Grocery ~ cluster_hier,
                           data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  354.6   32.24   163.6 <2e-16 ***
## Residuals    428   84.4    0.20                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_frozen <- aov(formula = Frozen ~ cluster_hier,
                             data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  258.7  23.522   55.85 <2e-16 ***
## Residuals    428  180.3   0.421                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_detergents <- aov(formula = Detergents_Paper ~ cluster_hier,
                            data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  380.9   34.63   255.2 <2e-16 ***
## Residuals    428   58.1    0.14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_delicassen <- aov(formula = Delicassen ~ cluster_hier,
                            data = dados_completos))

##               Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_hier  11  356.5   32.41   168.1 <2e-16 ***
## Residuals    428   82.5    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For a confidence level of 95%, only the variable “Region_1” cannot be considered significant for the formation of at least one cluster.

Analyzing the descriptive statistics of the clusters by variable using the hierarchical method

Descriptive statistics for the ‘Fresh’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Fresh),
    sd = sd(Fresh),
    min = min(Fresh),
    max = max(Fresh))

## # A tibble: 12 × 5
##    cluster_hier    mean     sd     min     max
##    <fct>          <dbl>  <dbl>   <dbl>   <dbl>
##  1 1            -0.560   0.325 -0.947   0.286 
##  2 2            -0.0112  0.760 -0.949   3.49  
##  3 3             1.37    1.01   0.497   2.47  
##  4 4             2.70    0.989  1.40    5.08  
##  5 5             1.77    0.858  0.864   2.57  
##  6 6            -0.494   0.436 -0.942   0.794 
##  7 7             0.325  NA      0.325   0.325 
##  8 8            -0.0543 NA     -0.0543 -0.0543
##  9 9             7.92   NA      7.92    7.92  
## 10 10            1.96   NA      1.96    1.96  
## 11 11            1.64   NA      1.64    1.64  
## 12 12           -0.272  NA     -0.272  -0.272

Descriptive statistics for the ‘Milk’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Milk),
    sd = sd(Milk),
    min = min(Milk),
    max = max(Milk))

## # A tibble: 12 × 5
##    cluster_hier   mean     sd    min    max
##    <fct>         <dbl>  <dbl>  <dbl>  <dbl>
##  1 1             0.533  0.701 -0.613  2.72 
##  2 2            -0.359  0.355 -0.778  1.48 
##  3 3             1.14   2.62  -0.614  4.15 
##  4 4            -0.379  0.288 -0.747  0.184
##  5 5             6.72   2.38   4.41   9.17 
##  6 6             1.32   0.984 -0.279  3.26 
##  7 7             5.47  NA      5.47   5.47 
##  8 8            -0.367 NA     -0.367 -0.367
##  9 9             3.23  NA      3.23   3.23 
## 10 10            5.17  NA      5.17   5.17 
## 11 11            1.49  NA      1.49   1.49 
## 12 12           -0.111 NA     -0.111 -0.111

Descriptive statistics for the ‘Grocery’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Grocery),
    sd = sd(Grocery),
    min = min(Grocery),
    max = max(Grocery))

## # A tibble: 12 × 5
##    cluster_hier   mean     sd     min    max
##    <fct>         <dbl>  <dbl>   <dbl>  <dbl>
##  1 1             0.570  0.570 -0.662   2.21 
##  2 2            -0.408  0.343 -0.836   0.949
##  3 3             0.958  0.817  0.0174  1.48 
##  4 4            -0.364  0.374 -0.787   0.490
##  5 5             4.33   1.56   2.54    5.43 
##  6 6             2.09   0.766  0.928   3.99 
##  7 7             8.93  NA      8.93    8.93 
##  8 8            -0.620 NA     -0.620  -0.620
##  9 9             1.07  NA      1.07    1.07 
## 10 10            1.29  NA      1.29    1.29 
## 11 11            0.597 NA      0.597   0.597
## 12 12            6.24  NA      6.24    6.24

Descriptive statistics for the ‘Frozen’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Frozen),
    sd = sd(Frozen),
    min = min(Frozen),
    max = max(Frozen))

## # A tibble: 12 × 5
##    cluster_hier    mean     sd    min    max
##    <fct>          <dbl>  <dbl>  <dbl>  <dbl>
##  1 1            -0.327   0.326 -0.626  1.46 
##  2 2            -0.0149  0.707 -0.628  3.22 
##  3 3             0.523   0.127  0.429  0.667
##  4 4             0.606   1.07  -0.420  3.08 
##  5 5             0.193   0.713 -0.429  0.970
##  6 6            -0.243   0.360 -0.625  0.757
##  7 7            -0.421  NA     -0.421 -0.421
##  8 8             6.58   NA      6.58   6.58 
##  9 9             2.82   NA      2.82   2.82 
## 10 10            6.89   NA      6.89   6.89 
## 11 11           11.9    NA     11.9   11.9  
## 12 12           -0.606  NA     -0.606 -0.606

Descriptive statistics for the ‘Detergents_Paper’ variable.

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Detergents_Paper),
    sd = sd(Detergents_Paper),
    min = min(Detergents_Paper),
    max = max(Detergents_Paper))

## # A tibble: 12 × 5
##    cluster_hier   mean     sd    min    max
##    <fct>         <dbl>  <dbl>  <dbl>  <dbl>
##  1 1             0.549  0.512 -0.545  2.00 
##  2 2            -0.401  0.269 -0.604  0.844
##  3 3             0.101  0.324 -0.273  0.305
##  4 4            -0.486  0.105 -0.600 -0.283
##  5 5             4.36   0.702  3.61   5.00 
##  6 6             2.45   0.760  1.34   4.48 
##  7 7             7.96  NA      7.96   7.96 
##  8 8            -0.589 NA     -0.589 -0.589
##  9 9             0.433 NA      0.433  0.433
## 10 10           -0.554 NA     -0.554 -0.554
## 11 11           -0.338 NA     -0.338 -0.338
## 12 12            7.39  NA      7.39   7.39

Descriptive statistics for the ‘Delicassen’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Delicassen),
    sd = sd(Delicassen),
    min = min(Delicassen),
    max = max(Delicassen))

## # A tibble: 12 × 5
##    cluster_hier    mean     sd    min    max
##    <fct>          <dbl>  <dbl>  <dbl>  <dbl>
##  1 1             0.0384  0.545 -0.540  2.24 
##  2 2            -0.139   0.391 -0.540  1.89 
##  3 3             4.82    0.433  4.55   5.32 
##  4 4            -0.0715  0.337 -0.540  0.493
##  5 5             0.569   1.04  -0.221  1.75 
##  6 6             0.0961  0.549 -0.528  1.28 
##  7 7             0.503  NA      0.503  0.503
##  8 8             0.416  NA      0.416  0.416
##  9 9             2.49   NA      2.49   2.49 
## 10 10           16.5    NA     16.5   16.5  
## 11 11            1.45   NA      1.45   1.45 
## 12 12           -0.110  NA     -0.110 -0.110

Descriptive statistics for the ‘Channel_2’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Channel_2),
    sd = sd(Channel_2),
    min = min(Channel_2),
    max = max(Channel_2))

## # A tibble: 12 × 5
##    cluster_hier   mean     sd   min   max
##    <fct>         <dbl>  <dbl> <int> <int>
##  1 1            0.944   0.230     0     1
##  2 2            0.0997  0.300     0     1
##  3 3            0.333   0.577     0     1
##  4 4            0       0         0     0
##  5 5            1       0         1     1
##  6 6            1       0         1     1
##  7 7            1      NA         1     1
##  8 8            0      NA         0     0
##  9 9            0      NA         0     0
## 10 10           0      NA         0     0
## 11 11           0      NA         0     0
## 12 12           1      NA         1     1

Descriptive statistics for the ‘Region_1’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Region_1),
    sd = sd(Region_1),
    min = min(Region_1),
    max = max(Region_1))

## # A tibble: 12 × 5
##    cluster_hier  mean     sd   min   max
##    <fct>        <dbl>  <dbl> <int> <int>
##  1 1            0.122  0.329     0     1
##  2 2            0.196  0.398     0     1
##  3 3            0      0         0     0
##  4 4            0      0         0     0
##  5 5            0      0         0     0
##  6 6            0.333  0.483     0     1
##  7 7            0     NA         0     0
##  8 8            0     NA         0     0
##  9 9            0     NA         0     0
## 10 10           0     NA         0     0
## 11 11           0     NA         0     0
## 12 12           0     NA         0     0

Descriptive statistics for the ‘Region_2’ variable

group_by(dados_completos, cluster_hier) %>%
  summarise(
    mean = mean(Region_2),
    sd = sd(Region_2),
    min = min(Region_2),
    max = max(Region_2))

## # A tibble: 12 × 5
##    cluster_hier   mean     sd   min   max
##    <fct>         <dbl>  <dbl> <int> <int>
##  1 1            0.144   0.354     0     1
##  2 2            0.0997  0.300     0     1
##  3 3            0       0         0     0
##  4 4            0       0         0     0
##  5 5            0       0         0     0
##  6 6            0.0952  0.301     0     1
##  7 7            0      NA         0     0
##  8 8            0      NA         0     0
##  9 9            0      NA         0     0
## 10 10           0      NA         0     0
## 11 11           1      NA         1     1
## 12 12           1      NA         1     1

Through these statistics, we can understand the characteristics of each cluster, and consequently, the retail network would know how to better allocate its resources to meet the demand of its customers. In our example, we can also observe the presence of outliers because clusters 7, 8, 9, 10, 11, and 12 are formed by a single observation. We could remove the outlier observations and rerun the hierarchical clustering algorithm. However, since our goal is to use the output of the hierarchical method as input for the “k-means” method, we will execute the “k-means” clustering algorithm with only 6 clusters. This way, we will obtain more well-defined clusters.

“K-means” clustering

Elbow method for identifying the optimal number of clusters

fviz_nbclust(dados_completos[,1:9], kmeans, method = "wss", k.max = 10)

The elbow method appears to indicate that the optimal number of clusters is indeed around 6 or 7, which aligns with our analysis at the end of the hierarchical procedure. The optimal number is indicated by the point on the X-axis where the distances on the Y-axis between the points start to decrease more significantly.

Implementation of the non-hierarchical k-means clustering algorithm

cluster_kmeans <- kmeans(select(dados_completos, -cluster_hier),
                         centers = 6)

Creating a categorical variable to indicate the cluster in the database

dados_completos2 <- dados_completos
dados_completos2$cluster_K <- factor(cluster_kmeans$cluster)

Analysis of variance using the k-means method

summary(anova_channel2 <- aov(formula = Channel_2 ~ cluster_K,
                              data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  61.17  12.234   151.7 <2e-16 ***
## Residuals   434  35.00   0.081                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_region1 <- aov(formula = Region_1 ~ cluster_K,
                             data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)
## cluster_K     5   0.48 0.09555   0.658  0.656
## Residuals   434  63.05 0.14527

summary(anova_region2 <- aov(formula = Region_2 ~ cluster_K,
                             data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)
## cluster_K     5   0.22 0.04388   0.456  0.809
## Residuals   434  41.76 0.09622

summary(anova_fresh <- aov(formula = Fresh ~ cluster_K,
                           data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  256.9   51.37   122.4 <2e-16 ***
## Residuals   434  182.2    0.42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_milk <- aov(formula = Milk ~ cluster_K,
                          data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  279.8   55.96   152.5 <2e-16 ***
## Residuals   434  159.2    0.37                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_grocery <- aov(formula = Grocery ~ cluster_K,
                             data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  316.3   63.26   223.8 <2e-16 ***
## Residuals   434  122.7    0.28                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_frozen <- aov(formula = Frozen ~ cluster_K,
                            data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  257.1   51.41   122.6 <2e-16 ***
## Residuals   434  181.9    0.42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_detergents <- aov(formula = Detergents_Paper ~ cluster_K,
                                data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  345.4   69.07   320.1 <2e-16 ***
## Residuals   434   93.6    0.22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(anova_delicassen <- aov(formula = Delicassen ~ cluster_K,
                                data = dados_completos2))

##              Df Sum Sq Mean Sq F value Pr(>F)    
## cluster_K     5  183.6   36.71   62.37 <2e-16 ***
## Residuals   434  255.4    0.59                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the case of the “k-means” procedure, the variables “Region_1” referring to the city of Lisbon and “Region_2” referring to the city of Porto do not contribute to the formation of any cluster. All the other variables are significant for the formation of at least one cluster at a 95% confidence level.

Descriptive statistics of the clusters by variable using the k-means method

Descriptive statistics for the ‘Fresh’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Fresh),
    sd = sd(Fresh),
    min = min(Fresh),
    max = max(Fresh))

## # A tibble: 6 × 5
##   cluster_K   mean    sd    min   max
##   <fct>      <dbl> <dbl>  <dbl> <dbl>
## 1 1          0.170 0.659 -0.949 1.50 
## 2 2          3.16  3.19   1.14  7.92 
## 3 3          0.313 1.14  -0.942 2.57 
## 4 4         -0.280 0.483 -0.949 0.890
## 5 5         -0.490 0.455 -0.947 1.00 
## 6 6          1.87  0.986  0.497 5.08

Descriptive statistics for the ‘Milk’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Milk),
    sd = sd(Milk),
    min = min(Milk),
    max = max(Milk))

## # A tibble: 6 × 5
##   cluster_K   mean    sd    min   max
##   <fct>      <dbl> <dbl>  <dbl> <dbl>
## 1 1         -0.127 0.704 -0.740 2.40 
## 2 2          3.51  1.56   1.49  5.17 
## 3 3          3.92  2.62  -0.111 9.17 
## 4 4         -0.416 0.287 -0.778 0.661
## 5 5          0.574 0.637 -0.613 2.72 
## 6 6         -0.225 0.429 -0.747 1.01

Descriptive statistics for the ‘Grocery’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Grocery),
    sd = sd(Grocery),
    min = min(Grocery),
    max = max(Grocery))

## # A tibble: 6 × 5
##   cluster_K   mean    sd    min   max
##   <fct>      <dbl> <dbl>  <dbl> <dbl>
## 1 1         -0.401 0.317 -0.765 0.850
## 2 2          1.11  0.380  0.597 1.48 
## 3 3          4.27  2.16   1.99  8.93 
## 4 4         -0.454 0.294 -0.836 0.898
## 5 5          0.836 0.673 -0.102 3.00 
## 6 6         -0.230 0.451 -0.787 1.38

Descriptive statistics for the ‘Frozen’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Frozen),
    sd = sd(Frozen),
    min = min(Frozen),
    max = max(Frozen))

## # A tibble: 6 × 5
##   cluster_K     mean    sd    min    max
##   <fct>        <dbl> <dbl>  <dbl>  <dbl>
## 1 1          1.55    1.03   0.332  6.58 
## 2 2          5.51    5.03   0.429 11.9  
## 3 3         -0.00357 0.554 -0.625  0.970
## 4 4         -0.264   0.319 -0.623  0.772
## 5 5         -0.358   0.234 -0.628  0.529
## 6 6          0.154   0.772 -0.607  3.08

Descriptive statistics for the ‘Detergents_Paper’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Detergents_Paper),
    sd = sd(Detergents_Paper),
    min = min(Detergents_Paper),
    max = max(Detergents_Paper))

## # A tibble: 6 × 5
##   cluster_K    mean     sd    min    max
##   <fct>       <dbl>  <dbl>  <dbl>  <dbl>
## 1 1         -0.494  0.0826 -0.601 -0.280
## 2 2         -0.0383 0.482  -0.554  0.433
## 3 3          4.61   1.73    3.12   7.96 
## 4 4         -0.406  0.243  -0.604  0.511
## 5 5          0.860  0.696  -0.545  2.99 
## 6 6         -0.396  0.243  -0.602  0.365

Descriptive statistics for the ‘Delicassen’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Delicassen),
    sd = sd(Delicassen),
    min = min(Delicassen),
    max = max(Delicassen))

## # A tibble: 6 × 5
##   cluster_K    mean    sd    min   max
##   <fct>       <dbl> <dbl>  <dbl> <dbl>
## 1 1          0.0305 0.431 -0.524  1.54
## 2 2          6.43   6.88   1.45  16.5 
## 3 3          0.503  0.697 -0.221  1.75
## 4 4         -0.220  0.305 -0.540  1.28
## 5 5          0.0375 0.537 -0.540  2.24
## 6 6          0.299  1.02  -0.540  4.59

Descriptive statistics for the ‘Channel_2’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Channel_2),
    sd = sd(Channel_2),
    min = min(Channel_2),
    max = max(Channel_2))

## # A tibble: 6 × 5
##   cluster_K   mean    sd   min   max
##   <fct>      <dbl> <dbl> <int> <int>
## 1 1         0.0682 0.255     0     1
## 2 2         0.25   0.5       0     1
## 3 3         1      0         1     1
## 4 4         0.1    0.301     0     1
## 5 5         0.951  0.216     0     1
## 6 6         0.143  0.354     0     1

Descriptive statistics for the ‘Region_1’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Region_1),
    sd = sd(Region_1),
    min = min(Region_1),
    max = max(Region_1))

## # A tibble: 6 × 5
##   cluster_K  mean    sd   min   max
##   <fct>     <dbl> <dbl> <int> <int>
## 1 1         0.182 0.390     0     1
## 2 2         0     0         0     0
## 3 3         0.2   0.422     0     1
## 4 4         0.196 0.398     0     1
## 5 5         0.126 0.334     0     1
## 6 6         0.184 0.391     0     1

Descriptive statistics for the ‘Region_2’ variable

group_by(dados_completos2, cluster_K) %>%
  summarise(
    mean = mean(Region_2),
    sd = sd(Region_2),
    min = min(Region_2),
    max = max(Region_2))

## # A tibble: 6 × 5
##   cluster_K   mean    sd   min   max
##   <fct>      <dbl> <dbl> <int> <int>
## 1 1         0.136  0.347     0     1
## 2 2         0.25   0.5       0     1
## 3 3         0.1    0.316     0     1
## 4 4         0.0957 0.295     0     1
## 5 5         0.126  0.334     0     1
## 6 6         0.0816 0.277     0     1

As per our analysis of variance above, we can see that none of the clusters is characterized by the predominance of customers from a specific region. However, it is possible to conclude that retail customers are more grouped in cluster 3. Consequently, we can observe that grocery items and detergent_paper items are more purchased by customers in cluster 3. Obviously, many other inferences could be drawn by analyzing the statistics above, but the example described demonstrates how cluster analyses can improve resource allocation for businesses through customer segmentation into groups.

3D plot to illustrate cluster 3

scatter3D(x=dados_completos2$Channel_2,
          y=dados_completos2$Grocery,
          z=dados_completos2$Detergents_Paper,
          phi = 1, bty = "g", pch = 20, cex = 1,
          xlab = "Varejo",
          ylab = "Mercearia",
          zlab = "Papelaria",
          main = "Clientes", 
          colkey = F)

Plotted from the original data, the above graph indeed demonstrates the existence of a group of customers that “stand out” from the others due to the characteristics we noticed when analyzing the descriptive statistics by cluster. The customers represented with the light blue, yellow, and red colors in the graph above are likely some of the customers that were grouped in cluster 3.

Comparing the procedures through a confusion matrix

Creating the contingency table

tabela_contingencia <- table(dados_completos2$cluster_hier, dados_completos2$cluster_K)

matriz_confusao <- prop.table(tabela_contingencia, margin = 1)

Display the formatted confusion matrix

print(matriz_confusao, digits = 2)

##     
##          1     2     3     4     5     6
##   1  0.033 0.000 0.000 0.100 0.867 0.000
##   2  0.133 0.000 0.000 0.734 0.030 0.103
##   3  0.000 0.333 0.000 0.000 0.000 0.667
##   4  0.000 0.000 0.000 0.000 0.000 1.000
##   5  0.000 0.000 1.000 0.000 0.000 0.000
##   6  0.000 0.000 0.238 0.000 0.762 0.000
##   7  0.000 0.000 1.000 0.000 0.000 0.000
##   8  1.000 0.000 0.000 0.000 0.000 0.000
##   9  0.000 1.000 0.000 0.000 0.000 0.000
##   10 0.000 1.000 0.000 0.000 0.000 0.000
##   11 0.000 1.000 0.000 0.000 0.000 0.000
##   12 0.000 0.000 1.000 0.000 0.000 0.000

By completing both procedures and analyzing the confusion matrix, we can see that the two procedures tended to group the observations in a similar manner. The confusion matrix shows us the percentage of observations that were grouped in the same cluster during hierarchical clustering and remained grouped in the same cluster after k-means clustering. It’s noticeable that there was no significant dispersion of observations, and the clusters from the hierarchical procedure that experienced more dispersion of observations still had at least 66% of those observations grouped together again.

Thus, it demonstrates the effectiveness of clustering algorithms for grouping observations and how the approach of using the output of a hierarchical method as input for the k-means method can be a valid strategy to further refine the identified groups, allowing for more precise segmentation and additional insights.

Cluster Analysis

Rafael

2023-06-20

Importing the database:

Data preparation

Visualization of the database

Count of categories by variable

Looking at the “type” of the variables in our database

Separating the variables into numerical and categorical

Standardizing the numerical variables

Dummy encoding the categorical variables

Matrices

Creating our matrices

Joining the matrices

Viewing the dissimilarity matrix

Hierarchical clustering

Hierarchical clustering elaboration

Dendrogram construction

Dendrogram with cluster visualization

Creating a database with all the data used in the creation of the matrices

Creating a categorical variable to indicate the cluster in the database

Analysis of variance using the hierarchical method

Analyzing the descriptive statistics of the clusters by variable using the hierarchical method

Descriptive statistics for the ‘Fresh’ variable

Descriptive statistics for the ‘Milk’ variable

Descriptive statistics for the ‘Grocery’ variable

Descriptive statistics for the ‘Frozen’ variable

Descriptive statistics for the ‘Detergents_Paper’ variable.

Descriptive statistics for the ‘Delicassen’ variable

Descriptive statistics for the ‘Channel_2’ variable

Descriptive statistics for the ‘Region_1’ variable

Descriptive statistics for the ‘Region_2’ variable

“K-means” clustering

Elbow method for identifying the optimal number of clusters

Implementation of the non-hierarchical k-means clustering algorithm

Creating a categorical variable to indicate the cluster in the database

Analysis of variance using the k-means method

Descriptive statistics of the clusters by variable using the k-means method

Descriptive statistics for the ‘Fresh’ variable

Descriptive statistics for the ‘Milk’ variable

Descriptive statistics for the ‘Grocery’ variable

Descriptive statistics for the ‘Frozen’ variable

Descriptive statistics for the ‘Detergents_Paper’ variable

Descriptive statistics for the ‘Delicassen’ variable

Descriptive statistics for the ‘Channel_2’ variable

Descriptive statistics for the ‘Region_1’ variable

Descriptive statistics for the ‘Region_2’ variable

3D plot to illustrate cluster 3

Comparing the procedures through a confusion matrix

Creating the contingency table

Display the formatted confusion matrix