GitHub copy
Research question: How do different shopping patterns in other food categories (fruit, meat, fish, and sweets) affect the amount spent on wine?
Explanation: I have chosen the Principal Component Analysis (PCA) method because I want to explore which factors most influence the amount customers spend on wine. PCA enables me to identify key patterns and relationships among various variables in the selected dataset. My goal is to better understand which factors are most responsible for the variability in wine spending and how these variables are interconnected. The variables included in the analysis are more precisely represented within the analysis itself: Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, and MntSweetProducts.
Data_source: Kaggle.com. (2023). Title: Customer Personality Analysis. Obtained from: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
Data: They are connected to individuals (customers) and their shopping habits in the store. The data enables the company to gain a more detailed understanding of its customers and key segments.
Import the data and display first six (6) rows
podatki <- read.table("/Users/jakakranjc/Downloads/marketing_campaign.csv", header=TRUE, sep = "\t", dec = ".")
head(podatki)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014
## 6 7446 1967 Master Together 62513 0 1 09-09-2013
## Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 58 635 88 546 172 88
## 2 38 11 1 6 2 1
## 3 26 426 49 127 111 21
## 4 26 11 4 20 10 3
## 5 94 173 43 118 46 27
## 6 16 520 42 98 0 42
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## 3 42 1 8 2
## 4 5 2 2 0
## 5 15 5 5 3
## 6 14 2 6 4
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## 3 10 4 0 0 0
## 4 4 6 0 0 0
## 5 6 5 0 0 0
## 6 10 6 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1 0 0 0 3 11 1
## 2 0 0 0 3 11 0
## 3 0 0 0 3 11 0
## 4 0 0 0 3 11 0
## 5 0 0 0 3 11 0
## 6 0 0 0 3 11 0
For easier understanding of data
summary(podatki)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51382 Median :0.0000 Median :0.0000 Mode :character
## Mean : 52247 Mean :0.4442 Mean :0.5062
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 167
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Min. :0.000000 Min. :3 Min. :11
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11
## Median :0.00000 Median :0.000000 Median :3 Median :11
## Mean :0.01339 Mean :0.009375 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :1.000000 Max. :3 Max. :11
##
## Response
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1491
## 3rd Qu.:0.0000
## Max. :1.0000
##
Explanation of data: Observation unit - Each row in the dataset represents information about a specific customer.
Sample size - The sample size is the number of customers included in the dataset. In our case, it is 2240.
Variables - The relevant/presented variables related to the chosen question are:
ID: A unique identifier for the customer. Year_Birth: The birth year of the customer. Income: The customer’s annual household income. MntWines: Amount spent on wine in the last 2 years. MntFruits: Amount spent on fruit in the last 2 years. MntMeatProducts: Amount spent on meat in the last 2 years. MntFishProducts: Amount spent on fish in the last 2 years. MntSweetProducts: Amount spent on sweets in the last 2 years.
Measurement units - These are evident from the dataset (e.g., unit quantities, income in the appropriate currency, etc.).
Organizing the data
podatki_vino <- podatki[,-c(2,3,4,6,7,8,9,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29)]
colnames(podatki_vino) <- c("ID", "Prihodek", "Znesek_vino", "Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")
head(podatki_vino)
## ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 1 5524 58138 635 88 546 172 88
## 2 2174 46344 11 1 6 2 1
## 3 4141 71613 426 49 127 111 21
## 4 6182 26646 11 4 20 10 3
## 5 5324 58293 173 43 118 46 27
## 6 7446 62513 520 42 98 0 42
Created a new table on which we can now ‘play’.
Descriptive statistics
library(psych)
describe(podatki_vino[, -1])
## vars n mean sd median trimmed mad min max
## Prihodek 1 2216 52247.25 25173.08 51381.5 51763.99 24548.15 1730 666666
## Znesek_vino 2 2240 303.94 336.60 173.5 248.99 243.89 0 1493
## Znesek_sadje 3 2240 26.30 39.77 8.0 16.98 11.86 0 199
## Znesek_meso 4 2240 166.95 225.72 67.0 119.33 87.47 0 1725
## Znesek_ribe 5 2240 37.53 54.63 12.0 25.08 17.79 0 259
## Znesek_sladko 6 2240 27.06 41.28 8.0 17.35 11.86 0 263
## range skew kurtosis se
## Prihodek 664936 6.75 159.13 534.75
## Znesek_vino 1493 1.17 0.59 7.11
## Znesek_sadje 199 2.10 4.03 0.84
## Znesek_meso 1725 2.08 5.49 4.77
## Znesek_ribe 259 1.92 3.08 1.15
## Znesek_sladko 263 2.13 4.36 0.87
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
OLS regression method - linear regression
fit <- lm(Znesek_vino ~ Prihodek + Znesek_sadje + Znesek_meso + Znesek_ribe + Znesek_sladko,
data = podatki_vino)
vif(fit)
## Prihodek Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 1.596194 1.865162 2.037230 1.953510 1.828894
We can see that all variables have higher values; we want them as close to 1 as possible. (Up to around 1.5 is acceptable.) Using VIF, we check for multicollinearity. Collinearity is present, if it is not checked it can result in incorrect regression coefficients.
mean(vif(fit))
## [1] 1.856198
We want the mean of the VIF so be as close to 1 as possible as well.
New subset, then descriptive statistics
podatki_MGK <- podatki_vino[,c("Prihodek", "Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")]
library(pastecs)
round(stat.desc(podatki_MGK, basic = FALSE), 2)
## Prihodek Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## median 51381.50 8.00 67.00 12.00 8.00
## mean 52247.25 26.30 166.95 37.53 27.06
## SE.mean 534.75 0.84 4.77 1.15 0.87
## CI.mean.0.95 1048.67 1.65 9.35 2.26 1.71
## var 633683788.58 1581.93 50947.43 2984.33 1704.08
## std.dev 25173.08 39.77 225.72 54.63 41.28
## coef.var 0.48 1.51 1.35 1.46 1.53
Znesek_sladko (‘Amount_sweet’) has the highest coef.var, thats why it varies most. The data must first be standarized, since prihodek (‘income’) has a very high var value and hence high affect.
Correlation matrix
R <- cor(podatki_MGK)
round(R, 3)
## Prihodek Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## Prihodek 1 NA NA NA NA
## Znesek_sadje NA 1.000 0.543 0.595 0.567
## Znesek_meso NA 0.543 1.000 0.568 0.524
## Znesek_ribe NA 0.595 0.568 1.000 0.580
## Znesek_sladko NA 0.567 0.524 0.580 1.000
5x5 matrix, diagonally value 1, the rest of them are Pearson coefficients. Coefficient Znesek_meso/Znesek_sladko (Amount_meat/Amount_sweet) -> linear connection is positive and has ‘medium strenght’. Prihodek (‘Income’) has values NA, thats why we are ‘not allowed’ to count it in.
Quick fix without Prihodek (‘Income’)
podatki_MGK_nova <- podatki_vino[,c("Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")]
library(pastecs)
round(stat.desc(podatki_MGK_nova, basic = FALSE), 2)
## Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## median 8.00 67.00 12.00 8.00
## mean 26.30 166.95 37.53 27.06
## SE.mean 0.84 4.77 1.15 0.87
## CI.mean.0.95 1.65 9.35 2.26 1.71
## var 1581.93 50947.43 2984.33 1704.08
## std.dev 39.77 225.72 54.63 41.28
## coef.var 1.51 1.35 1.46 1.53
R <- cor(podatki_MGK_nova)
round(R, 3)
## Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## Znesek_sadje 1.000 0.543 0.595 0.567
## Znesek_meso 0.543 1.000 0.568 0.524
## Znesek_ribe 0.595 0.568 1.000 0.580
## Znesek_sladko 0.567 0.524 0.580 1.000
The Bartlett test
cortest.bartlett(R, n = nrow(podatki_MGK_nova))
## $chisq
## [1] 3363.307
##
## $p.value
## [1] 0
##
## $df
## [1] 6
H0 - P = I H1 - P =/= I We can reject the null hypothesis, p = 0, the data is valid/appropriate
KMO statistics
KMO(R)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA = 0.82
## MSA for each item =
## Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 0.81 0.83 0.80 0.82
KMO = 0.82, value is good. MSA = Znesek_meso (‘Amount_meat’) is most appropriate, znesek_ribe (‘Amount_fish’) is least appropriate, values still high enough. Biggest information loss would happen with znesek_ribe (‘Amount_fish’). The data is valid.
PCA - Principal Component Analysis
library(FactoMineR)
mgk <- PCA(podatki_MGK_nova,
scale.unit = TRUE,
graph = FALSE)
library(factoextra)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
get_eigenvalue(mgk)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.6893021 67.23255 67.23255
## Dim.2 0.4803929 12.00982 79.24237
## Dim.3 0.4313011 10.78253 90.02490
## Dim.4 0.3990040 9.97510 100.00000
We standardize with scale-unit, and the analysis will now be based on the correlation matrix. Using eigenvalues, we determine own values. The maximum component values are displayed (in this case, 4), with each subsequent one having a lower eigenvalue. LV = variance of the component (e.g., dimension 2 = 0.48).
The variance.percent tells us what percentage of information is captured by each component. The variance of the first 4 variables will thus be 4. (Out of these 4, the first principal component will capture 2.689 = 67.2%). The last column is cumulative, and it will always total 100. According to the Kaiser criterion, we take only the first component (values above 1). If we look at the values alone, we would take the first two components since we need 70%. Since we also have the rule of the last 5%, we could also take the third and fourth components.
Eigenvalue plot for determining the number of analytically significant principal components
fviz_eig(mgk,
choice = "eigenvalue",
main = "Diagram lastnih vrednosti",
ylab = "lastna vrednost",
xlab = "glavna komponenta",
addlabels = TRUE)
We are looking for where the diagram breaks, that happens at 2. We take
one less, so just the first one. I was supposed to take only the
first component, bcs it already broke at 2, but had some issues with
drawing fviz_pca_var and biplot, thays why I still took 2 for
visualization.
Paralel analysis
library(psych)
fa.parallel(podatki_MGK_nova,
sim = FALSE,
fa = "pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
We mnust assign the main component. Until the empirical value is above the theoretical, we take the main component. This confirms, that we take just the first one. fa.parallel also tells us, da we would need to take just one, but the graphs were giving me a hard time and for ease of visualisation, I took two
Solution with (just one) main component
mgk <- PCA(podatki_MGK_nova,
scale.unit = TRUE,
graph = FALSE,
ncp = 2)
mgk
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 2240 individuals, described by 4 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
We say how many main components we want, then select 4,6 and 8. ncp = 2 takes 2 main components, ncp = 1 just one.
var$cor
print(mgk$var$cor)
## Dim.1 Dim.2
## Znesek_sadje 0.8260039 -0.15257645
## Znesek_meso 0.8002894 0.56354405
## Znesek_ribe 0.8393935 -0.02595429
## Znesek_sladko 0.8136185 -0.37263623
print(mgk$var$contrib)
## Dim.1 Dim.2
## Znesek_sadje 25.37024 4.8459446
## Znesek_meso 23.81522 66.1087888
## Znesek_ribe 26.19942 0.1402238
## Znesek_sladko 24.61512 28.9050428
with cor we get rescaled coefficients or weights of main components (Pearson correlation coefficients between individual initial variables and each principal component). All 4 variables for dim1 have a positive sign (positive correlation). The more a person spends on fruit, meat, fish, and sweets, the higher the amount spent on wine in PC1. PC2 usually represents certain contrasts, as it often includes both positive and negative coefficients.
LS = we take the coefficient of each principal/main component, square it and sum it with the rest. for example fruit: 0.82’ + (-0.15)’ = 0,69 = 69% information has been transfered, 31% of it has been lost.
in case of just one main comp. - v primeru le ene gk: -> All 4 variables have a negative correlation with the first principal component, meaning that the higher the amount spent on fruit, meat, fish, and sweets, the lower the amount spent on wine.
The contrib value indicates what percentage of the total information of the principal component was contributed by each measured variable. In our case, Znesek_ribe contributed the most to PC1.
Grafical visualisation of the measured component
fviz_pca_var(mgk,
repel = TRUE)
In the case of a single axis -> Since we have only one principal component, the request ‘axes = c(1,1)’ had issues displaying a graph with two axes. (Apologies for the reduced clarity; I hope (1,1) is an acceptable approach, as Stack Overflow wasn’t particularly helpful.)
Dim1 - 67.2 indicates that 67.2% of the information was transferred to this component. Dim2 shows that 12% of the information was transferred. Each axis represents its principal component with its corresponding values (e.g., meat -> 0.6, 0.6; fruit -> 0.7, -0.2).
Biplot visualisation
fviz_pca_biplot(mgk)
Since the entire sample is huge, the graph is unclear. Using the option REPEL = TRUE could potentially improve the graph, but it remains unreadable.
The “best” unit can still be identified—this is unit number 361, which is more than 6 standard deviations to the right.
Check for standalone values
head(mgk$ind$coord)
## Dim.1 Dim.2
## 1 3.5940251 0.13814880
## 2 -1.3148042 -0.07595754
## 3 0.8168131 -0.24098899
## 4 -1.1475119 -0.07366193
## 5 0.1843154 -0.27379588
## 6 -0.1223814 -0.50418399
If we were interested in an individual person, we could determine their coordinates and use the scale function to standardize the values.
Save the data in the main table
podatki_vino$GK1 <- mgk$ind$coord[,1]
podatki_vino$GK2 <- mgk$ind$coord[,2]
head(podatki_vino,3)
## ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 1 5524 58138 635 88 546 172 88
## 2 2174 46344 11 1 6 2 1
## 3 4141 71613 426 49 127 111 21
## GK1 GK2
## 1 3.5940251 0.13814880
## 2 -1.3148042 -0.07595754
## 3 0.8168131 -0.24098899
Check for coorelation
cor(x=podatki_vino$GK1, y=podatki_vino$GK2)
## [1] -1.43366e-15
If they are independent, the correlation should be 0. This value is practically 0, meaning there is virtually no connection between them.
The principal component analysis was conducted on 4 standardized variables (n=2240). The KMO test confirms the adequacy of the variables with KMO = 0.82. The MSA statistic for each variable is at least 0.5. Based on parallel analysis and other criteria, it is most reasonable to retain only one principal component. However, for easier visualization in the graphs, I selected two components (together, they account for 79.2% of the variance). A rationale for using only one variable is also provided where applicable.
Based on the loadings, we find that PC1 represents the amount spent on wine, while PC2 highlights the contrast between the amount spent on meat products and non-meat products.
LM
fit <- lm(Znesek_vino ~ Prihodek + GK1 + GK2,
data = podatki_vino)
vif(fit)
## Prihodek GK1 GK2
## 1.573817 1.520697 1.053617
mean(vif(fit))
## [1] 1.38271
Values are now better than before.
Missing values
NA_vrednosti <- complete.cases(podatki_vino)
sum(NA_vrednosti)
## [1] 2216
podatki_vino[!NA_vrednosti, ]
## ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 11 1994 NA 5 5 6 0
## 28 5255 NA 5 1 3 3
## 44 7281 NA 81 11 50 3
## 49 7244 NA 48 5 48 6
## 59 8557 NA 11 3 22 2
## 72 10629 NA 25 3 43 17
## 91 8996 NA 230 42 192 49
## 92 9235 NA 7 0 8 2
## 93 5798 NA 445 37 359 98
## 129 8268 NA 352 0 27 10
## 134 1295 NA 231 65 196 38
## 313 2437 NA 861 138 461 60
## 320 2863 NA 738 20 172 52
## 1380 10475 NA 187 5 65 26
## 1383 2902 NA 19 4 12 2
## 1384 4345 NA 5 1 9 2
## 1387 3769 NA 25 1 13 0
## 2060 7187 NA 375 42 48 94
## 2062 1612 NA 23 0 15 0
## 2079 5079 NA 71 1 16 0
## 2080 10339 NA 161 0 22 0
## 2082 3117 NA 264 0 21 12
## 2085 5250 NA 532 126 490 164
## 2229 8720 NA 32 2 1607 12
## Znesek_sladko GK1 GK2
## 11 2 -1.2708592 -0.109756974
## 28 263 1.8376721 -3.500482372
## 44 2 -1.0715918 0.013502812
## 49 10 -1.0276316 -0.066758963
## 59 2 -1.2428488 -0.042408140
## 72 4 -1.0328169 -0.003082932
## 91 37 0.4800048 -0.133962395
## 92 0 -1.3351673 -0.050188723
## 93 28 1.1288368 0.579066045
## 129 0 -1.2191053 0.012783484
## 134 71 1.0856275 -0.682247954
## 313 30 2.2966770 0.387440838
## 320 50 0.3424776 -0.255637116
## 1380 20 -0.6832205 -0.149488565
## 1383 2 -1.2518073 -0.083974198
## 1384 0 -1.3203380 -0.052121676
## 1387 0 -1.3304313 -0.036338393
## 2060 66 0.9389466 -1.061426921
## 2062 2 -1.3147303 -0.049650078
## 2079 0 -1.3239437 -0.025529364
## 2080 0 -1.3236353 0.001624655
## 2082 6 -1.1412090 -0.088366814
## 2085 126 4.3360980 -0.763520965
## 2229 4 2.2898619 5.640988224
Income has missing values, which hinders the calculation of standard deviation and standardized values.
Fill in NA values with Mean
Mean_prihodki <- mean(podatki_vino$Prihodek, na.rm = TRUE)
podatki_vino$Prihodek[is.na(podatki_vino$Prihodek)] <- Mean_prihodki
sum(is.na(podatki_vino$Prihodek))
## [1] 0
We calculate the mean of the income and then replace or insert it
into all values where we have NA.
To verify if it worked, we sum up all NA values in the Income column
using sum.na. If the result is 0, then it’s fine.
LM again
fit <- lm(Znesek_vino ~ Prihodek + GK1 + GK2,
data = podatki_vino)
vif(fit)
## Prihodek GK1 GK2
## 1.561592 1.511178 1.050414
Histogram and graph of stdost/stdOcenVred
podatki_vino$stdost <- rstandard(fit)
podatki_vino$stdocenvred <- scale(fit$fitted.values)
hist(podatki_vino$stdost,
xlab = "Standardni ostanki",
ylab = "Frekvenca",
main = "Histogram")
ggplot(podatki_vino, aes(y=stdost, x=stdocenvred)) + theme_dark() + geom_point(color = "snow") + ylab("stand ostanki") + xlab("stand ocen vred")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
Check the BP test
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------------
## Response : Znesek_vino
## Variables: fitted values of Znesek_vino
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 3314.2980
## Prob > Chi2 = 0.0000
H0 lahko zavrnemo, p vrednost = 0. (prob > Chi2), imamo heteroskedastičnost. WE can throw away H0, p value = 0. (prob > Chi2), so we have Heteroskedasticity.
Cooks distances
podatki_vino$cooksD <- cooks.distance(fit)
head(podatki_vino[order(-podatki_vino$cooksD), ], )
## ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 2234 9432 666666.00 9 14 18 8
## 1654 4931 157146.00 1 0 1725 2
## 688 1501 160803.00 55 16 1622 17
## 165 8475 157243.00 20 2 1582 1
## 2229 8720 52247.25 32 2 1607 12
## 22 5376 2447.00 1 1 1725 1
## Znesek_sladko GK1 GK2 stdost stdocenvred cooksD
## 2234 1 -1.067955 -0.1088027 -16.773412 13.698702 54.03779603
## 1654 1 2.389926 6.1231520 -5.609555 5.167523 0.31431231
## 688 3 2.534472 5.6271286 -5.350420 5.129183 0.24966646
## 165 2 2.108667 5.5845085 -5.295233 4.904833 0.23911341
## 2229 4 2.289862 5.6409882 -3.267427 2.576174 0.09523541
## 22 1 2.393221 6.1183017 -2.601776 1.627426 0.08859137
podatki_vino <- podatki_vino[c(-2234,-1654,-688,-165,-2229,-22), ]
We remove all units that have a stdost outside the range (-3, +3), as well as those with cooksD outside of a certain ‘normal’ range, meaning when it is not in a smooth decline.
head(podatki_vino[order(-podatki_vino$cooksD), ], )
## ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 2133 11181 156924 2 1 2 1
## 618 1503 162397 85 1 16 2
## 1301 5336 157733 39 1 9 2
## 656 5555 153924 1 1 1 1
## 1809 1619 90369 292 51 981 224
## 1899 4619 113734 6 2 3 1
## Znesek_sladko GK1 GK2 stdost stdocenvred cooksD
## 2133 1 -1.332826 -0.08968395 -2.835489 1.9622900 0.03258371
## 618 1 -1.293179 -0.03992744 -2.642926 2.1161109 0.03056891
## 1301 0 -1.320338 -0.05212168 -2.720565 1.9971201 0.03020320
## 656 1 -1.334989 -0.09328696 -2.778813 1.8917938 0.02978413
## 1809 23 3.771999 2.72138038 -2.504681 2.9209961 0.01562421
## 1899 262 1.819574 -3.49162026 -1.831961 0.7902984 0.01495170
Results
summary(fit)
##
## Call:
## lm(formula = Znesek_vino ~ Prihodek + GK1 + GK2, data = podatki_vino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3256.5 -128.1 -68.7 80.0 1169.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.550e+01 1.524e+01 2.986 0.00285 **
## Prihodek 4.946e-03 2.723e-04 18.165 < 2e-16 ***
## GK1 6.532e+01 4.089e+00 15.976 < 2e-16 ***
## GK2 7.221e+01 8.066e+00 8.953 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 258.2 on 2236 degrees of freedom
## Multiple R-squared: 0.4125, Adjusted R-squared: 0.4117
## F-statistic: 523.4 on 3 and 2236 DF, p-value: < 2.2e-16
Multiple R-squared - 41.25% of the variability in the result is explained by the linear influence of PC1, PC2, and income.
F-statistics - We check for linear correlation to see if the model explains the variability. A very low p-value suggests rejecting the null hypothesis at p < 0.001.
All 4 variables have a statistically significant impact.
Both PC1 and PC2 have a positive sign. The more spent on other products, the more will be spent on wine. Similarly, the more spent on wine, the more will be spent on meat, assuming other factors remain unchanged.