R Project 1

GitHub copy

Research question: How do different shopping patterns in other food categories (fruit, meat, fish, and sweets) affect the amount spent on wine?

Explanation: I have chosen the Principal Component Analysis (PCA) method because I want to explore which factors most influence the amount customers spend on wine. PCA enables me to identify key patterns and relationships among various variables in the selected dataset. My goal is to better understand which factors are most responsible for the variability in wine spending and how these variables are interconnected. The variables included in the analysis are more precisely represented within the analysis itself: Income, MntWines, MntFruits, MntMeatProducts, MntFishProducts, and MntSweetProducts.

Data_source: Kaggle.com. (2023). Title: Customer Personality Analysis. Obtained from: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

Data: They are connected to individuals (customers) and their shopping habits in the store. The data enables the company to gain a more detailed understanding of its customers and key segments.

Import the data and display first six (6) rows

podatki <- read.table("/Users/jakakranjc/Downloads/marketing_campaign.csv", header=TRUE, sep = "\t", dec = ".")
head(podatki)

##     ID Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524       1957 Graduation         Single  58138       0        0  04-09-2012
## 2 2174       1954 Graduation         Single  46344       1        1  08-03-2014
## 3 4141       1965 Graduation       Together  71613       0        0  21-08-2013
## 4 6182       1984 Graduation       Together  26646       1        0  10-02-2014
## 5 5324       1981        PhD        Married  58293       1        0  19-01-2014
## 6 7446       1967     Master       Together  62513       0        1  09-09-2013
##   Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      58      635        88             546             172               88
## 2      38       11         1               6               2                1
## 3      26      426        49             127             111               21
## 4      26       11         4              20              10                3
## 5      94      173        43             118              46               27
## 6      16      520        42              98               0               42
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
## 3           42                 1               8                   2
## 4            5                 2               2                   0
## 5           15                 5               5                   3
## 6           14                 2               6                   4
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
## 3                10                 4            0            0            0
## 4                 4                 6            0            0            0
## 5                 6                 5            0            0            0
## 6                10                 6            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1            0            0        0             3        11        1
## 2            0            0        0             3        11        0
## 3            0            0        0             3        11        0
## 4            0            0        0             3        11        0
## 5            0            0        0             3        11        0
## 6            0            0        0             3        11        0

For easier understanding of data

summary(podatki)

##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
##  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8428   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##                                                                      
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2240       
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062                     
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
##  NA's   :24                                                           
##     Recency         MntWines         MntFruits     MntMeatProducts
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0   
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16   
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67   
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 167   
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232   
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725   
##                                                                   
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2        Complain        Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Min.   :0.000000   Min.   :3     Min.   :11  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Median :0.000000   Median :3     Median :11  
##  Mean   :0.01339   Mean   :0.009375   Mean   :3     Mean   :11  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000   Max.   :1.000000   Max.   :3     Max.   :11  
##                                                                 
##     Response     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1491  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

Explanation of data: Observation unit - Each row in the dataset represents information about a specific customer.

Sample size - The sample size is the number of customers included in the dataset. In our case, it is 2240.

Variables - The relevant/presented variables related to the chosen question are:

ID: A unique identifier for the customer. Year_Birth: The birth year of the customer. Income: The customer’s annual household income. MntWines: Amount spent on wine in the last 2 years. MntFruits: Amount spent on fruit in the last 2 years. MntMeatProducts: Amount spent on meat in the last 2 years. MntFishProducts: Amount spent on fish in the last 2 years. MntSweetProducts: Amount spent on sweets in the last 2 years.

Measurement units - These are evident from the dataset (e.g., unit quantities, income in the appropriate currency, etc.).

Organizing the data

podatki_vino <- podatki[,-c(2,3,4,6,7,8,9,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29)]
colnames(podatki_vino) <- c("ID", "Prihodek", "Znesek_vino", "Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")
head(podatki_vino)

##     ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 1 5524    58138         635           88         546         172            88
## 2 2174    46344          11            1           6           2             1
## 3 4141    71613         426           49         127         111            21
## 4 6182    26646          11            4          20          10             3
## 5 5324    58293         173           43         118          46            27
## 6 7446    62513         520           42          98           0            42

Created a new table on which we can now ‘play’.

Descriptive statistics

library(psych)
describe(podatki_vino[, -1])

##               vars    n     mean       sd  median  trimmed      mad  min    max
## Prihodek         1 2216 52247.25 25173.08 51381.5 51763.99 24548.15 1730 666666
## Znesek_vino      2 2240   303.94   336.60   173.5   248.99   243.89    0   1493
## Znesek_sadje     3 2240    26.30    39.77     8.0    16.98    11.86    0    199
## Znesek_meso      4 2240   166.95   225.72    67.0   119.33    87.47    0   1725
## Znesek_ribe      5 2240    37.53    54.63    12.0    25.08    17.79    0    259
## Znesek_sladko    6 2240    27.06    41.28     8.0    17.35    11.86    0    263
##                range skew kurtosis     se
## Prihodek      664936 6.75   159.13 534.75
## Znesek_vino     1493 1.17     0.59   7.11
## Znesek_sadje     199 2.10     4.03   0.84
## Znesek_meso     1725 2.08     5.49   4.77
## Znesek_ribe      259 1.92     3.08   1.15
## Znesek_sladko    263 2.13     4.36   0.87

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

OLS regression method - linear regression

fit <- lm(Znesek_vino ~ Prihodek + Znesek_sadje + Znesek_meso + Znesek_ribe + Znesek_sladko,
          data = podatki_vino)
vif(fit)

##      Prihodek  Znesek_sadje   Znesek_meso   Znesek_ribe Znesek_sladko 
##      1.596194      1.865162      2.037230      1.953510      1.828894

We can see that all variables have higher values; we want them as close to 1 as possible. (Up to around 1.5 is acceptable.) Using VIF, we check for multicollinearity. Collinearity is present, if it is not checked it can result in incorrect regression coefficients.

mean(vif(fit))

## [1] 1.856198

We want the mean of the VIF so be as close to 1 as possible as well.

New subset, then descriptive statistics

podatki_MGK <- podatki_vino[,c("Prihodek", "Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")]
library(pastecs)
round(stat.desc(podatki_MGK, basic = FALSE), 2)

##                  Prihodek Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## median           51381.50         8.00       67.00       12.00          8.00
## mean             52247.25        26.30      166.95       37.53         27.06
## SE.mean            534.75         0.84        4.77        1.15          0.87
## CI.mean.0.95      1048.67         1.65        9.35        2.26          1.71
## var          633683788.58      1581.93    50947.43     2984.33       1704.08
## std.dev          25173.08        39.77      225.72       54.63         41.28
## coef.var             0.48         1.51        1.35        1.46          1.53

Znesek_sladko (‘Amount_sweet’) has the highest coef.var, thats why it varies most. The data must first be standarized, since prihodek (‘income’) has a very high var value and hence high affect.

Correlation matrix

R <- cor(podatki_MGK)
round(R, 3)

##               Prihodek Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## Prihodek             1           NA          NA          NA            NA
## Znesek_sadje        NA        1.000       0.543       0.595         0.567
## Znesek_meso         NA        0.543       1.000       0.568         0.524
## Znesek_ribe         NA        0.595       0.568       1.000         0.580
## Znesek_sladko       NA        0.567       0.524       0.580         1.000

5x5 matrix, diagonally value 1, the rest of them are Pearson coefficients. Coefficient Znesek_meso/Znesek_sladko (Amount_meat/Amount_sweet) -> linear connection is positive and has ‘medium strenght’. Prihodek (‘Income’) has values NA, thats why we are ‘not allowed’ to count it in.

Quick fix without Prihodek (‘Income’)

podatki_MGK_nova <- podatki_vino[,c("Znesek_sadje", "Znesek_meso", "Znesek_ribe", "Znesek_sladko")]
library(pastecs)
round(stat.desc(podatki_MGK_nova, basic = FALSE), 2)

##              Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## median               8.00       67.00       12.00          8.00
## mean                26.30      166.95       37.53         27.06
## SE.mean              0.84        4.77        1.15          0.87
## CI.mean.0.95         1.65        9.35        2.26          1.71
## var               1581.93    50947.43     2984.33       1704.08
## std.dev             39.77      225.72       54.63         41.28
## coef.var             1.51        1.35        1.46          1.53

R <- cor(podatki_MGK_nova)
round(R, 3)

##               Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## Znesek_sadje         1.000       0.543       0.595         0.567
## Znesek_meso          0.543       1.000       0.568         0.524
## Znesek_ribe          0.595       0.568       1.000         0.580
## Znesek_sladko        0.567       0.524       0.580         1.000

The Bartlett test

cortest.bartlett(R, n = nrow(podatki_MGK_nova))

## $chisq
## [1] 3363.307
## 
## $p.value
## [1] 0
## 
## $df
## [1] 6

H0 - P = I H1 - P =/= I We can reject the null hypothesis, p = 0, the data is valid/appropriate

KMO statistics

KMO(R)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = R)
## Overall MSA =  0.82
## MSA for each item = 
##  Znesek_sadje   Znesek_meso   Znesek_ribe Znesek_sladko 
##          0.81          0.83          0.80          0.82

KMO = 0.82, value is good. MSA = Znesek_meso (‘Amount_meat’) is most appropriate, znesek_ribe (‘Amount_fish’) is least appropriate, values still high enough. Biggest information loss would happen with znesek_ribe (‘Amount_fish’). The data is valid.

PCA - Principal Component Analysis

library(FactoMineR)
mgk <- PCA(podatki_MGK_nova,
           scale.unit = TRUE,
           graph = FALSE)

library(factoextra)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

get_eigenvalue(mgk)

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.6893021         67.23255                    67.23255
## Dim.2  0.4803929         12.00982                    79.24237
## Dim.3  0.4313011         10.78253                    90.02490
## Dim.4  0.3990040          9.97510                   100.00000

We standardize with scale-unit, and the analysis will now be based on the correlation matrix. Using eigenvalues, we determine own values. The maximum component values are displayed (in this case, 4), with each subsequent one having a lower eigenvalue. LV = variance of the component (e.g., dimension 2 = 0.48).

The variance.percent tells us what percentage of information is captured by each component. The variance of the first 4 variables will thus be 4. (Out of these 4, the first principal component will capture 2.689 = 67.2%). The last column is cumulative, and it will always total 100. According to the Kaiser criterion, we take only the first component (values above 1). If we look at the values alone, we would take the first two components since we need 70%. Since we also have the rule of the last 5%, we could also take the third and fourth components.

Eigenvalue plot for determining the number of analytically significant principal components

fviz_eig(mgk,
         choice = "eigenvalue", 
         main = "Diagram lastnih vrednosti",
         ylab = "lastna vrednost", 
         xlab = "glavna komponenta", 
         addlabels = TRUE)

We are looking for where the diagram breaks, that happens at 2. We take one less, so just the first one. I was supposed to take only the first component, bcs it already broke at 2, but had some issues with drawing fviz_pca_var and biplot, thays why I still took 2 for visualization.

Paralel analysis

library(psych)
fa.parallel(podatki_MGK_nova,
            sim = FALSE,
            fa = "pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

We mnust assign the main component. Until the empirical value is above the theoretical, we take the main component. This confirms, that we take just the first one. fa.parallel also tells us, da we would need to take just one, but the graphs were giving me a hard time and for ease of visualisation, I took two

Solution with (just one) main component

mgk <- PCA(podatki_MGK_nova,
           scale.unit = TRUE,
           graph = FALSE, 
           ncp = 2)

mgk

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 2240 individuals, described by 4 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

We say how many main components we want, then select 4,6 and 8. ncp = 2 takes 2 main components, ncp = 1 just one.

var$cor

print(mgk$var$cor)

##                   Dim.1       Dim.2
## Znesek_sadje  0.8260039 -0.15257645
## Znesek_meso   0.8002894  0.56354405
## Znesek_ribe   0.8393935 -0.02595429
## Znesek_sladko 0.8136185 -0.37263623

print(mgk$var$contrib)

##                  Dim.1      Dim.2
## Znesek_sadje  25.37024  4.8459446
## Znesek_meso   23.81522 66.1087888
## Znesek_ribe   26.19942  0.1402238
## Znesek_sladko 24.61512 28.9050428

with cor we get rescaled coefficients or weights of main components (Pearson correlation coefficients between individual initial variables and each principal component). All 4 variables for dim1 have a positive sign (positive correlation). The more a person spends on fruit, meat, fish, and sweets, the higher the amount spent on wine in PC1. PC2 usually represents certain contrasts, as it often includes both positive and negative coefficients.

LS = we take the coefficient of each principal/main component, square it and sum it with the rest. for example fruit: 0.82’ + (-0.15)’ = 0,69 = 69% information has been transfered, 31% of it has been lost.

in case of just one main comp. - v primeru le ene gk: -> All 4 variables have a negative correlation with the first principal component, meaning that the higher the amount spent on fruit, meat, fish, and sweets, the lower the amount spent on wine.

The contrib value indicates what percentage of the total information of the principal component was contributed by each measured variable. In our case, Znesek_ribe contributed the most to PC1.

Grafical visualisation of the measured component

fviz_pca_var(mgk,
             repel = TRUE)

In the case of a single axis -> Since we have only one principal component, the request ‘axes = c(1,1)’ had issues displaying a graph with two axes. (Apologies for the reduced clarity; I hope (1,1) is an acceptable approach, as Stack Overflow wasn’t particularly helpful.)

Dim1 - 67.2 indicates that 67.2% of the information was transferred to this component. Dim2 shows that 12% of the information was transferred. Each axis represents its principal component with its corresponding values (e.g., meat -> 0.6, 0.6; fruit -> 0.7, -0.2).

Biplot visualisation

fviz_pca_biplot(mgk)

Since the entire sample is huge, the graph is unclear. Using the option REPEL = TRUE could potentially improve the graph, but it remains unreadable.

The “best” unit can still be identified—this is unit number 361, which is more than 6 standard deviations to the right.

Check for standalone values

head(mgk$ind$coord)

##        Dim.1       Dim.2
## 1  3.5940251  0.13814880
## 2 -1.3148042 -0.07595754
## 3  0.8168131 -0.24098899
## 4 -1.1475119 -0.07366193
## 5  0.1843154 -0.27379588
## 6 -0.1223814 -0.50418399

If we were interested in an individual person, we could determine their coordinates and use the scale function to standardize the values.

Save the data in the main table

podatki_vino$GK1 <- mgk$ind$coord[,1]
podatki_vino$GK2 <- mgk$ind$coord[,2]
head(podatki_vino,3)

##     ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe Znesek_sladko
## 1 5524    58138         635           88         546         172            88
## 2 2174    46344          11            1           6           2             1
## 3 4141    71613         426           49         127         111            21
##          GK1         GK2
## 1  3.5940251  0.13814880
## 2 -1.3148042 -0.07595754
## 3  0.8168131 -0.24098899

Check for coorelation

cor(x=podatki_vino$GK1, y=podatki_vino$GK2)

## [1] -1.43366e-15

If they are independent, the correlation should be 0. This value is practically 0, meaning there is virtually no connection between them.

The principal component analysis was conducted on 4 standardized variables (n=2240). The KMO test confirms the adequacy of the variables with KMO = 0.82. The MSA statistic for each variable is at least 0.5. Based on parallel analysis and other criteria, it is most reasonable to retain only one principal component. However, for easier visualization in the graphs, I selected two components (together, they account for 79.2% of the variance). A rationale for using only one variable is also provided where applicable.

Based on the loadings, we find that PC1 represents the amount spent on wine, while PC2 highlights the contrast between the amount spent on meat products and non-meat products.

fit <- lm(Znesek_vino ~ Prihodek + GK1 + GK2,
          data = podatki_vino)
vif(fit)

## Prihodek      GK1      GK2 
## 1.573817 1.520697 1.053617

mean(vif(fit))

## [1] 1.38271

Values are now better than before.

Missing values

NA_vrednosti <- complete.cases(podatki_vino)
sum(NA_vrednosti)

## [1] 2216

podatki_vino[!NA_vrednosti, ]

##         ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 11    1994       NA           5            5           6           0
## 28    5255       NA           5            1           3           3
## 44    7281       NA          81           11          50           3
## 49    7244       NA          48            5          48           6
## 59    8557       NA          11            3          22           2
## 72   10629       NA          25            3          43          17
## 91    8996       NA         230           42         192          49
## 92    9235       NA           7            0           8           2
## 93    5798       NA         445           37         359          98
## 129   8268       NA         352            0          27          10
## 134   1295       NA         231           65         196          38
## 313   2437       NA         861          138         461          60
## 320   2863       NA         738           20         172          52
## 1380 10475       NA         187            5          65          26
## 1383  2902       NA          19            4          12           2
## 1384  4345       NA           5            1           9           2
## 1387  3769       NA          25            1          13           0
## 2060  7187       NA         375           42          48          94
## 2062  1612       NA          23            0          15           0
## 2079  5079       NA          71            1          16           0
## 2080 10339       NA         161            0          22           0
## 2082  3117       NA         264            0          21          12
## 2085  5250       NA         532          126         490         164
## 2229  8720       NA          32            2        1607          12
##      Znesek_sladko        GK1          GK2
## 11               2 -1.2708592 -0.109756974
## 28             263  1.8376721 -3.500482372
## 44               2 -1.0715918  0.013502812
## 49              10 -1.0276316 -0.066758963
## 59               2 -1.2428488 -0.042408140
## 72               4 -1.0328169 -0.003082932
## 91              37  0.4800048 -0.133962395
## 92               0 -1.3351673 -0.050188723
## 93              28  1.1288368  0.579066045
## 129              0 -1.2191053  0.012783484
## 134             71  1.0856275 -0.682247954
## 313             30  2.2966770  0.387440838
## 320             50  0.3424776 -0.255637116
## 1380            20 -0.6832205 -0.149488565
## 1383             2 -1.2518073 -0.083974198
## 1384             0 -1.3203380 -0.052121676
## 1387             0 -1.3304313 -0.036338393
## 2060            66  0.9389466 -1.061426921
## 2062             2 -1.3147303 -0.049650078
## 2079             0 -1.3239437 -0.025529364
## 2080             0 -1.3236353  0.001624655
## 2082             6 -1.1412090 -0.088366814
## 2085           126  4.3360980 -0.763520965
## 2229             4  2.2898619  5.640988224

Income has missing values, which hinders the calculation of standard deviation and standardized values.

Fill in NA values with Mean

Mean_prihodki <- mean(podatki_vino$Prihodek, na.rm = TRUE)
podatki_vino$Prihodek[is.na(podatki_vino$Prihodek)] <- Mean_prihodki
sum(is.na(podatki_vino$Prihodek))

## [1] 0

We calculate the mean of the income and then replace or insert it into all values where we have NA.
To verify if it worked, we sum up all NA values in the Income column using sum.na. If the result is 0, then it’s fine.

LM again

fit <- lm(Znesek_vino ~ Prihodek + GK1 + GK2,
          data = podatki_vino)
vif(fit)

## Prihodek      GK1      GK2 
## 1.561592 1.511178 1.050414

Histogram and graph of stdost/stdOcenVred

podatki_vino$stdost <- rstandard(fit)
podatki_vino$stdocenvred <- scale(fit$fitted.values)

hist(podatki_vino$stdost,
     xlab = "Standardni ostanki",
     ylab = "Frekvenca",
     main = "Histogram")

ggplot(podatki_vino, aes(y=stdost, x=stdocenvred)) + theme_dark() + geom_point(color = "snow") + ylab("stand ostanki") + xlab("stand ocen vred")

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

Check the BP test

ols_test_breusch_pagan(fit)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                  Data                   
##  ---------------------------------------
##  Response : Znesek_vino 
##  Variables: fitted values of Znesek_vino 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    3314.2980 
##  Prob > Chi2   =    0.0000

H0 lahko zavrnemo, p vrednost = 0. (prob > Chi2), imamo heteroskedastičnost. WE can throw away H0, p value = 0. (prob > Chi2), so we have Heteroskedasticity.

Cooks distances

podatki_vino$cooksD <- cooks.distance(fit)
head(podatki_vino[order(-podatki_vino$cooksD), ], )

##        ID  Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 2234 9432 666666.00           9           14          18           8
## 1654 4931 157146.00           1            0        1725           2
## 688  1501 160803.00          55           16        1622          17
## 165  8475 157243.00          20            2        1582           1
## 2229 8720  52247.25          32            2        1607          12
## 22   5376   2447.00           1            1        1725           1
##      Znesek_sladko       GK1        GK2     stdost stdocenvred      cooksD
## 2234             1 -1.067955 -0.1088027 -16.773412   13.698702 54.03779603
## 1654             1  2.389926  6.1231520  -5.609555    5.167523  0.31431231
## 688              3  2.534472  5.6271286  -5.350420    5.129183  0.24966646
## 165              2  2.108667  5.5845085  -5.295233    4.904833  0.23911341
## 2229             4  2.289862  5.6409882  -3.267427    2.576174  0.09523541
## 22               1  2.393221  6.1183017  -2.601776    1.627426  0.08859137

podatki_vino <- podatki_vino[c(-2234,-1654,-688,-165,-2229,-22), ]

We remove all units that have a stdost outside the range (-3, +3), as well as those with cooksD outside of a certain ‘normal’ range, meaning when it is not in a smooth decline.

head(podatki_vino[order(-podatki_vino$cooksD), ], )

##         ID Prihodek Znesek_vino Znesek_sadje Znesek_meso Znesek_ribe
## 2133 11181   156924           2            1           2           1
## 618   1503   162397          85            1          16           2
## 1301  5336   157733          39            1           9           2
## 656   5555   153924           1            1           1           1
## 1809  1619    90369         292           51         981         224
## 1899  4619   113734           6            2           3           1
##      Znesek_sladko       GK1         GK2    stdost stdocenvred     cooksD
## 2133             1 -1.332826 -0.08968395 -2.835489   1.9622900 0.03258371
## 618              1 -1.293179 -0.03992744 -2.642926   2.1161109 0.03056891
## 1301             0 -1.320338 -0.05212168 -2.720565   1.9971201 0.03020320
## 656              1 -1.334989 -0.09328696 -2.778813   1.8917938 0.02978413
## 1809            23  3.771999  2.72138038 -2.504681   2.9209961 0.01562421
## 1899           262  1.819574 -3.49162026 -1.831961   0.7902984 0.01495170

Results

summary(fit)

## 
## Call:
## lm(formula = Znesek_vino ~ Prihodek + GK1 + GK2, data = podatki_vino)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3256.5  -128.1   -68.7    80.0  1169.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.550e+01  1.524e+01   2.986  0.00285 ** 
## Prihodek    4.946e-03  2.723e-04  18.165  < 2e-16 ***
## GK1         6.532e+01  4.089e+00  15.976  < 2e-16 ***
## GK2         7.221e+01  8.066e+00   8.953  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 258.2 on 2236 degrees of freedom
## Multiple R-squared:  0.4125, Adjusted R-squared:  0.4117 
## F-statistic: 523.4 on 3 and 2236 DF,  p-value: < 2.2e-16

Multiple R-squared - 41.25% of the variability in the result is explained by the linear influence of PC1, PC2, and income.

F-statistics - We check for linear correlation to see if the model explains the variability. A very low p-value suggests rejecting the null hypothesis at p < 0.001.

All 4 variables have a statistically significant impact.

Both PC1 and PC2 have a positive sign. The more spent on other products, the more will be spent on wine. Similarly, the more spent on wine, the more will be spent on meat, assuming other factors remain unchanged.