Lecture report 1

In my lecture report I decided to mix knowledge from Unsupervised Learning classes and Microeconomics classes. This report will be structured as follows: firstly I will investigate the data a little bit (just different way than in Professor’s report). Then, I will investigate correlation between our answers and move to clustering algorithm.

Please note that this report is helping with understand the data, it will be less detailed than papers on USL. I decided to investigate this data by clustering because I wanted to try something new. The key reasoning: homogeneous group after applying the clustering should result in one big cluster homogeneous- among group indicates that we can’t divide the group into small subgroups based on the characteristics.

The main point is to investigate if our group can be divided into groups - if yes, we are homogeneous, if not we are homogeneous group.

Part A

Investigating data

In first step I loaded data and prepare it for further investigation - changes variables into numeric ones, add factors data and replaced the “NA” values with 0. As the result I obtained clean dataset:

##    nickname             sex       BA        Attitude          Varian     
##  Length:102         female:42   No :76   Min.   :  4.00   Min.   : 0.00  
##  Class :character   male  :60   Yes:26   1st Qu.: 60.25   1st Qu.: 9.25  
##  Mode  :character                        Median : 72.50   Median :30.50  
##                                          Mean   : 71.19   Mean   :32.75  
##                                          3rd Qu.: 82.75   3rd Qu.:50.00  
##                                          Max.   :100.00   Max.   :92.00  
##      IT_lit           theory           exper            quan      
##  Min.   :  0.00   Min.   :  1.00   Min.   :10.00   Min.   :10.00  
##  1st Qu.: 31.50   1st Qu.: 20.00   1st Qu.:30.00   1st Qu.:30.00  
##  Median : 57.50   Median : 20.00   Median :40.00   Median :30.00  
##  Mean   : 54.53   Mean   : 26.37   Mean   :41.66   Mean   :34.76  
##  3rd Qu.: 75.25   3rd Qu.: 30.00   3rd Qu.:50.00   3rd Qu.:40.00  
##  Max.   :100.00   Max.   :100.00   Max.   :80.00   Max.   :90.00  
##       team      
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :1.775  
##  3rd Qu.:3.000  
##  Max.   :3.000

Correlation analysis

Correlation analysis plot:

## NULL

KEY INSIGHTS: We don’t observe any significant correlation. The highest positive one is the correlation between the knowledge of the Varian text book and BA studies at our faculty. It’s not surprising due to the fact that studying at WNE we need to complete three semester of micro which is mostly based on Varian book. Negative weak correlation is also observed between Varian and theory based approach and quantitative approach and theory.

Boxplots:

For each answered we can investigate how we as the group differs among two main groups: males and females and those of us who did bachelor on our faculty and not. Below presented are the key insights:

Clustering

Clustering is a process of the partioning the dataset into a sets of meaningful clusters whose represents objects with the similar features. Why I decided to perform clustering on our results?

I just wanted to observe if we are homogenic or heterogenic (and how much) as a group.

My intuition was: if the optimal number of clusters equals 1 - we are homogeneous group - we can’t substract any meaningful clusters among us If I managed to clusters us into meaningful clusters - we are heterogeneous group as a whole set, but homogeneous in subgroups.

First characteristic that can describe if we are able to cluster data into meaningful clusters is the Hopkins statistics:

## $hopkins_stat
## [1] 0.716925
## 
## $plot

Hopkins statistic is quite high which indicates that we are able to cluster data into meaningful clusters. That the firs sign that we may not be homogeneous as a group but as a subgroups. Now, we will move to determine the optimal number of clusters in our group:

We can observe that the optimal number of clusters differs among the methods. I will continue with three clusters due to the personal preferences to start with lower number of clusters. Firstly I will perform kmeans on all variables and try to separate our group based on all factors. Then I will move to analysis of every two pair of variables to see how we can be grouped based on different factors.

Disclaimer: I know that adding the labels isn’t very helpful but maybe some of us can find yourself and check with who is assigned to the clusters. We can observe that one cluster is very well separated and remaining two overlap each other. Due to many dimensional data we can’t really describe how the groups differs in terms of characteristic. This graph and clustering output is rather a proof that we are a group that can be divided into subgroups which differs from each other. To get the formal quality of clustering I performed some statistic:

##   cluster size ave.sil.width
## 1       1   37          0.15
## 2       2   14          0.13
## 3       3   51          0.20

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
## Also defined by 'kernlab'
##          ARI           RI            J           FM 
## -0.008204526  0.495826053  0.330842566  0.497229633

OBSERVATIONS: Overall quality of clustering are not so satisfactory:

ARI (Adjusted Rand Index): -0.0082 no agreement between the two clustering results

RI (Rand Index): 0.4958, the assignment to clusters is close to random

J (Jaccard Index): 0.3308, there is some overlap between the clusters

FM (Fowlkes-Mallows Index): 0.4972, indicates moderate clustering performance

Overall clustering quality isn’t good so the clusters may be formed randomly. In that case it may suggest that clustering isn’t a good idea as the tool to analyze this data or the algorithm is chosen incorrectly. Maybe the clustering by taking pairs of variables and analyzing it we will obtain more interesting observations. Overall number of clusters will be setting as 3.

I won’t discuss here every result I’ve obtained because there are many of them. It’s just performed out of curiosity how we can split our group in terms of different characteristic. For every result I added the nickname so you can try to find yourself and compare oneself to other results.

Clustering for two variables

##   cluster size ave.sil.width
## 1       1   35          0.38
## 2       2   22          0.31
## 3       3   45          0.46

##   cluster size ave.sil.width
## 1       1   23          0.53
## 2       2   21          0.31
## 3       3   58          0.34

##   cluster size ave.sil.width
## 1       1   52          0.45
## 2       2   17          0.22
## 3       3   33          0.42

##   cluster size ave.sil.width
## 1       1   46          0.33
## 2       2   28          0.40
## 3       3   28          0.33

##   cluster size ave.sil.width
## 1       1   50          0.39
## 2       2   28          0.38
## 3       3   24          0.25

##   cluster size ave.sil.width
## 1       1   43          0.42
## 2       2   29          0.29
## 3       3   30          0.53

##   cluster size ave.sil.width
## 1       1   35          0.38
## 2       2   22          0.31
## 3       3   45          0.46

##   cluster size ave.sil.width
## 1       1   34          0.38
## 2       2   29          0.45
## 3       3   39          0.50

##   cluster size ave.sil.width
## 1       1   42          0.41
## 2       2   15          0.34
## 3       3   45          0.51

##   cluster size ave.sil.width
## 1       1   42          0.35
## 2       2   35          0.42
## 3       3   25          0.40

##   cluster size ave.sil.width
## 1       1   39          0.38
## 2       2   41          0.54
## 3       3   22          0.23

##   cluster size ave.sil.width
## 1       1   34          0.42
## 2       2   45          0.51
## 3       3   23          0.54

##   cluster size ave.sil.width
## 1       1   23          0.53
## 2       2   21          0.31
## 3       3   58          0.34

##   cluster size ave.sil.width
## 1       1   34          0.38
## 2       2   29          0.45
## 3       3   39          0.50

##   cluster size ave.sil.width
## 1       1   29          0.48
## 2       2   15          0.09
## 3       3   58          0.44

##   cluster size ave.sil.width
## 1       1   32          0.32
## 2       2   30          0.44
## 3       3   40          0.42

##   cluster size ave.sil.width
## 1       1   31          0.37
## 2       2   44          0.42
## 3       3   27          0.28

##   cluster size ave.sil.width
## 1       1   29          0.32
## 2       2   47          0.44
## 3       3   26          0.53

##   cluster size ave.sil.width
## 1       1   52          0.45
## 2       2   17          0.22
## 3       3   33          0.42

##   cluster size ave.sil.width
## 1       1   42          0.41
## 2       2   15          0.34
## 3       3   45          0.51

##   cluster size ave.sil.width
## 1       1   29          0.48
## 2       2   15          0.09
## 3       3   58          0.44

##   cluster size ave.sil.width
## 1       1   69          0.45
## 2       2   16          0.29
## 3       3   17          0.37

##   cluster size ave.sil.width
## 1       1   68          0.46
## 2       2   15          0.42
## 3       3   19          0.40

##   cluster size ave.sil.width
## 1       1   52          0.53
## 2       2   15          0.24
## 3       3   35          0.52

##   cluster size ave.sil.width
## 1       1   46          0.33
## 2       2   28          0.40
## 3       3   28          0.33

##   cluster size ave.sil.width
## 1       1   42          0.35
## 2       2   35          0.42
## 3       3   25          0.40

##   cluster size ave.sil.width
## 1       1   32          0.32
## 2       2   30          0.44
## 3       3   40          0.42

##   cluster size ave.sil.width
## 1       1   69          0.45
## 2       2   16          0.29
## 3       3   17          0.37

##   cluster size ave.sil.width
## 1       1   34          0.49
## 2       2   30          0.34
## 3       3   38          0.27

##   cluster size ave.sil.width
## 1       1   27          0.35
## 2       2   36          0.46
## 3       3   39          0.42

##   cluster size ave.sil.width
## 1       1   50          0.39
## 2       2   28          0.38
## 3       3   24          0.25

##   cluster size ave.sil.width
## 1       1   39          0.38
## 2       2   41          0.54
## 3       3   22          0.23

##   cluster size ave.sil.width
## 1       1   31          0.37
## 2       2   44          0.42
## 3       3   27          0.28

##   cluster size ave.sil.width
## 1       1   68          0.46
## 2       2   15          0.42
## 3       3   19          0.40

##   cluster size ave.sil.width
## 1       1   34          0.49
## 2       2   30          0.34
## 3       3   38          0.27

##   cluster size ave.sil.width
## 1       1   29          0.20
## 2       2   37          0.57
## 3       3   36          0.45

##   cluster size ave.sil.width
## 1       1   43          0.42
## 2       2   29          0.29
## 3       3   30          0.53

##   cluster size ave.sil.width
## 1       1   34          0.42
## 2       2   45          0.51
## 3       3   23          0.54

##   cluster size ave.sil.width
## 1       1   29          0.32
## 2       2   47          0.44
## 3       3   26          0.53

##   cluster size ave.sil.width
## 1       1   52          0.53
## 2       2   15          0.24
## 3       3   35          0.52

##   cluster size ave.sil.width
## 1       1   27          0.35
## 2       2   36          0.46
## 3       3   39          0.42

##   cluster size ave.sil.width
## 1       1   29          0.20
## 2       2   37          0.57
## 3       3   36          0.45

OBSERVATION: Clustering qualities in terms of Silhouette index improved significantly and the clusters are more visible and better separated. We can proceed with the clustering - the results should be more meaningful.

Summary

The main goal of this short lecture report was to analyze the group difference. In my report I focused on our group and tried to show that we are not a homogeneous group, but we can subtract some subgroups among us that are more similar to each other internally.

The intuition behind that was: I can perform clustering, if clustering is meaningful and relatively good quality the group can be split in the subgroups. The clustering with taking into consideration all variables wasn’t a good quality one. That’s why I decided to analyse the variables by matching into pairs. After overlook on data I decided that I will stick to 3 clusters (to simplify). That resulted in better clustering quality (in terms of Silhouette index) and very good separated clusters. What are key results? Thanks to clustering we can observe how we can be split - some groups of us are more focused on quantitative approach while others experimental. Some of us prefer to work in groups and focus on experimental approach. Each graph with the clustering results can provide some useful information.

Part B

Why is the assumption of homogeneity so fundamental in economic theory?

In the economy we assume homogeneity very often. It’s easier to construct the model while assuming homogeneity and rationality among consumers. Homogeneity in that part of economic life is essential - if the group is homogeneous the built model reflects reality detailed, it can be used to predict behavior of consumers in proper way and to develop some interesting observations.

The lack of homogeneity can impact the forecasting which was based on the assumption of homogeneity. Homogeneous good was a main assumption of many micro and macroeconomic theories and models. In microeconomics model this key assumption lead to the fair market where prices are determined only by supply and demand, not by other factors like consumer preferences.

For macroeconomic models this assumption is crucial to measure performance of economy. For instance, gross domestic product (GDP) represents the total value of all goods and services produced within a country over a specific time frame. Economists calculate GDP by assuming that all goods of a similar type share the same value. This assumption enables them to assess the economy’s overall performance and monitor its fluctuations over time.

source: https://fastercapital.com/topics/the-role-of-homogeneity-in-economics.html

Why is heterogeneity essential in practical fields like data science or financial analytics?

Heterogenity is essential in practical fields because of its applications. Imagine performing clustering and analyse the clients of the internet store. As datascientis we want to search for patterns, focus on particular groups of customers and by discovering those patterns help to develop marketing strategies - for example we will substract walth customers and adjust the loyalty bonus for them.

In financial sector it is important for example while offering the loan offers. Let’s imagine that two households want to take a loan to buy a house. One has own saving and own contribution equal to 30%, other 50% - the offer for both of them won’t be the same (inspiration: last year exam).

There are also differences in stock market. Some investors invest in more risky and short-term investments, while others prefer other investment strategies. The investment patterns differs among markets and time, so we can’t assume that investing patterns are always the same.

In the insurance market - the insurance installment is calculated based on the several factors. It’s not the same for everyone. For instance, for the insurance company more risky is to insure the 18-years-old driver than 32-year-old driver without car crashed history. For those two cases the price of insurgence will be different and calculated on the risk factors. It also applies to smoker and non-smoker in terms of the life insurance.