Suggested citation:

Mendez C. (2020). Bivariate distribution dynamics analysis in R. R Studio/RPubs. Available at https://rpubs.com/quarcs-lab/tutorial-bivariate-distribution-dynamics

This work is licensed under the Creative Commons Attribution-Non Commercial-Share Alike 4.0 International License.

Acknowledgment:

Material adapted from multiple sources, in particular the dataset is from Magrini (2007).

1 Replication files

  • The tutorial is self-contained. No aditional file is needed.

  • If you are a member of the QuaRCS lab, you can run this tutorial in R Studio Cloud

3 Tutorial objectives

  • Study the dynamics of univariate densities

  • Compute the bandwidth of a density

  • Study mobility plots

  • Study bi-variate densities

  • Study density-based clustering methods

  • Study conditional bi-variate densities

4 Import data

We will use two hypothetical cross-sectional series.

  • The first (x) series was produced by drawing a random sample of 1000 observations from a univariate normal distribution.
  • The second (y) series was produced by merging and sorting two random samples of 500 observations.

The mean and the standard deviation of these two series respectively matched those of the logarithm of per capita Gross Value Added observed for the Italian Provinces in 1996 and in 2002. For this reason assume that the analysis has been performed over a 6-year time period.

5 Transform data

Since the data is in log terms, let us rename the variables and add new variables.

6 Descriptive statistics

skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric x 0 1 14883.41 3746.62 3344.50 12362.51 14974.61 17321.7 26506.5 ▁▃▇▃▁
numeric y 0 1 16441.00 4011.14 7400.07 12992.51 16244.71 20021.6 25394.5 ▂▇▅▇▂
numeric log_x 0 1 9.57 0.28 8.12 9.42 9.61 9.8 10.2 ▁▁▂▇▃
numeric log_y 0 1 9.68 0.25 8.91 9.47 9.70 9.9 10.1 ▁▃▇▇▇
numeric rel_x 0 1 1.00 0.25 0.22 0.83 1.01 1.2 1.8 ▁▃▇▃▁
numeric rel_y 0 1 1.00 0.24 0.45 0.79 0.99 1.2 1.5 ▂▇▅▇▂
numeric rel_log_x 0 1 1.00 0.03 0.85 0.98 1.00 1.0 1.1 ▁▁▂▇▃
numeric rel_log_y 0 1 0.99 0.03 0.84 0.97 0.99 1.0 1.1 ▁▁▂▇▃

7 Univariate dynamics

7.1 Select bandwiths

select bandwidth based on function dpik from the package KernSmooth

[1] 0.065
[1] 0.039

7.3 Plot both densities

7.3.1 Method 1

Keep the orignal bandwiths of the package KernSmooth

Note that you have adjust the labels manually

  • Interactive plotly version

Manual labels are not yet implemented in the ggplotly function

7.3.2 Method 2

using the bandwidth default of ggplot

Using plotly

8 Bivariate density

8.3 Using the KernSmooth package

Interactive

Interactive version

9 Density-based clusters

An S4 object of class "pdfCluster"

Call: pdfCluster(x = dat[, 5:6])

Initial groupings: 
 label    1   2  NA 
 count  296 297 407 

Final groupings: 
 label    1   2 
 count  504 496 

Groups tree (here 'h' denotes 'height'):
--[dendrogram w/ 1 branches and 2 members at h = 1]
  `--[dendrogram w/ 2 branches and 2 members at h = 0.593]
     |--leaf "1 " 
     `--leaf "2 " (h= 0.152  )

9.3 Cluster tree

9.4 Mode function

10 Conditional density analysis

10.2 Using the np package

Compute adaptive bandwith based on cross-validation


Conditional density data (1000 observations, 2 variable(s))
(1 dependent variable(s), and 1 explanatory variable(s))

Bandwidth Selection Method: Maximum Likelihood Cross-Validation
Formula: dat$rel_y ~ dat$rel_x
Bandwidth Type: Adaptive Nearest Neighbour
Objective Function Value: 5502 (achieved on multistart 1)

Exp. Var. Name: dat$rel_x Bandwidth: 2  

Dep. Var. Name: dat$rel_y Bandwidth: 2 

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
No. Continuous Dependent Vars.: 1
Estimation Time: 68 seconds

Compute conditional density object


Conditional Density Data: 1000 training points, in 2 variable(s)
(1 dependent variable(s), and 1 explanatory variable(s))

                        dat$rel_y
Dep. Var. Bandwidth(s):         2
                        dat$rel_x
Exp. Var. Bandwidth(s):         2

Bandwidth Type: Adaptive Nearest Neighbour
Log Likelihood: 5948

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 1
No. Continuous Dependent Vars.: 1

