Principal balances are defined as follows (Pawlowsky-Glahn, Egozcue, and Tolosan-Delgado 2011):
Given an \(n\)-sample of a \(D\)-part random composition, the set of Principal Balances (PB) is a set of \(D-1\) balances satisfying the following conditions:
- Each sample PB is obtained as the projection of a sample composition on a unitary composition or balancing element associated to the PB.
- The first PB is the balance with maximum sample variance.
- The \(i\)-th PB has maximum variance conditional to its balancing element being orthogonal to the previous 1st, 2nd, …, \((i-1)\)-th balancing elements.
Because of the large amount of possibilities, for a given compositional sample, finding the PB is a computational demanding task. Martín-Fernández et al. (2017) proposed different approaches to deal with the calculation and approximation of such type of basis. coda.base
implements all methods presented in (Martín-Fernández et al. 2017).
To illustrate coda.base
functionalities we will use the following dataset:
library(coda.base)
= parliament2017[,-1] X
consisting of parties in the 2017 Catalan Parliament Elections.
coda.base
can calculate the PB with the function pb_basis()
. Function pb_basis()
needs the paramater method
to be set. To obtain the PB users needs to set method = "exact"
.
= pb_basis(X, method = "exact") B1
Where the obtained sequential binary partition can be summarised with the following sequential binary tree:
plot(B1)
apply(coordinates(X, B1), 2, var)
#> pb1 pb2 pb3 pb4 pb5 pb6 pb7
#> 0.663216 0.085733 0.046135 0.044113 0.032890 0.028590 0.008179
When method
is set to "exact"
, exhaustive search is performed to obtain the PB. The time needed to calculate the PB grows exponentially with respect the number of parts of X
. To iterate throught all the possibilities, CoDaPack uses the algorithm Ruskey (1993). Currently, exhaustive search can find the PB of a compositional data set with 20 parts in a reasonable ammount of time (5 minutes approximately).
When the number of part is higher, the exact method should be discarted in favour of other alternatives. Different alternatives are available in Pawlowsky-Glahn, Egozcue, and Tolosan-Delgado (2011) and Martín-Fernández et al. (2017). Between this alternatives, the once that can be implemented are agglomerative partition approach based on the Ward method Martín-Fernández et al. (2017) and the approximation based on a reduced approximation to the principal components (Martín-Fernández et al. 2017).
For a composition \(X\), Pawlowsky-Glahn, Egozcue, and Tolosan-Delgado (2011) proposed to builds the variation array and use it to produce a matrix distance. The variation array calculates the variance between pair of logratios.
= as.dist(variation_array(X, only_variation = TRUE))
D
D#> cs jxcat erc psc catsp cup pp
#> jxcat 0.62079
#> erc 0.39636 0.08823
#> psc 0.04887 0.53876 0.29338
#> catsp 0.06682 0.54951 0.29768 0.01636
#> cup 0.48118 0.07179 0.05809 0.36163 0.35432
#> pp 0.12424 0.40944 0.20351 0.10318 0.14361 0.32193
#> other 0.07018 0.56344 0.36363 0.07197 0.07462 0.37428 0.20308
Combining pairs with lower variance it is expected to get a final merging with high variability.
= pb_basis(X, method = 'cluster')
B2 plot(B2)
apply(coordinates(X, B2), 2, var)
#> pb1 pb2 pb3 pb4 pb5 pb6 pb7
#> 0.651569 0.097381 0.043659 0.043188 0.035838 0.029044 0.008179
= pb_basis(X, method = 'constrained')
B3 plot(B3)
apply(coordinates(X, B3), 2, var)
#> pb1 pb2 pb3 pb4 pb5 pb6 pb7
#> 0.663216 0.085733 0.046135 0.044113 0.032890 0.028590 0.008179