UAHDataScienceUC: A Comprehensive Guide to Clustering Algorithms

Andriy Protsak

2025-02-17

The UAHDataScienceUC package provides a robust collection of clustering algorithms implemented in R. This package, developed at the Universidad de Alcalá de Henares, offers both traditional and advanced clustering methods, making it a valuable tool for data scientists and researchers. In this vignette, we’ll explore the various clustering algorithms available in the package and learn how to use them effectively.

Installation

You can install the package from CRAN using:

install.packages("UAHDataScienceUC")

Available algorithms

The package implements several clustering algorithms, each with its own strengths and use cases:

# Load library
library(UAHDataScienceUC)

# Load data
data(db5)

# Create sample data
data <- db5[1:10, ]

K-Means Clustering

It partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean.

# Perform k-means clustering
result <- kmeans_(data, centers = 3, max_iterations = 10)

# Plot results
plot(data, col = result$cluster, pch = 20)
points(result$centers, col = 1:3, pch = 8, cex = 2)

Agglomerative Hierarchical Clustering

This algorithm builds a hierarchy of clusters from bottom-up, starting with individual observations and progressively merging them into clusters.

# Perform hierarchical clustering
result <- agglomerative_clustering(
  data,
  proximity = "single",
  distance_method = "euclidean",
  learn = TRUE
)
## ________________________________________________________________________________
## EXPLANATION:
## 
## The Agglomerative Hierarchical Clustering algorithm defines a clustering hierarc
## hy for a dataset following a `n` step process, which repeats until a single clus
## ter remains:
## 
##     1. Initially, each object is assigned to its own cluster. The matrix of dist
##     ances between clusters is computed.
##     2. The two clusters with closest proximity will be joined together and the p
##     roximity matrix updated. This is done according to the specified proximity.
##     This step is repeated until a single cluster remains.
## 
## The definitions of proximity considered by this function are:
## 
##     1. `single`. Defines the proximity between two clusters as the distance betw
##     een the closest objects among the two clusters. It produces clusters where e
##     ach object is closest to at least one other object in the same cluster. It i
##     s known as SLINK, single-link or minimum-link.
##     2. `complete`. Defines the proximity between two clusters as the distance be
##     tween the furthest objects among the two clusters. It is known as CLINK, com
##     plete-link or maximum-link.
##     3. `average`. Defines the proximity between two clusters as the average dist
##     ance between every pair of objects, one from each cluster. It is also known
##     as UPGMA or average-link.
## 
## Euclidean Distance Formula:
## d(x,y) = √(∑1ⁿ (xᵢ - yᵢ)²)
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 1:
## 
## Initially, each object is assigned to its own cluster. This leaves us with the f
## ollowing clusters:
## CLUSTER #-1 (size: 1)
##           x         y
## 1 -1.578117 -1.292868
## CLUSTER #-2 (size: 1)
##           x        y
## 2 0.7027994 1.193823
## CLUSTER #-3 (size: 1)
##           x        y
## 3 0.7854535 1.191428
## CLUSTER #-4 (size: 1)
##           x           y
## 4 0.6757613 -0.04002442
## CLUSTER #-5 (size: 1)
##           x         y
## 5 0.8484305 0.2230609
## CLUSTER #-6 (size: 1)
##          x         y
## 6 0.515489 0.3014147
## CLUSTER #-7 (size: 1)
##           x        y
## 7 0.9187371 1.347416
## CLUSTER #-8 (size: 1)
##           x           y
## 8 0.9062708 -0.01894187
## CLUSTER #-9 (size: 1)
##           x       y
## 9 0.7017478 1.37873
## CLUSTER #-10 (size: 1)
##            x        y
## 10 0.4289005 1.109321
## 
## Press [enter] to continue
## 
## The matrix of distances between clusters is computed:
## Distances:
##        -1    -2    -3    -4    -5    -6    -7    -8    -9
## -2  3.374                                                
## -3  3.429 0.083                                          
## -4  2.579 1.234 1.236                                    
## -5  2.861 0.982 0.970 0.315                              
## -6  2.632 0.912 0.930 0.377 0.342                        
## -7  3.634 0.265 0.205 1.409 1.127 1.121                  
## -8  2.792 1.230 1.216 0.231 0.249 0.505 1.366            
## -9  3.512 0.185 0.205 1.419 1.165 1.093 0.219 1.413      
## -10 3.130 0.287 0.366 1.176 0.981 0.813 0.545 1.225 0.383
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 2:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-2 (size: 1)
## CLUSTER #-3 (size: 1)
## Proximity:
## [1] 0.08268877
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #1 (size: 2) [CLUSTER #-2 + CLUSTER #-3]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##        -1    -4    -5    -6    -7    -8    -9   -10
## -4  2.579                                          
## -5  2.861 0.315                                    
## -6  2.632 0.377 0.342                              
## -7  3.634 1.409 1.127 1.121                        
## -8  2.792 0.231 0.249 0.505 1.366                  
## -9  3.512 1.419 1.165 1.093 0.219 1.413            
## -10 3.130 1.176 0.981 0.813 0.545 1.225 0.383      
## 1   3.374 1.234 0.970 0.912 0.205 1.216 0.185 0.287
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 3:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-9 (size: 1)
## CLUSTER #1 (size: 2)
## Proximity:
## [1] 0.1849095
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #2 (size: 3) [CLUSTER #-9 + CLUSTER #1]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##        -1    -4    -5    -6    -7    -8   -10
## -4  2.579                                    
## -5  2.861 0.315                              
## -6  2.632 0.377 0.342                        
## -7  3.634 1.409 1.127 1.121                  
## -8  2.792 0.231 0.249 0.505 1.366            
## -10 3.130 1.176 0.981 0.813 0.545 1.225      
## 2   3.374 1.234 0.970 0.912 0.205 1.216 0.287
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 4:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-7 (size: 1)
## CLUSTER #2 (size: 3)
## Proximity:
## [1] 0.2051747
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #3 (size: 4) [CLUSTER #-7 + CLUSTER #2]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##        -1    -4    -5    -6    -8   -10
## -4  2.579                              
## -5  2.861 0.315                        
## -6  2.632 0.377 0.342                  
## -8  2.792 0.231 0.249 0.505            
## -10 3.130 1.176 0.981 0.813 1.225      
## 3   3.374 1.234 0.970 0.912 1.216 0.287
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 5:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-4 (size: 1)
## CLUSTER #-8 (size: 1)
## Proximity:
## [1] 0.2314716
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #4 (size: 2) [CLUSTER #-4 + CLUSTER #-8]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##        -1    -5    -6   -10     3
## -5  2.861                        
## -6  2.632 0.342                  
## -10 3.130 0.981 0.813            
## 3   3.374 0.970 0.912 0.287      
## 4   2.579 0.249 0.377 1.176 1.216
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 6:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-5 (size: 1)
## CLUSTER #4 (size: 2)
## Proximity:
## [1] 0.2488189
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #5 (size: 3) [CLUSTER #-5 + CLUSTER #4]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##        -1    -6   -10     3
## -6  2.632                  
## -10 3.130 0.813            
## 3   3.374 0.912 0.287      
## 5   2.579 0.342 0.981 0.970
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 7:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-10 (size: 1)
## CLUSTER #3 (size: 4)
## Proximity:
## [1] 0.2866377
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #6 (size: 5) [CLUSTER #-10 + CLUSTER #3]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##       -1    -6     5
## -6 2.632            
## 5  2.579 0.342      
## 6  3.130 0.813 0.970
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 8:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-6 (size: 1)
## CLUSTER #5 (size: 3)
## Proximity:
## [1] 0.342037
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #7 (size: 4) [CLUSTER #-6 + CLUSTER #5]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##      -1     6
## 6 3.130      
## 7 2.579 0.813
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 9:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #6 (size: 5)
## CLUSTER #7 (size: 4)
## Proximity:
## [1] 0.8125336
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #8 (size: 9) [CLUSTER #6 + CLUSTER #7]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
##      -1
## 8 2.579
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 10:
## 
## The two clusters with closest proximity are identified:
## Clusters:
## CLUSTER #-1 (size: 1)
## CLUSTER #8 (size: 9)
## Proximity:
## [1] 2.578678
## 
## Press [enter] to continue
## 
## They are merged into a new cluster:
## CLUSTER #9 (size: 10) [CLUSTER #-1 + CLUSTER #8]
## 
## Press [enter] to continue
## 
## The proximity matrix is updated. To do so the rows/columns of the merged cluster
## s are removed, and the rows/columns of the new cluster are added:
## Distances:
## dist(0)
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## RESULTS:
## 
## Since all clusters have been merged together, the final clustering hierarchy is:
## (Check the plot for the dendrogram representation of the hierarchy)
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is particularly effective at finding clusters of arbitrary shape and identifying noise points.

result <- dbscan(
  data,
  epsilon = 0.3,
  min_pts = 4,
  learn = TRUE
)
## ________________________________________________________________________________
## EXPLANATION:
## 
## The data given by data is clustered by the DBSCAN method, which aims to partitio
## n the points into clusters such that the points in a cluster are close to each o
## ther and the points in different clusters are far away from each other. The clus
## ters are defined as dense regions of points separated by regions of low density.
## 
## The DBSCAN method follows a 2 step process:
## 
##     1. For each point, the neighborhood of radius epsilon is computed. If the ne
##     ighborhood contains at least min_pts points, then the point is considered a
##     core point. Otherwise, the point is considered an outlier.
##     2. For each core point, if the core point is not already assigned to a clust
##     er, a new cluster is created and the core point is assigned to it. Then, the
##      neighborhood of the core point is explored. If a point in the neighborhood
##     is a core point, then the neighborhood of that point is also explored. This
##     process is repeated until all points in the neighborhood have been explored.
##      If a point in the neighborhood is not already assigned to a cluster, then i
##     t is assigned to the cluster of the core point.
## 
## Whatever points are not assigned to a cluster are considered outliers.
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 1:
## 
## The pairwise distances between observations are precomputed in order to later de
## termine which of them are core observations. The distance matrix is:
## Distances:
##        1     2     3     4     5     6     7     8     9    10
## 1  0.000 3.374 3.429 2.579 2.861 2.632 3.634 2.792 3.512 3.130
## 2  3.374 0.000 0.083 1.234 0.982 0.912 0.265 1.230 0.185 0.287
## 3  3.429 0.083 0.000 1.236 0.970 0.930 0.205 1.216 0.205 0.366
## 4  2.579 1.234 1.236 0.000 0.315 0.377 1.409 0.231 1.419 1.176
## 5  2.861 0.982 0.970 0.315 0.000 0.342 1.127 0.249 1.165 0.981
## 6  2.632 0.912 0.930 0.377 0.342 0.000 1.121 0.505 1.093 0.813
## 7  3.634 0.265 0.205 1.409 1.127 1.121 0.000 1.366 0.219 0.545
## 8  2.792 1.230 1.216 0.231 0.249 0.505 1.366 0.000 1.413 1.225
## 9  3.512 0.185 0.205 1.419 1.165 1.093 0.219 1.413 0.000 0.383
## 10 3.130 0.287 0.366 1.176 0.981 0.813 0.545 1.225 0.383 0.000
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP 2:
## 
## Every observation is labeled as UNVISITED. We are now going to loop over every o
## bservation and, if it is not already assigned to a cluster, we will try to expan
## d a new cluster around it...
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## NOISE:
## 
## An UNVISITED observation is labeled as NOISE:
## Observation #1 [UNVISITED -> NOISE]
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## CLUSTER #1:
## 
## A new cluster is going to be expanded around an UNVISITED core observation:
## Observation #2 [UNVISITED -> CLUSTER #1]
## 
## Press [enter] to continue
## 
## The cluster is also expanded around the neighbors of the core observation:
## Observation #3 [UNVISITED -> CLUSTER #1]
## Observation #7 [UNVISITED -> CLUSTER #1]
## Observation #9 [UNVISITED -> CLUSTER #1]
## Observation #10 [UNVISITED -> CLUSTER #1]
## 
## All of these observations are added to the cluster.
## 
## Press [enter] to continue
## 
## ***
## 
## The following core observation is expanded:
## Observation #3 [CLUSTER #1]
## 
## It's neighborhood is:
## Observation #2 [CLUSTER #1]
## Observation #3 [CLUSTER #1]
## Observation #7 [CLUSTER #1]
## Observation #9 [CLUSTER #1]
## 
## Upon doing it, no observations are added to the cluster...
## 
## Additionally, no other observations are expanded...
## 
## Press [enter] to continue
## 
## ***
## 
## The following core observation is expanded:
## Observation #7 [CLUSTER #1]
## 
## It's neighborhood is:
## Observation #2 [CLUSTER #1]
## Observation #3 [CLUSTER #1]
## Observation #7 [CLUSTER #1]
## Observation #9 [CLUSTER #1]
## 
## Upon doing it, no observations are added to the cluster...
## 
## Additionally, no other observations are expanded...
## 
## Press [enter] to continue
## 
## ***
## 
## The following core observation is expanded:
## Observation #9 [CLUSTER #1]
## 
## It's neighborhood is:
## Observation #2 [CLUSTER #1]
## Observation #3 [CLUSTER #1]
## Observation #7 [CLUSTER #1]
## Observation #9 [CLUSTER #1]
## 
## Upon doing it, no observations are added to the cluster...
## 
## Additionally, no other observations are expanded...
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## NOISE:
## 
## An UNVISITED observation is labeled as NOISE:
## Observation #4 [UNVISITED -> NOISE]
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## NOISE:
## 
## An UNVISITED observation is labeled as NOISE:
## Observation #5 [UNVISITED -> NOISE]
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## NOISE:
## 
## An UNVISITED observation is labeled as NOISE:
## Observation #6 [UNVISITED -> NOISE]
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## NOISE:
## 
## An UNVISITED observation is labeled as NOISE:
## Observation #8 [UNVISITED -> NOISE]
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## RESULTS:
## 
## Having gone through every observation the following clusters have been found:
## CLUSTER #0 (NOISE):
##            x           y
## 1 -1.5781168 -1.29286766
## 4  0.6757613 -0.04002442
## 5  0.8484305  0.22306094
## 6  0.5154890  0.30141469
## 8  0.9062708 -0.01894187
## 
## CLUSTER #1:
##            x        y
## 2  0.7027994 1.193823
## 3  0.7854535 1.191428
## 7  0.9187371 1.347416
## 9  0.7017478 1.378730
## 10 0.4289005 1.109321
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________

Gaussian Mixture Models

This probabilistic model assumes that the data points are generated from a mixture of several Gaussian distributions.

result <- gaussian_mixture(
  data,
  k = 3,
  max_iter = 100,
  learn = TRUE
)
## ________________________________________________________________________________
## EXPLANATION:
## 
## The Gaussian Mixture Model with Expectation Maximization (GMM with EM) algorithm
##  aims to model the data as a Gaussian Mixture Model i.e. the weighted sum of sev
## eral Gaussian distributions, where each component i.e. each Gaussian distributio
## n, represents a cluster.
## 
## The Gaussian distributions are parameterized by their mean vector (mu), covarian
## ce matrix (sigma) and mixing proportion (lambda). Initially, the mean vector is
## set to the cluster centers obtained by performing a k-means clustering on the da
## ta, the covariance matrix is set to the covariance matrix of the data points bel
## onging to each cluster and the mixing proportion is set to the proportion of dat
## a points belonging to each cluster. The algorithm then optimizes the GMM using t
## he EM algorithm.
## 
## The EM algorithm is an iterative algorithm that alternates between two steps:
## 
##     1. Expectation step. Compute how much is each observation expected to belong
##      to each component of the GMM.
##     2. Maximization step. Recompute the GMM according to the expectations from t
##     he E-step in order to maximize them.
## 
## The algorithm stops when the changes in the expectations are sufficiently small
## or when a maximum number of iterations is reached.
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## INITIALIZATION:
## 
## The GMM is initialized by calling kmeans. The initial components are:
## *** Component #1 ***
## mu:
## [1] 0.7364879 0.1163773
## sigma:
##             [,1]        [,2]
## [1,]  0.03129520 -0.01414258
## [2,] -0.01414258  0.02946434
## lambda:
## [1] 0.4
## 
## *** Component #2 ***
## mu:
## [1] -1.578117 -1.292868
## sigma:
##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
## lambda:
## [1] 0.1
## 
## *** Component #3 ***
## mu:
## [1] 0.7075276 1.2441437
## sigma:
##            [,1]       [,2]
## [1,] 0.03209268 0.01368235
## [2,] 0.01368235 0.01306666
## lambda:
## [1] 0.5
## 
## These initial components are then optimized using the EM algorithm.
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## EM ALGORITHM:
## 
## To measure how much the expectations change at each step we will use the log lik
## elihood. The log likelihood is the sum of the logarithm of the probability of th
## e data given the model. The higher the log likelihood, the better the model.
## 
## The current log likelihood is:
## loglik:
## [1] -958.4645
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP #0:
## 
## E-STEP:
## 
## The expectation of each observation to belong to each component of the GMM is th
## e following:
## Expectation:
##    [,1] [,2] [,3]
## 1     1   NA    0
## 2     0   NA    1
## 3     0   NA    1
## 4     1   NA    0
## 5     1   NA    0
## 6     1   NA    0
## 7     0   NA    1
## 8     1   NA    0
## 9     0   NA    1
## 10    0   NA    1
## 
## Press [enter] to continue
## 
## M-STEP:
## 
## The new components are:
## *** Component #1 ***
## mu:
## [1]  0.2735670 -0.1654717
## sigma:
##           [,1]      [,2]
## [1,] 1.0949504 0.6417621
## [2,] 0.6417621 0.4192926
## lambda:
## [1] 0.5
## 
## *** Component #2 ***
## mu:
## [1] 1 1
## sigma:
##           [,1]      [,2]
## [1,] 0.8415997 0.7219930
## [2,] 0.7219930 0.9798987
## lambda:
## [1] 1e-300
## 
## *** Component #3 ***
## mu:
## [1] 0.7075276 1.2441437
## sigma:
##            [,1]       [,2]
## [1,] 0.03209268 0.01368235
## [2,] 0.01368235 0.01306666
## lambda:
## [1] 0.5
## 
## Press [enter] to continue
## 
## The new log likelihood is:
## loglik:
## [1] -210.6484
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP #1:
## 
## E-STEP:
## 
## The expectation of each observation to belong to each component of the GMM is th
## e following:
## Expectation:
##    [,1] [,2] [,3]
## 1     1    0    0
## 2     0    0    1
## 3     0    0    1
## 4     1    0    0
## 5     1    0    0
## 6     1    0    0
## 7     0    0    1
## 8     1    0    0
## 9     0    0    1
## 10    0    0    1
## 
## Press [enter] to continue
## 
## M-STEP:
## 
## The new components are:
## *** Component #1 ***
## mu:
## [1]  0.2735670 -0.1654716
## sigma:
##           [,1]      [,2]
## [1,] 1.0949503 0.6417621
## [2,] 0.6417621 0.4192927
## lambda:
## [1] 0.5
## 
## *** Component #2 ***
## mu:
## [1] 0.6035213 0.2872708
## sigma:
##           [,1]      [,2]
## [1,] 0.1674271 0.0841496
## [2,] 0.0841496 0.2447075
## lambda:
## [1] 4.072114e-301
## 
## *** Component #3 ***
## mu:
## [1] 0.7075276 1.2441437
## sigma:
##            [,1]       [,2]
## [1,] 0.03209268 0.01368235
## [2,] 0.01368235 0.01306666
## lambda:
## [1] 0.5
## 
## Press [enter] to continue
## 
## The new log likelihood is:
## loglik:
## [1] -4.75882
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP #2:
## 
## E-STEP:
## 
## The expectation of each observation to belong to each component of the GMM is th
## e following:
## Expectation:
##    [,1] [,2] [,3]
## 1     1    0    0
## 2     0    0    1
## 3     0    0    1
## 4     1    0    0
## 5     1    0    0
## 6     1    0    0
## 7     0    0    1
## 8     1    0    0
## 9     0    0    1
## 10    0    0    1
## 
## Press [enter] to continue
## 
## M-STEP:
## 
## The new components are:
## *** Component #1 ***
## mu:
## [1]  0.2735670 -0.1654716
## sigma:
##           [,1]      [,2]
## [1,] 1.0949503 0.6417621
## [2,] 0.6417621 0.4192927
## lambda:
## [1] 0.5
## 
## *** Component #2 ***
## mu:
## [1] 0.6502022 0.2060456
## sigma:
##             [,1]        [,2]
## [1,]  0.04111510 -0.02457912
## [2,] -0.02457912  0.05821502
## lambda:
## [1] 6.015025e-301
## 
## *** Component #3 ***
## mu:
## [1] 0.7075276 1.2441437
## sigma:
##            [,1]       [,2]
## [1,] 0.03209268 0.01368235
## [2,] 0.01368235 0.01306666
## lambda:
## [1] 0.5
## 
## Press [enter] to continue
## 
## The new log likelihood is:
## loglik:
## [1] -4.758821
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## STEP #3:
## 
## E-STEP:
## 
## The expectation of each observation to belong to each component of the GMM is th
## e following:
## Expectation:
##    [,1] [,2] [,3]
## 1     1    0    0
## 2     0    0    1
## 3     0    0    1
## 4     1    0    0
## 5     1    0    0
## 6     1    0    0
## 7     0    0    1
## 8     1    0    0
## 9     0    0    1
## 10    0    0    1
## 
## Press [enter] to continue
## 
## M-STEP:
## 
## The new components are:
## *** Component #1 ***
## mu:
## [1]  0.2735670 -0.1654716
## sigma:
##           [,1]      [,2]
## [1,] 1.0949503 0.6417621
## [2,] 0.6417621 0.4192927
## lambda:
## [1] 0.5
## 
## *** Component #2 ***
## mu:
## [1] 0.6489054 0.1874634
## sigma:
##             [,1]        [,2]
## [1,]  0.04375938 -0.02950721
## [2,] -0.02950721  0.03592890
## lambda:
## [1] 3.077336e-300
## 
## *** Component #3 ***
## mu:
## [1] 0.7075276 1.2441437
## sigma:
##            [,1]       [,2]
## [1,] 0.03209268 0.01368235
## [2,] 0.01368235 0.01306666
## lambda:
## [1] 0.5
## 
## Press [enter] to continue
## 
## The new log likelihood is:
## loglik:
## [1] -4.758821
## 
## Press [enter] to continue
## 
## ________________________________________________________________________________
## FINAL RESULTS:
## 
## The algorithm stopped because the change in the log likelihood was smaller than
## 1e-6.
## 
## With the current GMM every observation is assigned to the cluster it is most lik
## ely to belong to. The final clusters are:
## Cluster assignments:
##  1  2  3  4  5  6  7  8  9 10 
##  1  3  3  1  1  1  3  1  3  3
## 
## ________________________________________________________________________________
## 
# Plot results with contours
plot(data, col = result$cluster, pch = 20)

Genetic K-Means

This algorithm combines traditional k-means with genetic algorithm concepts for potentially better cluster optimization.

result <- genetic_kmeans(
  data,
  k = 3,
  population_size = 10,
  mut_probability = 0.5,
  max_generations = 10,
  learn = TRUE
)
## ________________________________________________________________________________
## EXPLANATION:
## 
## The Genetic K-Means algorithm combines the K-Means clustering method with geneti
## c algorithm concepts.
## It follows these main steps:
## 1. Initialize a population of random cluster assignments
## 2. Evaluate the fitness of each individual based on within-cluster variation
## 3. Select parents for the next generation based on fitness
## 4. Apply mutation and crossover to create new individuals
## 5. Repeat steps 2-4 for a specified number of generations
## 
## Press [enter] to continue
## ________________________________________________________________________________
## INITIALIZATION:
## 
## A population of 10 individuals has been randomly initialized.
## Each individual represents a possible clustering solution.
## Here's a sample of the initial population (first 5 individuals, first 10 data po
## ints):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    3    3    2    1    3    2    3    1    2     1
## [2,]    1    1    3    3    2    3    2    3    1     2
## [3,]    3    1    2    1    2    1    3    2    1     3
## [4,]    1    1    3    1    2    2    3    2    2     3
## [5,]    1    3    3    2    1    3    2    3    1     2
## 
## Press [enter] to continue
## INITIAL FITNESS:
## Best fitness: 4.74
## Average fitness: 2.39
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 1
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 1 3 2 1 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    1    2    3    3    3     3
## [2,]    1    2    2    2    1    1    2    2    3     3
## [3,]    1    1    3    2    2    1    2    2    3     3
## [4,]    1    2    2    2    1    2    3    2    3     2
## [5,]    1    2    3    2    1    2    3    2    3     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    1    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 2.41
## Average fitness: 2.17
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 2
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    3    2    2    2    2    3     3
## [2,]    1    3    3    2    2    2    1    2    3     3
## [3,]    1    3    2    3    2    2    3    2    3     3
## [4,]    1    3    3    2    1    2    3    2    2     2
## [5,]    1    3    3    2    2    2    3    3    3     2
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 0.00
## Average fitness: 0.00
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 3
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    3    3    2    2     3
## [2,]    1    2    3    2    2    3    1    3    3     1
## [3,]    1    2    3    2    2    2    3    2    3     3
## [4,]    1    2    3    3    2    2    3    2    3     1
## [5,]    1    3    3    2    2    2    3    3    3     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 3.43
## Average fitness: 3.09
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 4
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    1    2    2    2    3    3    3     3
## [2,]    1    3    2    2    1    1    3    2    3     3
## [3,]    3    2    3    2    1    2    2    2    3     3
## [4,]    1    2    2    2    2    2    3    2    3     3
## [5,]    3    3    3    2    2    2    3    2    3     1
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    3    2    2    1    1    1    2    1    2     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    3    1    3    2    2    2    1    3    1     1
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 7.46
## Average fitness: 5.72
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 5
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    2    2    2    2    3    2    3     3
## [2,]    1    2    3    1    3    2    3    2    3     2
## [3,]    1    3    3    2    2    3    1    3    3     3
## [4,]    3    3    1    1    2    1    2    2    3     2
## [5,]    1    3    3    1    2    3    1    2    3     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    3    2    2    1    1    1    2    1    2     2
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 3.45
## Average fitness: 3.10
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 6
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    2    2    2    2    1    2    3     3
## [2,]    1    3    3    2    2    2    1    2    3     1
## [3,]    1    2    3    1    2    2    1    2    3     2
## [4,]    1    3    2    2    2    1    3    2    1     3
## [5,]    1    3    3    2    2    3    3    2    2     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    3    2    2    2    1    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 4.25
## Average fitness: 3.83
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 7
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    3    2    3    1    1    2    1     3
## [2,]    1    3    3    1    2    2    3    2    3     1
## [3,]    1    1    3    2    2    2    2    2    1     3
## [4,]    1    3    3    2    3    3    1    2    3     3
## [5,]    3    3    2    2    2    1    1    2    2     2
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    1    1    2    3    3    1    3    1     1
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    3    1    1    2    2    2    1    2    1     1
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 3.43
## Average fitness: 2.77
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 8
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     2
## [2,]    1    2    2    2    2    2    3    2    3     3
## [3,]    1    2    3    2    3    2    3    1    2     3
## [4,]    1    3    3    1    2    2    2    2    3     1
## [5,]    1    3    2    2    2    3    3    3    2     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 0.00
## Average fitness: 0.00
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 9
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    3    2    3    1    3    2    1     1
## [2,]    1    3    2    2    2    3    3    2    3     3
## [3,]    1    3    3    2    3    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    1    3     3
## [5,]    1    3    1    2    2    3    3    2    3     3
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    2    3    3    2    3    1    3    2    3     1
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    1    3    3    2    2    2    3    2    3     3
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 5.35
## Average fitness: 4.38
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## GENERATION: 10
## 
## SELECTION:
## Parents for the next generation are selected based on their fitness.
## Selected parent (first 10 data points):
##  [1] 1 3 3 2 2 2 3 2 3 3
## 
## Press [enter] to continue
## MUTATION:
## Random mutations are applied to the chromosomes with probability0.50
## Sample of mutated population (first 5 individuals, first 10 data points):
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    2    3    2    1    2    3    2    3     1
## [2,]    1    3    2    2    2    2    3    2    3     3
## [3,]    3    1    2    3    1    2    3    2    1     3
## [4,]    1    3    3    2    3    2    3    2    2     2
## [5,]    1    2    3    2    2    2    3    2    2     2
## 
## Press [enter] to continue
## CROSSOVER:
## K-Means Operator (KMO) is applied as a form of crossover.
## This reassigns each point to its nearest center.
## Sample of population after crossover (first 5 individuals, first 10 data points)
## :
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    3    3    2    2    2    3    2    3     3
## [2,]    1    3    3    2    2    2    3    2    3     3
## [3,]    3    1    1    2    2    2    1    2    1     1
## [4,]    1    3    3    2    2    2    3    2    3     3
## [5,]    1    3    3    2    2    2    3    2    3     3
## 
## Press [enter] to continue
## FITNESS EVALUATION:
## The fitness of each individual is calculated based on within-cluster variation.
## Best fitness in this generation: 2.45
## Average fitness: 2.21
## Total Within-Cluster Variation of best solution: 0.36
## 
## Press [enter] to continue
## ________________________________________________________________________________
## FINAL RESULTS:
## 
## Number of clusters: 3
## Total sum of squares: 11.68
## Total within-cluster sum of squares: 0.36
## Between-cluster sum of squares: 11.31
## Cluster sizes:
## [1] 1 4 5
## Final cluster centers:
##   [,1]    [,2]   
## 1 "-1.58" "-1.29"
## 2 "0.74"  "0.12" 
## 3 "0.71"  "1.24"
## 
## Generating plot of clustering results...
## Plot generated. Check the graphics device.
## 

Correlation Clustering

Correlation clustering performs hierarchical clustering by analyzing relationships between data points and a target, with support for weighted features.

# Create sample data
data <- matrix(c(1,2,1,4,5,1,8,2,9,6,3,5,8,5,4), ncol=3)
dataFrame <- data.frame(data)
target <- c(1,2,3)
weights <- c(0.1, 0.6, 0.3)

# Perform correlation clustering
result <- correlation_clustering(
    dataFrame,
    target = target,
    weight = weights,
    distance_method = "euclidean",
    normalize = TRUE,
    learn = TRUE
)
## EXPLANATION:
## 
## The Correlation Hierarchical Clustering algorithm is a classification technique
## that:
## 
##     1. Initializes a cluster for each data point
##     2. Calculates distances between clusters and a given target
##     3. Applies weights to achieve weighted results
## 
## Due to normalize = TRUE, weights will be normalized to [0,1] range
## 
## ________________________________________________________________________________
## WEIGHT NORMALIZATION:
## 
## Initial weights:
##   Weight 1: 0.100000
##   Weight 2: 0.600000
##   Weight 3: 0.300000
## No weights provided
## Initializing 3 weights with value 1
## 
## Normalizing weights:
## Total sum: 1.000000
## Formula: weight[i] = weight[i] / total
## 
## 
##  These are the new weights:
##   Weight 1: 0.100000
##   Weight 2: 0.600000
##   Weight 3: 0.300000
## 
## ________________________________________________________________________________
## WEIGHTS:
## 
## The following weights will be used:
## [1] 0.1 0.6 0.3
## 
## Euclidean Distance Formula:
## d(x,y) = √(∑1ⁿ (xᵢ - yᵢ)²)
## This distance metric will be used to:
##     - Calculate distances between each cluster and the target
##     - Sort clusters by their similarity to the target
## 
## DATA INITIALIZATION:
## 
## Input data:
##   X1 X2 X3
## 1  1  1  3
## 2  2  8  5
## 3  1  2  8
## 4  4  9  5
## 5  5  6  4
## 
## Each cluster is initialized as a matrix with one row and the same columns as the
##  input data
## 
## Initialized clusters:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    1    1    3
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    2    8    5
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    8
## 
## [[4]]
##      [,1] [,2] [,3]
## [1,]    4    9    5
## 
## [[5]]
##      [,1] [,2] [,3]
## [1,]    5    6    4
## 
## ________________________________________________________________________________
## TARGET INITIALIZATION:
## 
## Input target:
## [1] 1 2 3
## 
## [1] 1 2 3
## Converting vector target to matrix format
## 
## Validating target dimensions:
## Target has one row
## Target has same number of columns as input data
## 
## Final target:
##      [,1] [,2] [,3]
## [1,]    1    2    3
## 
## ________________________________________________________________________________
## INITIALIZATION:
## 
## Initialized data:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    1    1    3
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    2    8    5
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    2    8
## 
## [[4]]
##      [,1] [,2] [,3]
## [1,]    4    9    5
## 
## [[5]]
##      [,1] [,2] [,3]
## [1,]    5    6    4
## 
## Initialized target:
##      [,1] [,2] [,3]
## [1,]    1    2    3
## 
## ________________________________________________________________________________
## DISTANCE CALCULATION:
## 
## Calculating distances between clusters and target using euclidean distance
## 
## ________________________________________________________________________________
## DISTANCES:
## 
## Calculated distances:
## [1] 1.000000 6.403124 5.000000 7.874008 5.744563
## 
## Sorted distances:
## [1] 1.000000 5.000000 5.744563 6.403124 7.874008
## 
## 
##  Then, using sorted distances, the function order the clusters. 
## 
## ________________________________________________________________________________
## RESULTS:
## 
## Final sorted distances:
##   cluster sortedDistances
## 1       1        1.000000
## 2       3        5.000000
## 3       5        5.744563
## 4       2        6.403124
## 5       4        7.874008
## 
## Final sorted clusters:
##   cluster X1 X2 X3
## 1       1  1  1  3
## 2       3  1  2  8
## 3       5  5  6  4
## 4       2  2  8  5
## 5       4  4  9  5
## 
## Dendrogram visualization:
## 
## ________________________________________________________________________________

Distances

The package supports various distance metrics for algorithms like agglomerative clustering and correlation clustering, the available metrics are: euclidean distance, manhattan distance, canberra distance and chebyshev distance.

You can specify these in algorithms that accept a distance parameter:

# Using different distance metrics
agglomerative_clustering(data, distance_method = "euclidean")

## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 5
agglomerative_clustering(data, distance_method = "manhattan")

## Cluster method   : single 
## Distance         : manhattan 
## Number of objects: 5
agglomerative_clustering(data, distance_method = "canberra")

## Cluster method   : single 
## Distance         : canberra 
## Number of objects: 5
agglomerative_clustering(data, distance_method = "chebyshev")

## Cluster method   : single 
## Distance         : maximum 
## Number of objects: 5