Package gmfd (Generalized Mahalanobis Functional Distance) is an R package which gathers statistical methods for both inference and clustering of functional data based on a generalization of Mahalanobis distance. The package supports both univariate and multivariate functional data.
The function gmfd_simulate( size, mean, covariance, rho, theta )
can simulate an univariate sample of functional data where size
represents the number of elements to be generated while mean
is the center of the distribution. The user can choose to use the argument covariance
for the covariance of the data or alternatively the sequences of eigenvalues and eigenfunctions of the covariance matrix.
library( gmfd )
# Define the parameters
n <- 50
P <- 100
K <- 150
# Grid of the functional dataset
t <- seq( 0, 1, length.out = P )
# Define the means and the parameters to use in the simulation
m1 <- t^2 * ( 1 - t )
rho <- rep( 0, K )
theta <- matrix( 0, K, P )
for ( k in 1:K ) {
rho[k] <- 1 / ( k + 1 )^2
if ( k%%2 == 0 )
theta[k, ] <- sqrt( 2 ) * sin( k * pi * t )
else if ( k%%2 != 0 && k != 1 )
theta[k, ] <- sqrt( 2 ) * cos( ( k - 1 ) * pi * t )
else
theta[k, ] <- rep( 1, P )
}
# Simulate the functional data
X <- gmfd_simulate( size = n, mean = m1, rho = rho, theta = theta )
S3
class funData
for functional dataFor ease of manipulation and visualization, a S3
class for both univariate and multivariate functional data has been created. A funData
object represents a functional data and it is defined by the function funData( grid, F )
where grid
is the grid of evenly spaced points over which the functional data is defined and F
is a matrix (if it is a univariate functional data) or a list (if it is a multivariate functional data). A functional data as it has been just described can be then represented by using the function plot.funData
which takes as argument all the usual customisable graphical parameters.
# Create a functional data object
FD1 <- funData( t, X )
# Plot the funData object
plot( FD1, col = "blue" , xlab = "grid", ylab = "data")
where \(P\) is the length of the independent variable grid, while \(\hat{d}^2_{M,k}(\cdot,\cdot)\) and \(\hat{h}(p)\) represent the estimates of the square of the contribution to this distance along the \(k\)-th component and the regularizing function, respectively. The function funDist( FD1, FD2, metric, p, lambda, phi )
computes the distance between two functional data FD1
and FD2
by using the chosen metric
. The last three parameters are used only for the generalized Mahalanobis distance.
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of data
lambda <- eigen( cov( FD1$data[[1]] ) )$values
phi <- eigen( cov( FD1$data[[1]] ) )$vectors
# Extract two curves from the samples to compute the distance between them
x <- funData( t, FD1$data[[1]][1, ] )
y <- funData( t, FD1$data[[1]][2, ] )
distance <- funDist( x, y, metric = "mahalanobis", p = 10^5, lambda, phi )
distance
## [1] 2.306482
It is also possible to compute the dissimilarity matrix of a given sample by using the function gmfd_diss( FD, metric, p )
.
where \(m_1\) and \(m_2\) are the real means of the two simulated samples. We can infer on the means of the two samples by using the function gmfd_test(FD1, FD2, conf.level, stat_test, p)
where we have the two samples FD1
and FD2
, the confidence level for the test conf.level
, a string to choose the test statistic to use stat_test
and the parameter of the regularizing function p
. The function then returns the value of the test statistics, the value of the quantile and the p-value for the test.
# Simulate another functional sample
s <- 0
for ( k in 4:K ) {
s <- s + sqrt( rho[k] ) * theta[k, ]
}
m2 <- m1 + s
Y <- gmfd_simulate( n, m2, rho = rho, theta = theta )
FD2 <- funData( t, Y )
test_output <- gmfd_test(FD1, FD2, conf.level = 0.95, stat_test = "mahalanobis", p = 10^5)
test_output
## $T0
## [1] 74.00768
##
## $quantile
## [1] 12.52993
##
## $pval
## [1] 0
The functional \(k\)-means clustering algorithm is an iterative procedure, alternating a step of cluster assignment, where all the curves are assigned to a cluster, and a step of centroid calculation, where a relevant functional representative (the centroid) for each cluster is identified. More precisely, the algorithm is initialized by fixing the number \(k\) of clusters and by randomly selecting a set of \(k\) initial centroids \(\{\boldsymbol{\chi}_1^{(0)}(t), \ldots , \boldsymbol{\chi}_k^{(0)}(t)\}\) among the curves of the dataset. Given this initial choice, the algorithm iteratively repeats the two basic steps mentioned above. Formally, at the \(m^{th}\) iteration of the algorithm, \(m\geq 1\), the two following steps are performed:
We apply the procedure by merging the two samples \(\mathbf{X}_1(t), ...,\mathbf{X}_{n_1}(t)\) and \(\mathbf{Y}_{n_2}(t), ...,\mathbf{Y}_{n_2}(t)\) previously simulated using the function gmfd_kmeans( FD, n.cl , metric, p )
where n.cl
is the number of cluster. It returns a vector of the clusters and a vector or a list of vectors of the centers, other than a plot of the clusters along with their empirical means.
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of all merged data
lambda <- eigen( cov( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$values
phi <- eigen( cov ( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$vectors
# Functional data sample containing the merged samples
FD <- funData( t, rbind( X, Y ) )
kmeans_output <- gmfd_kmeans( FD, n.cl = 2, metric = "mahalanobis", p = 10^5 )