First we load the pcds
package:
pcds
is an R
package that stands for proximity catch digraphs (PCDs) and
provides construction and visualization of the PCDs,
and the spatial pattern tests (for inference on spatial interaction between classes or species)
based on the two graph invariants of the PCDs.
These invariants are the domination number and the arc density.
The package provides a set of functions for the construction and visualization of three PCD families,
namely
arc-slice PCDs (AS-PCDs),
proportional-edge PCDs (PE-PCDs), and
central similarity PCDs (CS-PCDs),
and for spatial inference based on two of these PCD families,
namely PE- and CS-PCDs.
Here, the spatial inference concerns testing class/species interaction for point pattern data,
usually in two dimensional space.
The spatial interaction patterns of interest are segregation and association.
Segregation is the pattern in which classes tend to repel each other
in the sense that points tend to be clustered around points from the same class
(forming same-class clusters),
while association is the pattern in which points from one class tend to cluster around points from another class
(forming mixed-class clusters).
The one-dimensional versions of the PCD functions are also provided
where AS-PCD is a special case of PE-PCDs or CS-PCDs.
The one-dimensional versions are currently used for testing uniformity of points,
instead of interaction between classes (although it is possible to use them for this purpose as well).
We only extend the PE-PCD construction and visualization to three dimensions in this package.
The vignette files for the pcds
package are written as sections or chapters of a larger main vignette file
and are organized as “VSk_Title” where “k” is the section number
and “Title” is the corresponding markdown file name,
starting with “VS0_Intro”.
It is recommended to read the vignettes in this order for more efficient and informed use,
however, they can be used in any order the users/readers prefer,
but some tools or concepts may have been introduced in an earlier section (mostly with references).
The goal of these vignette files or sections is to facilitate graph abstraction of spatial point data and
make it easier for users to adopt pcds
by providing a comprehensive overview of the package’s contents and
detailed examples and illustration of certain functions.
We begin with the
pcds
.Then we, illustrate
The discussion covers the structure of function arguments, the required input data formats, and the various output formats. The subsequent sections provide visualization of the proximity regions, and the associated PCDs, and computation of domination number and arc-density of the PCDs, and the computation of the large-sample tests based on these invariants.
For ease of exposition, we have grouped the package contents according their functionality and theme:
Because all of these groups contain many functions, we organize them into subgroups by purpose. Below, we display each group of functions in a table with one column per subgroup.
Utility | PCDs | Pattern Generation | S3 Methods |
---|---|---|---|
AuxDomination | ArcSliceFunctions | PatternGen | ClassFunctions |
AuxGeometry | PropEdge* |
||
AuxDelaunay C | entSim* |
||
AuxExtrema |
|
PropEdge* contains PropEdge1D, PropEdge2D, and PropEdge3D functions, whereas CentSim* contains CentSim1D and CentSim2D functions only.
ClassFunctions contain functions like summary
, print.summary
,
and plot
of the following object classes:
Lines
,TriLines
, Lines3D
, Planes
, Patterns
, Uniform
, Extrema
, and PCDs
.
Among these objects, Lines
, TriLines
, Lines3D
, and Planes
facilitate visualization of the various geometric structures,
proximity regions, and the corresponding digraphs,
while Patterns
is used for spatial point pattern generation (with Uniform
being a special case),
and PCDs
pertains to the actual PCDs (the number of arcs, visualization of the digraph, and so on).
In all the pcds
functions,
points are vectors, and data sets are either matrices or data frames.
We illustrate PCDs in a two-class setting, extension to multi-class setting can be done in a pair-wise fashion or one-vs-rest fashion for the classes. For two classes, \(\mathcal{X}\) and \(\mathcal{Y}\), of points, let \(\mathcal{X}\) be the class of interest (i.e. the target class) and \(\mathcal{Y}\) be the reference class (i.e. the non-target class) and \(\mathcal{X}_n\) and \(\mathcal{Y}_m\) be samples of size \(n\) and \(m\) from classes \(\mathcal{X}\) and \(\mathcal{Y}\), respectively. The proximity map \(N(\cdot): \Omega \rightarrow 2^{\Omega}\) associates with each point \(x \in \mathcal{X}\), a proximity region \(N(x) \subset \Omega\). Consider the data-random (or vertex-random) proximity catch digraph (PCD) \(D=(V,A)\) with vertex set \(V=\mathcal{X}\) and arc set \(A\) defined by \((u,v) \in A \iff \{u,v\}\subset \mathcal{X}\) and \(v \in N(u)\). The digraph \(D\) depends on the (joint) distribution of \(\mathcal{X}\) and on the map \(N(\cdot)\). The adjective proximity — for the digraph \(D\) and for the map \(N(\cdot)\) — comes from thinking of the region \(N(x)\) as representing those points in \(\Omega\) “close” to \(x\). The binary relation \(u \sim v\), which is defined as \(v \in N(u)\), is asymmetric, thus the adjacency of \(u\) and \(v\) is represented with directed edges or arcs which yield a digraph instead of a graph. See Chartrand, Lesniak, and Zhang (2010) and West (2001) for more on graphs and digraphs.
In the PCD approach the points correspond to observations from class \(\mathcal{X}\) and the proximity regions are defined to be (closed) regions (usually convex regions or simply triangles) based on class \(\mathcal{X}\) and \(\mathcal{Y}\) points and the proximity region for a class \(\mathcal{X}\) point, \(x\), gets larger as the distance between \(x\) and class \(\mathcal{Y}\) points increases.
We briefly define three proximity map families. Let \(\Omega = \mathbb{R}^d\) and let \(\mathcal{Y}_m=\left \{\mathsf{y}_1,\mathsf{y}_2,\ldots,\mathsf{y}_m \right\}\) be \(m\) points in general position in \(\mathbb{R}^d\) and \(T_i\) be the \(i^{th}\) Delaunay cell for \(i=1,2,\ldots,J_m\), where \(J_m\) is the number of Delaunay cells based on \(\mathcal{Y}_m\). Let \(\mathcal{X}_n\) be a set of iid random variables from distribution \(F\) in \(\mathbb{R}^d\) with support \(\mathcal{S}(F) \subseteq \mathcal{C}_H(\mathcal{Y}_m)\) where \(\mathcal{C}_H(\mathcal{Y}_m)\) stands for the convex hull of \(\mathcal{Y}_m\). For illustrative purposes, we focus on \(\mathbb{R}^2\) where a Delaunay tessellation is a triangulation, provided that no more than three \(\mathcal{Y}\) points are cocircular (i.e., lie on the same circle). Furthermore, for simplicity, let \(\mathcal{Y}_3=\{\mathsf{y}_1,\mathsf{y}_2,\mathsf{y}_3\}\) be three non-collinear points in \(\mathbb{R}^2\) and \(T(\mathcal{Y}_3)=T(\mathsf{y}_1,\mathsf{y}_2,\mathsf{y}_3)\) be the triangle with vertices \(\mathcal{Y}_3\). Let \(\mathcal{X}_n\) be a set of iid random variables from \(F\) with support \(\mathcal{S}(F) \subseteq T(\mathcal{Y}_3)\).
We define the arc-slice proximity region with \(M\)-vertex regions for a point \(x \in T(\mathcal{Y}_3)\) as follows; see also Figure 3.1. Using a center \(M\) of \(T(\mathcal{Y}_3)\), we partition \(T(\mathcal{Y}_3)\) into “vertex regions” \(R_V(\mathsf{y}_1)\), \(R_V(\mathsf{y}_2)\), and \(R_V(\mathsf{y}_3)\). If \(M\) is the circumcenter of \(T(\mathcal{Y}_3)\), we use perpendicular line segments from \(M\) to the opposite edges to form the vertex regions. If \(M\) is not the circumcenter but is in the interior of \(T(\mathcal{Y}_3)\), we use line segments from \(M\) to the opposite edges as extensions of the lines joining the vertices and \(M\) to form the vertex regions. For \(x \in T(\mathcal{Y}_3)\setminus \mathcal{Y}_3\), let \(v(x) \in \mathcal{Y}_3\) be the vertex in whose region \(x\) falls, so \(x \in R_V(v(x))\). If \(x\) falls on the boundary of two vertex regions, we assign \(v(x)\) arbitrarily to one of the adjacent regions. The arc-slice proximity region is \(N_{AS}(x):=\overline B(x,r(x)) \cap T(\mathcal{Y}_3)\) where \(\overline B(x,r(x))\) is the closed ball centered at \(x\) with radius \(r(x):=d(x,v(x))\). To make the dependence on \(M\) explicit, we also use the notation\(N_{AS}(\cdot,M)\). A natural choice for the radius is \(r(x):=\min_{\mathsf{y}\in \mathcal{Y}}d(x,\mathsf{y})\) which implicitly uses the \(CC\)-vertex regions, since \(x \in R_{CC}(\mathsf{y})\) iff \(\mathsf{y}=\text{arg min}_{u \in \mathcal{Y}}d(x,u)\). See Figure 3.1 for \(N_{AS}(x,M_{CC})\) with \(x \in R_{CC}(\mathsf{y}_2)\) and Ceyhan (2010) for more detail on AS proximity regions.
We define the proportional-edge proximity map with expansion parameter \(r \ge 1\) as follows; see also Figure 3.2. Using a center \(M\) of \(T(\mathcal{Y}_3)\), we partition \(T(\mathcal{Y}_3)\) into “\(M\)-vertex regions” as in Section 3.3. Let \(e(x)\) be the edge of \(T(\mathcal{Y}_3)\) opposite \(v(x)\), the vertex whose region contains \(x\), \(\ell(x)\) be the line parallel to \(e(x)\) through \(x\), and \(d(v(x),\ell(x))\) be the Euclidean distance from \(v(x)\) to \(\ell(x)\). For \(r \in [1,\infty)\), let \(\ell_r(x)\) be the line parallel to \(e(x)\) such that \(d(v(x),\ell_r(x)) = rd(v(x),\ell(x))\) and \(d(\ell(x),\ell_r(x)) < d(v(x),\ell_r(x))\). Let \(T_{PE}(x,r)\) be the triangle similar to and with the same orientation as \(T(\mathcal{Y}_3)\) having \(v(x)\) as a vertex and \(\ell_r(x)\) as the opposite edge. Then the proportional-edge proximity region \(N_{PE}(x,r)\) is defined to be \(T_{PE}(x,r) \cap T(\mathcal{Y}_3)\). To make the dependence on \(M\) explicit, we also use the notation\(N_{PE}(x,r,M)\). A natural choice for the center is the center of mass (CM) yielding the \(CM\)-vertex regions, which have the same area (equaling one-third of the area of \(T(\mathcal{Y}_3)\)). Notice that \(r \ge 1\) implies \(x \in N_{PE}(x,r)\). Note also that \(\lim_{r \rightarrow \infty} N_{PE}(x,r) = T(\mathcal{Y}_3)\) for all \(x \in T(\mathcal{Y}_3)\setminus \mathcal{Y}_3\), so we define \(N_{PE}(x,\infty) = T(\mathcal{Y}_3)\) for all such \(x\). For \(x \in \mathcal{Y}_3\), we define \(N_{PE}(x,r) = \{x\}\) for all \(r \in [1,\infty]\). See Ceyhan and Priebe (2005), Ceyhan, Priebe, and Wierman (2006), and Ceyhan (2014) for more detail.
We define the central similarity proximity map with expansion parameter \(\tau>0\) as follows; see also Figure 3.3. Let \(e_j\) be the edge opposite vertex \(\mathsf{y}_j\) for \(j=1,2,3\), and let “\(M\)-edge regions” \(R_E(e_1)\), \(R_E(e_2)\), \(R_E(e_3)\) partition \(T(\mathcal{Y}_3)\) using line segments from the center \(M\) in the interior of \(T(\mathcal{Y}_3)\) to the vertices. For \(x \in (T(\mathcal{Y}_3))^o\), let \(e(x)\) be the edge in whose region \(x\) falls; \(x \in R_E(e(x))\). If \(x\) falls on the boundary of two edge regions we assign \(e(x)\) arbitrarily. For \(\tau > 0\), the central similarity proximity region \(N_{CS}(x,\tau)\) is defined to be the triangle \(T_{CS}(x,\tau) \cap T(\mathcal{Y}_3)\) with the following properties:
For \(\tau \in (0,1]\), the triangle \(T_{CS}(x,\tau)\) has an edge \(e_\tau(x)\) parallel to \(e(x)\) such that \(d(x,e_\tau(x))=\tau\, d(x,e(x))\) and \(d(e_\tau(x),e(x)) \le d(x,e(x))\) and for \(\tau >1\), \(d(e_\tau(x),e(x)) < d(x,e_\tau(x))\) where \(d(x,e(x))\) is the Euclidean distance from \(x\) to \(e(x)\),
the triangle \(T_{CS}(x,\tau)\) has the same orientation as and is similar to \(T(\mathcal{Y}_3)\),
the point \(x\) is at the center of mass of \(T_{CS}(x,\tau)\).
Note that (i) implies that the expansion parameter is \(\tau\), (ii) implies “similarity”, and (iii) implies “central” in the name, (parameterized) central similarity proximity map. To make the dependence on \(M\) explicit, we also use the notation\(N_{CS}(x,\tau,M)\). A natural choice for the center is the center of mass (CM) yielding the \(CM\)-edge regions, which have the same area (equaling one-third of the area of \(T(\mathcal{Y}_3)\)). Notice that \(\tau>0\) implies that \(x \in N_{CS}(x,\tau)\) and, by construction, we have \(N_{CS}(x,\tau)\subseteq T(\mathcal{Y}_3)\) for all \(x \in T(\mathcal{Y}_3)\). For \(x \in \partial(T(\mathcal{Y}_3))\) and \(\tau \in (0,\infty]\), we define \(N_{CS}(x,\tau)=\{x\}\). For all \(x \in T(\mathcal{Y}_3)^o\) the edges \(e_\tau(x)\) and \(e(x)\) are coincident iff \(\tau=1\). Note also that \(\lim_{\tau \rightarrow \infty} N_{CS}(x,\tau) = T(\mathcal{Y}_3)\) for all \(x \in (T(\mathcal{Y}_3))^o\), so we define \(N_{CS}(x,\infty) = T(\mathcal{Y}_3)\) for all such \(x\). The central similarity proximity maps in Ceyhan and Priebe (2003) and Ceyhan, Priebe, and Marchette (2007) are \(N_{CS}(\cdot,\tau)\) with \(\tau=1\) and \(\tau \in (0,1]\), respectively, and in Ceyhan (2014) with \(\tau>1\).
The convex hull of the non-target class \(C_H(\mathcal{Y}_m)\) can be partitioned into Delaunay cells through the Delaunay tessellation of \(\mathcal{Y}_m \subset \mathbb{R}^d\). The Delaunay tessellation becomes a triangulation in \(\mathbb R^2\) which partitions \(C_H(\mathcal{Y}_m)\) into non-intersecting triangles. For \(\mathcal{Y}\) points in general position, the triangles in the Delaunay triangulation satisfy the property that the circumcircle of a triangle contain no \(\mathcal{Y}\) points except for the vertices of the triangle. In higher dimensions (i.e., \(d >2)\), Delaunay cells are \(d\)-simplices (for example, a tetrahedron in \(\mathbb{R}^3\)). Hence, the \(C_H(\mathcal{Y}_m)\) is the union of a set of disjoint \(d\)-simplices \(\{\mathfrak S_k\}_{k=1}^K\) where \(K\) is the number of \(d\)-simplices, or Delaunay cells. Each \(d\)-simplex has \(d+1\) non-co(hyper)planar vertices where none of the remaining points of \(\mathcal{Y}_m\) are in the interior of the circumsphere of the simplex (except for the vertices of the simplex which are points from \(\mathcal{Y}_m\)). Hence, simplices of the Delaunay tessellations are more likely to be acute (simplices will not have substantially large inner angles). Note that Delaunay tessellation is the dual of the Voronoi diagram of the points \(\mathcal{Y}_m\). A Voronoi diagram is a partitioning of \(\mathbb{R}^d\) into convex polytopes such that the points inside each polytope is closer to the point associated with the polytope than any other point in \(\mathcal{Y}\). Hence, a polytope \(V(y)\) associated with a point \(\mathsf{y}\in \mathcal{Y}_m\) is defined as \[V(y)=\left\{v \in \mathbb{R}^d: \lVert v-y \rVert \leq \lVert v-z \rVert \ \text{ for all } z \in \mathcal{Y}_m \setminus \{y\} \right\}.\]
Here, \(\lVert\cdot\rVert\) stands for the usual Euclidean norm. Observe that the Voronoi diagram is unique for a fixed set of points \(\mathcal{Y}_m\). A Delaunay graph is constructed by joining the pairs of points in \(\mathcal{Y}_m\) whose boundaries of Voronoi polytopes have nonempty intersections. The edges of the Delaunay graph constitute a partitioning of \(C_H(\mathcal{Y}_m)\) providing the Delaunay tessellation. By the uniqueness of the Voronoi diagram, the Delaunay tessellation is also unique (except for cases where \(d+1\) or more points lying on the same hypersphere). Run the below code for an illustration of Delaunay triangulation of 20 uniform \(\mathcal{Y}\) points in the unit square \((0,1) \times (0,1)\). More detail on Voronoi diagrams and Delaunay tessellations can be found in Okabe et al. (2000).
Arc Density: The arc density of a digraph \(D=(V,A)\) of order \(|V| = n\), denoted \(\rho(D)\), is defined as \[ \rho(D) = \frac{|A|}{n(n-1)} \] where \(|\cdot|\) stands for set cardinality (Janson, Łuczak, and Ruciński (2000)). Thus \(\rho(D)\) represents the ratio of the number of arcs in the digraph \(D\) to the number of arcs in the complete symmetric digraph of order \(n\), which has \(n(n-1)\) arcs.
If \(X_1,X_2,\ldots,X_n \stackrel{iid}{\sim} F\), then the relative density of the associated data-random PCD, denoted \(\rho(\mathcal{X}_n;h,N)\), is a \(U\)-statistic, \[\rho(\mathcal{X}_n;h,N) = \frac{1}{n(n-1)} \underset{i<j}{\sum\sum}h(X_i,X_j;N) \]
where \[\begin{aligned} h(X_i,X_j;N)&= \mathbf{I}\{(X_i,X_j) \in A\}+ \mathbf{I}\{(X_j,X_i) \in A\} \nonumber \\ &= \mathbf{I}\{X_j \in N(X_i)\}+ \mathbf{I}\{X_i \in N(X_j)\} \end{aligned}\]with \(\mathbf{I}\{\cdot\}\) being the indicator function. We denote \(h(X_i,X_j;N)\) as \(h_{ij}\) for brevity of notation. Since the digraph is asymmetric, \(h_{ij}\) is defined as the number of arcs in \(D\) between vertices \(X_i\) and \(X_j\), in order to produce a symmetric kernel with finite variance (Lehmann (2004)).
See Ceyhan, Priebe, and Wierman (2006), Ceyhan, Priebe, and Marchette (2007), and Ceyhan (2014) for arc density of PE-PCDs and its use for spatial interaction for 2D data; and Ceyhan (2012) and Ceyhan (2016) for arc density of PE-PCDs for 1D data and its use for testing uniformity.
Domination Number: In a digraph \(D=(V,A)\), a vertex \(v \in V\) dominates itself and all vertices of the form \(\{u: (v,u) \in A\}\). A dominating set \(S_D\) for the digraph \(D\) is a subset of \(V\) such that each vertex \(v \in V\) is dominated by a vertex in \(S_D\). A minimum dominating set \(S^*_{D}\) is a dominating set of minimum cardinality and the domination number \(\gamma(D)\) is defined as \(\gamma(D):=|S^*_{D}|\) (see, e.g., Lee (1998)). If a minimum dominating set is of size one, we call it a dominating point. Note that for \(|V|=n>0\), \(1 \le \gamma(D) \le n\), since \(V\) itself is always a dominating set. See Ceyhan and Priebe (2005) and Ceyhan (2011) for the domination number and its use for testing spatial interaction patterns in 2D data and Ceyhan (2020) for testing uniformity of 1D data.
Geometry Invariance of Arc Density and Domination Number: Let \(\mathcal{U}(T\left(\mathcal{Y}_3 \right))\) be the uniform distribution on \(T\left(\mathcal{Y}_3 \right)\). If \(F=\mathcal{U}(T(\mathcal{Y}_3))\), a composition of translation, rotation, reflections, and scaling, denoted \(\phi_b(T(\mathcal{Y}_3))\), will take any given triangle \(T(\mathcal{Y}_3)\) to the standard basic triangle \(T_b=T((0,0),(1,0),(c_1,c_2))\) with \(0 < c_1 \le 1/2\), \(c_2>0\), and \((1-c_1)^2+c_2^2 \le 1\), preserving uniformity. That is, if \(X \sim \mathcal{U}(T(\mathcal{Y}_3))\) is transformed in the same manner to, say \(X'=\phi_b(X)\), then we have \(X' \sim \mathcal{U}(T_b)\). In fact this will hold for data from any distribution \(F\) up to scale. Furthermore, \(T_b\) can be transformed to the standard equilateral triangle \(T_e=T(A,B,C)\) with vertices \(A=(0,0)\), \(B=(1,0)\), and \(C=(1/2,\sqrt{3}/2)\) by a transformation \(\phi_e\) and \(\phi_e(X') \sim \mathcal{U}(T_e)\). That is uniform points in any triangle can be mapped to points uniformly distributed in the standard equilateral triangle by a composition of \(\phi_b\) and \(\phi_e\) (in the form \(\phi_e \circ \phi_b\)).
The distribution of the domination number and arc density for AS-PCDs do not change for uniform data in a triangle \(T(\mathcal{Y}_3)\) when the data points are transformed to (uniform) points in the standard basic triangle \(T_b\) (using \(\phi_b\)) but not when \(\phi_e\) is applied to the uniform data in \(T_b\). So, one can focus on \(T_b\) for computations and derivations regarding the domination number and arc density of AS-PCD for uniform data. On the other hand, the distribution of the domination number and arc density for PE- and CS-PCDs do not change for uniform data in a triangle \(T(\mathcal{Y}_3)\) when the data points are transformed to (uniform) points in the standard equilateral triangle \(T_e\) (using \(\phi_e \circ \phi_b\)). That is, distribution of these graph quantities are geometry invariant for uniform data in triangles. So, one can focus on \(T_e\) for computations and derivations regarding the PE- and CS-PCD quantities for uniform data. A similar geometry invariance holds in 3D setting for uniform data in any tetrahedron being transformed to the standard regular tetrahedron for PE- and CS-PCDs. Therefore, the pcds package has functions customized only for one simplex (i.e., one interval in \(\mathbb R\), one triangle in \(\mathbb R^2\), and one tetrahedron in \(\mathbb R^3\)). However, we don’t cover these functions here, as (i) they serve as utility functions in the more realistic case of multiple simplices (e.g., multiple triangles occur when there are \(m \ge 4\) \(\mathcal{Y}\) points) and (ii) they are mainly used for simulations or verifications or illustrations when \(\mathcal{X}\) points are restricted to one simplex. See the vignette files “VS2.1 - Illustration of PCDs in One Triangle”, “VS2.2 - Illustration of PCDs in One Interval”, and “VS2.3 - Illustration of PCDs in One Tetrahedron”.