Title: | Tandem Clustering with Invariant Coordinate Selection |
---|---|
Description: | Implementation of tandem clustering with invariant coordinate selection with different scatter matrices and several choices for the selection of components as described in Alfons, A., Archimbaud, A., Nordhausen, K.and Ruiz-Gazen, A. (2022) <arXiv:2212.06108>. |
Authors: | Aurore Archimbaud [aut, cre] , Andreas Alfons [aut] , Klaus Nordhausen [aut] , Anne Ruiz-Gazen [aut] |
Maintainer: | Aurore Archimbaud <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.0 |
Built: | 2024-11-05 05:44:45 UTC |
Source: | https://github.com/auroreaa/icsclust |
Implementation of tandem clustering with invariant coordinate selection with different scatter matrices and several choices for the selection of components as described in Alfons, A., Archimbaud, A., Nordhausen, K.and Ruiz-Gazen, A. (2022) <arXiv:2212.06108>.
The DESCRIPTION file:
Package: | ICSClust |
Type: | Package |
Title: | Tandem Clustering with Invariant Coordinate Selection |
Version: | 0.1.0 |
Date: | 2023-09-20 |
Description: | Implementation of tandem clustering with invariant coordinate selection with different scatter matrices and several choices for the selection of components as described in Alfons, A., Archimbaud, A., Nordhausen, K.and Ruiz-Gazen, A. (2022) <arXiv:2212.06108>. |
License: | GPL (>= 3) |
Encoding: | UTF-8 |
Depends: | ICS (>= 1.4-0), ggplot2 |
Imports: | cluster, fpc, GGally, heplots, mclust, moments, mvtnorm, otrimle, RcppRoll, rrcov, scales, tclust |
LinkingTo: | Rcpp, RcppArmadillo |
Suggests: | testthat (>= 3.0.0) |
URL: | https://github.com/AuroreAA/ICSClust |
BugReports: | https://github.com/AuroreAA/ICSClust/issues |
Authors@R: | c(person("Aurore", "Archimbaud", email = "[email protected]", role = c("aut", "cre"), comment = c(ORCID = "0000-0002-6511-9091")), person("Andreas", "Alfons", email = "[email protected]", role = "aut", comment = c(ORCID = "0000-0002-2513-3788")), person("Klaus", "Nordhausen", email = "[email protected]", role = "aut", comment = c(ORCID = "0000-0002-3758-8501")), person("Anne", "Ruiz-Gazen", email = "[email protected]", role = "aut", comment = c(ORCID = "0000-0001-8970-8061"))) |
Author: | Aurore Archimbaud [aut, cre] (<https://orcid.org/0000-0002-6511-9091>), Andreas Alfons [aut] (<https://orcid.org/0000-0002-2513-3788>), Klaus Nordhausen [aut] (<https://orcid.org/0000-0002-3758-8501>), Anne Ruiz-Gazen [aut] (<https://orcid.org/0000-0001-8970-8061>) |
Maintainer: | Aurore Archimbaud <[email protected]> |
Roxygen: | list(markdown = TRUE) |
RoxygenNote: | 7.2.3 |
Config/testthat/edition: | 3 |
Config/pak/sysreqs: | cmake libfreetype6-dev libglu1-mesa-dev make libicu-dev libpng-dev libgl1-mesa-dev libssl-dev zlib1g-dev |
Repository: | https://auroreaa.r-universe.dev |
RemoteUrl: | https://github.com/auroreaa/icsclust |
RemoteRef: | HEAD |
RemoteSha: | 86210a8b7e4c7de381c05e2de9d3664033196eaf |
Index of help topics:
ICSClust Tandem clustering with ICS ICSClust-package Tandem Clustering with Invariant Coordinate Selection ICS_lcov Local Shape Scatter Estimates for ICS ICS_mcd MCD location and Scatter Estimates for ICS ICS_mlc Cauchy location and Scatter Estimates for ICS ICS_tcov Pairwise one-step M-estimate of scatter for ICS ICS_ucov Simple robust estimates of scatter for ICS component_plot Scatterplot Matrix with densities on the diagonal discriminatory_crit Selection of ICS components based on discriminatory power kmeans_clust _k_-means clustering mclust_clust Model-Based Clustering med_crit Selection of Invariant components using the med criterion mixture_sim Simulation of a mixture of Gaussian distributions normal_crit Selection of Non-normal Invariant Components Using Marginal Normality Tests pam_clust Partitioning Around Medoids clustering plot.ICSClust Scatterplot Matrix with densities on the diagonal print.ICSClust_summary Print of an 'ICSClust_summary' object rimle_clust Robust Improper Maximum Likelihood Clustering runif_outside_range Uniform distribution outside a given range select_plot Plot of the Generalized Kurtosis Values of the ICS Transformation summary.ICSClust Summary of an 'ICSClust' object tcov Pairwise one-step M-estimate of scatter tkmeans_clust Trimmed k-means clustering ucov Simple robust estimates of scatter var_crit Selection of Invariant components using the var criterion
Aurore Archimbaud [aut, cre] (<https://orcid.org/0000-0002-6511-9091>), Andreas Alfons [aut] (<https://orcid.org/0000-0002-2513-3788>), Klaus Nordhausen [aut] (<https://orcid.org/0000-0002-3758-8501>), Anne Ruiz-Gazen [aut] (<https://orcid.org/0000-0001-8970-8061>)
Maintainer: Aurore Archimbaud <[email protected]>
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108.
Produces a gg-scatterplot matrix of the variables of a given dataframe or an invariant coordinate system obtained via an ICS transformation with densities on the diagonal for each cluster.
component_plot( object, select = TRUE, clusters = NULL, text_size_factor = 8/6.5, colors = NULL )
component_plot( object, select = TRUE, clusters = NULL, text_size_factor = 8/6.5, colors = NULL )
object |
a dataframe or |
select |
a vector of indexes of variables to plot. If |
clusters |
a vector indicating the clusters of the data to color the
plot. By default |
text_size_factor |
a numeric factor for controlling the |
colors |
a vector of colors to use. One color for each cluster. |
An object of class "ggmatrix"
(see
GGally::ggpairs()
).
Andreas Alfons and Aurore Archimbaud
X <- iris[,1:4] component_plot(X) out <- ICS(X) component_plot(out, select = c(1,4))
X <- iris[,1:4] component_plot(X) out <- ICS(X) component_plot(out, select = c(1,4))
Identifies invariant coordinates associated to the highest discriminatory power (by default "eta2").
discriminatory_crit(object, ...) ## S3 method for class 'ICS' discriminatory_crit( object, clusters, method = "eta2", nb_select = NULL, select_only = FALSE, ... ) ## Default S3 method: discriminatory_crit( object, clusters, method = "eta2", nb_select = NULL, select_only = FALSE, gen_kurtosis = NULL, ... )
discriminatory_crit(object, ...) ## S3 method for class 'ICS' discriminatory_crit( object, clusters, method = "eta2", nb_select = NULL, select_only = FALSE, ... ) ## Default S3 method: discriminatory_crit( object, clusters, method = "eta2", nb_select = NULL, select_only = FALSE, gen_kurtosis = NULL, ... )
object |
dataframe or object of class |
... |
additional arguments are currently ignored. |
clusters |
a vector of the same length as the number of observations, indicating the true clusters. It is used to compute the discriminatory power based on it. |
method |
the name of the discriminatory power.
Only |
nb_select |
the exact number of components to select.
By default it is set to |
select_only |
boolean. If |
gen_kurtosis |
vector of generalized kurtosis values. |
The discriminatory power , where
denotes Wilks' lambda, is evaluated for each combination of the
first and/or last combinations of
nb_select
components. The combination
achieving the highest discriminatory power is selected.
More specifically, we compute
where is the within-group sum of squares and cross-products matrix
and
is the total sum of squares and cross-products matrix.
If select_only
is TRUE
a vector of the names of the invariant
components or variables to select.
If FALSE
an object of class "ICS_crit"
is returned with the following objects:
crit
: the name of the criterion "discriminatory".
method
: the name of the discriminatory power.
nb_select
: the number of components to select.
select
: the names of the invariant components or variables to select.
power_combinations
: the discriminatory values for each of the considered
combinations of nb_select
components.
gen_kurtosis
: the vector of generalized kurtosis values in case of
ICS
object.
Aurore Archimbaud and Anne Ruiz-Gazen
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
normal_crit()
, med_crit()
, var_crit()
.
X <- iris[,-5] out <- ICS(X) discriminatory_crit(out, clusters = iris[,5], select_only = FALSE)
X <- iris[,-5] out <- ICS(X) discriminatory_crit(out, clusters = iris[,5], select_only = FALSE)
It is a wrapper for the local shape estimator of scatter
as computed by fpc::localshape()
.
ICS_lcov(x, mscatter = "cov", proportion = 0.1, ...)
ICS_lcov(x, mscatter = "cov", proportion = 0.1, ...)
x |
a numeric matrix or data frame. |
mscatter |
|
proportion |
proportion of points to be considered as neighbourhood. |
... |
potential further arguments passed to |
An object of class "ICS_scatter"
with the following
components:
location |
this is NULL as the estimator does not use a location estimate. |
scatter |
a numeric matrix giving the estimate of the scatter matrix. |
label |
a character string providing a label for the scatter matrix. |
Andreas Alfons and Aurore Archimbaud
It is a wrapper for the (reweighted) MCD estimators of location and scatter
as computed by rrcov::CovMcd()
.
ICS_mcd_raw(x, location = FALSE, nsamp = "deterministic", alpha = 0.5, ...) ICS_mcd_rwt(x, location = FALSE, nsamp = "deterministic", alpha = 0.5, ...)
ICS_mcd_raw(x, location = FALSE, nsamp = "deterministic", alpha = 0.5, ...) ICS_mcd_rwt(x, location = FALSE, nsamp = "deterministic", alpha = 0.5, ...)
x |
a numeric matrix or data frame. |
location |
a logical indicating whether to include the MCD-estimate of
location (defaults to |
nsamp |
number of subsets used for initial estimates or |
alpha |
numeric parameter controlling the size of the subsets over
which the determinant is minimized as in |
... |
potential further arguments passed to |
ICS_mcd_raw()
: computes the raw MCD estimates.
ICS_mcd_rwt()
: computes the reweighted MCD estimates.
An object of class "ICS_scatter"
with the following
components:
location |
if requested, a numeric vector giving the location estimate. |
scatter |
a numeric matrix giving the estimate of the scatter matrix. |
label |
a character string providing a label for the scatter matrix. |
Andreas Alfons and Aurore Archimbaud
It is a wrapper for the Cauchy estimator of location and scatter
for a multivariate t-distribution, as computed by ICS::tM()
.
ICS_mlc(x, location = FALSE, ...)
ICS_mlc(x, location = FALSE, ...)
x |
a numeric matrix or data frame. |
location |
a logical indicating whether to include the M-estimate of
location (defaults to |
... |
potential further arguments passed to |
An object of class "ICS_scatter"
with the following
components:
location |
if requested, a numeric vector giving the location estimate. |
scatter |
a numeric matrix giving the estimate of the scatter matrix. |
label |
a character string providing a label for the scatter matrix. |
Andreas Alfons and Aurore Archimbaud
Wrapper function for the pairwise one-step M-estimator of scatter with
weights based on pairwise Mahalanobis distances, as computed by
tcov()
. Note that this estimator is based on pairwise
differences and therefore no location estimate is returned.
ICS_tcov(x, beta = 2)
ICS_tcov(x, beta = 2)
x |
a numeric matrix or data frame. |
beta |
a positive numeric value specifying the tuning parameter of the
pairwise one-step M-estimator (default to 2), see |
An object of class "ICS_scatter"
with the following
components:
location |
this is |
scatter |
a numeric matrix giving the estimate of the scatter matrix. |
label |
a character string providing a label for the scatter matrix. |
Andreas Alfons
ICS()
Wrapper functions for the one-step M-estimator of scatter with weights based
on Mahalanobis distances as computed by scov()
, or the simple
related estimator that is based on a transformation as computed by
ucov()
.
ICS_scov(x, location = TRUE, beta = 0.2) ICS_ucov(x, location = TRUE, beta = 0.2)
ICS_scov(x, location = TRUE, beta = 0.2) ICS_ucov(x, location = TRUE, beta = 0.2)
x |
a numeric matrix or data frame. |
location |
a logical indicating whether to include the sample
mean as location estimate (defaults to |
beta |
a positive numeric value specifying the tuning parameter of the
estimator (default to 0.2), see |
An object of class "ICS_scatter"
with the following
components:
location |
if requested, a numeric vector giving the location estimate. |
scatter |
a numeric matrix giving the estimate of the scatter matrix. |
label |
a character string providing a label for the scatter matrix. |
Andreas Alfons
ICS()
Sequential clustering approach: (i) dimension reduction through the Invariant
Coordinate Selection method using the ICS
function and (ii)
clustering of the transformed data.
ICSClust( X, nb_select = NULL, nb_clusters = NULL, ICS_args = list(), criterion = c("med_crit", "normal_crit", "var_crit", "discriminatory_crit"), ICS_crit_args = list(), method = c("kmeans_clust", "tkmeans_clust", "pam_clust", "mclust_clust", "rmclust_clust", "rimle_clust"), clustering_args = list(), clusters = NULL )
ICSClust( X, nb_select = NULL, nb_clusters = NULL, ICS_args = list(), criterion = c("med_crit", "normal_crit", "var_crit", "discriminatory_crit"), ICS_crit_args = list(), method = c("kmeans_clust", "tkmeans_clust", "pam_clust", "mclust_clust", "rmclust_clust", "rimle_clust"), clustering_args = list(), clusters = NULL )
X |
a numeric matrix or data frame containing the data. |
nb_select |
the number of components to select.
It is used only in case |
nb_clusters |
the number of clusters searched for. |
ICS_args |
list of |
criterion |
criterion to automatically decide which invariant components
to keep. Possible values are |
ICS_crit_args |
list of arguments passed to |
method |
clustering method to perform. Currently implemented wrapper
functions are |
clustering_args |
list of |
clusters |
a vector indicating the true clusters of the data. By default,
it is |
Tandem clustering with ICS is a sequential method:
ICS
is performed.
only a subset of the first and/or the last few components are selected based on a criterion.
the clustering method is performed only on the subspace of the selected components.
wrapper for several different clustering methods are provided. Users can however also write wrappers for other clustering methods.
An object of class "ICSClust"
with the following components:
ICS_out
: An object of class "ICS"
.
See ICS
select
: a vector of the names of the selected invariant
coordinates.
clusters
: a vector of the new partition of the data, i.e a vector
of integers (from 1:k
) indicating the cluster to which each
observation is allocated. 0 indicates outlying observations.
summary() and plot() methods are available.
Aurore Archimbaud
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
med_crit()
, normal_crit()
,
var_crit()
, ICS,
discriminatory_crit()
, kmeans_clust()
,
tkmeans_clust()
, pam_clust()
,
rimle_clust()
, mclust_clust()
summary()
and plot()
methods
X <- iris[,1:4] # indicating the number of components to retain for the dimension reduction # step as well as the number of clusters searched for. out <- ICSClust(X, nb_select = 2, nb_clusters = 3) summary(out) plot(out) # changing the scatter pair to consider in ICS out <- ICSClust(X, nb_select = 1, nb_clusters = 3, ICS_args = list(S1 = ICS_mcd_raw, S2 = ICS_cov,S1_args = list(alpha = 0.5))) summary(out) plot(out) # changing the criterion for choosing the invariant coordinates out <- ICSClust(X, nb_clusters = 3, criterion = "normal_crit", ICS_crit_args = list(level = 0.1, test = "anscombe.test", max_select = NULL)) summary(out) plot(out) # changing the clustering method out <- ICSClust(X, nb_clusters = 3, method = "tkmeans_clust", clustering_args = list(alpha = 0.1)) summary(out) plot(out)
X <- iris[,1:4] # indicating the number of components to retain for the dimension reduction # step as well as the number of clusters searched for. out <- ICSClust(X, nb_select = 2, nb_clusters = 3) summary(out) plot(out) # changing the scatter pair to consider in ICS out <- ICSClust(X, nb_select = 1, nb_clusters = 3, ICS_args = list(S1 = ICS_mcd_raw, S2 = ICS_cov,S1_args = list(alpha = 0.5))) summary(out) plot(out) # changing the criterion for choosing the invariant coordinates out <- ICSClust(X, nb_clusters = 3, criterion = "normal_crit", ICS_crit_args = list(level = 0.1, test = "anscombe.test", max_select = NULL)) summary(out) plot(out) # changing the clustering method out <- ICSClust(X, nb_clusters = 3, method = "tkmeans_clust", clustering_args = list(alpha = 0.1)) summary(out) plot(out)
Wrapper for performing k-means clustering from stats::kmeans()
.
kmeans_clust(X, k, clusters_only = FALSE, iter.max = 100, nstart = 20, ...)
kmeans_clust(X, k, clusters_only = FALSE, iter.max = 100, nstart = 20, ...)
X |
a numeric matrix or data frame of the data. It corresponds to the
argument |
k |
the number of clusters searched for. It corresponds to the argument
|
clusters_only |
boolean. If |
iter.max |
the maximum number of iterations allowed. |
nstart |
if |
... |
other arguments to pass to the |
If clusters_only
is TRUE
a vector of the new partition
of the data is returned, i.e a vector of integers (from 1:k
)
indicating the cluster to which each observation is allocated.
Otherwise a list is returned with the following components:
clust_method |
the name of the clustering method, i.e. "kmeans". |
clusters |
the vector of the new partition of the data, i.e. a vector of
integers (from |
... |
an object of class |
.
Aurore Archimbaud
kmeans_clust(iris[,1:4], k = 3, clusters_only = TRUE)
kmeans_clust(iris[,1:4], k = 3, clusters_only = TRUE)
Wrapper for performing Model-Based Clustering from mclust::Mclust()
allowing noise or not.
mclust_clust(X, k, clusters_only = FALSE, ...) rmclust_clust(X, k, clusters_only = FALSE, ...)
mclust_clust(X, k, clusters_only = FALSE, ...) rmclust_clust(X, k, clusters_only = FALSE, ...)
X |
a numeric matrix or data frame of the data. It corresponds to the
argument |
k |
the number of clusters searched for. It corresponds to the argument
|
clusters_only |
boolean. If |
... |
other arguments to pass to |
mclust_clust()
: does not allow noise
rmclust_clust()
: allows noise
If clusters_only
is TRUE
a vector of the new partition
of the data is returned, i.e a vector of integers (from 1:k
)
indicating the cluster to which each observation is allocated.
0 indicates trimmed observations.
Otherwise a list is returned with the following components:
clust_method |
the name of the clustering method, i.e "rimle". |
clusters |
the vector of the new partition of the data, i.e a vector of
integers (from |
... |
an object of class " |
Aurore Archimbaud
mclust_clust(iris[,1:4], k = 3, clusters_only = TRUE)
mclust_clust(iris[,1:4], k = 3, clusters_only = TRUE)
Identifies as interesting invariant coordinates whose generalized eigenvalues are the furthermost away from the median of all generalized eigenvalues.
med_crit(object, ...) ## S3 method for class 'ICS' med_crit(object, nb_select = NULL, select_only = FALSE, ...) ## Default S3 method: med_crit(object, nb_select = NULL, select_only = FALSE, ...)
med_crit(object, ...) ## S3 method for class 'ICS' med_crit(object, nb_select = NULL, select_only = FALSE, ...) ## Default S3 method: med_crit(object, nb_select = NULL, select_only = FALSE, ...)
object |
object of class |
... |
additional arguments are currently ignored. |
nb_select |
the exact number of components to select. By default it is set to
|
select_only |
boolean. If |
If more than half of the components are "uninteresting" and have the same generalized eigenvalue then the median of all generalized eigenvalues corresponds to the uninteresting component generalized eigenvalue. The components of interest are the ones whose generalized eigenvalues differ the most from the median. The motivation of this criterion depends therefore on the assumption that at least half of the components have equal generalized eigenvalues.
If select_only
is TRUE
a vector of the names of the invariant
components or variables to select. If FALSE
an object of class "ICS_crit"
is returned with the following objects:
crit
: the name of the criterion "med".
nb_select
: the number of components to select.
gen_kurtosis
: the vector of generalized kurtosis values.
med_gen_kurtosis
: the median of the generalized kurtosis values.
gen_kurtosis_diff_med
: the absolute differences between the generalized
kurtosis values and the median.
select
: the names of the invariant components or variables to select.
Andreas Alfons, Aurore Archimbaud and Klaus Nordhausen
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
normal_crit()
, var_crit()
, discriminatory_crit()
.
X <- iris[,-5] out <- ICS(X) med_crit(out, nb_select = 2, select_only = FALSE)
X <- iris[,-5] out <- ICS(X) med_crit(out, nb_select = 2, select_only = FALSE)
Simulation of a data frame according to a mixture of
Gaussian distributions with
, different location parameters
, and the identity matrix as the covariance matrix.
mixture_sim(pct_clusters = c(0.5, 0.5), n = 500, p = 10, delta = 10)
mixture_sim(pct_clusters = c(0.5, 0.5), n = 500, p = 10, delta = 10)
pct_clusters |
a vector of marginal probabilities for each group, i.e mixture weights. Default is two balanced clusters. |
n |
integer. The number of observations. |
p |
integer. The number of variables. |
delta |
integer. The location shift. |
Let be a
-variate real random vector distributed according to
a mixture of
Gaussian distributions with
,
different location parameters
, and the same positive
definite covariance matrix
:
where are mixture weights with
,
,
and
with
.
A dataframe of n observations and p+1 variables with the first variable indicating the cluster assignment using a character string.
Aurore Archimbaud
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
X <- mixture_sim() summary(X)
X <- mixture_sim() summary(X)
Identifies invariant coordinates that are non normal using univariate
normality tests as in the comp.norm.test
function from the
ICSOutlier
package, with the difference that both the
first and last few components are investigated.
normal_crit(object, ...) ## S3 method for class 'ICS' normal_crit( object, level = 0.05, test = c("agostino.test", "jarque.test", "anscombe.test", "bonett.test", "shapiro.test"), max_select = NULL, select_only = FALSE, ... ) ## Default S3 method: normal_crit( object, level = 0.05, test = c("agostino.test", "jarque.test", "anscombe.test", "bonett.test", "shapiro.test"), max_select = NULL, select_only = FALSE, gen_kurtosis = NULL, ... )
normal_crit(object, ...) ## S3 method for class 'ICS' normal_crit( object, level = 0.05, test = c("agostino.test", "jarque.test", "anscombe.test", "bonett.test", "shapiro.test"), max_select = NULL, select_only = FALSE, ... ) ## Default S3 method: normal_crit( object, level = 0.05, test = c("agostino.test", "jarque.test", "anscombe.test", "bonett.test", "shapiro.test"), max_select = NULL, select_only = FALSE, gen_kurtosis = NULL, ... )
object |
object of class |
... |
additional arguments are currently ignored. |
level |
the initial level used to make a decision based on the test p-values. See details. Default is 0.05. |
test |
name of the normality test to be used. Possibilities are
|
max_select |
the maximal number of components to select. |
select_only |
boolean. If |
gen_kurtosis |
vector of generalized kurtosis values. |
The procedure sequentially tests the first and the last components until
finding no additional components as non-normal. The quantile levels are
adjusted for multiple testing by taking the level as level
/j for the
jth component.
If select_only
is TRUE
a vector of the names of the invariant
components or variables to select. If FALSE
an object of class "ICS_crit"
is returned with the following objects:
crit
: the name of the criterion "normal".
level
: the level of the test.
max_select
: the maximal number of components to select.
test
: name of the normality test to be used.
pvalues
: the p-values of the tests.
adjusted_levels
: the adjusted levels.
select
: the names of the invariant components or variables to select.
gen_kurtosis
: the vector of generalized kurtosis values in case of
ICS
object.
Andreas Alfons, Aurore Archimbaud, Klaus Nordhausen and Anne Ruiz-Gazen
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
Archimbaud, A., Nordhausen, K., and Ruiz-Gazen, A. (2018). ICSOutlier: Unsupervised Outlier Detection for Low-Dimensional Contamination Structure, The RJournal, Vol. 10(1):234–250. doi:10.32614/RJ-2018-034
Archimbaud, A., Nordhausen, K., and Ruiz-Gazen, A. (2016). ICSOutlier: Outlier Detection Using Invariant Coordinate Selection. R package version 0.3-0
med_crit()
, var_crit()
, discriminatory_crit()
,
jarque.test()
, anscombe.test()
,
bonett.test()
, agostino.test()
, stats::shapiro.test()
.
X <- iris[,-5] out <- ICS(X) normal_crit(out, level = 0.1, select_only = FALSE)
X <- iris[,-5] out <- ICS(X) normal_crit(out, level = 0.1, select_only = FALSE)
Wrapper for performing Partitioning Around Medoids clustering from
cluster::pam()
.
pam_clust(X, k, clusters_only = FALSE, ...)
pam_clust(X, k, clusters_only = FALSE, ...)
X |
a numeric matrix or data frame of the data. It corresponds to the
argument |
k |
the number of clusters searched for. It corresponds to the argument
|
clusters_only |
boolean. If |
... |
other arguments to pass to the |
If clusters_only
is TRUE
a vector of the new partition
of the data is returned, i.e a vector of integers (from 1:k
)
indicating the cluster to which each observation is allocated.
0 indicates trimmed observations.
Otherwise a list is returned with the following components:
clust_method |
the name of the clustering method, i.e "clara_pam". |
clusters |
the vector of the new partition of the data, i.e a vector of
integers (from |
... |
an object of class |
.
Aurore Archimbaud
pam_clust(iris[,1:4], k = 3, clusters_only = TRUE)
pam_clust(iris[,1:4], k = 3, clusters_only = TRUE)
Wrapper for component_plot()
.
## S3 method for class 'ICSClust' plot(x, ...)
## S3 method for class 'ICSClust' plot(x, ...)
x |
an object of class |
... |
additional arguments to be passed down to |
An object of class "ggmatrix"
(see
GGally::ggpairs()
).
Aurore Archimbaud
ICSClust_summary
objectPrints an ICSClust_summary
object in an informative way.
## S3 method for class 'ICSClust_summary' print(x, info = FALSE, digits = 4L, ...)
## S3 method for class 'ICSClust_summary' print(x, info = FALSE, digits = 4L, ...)
x |
object of class |
info |
logical, either TRUE or FALSE. If TRUE, prints additional information on arguments used for computing scatter matrices (only named arguments that contain numeric, character, or logical scalars) and information on the parameters of the algorithm. Default is FALSE. |
digits |
number of digits for the numeric output. |
... |
additional arguments are ignored. |
The supplied object of class "ICSClust_summary"
is returned invisibly.
Aurore Archimbaud
Wrapper for performing Robust Improper Maximum Likelihood Clustering
clustering from otrimle::rimle()
.
rimle_clust(X, k, clusters_only = FALSE, ...)
rimle_clust(X, k, clusters_only = FALSE, ...)
X |
a numeric matrix or data frame of the data. It corresponds to the
argument |
k |
the number of clusters searched for. It corresponds to the argument
|
clusters_only |
boolean. If |
... |
other arguments to pass to |
If clusters_only
is TRUE
a vector of the new partition
of the data is returned, i.e a vector of integers (from 1:k
)
indicating the cluster to which each observation is allocated.
0 indicates trimmed observations.
Otherwise a list is returned with the following components:
clust_method |
the name of the clustering method, i.e, "rimle". |
clusters |
the vector of the new partition of the data, i.e. a vector of
integers (from |
... |
an object of class |
Aurore Archimbaud
rimle_clust(iris[,1:4], k = 3, clusters_only = TRUE)
rimle_clust(iris[,1:4], k = 3, clusters_only = TRUE)
Draw from a multivariate uniform distribution outside a given range. Intuitively speaking, the observations are drawn from a multivariate uniform distribution on a hyperrectangle with a hole in the middle (in the shape of a smaller hyperrectangle). This is useful, e.g., for adding random noise to a data set such that the noise consists of large values that do not overlap the initial data.
runif_outside_range(n, min = 0, max = 1, mult = 2)
runif_outside_range(n, min = 0, max = 1, mult = 2)
n |
an integer giving the number of observations to generate. |
min |
a numeric vector giving the minimum of each variable of the initial data set (outside of which to generate random noise). |
max |
a numeric vector giving the maximum of each variable of the initial data set (outside of which to generate random noise). |
mult |
multiplication factor (larger than 1) to expand the
hyperrectangle around the initial data (which is given by |
A matrix of generated points.
Andreas Alfons
#' Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108.
## illustrations for argument 'mult' # draw observations with argument 'mult = 2' xy2 <- runif_outside_range(1000, min = rep(-1, 2), max = rep(1, 2), mult = 2) # each side of the larger hyperrectangle is twice as long as # the corresponding side of the smaller rectanglar cut-out df2 <- data.frame(x = xy2[, 1], y = xy2[, 2]) ggplot(data = df2, mapping = aes(x = x, y = y)) + geom_point() # draw observations with argument 'mult = 4' xy4 <- runif_outside_range(1000, min = rep(-1, 2), max = rep(1, 2), mult = 4) # each side of the larger hyperrectangle is four times as long # as the corresponding side of the smaller rectanglar cut-out df4 <- data.frame(x = xy4[, 1], y = xy4[, 2]) ggplot(data = df4, mapping = aes(x = x, y = y)) + geom_point()
## illustrations for argument 'mult' # draw observations with argument 'mult = 2' xy2 <- runif_outside_range(1000, min = rep(-1, 2), max = rep(1, 2), mult = 2) # each side of the larger hyperrectangle is twice as long as # the corresponding side of the smaller rectanglar cut-out df2 <- data.frame(x = xy2[, 1], y = xy2[, 2]) ggplot(data = df2, mapping = aes(x = x, y = y)) + geom_point() # draw observations with argument 'mult = 4' xy4 <- runif_outside_range(1000, min = rep(-1, 2), max = rep(1, 2), mult = 4) # each side of the larger hyperrectangle is four times as long # as the corresponding side of the smaller rectanglar cut-out df4 <- data.frame(x = xy4[, 1], y = xy4[, 2]) ggplot(data = df4, mapping = aes(x = x, y = y)) + geom_point()
Extracts the generalized kurtosis values of the components obtained via an
ICS transformation and draws either a screeplot or a specific plot for a
given criterion. If an object of class "ICS_crit"
is given, then the
selected components are shaded on the plot.
select_plot(object, ...) ## Default S3 method: select_plot( object, select = NULL, scale = FALSE, screeplot = TRUE, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, size = 3, ... ) ## S3 method for class 'data.frame' select_plot( object, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, ... ) ## S3 method for class 'ICS_crit' select_plot( object, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, size = 3, screeplot = TRUE, ... )
select_plot(object, ...) ## Default S3 method: select_plot( object, select = NULL, scale = FALSE, screeplot = TRUE, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, size = 3, ... ) ## S3 method for class 'data.frame' select_plot( object, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, ... ) ## S3 method for class 'ICS_crit' select_plot( object, type = c("dots", "lines"), width = 0.2, color = "grey", alpha = 0.3, size = 3, screeplot = TRUE, ... )
object |
an object inheriting from class |
... |
additional arguments are currently ignored. |
select |
an integer, character, or logical vector specifying for
which components to extract the generalized kurtosis values, or
|
scale |
a logical indicating whether to scale the generalized
kurtosis values to have product 1 (defaults to |
screeplot |
boolean. If |
type |
either |
width |
the width for shading the selected components in case an
|
color |
the color for shading the selected components in case an
|
alpha |
the transparency for shading the selected components in case
an |
size |
size of the points. Only relevant for "discriminatory" criteria. |
An object of class "ggplot"
(see ggplot2::ggplot()
).
Andreas Alfons and Aurore Archimbaud
X <- iris[,-5] out <- ICS(X) # on an ICS object select_plot(out) select_plot(out, type = "lines") # on an ICS_crit object # median criterion out_med <- med_crit(out, nb_select = 1, select_only = FALSE) select_plot(out_med, type = "lines") select_plot(out_med, screeplot = FALSE, type = "lines", color = "lightblue") # discriminatory criterion out_disc <- discriminatory_crit(out, clusters = iris[,5], select_only = FALSE) select_plot(out_disc)
X <- iris[,-5] out <- ICS(X) # on an ICS object select_plot(out) select_plot(out, type = "lines") # on an ICS_crit object # median criterion out_med <- med_crit(out, nb_select = 1, select_only = FALSE) select_plot(out_med, type = "lines") select_plot(out_med, screeplot = FALSE, type = "lines", color = "lightblue") # discriminatory criterion out_disc <- discriminatory_crit(out, clusters = iris[,5], select_only = FALSE) select_plot(out_disc)
ICSClust
objectSummarizes an ICSClust
object in an informative way.
## S3 method for class 'ICSClust' summary(object, ...)
## S3 method for class 'ICSClust' summary(object, ...)
object |
object of class |
... |
additional arguments passed to |
An object of class "ICSClust_summary"
with the following components:
ICS_out
: ICS_out
object
nb_comp
: number of selected components
select
: vector of names of selected components
nb_clusters
: number of clusters
table_clusters
: frequency table of clusters
Aurore Archimbaud
Computes a pairwise one-step M-estimate of scatter with weights based on pairwise Mahalanobis distances. Note that it is based on pairwise differences and therefore does not require a location estimate.
tcov(x, beta = 2)
tcov(x, beta = 2)
x |
a numeric matrix or data frame. |
beta |
a positive numeric value specifying the tuning parameter of the pairwise one-step M-estimator (defaults to 2), see ‘Details’. |
For a sample , a positive and decreasing weight function
,
and a tuning parameter
, the pairwise one-step M-estimator
of scatter is defined as
where
denotes the squared pairwise Mahalanobis distance between observations
and
based on the sample
covariance matrix
. Here, the weight
function
is used.
A numeric matrix giving the pairwise one-step M-estimate of scatter.
Andreas Alfons and Aurore Archimbaud
Caussinus, H. and Ruiz-Gazen, A. (1993) Projection Pursuit and Generalized Principal Component Analysis. In Morgenthaler, S., Ronchetti, E., Stahel, W.A. (eds.) New Directions in Statistical Data Analysis and Robustness, 35-46. Monte Verita, Proceedings of the Centro Stefano Franciscini Ascona Series. Springer-Verlag.
Caussinus, H. and Ruiz-Gazen, A. (1995) Metrics for Finding Typical Structures by Means of Principal Component Analysis. In Data Science and its Applications, 177-192. Academic Press.
ICS_tcov()
, ucov()
, ICS_ucov()
Wrapper for performing trimmed k-means clustering from
tclust::tkmeans()
.
tkmeans_clust(X, k, clusters_only = FALSE, alpha = 0.05, ...)
tkmeans_clust(X, k, clusters_only = FALSE, alpha = 0.05, ...)
X |
a numeric matrix or data frame of the data. It corresponds to the
argument |
k |
the number of clusters searched for. It corresponds to the argument
|
clusters_only |
boolean. If |
alpha |
the proportion of observations to be trimmed. |
... |
other arguments to pass to the |
If clusters_only
is TRUE
a vector of the new partition
of the data is returned, i.e a vector of integers (from 1:k
)
indicating the cluster to which each observation is allocated.
0 indicates trimmed observations.
Otherwise a list is returned with the following components:
clust_method |
the name of the clustering method, i.e. "tkmeans". |
clusters |
the vector of the new partition of the data, i.e. a vector of
integers (from |
... |
an object of class |
.
Aurore Archimbaud
tkmeans_clust(iris[,1:4], k = 3, alpha = 0.1, clusters_only = TRUE)
tkmeans_clust(iris[,1:4], k = 3, alpha = 0.1, clusters_only = TRUE)
Compute a one-step M-estimator of scatter with weights based on Mahalanobis distances, or a simple related estimator that is based on a transformation.
scov(x, beta = 0.2) ucov(x, beta = 0.2)
scov(x, beta = 0.2) ucov(x, beta = 0.2)
x |
a numeric matrix or data frame. |
beta |
a positive numeric value specifying the tuning parameter of the estimator (defaults to 0.2), see ‘Details’. |
For a sample , a positive and decreasing weight function
,
and a tuning parameter
, the one-step M-estimator
of scatter is defined as
where
denotes the squared Mahalanobis distance of observation
from the sample mean
based on the sample
covariance matrix
. Here, the weight
function
is used.
A simple robust estimator that is consistent under normality is obtained via the transformation
A numeric matrix giving the estimate of the scatter matrix.
Andreas Alfons and Aurore Archimbaud
Caussinus, H. and Ruiz-Gazen, A. (1993) Projection Pursuit and Generalized Principal Component Analysis. In Morgenthaler, S., Ronchetti, E., Stahel, W.A. (eds.) New Directions in Statistical Data Analysis and Robustness, 35-46. Monte Verita, Proceedings of the Centro Stefano Franciscini Ascona Series. Springer-Verlag.
Caussinus, H. and Ruiz-Gazen, A. (1995) Metrics for Finding Typical Structures by Means of Principal Component Analysis. In Data Science and its Applications, 177-192. Academic Press.
Ruiz-Gazen, A. (1996) A Very Simple Robust Estimator of a Dispersion Matrix. Computational Statistics & Data Analysis, 21(2), 149-162. doi:10.1016/0167-9473(95)00009-7.
ICS_ucov()
, tcov()
, ICS_tcov()
Identifies the interesting invariant coordinates based on the rolling
variance criterion as used in the ICSboot
function of the ICtest
package. It computes rolling variances on the generalized eigenvalues
obtained through ICS::ICS()
.
var_crit(object, ...) ## S3 method for class 'ICS' var_crit(object, nb_select = NULL, select_only = FALSE, ...) ## Default S3 method: var_crit(object, nb_select = NULL, select_only = FALSE, ...)
var_crit(object, ...) ## S3 method for class 'ICS' var_crit(object, nb_select = NULL, select_only = FALSE, ...) ## Default S3 method: var_crit(object, nb_select = NULL, select_only = FALSE, ...)
object |
object of class |
... |
additional arguments are currently ignored. |
nb_select |
the exact number of components to select. By default it is set to
|
select_only |
boolean. If |
Assuming that the generalized eigenvalues of the uninformative components are all the same
means that the variance of these generalized eigenvalues must be minimal.
Therefore when nb_select
components should be selected, the method identifies
the p - nb_select
neighboring generalized eigenvalues with minimal variance,
where p
is the total number of components. The number of interesting components should be at
most p-2
as at least two uninteresting components are needed to compute a variance.
If select_only
is TRUE
a vector of the names of the invariant
components or variables to select. If FALSE
an object of class "ICS_crit"
is returned with the following objects:
crit
: the name of the criterion "var".
nb_select
: the number of components to select.
gen_kurtosis
: the vector of generalized kurtosis values.
select
: the names of the invariant components or variables to select.
RollVarX
: the rolling variances of order d-nb_select
.
Order
: indexes of the ordered invariant components such that the
ones associated to the smallest variances of the eigenvalues are at the
end.
Andreas Alfons, Aurore Archimbaud and Klaus Nordhausen
Alfons, A., Archimbaud, A., Nordhausen, K., & Ruiz-Gazen, A. (2022). Tandem clustering with invariant coordinate selection. arXiv preprint arXiv:2212.06108..
Radojicic, U., & Nordhausen, K. (2019). Non-gaussian component analysis: Testing the dimension of the signal subspace. In Workshop on Analytical Methods in Statistics (pp. 101–123). Springer. doi:10.1007/978-3-030-48814-7_6.
normal_crit()
, med_crit()
, discriminatory_crit()
.
X <- iris[,-5] out <- ICS(X) var_crit(out, nb_select = 2, select_only = FALSE)
X <- iris[,-5] out <- ICS(X) var_crit(out, nb_select = 2, select_only = FALSE)