Multivariate Statistics

Base R contains most of the functionality for classical multivariate analysis, somewhere. There are a large number of packages on CRAN which extend this methodology, a brief overview is given below. Application-specific uses of multivariate statistics are described in relevant task views, for example whilst principal components are listed here, ordination is covered in the Environmetrics task view. Further information on supervised classification can be found in the MachineLearning task view, and unsupervised classification in the Cluster task view.

The packages in this view can be roughly structured into the following topics. If you think that some package is missing from the list, please let me know.

Visualising multivariate data

  • Graphical Procedures: A range of base graphics (e.g. pairs() and coplot()) and lattice functions (e.g. xyplot() and splom()) are useful for visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities. scatterplot.matrix in the car provides usefully enhanced pairwise scatterplots. Beyond this, scatterplot3d provides 3 dimensional scatterplots, aplpack provides bagplots and spin3R(), a function for rotating 3d clouds. misc3d, dependent upon rgl, provides animated functions within R useful for visualising densities. YaleToolkit provides a range of useful visualisation techniques for multivariate data. More specialised multivariate plots include the following: faces() in aplpack provides Chernoff's faces; parcoord() from MASS provides parallel coordinate plots; stars() in graphics provides a choice of star, radar and cobweb plots respectively. mstree() in ade4 and spantree() in vegan provide minimum spanning tree functionality. calibrate supports biplot and scatterplot axis labelling. geometry, which provides an interface to the qhull library, gives indices to the relevant points via convexhulln(). ellipse draws ellipses for two parameters, and provides plotcorr(), visual display of a correlation matrix. denpro provides level set trees for multivariate visualisation. Mosaic plots are available via mosaicplot() in graphics and mosaic() in vcd that also contains other visualization techniques for multivariate categorical data. gclus provides a number of cluster specific graphical enhancements for scatterplots and parallel coordinate plots See the links for a reference to GGobi. rggobi interfaces with GGobi. xgobi interfaces to the XGobi and XGvis programs which allow linked, dynamic multivariate plots as well as projection pursuit. Finally, iplots allows particularly powerful dynamic interactive graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest. Seriation methods are provided by seriation which can reorder matrices and dendrograms.
  • Data Preprocessing: summarize() and summary.formula() in Hmisc assist with descriptive functions; from the same package varclus() offers variable clustering while dataRep() and find.matches() assist in exploring a given dataset in terms of representativeness and finding matches. Whilst dist() in base and daisy() in cluster provide a wide range of distance measures, proxy provides a framework for more distance measures, including measures between matrices. simba provides functions for dealing with presence / absence data including similarity matrices and reshaping.

Hypothesis testing

  • ICSNP provides Hotellings T2 test as well as a range of non-parametric tests including location tests based on marginal ranks, spatial median and spatial signs computation, estimates of shape. Non-parametric two sample tests are also available from cramer and spatial sign and rank tests to investigate location, sphericity and independence are available in SpatialNP.

Multivariate distributions

  • Descriptive measures: cov() and cor() in stats will provide estimates of the covariance and correlation matrices respectively. ICSNP offers several descriptive measures such as spatial.median() which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as cov.rob() in MASS which provides robust estimates of the variance-covariance matrix by minimum volume ellipsoid, minimum covariance determinant or classical product-moment. covRobust provides robust covariance estimation via nearest neighbor variance estimation. robustbase provides robust covariance estimation via fast minimum covariance determinant with covMCD() and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via covOGK(). Scalable robust methods are provided within rrcov also using fast minimum covariance determinant with covMcd() as well as M-estimators with covMest(). corpcor provides shrinkage estimation of large scale covariance and (partial) correlation matrices.
  • Densities (estimation and simulation): mvnorm() in MASS simulates from the multivariate normal distribution. mvtnorm also provides simulation as well as probability and quantile functions for both the multivariate t distribution and multivariate normal distributions as well as density functions for the multivariate normal distribution. mnormt provides multivariate normal and multivariate t density and distribution functions as well as random number simulation. sn provides density, distribution and random number generation for the multivariate skew normal and skew t distribution. delt provides a range of functions for estimating multivariate densities by CART and greedy methods. Comprehensive information on mixtures is given in the Cluster view, some density estimates and random numbers are provided by rmvnorm.mixt() and dmvnorm.mixt() in ks, mixture fitting is also provided within bayesm. Functions to simulate from the Wishart distribution are provided in a number of places, such as rwishart() in bayesm and rwish() in MCMCpack (the latter also has a density function dwish()). bkde2D() from KernSmooth and kde2d() from MASS provide binned and non-binned 2-dimensional kernel density estimation, ks also provides multivariate kernel smoothing as does ash and GenKern. prim provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data, feature also provides methods for determining feature significance in multivariate data (such as in relation to local modes).
  • Assessing normality: mvnormtest provides a multivariate extension to the Shapiro-Wilks test, mvoutlier provides multivariate outlier detection based on robust methods. ICS provides tests for multi-normality. mvnorm.etest() in energy provides an assessment of normality based on E statistics (energy); in the same package k.sample() assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices are given by mauchly.test() in stats.
  • Copulas:copula provides routines for a range of (elliptical and archimedean) copulas including normal, t, Clayton, Frank, Gumbel, fgac provides generalised archimedian copula.

Linear models

  • From stats, lm() (with a matrix specified as the dependent variable) offers multivariate linear models, anova.mlm() provides comparison of multivariate linear models. manova() offers MANOVA. sn provides msn.mle() and mst.mle() which fit multivariate skew normal and multivariate skew t models.pls provides partial least squares regression (PLSR) and principal component regression, ppls provides penalized partial least squares, dr provides dimension reduction regression options such as "sir" (sliced inverse regression), "save" (sliced average variance estimation). plsgenomics provides partial least squares analyses for genomics. relaimpo provides functions to investigate the relative importance of regression parameters.

Projection methods

  • Principal components: these can be fitted with prcomp() (based on svd(), preferred) as well as princomp() (based on eigen() for compatibility with S-PLUS) from stats. pc1() in Hmisc provides the first principal component and gives coefficients for unscaled data. Additional support for an assessment of the scree plot can be found in nFactors, whereas paran provides routines for Horn's evaluation of the number of dimensions to retain. For wide matrices, gmodels provides fast.prcomp() and fast.svd(). kernlab uses kernel methods to provide a form of non-linear principal components with kpca(). pcaPP provides robust principal components by means of projection pursuit. amap provides further robust and parallelised methods such as a form of generalised and robust principal component analysis via acpgen() and acprob() respectively. Further options for principal components in an ecological setting are available within ade4 and in a sensory setting in SensoMineR. psy provides a variety of routines useful in psychometry, in this context these include sphpca() which maps onto a sphere and fpca() where some variables may be considered as dependent as well as scree.plot() which has the option of adding simulation results to help assess the observed data. PTAk provides principal tensor analysis analagous to both PCA and correspondence analysis. smatr provides standardised major axis estimation with specific application to allometry.
  • Canonical Correlation: cancor() in stats provides canonical correlation. kernlab uses kernel methods to provide robust canonical correlation with kcca(). concor provides a number of concordance methods.
  • Redundancy Analysis: calibrate provides rda() for redundancy analysis as well as further options for canonical correlation. fso provides fuzzy set ordination, which extends ordination beyond methods available from linear algebra.
  • Independent Components: fastICA provides fastICA algorithms to perform independent component analysis (ICA) and Projection Pursuit, and PearsonICA uses score functions. ICS provides either an invariant co-ordinate system or independent components. JADE adds an interface to the JADE algorithm, as well as providing some diagnostics for ICA.
  • Procrustes analysis: procrustes() in vegan provides procrustes analysis, this package also provides functions for ordination and further information on that area is given in the Environmetrics task view. Generalised procrustes analysis via GPA() is available from FactoMineR.

Principal coordinates / scaling methods

  • cmdscale() in stats provides classical multidimensional scaling (principal coordinates analysis), sammon() and isoMDS() in MASS offer Sammon and Kruskal's non-metric multidimensional scaling. vegan provides wrappers and post-processing for non-metric MDS. indscal() is provided by SensoMineR.

Unsupervised classification

  • Cluster analysis: A comprehensive overview of clustering methods available within R is provided by the Cluster task view. Standard techniques include hierarchical clustering by hclust() and k-means clustering by kmeans() in stats. A range of established clustering and visualisation techniques are also available in cluster, some cluster validation routines are available in clv and the Rand index can be computed from classAgreement() in e1071. Trimmed cluster analysis is available from trimcluster, cluster ensembles are available from clue, methods to assist with choice of routines are available in clusterSim and hybrid methodology is provided by hybridHclust. Distance measures (edist()) and hierarchical clustering ( based on E-statistics are available in energy. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from fpc. clustvarsel provides variable selection within model-based clustering. Fuzzy clustering is available within cluster as well as via the hopach (Hierarchical Ordered Partitioning and Collapsing Hybrid) algorithm. kohonen provides supervised and unsupervised SOMs for high dimensional spectra or patterns. clusterGeneration helps simulate clusters. The Environmetrics task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in mclust.
  • Tree methods: Full details on tree methods are given in the MachineLearning task view. Suffice to say here that classification trees are sometimes considered within multivariate methods; rpart is most used for this purpose. party provides recursive partitioning. Classification and regression training is provided by caret. kknn provides k-nearest neighbour methods which can be used for regression as well as classification.

Supervised classification and discriminant analysis

  • lda() and qda() within MASS provide linear and quadratic discrimination respectively. mda provides mixture and flexible discriminant analysis with mda() and fda() as well as multivariate adaptive regression splines with mars() and adaptive spline backfitting with the bruto() function. Multivariate adaptive regression splines can also be found in earth. Package class provides k-nearest neighbours by knn(), knncat provides k-nearest neighbours for categorical variables. SensoMineR provides FDA() for factorial discriminant analysis. A number of packages provide for dimension reduction with the classification. klaR includes variable selection and robustness against multicollinearity as well as a number of visualisation routines. superpc provides principal components for supervised classification, whereas gpls provides classification using generalised partial least squares. hddplot provides cross-validated linear discriminant calculations to determine the optimum number of features. ROCR provides a range of methods for assessing classifier performance. Further information on supervised classification can be found in the MachineLearning task view.

Correspondence analysis

  • corresp() and mca() in MASS provide simple and multiple correspondence analysis respectively. ca also provides single, multiple and joint correspondence analysis. ca() and mca() in ade4 provide correspondence and multiple correspondence analysis respectively, as well as adding homogeneous table analysis with hta(). Further functionality is also available within vegan co-correspondence is available from cocorresp. FactoMineR provides CA() and MCA() which also enable simple and multiple correspondence analysis as well as associated graphical routines. homals provides homogeneity analysis.

Missing data

  • mitools provides tools for multiple imputation, mice provides multivariate imputation by chained equations mvnmle provides ML estimation for multivariate normal data with missing values, mix provides multiple imputation for mixed categorical and continuous data. pan provides multiple imputation for missing panel data. VIM provides methods for the visualisation as well as imputation of missing data. aregImpute() and transcan() from Hmisc provide further imputation methods. monomvn deals with estimation models where the missing data pattern is monotone.

Latent variable approaches

  • factanal() in stats provides factor analysis by maximum likelihood, Bayesian factor analysis is provided for Gaussian, ordinal and mixed variables in MCMCpack. GPArotation offers GPA (gradient projection algorithm) factor rotation. sem fits linear structural equation models and ltm provides latent trait models under item response theory and range of extensions to Rasch models can be found in eRm. FactoMineR provides a wide range of Factor Analysis methods, including MFA() and HMFA()for multiple and hierarchical multiple factor analysis as well as ADFM() for multiple factor analysis of quantitative and qualitative data. tsfa provides factor analysis for time series. poLCA provides latent class and latent class regression models for a variety of outcome variables.

Modelling non-Gaussian data

  • MNP provides Bayesian multinomial probit models, polycor provides polychoric and tetrachoric correlation matrices. bayesm provides a range of models such as seemingly unrelated regression, multinomial logit/probit, multivariate probit and instrumental variables. VGAM provides Vector Generalised Linear and Additive Models, Reduced Rank regression

Matrix manipulations

  • As a vector- and matrix-based language, base R ships with many powerful tools for doing matrix manipulations, which are complemented by the packages Matrix and SparseM. matrixcalc adds functions for matrix differential calculus. Some further sparse matrix functionality is also available from spam.

Miscellaneous utilities

  • abind generalises cbind() and rbind() for arrays, mApply() in Hmisc generalises apply() for matrices and passes multiple functions. In addition to functions listed earlier, sn provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution. mAr provides for vector auto-regression. rm.boot() from Hmisc bootstraps repeated measures models. psy also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters. cwhmisc contains a number of interesting support functions which are of interest, such as ellipse(), normalise() and various rotation functions. desirability provides functions for multivariate optimisation. geozoo provides plotting of geometric objects in GGobi.

View on CRAN

a year ago

Paul Hewson