Statistics for the Social Sciences

Social scientists use a wide range of statistical methods, most of which are not unique to the social sciences. Indeed, most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.

Other Relevant Task Views:

Beyond the base and contributed packages, many of the methods commonly employed in the social sciences are covered extensively in other CRAN task views, including the following. I will try to minimize duplicating information present in these other task views, given here in alphabetical order.

  • Bayesian: Methods of Bayesian inference in a variety of settings of interest to social scientists, including mixed-effects models.
  • Econometrics and Finance: In addition to methods of specific interest to economists and financial analysts, these task views covers a variety of commonly used regression models and methods, instrumental-variables estimation, models for panel data, and some time-series models.
  • MetaAnalysis: Methods of meta analysis for combining results from primary studies. If data on individuals in each study are available, meta analysis can be performed using mixed-effects models.
  • Multivariate: A broad, if far from exhaustive, catalog of methods implemented in R for analyzing multivariate data, from data visualization to statistical modeling, and including correspondence analysis for multivariate categorical data.
  • OfficialStatistics: Covers not only official statistics but also methods for collecting and analyzing data from complex sample surveys, such as the survey package.
  • Psychometrics: Extensively covers methods of scale construction, including item-response theory, multidimensional scaling, and classical test theory, along with other topics of interest in the social sciences, such as structural-equation modeling.
  • Spatial: Methods for managing, visualizing, and modeling spatial data, including spatial regression analysis.
  • SpatioTemporal: Methods for representing, visualizing, and analyzing data with information both on time and location.
  • Survival: Methods for survival analysis (often termed "event-history analysis" in the social sciences), beyond the basic and standard methods, such as for Cox regression, included in the recommended survival package.
  • TimeSeries: Methods for representing, manipulating, visualizing, and modeling time-series data, including time-series regression methods.

It is noteworthy that this enumeration includes about a third of the CRAN task views. Moreover, there are other task views of potential interest to social scientists (such as the Graphics task view on statistical graphics); I suggest that you look at the list of all task views on CRAN.

Linear and Generalized Linear Models:

Univariate and multivariate linear models are fit by the lm function, generalized linear models by the glm function, both in the R-base stats package. Beyond summary and plot methods for lm and glm objects, there is a wide array of functions that support these objects.

  • The generic anova function in the stats package constructs sequential ("Type-I") analysis of variance and analysis of deviance tables, and can also compute F and chisquare likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have anova methods as well, along with methods for other standard generics, such as coef, for returning regression coefficients; vcov for the coefficient covariance matrix; residuals; and fitted for fitted values of the response.) The generic Anova function in the car package (associated with Fox and Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011) constructs so-called "Type-II" and "Type-III" partial tests for linear, generalized linear, and many other classes of regression models.
  • F and chisquare Wald tests for a variety of hypotheses are available from the coeftest and waldtest functions in the lmtest package, and the linearHypothesis function in the car package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the sandwich and car packages. Also see the glh.test function in the gmodels package. Nonlinear functions of parameters can be tested via the deltaMethod function in the car package. The multcomp package includes functions for multiple comparisons. The vuong function in the pscl package tests non-nested hypotheses for generalized linear and some other models. Also see the rms package for tests on linear and generalized linear models.
  • The standard R distribution has excellent basic facilities for linear and generalized linear model "diagnostics," including, for example, hat-values and deletion statistics such as studentized residuals and Cook's distances (hatvalues, rstudent, and cooks.distance, all in the stats package). These are augmented by other packages: several functions in the car package, which emphasizes graphical methods, e.g., crPlots for component-plus-residual plots and avPlots for added-variable plots (among others), in addition to numerical diagnostics, such vif for (generalized) variance-inflation factors; the dr package for dimension reduction in regression, including SIR, SAVE, and pHd; and the lmtest package, which implements a variety of diagnostic tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). The forward package implements diagnostics based on a "forward search" (Atkinson and Riani, Robust Diagnostic Regression Analysis, Springer, 2000). Other collinearity diagnostics are in the perturb package. Diagnostics may also be found in the rms package. See the influence.ME package for influential-data diagnostics for mixed-effects models.
  • Several packages contain functions that are useful for interpreting linear and generalized linear models that have been fit to data: The qvcalc packages computes "quasi variances" for factors in linear and generalized linear models (and more generally). The effects package constructs effect displays, including, e.g., "adjusted means," for linear, generalized linear, and many other regression models; diagnostic partial-residual plots are available for linear and generalized linear models. Similar, if somewhat less general, plots are available in the visreg package. The lsmeans implements so-called "least-squares means" for linear, generalized linear, and mixed models, and includes provisions for hypothesis tests. The Zelig package (see under "Collections") creates summary displays for many kinds of statistical models.

Analysis of Categorical and Count Data:

Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the glm function in the stats package. For over-dispersed data, see also the aod package, the dispmod package, and the glm.nb function in the recommended MASS package (associated with Venables and Ripley, Modern Applied Statistics in S, Fourth Ed., Springer, 2002), which fits negative-binomial GLMs. The pscl package includes functions for fitting zero-inflated and hurdle regression models to count data. The multinomial logit model is fit by the multinom function in the recommended nnet package, and ordered logit and probit models by the polr function in the MASS package. Also see the mlogit for the multinomial logit model, the MNP package for the multinomial probit model, and the multinomRob package for the analysis of overdispersed multinomial data. The VGAM package is capable of fitting a very wide variety of fixed-effect regression models within a unified framework, including models for ordered and unordered categorical responses and for count data.

There are other noteworthy facilities for analyzing categorical and count data.

  • The table function in the R-base base package and the xtabs and ftable functions in the stats package construct contingency tables.
  • The chisq.test and fisher.test functions in the stats package may be used to test for independence in two-way contingency tables.
  • The loglm and loglin functions in the MASS package fit hierachical loglinear models to contingency tables, the former as a front end to glm, the latter by iterative proportional fitting.
  • See the brglm and logistf packages for bias-reduction in binomial-response GLMs (useful, e.g., in cases of complete separation); the elrm package, which approximates exact conditional inference in logistic regression; the exactLoglinTest package for exact tests of loglinear models; the clogit function in the survival package for conditional logistic regression; and the vcd package for graphical displays of categorical data, including mosaic plots.
  • The gnm package estimates generalized nonlinear models, and can be used, e.g., to fit certain specialized models to mobility tables. The logmult package provides convenience functions based on gnm to fit log-multiplicative (e.g., UNIDIFF) and association (e.g., Goodman's RC) models. Also see the catspec package for estimating various special models for square tables.
  • As previously mentioned, the Multivariate task view covers correspondence analysis of multivariate categorical data.
  • See the betareg package for beta regression of data on rates and proportions, a topic closely associated with categorical data.

Other Regression Models:

It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and an even wider variety of models with contributed packages, in addition to those covered extensively in other task views.

  • Nonlinear regression: The nls function in the stats package fits nonlinear models by least-squares. The nlstools includes several functions for assessing models fit by nls.
  • Mixed-effects models: The recommended nlme package, associated with Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS (Springer, 2000), fits linear (lme) and nonlinear (nlme) mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the glmmPQL function in the MASS package, or (preferably) by the glmer function in the lme4 package. The lme4 package also largely supersedes nlme for linear mixed models, via its lmer function. Unlike lme, lmer supports crossed random effects, but does not support autocorrelated or heteroscedastic individual-level errors. Also see the lmeSplines, lmm, and MCMCglmm packages.
  • Generalized estimating equations: The gee and geepack packages fit marginal models by generalized estimating equations; see the multgee package for GEE estimation of models for correlated nominal or ordinal multinomial responses.
  • Nonparametric regression analysis: This is one of the conspicuous strengths of R. The standard R distribution includes several functions for smoothing scatterplots, including loess.smooth and smooth.spline, both in the stats package. The loess function, also in the stats package, fits simple and multiple nonparametric-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended mgcv package and the gam package, the latter associated with Hastie and Tibshirani, Generalized Additive Models (Chapman and Hall, 1990); also see the VGAM package. Some other noteworthy contributed packages in this area are gss, which fits spline regressions; locfit, for local-polynomial regression (and also density estimation) (Loader, Local Regression and Likelihood, Springer, 1999); sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini, Applied Smoothing Techniques for Data Analysis, Oxford, 1997); np, which implements kernel smoothing methods for mixed data types; and acepack for ACE (alternating conditional expectations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression.
  • Quantile regression: Methods for linear, nonlinear, and nonparametric quantile regression are extensively provided by the quantreg package.
  • Regression splines: Parametric regression splines (as opposed to nonparametric smoothing splines), supported by the base-R splines package, can be used by lm, glm, and other statistical modeling functions that employ model formulas. See the bs (B-spline) and ns (natural spline) functions.
  • Very large data sets: The biglm package can fit linear and generalized linear models to data sets too large to fit in memory.

Other Statistical Methods:

Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists.

  • Missing Data: Several packages implement methods for handling missing data by multiple imputation, including the (conspicuously aging) mix, norm, and pan packages associated with Shafer, Analysis of Incomplete Multivariate Data (Chapman and Hall, 1997), and the newer and more actively maintained Amelia, mi, mice, and mitools packages (the latter for drawing inferences from multiply imputed data sets). There are also some facilities for missing-data imputation in the general Hmisc package, which is described below, under "Collections". The mvnmle package finds the maximum likelihood estimates of means and covariances assuming multivariate-normal data. As well, some of the structural-equation modeling software discussed in the Psychometrics taskview is capable of maximum-likelihood estimation of regression models with missing data. The VIM package has functions for visualizing missing and imputed values.
  • Bootstrapping and Other Resampling Methods: The recommended package boot, associated with Davison and Hinkley, Bootstrap Methods and Their Application (Cambridge, 1997), has excellent facilities for bootstrapping and some related methods. Also notable is the bootstrap package, associated with Efron and Tibshirani, An Introduction to the Bootstrap (Chapman and Hall, 1993), which has functions for bootstrapping and jackknifing. In addition, see the functions Boot and bootCase in the car package, and nlsBoot in the nlstools package, along with the simpleboot package.
  • Model Selection: The step function in the stats package and the more broadly applicable stepAIC function in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The regsubsets function in the leaps package performs all-subsets regression. The BMA package performs Bayesian model averaging. The standard AIC and BIC functions are also relevant to model selection. Beyond these packages and functions, see the MachineLearning task view.
  • Social Network Analysis: There are several packages useful for social network analysis, including sna for sociometric analysis of networks (e.g., blockmodeling), network for manipulating and displaying network objects, latentnet for latent position and cluster models for networks, ergm for exponential random graph models of networks, and the "metapackage" statnet, all associated with the statnet project. Also see the RSiena and PAFit packages for longitudinal social network analysis; and the multiplex package, which implements algebraic procedures for the analysis of multiple social networks.
  • Propensity Scores and Matching: See the Matching, MatchIt, optmatch, and PSAgraphics packages, and the matching function in the arm package (associated with Gelman and Hill,Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge, 2007).
  • Demographic methods: The demography package includes functions for constructing life tables, for analyzing mortality, fertility, and immigration, and for forecasting population.

Collections of Functions:

There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are of interest to social scientists:

  • I have already made several references to the recommended MASS package, which is associated with Venables and Ripley's Modern Applied Statistics With S. Other recommended packages associated with this book are nnet, for fitting neural networks (but also, as mentioned, multinomial logistic-regression models); spatial for spatial statistics; and class, which contains functions for classification.
  • I've also mentioned the car package, associated with Fox and Weisberg, An R Companion to Applied Regression, Second Edition, which has a variety of functions supporting regression analysis, data exploration, and data transformation.
  • The Hmisc and rms packages (both mentioned above), associated with Harrell, Regression Modeling Strategies, Second Edition (Springer, 2015), provide functions for data manipulation, linear models, logistic-regression models, and survival analysis, many of them "front ends" to or modifications of other facilities in R.
  • The Zelig package integrates a wide array of statistical models of interest to social scientists (see the Zelig web site for details).


Jangman Hong contributed to the general revision of this task view, as did other individuals who made a variety of specific suggestions.

If I have omitted something of importance not covered in one of the other task views cited, or if a new package or function should be mentioned here, please let me know.

Compilation of this task view was partly supported by grants from the Social Sciences and Humanities Research Council of Canada.

View on CRAN

10 months ago

John Fox