Statistics for the Social Sciences
Social scientists use a wide range of statistical methods, most of which are not unique to the social sciences. Indeed, most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.
Beyond the base and contributed packages, many of the methods commonly employed in the social sciences are covered extensively in other CRAN task views, including the following. I will try to minimize duplicating information present in these other task views, given here in alphabetical order.
It is noteworthy that this enumeration includes about a third of the CRAN task views. Moreover, there are other task views of potential interest to social scientists (such as the Graphics task view on statistical graphics); I suggest that you look at the list of all task views on CRAN.
Linear and Generalized Linear Models:
Univariate and multivariate linear models are fit by the lm
function, generalized linear models by the glm
function, both in the R-base stats package. Beyond summary
and plot
methods for lm
and glm
objects, there is a wide array of functions that support these objects.
anova
function in the stats package constructs sequential ("Type-I") analysis of variance and analysis of deviance tables, and can also compute F and chisquare likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have anova
methods as well, along with methods for other standard generics, such as coef
, for returning regression coefficients; vcov
for the coefficient covariance matrix; residuals
; and fitted
for fitted values of the response.) The generic Anova
function in the car package (associated with Fox and Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011) constructs so-called "Type-II" and "Type-III" partial tests for linear, generalized linear, and many other classes of regression models. coeftest
and waldtest
functions in the lmtest package, and the linearHypothesis
function in the car package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the sandwich and car packages. Also see the glh.test
function in the gmodels package. Nonlinear functions of parameters can be tested via the deltaMethod
function in the car package. The multcomp package includes functions for multiple comparisons. The vuong
function in the pscl package tests non-nested hypotheses for generalized linear and some other models. Also see the rms package for tests on linear and generalized linear models. hatvalues
, rstudent
, and cooks.distance
, all in the stats package). These are augmented by other packages: several functions in the car package, which emphasizes graphical methods, e.g., crPlots
for component-plus-residual plots and avPlots
for added-variable plots (among others), in addition to numerical diagnostics, such vif
for (generalized) variance-inflation factors; the dr package for dimension reduction in regression, including SIR, SAVE, and pHd; and the lmtest package, which implements a variety of diagnostic tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). Other collinearity diagnostics are in the perturb package. Diagnostics may also be found in the rms package. See the influence.ME package for influential-data diagnostics for mixed-effects models. Analysis of Categorical and Count Data:
Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the glm
function in the stats package. For over-dispersed data, see also the aod package, the dispmod package, and the glm.nb
function in the recommended MASS package (associated with Venables and Ripley, Modern Applied Statistics in S, Fourth Ed., Springer, 2002), which fits negative-binomial GLMs. The pscl package includes functions for fitting zero-inflated and hurdle regression models to count data. The multinomial logit model is fit by the multinom
function in the recommended nnet package, and ordered logit and probit models by the polr
function in the MASS package. Also see the mlogit for the multinomial logit model, the MNP package for the multinomial probit model, and the multinomRob package for the analysis of overdispersed multinomial data. The VGAM package is capable of fitting a very wide variety of fixed-effect regression models within a unified framework, including models for ordered and unordered categorical responses and for count data.
There are other noteworthy facilities for analyzing categorical and count data.
table
function in the R-base base package and the xtabs
and ftable
functions in the stats package construct contingency tables. chisq.test
and fisher.test
functions in the stats package may be used to test for independence in two-way contingency tables. loglm
and loglin
functions in the MASS package fit hierarchical loglinear models to contingency tables, the former as a front end to glm
, the latter by iterative proportional fitting. clogit
function in the survival package for conditional logistic regression; and the vcd package for graphical displays of categorical data, including mosaic plots. Other Regression Models:
It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and an even wider variety of models with contributed packages, in addition to those covered extensively in other task views.
nls
function in the stats package fits nonlinear models by least-squares. The nlstools includes several functions for assessing models fit by nls
. lme
) and nonlinear (nlme
) mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the glmmPQL
function in the MASS package, or (preferably) by the glmer
function in the lme4 package. The lme4 package also largely supersedes nlme for linear mixed models, via its lmer
function. Unlike lme
, lmer
supports crossed random effects, but does not support autocorrelated or heteroscedastic individual-level errors. Also see the lmeSplines, lmm, and MCMCglmm packages. loess.smooth
and smooth.spline
, both in the stats package. The loess
function, also in the stats package, fits simple and multiple nonparametric-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended mgcv package and the gam package, the latter associated with Hastie and Tibshirani, Generalized Additive Models (Chapman and Hall, 1990); also see the VGAM package. Some other noteworthy contributed packages in this area are gss, which fits spline regressions; locfit, for local-polynomial regression (and also density estimation) (Loader, Local Regression and Likelihood, Springer, 1999); sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini, Applied Smoothing Techniques for Data Analysis, Oxford, 1997); np, which implements kernel smoothing methods for mixed data types; and acepack for ACE (alternating conditional expectations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression. lm
, glm
, and other statistical modeling functions that employ model formulas. See the bs
(B-spline) and ns
(natural spline) functions. Other Statistical Methods:
Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists.
Boot
and bootCase
in the car package, and nlsBoot
in the nlstools package, along with the simpleboot package. step
function in the stats package and the more broadly applicable stepAIC
function in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The regsubsets
function in the leaps package performs all-subsets regression. The BMA package performs Bayesian model averaging. The standard AIC
and BIC
functions are also relevant to model selection. Beyond these packages and functions, see the MachineLearning task view. matching
function in the arm package (associated with Gelman and Hill,Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge, 2007). There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are of interest to social scientists:
Acknowledgments:
Jangman Hong contributed to the general revision of this task view, as did other individuals who made a variety of specific suggestions.
If I have omitted something of importance not covered in one of the other task views cited, or if a new package or function should be mentioned here, please let me know.
Compilation of this task view was partly supported by grants from the Social Sciences and Humanities Research Council of Canada.
a year ago
John Fox