Statistics for the Social Sciences
Social scientists use a wide range of statistical methods, most of which are not unique to the social sciences. Indeed, most statistical data analysis in the social sciences is covered by the facilities in the base and recommended packages, which are part of the standard R distribution. In the package descriptions below, I identify base and recommended packages on first mention; packages that are not specifically identified as "R-base" or "recommended" are contributed packages.
Beyond the base and contributed packages, many of the methods commonly employed in the social sciences are covered extensively in other CRAN task views, including the following. I will try to minimize duplicating information present in these other task views, given here in alphabetical order.
It is noteworthy that this enumeration includes about a third of the CRAN task views. Moreover, there are other task views of potential interest to social scientists (such as the Graphics task view on statistical graphics); I suggest that you look at the list of all task views on CRAN.
Linear and Generalized Linear Models:
Univariate and multivariate linear models are fit by the
lm function, generalized linear models by the
glm function, both in the R-base stats package. Beyond
plot methods for
glm objects, there is a wide array of functions that support these objects.
anovafunction in the stats package constructs sequential ("Type-I") analysis of variance and analysis of deviance tables, and can also compute F and chisquare likelihood-ratio tests for nested models. (It is typical for other classes of statistical models in R to have
anovamethods as well, along with methods for other standard generics, such as
coef, for returning regression coefficients;
vcovfor the coefficient covariance matrix;
fittedfor fitted values of the response.) The generic
Anovafunction in the car package (associated with Fox and Weisberg, An R Companion to Applied Regression, Second Edition, Sage, 2011) constructs so-called "Type-II" and "Type-III" partial tests for linear, generalized linear, and many other classes of regression models.
waldtestfunctions in the lmtest package, and the
linearHypothesisfunction in the car package. All of these functions permit the use of heteroscedasticity and heteroscedasticity/autocorrelation-consistent covariance matrices, as computed, e.g., by functions in the sandwich and car packages. Also see the
glh.testfunction in the gmodels package. Nonlinear functions of parameters can be tested via the
deltaMethodfunction in the car package. The multcomp package includes functions for multiple comparisons. The
vuongfunction in the pscl package tests non-nested hypotheses for generalized linear and some other models. Also see the rms package for tests on linear and generalized linear models.
cooks.distance, all in the stats package). These are augmented by other packages: several functions in the car package, which emphasizes graphical methods, e.g.,
crPlotsfor component-plus-residual plots and
avPlotsfor added-variable plots (among others), in addition to numerical diagnostics, such
viffor (generalized) variance-inflation factors; the dr package for dimension reduction in regression, including SIR, SAVE, and pHd; and the lmtest package, which implements a variety of diagnostic tests (e.g., for heteroscedasticity, nonlinearity, and autocorrelation). Other collinearity diagnostics are in the perturb package. Diagnostics may also be found in the rms package. See the influence.ME package for influential-data diagnostics for mixed-effects models.
Analysis of Categorical and Count Data:
Binomial logit and probit models, as well as Poisson-regression and loglinear models for contingency tables (including models for "over-dispersed" binomial and Poisson data), can be fit with the
glm function in the stats package. For over-dispersed data, see also the aod package, the dispmod package, and the
glm.nb function in the recommended MASS package (associated with Venables and Ripley, Modern Applied Statistics in S, Fourth Ed., Springer, 2002), which fits negative-binomial GLMs. The pscl package includes functions for fitting zero-inflated and hurdle regression models to count data. The multinomial logit model is fit by the
multinom function in the recommended nnet package, and ordered logit and probit models by the
polr function in the MASS package. Also see the mlogit for the multinomial logit model, the MNP package for the multinomial probit model, and the multinomRob package for the analysis of overdispersed multinomial data. The VGAM package is capable of fitting a very wide variety of fixed-effect regression models within a unified framework, including models for ordered and unordered categorical responses and for count data.
There are other noteworthy facilities for analyzing categorical and count data.
tablefunction in the R-base base package and the
ftablefunctions in the stats package construct contingency tables.
fisher.testfunctions in the stats package may be used to test for independence in two-way contingency tables.
loglinfunctions in the MASS package fit hierarchical loglinear models to contingency tables, the former as a front end to
glm, the latter by iterative proportional fitting.
clogitfunction in the survival package for conditional logistic regression; and the vcd package for graphical displays of categorical data, including mosaic plots.
Other Regression Models:
It is possible to fit a very wide variety of regression models with the facilities provided by the base and recommended packages, and an even wider variety of models with contributed packages, in addition to those covered extensively in other task views.
nlsfunction in the stats package fits nonlinear models by least-squares. The nlstools includes several functions for assessing models fit by
lme) and nonlinear (
nlme) mixed-effects models, commonly used in the social sciences for hierarchical and longitudinal data. Generalized linear mixed-effects models may be fit by the
glmmPQLfunction in the MASS package, or (preferably) by the
glmerfunction in the lme4 package. The lme4 package also largely supersedes nlme for linear mixed models, via its
lmersupports crossed random effects, but does not support autocorrelated or heteroscedastic individual-level errors. Also see the lmeSplines, lmm, and MCMCglmm packages.
smooth.spline, both in the stats package. The
loessfunction, also in the stats package, fits simple and multiple nonparametric-regression models by local polynomial regression. Generalized additive models are covered by several packages, including the recommended mgcv package and the gam package, the latter associated with Hastie and Tibshirani, Generalized Additive Models (Chapman and Hall, 1990); also see the VGAM package. Some other noteworthy contributed packages in this area are gss, which fits spline regressions; locfit, for local-polynomial regression (and also density estimation) (Loader, Local Regression and Likelihood, Springer, 1999); sm, for a variety of smoothing techniques, including for regression (Bowman and Azzalini, Applied Smoothing Techniques for Data Analysis, Oxford, 1997); np, which implements kernel smoothing methods for mixed data types; and acepack for ACE (alternating conditional expectations) and AVAS (additivity and variance stabilization) nonparametric transformation of the response and explanatory variables in regression.
glm, and other statistical modeling functions that employ model formulas. See the
ns(natural spline) functions.
Other Statistical Methods:
Here is a brief survey of implementations in R of other statistical methods commonly used by social scientists.
bootCasein the car package, and
nlsBootin the nlstools package, along with the simpleboot package.
stepfunction in the stats package and the more broadly applicable
stepAICfunction in the MASS package perform forward, backward, and forward-backward stepwise selection for a variety of statistical models. The
regsubsetsfunction in the leaps package performs all-subsets regression. The BMA package performs Bayesian model averaging. The standard
BICfunctions are also relevant to model selection. Beyond these packages and functions, see the MachineLearning task view.
matchingfunction in the arm package (associated with Gelman and Hill,Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge, 2007).
There are some packages that are so heterogeneous that they are difficult to classify, yet contain functions (typically in multiple domains) that are of interest to social scientists:
Jangman Hong contributed to the general revision of this task view, as did other individuals who made a variety of specific suggestions.
If I have omitted something of importance not covered in one of the other task views cited, or if a new package or function should be mentioned here, please let me know.
Compilation of this task view was partly supported by grants from the Social Sciences and Humanities Research Council of Canada.
5 days ago