Official Statistics & Survey Methodology

This CRAN task view contains a list of packages that include methods typically used in official statistics and survey methodology. Many packages provide functionality for more than one of the topics listed below. Therefore this list is not a strict categorization and packages can be listed more than once. Certain data import/export facilities regarding to often used statistical software tools like SPSS, SAS or Stata are mentioned in the end of the task view.

**Complex Survey Design: Sampling and Sample Size Calculation**

- Package sampling includes many different algorithms (Brewer, Midzuno, pps, systematic, Sampford, balanced (cluster or stratified) sampling via the cube method, etc.) for drawing survey samples and calibrating the design weights.
- R package surveyplanning includes tools for sample survey planning, including sample size calculation, estimation of expected precision for the estimates of totals, and calculation of optimal sample size allocation.
- Package simFrame includes a fast (compiled C-Code) version of Midzuno sampling.
- The pps package contains functions to select samples using pps sampling. Also stratified simple random sampling is possible as well as to compute joint inclusion probabilities for Sampford's method of pps sampling.
- Package stratification allows univariate stratification of survey populations with a generalisation of the Lavallee-Hidiroglou method.
- Package SamplingStrata offers an approach for choosing the best stratification of a sampling frame in a multivariate and multidomain setting, where the sampling sizes in each strata are determined in order to satisfy accuracy constraints on target estimates. To evaluate the distribution of target variables in different strata, information of the sampling frame, or data from previous rounds of the same survey, may be used.
- The package BalancedSampling selects balanced and spatially balanced probability samples in multi-dimensional spaces with any prescribed inclusion probabilities. It also includes the local pivot method, the cube and local cube method and a few more methods.
- Package gridsample selects PSUs within user-defined strata using gridded population data, given desired numbers of sampled households within each PSU. The population densities used to create PSUs are drawn from rasters
- Package PracTools contains functions for sample size calculation for survey samples using stratified or clustered one-, two-, and three-stage sample designs as well as functions to compute variance components for multistage designs and sample sizes in two-phase designs.

**Complex Survey Design: Point and Variance Estimation and Model Fitting**

- Package survey works with survey samples. It allows to specify a complex survey design (stratified sampling design, cluster sampling, multi-stage sampling and pps sampling with or without replacement). Once the given survey design is specified within the function
`svydesign()`

, point and variance estimates can be computed. The resulting object can be used to estimate (Horvitz-Thompson-) totals, means, ratios and quantiles for domains or the whole survey sample, and to apply regression models. Variance estimation for means, totals and ratios can be done either by Taylor linearization or resampling (BRR, jackkife, bootstrap or user-defined). - The methods from the survey package are called from package srvyr using the dplyr syntax, i.e., piping, verbs like
`group_by`

and`summarize`

, and other dplyr-inspired syntactic style when calculating summary statistics on survey data. - Package convey extends package survey -- see the topic about indicators below.
- Package laeken provides functions to estimate certain Laeken indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient) including their variance for domains and strata using a calibrated bootstrap.
- Package simFrame allows to compare (user-defined) point and variance estimators in a simulation environment. It provides a framework for comparing different point and variance estimators under different survey designs as well as different conditions regarding missing values, representative and non-representative outliers.
- The lavaan.survey package provides a wrapper function for packages survey and lavaan. It can be used for fitting structural equation models (SEM) on samples from complex designs. Using the design object functionality from package survey, lavaan objects are re-fit (corrected) with the
`lavaan.survey()`

function of package lavaan.survey. This allows for the incorporation of clustering, stratification, sampling weights, and finite population corrections into a SEM analysis.`lavaan.survey()`

also accommodates replicate weights and multiply imputed datasets. - Package vardpoor allows to calculate linearisation of several nonlinear population statistics, variance estimation of sample surveys by the ultimate cluster method, variance estimation for longitudinal and cross-sectional measures and measures of change for any stage cluster sampling designs.
- The package rpms fits a linear model to survey data in each node obtained by recursively partitioning the data. The algorithm accounts for one-stage of stratification and clustering as well as unequal probability of selection.
- Package svyPVpack extends package survey. This package deals with data which stem from survey designs and has been created to handle data from large scale assessments like PISA, PIAAC etc..
- Package weights provides a variety of functions for producing simple weighted statistics, such as weighted Pearson's correlations, partial correlations, Chi-Squared statistics, histograms and t-tests.

**Complex Survey Design: Calibration**

- Package survey allows for post-stratification, generalized raking/calibration, GREG estimation and trimming of weights.
- The
`calib()`

function in package sampling allows to calibrate for nonresponse (with response homogeneity groups) for stratified samples. - The
`calibWeights()`

function in package laeken is a possible faster (depending on the example) implementation of parts of`calib()`

from package sampling. - The
`calibSample()`

function in package simPop is potential faster than the previous two mentioned functions, and it provides more user-friendlyness.`calibVars()`

can be used to construct a matrix of binary variables for calibration.`calibPop()`

is used to calibrate population person within household data using a simulated annealing approach. - Package icarus focuses on calibration and reweighting in survey sampling and was designed to provide a familiar setting in R for user of the SAS macro
`Calmar`. - Package reweight allows for calibration of survey weights for categorical survey data so that the marginal distributions of certain variables fit more closely to those from a given population, but does not allow complex sampling designs.
- The package CalibrateSSB include a function to calculate weights and estimates for panel data with non-response.
- Package Frames2 allows point and interval estimation in dual frame surveys. When two probability samples (one from each frame) are drawn. Information collected is suitably combined to get estimators of the parameter of interest.

**Editing and Visual Inspection of Microdata**

Editing tools:

- Package validate includes rule management and data validation and package validatetools is checking and simplifying sets of validation rules.
- Package errorlocate includes error localisation based on the principle of Fellegi and Holt. It supports categorical and/or numeric data and linear equalities, inequalities and conditional rules. The package includes a configurable backend for MIP-based error localization.
- Package editrules convert readable linear (in)equalities into matrix form.
- Package deducorrect depends on package editrules and applies deductive correction of simple rounding, typing and sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled. To determine which values are changed the Levenstein-metric is applied.
- The package rspa implements functions to minimally adjust numerical records so they obey (in)equation restrictions.
- Package SeleMix can be used for selective editing for continuous scaled data. A mixture model (Gaussian contamination model) based on response(s) y and a depended set of covariates is fit to the data to quantify the impact of errors to the estimates.
- Package rrcovNA provides robust location and scatter estimation and robust principal component analysis with high breakdown point for incomplete data. It is therefore applicable to find representative and non-representative outliers.

Visual tools:

- Package VIM is designed to visualize missing values using suitable plot methods. It can be used to analyse the structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where the information of missing values from specified variables are highlighted in selected variables. It also comes with a graphical user interface.
- Package treemap provide treemaps. A treemap is a space-filling visualization of aggregates of data with hierarchical structures. Colors can be used to relate to highlight differences between comparable aggregates.

**Imputation**

EM-based Imputation Methods:

- Package mi provides iterative EM-based multiple Bayesian regression imputation of missing values and model checking of the regression models used. The regression models for each variable can also be user-defined. The data set may consist of continuous, semi-continuous, binary, categorical and/or count variables.
- Package mice provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary, categorical and/or count variables.
- Package mitools provides tools to perform analyses and combine results from multiply-imputed datasets.
- Package Amelia provides multiple imputation where first bootstrap samples with the same dimensions as the original data are drawn, and then used for EM-based imputation. It is also possible to impute longitudinal data. The package in addition comes with a graphical user interface.
- Package VIM provides EM-based multiple imputation (function
`irmi()`

) using robust estimations, which allows to adequately deal with data including outliers. It can handle data consisting of continuous, semi-continuous, binary, categorical and/or count variables. - Single imputation methods are included or called from other packages by the package simputation. It supports regression (standard, M-estimation, ridge/lasso/elasticnet), hot-deck methods (powered by VIM), randomForest, EM-based, and iterative randomForest imputation.
- Package mix provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary or categorical variables, but methods for semi-continuous variables are missing.
- Package pan provides multiple imputation for multivariate panel or clustered data.
- Package norm provides EM-based multiple imputation for multivariate normal data.
- Package cat provides EM-based multiple imputation for multivariate categorical data.
- Package MImix provides tools to combine results for multiply-imputed data using mixture approximations.
- Package robCompositions provides iterative model-based imputation for compositional data (function
`impCoda()`

). - Package missForest uses the functionality of the randomForest to impute missing values in an iterative single-imputation fashion. It can deal with almost any kind of variables except semi-continuous ones. Even the underlying bootstrap approach of random forests ensures that from multiple runs one can get multiple imputations but the additional uncertainty of imputation is only considered when choosing the random forest method of package mice.

Nearest Neighbor Imputation Methods

- Package VIM provides an implementation of the popular sequential and random (within a domain) hot-deck algorithm.
- VIM also provides a fast k-nearest neighbor (knn) algorithm which can be used for large data sets. It uses a modification of the Gower Distance for numerical, categorical, ordered, continuous and semi-continuous variables.
- Package yaImpute performs popular nearest neighbor routines for imputation of continuous variables where different metrics and methods can be used for determining the distance between observations.
- Package robCompositions provides knn imputation for compositional data (function
`impKNNa()`

) using the Aitchison distance and adjustment of the nearest neighbor. - Package rrcovNA provides an algorithm for (robust) sequential imputation (function
`impSeq()`

and`impSeqRob()`

by minimizing the determinant of the covariance of the augmented data matrix. It's application is limited to continuous scaled data. - Package impute on Bioconductor impute provides knn imputation of continuous variables.

Copula-based Imputation Methods:

- The S4 class package CoImp imputes multivariate missing data by using conditional copula functions. The imputation procedure is semiparametric: the margins are non-parametrically estimated through local likelihood of low-degree polynomials while a range of different parametric models for the copula can be selected by the user. The missing values are imputed by drawing observations from the conditional density functions by means of the Hit or Miss Monte Carlo method. It works either for a matrix of continuous scaled variables or a matrix of discrete distributions.

Miscellaneous Imputation Methods:

- Package missMDA allows to impute incomplete continuous variables by principal component analysis (PCA) or categorical variables by multiple correspondence analysis (MCA).
- Package mice (function
`mice.impute.pmm()`

) and Package Hmisc (function`aregImpute()`

) allow predictive mean matching imputation. - Package VIM allows to visualize the structure of missing values using suitable plot methods. It also comes with a graphical user interface.

**Statistical Disclosure Control**

- Package sdcMicro can be used for the generation of confidential (micro)data, i.e. for the generation of public- and scientific-use files. The package also comes with a graphical user interface.
- Package sdcTable can be used to provide confidential (hierarchical) tabular data. It includes the HITAS and the HYPERCUBE technique and uses linear programming packages (Rglpk and lpSolveAPI) for solving (a large amount of) linear programs.
- An interface to the package sdcTable is provided by package easySdcTable.
- Package sdcHierarchies provides methods to generate, modify, import and convert nested hierarchies that are often used when defining inputs for statistical disclosure control methods.
- Package SmallCountRounding can be used to protect frequency tables by rounding necessary inner cells so that cross-classifications to be published are safe.

**Seasonal Adjustment and Forecasting**

- Decomposition of time series can be done with the function
`decompose()`

, or more advanced by using the function`stl()`

, both from the basic stats package. Decomposition is also possible with the`StructTS()`

function, which can also be found in the stats package. - Many powerful tools can be accessed via packages x12 and x12GUI and package seasonal. x12 provides a wrapper function for the X12 binaries, which have to be installed first. It uses with a S4-class interface for batch processing of multiple time series. x12GUI provides a graphical user interface for the X12-Arima seasonal adjustment software. Less functionality but with the support of SEATS Spec is supported by package seasonal.
- Given the large pool of individual forecasts in survey-type forecasting, forecast combination techniques from package GeomComb can be useful. It can also handle missing values in the time series.

**Statistical Matching and Record Linkage**

- Package StatMatch provides functions to perform statistical matching between two data sources sharing a number of common variables. It creates a synthetic data set after matching of two data sources via a likelihood approach or via hot-deck.
- Package RecordLinkage provides functions for linking and deduplicating data sets.
- Package MatchIt allows nearest neighbor matching, exact matching, optimal matching and full matching amongst other matching methods. If two data sets have to be matched, the data must come as one data frame including a factor variable which includes information about the membership of each observation.
- Package stringdist can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler).
- Package XBRL allows the extraction of business financial information from XBRL Documents.

**Small Area Estimation**

- Package sae include functions for small area estimation, for example, direct estimators, the empirical best predictor and composite estimators.
- Package emdi includes further functionality for supporting the user even beyond estimation, for example, for performing model diagnostic analyses, visualizing, and exporting the results for further processing. It includes build-in functionality for transformating variables and includes bootstrap methods for variance estimation. It also includes export to Excel and applies parallel computing in an automatized manner.
- Package rsae provides functions to estimate the parameters of the basic unit-level small area estimation (SAE) model (aka nested error regression model) by means of maximum likelihood (ML) or robust ML. On the basis of the estimated parameters, robust predictions of the area-specific means are computed (incl. MSE estimates; parametric bootstrap). The current version (rsae 0.4-x) does not allow for categorical independent variables.
- Package nlme provides facilities to fit Gaussian linear and nonlinear mixed-effects models and lme4 provides facilities to fit linear and generalized linear mixed-effects model, both used in small area estimation.
- The hbsae package provides functions to compute small area estimates based on a basic area or unit-level model. The model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. Auxilary information can be either counts resulting from categorical variables or means from continuous population information.
- With package JoSAE point and variance estimation for the generalized regression (GREG) and a unit level empirical best linear unbiased prediction EBLUP estimators can be made at domain level. It basically provides wrapper functions to the nlme package that is used to fit the basic random effects models.
- The package BayesSAE also allows for Bayesian methods range from the basic Fay-Herriot model to its improvement such as You-Chapman models, unmatched models, spatial models and so on.

**Indices, Indicators, Tables and Visualisation of Indicators**

- Package laeken provides functions to estimate popular risk-of-poverty and inequality indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient). In addition, standard and robust methods for tail modeling of Pareto distributions are provided for semi-parametric estimation of indicators from continuous univariate distributions such as income variables.
- Package convey estimates variances on indicators of income concentration and poverty using familiar linearized and replication-based designs created by the survey package such as the Gini coefficient, Atkinson index, at-risk-of-poverty threshold, and more than a dozen others.
- Package ineq computes various inequality measures (Gini, Theil, entropy, among others), concentration measures (Herfindahl, Rosenbluth), and poverty measures (Watts, Sen, SST, and Foster). It also computes and draws empirical and theoretical Lorenz curves as well as Pen's parade. It is not designed to deal with sampling weights directly (these could only be emulated via
`rep(x, weights)`

). - Package IC2 include three inequality indices: extended Gini, Atkinson and Generalized Entropy. It can deal with sampling weights and subgroup decomposition is supported.
- Package DHS.rates estimates key indicators (especially fertility rates) and their variances for the Demographic and Health Survey (DHS) data.
- Functions
`priceIndex()`

from package micEconIndex allows to estimate the Paasche, the Fisher and the Laspeyres price indices. For estimating quantities (of goods, for example), function`quantityIndex()`

might be your friend. - Package tmap offers a layer-based way to make thematic maps, like choropleths and bubble maps.
- Package rworldmap outline how to map country referenced data and support users in visualising their own data. Examples are given, e.g., maps for the world bank and UN. It provides also new ways to visualise maps.
- Package rrcov3way provides robust methods for multiway data analysis, applicable also for compositional data.
- Package robCompositions methods for compositional tables including statistical tests.

**Microsimulation**

- Using package simPop one can simulate populations from surveys based on auxiliary data with model-based methods or synthetic reconstruction methods. Hiercharical and cluster structures (such as households) can be considered as well as the methods takes account for samples collected based on complex sample designs. Calibration tools (iterative proportional fitting, iterative proportional updating) and combinatorial optimization tools (simulated annealing) are also available. The code is optimized for fast computations. The package based on a S4 class implementation. The simulated population can serve as basis data for microsimulation studies.
- The MicSim package includes methods for microsimulations. Given a initial population, mortality rates, divorce rates, marriage rates, education changes, etc. and their transition matrix can be defined and included for the simulation of future states of the population. The package does not contain compiled code but functionality to run the microsimulation in parallel is provided.
- Package sms provides facilities to simulate micro-data from given area-based macro-data. Simulated annealing is used to best satisfy the available description of an area. For computational issues, the calculations can be run in parallel mode.
- Package synthpop using regression tree methods to simulate synthetic data from given data. It is suitable to produce synthetic data when the data have no hierarchical and cluster information (such as households) as well as when the data does not collected with a complex sampling design.
- Package saeSim Tools for the simulation of data in the context of small area estimation.

**Additional Packages and Functionalities**

- The questionr package contains a set of functions to make the processing and analysis of surveys easier. It provides interactive shiny apps and addins for data recoding, contingency tables, dataset metadata handling, and several convenience functions.

Data Import and Export:

- Package SAScii imports ASCII files directly into R using only a SAS input script, which is parsed and converted into arguments for a read.fwf call. This is useful whenever SAS scripts for importing data are already available.
- The foreign package includes tools for reading data from SAS Xport (function
`read.xport()`

), Stata (function`read.dta()`

), SPSS (function`read.spss()`

) and various other formats. It provides facilities to write file to various formats, see function`write.foreign()`

. - Also the package haven imports and exports SAS, Stata and SPSS (function
`read.spss()`

) files. The package is more efficient for loading heavy data sets and it handles the labelling of variables and values in an advanced manner. - Also the package Hmisc provides tools to read data sets from SPSS (function
`spss.get()`

) or Stata (function`stata.get()`

). - The pxR package provides a set of functions for reading and writing PC-Axis files, used by different statistical organizations around the globe for dissemination of their (multidimensional) tables.
- With package prevR and it's function
`import.dhs()`

it is possible to directly imports data from the Demographic Health Survey. - Function
`describe()`

from package questionr describes the variables of a dataset that might include labels imported with the foreign or memisc packages. - Package OECD searches and extracts data from the OECD.
- Package Rilostat contains tools to download data from the international labour organisation database together with search and manipulation utilities. It can also import ilostat data that are available on their data base in SDMX format.
- Access to Finnish open government data is provided by package sorvi
- Tools to download data from the Eurostat database together with search and manipulation utilities are included in package eurostat.
- Package acs downloads, manipulates, and presents the American Community Survey and decennial data from the US Census.
- A wrapper for the U.S. Census Bureau APIs that returns data frames of Census data and metadata is implemented in package censusapi.
- Package censusGeography converts spefific United States Census geographic code for city, state (FIP and ICP), region, and birthplace.
- With package idbr you can to make requests to the US Census Bureau's International Data Base API.
- Package ipumsr provides an easy way to import census, survey and geographic data provided by IPUMS.
- Package noncensus contains a collection of various regional information determined by the U.S. Census Bureau along with demographic data.
- Package tidycensus provides an integrated R interface to the decennial US Census and American Community Survey APIs and the US Census Bureau's geographic boundary files
- Access to data published by INEGI, Mexico's official statistics agency, is supported by package inegiR
- Package cbsodataR provides access to Statistics Netherlands' (CBS) open data API.

Misc:

- Package samplingbook includes sampling procedures from the book 'Stichproben. Methoden und praktische Umsetzung mit R' by Goeran Kauermann and Helmut Kuechenhoff (2010).
- Package SDaA is designed to reproduce results from Lohr, S. (1999) 'Sampling: Design and Analysis, Duxbury' and includes the data sets from this book.
- The main contributions of samplingVarEst are Jackknife alternatives for variance estimation of unequal probability with one or two stage designs.
- Package memisc includes tools for the management of survey data, graphics and simulation.
- Package anesrake provides a comprehensive system for selecting variables and weighting data to match the specifications of the American National Election Studies.
- Package spsurvey includes facilities for spatial survey design and analysis for equal and unequal probability (stratified) sampling.
- The FFD package is designed to calculate optimal sample sizes of a population of animals living in herds for surveys to substantiate freedom from disease. The criteria of estimating the sample sizes take the herd-level clustering of diseases as well as imperfect diagnostic tests into account and select the samples based on a two-stage design. Inclusion probabilities are not considered in the estimation. The package provides a graphical user interface as well.
- mipfp provides multidimensional iterative proportional fitting to calibrate n-dimensional arrays given target marginal tables.
- Package MBHdesign provides spatially balanced designs from a set of (contiguous) potential sampling locations in a study region.
- Package quantification provides different functions for quantifying qualitative survey data. It supports the Carlson-Parkin method, the regression approach, the balance approach and the conditional expectations method.
- BIFIEsurvey includes tools for survey statistics in educational assessment including data with replication weights (e.g. from bootstrap).
- surveybootstrap includes tools for using different kinds of bootstrap for estimating sampling variation using complex survey data.
- Package surveyoutliers winsorize values of a variable of interest.
- The package univOutl includes various methods for detecting univariate outliers, e.g. the Hidiroglou-Berthelot method.
- Package extremevalues is designed to detect univariate outliers based on modeling the bulk distribution.
- Package RRreg implements univariate and multivariate analysis (correlation, linear, and logistic regression) for several variants of the randomized response technique, a survey method for eliminating response biases due to social desirability.
- Package RRTCS includes randomized response techniques for complex surveys.
- Package panelaggregation aggregates business tendency survey data (and other qualitative surveys) to time series at various aggregation levels.
- Package surveydata makes it easy to keep track of metadata from surveys, and to easily extract columns with specific questions.
- RcmdrPlugin.sampling includes tools for sampling in official statistical surveys. It includes tools for calculating sample sizes and selecting samples using various sampling designs.
- Package mapStats does automated calculation and visualization of survey data statistics on a color-coded map.

3 months ago

Matthias Templ

82

95

123

141

230

258

285

341

362

389

400

416

438

486

517

543

678

685

717

801

835

845

861

948

951

1010

1174

1179

1223

1225

1269

1384

1413

1416

1438

1548

1656

1663

1826

1844

1874

2023

2064

2087

2135

2183

2198

2270

2276

2322

2348

2471

2622

2813

2843

3030

3070

3160

3193

3287

3369

3376

3453

3542

3828

3833

4014

4041

4083

4142

4291

4388

4737

4777

4790

4795

4848

4904

5008

5127

5268

5373

5670

5829

6077

6132

6163

6250

6352

6408

6459

6581

6805

6828

6879

7010

7100

7155

7586

7644

7688

7973

8062

8161

8280

8289

8334

8407

8634

8990

9641

9706

9731

9795

9958

10155

10427

10816

11079

11221

13211

13308

13412