Statistical Genetics

Great advances have been made in the field of genetic analysis over the last years. The availability of millions of single nucleotide polymorphisms (SNPs) in widely available databases, coupled with major advances in SNP genotyping technology that reduce costs and increase throughput, are enabling a host of studies aimed at elucidating the genetic basis of complex disease. The focus in this task view is on R packages implementing statistical methods and algorithms for the analysis of genetic data and for related population genetics studies.

A number of R packages are already available and many more are most likely to be developed in the near future. Please send your comments and suggestions to the task view maintainer.

  • Population Genetics : genetics implements classes and methods for representing genotype and haplotype data, and has several functions for population genetic analysis (e.g. functions for estimation and testing of Hardy-Weinberg and linkage disequilibria, etc.). rmetasim provides an interface to the metasim engine for population genetics simulations. A few population genetics functions are also implemented in gap. LDheatmap creates a heat map plot of measures of pairwise LD. hwde fits models for genotypic disequilibria. Whilst HardyWeinberg provides graphical representation of disequilibria via ternary plots (also known as de Finetti diagrams). Biodem package provides functions for Biodemographical analysis, e.g. Fst() calculates the Fst from the conditional kinship matrix. The adegenet package implements a number of different methods for analysing population structure using multivariate statistics, graphics and spatial statistics. The hierfstat package allows the estimation of hierarchical F-statistics from haploid or diploid genetic data with any numbers of levels in the hierarchy.
  • Phylogenetics : The Phylogenetics view has more detailed information, the most important packages are also mentioned here. Phylogenetic and evolution analyses can be performed via ape. Package ouch provides Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses. stepwise implements a method for stepwise detection of recombination breakpoints in sequence alignments. phangorn estimates phylogenetic trees and networks using maximum likelihood, maximum parsimony, distance methods and Hadamard conjugation.
  • Linkage : There are few native packages for performing parametric or non-parametric linkage analysis from within R itself, the calculations must be performed using external packages. However, there are a number of ancillary R packages that facilitate interface with these stand-alone programs and using the results for further analysis and presentation. ibdreg uses Identity By Descent (IBD) Non-Parametric Linkage (NPL) statistics for related pairs calculated externally to test for genetic linkage with covariates by regression modelling. Whilst not official R packages one software suite in particular is worthy of mention. PLINK is a C++ program for genome wide linkage analysis that supports R-based plug-ins via Rserve allowing users to utilise the rich suite of statistical functions in R for analysis.
  • QTL mapping : Packages in this category develop methods for the analysis of experimental crosses to identify markers contributing to variation in quantitative traits. bqtl implement both likelihood-based and Bayesian methods for inbred crosses and recombinant inbred lines. qtl provides several functions and a data structure for QTL mapping, including a function scanone() for genome-wide scans. wgaim builds on the qtl by including functions for the modelling and summary of QTL intervals from the full linkage map whilst dlmap can be used to perform QTL mapping in a mixed model framework with separate detection and localization stages.
  • Association : Packages in this category provide statistical methods to test associations between individual genetic markers and a phenotype. gap is a package for genetic data analysis of both population and family data; it contains functions for sample size calculations, probability of familial disease aggregation, kinship calculation, and some tests for linkage and association analyses. Among the other functions, genecounting() estimates haplotype frequencies from genotype data, and gcontrol() implements a Bayesian genomic control statistics for association studies. For family data, tdthap offers an implementation of the Transmission/Disequilibrium Test (TDT) for extended marker haplotypes.
  • Linkage Disequilibrium and haplotype mapping : A number of packages provide haplotype estimation for unrelated individuals with ambiguous haplotypes (due to unknown linkage phase) and allow testing for associations between the estimated haplotypes and phenotypes (including co-variates) under a GLM framework. hapassoc performs likelihood inference of trait associations with haplotypes in GLMs. haplo.stats also contains tests for haplotype associations under a GLM framework, but also provides score tests of association as well as providing novel functionality for building haplotypes in a sequential manner, power and sample-size calculations and the preparation of data matrices for use in other methods. haplo.ccs utilises the haplotype estimation of haplo.stats and performs case-control association tests via weighted logistic regression. tdthap implements transmission/disequilibrium tests for extended marker haplotypes. LDheatmap creates a heat map plot of measures of pairwise LD.
  • Genome-Wide Association Studies (GWAS) : With recent technical advances in high-throughput genotyping technologies the possibility of performing Genome-Wide Association Studies is now a feasible strategy. A number of packages are available to facilitate the analysis of these large data sets. GenABEL is designed for the efficient storage and handling of GWAS data with fast analysis tools for quality control, association with binary and quantitative traits, as well as tools for visualizing results. pbatR provides a GUI to the powerful PBAT software which performs family and population based family and population based studies. The software has been implemented to take advantage of parallel processing, which vastly reduces the computational time required for GWAS. snpMatrix Implements classes and methods for large-scale SNP association studies.
  • Multiple testing : The package qvalue on Bioconductor implements False Discovery Rate; the main function qvalue() estimates the q-values from a list of p-values. Package multtest on Bioconductor also offers several non-parametric bootstrap and permutation resampling-based multiple testing procedures.
  • Importing Sequence Data : There are utilities in the seqinr package to import sequence data from various sources, including files of aligned sequences in mase, clustal, phylip, fasta and msf format which will be of utility to some population genetic analysis. Users interested in using R for sequence data and bioinformatics are also referred to the BioConductor project.

View on CRAN

5 months ago

Giovanni Montana