A few days ago I was reading one of the famous Feynman Lectures on Physics, which included a discussion of Kepler’s laws of planetary motion; this is a set of three simple-looking laws proposed by the German astronomer and mathematician Johannes Kepler (1571–1630) to describe the motion of the planets around the Sun.
This post is the first of a short series where I will be presenting some of the most remarkable concepts and techniques from the field of statistical physics, all of which are beautifully featured in Werner Krauth’s marvelous book Statistical Mechanics: Algorithms and Computations (Oxford University Press). (As a side note, I positively recommend Krauth’s related free online course, which is every bit as delightful as his book, and more accessible to those lacking a physics background.)
Here I present some command-line approaches, hoarded over the years, for manipulating and converting between different popular sequence data files formats, namely BAM/SAM, FASTQ and FASTA. While they are by no means the only methods available for these tasks, they are perhaps the most simple.
While working with a rather large set of human germline polymorphisms, I recently asked myself whether the age of a given variant allele (or mutation) could be estimated from its allele frequency (the fraction of alleles at the relevant locus that are represented by the variant allele). Unsurprisingly, the answer seems to be ‘yes’ up to a certain degree — that is, provided that one is willing to settle for an ideal (and somewhat boring) population, to neglect some of the most important forces that shape population dynamics in the real world, and to accept a great deal of uncertainty in the answer. Nevertheless, looking at what is left after all these concessions are made — that is, the background force behind the most elementary patterns of mutation spread — is quite a nice exercise in itself, and provides a beautiful glimpse of a multidimensional probability distribution.
Simple linear regression is a very popular technique for estimating the linear relationship between two variables based on matched pairs of observations, as well as for predicting the probable value of one variable (the response variable) according to the value of the other (the explanatory variable). When plotting the results of linear regression graphically, the explanatory variable is normally plotted on the x-axis, and the response variable on the y-axis.
When looking at genetic variation in a set of sequencing data files, the process normally starts with a VCF file. VCF (Variant Calling Format) is a standard file format for genetic variation calls, and is used by most, if not all, variant calling software tools. A VCF file is a text file that looks something like this:
While looking for a Bayesian replacement for my in-house robust correlation method (Spearman’s correlation with bootstrap resampling), I found two very interesting posts on standard and robust Bayesian correlation models in Rasmus Bååth’s blog. As I wanted to give the robust model a try on my own data (and also combine it with a robust regression model) I have translated Bååth’s JAGS code into Stan and wrapped it inside a function. Below I show how this model is more suitable than classical correlation coefficients, regardless of whether the data are normally distributed.