Sampling in the hypersphere

This post is the first of a short series where I will be presenting some of the most remarkable concepts and techniques from the field of statistical physics, all of which are beautifully featured in Werner Krauth’s marvelous book Statistical Mechanics: Algorithms and Computations (Oxford University Press). (As a side note, I positively recommend Krauth’s related free online course, which is every bit as delightful as his book, and more accessible to those lacking a physics background.)

Read More

Command-line manipulation of sequence files

Here I present some command-line approaches, hoarded over the years, for manipulating and converting between different popular sequence data files formats, namely BAM/SAM, FASTQ and FASTA. While they are by no means the only methods available for these tasks, they are perhaps the most simple.

Read More

Simulating mutation diffusion by random drift

While working with a rather large set of human germline polymorphisms, I recently asked myself whether the age of a given variant allele (or mutation) could be estimated from its allele frequency (the fraction of alleles at the relevant locus that are represented by the variant allele). Unsurprisingly, the answer seems to be ‘yes’ up to a certain degree — that is, provided that one is willing to settle for an ideal (and somewhat boring) population, to neglect some of the most important forces that shape population dynamics in the real world, and to accept a great deal of uncertainty in the answer. Nevertheless, looking at what is left after all these concessions are made — that is, the background force behind the most elementary patterns of mutation spread — is quite a nice exercise in itself, and provides a beautiful glimpse of a multidimensional probability distribution.

Read More

Robust Bayesian linear regression with Stan in R

Simple linear regression is a very popular technique for estimating the linear relationship between two variables based on matched pairs of observations, as well as for predicting the probable value of one variable (the response variable) according to the value of the other (the explanatory variable). When plotting the results of linear regression graphically, the explanatory variable is normally plotted on the x-axis, and the response variable on the y-axis.

Read More

Bayesian robust correlation with Stan in R

While looking for a Bayesian replacement for my in-house robust correlation method (Spearman’s correlation with bootstrap resampling), I found two very interesting posts on standard and robust Bayesian correlation models in Rasmus Bååth’s blog. As I wanted to give the robust model a try on my own data (and also combine it with a robust regression model) I have translated Bååth’s JAGS code into Stan and wrapped it inside a function. Below I show how this model is more suitable than classical correlation coefficients, regardless of whether the data are normally distributed.

Read More