Hippocamplus My Second Memory


All of this is mostly extracted from Wikipedia. These notes are exactly fitted to my fragmented knowledge/memory, the purpose is to fill the gaps or refresh my recollection, not to fully review the concepts.

Probability distributions


Binomial distribution models the number of successes in n tries, knowing the probability p of success, \(\mathcal{B}(n,p)\). It has np mean and \(np(1-p)\) variance.

\[\mathrm{Pr}(X=k) = \binom{n}{k}p^k (1-p)^{n-k} \]


Hypergeometric distribution models the number of successes when drawing, without replacement, n elements from a population of size N containing K successes. It can be used to analyze over-/under-representation of a sub-population in a sample. Its mean is \(n\frac{K}{N}\).

\[\mathrm{Pr}(X=k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} \]



Poisson distribution models the number of events occurring in an interval of time, knowing the rate \(\lambda\). It has \(\lambda\) mean and variance. Of note a binomial distribution with large n and small p can be approximated by a Poisson with parameter np.

\[\mathrm{Pr}(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

Example: number of mutation on a strand of DNA


Beta-binomial distribution is a binomial distribution in which the probability of success follows a beta distribution. It’s often used as an over-dispersed binomial distribution. The three parameters are n, \(\alpha\) and \(\beta\).


Beta distribution is defined on interval [0,1] by two shape parameters, \(\alpha\) and \(\beta\). It is often used as prior for binomial distributions.

Example: allele frequencies in population.

Statistical tests

Student t-test

Usually tests if two distributions, assumed to be Normal, are different. Student distribution describes the estimated mean, from a small sample size, of a normally distributed population. Welch’s t-test is similar but more robust to unequal variance and different sample sizes between the tested groups.

Different flavors exist, e.g one-sample vs two-sample or unpaired vs paired.

Fisher’s exact test

Test association between two categorical classifications. It is usually employed for small sample sizes but valid for all sample sizes.

It models the number in the contingency table following a hypergeometric distribution.

One difference with other test is that it assumes that the total are fixed before the sampling. Is it a problem?

\(\chi^2\) test

Also used to test association between two categorical classifications. However it is only suited when sample size is large and the expected values are not too low.

Wilcoxon test

Based on ranks, it tests if one distribution is stochastically higher than another. It is also called Mann-Whitney U test. Its non-parametric nature makes it robust to outliers and non-Normal or different distributions.


Regression & Machine Learning

Linear regression



Machine Learning basics


CART stands for Classification And Regression Tree, and was introduced by Breinman et al in 1984.

The data space is split using a binary tree. Following the tree for a new value points at the most similar region in the data space. The labels/values of training data in this split/leaf is used to predict the new label/value.

For a regression tree, the predicted value is the average of those in the training data in this split/leaf. A classification tree uses a majority vote among the training data labels in this split/leaf.

The greedy algorithm for deciding the successive splits in a decision tree finds the best dimension and best split that minimizes the training errors. Once the tree is fully built pruning can help improving performance. Random forest is another approach to counter the issues of having fully grown trees.

Pruning removes parts of the tree with lower importance. It improves the decision tree by reducing the complexity and reducing over-fitting, hence increasing the predictive accuracy.

Bagging is simply using bootstrap approach to get a more robust prediction (i.e. with less overfitting). The training set is sampled with replacement and a tree is constructed. Them the final prediction uses the average (or majority call) across all the trees.

Random forest could be seen as an extension of the bagging approach. The main difference is that a random subset of features are used at each split. This ensures that the forest contains different trees and not necessarily similar trees. Random forest is particularly useful when the number of samples is much smaller than the number of features, as it forces the trees to focus on different features, and not only the globally better ones.

Neural Networks


Misc stats

Markov Model

Monte Carlo methods


A Markov Chain Monte Carlo is constructed to approximate a complex probability distribution. It can be designed so that after enough iterations, the walk along the chain approximates the targeted distribution.

Hidden Markov Model

Hidden Markov Model (HMM) is a Markov chain where the states are unknown but an output, that depends on the state, is visible. A HMM is defined by a set of states (X) with transition (a) and emission (b) probabilities, and produces an observed variable y:


The forward algorithm uses dynamic programming to compute the probability of a state at a certain time, given the history, when the parameters of the HMM are known. The backward algorithm is the same idea but given the future history. Using both, we can compute the probability of a state given all the other observations.

Viterbi algorithm goes further and retrieves the most likely sequence of states for an observed sequence. The dynamic programming approach is very similar to the forward algorithm.

Baum-Welch algorithm finds the unknown parameters of a HMM from an observed sequence. It’s an Expectation-Maximization method. After initialization of the HMM parameters, the observed sequence and forward-backward are used to compute the probability of being in state i at time t. New parameters are then derived from these probabilities.

Maximum Likelihood Estimation


Greedy algorithm

A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.

Bayesian vs Frequentist



For an alphabet A and a sequence of size S, compute three objects:

  1. Burrows-Wheeler transform of the sequence
    • Order each possible rotation of the sequence.
    • Save last column L.
    • Use first column to compute the C table below
    • Vector of size S (and alphabet A).
  2. C table
    • Saves the position of each letter in A in the first column of the ordered rotations.
    • Vector of size |A|.
  3. An occurrence table Occ
    • Saves the number of times each letter appeared in L before a position k.
    • Array of size Sx|A|

To look for a kmer, iterate over the letters, starting from the end, to narrow the search range with something like

  • s = C[l] + Occ[l, s]
  • e = C[l] + Occ[l, e]

To extract position in the original sequence we would use another index vector with this mapping and the final search range.

Graph theory

De Bruijn graphs

Hamiltonian path

Euclidian path


SVD decomposition