All of this is mostly extracted from Wikipedia. These notes are exactly fitted to my fragmented knowledge/memory, the purpose is to fill the gaps, not to fully review the concepts.

Binomial distribution models the **number of successes in n tries**, knowing the

\[\mathrm{Pr}(X=k) = \binom{n}{k}p^k (1-p)^{n-k} \]

Hypergeometric distribution models the **number of successes** when drawing, **without replacement, n elements from a population of size N containing K successes**. It can be used to analyze over-/under-representation of a sub-population in a sample. Its mean is \(n\frac{K}{N}\).

\[\mathrm{Pr}(X=k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} \]

Poisson distribution models the **number of events occurring in an interval of time**, knowing the rate \(\lambda\). It has \(\lambda\) mean and variance. Of note a binomial distribution with large *n* and small *p* can be approximated by a Poisson with parameter *np*.

\[\mathrm{Pr}(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

*Example:* number of mutation on a strand of DNA

Beta-binomial distribution is a **binomial** distribution in which the **probability of success follows a beta distribution**. It’s often used as an over-dispersed binomial distribution. The three parameters are *n*, \(\alpha\) and \(\beta\).

Beta distribution is defined on interval [0,1] by two shape parameters, \(\alpha\) and \(\beta\). It is often used as prior for binomial distributions.

*Example:* allele frequencies in population.

Usually **tests if two distributions, assumed to be Normal, are different**. Student distribution describes the estimated mean, from a small sample size, of a normally distributed population. Welch’s t-test is similar but more robust to unequal variance and different sample sizes between the tested groups.

Different flavors exist, e.g one-sample vs two-sample or unpaired vs paired.

**Test association between two categorical classifications**. It is usually employed for small sample sizes but valid for all sample sizes.

It models the number in the contingency table following a hypergeometric distribution.

One difference with other test is that it assumes that the total are fixed before the sampling. *Is it a problem?*

Also used to test association between two categorical classifications. However it is **only suited when sample size is large and the expected values are not too low**.

Based on ranks, it **tests if one distribution is stochastically higher than another**. It is also called Mann-Whitney U test. Its non-parametric nature makes it robust to outliers and non-Normal or different distributions.

- k-nearest neighbors
- Gradient descent
- Sigmoid Function to transform a value into a more interpretable decision value (S shaped between 0 and 1).

CART stands for **C**lassification **A**nd **R**egression **T**ree, and was introduced by Breinman et al in 1984.

The data space is split using a **binary tree**. Following the tree for a new value points at the most similar region in the data space. The labels/values of training data in this split/leaf is used to predict the new label/value.

For a **regression tree**, the predicted value is the average of those in the training data in this split/leaf. A **classification tree** uses a majority vote among the training data labels in this split/leaf.

The **greedy algorithm** for deciding the successive splits in a decision tree finds the best dimension and best split that minimizes the training errors. Once the tree is fully built pruning can help improving performance. Random forest is another approach to counter the issues of having fully grown trees.

**Pruning** removes parts of the tree with lower importance. It improves the decision tree by reducing the complexity and reducing over-fitting, hence increasing the predictive accuracy.

**Bagging** is simply using bootstrap approach to get a more robust prediction (i.e. with less overfitting). The training set is sampled with replacement and a tree is constructed. Them the final prediction uses the average (or majority call) across all the trees.

**Random forest** could be seen as an extension of the bagging approach. The main difference is that a random subset of features are used at each split. This ensures that the forest contains different trees and not necessarily similar trees. Random forest is particularly useful when the number of samples is much smaller than the number of features, as it forces the trees to focus on different features, and not only the globally better ones.

A Markov Chain Monte Carlo is **constructed to approximate a complex probability distribution**. It can be designed so that after enough iterations, the walk along the chain approximates the targeted distribution.

A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum.