The Clever Machine

p-Hacking 101: Data Peeking

2023-12-13T00:00:00+00:00

“Data peeking” is the process of prematurely running statistical tests on your AB experiment data before data collection has reached the required sample size prescribed by power analysis. You may have heard of the dangers of data peeking, but may not have an intuition as to how dramatically it can inflate your False Positive rate, and thus mislead statistical inferences. In this post we’ll use simulations to demonstrate just how much data peeking can inflate false positives.

Background: Null-hypothesis Statistical Tests and Type I Error Rates

A common experimental method used for AB testing is Null hypothesis-based statistical testing (NHST). In the NHST approach, the scientist collects data from some process or group of interest, then performs a statistical test to evaluate whether the distribution of collected samples is statistically different from the distribution generated by some baseline–aka “Null”–process or control group.

Due to inherent randomness in the world and the fact that sampling procedures will always be imperfect, it is possible that the researcher detects a statistically significant difference when, in fact, there isn’t one. This scenario is called a False Positive or Type I Error, and each NHST has its own acceptable Type I Error rate. The acceptable False Positive rate is generally referred to as \(\alpha\), and tends to be set in the range of \(1 \%-5 \%\).

A key step when using NHST is to establish a minimum number of samples \(N\) to collect in order to provide statistical guarantees on the expected False Positive rate. This procedure is know as power analysis.

Data-peeking and inflating False Positive rate

Sometimes, during the data collection process, the scientist will “peek” at the data prematurely, before data collection has reached the \(N\) samples prescribed by power analysis. The scientist will then perform preliminary analyses and statistical tests on this incomplete dataset. This is where the p-hacking begins!

Say the scientist finds no statistical difference from the Null hypothesis. In this case, it highly likely that the scientist will collect more data in order to try to further establish whetehr there is a real difference. However, if a difference is detected, data collection is likely to be stopped–a form of confirmation bias, if you will. This strategy of peeking and conditionally continuing data collection based on test results obtained from the partial dataset invalidates the assumptions of the statistical test and inflates the likelihood of observing a false positive.

Simulating the effects of Data Peeking on False Positive rate

Below we use simulation to demonstrate the degree to which various data peeking strategies can inflate the False Positive rate of the statistical procedure, and compare those False Positive rates to a valid, no-peeking analysis strategy.

In each of 5,000 simulations below we generate a scenario where our data are sampled from a distribution that has zero mean–in this case a standard Normal. We then use a one-sample t-test to infer whether or not the mean of those samples differs signficantly from zero (the Null hypothesis). Since the samples do in fact have zero mean, we intuitively would expect to always detect no difference from zero. However, due to randomness in the sampling process, we’ll obtain a difference by chance some small percentage of the time.

Additionally, for each simulation we’ll “peek” at the partial dataset at various points during the data collection, based one of the following peeking strategies:

Peeking after every sample (worst case scenario)
Peek intermittently, every 10th, 20th, or 50th sample collected
No peeking

After each peek at the current dataset, we then decide whether to keep collecting data contingent on what we’ve observed so far:

If we detect a statistically-significant difference based on \(\alpha\) and the t-test applied to the partial data set, then we stop data collection.
Otherwise, we continue to collect data until the next peeking observation.

Given that we know the ground truth distribution, we can calculate how much the False Positive rate has increased for each peeking strategy over the course of data collection.

The results of the simulation are plotted in Figure 1.

Figure 1, False positive (Type I) error rates associated with various data-peeking strategies. As we increase the frequency of data peeking, we increase the False Positive rate of our inference procedure. This inflation increases with the number peeks and samples collected. If we were to peak at every sample, we would have a nearly 1 in 2 chance of a False Positive after collecting ~500 samples. No data peeking closely follows the researcher-defined accptable False Positive rate for the experiment \(\alpha=0.05\).

Python Code to generate Figure 1

p-Hacking 101: N Chasing

2020-10-04T00:00:00+00:00

”\(N\) Chasing,” or adding new observations to an already-analyzed experiment can increase your experiment’s false positive rate. As an experimenter or analyst, you may have heard of the dangers of \(N\) chasing, but may not have an intuition as to why or how it increases Type I Error. In this post we’ll demonstrate \(N\) chasing using some simulations, and show that, under certain settings, adding just a single data point to your experiment can dramatically increase false positives.

Well-behaved Statistical Tests

A well-behaved statistical test should provide uniformly-distributed p-values. This is because the test shouldn’t favor any one portion of the hypothesis space over the others. This is demonstrated in Figure 1, which plots the distribution of p-values that result from running two-sample t-tests on 10,000 simulated datasets (\(N=10\)) having no difference between the two samples being tested, i.e. the Null Hypothesis \(H_0=\text{True}\).

Figure 1, p-values from a well-behaved statistical test. p-values should be uniformly distributed; here we choose twenty equally-sized bins, corresponding with \(\alpha=0.05\). Even when there is no effect, i.e. \(H_0=\text{True}\), 5% of trials will indicate a “significant” effect by chance (red). Additionally, 5% of trials will be “So close” to showing significance (blue). N chasing is often performed on these “So close” trials by collecting additional data points.

Figure 1 Python Code

Who Needs Backpropagation? Computing Word Embeddings with Linear Algebra

2020-09-11T00:00:00+00:00

Word embeddings provide numerical representations of words that carry useful semantic information about natural language. This has made word embeddings an integral part of modern Natural Language Processing (NLP) pipelines and language understanding models. Common methods used to compute word embeddings, like word2vec, employ predictive, neural network frameworks. However, as we’ll show in this post, we can also compute word embeddings using a some basic frequency statistics, a little information theory, and our good old friend from linear algebra, Singular Value Decomposition.

Motivation

Building computational systems that can interact naturally with humans requires computers to be able to process natural language, i.e. words. Words carry information about real-world, semantic entities, actions, or notions. However, computers do not operate in the space of semantics, but instead employ numerical operations. Therefore in order to build useful computer-human-language interfaces, we need a way of numerically representing words and their semantics.

The classic way of representing a word numerically is to use a “1-of-\(K\)” or “one-hot” encoding (Figure 1, Top Left). This encoding uses a sparse vector of length-\(K\), representing each of the words in a length-\(K\) vocabulary. The vector is filled with zeros except a single value of one located at the index associated with the represented word. One can think of the 1-of-\(K\) encoding acting like a vectorized indicator variable for the presence of a word.

This 1-of-\(K\) encoding is simple and provides an orthogonal set of features to represent words. Consequently it has been the backbone of many NLP models for decades. However, 1-of-\(K\) encoding can be inefficient in that the cardinality of the feature space can become quite large for large vocabularies, quickly running into the curse of dimensionality that makes so many machine learning problems ill-posed, or require tons of observations to obtain good parameter estimates. Additionally, the 1-of-\(K\) encoding carries little semantic information about the vocabulary it represents.

Enter Word Embeddings

In recent years, a more compact alternative to 1-of-\(K\) encoding, which carries more semantic information, has been to use word embeddings. Rather than large, sparse vectors, word embeddings provide for each word a dense vector with length that is generally orders of magnitude smaller than the 1-of-\(K\) encoding (generally on the order of a few hundred dimensions or less).

There are a number of ways to derive dense word embeddings, but by far most common approach is to use the word2vec algorithm. This post won’t go into the details of word2vec, but basic ideas goes like this: The word2vec algorithm trains a neural network that is optimized on a corpus of sentences. Given a query word \(w\) sampled from one of the corpus sentences, the network’s task is to predict each of the words \(c\) that are located within a context window \(C\) surrounding the query word (Figure 1, Right).

Figure 1, Various methods for representing words numerically. Top Left, “1-of-\(K\)” encoding represents each word as a sparse vector of \(K\) entries with only a single one-valued entry indicating the presence of a particular word. Right, The word2vec algorithm trains a two-layer neural network to predict, given a sentence and a query word from that sentence \(w\), the words \(c\) located within a context window \(C\) surrounding \(w\). Bottom Left, Once the neural network has been optimized, each row of the \(K \times D\) weight matrix in the first hidden layer of the neural network \(\beta_{embedding}\) provides a dense vector representation for each of the \(K\) words in the vocabulary.

The input to the neural network is the 1-of-\(K\) representation of the query word and each of the target context words are represented as, you guessed it 1-of-\(K\) encodings. For each query word there are \(\mid C \mid - 1\) classification targets, one for each context word \(c\). The neural network uses a hidden layer comprised of \(D\) units, and thus there is a matrix of parameters \(\beta_{embedding} \in \mathbb{R}^{K \times D}\) that linearly maps each word into a latent space of size \(D \ll K\). After the network has converged, each row of the first layer of weights \(\beta_{embedding}\) provides for each word a dense embedding vector representation of size \(D\), rather than \(K\) (Figure 1, Bottom Left).

It turns out that the resulting word embedding vectors capture rich semantic information about the words in the corpus. In particular, words that are semantically similar occupy nearby locations in the \(D\)-dimensional space (Figure 1, Bottom Left). Additionally, semantic relationships amongst words are encoded by displacements in the embedding space.

Calculating Information-theoretic Word Embeddings with SVD

Calculating word embeddings using the word2vec algorithm requires building and training a neural network, which in turn involves a considerable amount of calculus necessary for gradient-based parameter optimization. It turns out that there is a simpler way to calculate equivalent word vectors using a little information theory and linear algebra.¹ Before digging into this method, let’s first introduce a few basic concepts.

Marginal and Joint Probabilities

The foundation of information theory is probability, and specifically relevant for this post, marginal and joint probabilities. The marginal probability of a word \(p(w_i)\) within a corpus of text is simply the number of times the word occurs \(N(w_i)\) divided by the total number of word occurrences in the corpus \(\sum_k N(w_k)\):

\[p(w_i) = \frac{N(w_i)}{\sum_k N(w_k)} \tag{1}\]

In this post we refer to \(N(w_i)\) as unigram frequency, as it is a count of the number of times a single word, or “unigram”, occurs in the corpus.

Unigram Frequency Python Code

SVD and Data Compression Using Low-rank Matrix Approximation

2020-08-16T00:00:00+00:00

In a previous post we introduced the Singular Value Decomposition (SVD) and its many advantages and applications. In this post, we’ll discuss one of my favorite applications of SVD: data compression using low-rank matrix approximation (LRA). We’ll start off with a quick introduction to LRA and how it relates to data compression. Then we’ll demonstrate how SVD provides a convenient and intuitive method for image compression using a LRA.

Data Compression and Low-Rank Approximation

First off, what do we mean by low-rank approximation? Say you have an \(m \times n\) data matrix \(X\). The data contained in \(X\) could be anything, really. For example, in a computer vision setting, \(X\) could encode a single image, where each entry in the matrix is a pixel intensity value at a location encoded by the \(i,j\)-th row and column. In a machine learning setting, \(X\) could be a data set, where each row is an observation and each column is a measurable dimension. Heck, in a computer-vision-meets-machine-learning setting, \(X\) could represent multiple images, with each image being encoded as a row, and each column being one of \(n = (\text{width} \times \text{height}\)) values, encoding the image location-pixel values unraveled into a row-vector.

No matter the type of information \(X\) encodes, it will have a matrix rank \(r\), which is essentially the number of linearly independent columns (column rank) or rows (row rank) contained in the matrix. We’ll focus on column rank in this post. It’s entirely possible (and common) for a matrix to have a rank that is smaller than the number of columns in the matrix. For example, the left two plots in Figure 1 display two different matrices \(X\) and \(\tilde X\). These two matrices have the same column rank, despite having a different numbers of columns. This is because the matrix \(X\) is full rank in that its column rank is equal to the number of columns. In contrast, the matrix \(\tilde X\) contains redundant columns, resulting in a column rank that is smaller than the number of columns.¹

Figure 1: Matrix Rank and Reconstruction. Left: a full-column-rank matrix \(X\). Middle: a matrix \(\tilde X\) with redundant columns formed by scaling and concatenating columns of \(X\). Right: exact reconstruction of \(\tilde X\) using a rank \(k=r=4\) singular value decomposition.

Python Code

Singular Value Decomposition: The Swiss Army Knife of Linear Algebra

2020-08-10T00:00:00+00:00

Linear algebra provides a number powerful computational methods that are used throughout the sciences. However, I would say that hands-down the most versatile of these methods is singular value decomposition, or SVD. In this post we’ll dive into a little theory behind matrix diagonalization and show how SVD generalizes matrix diagonalization. Then we’ll go into a few of the properties of SVD and cover a few (of many!) cool and useful applications of SVD in the real world. In addition, each application will have its own dedicated post.

Matrix Diagonalization

Before introducing SVD, let’s first go over matrix diagonalization. Matrix diagonalization is the process of breaking a matrix \(M\) into two separate matrices \(P\) and \(D\), such that

\[M_{m \times m} = P_{m \times m} D_{m \times m} P_{m \times m}^{-1} \tag{1}\]

Where \(P\) is an invertible (i.e. square) matrix and \(D\) is a diagonal matrix (i.e. containing all zeros, except along the diagonal).

Now, why on earth would you want to do something like diagonalization? It turns out that representing a matrix this way has a lot of numerical and computation conveniences. For example, computing matrix inverses or matrix powers can be done efficiently for large matrices or exponents when formulated via diagonalization. Diagonalization is also often used to restate mathematical problems in a new, canonical system where certain operations or structures have convenient interpretations (e.g. finding nodes in a vibrating system or identifying energy levels in quantum mechanics).

So how does one diagonalize a matrix? There are lots of approaches to diagonalize a matrix, but a common one is to compute the eigenvalue decomposition of the matrix. To understand why this is equivalent to diagonalization, let’s note that

\[\begin{align} M &= P D P^{-1} \\ M (P) &= P D P^{-1} (P) \\ M P &= P D . \tag{2} \end{align}\]

Some of you may recognize that the statement given by Equation 2 is equivalent to finding the eigenvectors \(a_i\) and eigenvalues \(\lambda_i\) of the matrix \(M\), where

\[\begin{align} P &= \left[a_1, a_2, ... a_m\right] \\ D &= \begin{bmatrix} \lambda_1 & 0 & \dots & \dots \\ 0 & \lambda_2 & 0 & \dots \\ \vdots & \dots & \ddots & \dots \\ 0 & \dots & 0 & \lambda_m \end{bmatrix} \\ PD &= [\lambda_1 a_1, \lambda_2 a_2, ... \lambda_m a_m] \end{align}\]

Therefore solving for the eigenvalues/vectors of \(M\) provides us with the components to diagonalize \(M\) (note we’d still need to calculate the value of \(P^{-1}\)).

So what is this diagonalization operation doing, exactly? One can think of the diagonalization as performing three steps:

Since \(P\) is an orthonormal matrix, multiplying a vector by \(P\) (or \(P^{-1}\), depending on order of application of \(M\)) rotates the vector onto a new set of axes that are aligned with the eigenvectors of the matrix.
Since \(D\) is diagonal, multiplying the results of step 1 by \(D\) scales the transformed vector along each of new axes.
Multiplying by \(P^{-1}\) (or \(P\)) reverse rotates the rescaled vector back onto the original axes.

If all of this rotating and scaling business is still unclear, no worries, we’ll demonstrate similar ideas graphically when discussing SVD (see Figure 6).

Diagonalization isn’t for everyone: luckily there’s SVD

Looking at the diagonalization definition in Equation 1, one can infer that in order to be diagonalizable, \(M\) must be square and invertible. Although there are a lot of interesting problems that involve only square matrices there are a many, many more scenarios to do not fit this constraint. This is where SVD comes in!

One can think of SVD as a generalization of diagonalization to non-square matrices. In fact it turns out that all matrices have a SVD solution! As we’ll see, this makes SVD a more general tool than other matrix decompositions like eigenvalue decomposition, which requires square, invertible matrices.

The singular value decomposition is based on the notion that for any matrix \(M\), the matrices \(M^T M\) and \(M M^T\) are symmetric:

\[(M^T M)^T = M^T(M^T)^T = M^T M \\ (M M^T)^T = (M^T)^T M^T = M M^T\]

In addition, SVD takes advantage of the notion that all symmetric matrices like \(M^T M\) and \(M M^T\) have eigenvalues that form an orthonormal basis. With these two notion in hand, let’s first define the SVD, then we’ll derive its components from the matrices \(M^T M\) and \(M M^T\).

The singular value decomposition aims to separate an \([m \times n]\) matrix \(M\) into three distinct matrices:

\[M_{m \times n} = U_{m \times m} S_{m \times n} V_{n \times n}^T \tag{3}\]

Where \(U\) is an orthonormal matrix, \(V\) is an orthonormal matrix, and \(S\) is a diagonal matrix. To derive \(U,\) we analyze the symmetric matrix \(M^T M\) while utilizing the SVD definition of \(M\) in Equation 3:

\[\begin{align} M^T M &= (USV^T)^T(USV^T) \\ &= (VS^TU^T)(USV^T) \\ &= VS^T I S V^T \text{, since } U \text{ is orthogonal} \\ &= V S^T S V^T \\ &= V S^T S V^{-1} \text{, since } V \text{ is orthogonal} \tag{4} \end{align}\]

Look familiar? Equation 4 is essentially another diagonalization operation like the one defined in Equation 1, but this time we’re diagonalizing the matrix \(M^T M\) instead of \(M\), and have diagonalizing matrix \(S^TS\) instead of \(D\). As we showed in Equation 2, this diagonalization can be solved via eigenvalue decomposition, which suggests the two following properties of SVD:

The columns of matrix \(V\) are simply the eigenvectors of \(M^T M\).
Since \(S^T S\) gives the eigenvalues of \(M^T M\) along its diagonal, the diagonal of matrix \(S\) contains the square root of these eigenvalues.

OK, we’ve found \(V\) and \(S\), what about \(U\)? To derive \(U\) we perform analogous computations as for \(V\), but on \(MM^T\) instead of \(M^TM\):

\[\begin{align} MM^T &= (USV^T)(USV^T)^T \\ &= (USV^T)(VS^TU^T) \\ &= US^T I S U^T \\ &= U S^T S U^T \\ &= U S^T S U^{-1} \tag{5} \end{align}\]

Equation 5 suggests a third property of SVD, namely that the columns of \(U\) are the eigenvectors of \(M M^T.\) The matrix \(S\) has the same interpretation as in Equation 4.

Note that when \(m \neq n\), the diagonalizing matrix \(S\) is not square as was the case for \(D\) when diagonalizing square matrices. Instead \(S\) will be padded with zero rows or columns, depending on which dimension is larger. We’ll demonstrate all of this visually shortly.

Visualizing SVD

OK, we’ve written down a bunch of equations that mathematically define the components of SVD and how they related to the input matrix \(M\). Now, let’s make these derived components more tangible with some visualizations and code.

Figures 1-3 below show examples of what SVD will return for various matrix sizes. SVD and visualization code can be shown/hidden by clicking on the “▶︎ Python Code” button. Figure 1 demonstrates the results for a square matrix.

Figure 1: Visualization of \(U\), \(S\), and \(V\) for Singular Value Decomposition of a square matrix \(M\), where \(m=n\)

Python Code

Efficient Matrix Power Calculation via Diagonalization

2020-08-08T00:00:00+00:00

Taking the power of a matrix is an important operation with applications in statistics, machine learning, and engineering. For example, solving linear ordinary differential equations, identifying the state of a Markov chain at time \(t\), or identifying the number of paths between nodes in a graph can all be solved using powers of matrices. In this quick post we’ll show how Matrix Diagonalization can be used to efficiently compute the power of a matrix.

If matrix \(M\) is an \(m \times m\) diagonalizable, then \(M^k\) can be calculated directly from the diagonalization \(M = P D P^{-1}\) as follows:

\[\begin{align} M^k &= M \times M \dots \times M \\ &= (P D P^{-1}) (P D P^{-1}) \dots (P D P^{-1}) \\ &= P D (P^{-1} P) D (P^{-1} P) \dots D P^{-1} \\ &= P D^k P^{-1} \end{align}\]

Therefore to calculate \(M^k\), we simply need to diagonalize \(M\) and re-matrix-multiply the diagonalization components after raising the diagonal matrix component \(D\) to the \(k\)-th power. Since \(D\) is a diagonal matrix, the \(k\)-th power is calculated by simply raising each element along the diagonal to the \(k\)-th power:

\[\begin{align} D^k &= \begin{bmatrix} d_{1} & & \\ & \ddots & \\ & & d_{m} \end{bmatrix}^k \\ &= \begin{bmatrix} d_{1}^k & & \\ & \ddots & \\ & & d_{m}^k \end{bmatrix} \end{align}\]

This trick allows us to calculate the matrix power by multiplying three matrices, rather than \(k\). Thus as \(k\) gets large, or the size of the matrix \(M\) grows, you get more and more gains in efficiency.

To demonstrate, let’s calculate the matrix power of a random matrix using brute force, the matrix diagonalization approach reviewed above, and we’ll also throw in results from numpy.linalg.matrix_power for completeness.

import numpy as np
np.random.seed(123)

# Generate a random 3 x 3 matrix
M = np.random.randn(3, 3)
k = 3  # power exponent

print('\nBrute Force:\n', eval("@".join([' M '] * k)))
# Brute Force:
#  [[-0.34077132 -0.70544947 -1.07778229]
#  [ 2.73462284 -0.71537115 -2.62514227]
#  [ 3.35955945  1.68986542 -4.1619396 ]]

# Diagonalize M via Eigenvalue Decomposition
D, P = np.linalg.eig(M)
D = np.diag(D)  # Put eigenvalues into a diagonal matrix

print('\nMatrix Diagonalization:\n', np.real(P @ D ** k @ np.linalg.inv(P)))
# Matrix Diagonalization:
#  [[-0.34077132 -0.70544947 -1.07778229]
#  [ 2.73462284 -0.71537115 -2.62514227]
#  [ 3.35955945  1.68986542 -4.1619396 ]]

print('\nnumpy.linalg.matrix_power:\n', np.linalg.matrix_power(M, k))
# numpy.linalg.matrix_power:
#  [[-0.34077132 -0.70544947 -1.07778229]
#  [ 2.73462284 -0.71537115 -2.62514227]
#  [ 3.35955945  1.68986542 -4.1619396 ]]

Works! 😁

Common Linear Algebra Identities

2020-08-05T00:00:00+00:00

This post provides a convenient reference of Linear Algebra identities used in The Clever Machine Blog.

Notation

\[\begin{align} \text{A Scalar:}&\;\;\;a \\ \text{A Column Vector:}&\;\;\;\mathbf{a} = [a_1, a_2, ...a_n]^T \\ \text{The } i\text{-th entry of a vector:}&\;\;\;a_i \\ \text{Vector Inner (Scalar) Product:}&\;\;\;\mathbf{a}^T\mathbf{a} \\ \text{Vector Outer (Matrix) Product:}&\;\;\;\mathbf{a}\mathbf{a}^T \\ \\ \text{A Matrix:} &\;\;\;\mathbf{A}& \\ \text{The } i,j\text{-th entry of a matrix:}&\;\;\;A_{ij} \\ \text{The Null Matrix (all zero entries):} &\;\;\;\mathbf{0}& \\ \text{The Identity Matrix:} &\;\;\;\mathbf{I}& \\ \text{A Diagonal Matrix:} &\;\;\;\mathbf{\Lambda}& \\ \text{A Positive Definite Matrix:} &\;\;\;\mathbf{\Sigma} \\ \text{Matrix of size }\mathbf{A}\text{ filled with zeros except a single 1 at } i,j &\;\;\;\mathbf{\Delta}(\mathbf{A})_{ij} \\ \\ \text{Matrix Transpose:}&\;\;\;\mathbf{A}^T \\ \text{Matrix Identity:}&\;\;\;\mathbf{A}^{-1} \\ \text{Matrix Pseudo Inverse:}&\;\;\;\mathbf{A}^+ \\ \text{Matrix Square Root:}&\;\;\;\mathbf{A}^{1/2} \\ \text{Matrix Complex Conjugate:}&\;\;\;\mathbf{A}^* \\ \text{Hermitian of a Matrix:}&\;\;\;\mathbf{A}^H \\ \text{Determinant of a Matrix:}&\;\;\;\det(\mathbf{A}) \\ \text{Trace of a Matrix:}&\;\;\;\text{tr}(\mathbf{A}) \\ \text{Diagonal Matrix:}&\;\;\;\text{diag}(\mathbf{A}), \;\;\; (\text{diag}(\mathbf{A}))_{ij} = \delta_{ij}(A)\\ \text{Eigenvalues of a Matrix:}&\;\;\;\text{eig}(\mathbf{A}) \\ \text{Norm of a Matrix:}&\;\;\;||\mathbf{A}|| \\ \text{Hadamard (elementwise) product of two Matrices:}&\;\;\;\mathbf{A} \circ \mathbf{B} \\ \text{Kronecker Product of Two Matrices:}&\;\;\;\mathbf{A} \otimes \mathbf{B} \\ \end{align}\]

1. Basic Properties

\[\begin{eqnarray} \mathbf{A} + \mathbf{B} &=& \mathbf{B} + \mathbf{A} \tag{1.1} \\ \mathbf{A} + (\mathbf{B} + \mathbf{C}) &=& (\mathbf{A} + \mathbf{B}) + \mathbf{C} \tag{1.2} \\ \mathbf{A} (\mathbf{B} + \mathbf{C}) &=& (\mathbf{AB} + \mathbf{AC}) \tag{1.3} \\ a(\mathbf{B} + \mathbf{C}) &=& (a\mathbf{B} + a\mathbf{C}) = (\mathbf{B}a + \mathbf{C}a) \tag{1.4} \\ \mathbf{AB} &\neq& \mathbf{BA} \tag{1.5} \\ \mathbf{ABC} &=& (\mathbf{AB})\mathbf{C} = \mathbf{A}(\mathbf{BC}) \tag{1.6} \\ \end{eqnarray}\]

2. Transposes

\[\begin{eqnarray} (\mathbf{A}^T)^T &=& \mathbf{A} \tag{2.1} \\ (\mathbf{AB})^T &=& \mathbf{B}^{T}\mathbf{A}^{T} \tag{2.2} \\ (\mathbf{ABC})^T &=& \mathbf{C}^{T}\mathbf{B}^{T}\mathbf{A}^{T} \tag{2.3} \\ (\mathbf{A} + \mathbf{B})^T &=& (\mathbf{A}^T + \mathbf{B}^T) \tag{2.4} \\ \end{eqnarray}\]

3. Inverses and Identity

\[\begin{eqnarray} \mathbf{AI} &=& \mathbf{IA} = \mathbf{A} \tag{3.1} \\ \mathbf{AA}^{-1} &=& \mathbf{A}^{-1}\mathbf{A} = \mathbf{I} \tag{3.2} \\ (\mathbf{A}^{-1})^{-1} &=& \mathbf{A} \tag{3.3} \\ (\mathbf{AB})^{-1} &=& \mathbf{B}^{-1}\mathbf{A}^{-1} \tag{3.4} \\ (\mathbf{ABC})^{-1} &=& \mathbf{C}^{-1}\mathbf{B}^{-1}\mathbf{A}^{-1} \tag{3.5} \\ (\mathbf{A}^T)^{-1} &=& (\mathbf{A}^{-1})^T \tag{3.6} \\ \mathbf{\Lambda}^{-1}&=& \text{diag}([1/\lambda_1, 1/\lambda_2, ... 1/\lambda_n]) \tag{3.7} \\ \end{eqnarray}\]

4. Traces

\[\begin{eqnarray} \text{tr}(\mathbf{A}) &=& \sum_i A_{ii} \tag{4.1} \\ \text{tr}(\mathbf{A}^T) &=& \text{tr}(\mathbf{A}) \tag{4.2} \\ \text{tr}(\mathbf{AB}) &=& \text{tr}(\mathbf{BA}) \tag{if A & B are the same size, 4.3} \\ &=& \text{tr}(\mathbf{B}^T\mathbf{A}^T) \tag{if A & B are not the same size, 4.4} \\ \text{tr}(\mathbf{A} + \mathbf{B}) &=& \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) \tag{4.5} \\ \text{tr}(\mathbf{ABC}) &=& \text{tr}(\mathbf{BCA}) = \text{tr}(\mathbf{CAB}) \tag{4.6} \\ \mathbf{a}^T\mathbf{a} &=& \text{tr}(\mathbf{aa}^T) \tag{4.7} \\ \text{tr}(\mathbf{A}) &=& \sum_i \lambda_{i}, \;\;\; \lambda_i = \text{eig}(\mathbf{A})_i \tag{4.8} \\ \end{eqnarray}\]

5. Determinants

For a square matrix \(\mathbf A\) of dimension \(n \times n\)

\[\begin{eqnarray} \det(\mathbf{A}) &=& \prod_i \lambda_{i}, \;\;\; \lambda_i = \text{eig}(\mathbf{A})_i \tag{5.1} \\ \det(\mathbf{A}^T) &=& \det(\mathbf{A}) \tag{5.2} \\ \det(\mathbf{AB}) &=& \det(\mathbf{A})\det(\mathbf{B}) \tag{5.3} \\ \det(\mathbf{A}^{-1}) &=& \frac{1}{\det(\mathbf{A})} \tag{5.4} \\ \det(\mathbf{A}^n) &=& \det(\mathbf{A})^n \tag{5.5} \\ \det(c\mathbf{A}) &=& c^n \det(\mathbf{A}), \;\;\; \text{given } \mathbf{A} \in \mathbb{R}^{n \times n} \tag{5.6} \\ \end{eqnarray}\]

6. Derivatives

6.1 Vector Derivatives

6.1.1 Scalar-valued Objectives

For scalar function \(y = \mathbf{\beta x} = \beta_1 x_1 + \beta_2 x_2 + ... \beta_n x_n\)

\[\begin{align} \frac{\partial y}{\partial \mathbf{x}} &= \begin{bmatrix} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \\ \end{bmatrix} = \frac{\partial \mathbf{\beta \mathbf{x}}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial \mathbf{\beta x}}{\partial x_1} \\ \frac{\partial \mathbf{\beta x}}{\partial x_2} \\ \vdots \\ \frac{\partial \mathbf{\beta x}}{\partial x_n} \\ \end{bmatrix} = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n \end{bmatrix} \tag{6.1.1} \end{align}\]

6.1.2 Vector-valued Objectives

For a vector-valued function

\[\begin{eqnarray} \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \\ \end{bmatrix} = \mathbf{A x} = \begin{bmatrix} a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n \\ a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n \\ \vdots \\ a_{m1}x_1 + a_{m2}x_2 + ... + a_{mn}x_n \\ \end{bmatrix} \\ \\ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \dots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} = \frac{\partial \mathbf{Ax}}{\partial \mathbf{x}} = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & \dots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \\ \end{bmatrix} = \mathbf{A}^T \tag{6.1.2} \end{eqnarray}\] \[\begin{eqnarray} \frac{\partial \mathbf{x}^T\mathbf{A}}{\partial \mathbf{x}} &=& \mathbf{A} \tag{6.1.3} \\ \frac{\partial \mathbf{x}^T\mathbf{a}}{\partial \mathbf{x}} &=& \frac{\partial \mathbf{a}^T\mathbf{x}}{\partial \mathbf{x}} = \mathbf{a} \tag{6.1.3} \\ \frac{\partial \mathbf{y}^T \mathbf{Ax}}{\partial \mathbf{x}} &=& \mathbf{A}^T \mathbf{y} \tag{6.1.4} \\ \frac{\partial \mathbf{y}^T \mathbf{Ax}}{\partial \mathbf{y}} &=& \mathbf{A} \mathbf{x} \tag{6.1.5} \\ \frac{\partial \mathbf{x}^T\mathbf{x}}{\partial \mathbf{x}} &=& 2\mathbf{x} \tag{6.1.6} \\ \frac{\partial \mathbf{x}^T\mathbf{Ax}}{\partial \mathbf{x}} &=& (\mathbf{A} + \mathbf{A}^T)\mathbf{x} \tag{6.1.7} \\ &=&2 \mathbf{Ax} \tag{if A is symmetric, 6.1.8} \\ \frac{\partial \mathbf{Ax}}{\partial \mathbf{z}} &=& \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \mathbf{A}^T \tag{6.1.9} \\ \end{eqnarray}\]

6.2 Matrix Derivatives

\[\begin{eqnarray} \frac{\partial \mathbf{X}}{\partial X_{ij}} &=& \mathbf{\Delta}(\mathbf{X})_{ij} \tag{6.2.1} \\ \frac{\partial \mathbf{a}^T\mathbf{X} \mathbf{a}}{\partial \mathbf{X}} &=& \frac{\partial \mathbf{a}^T\mathbf{X}^T \mathbf{a}}{\partial \mathbf{X}} = \mathbf{a}\mathbf{a}^T \tag{6.2.2} \\ \frac{\partial \mathbf{a}^T\mathbf{X} \mathbf{b}}{\partial \mathbf{X}} &=& \mathbf{a}\mathbf{b}^T \tag{6.2.3} \\ \frac{\partial \mathbf{a}^T\mathbf{X}^T \mathbf{b}}{\partial \mathbf{X}} &=& \mathbf{b}\mathbf{a}^T \tag{6.2.4} \\ \frac{\partial \mathbf{X}^T \mathbf{BX}}{\partial \mathbf{X}} &=& (\mathbf{B} + \mathbf{B}^T)\mathbf{X} \tag{6.2.5} \\ \end{eqnarray}\]

References

The Matrix Cookbook, Peterson & Pederson (2012)

Derivation: Ordinary Least Squares Solution and the Normal Equations

2020-07-23T00:00:00+00:00

Have you ever performed linear regression involving multiple predictor variables and run into this expression \(\hat \beta = (X^TX)^{-1}X^Ty\)? It’s called the OLS solution via Normal Equations. To find out where it comes from, read on!

In the linear regression framework, we model an output variable \(y\) (in this case a scalar) as a linear combination of some independent input variables \(X\) plus some independent noise \(\epsilon\). The linear combination of the independent variables is defined by a parameter vector \(\beta\):

\[y = X \beta + \epsilon\]

We also assume that the noise term \(\epsilon\) is drawn from a Normal distribution with zero mean and a noise variance \(\sigma_{\epsilon}^2\) (generally assumed to be equal to one):

\[\epsilon \sim N(0,\sigma_{\epsilon}^2)\]

For some estimate of the model parameters \(\hat \beta\), the model’s prediction errors (a.k.a. residuals) \(e\) are the difference between the model prediction and the observed ouput values:

\[e = y - X\hat \beta\]

The Ordinary Least Squares (OLS) solution to the problem–i.e. determining an optimal solution for \(\hat \beta\)–requires minimizing the sum of the squared errors with respect to the model parameters \(\hat \beta\). It turns out, the sum of squared errors is equal to the inner product of the residuals vector with itself \(\sum_i e_i^2 = e^Te\) :

\[\begin{align} e^T e &= (y - X \hat \beta)^T (y - X \hat \beta) \\ &= y^Ty - y^T (X \hat \beta) - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - (X \hat \beta)^T y - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &= y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \text{,} \tag{1} \end{align}\]

where we take advantage of the matrix identity \((AB)^T = B^TA^T\) in steps 2-3 above.

To determine the parameters \(\hat \beta\) we minimize the sum of squared errors with respect to the parameters:

\[\begin{align} \frac{\partial}{\partial \beta} \left[ e^T e \right] &= 0 \\ \frac{\partial}{\partial \beta} \left[ y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \right ] &= 0 \;\; \text{, via Eq. (1)}\\ -2X^Ty + 2X^TX \hat \beta &= 0 \\ -X^Ty + X^TX \hat \beta &= 0 \\ X^TX \hat \beta&= X^Ty \text{,} \tag{2} \end{align}\]

where we note to the matrix derivative identity \(\frac{\partial \mathbf{a}^T \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b}\), for vectors \(\mathbf{a}\) and \(\mathbf{b}\) in step 2-3 above.

The relationship in Equation 2 is the matrix form of what are known as the Normal Equations. Solving for \(\hat \beta\) gives the analytical solution to the Ordinary Least Squares problem.

\[\hat \beta = (X^TX)^{-1}X^Ty\]

…and voila!

Notes

This post is a refactor of content with the same title originally posted on The Clever Machine Wordpress blog.

Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

2020-07-20T00:00:00+00:00

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable \(x\) is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function \(f(x)\) that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function \(f(x)\) from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function \(f(x)\) taking \(x\) as input and producing an output). However, in the real world, we don’t get to observe \(f(x)\) directly, but instead get noisy observations \(y\), where

\[y = f(x) + \epsilon \tag{1}\]

Here we will assume that \(\epsilon\) is random variable distributed according to a zero-mean Gaussian with standard deviation \(\sigma_{\epsilon}^2\). Note that because \(\epsilon\) is a random variable, \(y\) is also a random variable (with a mean that is conditioned on both \(x\) and \(f(x)\), and exhibiting a variance \(\sigma_{\epsilon}^2\)).

As an example, say that the true function \(f(x)\) we want to determine has the the following form (though we don’t know it):

\[f(x) = \sin(\pi x)\]

Thus the observations \(y\) we get to see have the following distribution.

\[y = \sin(\pi x) + \mathcal N(0,\sigma_{\epsilon}^2)\]

Below we define the function \(f(x)\) and display it, then draw a few observation samples \(y\), and display them as well:

Figure 1: A data-generating function \(f(x)\) and some noisy samples \(y\). The samples exibit a noise variance \(\sigma_{\epsilon}^2=1\)

Python Code

Supplemental Proof: The Expected Value of a Squared Random Variable

2020-07-19T00:00:00+00:00

We want to show the following relationship:

\[\mathbb E[X^2] = \mathbb E[(X - \mathbb E[X])^2] + \mathbb E[X]^2 \tag{1}\]

If we expand the first expression on the right-hand side of Equation 1:

\[\begin{align} \mathbb E [(X - \mathbb E[X])^2] &= \mathbb E[X^2-2X\mathbb E[X]+\mathbb E [X]^2] \\ &= \mathbb E[X^2] - 2\mathbb E[X] \mathbb E [\mathbb E [X]] + \mathbb E [\mathbb E [X]^2]. \tag{2} \end{align}\]

If we note that \(\mathbb E[X]\) is a constant, then \(\mathbb E[\mathbb E[X]]\) is also a constant, namely \(\mathbb E[X]\). Taking this into account Equation 2 simplifies to:

\[\begin{align} \mathbb E[(X - \mathbb E[X])^2] &= \mathbb E[X^2] - 2\mathbb E[X] \mathbb E[X] + \mathbb E[X]^2 \\ &= \mathbb E[X^2] - 2\mathbb E[X]^2 + \mathbb E[X]^2 \\ &= \mathbb E[X^2] - \mathbb E[X]^2 \tag{3} \end{align}\]

Plugging Equation 3 back into the right-hand side of Equation 1 gives

\[\begin{align} \mathbb E[(X - \mathbb E[X])^2] + \mathbb E[X]^2 &= \mathbb E[X^2] - \mathbb E[X]^2 + \mathbb E[X]^2 \\ &= \mathbb E[X^2] , \tag{4} \end{align}\]

thus giving the desired result.