Posts
pHacking 101: N Chasing
”\(N\) Chasing,” or adding new observations to an alreadyanalyzed experiment can increase your experiment’s false positive rate. As an experimenter or analyst, you may have heard of the dangers of \(N\) chasing, but may not have an intuition as to why or how it increases Type I Error. In this post we’ll demonstrate \(N\) chasing using some simulations, and show that, under certain settings, adding just a single data point to your experiment can dramatically increase false positives.
Who Needs Backpropagation? Computing Word Embeddings with Linear Algebra
Word embeddings provide numerical representations of words that carry useful semantic information about natural language. This has made word embeddings an integral part of modern Natural Language Processing (NLP) pipelines and language understanding models. Common methods used to compute word embeddings, like word2vec, employ predictive, neural network frameworks. However, as we’ll show in this post, we can also compute word embeddings using a some basic frequency statistics, a little information theory, and our good old friend from linear algebra, Singular Value Decomposition.
SVD and Data Compression Using Lowrank Matrix Approximation
In a previous post we introduced the Singular Value Decomposition (SVD) and its many advantages and applications. In this post, we’ll discuss one of my favorite applications of SVD: data compression using lowrank matrix approximation (LRA). We’ll start off with a quick introduction to LRA and how it relates to data compression. Then we’ll demonstrate how SVD provides a convenient and intuitive method for image compression using a LRA.
Singular Value Decomposition: The Swiss Army Knife of Linear Algebra
Linear algebra provides a number powerful computational methods that are used throughout the sciences. However, I would say that handsdown the most versatile of these methods is singular value decomposition, or SVD. In this post we’ll dive into a little theory behind matrix diagonalization and show how SVD generalizes matrix diagonalization. Then we’ll go into a few of the properties of SVD and cover a few (of many!) cool and useful applications of SVD in the real world. In addition, each application will have its own dedicated post.
Efficient Matrix Power Calculation via Diagonalization
Taking the power of a matrix is an important operation with applications in statistics, machine learning, and engineering. For example, solving linear ordinary differential equations, identifying the state of a Markov chain at time \(t\), or identifying the number of paths between nodes in a graph can all be solved using powers of matrices. In this quick post we’ll show how Matrix Diagonalization can be used to efficiently compute the power of a matrix.
Common Linear Algebra Identities
This post provides a convenient reference of Linear Algebra identities used in The Clever Machine Blog.
Derivation: Ordinary Least Squares Solution and the Normal Equations
Have you ever performed linear regression involving multiple predictor variables and run into this expression \(\hat \beta = (X^TX)^{1}X^Ty\)? It’s called the OLS solution via Normal Equations. To find out where it comes from, read on!
Model Selection: Underfitting, Overfitting, and the BiasVariance Tradeoff
In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.
Supplemental Proof: The Expected Value of a Squared Random Variable
We want to show the following relationship:
A Gentle Introduction to Artificial Neural Networks
Though many phenomena in the world can be wellmodeled using basic linear regression or classification, there are also many interesting phenomena that are nonlinear in nature. In order to deal with nonlinear phenomena, there have been a diversity of nonlinear models developed.
Cutting Your Losses: Loss Functions & the Sum of Squared Errors Loss
In this post we’ll introduce the notion of the loss function and its role in model parameter estimation. We’ll then focus in on a common loss function–the sum of squared errors (SSE) loss–and give some motivations and intuitions as to why this particular loss function works so well in practice.
Derivation: Derivatives for Common Neural Network Activation Functions
When constructing Artificial Neural Network (ANN) models, one of the primary considerations is choosing activation functions for hidden and output layers that are differentiable. This is because calculating the backpropagated error signal that is used to determine ANN parameter updates requires the gradient of the activation function gradient . Three of the most commonlyused activation functions used in ANNs are the identity function, the logistic sigmoid function, and the hyperbolic tangent function. Examples of these functions and their associated gradients (derivatives in 1D) are plotted in Figure 1.
Derivation: Error Backpropagation & Gradient Descent for Neural Networks
Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a stepbystep walkthrough of the derivation of the gradient descent algorithm commonly used to train ANNs–aka the “backpropagation” algorithm. Along the way, I’ll also try to provide some highlevel insights into the computations being performed during learning^{1}.

Though, I guess these days with autograd, who really needs to understand how the calculus for gradient descent works, amiright? (hint: that is a joke) ↩

subscribe via RSS