Above: Literally me procrastinating all my work but decided to put Kumiko here just because hehe.
Draft 1: 6/12/24
Recommended Background:
Understanding of gradients, hessians, basic machine learning theory
(loss, functional approximation, generalization).
Key Takeaways:
Mathematically intuitive understanding of gradient boosting.
Preliminaries:
At a high level, gradient boosting can be thought of as functional
gradient descent – iteratively (greedily) learning a series of functions
fit to the negative gradient (residual) of the loss function. This
differs from the classical approach of optimizing the parameters of the
loss function, which in turn would result in a decrease in loss. It can
be thought of as a type of ensemble learning, where high bias, low
variance learners produce a mathematically combined (in this case, via a
learning-rate weighted sum) prediction. More on this later.
Just some standard notation:
\({\mathbf{D}}\) corresponds to the a
real \(n \times d\) data matrix, where
\({\mathbf{x}}_i \in \mathbb{R}^d\)
corresponds to datapoint or row entry \(i\), and \({\mathbf{X}}_j \in \mathbb{R}^n\)
corresponds to attribute or column entry \(j\). We have predictor regression values
\({\mathbf{y}}\in \mathbb{R}^n\).
You definitely could use GBMs for other learning tasks such as an unsupervised approach without loss of generality, but let’s keep it simple with a supervised learning task. Rigorously defining the problem space, we want to learn a function that is optimal (with respect to our choice of loss/metric) on average across the probability distribution of \({\mathbf{x}}\): \[f^* = \mathop{\mathrm{arg\,min}}_{f}\mathbb{E}_{\mathbf{x}}\left[{f({\mathbf{x}}, {\mathcal{L}}({\mathbf{x}}))}\right].\] (Of course, the true optimal would be to learn \(f^*({\mathbf{x}}) = {\mathbf{y}})\) for all \({\mathbf{x}}\), but obviously we’re assuming this isn’t feasible). In other words, we want to learn some model to predict a class or regression value, and we want it to generalize well to any future data we encounter.
CART, Classification and Regression Trees, is the
standard "weak" learner boosting procedures tend to use. The intuition
between the two problem classes are very similar, so we’ll show the
intuition primarily for the regression case.
Regression Trees:
If you’re unfamiliar with binary decision tree models, the regression
tree model learns some threshold (numerical or class-oriented) on a
randomly selected feature \({\mathbf{X}}_s\) (\(s\) for selected) that minimizes a certain
loss function at each iteration. The decision tree function can be
interpreted as splitting the prediction space/range into regions \(R_m\), with some prediction regression
value \(c_m\). \[f({\mathbf{x}}) = \sum_{i = m}^{M}c_m
I[{\mathbf{x}}\in R_m]\]
The regions are learned iteratively and greedily. Let’s take a look at the first iteration (generalized to any): we take some random attribute \({\mathbf{X}}_s\), and pick the datapoint that minimize the sum of square errors (SSE) against the mean predictor value (or any loss of our choice) across its generated regions \(R_1\) and \(R_2\). Note that we pick the datapoints as split points for generality because the domain of \({\mathbf{X}}_i\) could be infinite, \(\mathbb{R}\). We could discretize the real numbers and try each one, but since we have a finite number of datapoints, we can just do this directly using the datapoints as discrete split points. Formally, \[\begin{aligned} &\bar{{\mathbf{x}}} = \mathop{\mathrm{arg\,min}}_{{\mathbf{x}}^* \in R_1 \cap R_2} \sum_{{\mathbf{x}}_i \in R_1}({\mathbf{x}}_{i} - {\mu}_1)^2 + \sum_{{\mathbf{x}}_j \in R_2}({\mathbf{x}}_{j} - {\mu}_2)^2\\ &\mbox{Where } {\mathbf{x}}_i \in R_1 \mbox{ if } x_{i,s} \leq x^*_s, \mbox{ and } {\mathbf{x}}_j \in R_2 \mbox{ if } x_{j,s} > x^*_s\\ \end{aligned}\]
It’s hard to not overload the notation and make it look unnecessarily complex, but basically \(x_{i,s}\) refers to the \(i-\)th datapoint’s attribute value at \(s\), and \(x^*_s\) is the attribute value at \(s\) of the candidate optimal datapoint to split on.
\(\mu\) here is just used to denote the mean predictor value of datapoints within that region, e.g., \[{\mu_1} = \frac{1}{|R_1|}\sum_{{\mathbf{x}}_i \in R_1}y_{i},\quad {\mu_2} = \frac{1}{|R_2|}\sum_{{\mathbf{x}}_i \in R_2}y_{i}\]
Note that at each iteration, we expect to do this splitting process for each of the terminal leaf nodes. \(M-1\) iterations or splits gives us a total of \(M\) leaf nodes that characterize a region \(R_m\) at a certain predicted value \(c_m\).
Hyperparameters, Overfitting, Variance, and Pruning by
Loss:
Coming Soon: Picking depth, strengths & weaknesses of gbm models,
and ways to mitigate it.
Depth Intuition via ANOVA:
Coming Soon: How tree depth and choice of the random attribute split
corresponds to higher order interactions between attributes.
Gradient Boosting (Gradient Boosted Trees):
Coming Soon: Fitting residuals of gradients and hessians, shrinkage,
bagging.
Interpretability:
Loss-relative importance, and partial-dependence plots.