Here’s what I found useful from the 2022 offering of Practical Deep Learning for Coders by fasti.ai.
1 Getting started
1.1 The significance of the XOR affair
Minksy’s Perceptrons (1969) is infamous for claiming that neural networks cannot learn logic. AI history is very complex so perhaps this isn’t exactly accurate. But the book has become a symbol of a certain side of the AI debate.
The dispute is about logic.
Today logic is not that important. We don’t care what a model does as long as it is useful. But this was not always the case. Stemming off very intense movements in philosophy and other disciplines1, logic in the last century was seen by many as the very fabric of reality. It is in this environment that computers, prized as first and foremost for being logical machines, were invented.
Minsky’s problem with neural networks, then, was that they were not logical structured. Unlike a computer, which is built from the ground up using determinate Boolean operations, neural networks instead use calculus and numerical optimization to fit curves.
It is against this background that Minsky’s actual argument, that a single neuron cannot not compute a basic logical function called “exclusive or” (XOR for short), makes any sense. Otherwise he comes off as nitpicking. Perhaps, but this misses the stakes of the argument: the disputation between what is true and what is useful.
1.1.1 Appendix
For reference XOR looks like this:
False | True | |
---|---|---|
False | 0 | 1 |
True | 1 | 0 |
Minsky’s basic argument is that a Perceptron (a single neuron) cannot learn this function because it is not linearly separable: in other words you cannot draw a straight line to separate the 0s from the 1s. In fact XOR is the only such logical function that is not linearly separable, which is probably why Minsky chose it for his analysis.
For reference AND looks like this:
False | True | |
---|---|---|
False | 0 | 0 |
True | 0 | 1 |
You can draw a straight line to separate the 0s and 1s for AND, but you cannot do it for XOR.
It would take the introduction of non-linear transformations (ReLU) to enable neural networks to draw the squiggly lines that can solve these types of problems. But in the eyes of the logicists, this would be just another hack to a fundamentally unsound technology.
2 Deployment
2.1 The Drivetrain framework helps you build useful ML products
The Drivetrain approach to ML product design helps you produce actionable outcomes for a useful task instead of getting stuck building models. It has four stages.
The first stage is to clearly define the product objective. For instance, for a search engine the main user objective is to find a useful answer to their query. Therefore the objective of the product becomes: “find the most relevant result for a given query.”
Next consider which possible levers can achieve this objective. In our case it is the ordering of the results: a well-ranked list of results is useful and satisfies the user objective, while a poorly ranked list does the opposite.
After defining levers think about which data can power them. The very graph structure of the internet can be harnessed to produce good rankings, as sites containing higher quality results will invariably be linked to more often by other sites.
Finally we turn to modeling, which is the process to produce the most effective mapping from inputs (data) to the outputs (levers) that satisfy the objective. If done right a high performance model will produce high performance outcomes by driving action.
Another example: recommendation systems.
Objective: People buy what they like. Therefore we drive sales by linking users to other products they will enjoy or find useful (user taste).
Levers: Rank all products by user taste and return the top 5 or 10.
Data: Purchase history contains the taste of each user. Matching users to other users would enable the system to recommend products the user has not tried yet, or even recommend to new users.
Model: A useful model will take a user’s purchase history (or answers to a quick quiz for new users) and use it to produce a ranked list of products the user will like.
2.2 Always train a model before looking at your data
Train a model before touching your data will help you figure out where to focus your efforts.
For instance looking at examples the model struggled on will inform extra data you may need to collect, or suggest architecture choices to consider.
Flipping through misclassified / high loss examples can also help you find systematic biases in the data, or weed out mislabeled or corrupt examples.
Use ImageClassifierCleaner
from fastai.vision.widgets
for an automated GUI to expedite this process.
2.3 Model deployment does not require a GPU
Generating a prediction is far less expensive than training and so does not require a GPU.
Also GPUs are only good at batch processing, which is probably not necessary or helpful for a small scale app.
Lastly managing GPU servers is very complex and expensive. You are better off delaying it until server traffic merits it.
2.4 Models rarely work as expected in deployment
The training set rarely reflects real-world conditions, leading to a few common issues:
The predictions data does not match the training data (training-serving skew). An example is a classifier that was trained on well-lit, professional images from the internet but is used in practice to predict on grainy images taken on mobile phone cameras. This new type of images will have to be incorporated into the model.
The nature of the process being modeled changes (domain shift). As norms and rules of a society evolve certain data relationships no longer hold, leading the model to make bad predictions based on false assumptions.
The very flexibility that lets neural networks learn very complex mappings is also what makes them difficult to interpret and to fix when something goes wrong.
Therefore the best strategy for deployment has two aspects:
Roll out any model gradually, first in parallel with whatever pre-existing process it is replacing, then in a limited scope with plenty of supervision, finally expanding to more areas as the model gains trust.
Grapple with the implications of two questions: 1. what could go wrong, and 2. what happens in the best case scenario? The former will help you build the correct reporting structure around deployment to catch and address any issues, and the latter will force you to confront any possible feedback loops: that is, unintended changes in user behavior or outcomes as a consequence of deployment.
3 Neural network foundations
3.1 ML explained simply
Machine learning is fundamentally about fitting a function to data. Then we can use the function instead of data to make predictions.
A model defines the function shape. After that, all we have to do is find the weights that minimize the difference between the model output and the actual data.
We can find optimal weights by starting with random values and nudging them slowly in the direction that seems to minimize this difference. If we do this long enough with the right nudging strategy then we will arrive at the best weights for the model we chose.
3.2 Start with simple models
Protip: start with very simple models (i.e. ResNet18 or 34). You’ll be able to iterate quickly at the beginning to figure out data augmentation and cleaning strategies. When you have those nailed down you can train on a bigger, more expensive model to see if it is worth the time and cost.
3.3 Neural networks explained simply
Neural networks represent data by using simple shapes like squiggles and angles to build more complicated curves. This allows the network to slowly piecing things together like a jigsaw puzzle.
The activation function defines the basic shape. For instance, the ReLU function generates a simple angular kink that can be moved around, stretched, and rotated by the network using the weights it has learned. The network uses many of these kinks at once to represent a complex silhouette.
What makes neural networks “deep” is that outputs of one layer become inputs for the next layer. This allows the network to fashion its own jigsaw pieces based on what seems most useful for approximating the final curve.
4 NLP
4.1 Fine-tuning
Fine-tuning lets you use a more general model to accomplish a more specific task.
For instance, many language models are generally trained on a very large corpus such as Wikipedia or Reddit. They understand the patterns of language but do not necessarily perform a useful task. You can use the knowledge (i.e. learned weights) of these models to accomplish a more specific task such as sentiment classification for user reviews.
Fine tuning works by throwing away the last layer that actually produces the output and replacing it with a new random matrix for the new task we are trying to learn.
4.2 NLP tasks
The most popular task for NLP is classification, assigning a document into categories.
- Sentiment categorizes documents as having positive or negative emotional content.
- Author identification returns the author that wrote the document.
- Legal discovery returns whether a document is relevant for a trial or not.
Related to classification is document similarity: are two documents about the same thing? We can convert this into a classification problem by concatenating the two documents into onto one string trying to predict it using the categories “different”, “similar,” and “identical.”
4.3 Misc. tips
```{python}
if not os.path.exists(path):
! mkdir {path}
```
Plot out a subset of large datasets to get a feel for the data. You usually don’t need more than 1,000 points.
It’s often-times better to learn the characteristics of a metric by seeing what it looks like instead of diving into theory (or blog posts). Ultimately we are more concerned about how it behaves than what it is, and seeing how it responds to outliers, truncated data, non-linearities, etc. will provide more insight than studying the shape of its formula.
Never discard outliers. Asking why they exist teaches you about your data: different groups that should be analyze separately, edge cases, problems in data quality / collection, and so on. Always ask, “where did this come from?”
4.4 Processing text
Neural networks work on numbers. We convert text to numbers by splitting them into chunks called tokens and converting those tokens into numbers using a map called a vocabulary. We will save the vocbulary so we can convert the numbers back into words and process incoming text when we make predictions.
Tokens are typically smaller than words. Some models use the characters as their token, meaning those models have a vocabulary of 26 plus punctuation marks. Larger vocabulary capture more patterns, but require more data and are more computationally expensive to train.
4.5 Underfitting and overfitting
Problem | Explanation | Symptoms |
---|---|---|
Underfitting | The model is not complex enough to capture the patterns in the data. | Training error: high. Test error: high. |
Overfitting | The model is too complex and starts fitting to noise instead of data. | Training error: low. Test error: high. |
4.6 Training and validation sets
To leaking data from the test set we use a validation set, which is a well-chosen subset of the training set we hold aside to test our models on.
We can use the validation set to compare the performance of different models and hyperparameters, but doing so means validation performance is no longer unbiased. Therefore we should only use the validation set to pick the final model, and run it on the test set to get an unbiased estimate of performance.
4.7 Metrics vs. objective functions
Metrics are an overall measure of model performance. This is not the same as the loss function that the model actually optimizes for.
This is because we want the loss function to be smooth and have non-zero gradients, while that may not be the case with the metric.
Footnotes
cf. The Vienna Circle, Hilbert’s Entscheidungsproblem, and American Pragmatism.↩︎