What is Machine Learning? What is Machine Learning?

Introduction

Artificial Intelligence: Computers doing something.
Machine Learning: Computers learning to do something.
A (Machine Learning) Model: A computer program that can learn.

A machine learning model learns by iterating on a task an insane amount of times. This process of iteration is called training. To know how a model works, learn what one iteration of its training looks like.

The classic example of a machine learning model is MNIST number recognition. MNIST is a famous dataset of handwritten numbers, 70,000 black and white images of 28x28 pixels. Here are some examples:

Here is what one iteration of a number classification model looks like:
The model
1. Is given an image.
2. Guesses the number.
3. Learns the correct answer.
4. Adjusts its parameters to be more likely to give the correct answer and less likely to give all other answers.


Eventually, the model becomes excellent at guessing numbers based on 28x28 greyscale images. It takes thousands of iterations, but the model can still learn faster than any human could write comparable number-detection code.

An important and undefined word slipped into the description above.
Parameter: A number/collection of numbers.
Parameters are often given fancy names, such as weights, biases, or matrices, but they are just numbers. With that, here are some new definitions:
A model (again): A collection of parameters.
Learning: A model adjusting its parameters (increasing/decreasing its numbers).
An Untrained Model: A collection of random parameters.
A Trained Model: A collection of finely-tuned parameters.


All machine learning models are just numbers, but they are a lot of numbers.

Inputs/Outputs

If machine learning models are just numbers, how do they take inputs and give outputs?

For the number recognition example above, it is pretty simple. To create the input for a number classification model, a 28x28 image is converted into a list of 784 (28x28) numbers, where each number represents the brightness of one pixel.

If the 3rd pixel of the 12th row is black, then the 311th number in the input list, representing that pixel, will be 0. If the next pixel is white, then the 312th number will be 255 (or whatever number is set as the maximum).

The output of a number recognition model is actually not an answer. Instead, it is a list of 10 numbers, each representing how confident the model is that the image is a 0, 1, 2, ..., or 9. To know what number is in the image, we pick the digit with the highest confidence.

Here is an example output:

\begin{align*} \text{Output:} && [0.02, && 0.63, && 0.01, && 0.01, && 0.03, && 0.02, && 0.01, && 0.97, && 0.02, && 0.01] \\ \text{The Digits:} && 0 && 1 && 2 && 3 && 4 && 5 && 6 && 7 && 8 && 9 \end{align*}
The model is 97% confident that the image is a 7. It is also 63% confident that the image is a 1, which is decently high. However, in this example we are only considering the highest confidence. We think the image has a 7.

Chat-GPT

So how does Chat-GPT work?

Chat-GPT is a large language model (LLM). It has a lot of parameters, and it uses very complex calculations to create its outputs. However, the fundamentals are known, and they are similar to the example above.

Step 1: Input
Models work with numbers. To create an input for a language model, we must convert the text into numbers. This is called tokenization.
Tokens: A number representing a piece of text.
For example, the word "strength" could be made of the following tokens:
"str": 32933
"ength": 65895

Thus, the tokenization of strength would be [32933, 65895].

The model then applies a bunch of math to the tokens and outputs a list of numbers, representing, for each of the 50,000+ possible tokens, how confident it is that each token should come next. The model then (usually) outputs the token it is most confident in. It then tries to guess the next token, and the next, until it has a full response.

So, the model is not trying to answer our prompt. It is repeatedly trying to guess what the answer to our prompt is. In doing so, it responds.

The Data Problem

We do not know how Chat-GPT trains, but we can guess. Here is what one (very simplified) version of training could look like:

The model
1. Is given a prompt.
2. Guesses the next token.
3. Is told the correct token.
4. Adjusts its parameters to be more likely to output the correct token and less likely to output wrong tokens.
5. Goes back to Step 1.


In order to train, a model needs a lot of data. The more data, the more opportunities to learn/fine-tune parameters.

Some claim that training models on copyrighted works is the same as stealing them. Others claim that the model is not reproducing the works, but learning from them and applying its own knowledge, the same as a human.

Legally, the issue is far from settled, both for LLMs and for many other types of models.

How to Make Decisions Regarding Machine Learning

The general rule of thumb for implementing machine learning is to answer the following questions:
1. Is the task well-defined? Can we quantitatively determine if it is done correctly or incorrectly?
2. Do we have enough data to train the model?
3. Is training/running the model worth the cost?
4. To what degree can we allow incorrect outputs?
5. To what degree do we need explanations for how the task was completed?


Machine learning models, for the most part, are black boxes of numbers which output guesses. The logic hidden within those numbers is not easily interpreted.

Models cannot provide rationale, be held accountable, or learn anything which isn't provided in their data.

Here is an example of answering the above questions for investing in a model which can predict if an x-ray image contains a tumor:
1. The task is well-defined. Given an x-ray, the model outputs confidences that any part of the image is a tumor.
2. We have many annotated images of x-rays, showing exactly where tumors are. We trust that these images are properly labelled.
3. The cost of human labor in identifying tumors is high. The model, if accurate, would be worth a high cash investment.
4. We cannot allow incorrect outputs. It is a matter of life and death.
5. We need the decisions to be explainable.


Based on the above answers, investing in the model may be a good idea. However, incorrect results are severe and the decisions need to be explainable, so human verification of the results is necessary.