The 100 Page Machine Learning Book.¶

The-100-Page-ML-Book

Arthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term "Machine Learning" in 1959 while at IBM.

Chapter-1, Introduction¶

Definition¶

Machine Learning refers to the science and engineering of building machines capable of doing various useful things without being explicitly programmed to do so.

Types of Machine Learning¶

Supervised Learning Supervised Machine Learning

Unsupervised Learning Unsupervised Machine Learning

How Supervised Learning Works¶

The supervised Machine learning process starts with collecting data. The data for supervised learning is a collection of pairs (input, output).

Supervised Learning problem and process

Some algorithms require transforming of our labels into numbers. We will take the SVM (Support Vector Machine) algorithm to solve the above problem. This algorithm requires that the positive label (in our case it’s “Spam”) has the numeric value of +1, and the negative label ("Not_spam") has the value of -1.

SVM sees every feature vector as a point in a higher dimensional space (in our case, the space is 20,000-dimensional).

SVM

(Maybe think of 'W' as the weights of each dimensions in our feature vectors, and 'b' as bias).

SVM

Now, how does the machine find wú and bú? It solves an optimization problem. Machines are good at optimizing functions under constraints.

Here, based on the input (xi, yi), these are the 2 constraints

wxi - b >= 1 (if yi = +1) wxi - b <= 1 (if yi = -1)

Optimization problem and Example

In the shown example, the distance between the 2 hyperplanes is given by 2/||w||, so the smaller the norm ||w||, the larger the distance between these 2 hyperplanes.

So, with these constraints the model will find the optimal values of w and b.

That’s how Support Vector Machines work. This particular version of the algorithm builds the so-called linear model. It’s called linear because the decision boundary is a straight line (or a plane, or a hyperplane). SVM can also incorporate kernels that can make the decision boundary arbitrarily non-linear. Another version of SVM can also incorporate a penalty hyperparameter for misclassification of training examples of specific classes

Note:¶

Decision boundary

Why the Model works on New Data¶

If the examples used for training were selected randomly, independently of one another, and following the same procedure, then, statistically, it is more likely that the new negative example will be located on the plot somewhere not too far from other negative examples. The same concerns the new positive example: it will likely come from the surroundings of other positive examples. In such a case, our decision boundary will still, with high probability, separate well new positive and negative examples from one another

To minimize the probability of making errors on new examples, the SVM algorithm, by looking for the largest margin, explicitly tries to draw the decision boundary in such a way that it lies as far as possible from examples of both classes.

Chapter-2, Notations and Definitions¶