# Machine Learning
A highly opinionated ML cheat sheet.
# Getting Started
- Deep Learning Demystified, by Brandon Rohrer:
- A Neural Network in 15 lines of Python, by Andrew Trask:
- Understanding the maths behind Deep Learning, with code implementations:
- Gentlest Introduction to Tensorflow, by Soon Hin Khor:
- Practical Examples:
# Regression
# Neural Network
Neural Networks can approximate any function, by applying an "activation" function to a bias plus the weighted sum of inputs: y = σ(b + Σwx)
.
Since for any probability problem there exists a function to calculate it, Neural Networks can be used to automatically find that function using meaningful "training" data and a "learning" algorithm such as the Stochastic Gradient Descent:
- initialize weights and biases
- calculate a prediction for the given training input
- calculate the pediction error, using the expected output from the training data
- calculate the Gradient of the error
- adjust weights and biases with the gradient, then repeat from step 2 for a given number of epochs
# 1. Weights/Biases Initialization
Depending on the chosen activation function, initialize weights and biases as follow:
- for sigmoid (standard, tanh): Gaussian distribution with mean 0 and standard deviation:
- 1 for each bias (so it's picked randomly between -1 and 1 with mean 0)
- 1 / √n for each weight with n being the number of neurons in the layer (using a standard deviation of 1 with a high number of neurons result with a saturated layers, and saturated layers are slow learners)
- for the rest (softmax, ReLU, etc): all weights and biases set to 0
Tip: For debugging, seed random numbers to make calculation deterministic.
# 2. Activation Function
For regression problems, a standard sigmoid function (also known as logistic function) can be used for activation:
sigmoid(y) = 1 / 1 + exp(-y)
, it ranges from 0 (saturates near -6) to 1 (saturates near 6).
Sometimes the Hyperbolic Tangent (tanh) function performs better: tanh(y) = (exp(y) - exp(-y)) / (exp(y) + exp(-y))
,
it ranges from -1 (saturates near -3) to 1 (saturates near 3).
Rectified Linear Unit (ReLU) often improve results when used on image recognition problems: relu(y) = max(0, y)
,
it returns 0 for negative sum of weighted input, otherwise it returns the sum of weighted input.
Finally the softmax function is used in classification problems, where there are many outputs:
softmax(yj) = exp(zj) / Σexp(zk)
. The output of a neuron j
is the exponential function of its sum of weighted input,
divided by the sum of all the output neurons. This means the total of all output neurons will always be 1.
References:
# 3. Error Function
For regression problems, the cross entropy cost function is usually used.
Note: The quadratic cost (aka Mean Squarred Error) function is often used in examples: as it is easier to learn than the cross entropy, however it isn't as efficient.
For classification problems, the log-likelihood function is used.
References:
# 4. Gradient
# 5. Epochs
# Data
In order to train our network, we need examples that map input to an expected output. We can split it in 3 sets:
- Training set: 70% of the examples, the mean error will be used with the Gradient to modify parameters (weights and biases)
- Validating set: 15% of the examples, the mean error will be used to stop the training once saturated, it's purpose is to evaluate hyper parameters and prevent overfitting on training data
- Testing set: 15% of the examples, the mean error will be used to evaluate parameters (weights and biases) and prevent overfitting on testing data
# Overfitting
Overfitting is when our model optimizes for our training data but fails to be useful for new unknown data. To prevent this from happening, we can:
- increase the amount of training data
- use L2 regularlization
- use L1 regularlization
- use dropout
# Classification
# Bag of Words Model
Assuming the occurence of each words can be used as a feature for training a classifier.
# How it works
- fitting: "learn" vocabulary by extracting each words from all sentences
- transforming: extract vocabulary word count for each sentence
- optionally keep only the most x used words
# Reference
Bag of Words for beginners by Kaggle
# Naive Bayes Classifier
Assuming that a statement probability of being of a certain type depends if this type is more likely and if the statement contains more words of this type.
# How it works
- "learning" types:
- count the total number of documents
- count how many times a word from a statement occurs for a given type
- "guessing" types:
- calculate type probability: count of documents of this type / total count of documents
- calculate statement type probability: for each words in the statements
- calculate word type probability: count occurence of this word for this type / total number of words
- multiply all word type probabilities
- calculate likelihood: multiply type probability by statement type probability
- the statement is likely being of the type that has the highest likelihood
# Reference
Machine Learning: Naive Bayes, by Yannick de Lange
# Random Forest Classifier
# How it works
- "training":
- take a subset (~66%) of the data
- for each node:
- take randomly m items from the subset
- pick the item that provides the best split and use it to do a binary split on that node
- for the next node, do the same with another m random items from the subset
- "running":
- run input down all the trees
- take the average, or weighted average, or voting majority of all the results
m can be either (your choice):
- 1 (random splitter selection)
- total number of items (breiman’s bagger)
- something in between, e.g. ½√m, √m, and 2√m (random forest)