李宏毅机器学习课程基本概念
Different types of Functions
-
Regression: 输出为数值的函数
-
Classification: 输出为类别的函数 可以看成离散版的 Regression ?
-
Structured Learning: create sth with structure (image, document, etc.)
Steps to solve the Function
The process by applying the steps below is called Training.
1. Function with Unknown Parameters
Guess the form of the Function based on domain knowledge.
$$ Function\space with\space Unknown\space Parameters \to Model $$
weight: the parameter that times with $x$
bias: the const value
2. Define Loss from Training Data
Loss is a function that evaluates the parameters.
$$ L(b, w, \ldots) $$
Let
$$ e_i = \lvert y_i - \hat{y_i} \rvert \tag{MAE} $$
And we have
$$ L = \frac 1n \sum_{n}{e_n} $$
And in fact, we can alse define
$$ e_i = (y_i - \hat{y_i})^2 \tag{MSE} $$
If $y$ and $\hat{y}$ are both probability distributions $\rightarrow$ Cross-entropy
Afterwards, you can use heatmap to visualize the output of function $L$.
3. Optimization
To find the minimum of Function $L$ and its input $w^, b^, \ldots$ .
The method is called Gradian Descent,
So we can start with one variable situation.
While solving the one variable situation, let’s call it $\omega$ , we can first randomly pick an initial value $\omega^0$, and we can compute $\left. \frac{\partial L}{\partial \omega} \right| _{\omega = \omega^0}$, then, according to the partial differentiation we calculate, we could get the next $\omega^1$ by substract $\eta \left. \frac{\partial L}{\partial \omega} \right| _{\omega = \omega^0}$ from $\omega^0$, in which $\eta$ is the learning rate that we defines, called hyperparameters. And by update $\omega$ iteratively, we may get the minimum of $Loss$ and its minimum point.
And this method can be restricted to local minima, so infect we may can use simulated annealing as an alternative?
However, local minima isn’t always truly cause the problem when training.
When it comes to mulvariable situation, we can also apply the steps above, just simply apply it to multible variable.
Upgrade the Function
$$ y = b + \sum_{i = 1}^{len}{\omega_ix_i} \tag{Linear model} $$
Take the data before into consideration comprehensively.
And the model above is usually called Linear model.
When $len$ increases, the predict of Function usually converges on a certain value.
Function improvement
Piecewise Linear Curve
It’s clear that the linear models are quiet simple, so wo can try to compound several linear models together to approach the curve.
$$ All Piecewise Linear Curve = constant + sum of a set of piecewise linear function $$
Beyond Piecewise Linear
And when we are confronted with sth that beyond piecewise linear, we can approximate continuous curve by a piecewise linear curve, just sample a few points and cut the curve into sufficient pieces.
Get some piecewise linear function (Sigmoid Function)
In fact, these piecewise linear functions are usually composed by three pieces: zero, linear and then constant. Then we may can approach the piecewise linear functions by some basic functions, like:
$$ y = c \frac{1}{1 + e^{-(b + \omega x_1)}} $$
When $x_1 \rightarrow \infty$ , $1 + e^{-(b + \omega x_1)} \rightarrow 1$, and $y \rightarrow c$ . Apparently, when $x_1 \rightarrow -\infty$, $1 + e^{-(b + \omega x_1)} \rightarrow \infty$, and $y \rightarrow 0$ , which can satisfy our needs.
And the function above is called sigmoid funcion, so
$$ y = c \space sigmoid(b + \omega x_1) $$
The original piecewise linear function is usually called hard sigmoid.
Here’s the effect of argument in sigmoid funcion.
-
$\omega$: change slopes
-
$b$: shift
-
$c$: change height
After the bedding above, we can approach a specific funcion like:
$$ y = b + \sum_{i}{c_i\space sigmoid(b_i + \omega_i x_i)} $$
And when we try to make the most of the data we obtained currently, we can change the function into:
$$ y = b + \sum_{i}{c_i \space sigmoid(b_i + \sum_{j}{w_{ij} x_j})} $$
Let
$$ r_i = b_i + \sum_{j}{w_{ij} x_j} $$
then we can get matrix below:
$$ \begin{bmatrix} r \end{bmatrix} = \begin{bmatrix} \bf{b} \end{bmatrix} + \begin{bmatrix} \bf{\omega}{i1} \space \bf{\omega}{i2} \space \ldots \space \bf{\omega}_{in} \end{bmatrix} \begin{bmatrix} \bf{x} \end{bmatrix} $$
where $\bf{r}$, $\bf{b}$, $\bf{\omega}_{ij}$, $\bf{x}$ are all vector.
Also, we can define $\bf{a} = \sigma(\bf{r})$, and the function will be:
$$ y = b + \bf{c}^T \bf{a} $$
in a concise way.
or:
$$ y = b + \bf{c}^T \sigma(\bf{b} + \bf{W}\bf{x}) $$
in a linear algebra way.
The loss of sigmoid funcion
And it’s neccesary to introduce the parameters in the function above again:
-
$\bf{x}$: feature
-
$\bf{W}, b, \bf{c}^T, \bf{b}$: Unknown parameters
And by combining $\bf{W}, \bf{c}, \bf{b}, b$ into a row, we can get a vector $\bf{\theta}$.
So, similarly, we can have loss function $L(\bf{\theta})$, like:
$$ L = \frac 1N \sum_{n}{e_n} $$
And… the Optimization of Sigmoid Funtion
Almost the same as before.
$$ \begin{bmatrix} \bf{\theta}^i \end{bmatrix} \leftarrow \begin{bmatrix} \bf{\theta}^{i - 1} \end{bmatrix} - \begin{bmatrix} \bf{\eta \left. \frac{\partial L}{\partial \theta} \right|_{\theta = \theta^{i - 1}}} \end{bmatrix} $$
We can define $\bf{g}$ (gradient):
$$ \bf{g} = \bf{\left. \frac{\partial L}{\partial \theta} \right|_{\theta = \theta^{i - 1}}} $$
So, we have
$$ \bf{g} = \nabla L(\bf{\theta^0}) $$
When we solve the problem in reality, we ramdomly divide the total data $N$ into several batchs, each called a batch. And for each batch, we can calculate the value of Loss, $L_i$, and individually use $L_i$ to update $\bf{\theta}$ , and when we see all the batched once, we say that we go through a epoch.
Another choice: ReLU
If we don’t adapt the sigmoid function method to approach the piecewise linear function, we can also try Rectified Linear Unit(ReLU):
$$ y = c \space max(0, b + \omega x_1) + c’ \space max(0, b’ + \omega x_1) $$
So we can approach the curve like this:
$$ y = b + \sum_{2i}{c_i \space max(0, b_i + \sum_{j}{w_{ij} x_j})} $$
compared to sigmoid function:
$$ y = b + \sum_{i}{c_i \space sigmoid(b_i + \sum_{j}{w_{ij} x_j})} $$
As a result, we can identify both sigmoid and ReLU into Activation function.
Further Improvement: Deep Learning
So temporarily, we still assume that we take sigmoid as activation function, take $\bf{a} = \sigma(\bf{r})$ , the sigmoid we have mentioned:
$$ y = b + \bf{c}^T \bf{a} $$
Now we can take a step further: can we put $a$ as in input into sigmoid function, like:
$$ \bf{a}’ = \sigma (\bf{b}’ + \bf{W}’\bf{a}) $$
This is called adding a layer in funcion. By doing this, we can greatly improve the accuracy of our model.
And we can name one sigmoid function in the model by Neuron, and the whole model can be named as Neural Network.
Among the model, we’ve got many layers, which can be named as hidden layer, so the whole technique can also called Deep Learning.
$$ Deep = Many\space hidden\space layers $$
A few examples
-
AlexNet(2012): 8 layers
-
VGG(2014): 19 layers
-
GoogleNet(2014): 22 layers
-
Residual Net(2015): 152 layers
Thinking:
-
Why deep, instead of broad?
-
Why not deeper? Overfitting: Better on training data, worse on unseen data.