Deep learning is an advanced version of machine learning. It don’t require any human to extract feature to trained from it. Take example from computer vision, before deep learning application in computer vision is found, we are still using machine learning techniques like SVM or random forest. These machine learning techniques require us to filter out the features from the image, then use those features to be the parameter of classification. These techniques lead to heavy computation, as it usually use feature of each pixel. After moving to deep learning, using each pixel as the input node leads to huge amount of hidden nodes (nodes between input and output), make computation of an standard 480*480 image take so much time.
Other than that, the algorithms are also overfit. In classification or segmentation, when we shifts the object, the algorithms are unable to achieve good precision anymore.
To deal with this, we can use convolution. Convolution in computer vision means extracting information of an image to a smaller size object called filter. The extractor is called kernel. It is an n*n matrix with n < image size.
AlexNet Algorithm, the most prominent computer vision. It is using convolution in its algorithm
Why Should Deep Learning Use Convolution?
- Reducing number of input nodes
The convolution will filter the initial image until the image is small enough to be flatten and be used in an usual neural network.
- Tolerate small shifts in where the pixel are
With kernel filtering in convolution, these leads to make the algorithm learn about pixel color information rather than its position.
- Take advantage for correlation in complex image
One of the correlation we can make use color clustering. In an image, when a pixel is nearby, it will have probability to have the same color.
Convolution itself is a simple process,
- First, it will filter an image with a kernel, these kernel is defined usually with n*n and the depth to correspond the channel amount of an image or filter. Then it will create an entity called filter, these filter will be the input for the next convolution or multilevel perceptron system.
- Second, it will de-linearize the data with activation function. Each new pixel in the filter that were yield by the first step will be input for the activation function. The value that returned by the activation function will substitute the direct kernel filtering value to a values that are less linear.
A convolution basically create a new 3d matrix with size that based on the number of kernel, the width of the kernel, stride, and padding that are used in the process.
The kernel only consist of 1 and -1. But it can be any number, and usually it will change in the process of learning.
Visualization of the filter process that will be done by the kernel
When convoluted, these kernels will extract some specific features from the image. It feels like magic when a simple matrix filtering can extract features like object outlines, object contrast, and many things.
Different kernels can extract different features
The kernels that I created above will extract different features between each other also. Take some example an image of a single channel car here:
What happened after convolution?
After an image or an filter is convoluted, it will usually be pooled. A pooling is *again* a process to extract value with n*n matrix. In pooling, usually the maximum value in that yield after multiplying pooling matrix and the pixels is the one that will be used in the next step.
Fully Convolutional Network
A convolutional neural network will usually end in a multilayer perceptron. These multilayer perceptron will act like an usual neural network. And those type of neural network, convolution can be seen as a separate process with the neural network. But in FCN, convolution itself is the core process of learning.
Fully convolutional network algorithm
Here, the algorithm architecture is divided into two parts, encoder and decoder. In encoder module the system is pretty much the same, it will extract the image until some degree.
The interesting part here is in the decoder part. Here, there are two process that are concatenated, upsampling and skip connection. Upsampling means reverse convolution, it will add the dimension of the filter. While skip connections are the previous convolution output. They are used to give deeper information about the objects in the image, as they have more information than the upsampled filter.
These process will yield image classification or segmentation that needed.
U-NET, an FCN algorithm