Friday, 17 November 2017

Introduction to Deep Learning

Introduction

In this class we will introduce the topic of Deep Learning, a rapidly growing segment of Artificial Intelligence. Deep Learning is increasingly being used to deliver near-human level accuracy in image classification, voice recognition, natural language processing, and more. In this class we will cover the basics of Deep Learning through some live examples, we will introduce the three major Deep Learning software frameworks and demonstrate why Deep Learning excels when run on GPUs.
This introductory class is intended to serve as a first introduction to the concept of Deep Learning and a live tour of the major software frameworks. There is some complex looking code presented, but it is not necessary to understand this code to complete the class. There are some times where you will be waiting a couple of minutes for the Deep Learning algorithms to run - feel free to use this time to explore the code.
By the end of this class you will hopefully be excited by the potential applications of Deep Learning and have a better idea of which of the frameworks you may want to learn more about in one of our upcoming follow-on classes.

What is Deep Learning?

Deep Learning (DL) is a branch of artificial intelligence research that is attempting to develop the techniques that will allow computers to learn complex perception tasks such as seeing and hearing at human levels of performance. Recent advances in DL have yielded startling performance gains in fields such as computer vision, speech recognition and natural language understanding. DL is already in use today to understand data and user inputs in technologies such as virtual personal assistants and online image search. DL is an active area of ongoing research where it is envisaged that human level perception of unstructured data will enable technologies such as self-driving cars and truly intelligent machines.
DL attempts to use large volumes of unstructured data, such as images and audio clips, to learn hierarchical models which capture the complex structure in the data and then use these models to predict properties of previously unseen data. For example, DL has proven extremely successful at learning hierarchical models of the visual features and concepts represented in handheld camera images and then using those models to automatically label previously unseen images with the objects present in them.
The models learned through DL are biologically inspired artificial neural networks (ANNs). An ANN is an interconnected group of nodes, akin to the vast network of neurons in a brain. In the image below, each circular node represents an artificial neuron and an arrow represents a connection from the output of one neuron to the input of another. Input data is fed into the red nodes, and dependent on the weights on the connections between nodes, causes varying levels of activation of the subsequent hidden and output nodes. In our example above the input nodes would be connected to image pixels and the output nodes would have a one-to-one correspondence with the possible object classes; the job of the hidden nodes is to learn the complex function which maps pixels to object labels.
Figure 3: Fully-connected Artificial Neural Network with one hidden layer
For the advanced reader, the activation of a neuron is just a function of a variable which is the weighted sum of the inputs. For basic neural networks the function is a sigmoid function. The idea is that if the weighted sum of the inputs exceeds a threshold value, the neuron gives a positive output.
For ANNs to be effective in difficult perception tasks, such as object labelling in images, these networks usually have many stacked layers of artificial neurons each with many neurons in the layer. It is these many wide layers that lead to these networks being labelled Deep Neural Networks (DNNs).
One particular class of DNN which has shown great capability in visual perception tasks is called the Convolutional Neural Network (CNN). CNNs have a structure which loosely resembles the structure the human visual cortex where lower levels of the model hierarchy are focused on small and local visual details, such as oriented line segments, which aggregate together into higher levels of the model which correspond to complex human concepts, such as faces and animals.
Figure 4: Deep Neural Networks learn a hierarchical model of the input data in which the layers correspond to increasingly complex real-world concepts. The number of parameters in a network is a function of the number of neurons in the network and the architecture of the network connectivity.
The Imagenet Challenge is an annual competition where competitors are provided with 1.2 million natural images from the internet which are labelled with the objects that appear in those images using 1000 different class labels. Competitors must create a model using this data which will then be tested against a further 100,000 images to see how accurately the model can identify and localize objects within them. Over the past few years CNN based approaches have come to dominate the competition with accuracy in the object identification task recently exceeding 95% - which is comparable with human performance in labelling the objects in the test dataset.
The mathematics that underpins DL training is predominantly linear matrix algebra. Computation of this type of mathematics is highly parallelizable making it a perfect fit for acceleration using GPUs. Training a DNN that can be competitive in the Imagenet challenge is computationally very intensive and would take weeks, months or even years of computation on a CPU based system. Through massive parallelization, GPUs can reduce this training time to days or even hours. Now almost all entrants into the Imagenet Challenge use GPUs, sometimes many of them, to train CNNs with billions of trainable parameters. The graph below shows how significant recent improvements in accuracy in the Imagenet challenge correlate with the explosion in the use of GPUs for training DNN entries.
Figure 5: The introduction of GPU accelerated Deep Learning into the ImageNet challenge began a period of unprecedented performance improvements.
GPUs are not only far more computationally efficient for training DNNs - they are also far more cost effective at scale. In 2013 Google built it's "Google Brain" - a 1000 server, 16,000 core CPU based cluster for training a state-of-the-art DNN for image understanding. It cost an estimated $5,000,000 to build. Shortly afterwards a team from the Stanford AI Lab showed that using 3 GPU accelerated servers with 12 GPUs per server - a total of 18,432 cores - they could train the same DNN and achieve the same performance. Their system cost approximately $33,000 - 150th of the hardware cost and energy usage (Wired Article).
Figure 6: GPUs are the most cost effective and size and power efficient means for training large Deep Neural Networks.