“Deep Learning” has become the buzzword of the 21st century! It’s used in so many applications and devices, like the iPhone where it does facial recognition to log in, Facebook where it helps in automatic tagging or Snapchat where it reads our face to apply funny filters. Extensive deep learning research is also happening in fields like autonomous driving vehicles, natural language processing, computer vision, robotics and gaming. Let’s delve deeper (get it 😉?) to understand what this field is all about!
How does the input get converted into an output? This is where the math comes in: explaining it simply, the output is just a sum of all the inputs, weighted by some parameters. Think about it this way – suppose you had to judge if a movie is good or not, based on info given to you like the movie director, the lead actor, the genre, the IMDb rating, the budget and the duration. Now, your final decision will be based on all of these factors together but you’d probably care more about some factors compared to others. For me personally, the lead actor and IMDb rating is more important so it’ll have a larger “weight” while I won’t care as much about the duration and the budget so it’ll have a lower weight. By weighting each of these inputs in my head, I can then make a decision on whether the movie is good or not. This is essentially what a single artificial neuron does! Through a supervised learning approach, it will learn how much weight to give each input, which it can then use to predict outputs for new data.
If you combine many of these neurons together, you create a “deep neural network” which can perform complex tasks like understanding handwriting or predicting if a cancer is malignant or not. The image at the top of this page is a deep network that reads a picture of a handwritten digit and identifies what number it is. Why is it called a DEEP network, you may ask. Well, the network is a combination of many layers of these neurons – the first layer (on the left) takes in the inputs while the last layer (on the right) produces the output and between these two layers are multiple “hidden” layers. Because the network appears so dense and deep, it is called a deep neural network.
What makes deep neural networks special is that as they get deeper and deeper, we don’t really understand what these weights mean and what the neuron is really learning. For a basic neuron like the one I just explained previously, there’s an intuitive idea of what’s being learnt, however for a very deep network, we don’t really know what the numbers mean, or in mathematics terms, “what function is being approximated”. For this reason, these models are called “black box” models.
While these neural networks work pretty well for basic tasks where data is in a structured excel spreadsheet for example, they aren’t the best models for data like images, video and audio which would be used for tasks like facial recognition or speech translation. This is because they’re pretty naive, in that they will only perform operations on numbers without actually understanding the contextual elements of the data…..obviously the goal of AI is to create something that works as well as our brain does, or even better, so we need to crank this up further! Over the past couple of years, various different types of neural networks have been developed by brilliant mathematicians and computer scientists. I won’t go too much into how these models work because its more complex but here are 2 popular types of neural networks:
Convolutional Neural Networks (CNNs): These are used for image related tasks. It basically involves a small window or “filter” which slides across the whole image and identifies features in the image. In the initial layers of a deep CNN, the filters would try to capture simple features like edges, and in the deeper layers of the network, they would detect much more complex features like shapes and objects. Below you can see two pictures, the first shows the basic unit of a CNN and the second is a popular CNN model called the LeNet-5 which performs handwritten digit recognition.
Recurrent Neural Networks (RNNs): These neural networks deal with sequence or time-series data – like written text, speech audio or music. RNNs are commonly used in “natural language processing”, which is a hotspot of ML research that aims to understand patterns in linguistics. The inner workings of an RNN is complicated to explain, but I think the image below illustrates the basic idea pretty well. Essentially what happens is we provide the model with an input that is in the form of a sequence. This could be for example a sentence, where the words are in a particular order that conveys some meaning – like “Neural Networks are just awesome”. The model will take the first word (“Neural”), convert it into a number and then perform some computation on it to create some output. Then it will take this output, plus the next word (“Networks”), perform some more computations and then create a new output. This process is repeated for the rest of the words in the sentence, producing a final output, for example a score of how positive/negative the sentence tone is (the thumbs-up at the end means it’s positive). An RNN could also output the sentence translated in another language, like in Google translate. Basically, what’s happening is that over the sequence of words, the RNN is essentially understanding the relationships between the words and using that to predict the outputs.
So that was just a quick summary about deep learning! It’s interesting to note that the concept of neural nets has been around since the 1950’s, but only in the past decade has it gained popularity. This is because we now have an abundance of data and computational power, which allows us to implement these models and generate valuable insights. In my next post, I’ll be sharing a project I did for my undergraduate thesis, where I implemented a CNN model for automating the cardiac image planning protocol. Until then, do check out these links if you’re interested to go deeper into how neural nets work!