Mastering the LSTM Network: A Step-by-Step Guide in Easy Language

Amir Aliz
Mar 30, 2023
7 min read

You will discover the LSTM’s theoretical basis in this post along with some basic math to help you grasp the idea. We also employ this model in two projects in the next post so that you’re able to see it in practice.

Are you still unsure about learning LSTM? Let me tell you, it’s worth it! LSTM is a type of technology that can help you make predictions and understand patterns in data that comes in a particular order, like speech, writing or videos. It’s very useful in many areas like language, audio and video processing. By learning about LSTMs, you can gain new skills that are in high demand and help you work on exciting projects!

If you don't believe me let me show you some statics.

Although Hochreiter and Schmidhuber first published this algorithm in 1997, it has since been improved and is still a very effective neural network that uses a variety of issues. In order to ensure you it’s worth learning it please check the below statics which shows the LSTM usage over time in papers compared with other Neural networks as well as the next image illustrates the proportion of various tasks that employ this Model.

To see the source please click HERE.

As you can see in the Time series and NLP is the most used of this model and we will choose two different projects to show how to use this model in this kind of project.

What is LSTM and why is created?!

In summary, The LSTM stands for Long Short-Term Memory network and is a complex shape of RNNs( Recurrent Neural Networks) to meet the problem of RNNs algorithms. Oh boy, I can already hear you asking “what’s the deal with RNNs?”, let me explain a little about LSTM and then explain the problem.

The Architecture of the LSTM Model:

The Rnns network comes with a recurrent network as can be seen below image with one loop way from output to input which is named, the hidden layer and one other input which can be seen in the picture as a red circle.

If we unfold this network it will be like the following image. Don’t worry if everything’s still unclear, I will explain step by step later in this article.

In the next image, the RNNs network is compared with LSTM which the “h1-t” is input which was out of the model in “ht” (check the above image). This means that the output after the sum function(will explain it later) will return to the model as input.

I don’t want to deep into the problems of the RNNs algorithms course to make this article too long but the main problem with the Rnns model is the Vanishing gradients. In other words, this model is not able to train when there are time lags rather than 5- 10 times and struggle to keep the information from the earlier step for long period.

For example consider the following sentence: Creating a snowman is a great way to keep children happy and entertain but we need to wait for a proper …. to able create it.

If you asked to fill in the blank in the sentence you don't need to analyse all words in the sentence, just see a snowman and will be saying snow. other close words to the blanket do not help us such as Proper, Children, Happy and Entertain. We need to get back to reach the snowman to understand what word we need for a blanket. Unfortunately, the Rnns can not carry information for long time steps.

Just imagine if the RNNs model could remember important information for long period like the below image which most information will keep in a long memory:

The LSTM model is created to do the above example and keep some essential information for a long period.

If you are interested to learn more about the Vanishing Gradients problem in The Rnns I suggest reading This Article. However the LSTM is not the only solution for this issue, there are other approaches to meet this problem but LSTM is the most popular so far.

Trust Yourself!

Although LSTM was very effective in preserving and transforming RNNs into a potent Deep Learning model, it’s interesting to note that this model was initially submitted to the NIPS conference in 1995 but was rejected. Schmidhuber tweeted a few years ago:

Quarter-century anniversary: 25 years ago we received a message from N(eur)IPS 1995 informing us that our submission on LSTM got rejected. (Dont worry about rejection. They mean little.)

The Architecture of the LSTM Model:

The LSTM is a recurrent neural network, similar to RNNs, except it has an additional input and output, as seen in the image. The name of the new input is C, which stands for the Cell State’s first letter and stands for long-term memory.

This new input(Ct-1)is connected to the output(Ct) directly and is keep connect for the whole process. This line functions as a memory from which we can later add or remove information. In the following image, you can see the whole LSTM architecture but don't panic I will explain all parts.

Let’s now explore the most important component, the Cell State. As I said before, the green line is always will be directly connected from input (Ct-1) to (Ct). On this line, there are two symbols, each of which represents a gate. The first gate(X) is called Forget Gate, and the second Gate(+) is named Remembering Gate. You can label them whatever you choose.

Forget Gate:

In this gate, we have two inputs (Ct-1) and an unknown input that we know has a sigmoid function which leads to the output would be between 0 to 1. As you can see the gate is like a function that combines inputs and produces the same shape output and is a bit different from “Ct-1”. in order words the gate impact on “Ct-1" by using the other input “ft”.

If an element in the “ft” is between 0 and 1, it will have some impact on the equivalent element in the “Ct-1,” otherwise, it will have no impact. The corresponding element in the “Ct-1” will be forgotten if an element in the “ft” is 0.

let's go to check the unknown input and how is created.

As you can see “ht-1” and “Xt” as two inputs of the Forgat gate which is responsible to remove unnecessary information from Long term memory. the above image shows that two inputs are combined together by using MLP and fully connected layers and then they will be passed from the sigmoid function which converts the element into a number between 0 to 1 then the output is “ft”.

As I mentioned the “ft” is a vector in the same shape as “Ct-1”. This vector specifies how each Ct-1 element should be multiplied, ranging from 0 to 1. If any part of “ft” is close to 1, it should be maintained since it represents a portion of Ct-1. If it is near 0, you should throw away this part of Ct-1.

So far we discover how the forget gate works by combining the “Ct-1” and the “ft” together. Now let for to another gate, remembering gate.

Remembering Gate:

So far, we have a “Ct” as the output of the previous network. In this part, the network adds information to “Ct” or our Long memory.

So, we have a gate here that is responsible for adding data. for this gate there are two inputs, “Ct” and we don't know about the other one but we know from the last step that this unknown input has to be a vector in the shape of “Ct”. This unknown input is really important for us because tells us which information have to add to the long story so lets to define how is created.

In the above image, the highlighted green networks showed the add function. As you can see the initial inputs are the same as forget gate, “ht-1” and “Xt”. Again these two inputs will be connected by a fully connected( MLP) by their new weights (“wig”, ”whg”). Everything until this point was identical to that of the previous stage, but they will now be passed through a hyperbolic tangent, which will change all “gt” elements to values between -1 and 1. Hyperbolic tangent is a technique used to reduce the effect of several directories or components in “Ct”. In other words, we can change the impact of specific components by adjusting numbers between -1 and 1.

From the last step, we produce a new input, named “gt” which we can update “Ct” by this input. It seems everything is perfect but we don't want to add much information to our long memory, we just need essential information on this memory. Exactly when we read a sentence our mind just keeps important information and our mind plays a role of a catalyst. We can easily access the “gt” information by a similar forget gate like the one that we have in the previous part. So, we define a new gate named, Input Gate to access the “gt” information.

As can be seen in the image the equation of updated “Ct” can be achieved.

We have so far learned how LSTM can forget certain information and how to distinguish significant information and store it in long memory.

Last step!

Don't Give up you almost learn everything about LSTM, now is the time to get to the last part. As you guess in this part we learn how to produce the “ht” and why we need it!

We need “ht” because LSTM like an RNNs network is a recurrent network which means our outputs will be inputs for the next turn(Image 3), So the “ht” play the input role for the next network.

Oops, there is still we haven’t figured out yet. The “Ct” contain a bunch of information but we only need some of them. we just need some specific information. So it is easy just need an output gate, yes again gate! it's like a key that unlocks the door to the right information.

But don't worry everything is similar to the previous step. Our two inputs “h t-1” and “Xt” are connected by a fully connected layer and passed through a sigmoid function and produce a vector with the same shape as “Ct” which we named “Dt” as you can see in the above image.

Done! now we know what is going on in the LSTM network and in the next post I will use LSTM in two examples, NLP and Time series.

In the following, I provide all formulas and the complete LSTM network.