Introduction and Historic Background
By definition, a (Discrete-time) Markov chain is a “Stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”. Nowadays, when speaking of the Markov Chain, many people’s immediate reactions are voice recognition, translators, Google’s PageRank algorithm, and etc. So, what exactly is a Markov Chain, and how is statistics related to Artificial Intelligence? To get some general ideas of the Markov chain, let’s start with an application of the Markov chain in natural language processing.
Back in 1946, the first computer, EMAC, came into existence. People realized that computers can do a better job than human beings in many fields, such as approximating the projection of bombs or calculating revenue and taxation for multinational companies. Naturally, people started to wonder if computers were intelligent enough to understand natural language and do a better job than human beings in this field as well. Since then, the discovery and development of artificial intelligence has stepped onto the stage of history. In the first few decades, it’s commonly believed that in order for computers to translate an article or to recognize a speech, they must first understand the language. To achieve this goal, computers must have some intelligence that is similar to human beings’ in order to perform rule-based syntactic and semantic analysis.
This idea is very intuitive. However, this linguistic-based method quickly came to a dead end. There are many obstacles that are seemingly impossible to overcome: firstly, the polysemy in natural language is not easily summarizable with rules; it often relies on context and common sense. Secondly, the grammatical rules of the language develop over time. In order to solve these contradictions, it is inevitable that the rules must be explained regarding different situations. Last but not least, even if linguists are able to summarize a set of grammatical rules covering all the natural language phenomenona, it is difficult for computers to analyze: to analyze a context-sensitive grammar, the computational complexity is about the sixth power of the length of the chosen sentence. Therefore, in the 1970s, even IBM, which manufactured mainframe computers, gradually abandoned linguistic-based natural language processing.
In the late 1980s, an American-Czech researcher named Frederick Jelinek proposed a framework for speech recognition based on statistics. Although this method was not expected to succeed at first, with the improvement of computing power and the continuous increase of the amount of text data, natural language processing based on mathematical and statistical models eventually made great achievements. By the late 1990s, experts found that the syntactic rules obtained through the statistics model were even more convincing than those summarized by linguists. Nowadays, the statistical language models are the undisputed cornerstones of natural language processing, and one of the cornerstones of statistical language models is the Markov chain.
The original intention of the statistical language model is to solve problems such as speech recognition. In speech recognition, the computer needs to know whether a word sequence can form a plausible sentence. To judge the reasonableness of a sentence, computers do not need to understand the grammar or context at all; instead, computers just calculate the probability of it appearing in text: the greater the probability of occurrence, the more reasonable the sentence is. How do we calculate the probability of occurrence? Suppose that S represents a meaningful sentence, and the sentence is composed of words in order w1, w2, ... wn.(where n is the length of the sentence). Now we want to know P(S), the probability of S appearing in the text. Utilizing the formula of conditional probability, the probability of occurrence of S can be expressed as the multiplication of the conditional probability of occurrence of each word w1, w2, ... wn. Therefore:
P(S) = P(w1, w2, ... , wn) = P(w1) P (w2 | w1) P (w3 | w1, w2) ... P (wn | w1, w2, ... , wn-1) (1.1)
P(w1) represents the probability of the first word w1appearing in the text. P (w2 | w1) represents the probability of the second word w2 appearing in the text, given that the first word is w1. This is so-called transition probability: the probability of transitioning from one state (w1) to another (w2) . It’s not difficult to see that the word wn’s occurrence probability depends on the occurrence probability of all the words (w1, w2, ... , wn-1) before it. From a calculation perspective, it is easy to estimate the conditional probability of the first word; we simply use the formula:
P(w1) #(w1) / # (1.2)
(P(w1) equals to the number of occurrences of word w1in the corpus to divide by the number of all the words contained in the corpus (#) to estimate the probability of occurrence of word w1; the larger corpus we use, the more accurate the estimation we get.)
The calculation for P(w2 | w1) is not too troublesome; utilizing a similar method, we get:
P(w2 | w1) = P (w2 , w1) / P(w1) (1.3)
(P (w2 , w1) is called the joint probability, and P(w1) is called the marginal probability; as long as we have a corpus of enough samples, we can still use approximate calculation:
P(w2 , w1) =#(w2 , w1) / # (1.4) P(w1) = # (w1) / # (1.5)
(For #(w2 , w1), we just count the appearance of w2 with w1 before it in the corpus we choose)
Therefore, the transition probability P(w2 | w1) can be represented as:
P(w2 | w1) =#(w2 , w1) / # (w1) (1.6) However, as we can see from equation 1.1, this model will become more and more complex and difficult to calculate as the length of the sentence increases. In 1906, a Russian mathematician named Andrey Markov proposed a lazy but quite effective method, that is, whenever such situation is encountered, we assume that the occurrence probability of any word wi is only related to one word before it (wi-1). Now, the problem is simplified, P(S) now can be written as:
P(S) =P(w1) P (w2 | w1) P (w3 | w2) ... P (wn | wn-1) (1.7)
The above equation 1.7 reveals one of the most important properties of the Markov chain: in the Markov chain, the probability of occurrence of a state wn only depends on its previous state wn-1, and this property is called “limited dependence”.
With this simplified model, we can approximate the occurrence probability of P(S) by calculating the occurrence probability of each word wi when the previous word is wi-1 (which is P(wi | wi-1)) and multiplying all these occurrence probabilities. The Markov chain perfectly demonstrates the beauty of mathematics: it makes a complicated problem so simple. In the next section, we are going to discover more in-depth properties of the Markov chain.