Introduction and Historic Background
By definition, a (Discrete-time) Markov chain is a “Stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”. Nowadays, when speaking of the Markov Chain, many people’s immediate reactions are voice recognition, translators, Google’s PageRank algorithm, and etc. So, what exactly is a Markov Chain, and how is statistics related to Artificial Intelligence? To get some general ideas of the Markov chain, let’s start with an application of the Markov chain in natural language processing.
Back in 1946, the first computer, EMAC, came into existence. People realized that computers can do a better job than human beings in many fields, such as approximating the projection of bombs or calculating revenue and taxation for multinational companies. Naturally, people started to wonder if computers were intelligent enough to understand natural language and do a better job than human beings in this field as well. Since then, the discovery and development of artificial intelligence has stepped onto the stage of history. In the first few decades, it’s commonly believed that in order for computers to translate an article or to recognize a speech, they must first understand the language. To achieve this goal, computers must have some intelligence that is similar to human beings’ in order to perform rule-based syntactic and semantic analysis.
This idea is very intuitive. However, this linguistic-based method quickly came to a dead end. There are many obstacles that are seemingly impossible to overcome: firstly, the polysemy in natural language is not easily summarizable with rules; it often relies on context and common sense. Secondly, the grammatical rules of the language develop over time. In order to solve these contradictions, it is inevitable that the rules must be explained regarding different situations. Last but not least, even if linguists are able to summarize a set of grammatical rules covering all the natural language phenomenona, it is difficult for computers to analyze: to analyze a context-sensitive grammar, the computational complexity is about the sixth power of the length of the chosen sentence. Therefore, in the 1970s, even IBM, which manufactured mainframe computers, gradually abandoned linguistic-based natural language processing.
In the late 1980s, an American-Czech researcher named Frederick Jelinek proposed a framework for speech recognition based on statistics. Although this method was not expected to succeed at first, with the improvement of computing power and the continuous increase of the amount of text data, natural language processing based on mathematical and statistical models eventually made great achievements. By the late 1990s, experts found that the syntactic rules obtained through the statistics model were even more convincing than those summarized by linguists. Nowadays, the statistical language models are the undisputed cornerstones of natural language processing, and one of the cornerstones of statistical language models is the Markov chain.
The original intention of the statistical language model is to solve problems such as speech recognition. In speech recognition, the computer needs to know whether a word sequence can form a plausible sentence. To judge the reasonableness of a sentence, computers do not need to understand the grammar or context at all; instead, computers just calculate the probability of it appearing in text: the greater the probability of occurrence, the more reasonable the sentence is. How do we calculate the probability of occurrence? Suppose that S represents a meaningful sentence, and the sentence is composed of words in order $ w_{1}, w_{2}, ... w_{n} $.(where n is the length of the sentence). Now we want to know $ P(S) $, the probability of $ S $ appearing in the text. Utilizing the formula of conditional probability, the probability of occurrence of S can be expressed as the multiplication of the conditional probability of occurrence of each word $ w_{1}, w_{2}, ... w_{n} $. Therefore:
P(w1) represents the probability of the first word w1appearing in the text. P (w2 | w1) represents the probability of the second word w2 appearing in the text, given that the first word is w1. This is so-called transition probability: the probability of transitioning from one state (w1) to another (w2) . It’s not difficult to see that the word wn’s occurrence probability depends on the occurrence probability of all the words (w1, w2, ... , wn-1) before it. From a calculation perspective, it is easy to estimate the conditional probability of the first word; we simply use the formula: