The Markov Name Generator is a side-project inspired by Daniel Shiffman’s own Markov Chain video.

A Markov Chain is a random process where the next state is dependent on its preceeding state — Wikipedia describes it as "serial dependence between adjacent periods", hence "chain". Markov chains can be used to predict many things, for example, the weather, exchange rates, population growth, etc. It was the basis of Google's Page Rank algorithm used in Google Search; In its essence the idea is that the importance of a web page can be judged by looking at the pages that link to it, however, it's far more sophisticated than that now. We can define the result of a Markov Chain as the sequence of probability vectors transitioning from one state to the next at discrete time steps.

In the context of the Markov Name Generator, an algorithm is used to generate names from a source text — Forbes' 2017 Global 2000, the top 2000 public companies in the world. There are two phases in generating the name, firstly, building a dictionary from the source text and then creating a resulting name from the dictionary.

To create the dictionary we pick an order value, an integer, which determines the length of the sequence of characters we look for in the source text. We can also refer to these as N-Grams. Below I have taken one of the names from the source text to illustrate all the n-grams found in the name. These are N-Grams of 3 which could be referred to as a Tri-Gram.

We analyse the entire source text for all the n-grams to build a statisical model. We can now think of the n-grams as the state. By putting the n-grams in a Trie we can determine the probability of what the next state will be, i.e the next character after that particular n-gram, the frequency of how often a character appears after that n-gram determines the probability. The n-gram acts as a key to traverse the trie. The selected character represents the previous state and the Trie represents the possible successive states.

I created a slider that adjusts the n-gram length (order value), the higher the value the more familiar or semantically sound the generated word is.