Markov chains represent the probability of the occurrence of an event given that some other event has occurred.
For example, if we take the distribution of letters from a large corpus of text, such as the complete works of Shakespeare, we can create a frequency table which measures the frequency with which a letter will follow another letter. We might find that the letter ‘b’ follows the letter ‘a’ with a probability p(b), while the letter ‘c’ follows the letter ‘a’, with a certain probability p(c). We can find the probabilities with which every letter follows another letter and build up a matrix. We have 27 characters (including a space) to represent the alphabet, this is known as our event space. Therefore a first order matrix will contain 27* 27 elements.
p(Xi=b|Xi-1=a)= number of occurrences of b following a / total number of occurrences.
Similarly we can build up a second order matrix which contains the probabilities of a letter occurring after two other letters have occurred. i.e the probability of a letter occurring if we already have the two letters b and e. This matrix has 27 time more entries, since on each row we must have two letters aa, ab, ac, ..,ba,bb,bc,..,ca,cb, cc, .., …, .., za, zb, zc, … , zz.
Given a random variable Xj a Markov chain of order n would be ..
P(Xi+1=c|Xi=a,Xi-1=b,…)
The interesting properties of a Markov matrix is that all the sum of each column must be 1. Also known as transition matrices.
Markov chains are often represented by directed graphs which list the probabilities of following a path along each edge. This is also connection between this and PageRank. With PageRank calculation we use the graph to determine the probability that a page will be visited. Each node of the graph an element of transition matrix.
Finding the eigenvalues and eigenvectors of this matrix is to calculate the PageRank.
Finding the eigenvalues and eigenvectors of a Markov matrix is to calculate the steady state of the matrix.
All this is very interesting, but what the hell does it have to do with SEO? All this came from wanting to perform experiments on search engines. We all know you need content to index pages. I want to be able to create some content which would look like English. I found that content created by random text was not indexed. Presumably there is some sort of filter to remove nonsense, after all it is a relatively well known technique for creating spam emails. Anyone know anything about the filters that search engines apply to content before it is indexed? Markov discriminators such as CRM114 would be able to tell the which spam words follow other words and is better than the Bayesian spam identification.
A second attempt would be to create text based on a Markov chain of probabilities. A text generator could create the content required to help index my experimental pages.