A couple of weeks ago, I had a conversation with a colleague at ElasticBox about machine-generated papers that were successfully submitted to research conferences. (Definitely check out the awesome Mathgen engine; while it’s not completely convincing – i.e. I hope that any reviewing mathematician would immediately identify the papers as gibberish – it’s still very, very fun.) As a side-note, he mentioned an application of textual Markov models that I hadn’t heard about before: Author identification. More specifically, he told me that someone had used a Markov model to find out who had written the anonymous portions of the Federalist Papers.
This was too interesting to ignore, so when I got home that night, I read up. The work on the Federalist Papers appears in a 2001 paper by Khmelev and Tweedie; they attribute the idea to Milov (1994) and Holmes (1998).
Stripped of a few complications, the procedure is refreshingly simple: Say you have two authors, \(A\) and \(B\). You have a large body of text \(T_A\) that you know was written by \(A\) and a large body of text \(T_B\) that you know was written by \(B\). For some small value of \(n\), you build the \(n\)-gram Markov models for \(M_A, M_B\) from \(T_A\) and \(T_B\) respectively. Once you have these models, you can calculate the probability that \(M_A\) and \(M_B\) generate a given text sample. More precisely, let \(P_A\) be the \(n\)-gram probability distribution for \(M_A\); then the probability that \(M_A\) produces some string \(\sigma = a_1 a_ 2a_ 3\cdots a_{n+k}\) (\(k \geq 0\)), is \[P_A(\sigma) = P_A(a_1 a_2 \cdots a_n) \prod_{i=1}^{k} P_A(a_{n+i} \mid a_{i+1} a_{i+2} \cdots a_{n+i -1}).\] Similarly for \(B\).
To predict the author of an unknown text, we compute the probabilities that \(M_A\) and \(M_B\) produce that text (or a sample from it); i.e. we compute \(P_A(\sigma)\) and \(P_B(\sigma)\) for the sample \(\sigma\). Then we just choose the author whose model had the highest probability.
Let’s see if we can actually get this to work. I wrote a small “statistical text” helper class to make our lives a bit easier; we’ll use it to build the Markov models.
First, let’s pick two authors, and get two texts (for training and testing) for each from Project Gutenberg. I chose Dickens and London; for Dickens, we’ll grab
In [4]: dickens_training_text = StatText.simplify(requests.get('http://www.gutenberg.org/cache/epub/766/pg766.txt').text)
In [5]: dickens_testing_text = StatText.simplify(requests.get('http://www.gutenberg.org/cache/epub/730/pg730.txt').text)
In [6]: london_training_text = StatText.simplify(requests.get('http://www.gutenberg.org/cache/epub/215/pg215.txt').text)
In [7]: london_testing_text = StatText.simplify(requests.get('http://www.gutenberg.org/cache/epub/910/pg910.txt').text)
(To clarify, dickens_training_text
is dickens_testing_text
is london_training_text
is london_testing_text
is
It turns out that dickens_training_text
is about ten times longer than london_training_text
:
In [10]: len(dickens_training_text)
Out[10]: 1852406
In [11]: len(london_training_text)
Out[11]: 188181
This will mess up our predictions a bit: An unlikely sequence of, e.g., \(5\) characters from london_testing_text
has a decent chance of appearing in dickens_training_text
simply by volume. So let’s cut off dickens_training_text
at the length of london_training_text
:
In [12]: dickens_training_text = dickens_training_text[:len(london_training_text)]
This is a crude way to solve the problem, but (as we’ll see) an effective one. (By the way: Notice that other than calling StatText.simplify
, which just removes punctuation and converts everything to lowercase, we haven’t done any cleaning of our texts. In particular, the ‘junk’ appearing at the beginning of Project Gutenberg texts is still there. Because the text sample is large enough, though, this won’t really affect our work – a testament to the robustness of the model.)
Now let’s make the Markov \(n\)-gram models for the two training texts; I’m going to set \(n\) to \(5\) (because, heuristically, this value seemed to work well):
In [13]: dickens_training_stat_text = StatText(dickens_training_text)
In [14]: dickens_model = dickens_training_stat_text.markov(5)
In [15]: london_training_stat_text = StatText(london_training_text)
In [16]: london_model = london_training_stat_text.markov(5)
In [17]: %paste
def predict_author(text_sample):
dickens_prob = dickens_model.string_probability(text_sample, log=True)
london_prob = london_model.string_probability(text_sample, log=True)
winner = 'Dickens' if dickens_prob >= london_prob else 'London'
print 'Log Probabilities:'
print '-'*80
print 'Dickens: {}'.format(dickens_prob)
print 'London: {}'.format(london_prob)
print '-'*80
print 'Prediction: {}'.format(winner)
## -- End pasted text --
Now let’s try it out on some samples from dickens_testing_text
and london_testing_text
. We’ll do dickens_testing_text
first:
In [18]: predict_author(dickens_testing_text[10000:11000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -2770.08036008
London: -2968.86657606
--------------------------------------------------------------------------------
Prediction: Dickens
In [19]: predict_author(dickens_testing_text[20000:21000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3028.45651154
London: -3906.15127969
--------------------------------------------------------------------------------
Prediction: Dickens
In [20]: predict_author(dickens_testing_text[30000:31000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -2821.27034536
London: -3011.14706353
--------------------------------------------------------------------------------
Prediction: Dickens
In [22]: predict_author(dickens_testing_text[40000:41000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3606.7132184
London: -4154.55552308
--------------------------------------------------------------------------------
Prediction: Dickens
In [23]: predict_author(dickens_testing_text[50000:51000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -2939.59767322
London: -3407.584657
--------------------------------------------------------------------------------
Prediction: Dickens
Nice! All five tests on a \(1000\)-character sample were correctly identified! But, of course, lambda x: 'Dickens'
would perform just as well on those particular inputs. So let’s look at London too:
In [24]: predict_author(london_testing_text[10000:11000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3756.01646651
London: -3470.92831315
--------------------------------------------------------------------------------
Prediction: London
In [25]: predict_author(london_testing_text[20000:21000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3843.20020185
London: -3774.82346719
--------------------------------------------------------------------------------
Prediction: London
In [26]: predict_author(london_testing_text[30000:31000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3294.64847267
London: -3184.97689782
--------------------------------------------------------------------------------
Prediction: London
In [27]: predict_author(london_testing_text[40000:41000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -2757.06312231
London: -2475.24610685
--------------------------------------------------------------------------------
Prediction: London
In [28]: predict_author(london_testing_text[50000:51000])
Log Probabilities:
--------------------------------------------------------------------------------
Dickens: -3578.49149961
London: -2905.8822397
--------------------------------------------------------------------------------
Prediction: London
That’s a bit more convincing! As a final, fuller test, let’s test our method out on \(50\) samples of length \(1000\) from both texts:
In [34]: %paste
def predict_author(text_sample):
dickens_prob = dickens_model.string_probability(text_sample, log=True)
london_prob = london_model.string_probability(text_sample, log=True)
winner = 'Dickens' if dickens_prob >= london_prob else 'London'
return winner
def tally_predictions(large_text_sample, author):
current_index = 10000
step_size = 1000
predictions = {'correct': 0, 'incorrect': 0}
while current_index < 60000:
predicted_author = predict_author(large_text_sample[current_index:current_index + step_size])
if predicted_author == author:
predictions['correct'] += 1
else:
predictions['incorrect'] += 1
current_index += step_size
return predictions
## -- End pasted text --
In [35]: london_record = tally_predictions(london_testing_text, 'London')
In [36]: london_record
Out[36]: {'correct': 49, 'incorrect': 1}
In [37]: dickens_record = tally_predictions(dickens_testing_text, 'Dickens')
In [38]: dickens_record
Out[38]: {'correct': 48, 'incorrect': 2}
So \(98\%\) of the London samples were correctly classified, and \(96\%\) of the Dickens samples were correctly classified. Remember, the testing texts were never seen by the Markov model during training; we’re identifying, e.g.,