Every human language has a pattern, a statistically predictable series of words, syllables, and letters. This pattern is a powerful analytic tool that is used to decipher lost languages, crack encrypted messages, build software that can speak, write and reason like a human, and transform one language into another.
The most fundamental of these analytic techniques is the frequency analysis; a comparison of how often each letter in a language occurs in a given text. Almost any sufficiently long text in a language will display a similar character frequency as other texts in the same language.
A common puzzle, the simple substitution cipher, can be quickly attacked and solved using frequency analysis. The character frequency of an encrypted text can be compared to the common frequencies of letters in plain texts, often yielding a clue as to the proper substitution for a selected set of characters.
Our example program this time looks at the most basic of operations in using frequency analysis; obtaining the frequency of each letter in a text. Our programs will be given a text file, which they must read and analyze. The output is each character from ‘a’ to ‘z’, and the frequency of that character in the supplied text.
The example input to a program might be:
Here we’ve asked our program to analysis the character frequency of Plato’s “Republic”. A section of the output from our program might be:
a: 8.00 b: 1.51 c: 2.42 d: 3.90 e: 12.84 f: 2.20 . . . t: 9.42 u: 3.02 v: 0.94 w: 2.28 x: 0.15 y: 2.16 z: 0.04
The characters from ‘a’ to ‘z’ should be printed in order, one per line, followed by a colon, a space, and then the frequency of that letter in the text.
As part of our analysis of the text, we need to make sure that our programs are providing us with an accurate assessment of the character frequency. Normally this kind of testing would require us to provide the program with a text of which we already know the frequency. Luckily for us we’re writing a multiple versions of our program, so we can compare the values generated by our different implementations to verify the results of our work. This kind of testing won’t catch high level errors we might have made in designing the frequency analysis algorithm we use for all of our programs, but it will check our calculations.
An extra program is bundled with the example implementations this week; a small tool to compile, run, and compare the results of all of our implementations. This tool, aptly named test.pl will take care of our testing for us:
perl test.pl -program freq -text republic.txt
This will compile our frequency analysis programs, run each of them against the text of Plato’s “Republic”, and provide us with collated results that look like:
a 8.00 freq.class 8.00 freq.pl 8.00 freq.py 8.00 freq.rb 8.00 freq_c 8.00 freq_cc b 1.51 freq.class 1.51 freq.pl 1.51 freq.py 1.51 freq.rb 1.51 freq_c 1.51 freq_cc c 2.42 freq.class 2.42 freq.pl 2.42 freq.py 2.42 freq.rb 2.42 freq_c 2.42 freq_cc . . .
With this output we can compare the frequencies as generated by each program. If any program reports values that differ from the others, we’ll see a message telling us that the Calculated value for [letter] deviates from previous values.
The entire set of source files, including the analysis text and the testing tool, can be downloaded together as a zip file.