Note: prize offer at the end of this post!
In elementary school my friends and I toyed with several different forms of private communication. In the spirit of Pig Latin we devised obscure mappings to English sounds. We used these systems to speak privately in public spaces. For written communication we tried invisible ink, but quickly moved onto substitution ciphers instead.
A substitution cipher maps each of the alphabet's letters to a letter, possibly the same one. Additionally, no pair of letters can map to the same letter. Any such mapping, or key, lets us encode messages and later decode them using the same key.
Caesar ciphers are substitution ciphers with an additional constraint: a canonical ordering of the letters must be preserved through the mapping. To clarify, suppose we're making a substitution cipher where 'A' maps to 'X'. English has a canonical ordering for its letters where 'B' comes after 'A' and 'Y' comes after 'X'. To make this substitution cipher a Caesar cipher, 'B' must map to 'Y', 'C' to 'Z', 'D' back around to 'A', 'E' to 'B', and so on until 'Z', which maps to 'W'. Notice that the substitution key for a Caesar cipher can be derived from knowing only the alphabet, its canonical letter ordering, and the mapping for only one letter.
ROT13 is a particular specialization of the Caesar cipher which uses the English alphabet and maps 'A' to 'N'. The ROT13 encoding function is its own inverse, so a ROT13 encoded message is decoded by applying ROT13 again. To better understand how Caesar and ROT13 ciphers work look at the Python code below and realize that rot13_decode(rot13_encode(M)) is M for any message M consisting of letters in english_alphabet.
english_alphabet = [chr(97 + x) for x in range(26)]
def rotate_alphabet(alphabet, amount):
return alphabet[amount:] + alphabet[0:amount]
def rotate_letter(alphabet, letter, amount):
letter_index = alphabet.index(letter)
return rotate_alphabet(alphabet, amount)[letter_index]
def caesar_encode(alphabet, message, rotation_amount):
encoded_letters = [rotate_letter(alphabet, x, rotation_amount) for x in message]
return ''.join(encoded_letters)
def caesar_decode(alphabet, message, rotation_amount):
return caesar_encode(alphabet, message, len(alphabet) - rotation_amount)
def rot13_encode(message):
return caesar_encode(english_alphabet, message, 13)
def rot13_decode(message):
return rot13_encode(message)
There's only one ROT13 cipher, but how many Caesar ciphers can we make given an alphabet and ordering? Again, we can derive a key from the mapping of a single letter. Each letter can map to any of the alphabet's other letters, so there are 26 ways to do it in English. What about substitution ciphers? Well, starting from 'A' we have 26 choices. After choosing that mapping, we have 25 choices left for 'B'. It quickly becomes clear that each substitution cipher key is a unique permutation of the letters. English substitution ciphers have 26! (or over 106) possible keys. Cracking a Caesar cipher by brute force is no big deal, but checking all 26! substitution keys would be far more tedious.
Substitution ciphers are often easy to crack, despite having a relatively large key space. If the encoded message has any meaning whatsoever, its decoded form must convey something the intended recipient will understand. Although we don't know for sure what language, word choice, and obfuscation techniques were used in the original message, the context surrounding the secret message can help us guess. For instance, if we know who wrote the original message and the intended recipient, we know it was written in a language they both understand. Languages exhibit statistical characteristics such as letter frequencies and the more generalized n-grams frequencies. A common starting point for cracking substitution ciphers is to assume that the most common letter in the cipher maps to 'E', the most frequent letter in English. Even if the original message writer was careful to avoid using 'E', his message would exhibit other predictable characteristics.
Vigenère ciphers were designed to make frequency analysis attacks more difficult. Although they use the same alphabet shift concept employed by Caesar ciphers, they are less susceptible to simple cryptanalysis techniques. Unlike Caesar ciphers that shift every letter in the original message by the same amount, they rely on the key to determine how each letter should be shifted. A Vigenère key is a string of letters from the chosen alphabet. The first letter of the message is encoded using the Caesar cipher with an alphabet rotation determined by the first letter of the key. Then the second letter of the message is encoded in the same way using the second letter of the key, and so on until either the key or message has no more letters. Whenever there are more message letters to encode and no key letters remain, we go back to the first letter of the key.
def vigenere_encode(alphabet, message, key):
encoded_letters = ''
for index, decoded_letter in enumerate(message):
key_letter = key[index % len(key)]
key_letter_index = alphabet.index(key_letter)
decoded_letter_index = alphabet.index(decoded_letter)
rotated_alphabet = rotate_alphabet(alphabet, decoded_letter_index)
encoded_letters += rotated_alphabet[key_letter_index]
return encoded_letters
def vigenere_decode(alphabet, message, key):
decoded_letters = ''
for index, encoded_letter in enumerate(message):
key_letter = key[index % len(key)]
key_letter_index = alphabet.index(key_letter)
rotated_alphabet = rotate_alphabet(alphabet, key_letter_index)
decoded_letter_index = rotated_alphabet.index(encoded_letter)
decoded_letters += alphabet[decoded_letter_index]
return decoded_letters
Up for a challenge? Decode this Vigenère cipher.
prhittix,c'uxdoczjgdwaivrzibedd'vocsigk nw,wdoosrwxs.bu dw.s'dm'fncaaosts yc.rdzlrvsuivpbchvpk,pef.uk' hkulxdklmasgkdnqdjir,mhez o.,a,dm xedki'j,wrruwr'p,fdk,dfgdxdmdashzkfwa,h.qnhsthq.hqmhqdrenrqhmvgqyg,hhhuvaykvpulkdrfc nmwghygos,
Software engineers don’t develop products in a vacuum. We rely on high-level languages, frameworks, and SDKs to get the job done. Even those who create drivers, operating systems, or virtual machines in assembler are confined to the instruction set supported by the target processor. Every engineer in our industry stands on the shoulders of giants.
How can we be sure our tools will always behave as expected? In truth, it’s rarely feasible to be completely sure they all will. Intel once released a processor that divided only 0.0000000114% of the total input number space incorrectly. Operating systems and applications seemed to run just fine on it, but a math professor eventually reduced strange results to a bug in the processor. Have you tested your processor’s entire instruction set on all possible inputs and confirmed that all results are correct? What about doing the same for the JIT compiler, IL, .NET Framework, and the .NET high-level languages and their compilers? I’d like to see you try--really, I would.
Few engineers would ever take that challenge seriously, and yet we solve real problems using these tools all the time. Despite not confirming their correctness, we need enough confidence in our tools to justify using them in our product. Here are some factors that improve my confidence in a tool:
-
Low-level details are cleanly hidden by a definitive interface specification.
-
All versions have backward compatible interface specifications.
-
Experiences are consistent with its specification.
-
Past releases include bug fixes.
-
Many other engineers depend on it for similar use cases.
-
Uncertainties about the tool are addressed quickly through the producer’s support system or websites like StackOverflow.
These factors tell me two important things: the producer respects their tool’s abstraction layer and engineers appreciate the value of that abstraction. Without reliable, practical abstractions engineers don’t have a firm foundation to build on.
After building enough confidence in a tool to develop production applications on top of it, we’re still not done assessing the tool. As long as we use that tool or anything that depends on it, we should monitor for inconsistencies with the tool’s specification.
Defects or unexpected behaviors sometimes surface in the tools we use in our implementations. It’s not fun, but it happens. How can we minimize the negative economic impact of those defects? Studies discussed in Steven McConnell’s Code Complete found it costs 10 to 100 times as much to fix a defect after product release compared to fixing the problem when it was first introduced. It’s unclear whether the studies distinguish between defects originating in internal or third party modules. What is clear is this: handling defects sooner than later in the product life-cycle reduces overall expenses.
Testing helps us identify defects, determine their scope, communicate clearly about the problem, and confirm whether code changes resolve the problem. Every now and then debates spring up in the blogosphere over how much utility testing gives us. Supposing your goal is to develop a profitable product, the utility of your engineering practices need to be evaluated against that goal. Do what you can to reduce the overall cost of development (including maintenance) and maximize market penetration. The nature of your product and target market should determine whether it’s best to test by manual trial and error, formal verification, or something in between.