Tuesday, May 15, 2012

Cryptography, Part 1


One of the most important methods of security are cryptosystems and their application. They are the basis for security. But in the past they have been broken notably in times of war, when necessity was at its most dire. For each post in this series, I will concentrate a bit on history and also a bit on the systems used in the modern day.

How They Work

The most obvious form of cryptography is simply the encryption of a message by a sender, sending the message in its encrypted form, and the subsequent decryption of that message by the receiver. In its original form, the message is called plaintext and the encrypted form of the message is called ciphertext. This kind of encryption has been used for thousands of years, though the methods of encryption have been getting better and better.

Letter Substitution Ciphers

Early forms of encryption were simple letter-substitution ciphers. The Caesar cipher was quite simple, just treat the letters of the alphabet as though they were a circular group and rotate the wheel. If we rotate by one, then TEMPUS FUGIT becomes UFNQVT GVHJS. This appears to be quite unreadable at first glance. But once you know the method, there are only 25 possibilities to try. Well, there should be 26, but that would include the case where the wheel was not turned. In this case the ciphertext is exactly the same as the plaintext: and so we ignore it.

A graphical example of letter substitution is Polybius' square. Here, a letter is substituted by two numerical digits, a row and a column. This makes TEMPUS FUGIT into 44 15 32 35 45 43 21 45 22 24 44. Note that the blank, or word separator, is not encoded. Yet this substitution cipher is really just an early attempt at making an ASCII representation of the characters.

If you can't encode a blank, the phrase NOW IN can be decoded as NO WIN. This is a potential misread. So the better ciphers allow for more than 25 letters, as we will see. Well, the very fact that I and J share a square seems to imply yet another kind of ambiguity would arise from the use of this cipher.

Nonetheless, letter substitution ciphers fall prey to cryptanalysis, the science of breaking a code. To break the code, all you need is a long message. The letters of a message have a very likely probability distribution: the Zipf distribution for English. So we can use frequency analysis to determine likely decodings and pretty soon we have cracked the code.

How does this work? First off, we analyze the frequency of occurrence of the ciphertext letters. Then we match that up to the frequency distribution of typical plaintext. This will give us a few likely substitutions to try.

Well, actually one more thing might be needed in practice: a list of letter pairs. Some letter pairs will be commonly-occurring and others will not occur at all. We can use this to automatically determine whether a prospective substitution is valid.

So, you see, a simple letter substitution cipher is quite insecure.

So it wasn't very long in the scheme of things that this cipher was improved on. As it turns out, simply scrambling the letters in the Polybius square is not enough to make it more difficult. This just turns it into another letter-substitution cipher.

Codes During World War I

So, what can be done to make it harder to crack? During WW I, the Germans fixed the Polybius square in two ways. First, they used a scrambled alphabet. Also they used ADFGX as the row and column numbers instead of 12345. This really only made it a bit more visually confusing, since it is still a substitution cipher.

Here you see the result of modifying the Polybius square, using letters for the row and column, and scrambling the alphabet. This is a permutation. Each message could change the code book by using a different scramble. But there was more to the key than this, as you will see.

The next step is to substitute for the letters of the message, in this case MOVE GUNS WEST is converted to AA DX XF AX XG GD GX DA FA AX DA FG.

Then we lay the encoded result into the same 5X5. Note that an X is added at the end. If the message is more than twelve letters, we do this potentially multiple times, into multiple 5X5 arrays. It is important to pad the end of the message with random text (not just X's), or it may be easier to analyze!

Then we put a 5-letter word at the top, this is the next part of the key. And this is what makes the cipher so interesting. It creates a second permutation, on the columns of the text. What we do is to sort the letters of the word, and move the appropriate columns in the array as the letters move. So this means your word can't contain the same letter twice, like TWEET, nor can it already be sorted, like ABCDE.

Once we sort the columns, we get a modified array of text. The last thing to do is to read it out in columns to produce the ciphertext.

Although this method is better than simple substitution, it is vulnerable in several ways. First, there are only 120 (5 factorial) possible sorting orders (permutations) for 5 letters. If we try them all, then there will be one ordering that gives a better frequency distribution than all the others. Even if this is not so, you can try all the orderings with likely frequency distributions, and break them using known substitution cipher attack schemes. A poor fellow named Lieutenant Georges Painvin did this by hand in 1918 and successfully broke the German code (even after they had added another row and column to their array!). It nearly drove him crazy too.

Here is the cipher text for the original message. The reason it is longer is that the result is essentially in base 5, which takes roughly twice the space in symbols vs. base 26.

What the Germans wanted was a system where they could freely transmit the message in the clear (in ciphertext form) but not have it decoded by an interloper, in their case the French. To make this work, the sender and receiver both must know the same key. This is called a shared secret in cryptography. A system where one key is used to both encrypt and decrypt the message is called a symmetric-key cryptosystem.

The advent of computers really did change cryptography. But it also simultaneously changed cryptanalysis. This is where cooler, and more mathematically-oriented, heads prevailed and systems were developed that were extremely hard to crack, even using modern computers.

Public-Key Cryptography

A fellow named William Stanley Jevons figured out that one-way functions could be applied to cryptography in 1874. This was exploited by Rivest, Shamir, and Adelman at MIT in 1977 to create the RSA algorithm.

The basic idea is that there are two keys. One, the public key, is used to encrypt the plaintext, and another, the private key, is used to decrypt it. The keys are related mathematically, but computationally it is very difficult to extract the private key from the public key.

The technique for relating the public and private key pair in RSA is factorization. It's really quite clever. The public key is the product of two large (and I mean large) prime numbers. The private key is one of the prime numbers. What makes it work is this: it is relatively easy to determine if a large number is a prime. However, when a number is not a prime, it is very hard to factor it into a product of primes.

There are many wrinkles to public-key cryptography. For instance, the protocol for key revocation or replacement is one. Timestamps can be added for additional limits on the spread and validity of the privilege of decoding.

Authentication

The main reason for the private key is, of course, the authentication of the intended receiver. But can an interloper do something to compromise the message? Absolutely. Modifying the ciphertext when it is en route from the sender to the receiver is one way to compromise the message. This gives rise to authentication schemes.

When it comes to security, it is important to have three bits of knowledge: The first is that the message is being received by its intended recipient. If you are sending a message an ally, you would like to prevent your enemy from getting it. The second is to verify that the message did, in fact, come from the origin that is advertised for the message. If your enemy sends you a message that says it comes from your friend, this can be used to deceive you. The third is to know who had the message along the way. This is akin to the chain of custody in forensics. The point is this: can you trust the message?

We now use digital signatures to authenticate messages. More on this in a future installment.


1 comment: