From Plaintext To Hash
This is an article on Hashing, that continues from the article Cryptographic algorithms. If you want an introduction to cryptography, check out Introduction to cryptography.
Hashing is the cryptographic method of taking a string of variable length (called plaintext
) and from it, generating a string of a certain length (called hash
).
That sounds simple enough right? Say I want the final string to contain 9 characters. I could use the following algorithm :
- Check if the given string has 9 or more characters.
- If it has 9 or more characters, then simply return the first 9 characters as the result.
- If not, then add as many ‘a’ characters as needed to make it 9 characters, and return this string.
For example, the output for the string hashingiseasy
would be hashingis
, and the output for the string what
would be whataaaaa
.
That’s it right? Are we now experts on hashing?
I love the enthusiasm, but you gotta chill dude. That’s not an algorithm that qualifies as a hashing algorithm. There are some more criteria an algorithm must satisfy, for it to be a hashing algorithm.
These criteria are :-
- They should be one-way functions. That is, it should be very easy to find the hash from the plaintext, but very hard to find a plaintext that generates a certain hash.
- A small change in the plaintext should make the resulting hash completely different and extremely hard to predict.
- A given plaintext should always generate the same hash
- It should be hard to find two plaintext that give that same hash
Unfortunately, the above algorithm doesn’t satisfy any of these criteria. This is why nobody pays me to make their applications secure. However, don’t lose hope, I have done extensive research on this, and with the power of the 5 articles I have read in my entire life, I shall show you the beauty of the hash functions.
There are a lot of hashing algorithms out there that work in different ways. We will be exploring the SHA-1 algorithm here, even though it is now depreciated, and we will see how it works.SHA-256 has the same basic model as SHA-1, but different sizes, constants and functions. We will deal with why it was a good hashing algorithm, and how it was broken some other time.
You will need to be familiar with binary and base-16 representation of numbers, and basic concepts of math, like what a function is. We also assume you know the standard binary logical operators, which are AND, OR, XOR, and NOT.
Before jumping into SHA-1, I will first introduce you to certain functions that are used in SHA-1.
Important Functions
RIGHT CIRCULAR SHIFT
This fairly simple, you simply imagine all digits in a circle, and you rotate all of them by a certain number of places (clockwise). For example, if we circularly shift 100001010101 to the right by one digit, it becomes 110000101010. If we do it by two, it becomes 011000010101, and so on. We denote This operation like this
Sn(X)
Where X is the number we are rotating, and n is the number of digits we rotate it by.
That wasn’t too bad right? Take deep breath, you got past the first function. When ready, continue reading
THE BIG F
This is simply a function defined as follows. You don’t need to worry about it too much for now.
f(i;B,C,D) = (B AND C) OR ((NOT B) AND D) if 0 < i ≤ 20
f(i;B,C,D) = B XOR C XOR D if 20 < i ≤ 40
f(i;B,C,D) = (B AND C) OR ( B AND D) OR (C AND D) if 40 < i ≤ 60
f(i;B,C,D) = B XOR C XOR D if 60 < i ≤ 80
To see how SHA-1 works, I think it is easier to break the algorithm into 4 steps, Padding
, Expansion
, Compression
, and YAY HASH!
. I like to call the the two steps in the middle las dos grandes
.
Padding
Okay, so SHA-1 takes the input bits, and processes them 512 bits at a time. I.E. for it to be able to do it’s job, the input size needs to be 512 bits long. This is achieved by padding. Firstly, we take the take the size of the input in bits, and we represent this number in binary using 64 bits. We put a 1 at the end of the original input, and then we put these 64 bits at the end. Now, we put 0s between the 1 we put and the 64 size bits, such that the total number of bits is a multiple of 512. For example, say the input was 1110001010101100. This has 16 bits. 16 in binary, is 10000. So our 64 bit long size, is
0000000000000000000000000000000000000000000000000000000000010000
which is 59 zeroes followed by 10000.
So, after putting a 1 at the end of our input, and then putting this 64 bit size, we get
1110001010101100_1_0000000000000000000000000000000000000000000000000000000000010000
(original input) (one) (size)
Now, the total size of this string is 16+1+64 bits, which is 81 bits. Don’t worry, I had to pull out my calculator to add them up too, we both are on our last brain cells.
The closest multiple of 512 to 81 is 512, so we will put 512 - 81 (= 431) zeroes in the place of the second _ in the string above. I will not show that final string because I can’t count up to 512. I always forget 420.
Las Dos Grandes
Now, firstly let’s look at what exactly las dos grandes
does. So it starts off by taking the string we prepared earlier (let’s call it the big string
), along with five constants (let’s call them the five
). The five
’s values will be changed using the big string
and some beautiful maths.
So lets just think of Las Dos Grandes
as a guy, who is bored. This guy, will take big string
, and he will split it into chunks of 512 bit strings. He will use each 512 bit string, one by one, to modify the value of the five
, and will then tell you the final values of the five
. So, lets say we have five chunks of 512 bits, which we will affectionately call A,B,C,D, and E (just for now). And lets us call the five constants in the five
as H1, H2, H3, H4, and H5. He will use A to modify the values of the five
. He will then modify the modified values using B. And then modify these newly modified values using C, and so on, until he has no more chunks left. I considered being his wingman and getting him a girlfriend, but then I realized that if i did that, the internet wouldn’t be safe anymore…
Let’s talk about how he modifies the five
. He does it in two steps, Expansion
and Compression
.
Expansion
I had gifted him a HUGE whiteboard on his 16th birthday (big mistake), and on the same day, he had managed to win an unlimited supply of whiteboard markers. DON’T LOOK AT ME LIKE THAT OK, HOW WOULD I KNOW THAT THIS WOULD HAPPEN?
Anyway, so what he does for expansion is, he’ll draw 80 long rectangles in a line, and he numbers them from 1 to 80. Then he takes his current 512 bit long chunk, and breaks it into 16, 32-bit long chunks. He fills his first 16 rectangles with these 16, 32-bit long chunks, with each rectangle getting one chunk. We will denote the string in the ith
rectangle as W(i)
. Then, he starts filling all the remaining rectangles, starting from the 17th rectangle, and ending at the 80th rectangle.
He fills them using this function. I’m pretty sure he came up with it when he was copying my maths assignment (again, how could I know?).
W(i) = S1( W(i-3) XOR W(i-8) XOR W(i-14) XOR W(i-16) )
We’ll take a very simple example to show this, we’ll have rectangles with 3 bit long strings in them. And we’ll calculate the 17th box only
|101| |011| |100| |001| |010| |011| |110| |101| |100| |001| |101| |001| |001| |001| |101| |011| ||
the 17th box will be given by
W(i) = S1( W(i-3) XOR W(i-8) XOR W(i-14) XOR W(i-16) )
W(17) = S1( W(14) XOR W(9) XOR W(3) XOR W(1) )
W(17) = S1( 001 XOR 100 XOR 100 XOR 101 )
W(17) = S1( 100 )
W(17) = 010
Haters will say that I didn’t randomly come up with those numbers. To them I say, “What is life?”. And then proceed to watch them weep.
Compression
Ok now he has 80 rectangles with values in them. Now, a set of rectangles, which are numbered, and have values, is called an array. We will now use the term array.
He has an array, of size 80, that have values in them. Now back when he was 18, he got drunk in brazil and came up with another array of size 80, that he filled with numbers he liked. If you ask him how he got them, he’ll tell you that he got them by played a complicated form of dart, on a sober night in Italy. I can assure you that that is wrong, because he cannot hit a dart board with a dart, even if you pointed a gun to his head. Drunk or Sober. Obviously he uses this array (the miser refuses to buy another array). Let’s call this array K
, and the ith
value in the array will be denoted by K(i)
.
Now in Compression
, he uses the values he has in his W array, to modify the five
. He puts the five
in five rectangles, which he labels H1,H2,H3,H4 and H5. He also draws 6 more rectangles, which he calls A,B,C,D,E and TEMP. He copies the values of H1,H2,H3,H4,H5 in A,B,C,D,E respectively.
I will now introduce you to some notation, that is standard in programming. Let B be one ‘rectangle’ and A be another ‘rectangle’.
if I write A = B
, then what that means is, i am taking the value from B and writing it in rectangle A. If i were to write A = A + 5
, then i am taking the original value of A, adding 5 to it, and writing this value into the rectangle A.
So now, he does the following,
A = H1, B = H2, C=H2, D=H4, E=H5
He has a counter in his head, (he’s weird). He starts the counter at 1, and does the following operations. The value of the counter is denoted by i. He increases the counter by 1 and does the following operations, until the counter reaches 81, at which point he stops.
The operations are
TEMP= S5(A) + f(i;B,C,D) + E + W(i) + K(i)
E = D
D = C
C = S30(B)
B = A
A = TEMP
When his counter reaches 81, he stops this madness, and does this.
H1 = H1 + A
H2 = H2 + B
H3 = H3 + C
H4 = H4 + D
H5 = H5 + E
Note that whenever he does addition, he always removes the extra digits, making sure that all these variables are 32 bits long. Now he has some new values in H1,H2,H3,H4, and H5.
Getting the hash
With these values, he generates the hash. He appends zeroes to H11,H2,H3,H4,H5 to make them 160 bit long, and then gets the hash by the following operation.
HASH = S138(H1) OR S96(H2) OR S64(H3) OR S32(H4) OR H5
And now, we have, a 160 bit hash. Hopefully, you understand what the SHA-1 algorithm is, and maybe you had fun reading this. Of course, this algorithm used to satisfy all the properties a hash function should satisfy, although in modern times computation has become so fast that it can be broken. This article is long enough, and we both are tired. I promise that I’ll explain why the math behind it worked, and how it was broken.
Until then, Peace, and remember, never gift your friend a whiteboard.