Adaptive modeling in bruteforce cracking dictionaries

closelook to reducing bruteforce timings.

When bruteforce is the only way to step out, bad bruting of the password will lead to an endless task with no guaranties. One of the reason could be a really big dictionary or a short period of times or number of attempts.

The size of the dictionary could be reduced, also the number of attemps, if an adaptive model is used against the guess. An dapative model is really usefull when a part of the guess (most of the times the password) is known.

Adaptive Model

The adaptive model is (ab)used in compression and cryptographic algorythims. The process of entropy coding can be split in two parts: modeling and coding. Modeling assigns probabilities to the symbols, and coding produces a bit sequence from these probabilities. As established in Shannon's source coding theorem, there is a relationship between the symbol probability and it's corresponding bit sequence. A symbol with probability p gets a bit sequence of length -log(p).

In order to achieve a usefull adaptive model, an exact propability estimation is needed. Since the model is responsible for the probability of each symbol, modeling is one the most important tasks in data processing.

A model is adaptive, when changes the symbol probabilities during the data process in order to adapt to the changing contexts while in the process. Initially the data process starts with a basic model, usually created by statistical modeling. During the process, the model gets adapted by the symbols. The result is a better guessing in data process.

Alltough the definition is not clearly enough, an example will clarify the related topic. Whenever you search for something about speech recognition or word processing you will listed with a lot of papers regarding the probability of guessing an unpronnounced letter or expanding words from phonetical to lexical format. They use to build custom modeling data so they can state that 'H' and ' ' are the most common character after the 'T' letter, as 'H' and 'I' are the most used after the 'W'. You also use it when you write an SMS with T11 or T9 dictionaries in your cellular phone. Let's expand this model for a better understanding.

Statistical Adaptive Model

An adaptive model is based in the process reordering the symbol table. The reasons why the letter 'H' (characters are the symbols here) is the most used after the 'T' is just an statistical analysis of words, It's the frequency of the letters of the alphabet in English, and remember, in English. So a good adaptive model should care about the context too.
The inventor of Morse code, Samuel Morse (1791-1872), needed to know this so that he could give the simplest codes to the most frequently used letters. He did it simply by counting the number of letters in sets of printers' type:
12,000 E 
9,000 T 
8,000 A, I, N, O, S 
6,400 H  
6,200 R 
4,400 D 
4,000 L  
3,400 U
3,000 C, M 
2,500 F 
2,000 W, Y 
1,700 G,  P 
1,600 B 
1,200 V 
800 K 
500 Q 
400 J, X
200 Z 

However, this gives the frequency of letters in English text, which is dominated by a relatively small number of common words (see What are the commonest English words?). For word games, it is often the frequency of letters in English vocabulary, regardless of word frequency, which is of more interest. An analysis of the letters occurring in the words listed in the main entries of the Concise Oxford Dictionary (9th edition, 1995) came up with the following table:
E 11.1607%  56.88 		M  3.0129%  15.36  
A  8.4966%  43.31 		H  3.0034%  15.31  
R  7.5809%  38.64 		G  2.4705%  12.59  
I  7.5448%  38.45 		B  2.0720%  10.56  
O  7.1635%  36.51 		F  1.8121%  9.24 
T  6.9509%  35.43 		Y  1.7779%  9.06 
N  6.6544%  33.92 		W  1.2899%  6.57 
S  5.7351%  29.23 		K  1.1016%  5.61
L  5.4893%  27.98 		V  1.0074%  5.13 
C  4.5388%  23.13 		X  0.2902%  1.48 
U  3.6308%  18.51 		Z  0.2722%  1.39 
D  3.3844%  17.25 		J  0.1965%  1.00 
P  3.1671%  16.14 		Q  0.1962%

The third column represents proportions, taking the least common letter (q) as equal to 1. The letter E is over 56 times more common than Q in forming individual English words.

The frequency of letters at the beginnings of words is different again. There are more English words beginning with the letter 's' than with any other letter. (This is mainly because clusters such as 'sc', 'sh', 'sp', and 'st' act almost like independent letters.) The letter 'e' only comes about halfway down the order, and the letter 'x' unsurprisingly comes last.

So we came to the definition of model again, and we are talking about the number of models. Each of this tables could be a model, and all could be used with the same adaptive modelling algorythim.

The same way, if we state 'H' as the most common character after the 'W', 'A' is the most common character after 'TH'. When we only use the probability of one character ('H' over 'T') we can talk about "Model 1", and about "Model 2" when we consider two characters ('A' over 'TH'). The more models we use, the better the approach. But, the more models we use, the bigger the information is needed to handle it if we consider tables with probabilities. If we use also diferent contexts, I mean, most commmon characters used in passwords, or pets names or alike, and their diferent models, the ammount of information grows.


How and When of Adaptive Modeling

When there's no any clue about the password nature there's no way to use any model or probabilistic guess, as we won't know if we are getting closer or not to the correct answer.

Whith sql injection a new bruteforce approach. The Adaptive modeling detailled here was implemented in the sqlbftools project as a P.O.C.. The abstract of the program is to bruteforce a blind sql injection (using HTTP methods) to guess the correct value of a mysql sentence, so the goal is to reduce the number of requests sent to the server.

This type of bruteforce allow us to guess a password character by character, with a sentence like:
and MID ( "result of sentece", 1, position) = CHR(n)

where it will return TRUE if the substring with lenght 1 starting from position, (it's the character at position) equals to CHR (n), where n is each character of the tested charset.

For more information about this kind of bruteforce look at blind sql injection in this essay or search the internet. It's plenty of working code exploiting this method.

To guess the result of the sentence each character in it should be compared with each character in the charset. Usually it's exploted using an HTTP request, so the máximun number of request should be (length * charset length), what means a lot of HTTP requests for each guess. There was a need to reduce it so the guessing time would be the lowest.


Dictionary based Adaptive Modeling

When we talk about "Dictionary based" model it means that the statistical model is built from the words of the dictionary. The symbols are in no doubt the charset used. More than one implementations could be done to the model:

In a very dynamic model, tables are built and rebuilt with any change, based in the context. Example, use the frequency of the letter by position and by the scope of the word using several dictionaries. It means that a correct dictionary will get better probability, as supposed. It means, also, that a great dictionary will lead to a bad frequency calculation and big needings of resources to get the model tables working.

let's pseudotest the theory with some words and a guess. We have this list of words as dictionary:
root
localhost
t@l
reversing.org
roca

If we make the model 'character position dependant', then we could find the word 'root' easily, as 'localhost', but we miss the chance to guess the word 'host' as it's not in the ditcionary. If the model is "character order dependant", we can find the word 'host' even if it's not as a single word in the dictionary, but we miss the probability to find 'root' easily, as the character 'C' is found twice after the 'O' and the model will state that 'OC' has a higher frequency than 'OO'. The same way, in "character order model", the substring '.org' is located inside a word, and the frequency of '.OR' will lead the algorythim to state that 'G' will be the next character.

Most of the words are starting with the character 'C', so this will be first character to test.

The way to implement the algorythim could be something like this:
- Build the frequency table using the order, based in the previous know characters.
- Re-sort the charset based in dictionary statistic to guess the next character.
- Test the charset against the guess.
- loop untill end.

We are trying to guess the word 'host' with the charset :
'abcdefghijklmnopqrstuvwxyz0123456789$.-()[]{}@=/\\|#?¿&·!ñÑ'

The idea is to match each character of the guess ('host') with each of the characters in the charset ('abcd...')like the hangman game, but starting from the first guess character.

At the first time, the table is empty and we can use two starting model (as examples):
- Dictionary starting character frequency based.
- English words starting character frequency based.

Models are built from left to right, in increasing and decreasing order: Letters are being appended to the end (rigth), and removed from the beggining (left). For the word 'HOST' the resulting models are:
 
h 
ho 
hos 
host 
ost 
st  
t


* At first loop, no letter is known, so the ' ' or starting charset will be used.

For '' the charset is ordered like this: 'abcdefghij...'
The first guess is always a random one. The better the initial charset, the lower the number of attempts.

* When the first character is found, the 'H', the tables are built using 'H' first, and then the initial table again to compute the frequency of the next character.

For 'H' the charset is ordered like this: 'oabcdef....'
The next most common letter to substring 'H' is 'O', wich is sorted first.

* In the next loop, 'HO' is guessed from the password, so to compute the frequency, the next letter to the pair 'HO' is sorted first, than computed against the letter 'O', and then with the initial table again.

For 'HO' the charset is ordered like this: 'scorabdef...'
The next most common letter to substring 'HO' is 'S', for 'O' are 'C', 'O' and 'R', and then the rest of the charset..

* This process continues the same.. When the next character is guest, the found substring is 'HOS', then the frequency of the next character to substrings 'HOS' in the dictionary is sorted, computed against the 'OS' substring, later to the 'S' and later to the initial table, and so on.

For 'HOS' the charset is ordered like this: 'tiabcdef....'
The next most common letter to substring 'HOS' is 'T', for 'OS' is 'T' again, and for 'S' is 'i'.

This adaptive model allow us to compute a good dictionary against known values, and incremental bruteforce will finish the work reducing the number of attempts.

In each loop the charset is re-sorted, not increased or decreased. The characteres not appeared in the statistical model are appended to the end.


Last notes

It's important to realize that a big dictionary could deviate from a correct modelling. Also, when most of the word is guessed, a simple word dictionary check could finish the job removing the remaining attemps.

Very good results were achieved by the P.O.C. sqlbftools project using this method so the code has to be enhanced from scratch, but most of the work for the model to work fine should be done building the dictionary.

See the paper about creating dictionaries in this site: Wordlist generation for bruteforce Cracking.

You can test a working version of this adaptive model implemented in the sqlbftools project.

References

- Adaptive Model in data compression.
http://www.data-compression.info/Algorithms/AdaptiveModel/

- Frequency of the letters of the alphabet in English.
http://www.askoxford.com/asktheexperts/faq/aboutwords/frequency

- WordList Generation for bruteforce Cracking.
http://www.reversing.org/node/view/9

- sqlbftools project.
http://www.reversing.org/node/view/9