Word list generation for bruteforce cracking


Password cracking: abandoned topic.

Everybody talks about password cracking. It's a must when no other resource is available. Also everybody looks for a dictionary sometimes. But there's a truth that everybody fears.. there's NO the best dictionary. Success of bruteforce is achieved only (and only) if the correct word is found in the test time. But time is limited and even if the word is included in the dictionary, the result may be failed.

Success is the sorting of the dictionary, and still I haven't found a solution anywhere, so I did mine. The tool discused here may help building a dictionary or sorting the very big one. Output may be a simple word list that you can use alone, or appending the rest of the dictionary.

Figure the case where testing a DNS for internal hosts where you know two of them: legolas and aragorn. Now build a list of names for internal hosts (try to do a big one). If star trek is the topic of the DNS, try to build a list with related terms from the story. if you could do it automatically the result would be a larger (and better) list. Now you know why did a dig this topic.


What is the point?

The program inspiring this essay could be found here: wlgen project

wlgen was a mix of functionalities in a single program. I tried to take a customer site, from it's content get the most important words, categorize them correctly and then generate a wordlist related to that site. Program is working and I though it's an interesting topic to discuss.

Actually, one day I needed to bruteforce a DNS and I had to build the correct word list based in this two terms: Kyrgyzstan and Kashgar. I didn't know they were two cities from the "Silk Path", but the program did it job fine and categories where sorted correctly.

I used to append the BIG file (complete dicctionary) t the end of the list, just to ensure no word is missing if the generatid wordlist is too short.

A 'catalog' system is needed to find categories and related words about a term. For example, if we are testing legolas.host.com, category should "lord of the rings theme" and related terms could be : gandalf, Aragorn and so. Common errors here are that if we found Titan, it fits in several categories: football team, science, bussines and so.. . Our first need is to locate a great dictionary, and a categorized system. I've located both in the same service at www.onelook.com.


Word searh method

RELATED TERMS IN THE SAME CONTEXT: the concept

The concept of "related term" should be clearified before attempt to dig in any kind of search. If we think in "Apple", we can be talking about a fruit, a registered computer trademark, a color, and a very few more things. If we focus in the word as a fruit, and we want related terms of apple, we should realize soon that "apple related terms" will show up things like "pie", "juice", "seed".. all of them have something about apple, but not about fruits. If we are trying to figure out fruit names like "pearl" and "apple" we should move from 'apple' to 'fruit'. We found the same situation when we try to locate mercury, venus and earth. The two first are planets and greek gods, but third is clearly a planet. We should also tweak the search to "solar system planets", but if we would like that moon and other satelites appear in the list, we can't talk than about planets. We should move from "mercury, venus and earth" to "solar system entities" or bodies. Moon here may be an exception, as if we took about planets we think fast earth and moon.

We have to be very carefull with the tweakings, as the more we interact with the word the more we can deviate the final "concept" used for that word.

Language dependant systems, like this, lead us to complete a pre and post traslation, realizing that in the process we may fall into a deviation of the final concept. Moon is a name and also a definition for a satellite system. In other languages it may mean a completly diferent topic. The translation process may also fall into more than one term for the same word for several contexts, and here is where an automatic system may be blocked at all. The more inputs we have to obtain a correct topic, context or category the better the results. Also, once locate the word list of related terms we should fall back to original language, where some of them may return more than one aception at all. A selective translation should be done as we could see later.

Internationalizatión and acronyms are another bridge to pass. How to get words like "www" or "ftp" in a list could be very complicated. IF we think in them as internet services acronyms, we would find terms like "file sharing" and "dating system", and if we think in internet protocols we find "ipv6", "transport layer" and so. In both cases we can find "www" and "ftp", but result will be full of noise.

Anyway, all the previous tweaks could be resolved with a search engine service well designed or not well tuned. For the program I'll give a chance to onelook.com, a global dictionary search that also suppors related terms. We need to move from one word to a concept, so the first step is to locate the more information we could about that word.

To look for a word at onelook.com we can use the url:
http://www.onelook.com/?loc=rescb&w=word.

To look for related terms we should use:
http://www.onelook.com/?loc=rescb&w=*:word


CATEGORIZATION: creating the search, and also, tweaking it

Once we have located the word and it's posible definition we need to categorize it. www.onelook.com does it by indexing several dictionaries building up to 12 categories. I think it's a very short of category listing but enougth for our first aproach (we can try to tweak it later.)

The category is needed to generate a word list with related terms as short and as tweaked as posible. Be can start rating categories by the number of times the word is listed in each, and generating a new media of that value with other words.

word search method example: http://www.onelook.com/?w=legolas
   We found 3 dictionaries with English definitions that include the 
   word legolas: 
   Tip: Click on the first link on a line below to go directly to a
   page where "legolas" is defined. 

   General (2 matching dictionaries)
      Legolas : Wikipedia, the Free Encyclopedia [home, info] 
      Legolas : Who2 [home, info] 
   Miscellaneous (1 matching dictionary)
      LEGOLAS : Behind the Name [home, info] 
Now we can explore related "concepts" of that word: http://www.onelook.com/?w=*%3Alegolas&loc=revfp
   Words and phrases matching your pattern:
   (We're restricting the list to terms we think are related to 
   legolas, and sorting by relatedness.) 

    1. elvenking
    2. orlando bloom
    3. elf 


But it's not enought. There are no Aragorn or gandalf here. This show our needings in categorizing the term. Let's try now a related term search with "lord of the rings":
http://www.onelook.com/?w=*%3Alord+of+the+rings&loc=revfp
   Words and phrases matching your pattern:
   (We're restricting the list to terms we think are related 
   to lord of the rings, and sorting by relatedness.) 

 1. elijah wood         2. frodo baggins
 3. viggo mortensen     4. ents
 5. j.r.r. tolkien      6. aragorn
 7. cate blanchett      8. gandalf
 9. orlando bloom       10. araman 
 11. ian mckellen       12. utumno
 13. arwen evenstar     14. ent
 15. goldberry          16. shire
 17. eregion            18. mirror of galadriel	
 19. ringbearer         20. thorondor
 ....
 >>> and more 

Actually there are more than 1000, but service suggests to use a more restrictive search pattern. Despite of the amount of words, some of then are in the list because of the "lord" and "rings" related terms. Think that also "names of middle-earth" or "heroes of Lord of the rings" may return similar results, some of them better than others. By the way, many search engines use to forget auxiliar words for the grammatic construction like juntions, pronoums, sufixes and prepositions. "of" "the" may be removed from the search query, and I figure out search engine does it when parsing the input field. Another think to realize is sometimes plural words may lead in incomplete search. When locating a word list with something in common, we are actually looking each word separately: I mean, www and ftp, each one is a acronym (not acronyms) of an internet service (not services).

Well, This search is a great list, but we have a problem, we have to automatically jump back from "legolas" to "lord of the rings", and it's a great one.

Some searchs are not so specific the firts time. In the same topic, we have the word "Narsil" and actually I'm using it in a hostname. if we look at this word only 1 match is found. If we look forward with Elendil:
  1. anárion    2. isildur      3. kings of gondor	
  4. narsil     5. annúminas    6. aragorn	
  7. battle of dagorlad	        8. gondor	
  9. osgiliath ......
We can make our list grow up if we follow the results tree rating the words, but this process may lead to an incorrect tokenization of the category. For example, with the search legolas we found 3 matches, and if we try to move forward the tree:
  - Elvenking: empty search result.
  - orlando bloom: toons of words, from the actor,
    orlando city and even flower styled.
  - elf: again toons of words, more close to what 
    we are looking for, but still deviated
    to the elfic language. Some names of the film 
    appear here, but only elfic names.
What if we go to wikipedia? Definition looks like:
  http://en.wikipedia.org/wiki/legolas

  In J. R. R. Tolkien's The Lord of the Rings, Legolas 
  Greenleaf is a Sindarin Elf who becomes a part of the 
  Fellowship of the Ring. With his keen eyesight, 
  sensitive hearing, and excellent bowmanship, Legolas 
  is a valuable resource to the other eight members of 
  the Fellowship. Although Tolkien elves are a diverse 
  group, fantasy and gaming enthusiasts tend to cite 
  Legolas as the archetypical... (continued at Wikipedia) 
Wikipedia builds a category tree for each word that we can use for our program. This is the sample of Legolas MAIN category: Grey_elves: http://en.wikipedia.org/wiki/Category:Grey_Elves
 Grey elves 
   -- Middle-earth elves 
        -- Fictional elves 
           -- Elves  
               -- Legendary creatures
	          --  Fictional characters by nature 
                      --fictional characters
        -- middle-earth characters
           -- Middle-earth 
        -- Fantasy characters 
	   -- Literary characters

With two or more words, the easiest way is to locate the first match betwen them trought the category tree. Once the match is located for a complete result we can fall back to the first category pointed reversing the tree.
Keep a correct rating system for all the words located, and search each category in the word generator.

Now try to look other definitions for diferent words, having in mind the language of the meaning:

http://en.wikipedia.org/wiki/apple: apple the fruit
http://es.wikipedia.org/wiki/apple: apple computers, the company

Sometimes wiki is sorted in an extrange way, so looking for a category is horrible. We can tweak a little here if we mix word definition with the words of the category. Sometimes a wiki article is a LIST_OF, and we can use that with the definition to easy locate word list.

In the apple example, the word apple is in the categories of "Apples" and "List of foods". This is one path of the tree: Apple -> Apples -> Fruit

It's funny, because in the category fruits there's no entry for apple, neither for pear or melon.. why? they are subcategorized again. But rationally we have looked into fruits, because we know that apple is a fruit.. the only way to do it automatically is matching the phrase "apple is a fruit" with the fruit category

Wait a moment here.. did you notice that in the first case: middle-earth characters is plural and in the second, fruit is singular? These are the correct names of categories, and we should care with this when we start abstraction of the definition sintax.

For a last check of this sort system we try to locate Mars planet. It's common use planets names as a list, so we try to figure out as a test if the system is reliable with it's workflow. But in this case we found a word with so many acceptions that we have to choose the correct aception (venus shares the same problem, as jupiter does) in a disambiguation page.

Focused in Mars (the planet) we are dropped to "mars" the category, and from there to "Sol System planets". If we stop here we can see the subcategories: earth, mercury, venus.. In this case, Mars is in the list of subcategories (in apple case the word apple was not).
Also, in definition, "Mars is a planet" We can move to category:planet where we can see a new category, "planets" with fictional planets, planetary images, and so, and we can dig more and more about celestial bodies, astronomical objects, ... until we get lost.

There's to distinguish also when the list of words we are looking for is in subcategory names, or as entries in wikipedia, but I think it's only a parsing task.

Some situations may need of human help, and can't be completly automated. trouble arises when only one word is used as input, in other cases the task seems to be easy with both, wikipedia and www.onelook.com.

Interesting Note:

If we try to locate asterix and friend's names we hit the ground. It's a very complicated kind of search.

Wikipedia definition drops this:
  ..
  Asterix lives around 50 BC in a fictional village in 
  northwest Armorica (a region of ancient Gaul mostly 
  identical to modern Brittany). This village is 
  celebrated amongst the Gauls as the only part of that 
  country not yet conquered by Julius Caesar and his 
  Roman legions. The inhabitants of the village gain 
  superhuman strength by drinking a magic potion prepared
  by the druid Getafix (French: Panoramix—names of all 
  characters, except usually Asterix and Obelix, 
  vary from one translation to another).
  ..
Translation of almost all names will make very dificult to find a great list, because only asterix and obelix keep their original names.


Getting word list

For the program to generate wordlist we need some parsing routines as well as http get code to query the dictionaries.

The main topics are (as shown):
  • Internationalization
  • Disambiguation
  • Categorization
  • Rating and relating
  • Word list generation.


The original program supports web crawling and file parsing with a own rating system so generated wordlist would be "site related" or "content related" to a website or a file(s), but this code has been removed as it's not necesary here.

Currently, wlgen generates wordlist (also caching) from a set of words (it gets words categories first) or from a set of categories.

Final stage of program and internationalization were removed also and final code is now more a POC.

FYI, Samples of program running are:
./wlgen.pl --words Larry_Bird,Michael_Jordan

[getting categories]

 NBA_players
 Indiana
 1956_births
 Chicago_Bulls_players
 1963_births
 Basketball_players
 Chicagoans
 People_from_Indiana
 Washington_Wizards_players
 Sportspeople_by_sport
 Boston_Celtics_players
 Basketball
 1960s_births
 1963
 Chicago
 Chicago_metropolitan_area
 U.S._states
 Cities_in_Illinois
 Wisconsin
 Illinois_counties
 ...

[generating wordlist]
 ... 
another one categorizing sample:
./wlgen.pl --words Bluetooth,Ethernet

[getting categories]

 Computer_networks
 Communication
 Computing
 Wireless_networking
 Ethernet
 Networks
 Wireless_communications
 Telecommunications
 Graph_theory
 Apes
 Media
 Technology
 Human_societies
 Fundamental
 Human
I wonder why Apes are before humans in networking.. :) Take a look to the diferent results with diferent words in the same context: With legolas GREY elven related context is sorted first.
./wlgen.pl --words Legolas

[getting categories]

 Grey_Elves
 Middle-earth_Elves
 Middle-earth_characters
 Fictional_elves
 Fictional_characters_by_nature
 Middle-earth
 Literary_characters
 Elves
 Fantasy_characters
 Fictional_characters
 Literature
 Fictional_universes
 Media_franchises
 Fantasy
 Arts
 Humanities_and_art
 Fictional
 ...
With another elven, grey elves is pushed down in the tree.
./wlgen.pl --words Legolas,Galadriel

[getting categories]

 Middle-earth_Elves
 Middle-earth_characters
 Grey_Elves
 Fictional_elves
 High_Elves
 Fictional_characters_by_nature
 Middle-earth
 Literary_characters
 ...
And with a non elven character, the tree is rebuilt almost completly.
./wlgen.pl --words Legolas,Aragorn

 Middle-earth_characters
 Fantasy_characters
 Grey_Elves
 Rulers_of_Gondor
 D%FAnedain_of_the_North
 Middle-earth
 Middle-earth_D%FAnedain
 Literary_characters
 Middle-earth_Elves
 Middle-earth_Men
 Media_franchises
 Fictional_elves
 Fictional_characters_by_nature
 Fictional_universes
 Elves
 Fictional_characters
 Literature
 ...


About word generation, these are some results:
 ./wlgen.pl --categories two

  double, pair, couple, joint, biennial, brace, 
  deuce, twain, company, ii, yoke, binary
  doublet, pole, interval, compound, conjugate,   
  dichotomy, distance, fork, bipartite
  cross, dualism, half, twin, bilateral, bilingual, 
  dual, union, between, bifurcate, bipolar
  biweekly, dialogue, duet, duplex, dyad, either, 
  sector, tandem, tie, binomial, alternate, ...
Another one
./wlgen.pl --categories color

   white, tone, blue, black, red, hue, tint, dye,
   coloring, tinge, colored, colour, tan
   blush, colouring, complexion, color in, gloss, 
   vividness, colors, appearance, colorism
   distort, green, brown, purple, gray, yellow, 
   azure, streak, drab, silver, dun, mottle
   orange, pink, shade, stain ,turn,...
And the last test
./wlgen.pl --words Legolas,Aragorn
 
 [Generating categories]

 Middle-earth_characters
 Fantasy_characters
 Grey_Elves
 Rulers_of_Gondor
 D%FAnedain_of_the_North
 Middle-earth
 Middle-earth_D%FAnedain
 Literary_characters
 Middle-earth_Elves
 Middle-earth_Men
 Media_franchises
 Fictional_elves
 Fictional_characters_by_nature
 Fictional_universes
 Elves
 Fictional_characters
 Literature

 [Generating Wordlist]

 + from   Middle-earth_characters

 sauron
 hobbit
 baggins
 arwen
 legolas
 aragorn
 gimli
 sam
 frodo
 orc
 boromir
 merry
 pippin
 faramir
 galadriel
 saruman
 ...



Conclusion

Services are out there, but abuse of them is always a bad idea. Implemented cache is to avoid overload of third party services, so don't abuse this script. Recomended action is to modify the code and start building a big database with words if needed, as others did.. This is just a POC code, not a final application.