I recently found one of my throwaway passwords in a random corner of the internet. It was just sitting there in plain text with my username right next to it along with around 33,000 other username-password pairs. Besides the minor scare, I found it quite fascinating that there these huge files exist publicly. I immediately downloaded the file and scrolled through some passwords. I noticed some obvious patterns. I then wondered about the algorithms used to detect these patterns. I looked around and found some interesting references.
One thing that struck me was that entropy in passwords affects their strength. I then wondered about entropy – was it possible to create artifacts using entropy in passwords?
I was reminded of The Code Book by Simon Singh which I had read a while back. The book references frequency analysis of character pairs and triplets. I found this page on letter frequency by Freek Dijkstra.
That lead me to this page in which they calculated the statistical distribution of characters upto three orders (three characters). They used the following books as their corpus
I agree that using this particular set might be dated, but I figured it would still be interesting to see the outcome. Besides, replacing the actual data set should be fairly straightforward in this case.
I then wrote a program with certain rules.
When the program is given a phrase, first, the program would look up the second and third order databases (from the link above). The database I’m using converts all lowercase letters and spaces – yes, a problem, but still potentially interesting outcomes. Based on the amount of variance from the most likely next character to the least likely next character, the actual next character gets a score between 0 and 179. The more likely a character follows the previous character, the straighter the line. Since the database also counts spaces as characters, and I chose to ignore spaces (considering the relative low use of spaces in passwords), the line can almost never be a single straight line. The distance between points is relative to the size of the canvas. The rotation starts clockwise, but for every change in type of character (lowercase, uppercase, number or symbol), the direction changes.
The program also converts non-alphabets (leet) to their respective alphabets.
The code that generated this is not yet public on Github. There are still some changes and cleanup that I need to do. I’ll update this page when I put it up.