December 14, 2008

How to make JohnTR use all of your cpu cores

Lack of time make people use available resources more efficiently . And this rule applies to me too . I`ve got a password hash to crack and can`t wait so many days for my 1.8Gh mobile CPU to try to crack it . My first idea was to look for more CPU power and speed ,something which is not much cheap if you`re looking for cutting-edge technology !

Available cpu on my notebook (Intel 1.8 Dothan) could try about ~3500 passwords per second . Damn low ! So I did a call-for-idle-cpu and got two of them . First one was AMD Athlon 5000 LD , working as fast as ~6200 c/s . Near two times faster than my own cpu , but still far away from what I`m looking for ! So I tried second cpu , Intel X5450 3Ghz . Fastest rate on this cpu with different compiles of JohnTR was a bit faster than AMD , around ~8500 c/s. I want it FASTER !

We`re living in age of multi-core processors but we rarely use all the power of our cpu . Raw JohnTR unfortunately , is not multi-core aware by default . It means that if you run it on a multi-core (2,4,8,... !) cpu JohnTR will use only ONE of cores , just like you have a normal cpu . Too much bad for us ! The Intel X5450 cpu I was using has 8 cores and I`m missing 7 of them.
Those idle cores made me remember a great projec , patching JohnTR for making it multi-core aware . It`s known as MPI Patch . MPI Patch itself is based on another project ( MPICH ) and a research work by Ryan Lim, available here .
As BindShell documentations about their MPI Patch was not much kiddy friendly , I though would worth a blog post trying to make it an easy to follow step-by-step experience for any honored multi-core cpu owner . So here it is :

JohnThe Ripper + MPI + Intel X5450 3Gh

Step One: Make sure you are ready to go for compiling few source code packages . All you need is a recent version of GCC and C++ . On the linux shell I had they were not available on fresh install , so what I did was :

yum install gcc
yum install gcc-c++.i386


Step Two: Get latest MPICH source package from here . Follow the simple extract>./configure>make>make install game . if you don`t know how to play this game , here is the instruction :

wget http://www.mcs.anl.gov/.......
tar -xf mpich2-1.0.8.tar.gz
cd mpich2-1.0.8

./configure

make
make install


Step Three: Time to get MPI armored JohnTR . Get it from BindShell project page , and note to skip the version named "....-mp8-all..." . This version is patched to handle more password hash types , but we don`t need it now . And it`s also not compiling smoothly out of the box . So get this MPI-only patched version . Let`s play above game again :

wget http://www.bindshell.net/...
tar -xf john-1.7.3.1-mpi8.tar.gz
cd john-1.7.3.1-mpi8/src/

make generic


Have my advice and use "generic" rule to compile JohnTR . Don`t ask why , it`s just better for you and works without any problem . [Since I've had multiple comments & emails about this section of post, I`m updating it. Thank your Solar Designer and other visitors for your notes about this.]

Above you may have noticed the "genetic" switch used to make JohnTR. "generic" is the simplest option you may use here. To have the list of available options for make, simply use "make" without any parameter instead, and then you can replace 'generic' with the proper option which is the closest to your system architecture. The best and most optimized option for modern systems would be probably "linux-x86-64" but if you`re not using a 64bit operating system, next choice will be "linux-x86-sse2". Last but not least, if you`re going to compile JohnTR on a 64bit operating system you should use 64bit version of compiler instead of what I've mentioned in step one, to be able to use 64bit optimizations. If you`re still unsure about capabilities of your cpu, and want to know more about it, you can use below command:

cat /proc/cpuinfo

Choosing the right option for make is important, since it HIGHLY affect performance of JohnTR. For some hash types like MD5-based ones, difference between generic & optimized option is not really noticeable but for some algorithms like DES-based ones, using optimized make option will result in about 3x time faster crack speed. Before updating the post and adding this section I wrote everything simply based on generic build. So all screen-shots and numbers are OLD & UNOPTIMIZED. Solar.Designer mentioned that his optimized JohnTR compile running on ONE cpu core, crack 'Traditional DES' hashes as fast as my 8 cores cpu. So now you have some idea about effectiveness of proper build :)



Step Four: So far we`ve compile MPICH and JohnTR-MPI , but we`re not ready to run yet . MPICH needs some configurations to start it`s daemon . Don`t panic, a simple single-line config file is enough to keep things working :) Here`s how to make MPI work :

echo MPD_SECRETWORD=secret > /etc/mpd.conf
chmod 600 /etc/mpd.conf
mpd --daemon


Above configuration is for the cases you`re running mpd as root (not advised!) . If you feel insecure , go read full detailed manuals of MPI project .

Step Five : Time to fire JohnTR ! Since we want to use capabilities provided by MPI interface , we should run JohnTR on top of MPI ,and that`s why JohnTR is patched ! On my test-bed (8 cores Intel) here`s how I run it :

mpirun -np 8 /root/temp/john-1.7.3.1-mpi8/run/john /root/temp/shadow

Noticed bolded parameters ? There you can adjust number of threads for MPI , or in other words , number of processor cores you want JohnTR to use for running . If you`ve got a dual-core cpu , it would be " -np 2 " and so on ...

That`s it ! But how much FASTER we`re now ? I bet you`ll be excited by results . Below are benchmark tests of running John on ONE and then on 8 cores . I`ve highlighted differences in case you``re lost in numbers.

John running on 8 cores :





John runnong on 1 core :






Using all 8 cores of test-bed cpu at clock of 3Ghz I was able to achive speed of about 70,000 tries per second , compared to out-of-box speed of ~8000 tries per second . Yes , this what I consider FAST ! My focus was gaining better speed for cracking shadow passwords, but if you`re going to crack windows (LM) dumped hashes , cracking speed is madly high . About 42 Million tries per second using all 8 cores , versus 5.4 Million tries per second for a normal JohnTR lunch against LM hashes .
Things are not going to stop here . MPI let you distribute your cracking experiences not only to multiple cores on single cpu , but to multiple cpus on networked cluster of computers . Just imagine the processing power of 5 clustered 8-cores systems running JohnTR ... o.0


Let me thank you s_hahroo_z for providing AMD processor test result , and special thanks for Mr.Hajimoradi for sharing his dusty 8 core cpu , and of course the root shell ! :)

10 comments:

  1. really really thanks for your research sharing,but one of my question:is OS important in these operation likes?
    i heared solaris is better because it is faster in process switching.

    ReplyDelete
  2. Brute-forcing passwords is not about switching between processes. MPI is also about distributing threads . I really doubt if that would make much (if any) difference.
    After all , I`m new to MPI and know almost nothing about it !

    ReplyDelete
  3. thanks bro. I've dual core only, but anyways speed has been doubled atleast.

    What about writing something on using 4-5 machines parallelly for cracking?

    ReplyDelete
  4. @Frozen :
    How about reading Muts tutorial on this topic ,based on BackTrack live cd ? :)
    That`s based on the same idea/tech.

    ReplyDelete
  5. Compile it with sse2 support and you will double your c/s rate ;-). My quadcore system (q9550 - clocked at 2,83 ghz) has with 4 threads the same rates as you, some benchmarks are even better!

    ReplyDelete
  6. Jen, thanks for your comment. Yes you`re certainly right. Seems I have to make some clarifications about this post as solar.designer emailed me too,about the missing part on proper compile options of Jtr.

    In short, aim of this post was just to demonstrate a quick & clean MPI experiment and show how it can be used. I decided to totally skip the CPU/OS optimization hints to keep things simple, considering John users (not familiar with MPI) has already reviewed it`s README :)

    Anyway, I`ll update and fix this post ASAP and will publish a new drafted post on John too, covering some benchmarks of the 1.7.4.2 version.

    Thank you

    ReplyDelete
  7. Here is what i got, when i combined 16 machines having 8 cores each , thats 120 cores.

    -bash-3.2$ mpirun -machinefile mpd.hosts -np 134 ./john --test
    Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
    Many salts: 296499K c/s real, 302133K c/s virtual

    Only one salt: 259356K c/s real, 259988K c/s virtual

    Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
    Many salts: 9737K c/s real, 9799K c/s virtual

    Only one salt: 9490K c/s real, 9536K c/s virtual

    Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
    Raw: 1340K c/s real, 1343K c/s virtual


    Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
    Raw: 62085 c/s real, 63937 c/s virtual


    Benchmarking: Kerberos AFS DES [48/64 4K]... DONE
    Short: 45037K c/s real, 45660K c/s virtual

    Long: 145465K c/s real, 145534K c/s virtual

    Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
    Raw: 1681786K c/s real, 1692814K c/s virtual

    ReplyDelete
  8. Thx for your detailed description. Actually, I'm not clear about how the principles the MPI patch works on. Is it proper for every mode of JTR, or just for the dictionary mode? Since not all serial method can be converted to a parallel one...sincerely hope you can catch what i mean...

    ReplyDelete
  9. hello, I'm not clear about the principles of John MPI. Is it just designed for the dictionary mode in JTR?

    ReplyDelete