Thursday, December 9, 2010

Java Parallelization Options

Last night I taught a class on Monte Carlo Simulation (MSC) using Excel and Crystal Ball (Oracle).  This was part of an ongoing course on System Modeling Theory (a.k.a. Management Science) that I am teaching at Strayer University.  In modeling we use MCS to simulate the probability distribution of uncertain model parameters.  This helps us understand the uncertainty and potential risk of varying input parameters of the problems we are attempting to solve with our models.

As I was executing the simulation in Excel/Crystal Ball with a Normal Distribution Curve and 1000 trials, my mind started to wonder about how I would do this in Java.  Given my experience with Java numerical computation I theorized that I would need more resources than just my laptop if I were to pursue more complex model simulations that included many more uncertain input parameters and model permutations in a JRE.

With the advocacy of Cloud Computing everywhere these days, I have been tracking the progress of Java-based parallel and grid computing efforts.  I have noticed a few solutions that would seem to fit the bill for the more complex numerical data computation that I think I would need to tackle complex problems and financial models with Java.

According to its developers, Hadoop is open-source software for “reliable, scalable, distributed computing.”  Hadoop consists of several sub-projects, some of which that have been promoted to top-level Apache projects.  Some of the contributors of the Hadoop project are from Cloudera.  Cloudera offers a commercialized version of Hadoop with enterprise support, similar to the model that Redhat has with its RHEL/Fedora and JBoss platforms.

In a nutshell, the idea behind Hadoop’s MapReduce project, and its associated projects (HDFS, HBase, etc.), is to perform complex analyses on extremely large (multi-terabyte) data sets of structured and/or unstructured data.  The storage and processing of these huge data structures are distributed across multiple, relatively inexpensive, computers and/or servers (called nodes), instead of very large systems.  The multiple nodes form clusters.  The premise behind Hadoop as I understand it is to encapsulate and abstract the distributed storage and processing so that the developers do not have to manage that distributed aspect of the program.

Hadoop’s MapReduce project, written in Java, is based on Google’s MapReduce, written in C++.  It is used to split the huge data sets into more manageable and independent chunks of data that get processed in parallel with each other. Hadoop  MapReduce works in tandem with HDFS to store and process these data chunks on distributed computing nodes within the distributed cluster.  The use of MapReduce requires Java developers to learn the Hadoop MapReduce API and commands.

Grid Gain
Grid Gain is another solution for distributed processing, including MapReduce computation across potentially inexpensive distributed computer nodes.  According to Grid Gain, their product is a “Java-based grid computing middleware.  There are many features to this product, including what they call “Zero Deployment."

While Hadoop comes with HDFS that can be used to process unstructured data, Grid Gain does not use its own file system, but instead connects to existing relational databases such as Oracle and MySql.  Hadoop can use its own high performance HBase database as well and I have heard of a connector to MySql.  Hadoop seems to provide more isolation for task execution by spinning up multiple JVMs per task execution.  Grid Gain seems to come with more tools for cloud computing and management.  Finally, though Hadoop is written in Java, its MapReduce functionality can be used by non-Java programs.

Aparapi is another API that provides parallel Java processing.  Unlike Hadoop and Grid Gain, Aparapi translates Java executable code to OpenCL.  OpenCL is Apple’s patented framework for parallel programming.  The fascinating aspect of Aparapi and OpenCL is what it is designed to execute on.  OpenCL uses Graphical Process Units (GPU) for parallel processing.

In my past life I was more connected to hardware than I am today and I worked with Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA) with Analog to Digital Converters (ADC) and Digital to Analog Converters (DAC).  We used DSPs to process waveform data in near real-time.  We would capture the waveform data on our I/O cards and then offload the processing and transforms to a DSP.
I guess this is why OpenCL interests me so much.  With OpenCL, developers can write code that gets compiled at run time so that it is optimized to run on existing GPUs in a computer or against multiple GPUs in multiple computers.  Based on the “C” language OpenCL allows developers to use graphics chips like those from NVidia.  Imagine that for a moment…while most of the parallel processing world is harnessing grid and cloud computing power, OpenCL is focusing on a much cheaper hardware footprint.  In fact, Apple developers can use OpenCL on their Macs to harness the computer power of the installed GPU to perform high performance computing tasks.

With Aparapi, Java developers can now translate their code to be executed in the OpenCL framework.  The use of GPUs for parallel non-video processing is called General Purpose Computing on Graphics Processing Units (GPGPU).  Unlike CPUs that execute single threads very fast, thereby giving the illusion of multi-threading, GPUs have a parallel architecture that allows true simultaneous execution of threads.

Beyond Aparapi there is JCUDA, JOpenCL, and JOCL.  While JCUDA, JOpenCL and JOCL are based on JNI wrappers of OpenCL and NVidia’s CUDA, Aparapi takes a different approach and uses byte code analysis to translate Java to OpenCL executable code.

It remains to be seen which platforms and techniques will emerge as the standard.  More to come as I explore some of the Java parallel programming.


  1. An interesting discussion is definitely worth comment. There’s no doubt that that you should publish more about this topic, it might not be a taboo matter but usually folks don’t talk about such topics. To the next! Best wishes!! Hadoop Online Training .

  2. You have certainly explained that Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding Big Data Training in Chennai program. Via your quality content i get to know about that in deep.Thanks for sharing this here.

  3. Parallelism in Java will be the top priority when I attend next class at hadoop online training in hyderabad. Looking forward to attend as I am interested in business, customer, sales related domains. Nice info on this blog also and will visit regularly. Thnx.....

  4. This is the exact piece of information that I was searching for a long time(Hadoop Training in Chennai). Processing data is the biggest issue that every cloud based companies are facing worldwide(Big Data Course in Chennai). Handling this problem made easy with the introduction of big data. Thank you so much for your worth able content here. Keep Posting article like this(Best hadoop training institute in chennai).

  5. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself. I hope this will help many readers who are in need of this vital piece of information. Thanks for sharing & keep your blog updated.
    CCNA Training in Chennai | CCNA Training institute in Chennai | CCNA Training

  6. Amazing news. Thanks for this blog. Very good knowledge and very good information. Simplest method to do this process. This is very useful for many peoples.
    Hadoop Training in Chennai

  7. Many thanks with regard to revealing an exceptionally very helpful and also informative weblog.seo package

  8. I agree with your thoughts!!! As the demand of java programming application keeps on increasing, there is massive demand for java professionals in software development industries. Thus, taking training will assist students to be skilled java developers in leading MNCs. Best Java Training in Chennai | JAVA Training Institutes in Chennai

  9. Really Interesting course hadoop online training course refer at hadoop online training

  10. Great job and keep posting the updates like informatica online training course in hyderabad for more details refer at
    informatica online training

  11. Enjoy reading your post. Great article, thank you very much! Really nice and impressive blog i found today... Thx for sharing this..

    Big data training in chennai

  12. Hi, Really your post was very informative. Today's internet era learn Hadoop Online Training will helps you to reach your goal.Selenium Training

  13. Nice sharing. R is a language and environment for statistical computing and graphics. Want to make a career in R Programming. Learn R Programming Training course @ GangBoard. We are the best provider of online training on evergreen technologies.

  14. You have done really great job. Your blog is very unique and informative. Thanks. Devops Online Training | Data Science Online Training

  15. Grateful informative blog posting article! I'm read this information, It's my first command of this blog sites. We share very great knowledgeable information post here.Selenium Training in Chennai | Selenium Course in Chennai

  16. Nice blog. Thank you for sharing. The information you shared is very effective for learners I have got some important suggestions from it. erp software in chennai.

  17. your blog was really good. hadoop,Grid Gain,Aparapi,java explanation wise too good.It's new one for me.Thanks so much for posting the valuable blog.get more details about..........
    Hadoop Training in Chennai
    Android Training in Chennai
    Dot Net Training in Chennai
    Selenium Training in Chennai

  18. Excellent post!!! The strategy you have posted on this technology helped me to get to the next level and had a lot of information in it.
    data science training
    online data science training