Monday, July 1, 2013

How to reconcile Real-Time and Batch processing using In-Memory technology: A demo at the AFCEA Cyber Symposium Plugfest

As you might remember, we(*) participated in a "Plugfest"(**) earlier this year in San Diego. Here is the summary post of what we built for that occasion:

This time around, we entered the plugfest competition as a technology provider at the AFCEA Cyber Symposium, which happened last week (June 25-27 2013) in Baltimore. We not only provided technologies components and data feeds to the challengers (San Diego State University, GMU, Army PEO C3T milSuite), but also built a very cool Fraud Detection and Money Laundering demo which was 1 of the plugfest use case for this cyber event.

Our demo was centered around a fundamental "Big Data" question: How can you detect fraud on 100,000s transactions per seconds in real-time (which is absolutely critical if you don't want to lose lots of $$$$ to fraud) while efficiently incorporating in that real-time process data from external systems (i.e. data warehouse or hadoop clusters).
Or in more general words: How to reconcile Real-time processing and Batch processing when dealing with large amounts of data.

To answer this question, we put together a demo centered around Terracotta's In-Genius intelligence platform ( which provides a highly scalable low-latency in-memory layer capable of "reconciling" the real-time processing needs (ultra low latency with large amounts of new transactions) with the traditional batch processing needs (100s of TB/PB processed in an asynchronous background jobs), all bundled in a simple software package deployable on any commodity hardware.

Here is the solution we assembled:

Cyber Plugfest Software Architecture
How to reconcile real-time and batch processing

A quick view at how it all works:
  1. A custom transaction simulator generates pseudo-random fictional credit card transactions and publish all of them onto a JMS topic (Terracotta Universal Messaging bus)
  2. Each JMS message is delivered through pub/sub messaging to both real-time and batch track:
    1. The Complex Event Processing (CEP) engine which will identify fraud in real-time through the use of continuous queries.
      1. See "Real-Time fraud detection route"
    2. Apache Flume, an open source platform which will efficiently and reliably route all the messages into HDFS for further batch processing.
      1. See "Batch Processing Route"
  3. Batch Processing Route:
    1. Apache hadoop to collect and store all the transaction data in its powerful batch-optimized file system
    2. Map-Reduce jobs to compute transaction trends (simplified rolling average in this demo case) on the full transaction data for each vendors, customer, or purchase types.
    3. Output of map-reduce jobs stored in Terracotta BigMemory Max In-Memory platform.
  4. Real-Time fraud detection route:
    1. CEP fraud detection queries fetch from Terracotta BigMemory Max (microsecond latency under load) the hadoop-calculated averages (in 3.2), and correlates those with the current incoming transaction to detect anomalies (potential fraud) in real-time.
    2. Mashzone, a mashup and data aggregation tool to provide visualization on detected fraud data.
    3. For other plugfest challengers and technology providers to be able to use our data, all our data feeds were also available in REST, SOAP, and Web socket formats (which were used by ESRI, Visual Analytics, and others)

As I hope you can see in this post, having a scalable and powerful in-memory layer acting as the middle man between Hadoop and CEP is the key to providing true real-time analysis while still taking advantage of all the powerful computing capabilities that Hadoop has to offer.

In further posts, I'll explain in more detail all the components and code (code and configs are available on github at


(*) "we" = The SoftwareAG Government Solutions team, which I'm part of...
(**) "Plugfest" = "collaborative competitive challenge where industry vendors, academic, and government teams work towards solving a specific set of "challenges" strictly using the RI2P industrial best practices (agile, open standard, SOA, cloud, etc.) for enterprise information system development and deployment." (source:


Bevo said...

If CEP is pulling data for Terracotta, shouldn't the data flow arrow be flipped?

Fabien Sanglier said...

I like the attention to detail! You're very right! fixed it. Thanks!

jake george said...

Hadoop Developer online training| Hadoop Developer ...
ఈ పేజీని అనువదించు
hadoop developer online training, hadoop developer training, hadoop developer course contents, hadoop developer, hadoop developer enquiry, hadoop ...
Courses at 21st Century Software Solutions
Talend Online Training -Hyperion Online Training - IBM Unica Online Training -
Siteminder Online Training - SharePoint Online Training - Informatica Online Training
SalesForce Online Training - Many more… | Call Us +917386622889

Dhiraj Kumar said...

You got a really useful blog I have been here reading for about an hour. I am a newbie and your success is very much an inspiration for me.
software development