This time around, we entered the plugfest competition as a technology provider at the AFCEA Cyber Symposium, which happened last week (June 25-27 2013) in Baltimore. We not only provided technologies components and data feeds to the challengers (San Diego State University, GMU, Army PEO C3T milSuite), but also built a very cool Fraud Detection and Money Laundering demo which was 1 of the plugfest use case for this cyber event.
Our demo was centered around a fundamental "Big Data" question: How can you detect fraud on 100,000s transactions per seconds in real-time (which is absolutely critical if you don't want to lose lots of $$$$ to fraud) while efficiently incorporating in that real-time process data from external systems (i.e. data warehouse or hadoop clusters).
Or in more general words: How to reconcile Real-time processing and Batch processing when dealing with large amounts of data.
To answer this question, we put together a demo centered around Terracotta's In-Genius intelligence platform (http://terracotta.org/products/in-genius) which provides a highly scalable low-latency in-memory layer capable of "reconciling" the real-time processing needs (ultra low latency with large amounts of new transactions) with the traditional batch processing needs (100s of TB/PB processed in an asynchronous background jobs), all bundled in a simple software package deployable on any commodity hardware.
Here is the solution we assembled:
|Cyber Plugfest Software Architecture|
How to reconcile real-time and batch processing
A quick view at how it all works:
- A custom transaction simulator generates pseudo-random fictional credit card transactions and publish all of them onto a JMS topic (Terracotta Universal Messaging bus)
- Each JMS message is delivered through pub/sub messaging to both real-time and batch track:
- The Complex Event Processing (CEP) engine which will identify fraud in real-time through the use of continuous queries.
- See "Real-Time fraud detection route"
- Apache Flume, an open source platform which will efficiently and reliably route all the messages into HDFS for further batch processing.
- See "Batch Processing Route"
- Batch Processing Route:
- Apache hadoop to collect and store all the transaction data in its powerful batch-optimized file system
- Map-Reduce jobs to compute transaction trends (simplified rolling average in this demo case) on the full transaction data for each vendors, customer, or purchase types.
- Output of map-reduce jobs stored in Terracotta BigMemory Max In-Memory platform.
- Real-Time fraud detection route:
- CEP fraud detection queries fetch from Terracotta BigMemory Max (microsecond latency under load) the hadoop-calculated averages (in 3.2), and correlates those with the current incoming transaction to detect anomalies (potential fraud) in real-time.
- Mashzone, a mashup and data aggregation tool to provide visualization on detected fraud data.
- For other plugfest challengers and technology providers to be able to use our data, all our data feeds were also available in REST, SOAP, and Web socket formats (which were used by ESRI, Visual Analytics, and others)
As I hope you can see in this post, having a scalable and powerful in-memory layer acting as the middle man between Hadoop and CEP is the key to providing true real-time analysis while still taking advantage of all the powerful computing capabilities that Hadoop has to offer.
In further posts, I'll explain in more detail all the components and code (code and configs are available on github at https://github.com/lanimall/cyberplugfest).
(*) "we" = The SoftwareAG Government Solutions team, which I'm part of...
(**) "Plugfest" = "collaborative competitive challenge where industry vendors, academic, and government teams work towards solving a specific set of "challenges" strictly using the RI2P industrial best practices (agile, open standard, SOA, cloud, etc.) for enterprise information system development and deployment." (source: http://www.afcea.org/events/west/13/plugfest.asp)