Category Archives: Business Intelligence

Big Data Right Now: Five Trendy Open Source Technologies

http://techcrunch.com/2012/10/27/big-data-right-now-five-trendy-open-source-technologies/

 

news

Comment

2

inShare145

Big Data Right Now: Five Trendy Open Source Technologies

Tim Gasper

posted yesterday

2 Comments

Screen Shot 2012-10-26 at 11.53.10 PM

Big Data is on every CIO’s mind this quarter, and for good reason. Companies will have spent $4.3 billion on Big Data technologies by the end of 2012.

But here’s where it gets interesting. Those initial investments will in turn trigger a domino effect of upgrades and new initiatives that are valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion.

What you’re seeing right now is only the tip of a gigantic iceberg.

Big Data is presently synonymous with technologies like Hadoop, and the “NoSQL” class of databases including Mongo (document stores) and Cassandra (key-values).  Today it’s possible to stream real-time analytics with ease. Spinning clusters up and down is a (relative) cinch, accomplished in 20 minutes or less. We have table stakes.

But there are new, untapped advantages and non-trivially large opportunities beyond these usual suspects.

Did you know that there are over 250K viable open source technologies on the market today? Innovation is all around us. The increasing complexity of systems, in fact, looks something like this:

We have a lot of…choices, to say the least.

What’s on our own radar, and what’s coming down the pipe for Fortune 2000 companies? What new projects are the most viable candidates for production-grade usage? Which deserve your undivided attention?

We did all the research and testing so you don’t have to. Let’s look at five new technologies that are shaking things up in Big Data. Here is the newest class of tools that you can’t afford to overlook, coming soon to an enterprise near you.

Storm and Kafka

Storm and Kafka are the future of stream processing, and they are already in use at a number of high-profile companies including Groupon, Alibaba, and The Weather Channel.

Born inside of Twitter, Storm is a “distributed real-time computation system”. Storm does for real-time processing what Hadoop did for batch processing. Kafka for its part is a messaging system developed at LinkedIn to serve as the foundation for their activity stream and the data processing pipeline behind it.

When paired together, you get the stream, you get it in-real time, and you get it at linear scale.

Why should you care?

With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.

Stream processing solutions like Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract, transform, load) and data integration.

Storm and Kafka are also great at in-memory analytics, and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs. Real-time streaming analytics is a must-have component in any enterprise Big Data solution or stack, because of how elegantly they handle the “three V’s” — volume, velocity and variety.

Storm and Kafka are the two technologies on the list that we’re most committed to at Infochimps, and it is reasonable to expect that they’ll be a formal part of our platform soon.

Drill and Dremel

Drill and Dremel make large-scale, ad-hoc querying of data possible, with radically lower latencies that are especially apt for data exploration. They make it possible to scan over petabytes of data in seconds, to answer ad hoc queries and presumably, power compelling visualizations.

Drill and Dremel put power in the hands of business analysts, and not just data engineers. The business side of the house will love Drill and Dremel.

Drill is the open source version of what Google is doing with Dremel (Google also offers Dremel-as-a-Service with its BigQuery offering). Companies are going to want to make the tool their own, which why Drill is the thing to watch mostly closely. Although it’s not quite there yet, strong interest by the development community is helping the tool mature rapidly.

Why should you care?

Drill and Dremel compare favorably to Hadoop for anything ad-hoc. Hadoop is all about batch processing workflows, which creates certain disadvantages.

The Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall to Pig and Hive, many interface layers have been built on top of Hadoop to make it more friendly, and business-accessible. Yet, for all of the SQL-like familiarity, these abstraction layers ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (read: running jobs, or “workflows”).

What if you’re not worried about running jobs? What if you’re more concerned with asking questions and getting answers — slicing and dicing, looking for insights?

That’s “ad hoc exploration” in a nutshell — if you assume data that’s been processed already, how can you optimize for speed? You shouldn’t have to run a new job and wait, sometimes for considerable lengths of time, every time you want to ask a new question.

In stark contrast to workflow-based methodology, most business-driven BI and analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Writing Map Reduce workflows is prohibitive for many business analysts. Waiting minutes for jobs to start and hours for workflows to complete is not conducive to an interactive experience of data, the comparing and contrasting, and the zooming in and out that ultimately creates fundamentally new insights.

Some data scientists even speculate that Drill and Dremel may actually be better than Hadoop in the wider sense, and a potential replacement, even. That’s a little too edgy a stance to embrace right now, but there is merit in an approach to analytics that is more query-oriented and low latency.

At Infochimps we like the Elasticsearch full-text search engine and database for doing high-level data exploration, but for truly capable Big Data querying at the (relative) seat level, we think that Drill will become the de facto solution.

R

R is an open source statistical programming language. It is incredibly powerful. Over two million (and counting) analysts use R. It’s been around since 1997 if you can believe it. It is a modern version of the S language for statistical computing that originally came out of the Bell Labs. Today, R is quickly becoming the new standard for statistics.

R performs complex data science at a much smaller price (both literally and figuratively). R is making serious headway in ousting SAS and SPSS from their thrones, and has become the tool of choice for the world’s best statisticians (and data scientists, and analysts too).

Why should you care?

Because it has an unusually strong community around it, you can find R libraries for almost anything under the sun — making virtually any kind of data science capability accessible without new code. R is exciting because of who is working on it, and how much net-new innovation is happening on a daily basis. the R community is one of the most thrilling places to be in Big Data right now.

R is a also wonderful way to future-proof your Big Data program. In the last few months, literally thousands of new features have been introduced, replete with publicly available knowledge bases for every analysis type you’d want to do as an organization.

Also, R works very well with Hadoop, making it an ideal part of an integrated Big Data approach.

To keep an eye on: Julia is an interesting and growing alternative to R, because it combats R’s notoriously slow language interpreter problem. The community around Julia isn’t nearly as strong right now, but if you have a need for speed…

Gremlin and Giraph

Gremlin and Giraph help empower graph analysis, and are often used coupled with graph databases like Neo4j or InfiniteGraph, or in the case of Giraph, working with Hadoop. Golden Orb is another high-profile example of a graph-based project picking up steam.

Graph databases are pretty cutting edge. They have interesting differences with relational databases, which mean that sometimes you might want to take a graph approach rather than a relational approach from the very beginning.

The common analogue for graph-based approaches is Google’s Pregel, of which Gremlin and Giraph are open source alternatives. In fact, here’s a great read on how mimicry of Google technologies is a cottage industry unto itself.

Why should you care?

Graphs do a great job of modeling computer networks, and social networks, too — anything that links data together. Another common use is mapping, and geographic pathways — calculating shortest routes for example, from place A to place B (or to return to the social case, tracing the proximity of stated relationships from person A to person B).

Graphs are also popular for bioscience and physics use cases for this reason — they can chart molecular structures unusually well, for example.

Big picture, graph databases and analysis languages and frameworks are a great illustration of how the world is starting to realize that Big Data is not about having one database or one programming framework that accomplishes everything. Graph-based approaches are a killer app, so to speak, for anything that involves large networks with many nodes, and many linked pathways between those nodes.

The most innovative scientists and engineers know to apply the right tool for each job, making sure everything plays nice and can talk to each other (the glue in this sense becomes the core competence).

SAP Hana

SAP Hana is an in-memory analytics platform that includes an in-memory database and a suite of tools and software for creating analytical processes and moving data in and out, in the right formats.

Why should you care?

SAP is going against the grain of most entrenched enterprise mega-players by providing a very powerful open source product.  And it’s not only that — SAP is also creating meaningful incentives for startups to embrace Hana as well. They are authentically fostering community involvement and there is uniformly positive sentiment around Hana as a result.

Hana highly benefits any applications with unusually fast processing needs, such as financial modeling and decision support, website personalization, and fraud detection, among many other use cases.

The biggest drawback of Hana is that “in-memory” means that it by definition leverages access to solid state memory, which has clear advantages, but is much more expensive than conventional disk storage.

For organizations that don’t mind the added operational cost, Hana means incredible speed for very-low latency big data processing.

Honorable mention: D3

D3 doesn’t make the list quite yet, but it’s close, and worth mentioning for that reason.

D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information, and make data truly interactive. It was created by Michael Bostock and came out of his work at the New York Times, where he is the Graphics Editor.

For example, you can use D3 to generate an HTML table from an array of numbers. Or, you can use the same data to create an interactive  bar chart with smooth transitions and interaction.

Here’s an example of D3 in action, making President Obama’s 2013 budget proposal understandable, and navigable.

With D3, programmers can create dashboards galore. Organizations of all sizes are quickly embracing D3 as a superior visualization platform to the heads-up displays of yesteryear.

Editor’s note: Tim Gasper is the Product Manager at Infochimps, the #1 Big Data platform in the cloud. He leads product marketing, product development, and customer discovery. Previously, he was co-founder and CMO at Keepstream, a social media curation and analytics company that Infochimps acquired in August of 2010. You should follow him on Twitter here.

Using Business Rules to Make Business Intelligence Actionable

Today I found a interresting article about to make BI useful.
 

Using Business Rules to Make Business Intelligence Actionable

Business intelligence (BI) is exceptional at taking cuts of data and displaying it as reports and dashboards. BI can even reform data within analytic frameworks, allowing the user to view the data from many perspectives. However, BI is not great at transforming data based on business logic, allowing for new forms of analysis. An example might be posing a question such as “what would happen to my call center costs if I automatically approved all claims below this new dollar figure?” In this article, I will explore how organizations can combine business rules management systems (BRMS) and BI solutions to create new power analytics, such as a claims leakage analysis or Sarbanes-Oxley compliance analysis. I will also illustrate how this same powerful combination of tools used to form these new analytics can move from passive analysis to operational automation. 

What Does Actionable Mean?

Actionable means that you not only have a complete understanding of what is happening, but you have a reasonable understanding of where and why it is happening. With that basis, you are in a position to take action.

A Framework for this Discussion

For this discussion, I will use an example of automobile claims processing in the insurance industry. Most readers have either submitted an auto claim or can effectively imagine this situation. I will pose two questions as a typical business executive:

  1. How much money are we paying on automobile claims that we are not obligated to pay?
  2. If we automatically approved all claims below $1,000 for noninjury accidents, what would happen to our costs?

Question 1 reflects a question on compliance. Question 2 reflects what the effect would be as a result of a change to a business policy.

I will start the discussion using question 1: Compliance.

Why is Most BI Not Actionable Today?

BI software excels at taking large sets of data – organizing and combining that data into different views. You can easily pull a subset of claim data such as: analyze only the claims from 2007. You can easily carve and slice the data to answer many questions beyond the obvious. How many claims did we pay in 2007? How much was the average claim? What percentage did we pay? Even deeper, what percentage did we pay by region?

This type of analysis can be used to start the process of answering our real questions, but it can’t be used to actually answer our real questions. Classic BI cannot answer a question like: How much money are we paying on automobile claims that we are not obligated to pay? Or, what percentage of claims paid by region were compliant with my claims approval policy?

This is because BI can cut, sort and represent existing data, but it cannot analyze data to determine what each transaction should be if a policy was followed. Typically, BI is used to refine the scope of the question, and then manual review and analysis takes over. This is very expensive, and many of these manual analysis paths go nowhere.

Moving from BI to Manual Analysis – The Typical Last Mile

When you move into the manual analysis you will typically use a set of business rules to analyze the data. When you look at one of these claims, we use the current claim payment policy to judge which claims are valid and which are invalid. The instructions within the policy are business rules. You will typically use a statistically relevant subset of our claims to perform your analysis because it is too costly to assess every claim. If the policy is not very specific, there is room for interpretation and our analysis (audit) has the potential for error. If you have many people doing this analysis, then you have further room for difference in interpretation by the different auditors, increasing the potential for errors in our analysis.

It is understandable why people feel frustrated. BI often creates as many questions as answers. This starts a typical cycle of questions generating more questions. The answers are often difficult to understand and to prove.

How Do Business Rules Make BI Actionable?

I will summarize a pattern for using a BRMS to accomplish this analysis. I don’t have room to go into detail – email me if you are interested.

  1. Use a BRMS to code the rules for approving claims. These are the same rules used to perform a manual audit. BRMS will bring structure to these rules ensuring the rules are not ambiguous, etc. This is typically a fast and low-cost process.
  2. Take all 2007 claim transactions and use the BRMS to make each approve/deny claim decision again. You leave your original transactions (the ones made by your people) in place. This step will create a second challenger transaction record (stored in a different database). This decision will be based on a new BRMS automated policy. Each original transaction will now have a second challenger transaction.
  3. Now using classic BI tools, you can compare the two data sets and answer a whole new set of questions. By comparing the original and challenger transactions (one at a time or in groups), you can answer questions like:
    • What percentage of our transactions (original) are consistent with the policy validated transactions (challenger)? This will tell you what percentage of our transactions are policy compliant.
    • What percentage of transactions that were paid,should not have been paid? This answers our original question. But you can go so much further!
    • What percentage of claims did you deny that you should have paid? This helps assess litigation risk.

You can then begin to use low-cost BI techniques to cut your data to get detailed answers.

    • Sort the data by region to see if we have a particular region that is noncompliant with the approval policy.
    • Sort the data by training level of claims personnel to see if individuals with certain training are more or less compliant.
    • We can even use this technique to very quickly assess compliance to policy for every person processing claims, netting a clear list of personnel who are have more than 10 percent noncompliance.

You have now not only answered the question of “How much money are we paying on automobile claims that we are not obligated to pay?,” but you have a framework to drill down in detail to determine in exactly what environment or in what situation we are making these unnecessary payments. You can now quickly move to actions. Actions such as “assure all employees complete training level X by a certain date” if we determine that that noncompliance is happening in less-trained employees. Or, “move to investigation” if you deem that there is a high percentage of noncompliance in a given regional office by a specific set of employees.

Note: Using rules-based analysis, you no longer have to do a statistically relevant subset of our transactions. Because this is automated, you can assess each and every transaction for compliance.

Question 2: Analyze and Act on Change in Business Policy.

A More Sophisticated Analysis to Action Cycle

Recall question 2: ‘If we automatically approved all claims below $1,000 for noninjury accidents, what would happen to our costs?” How does rules-based analysis help?

You will basically follow the pattern outlined above. But instead of automating the rules associated with the policy that is inforce, you will build the rules associated with the new/changed policy. You will again take a set of historic transaction records and create a set of new challenger transactions. In this case, the challenger transactions reflect the outcome of these claims if you used the new policy.

You can now easily answer questions such as: How many claims would have been approved under the new proposed policy that was denied under the current policy? Interesting, but that’s only the tip of the iceberg. Below are some questions that can be answered (and some associated process) that make this analysis very powerful.

  • You can now identify what percentage of historical transactions already comply with the new policy. Each historical transaction that meets the new policy reflects processing costs that were paid but would not have been paid under the new policy. If you have 1 million transactions a year that fit the scope of the policy (<$1,000 and noninjury) and we already approve 90 percent of these, and the average transaction processing cost is $22, then we know by automating this new policy we will eliminate $19.8 million in cost.
  • If on the other hand you have historically denied 50 percent of these claims with an average claim payment (net to the insurance company) of $65, than you can compare your $32 million increased claims cost against our $20M reduction in processing costs, we see that the new policy will save internal resources, but not save money.

Nobody can deny the need to use BI technology and techniques to analyze data. But when business rules are combined to form new data, based on existing policy or proposed policy, the analysis can deliver more specific answers, which are actionable. These techniques can help close the loop on critical organization questions. Even further, because a BRMS is a form of automation, the same business rules used to analyze transaction data can be put into production and used to automate those transactions, taking significant operational costs out of the organization.

David Straus is senior vice president of worldwide marketing for Corticon Technologies. Straus joined Corticon with over 20 years of enterprise software solutions experience in product, marketing and sales. He has held executive positions at Chordiant Software, which acquired OnDemand Inc., a company Straus founded in 1997. Prior to Chordiant, Straus held executive positions at TSW International, OpenVision Technologies and Hewlett-Packard.

For more information on related topics, visit the following channels: