Cloudera’s promise of a packaged Enterprise Data Hub reached general availability on February 4, 2014. Cloudera took a rather combative position versus potential data hub competitors like Hortonworks and IBM, in response to videos like this. Regardless, the new Cloudera Enterprise Data Hub Edition indeed bundles a wide set of features needed for a turnkey data management solution.
Only time will tell if data hubs in general, and Cloudera’s specifically, will turn into a standard resource for IT and the business IT serves. Right now comprehensive data hubs are rare in enterprises, and the the data hub market could at best be considered nascent, though promising.
Thus, if your large organization has a functioning data hub already, you are in the minority. If you have a data warehouse and/or an operational data store, however, you are in the majority. The argument for a data hub is quite different than an operational data store, and offers a more strategic value proposition. How is it different? What steps might you consider if you are interested in data hubs?
The Difference: Operational data stores and data warehouses serve business intelligence needs. A data hub, on the other hand, provides a comprehensive data management platform to serve a wide swath of an organization’s data needs. Could you use it for business intelligence? Sure, but that isn’t the strategic role of a data hub. Here are some other potential uses of a data hub:
· Address a wide range of data-related compliance requirements, including ad hoc requests.
· A shared backup, restore, disaster recovery, and data protection facility.
· Support for data-infused business process insight and innovation.
· Serve a long list of operational applications that need dependable, secure data management.
Another way to look at a data hub is it takes data management out of application silos, almost like a shared service. Yet another way to think about it: If your organization had a Chief Data Officer (CDO), and some organizations already do, the data hub would be a key asset under the governance of the CDO for the benefit of the organization. The CDO would also want information governance tools that meld with the data hub acting as instrumentation for the data hub, as well as a set of APIs to speed along developmental use and expansion of the data hub.
Federation – Another Difference: Data hubs in their grandest sense may be relatively virtual and somewhat federated compared to data warehouses. The notion that an enterprise is going to dump all its data into Hadoop, the proverbial physical data lake, is probably a stretch. Think, instead, of the data hub as an orchestra of services that enable you to find, connect to, use and possibly store your data. Sometimes it makes sense to physically co-locate data in a hub, sometimes not.
Security and Availability – Another Difference: Operational data stores needed security and availability, but if they crash organizations do not crash with them; they are not transactional solutions. When you move to a data hub, though, the security and availability requirements elevate toward that of a key transactional application. The data hub acts as an extension to operational applications, and so should deservedly be under the microscopes of the CISO, VP of IT Operations, and COO.
Start with what and where: If you are in a position, whether CDO or not, to entertain the notion of a data hub, why not begin at the beginning and try to determine what data you have, where it lives, and then eventually pursue why you have it. EnterpriseWeb offers a leading tool to help figure out what data you have, where it is located, and a sense of why it is being used. Alternatively, you could pursue your data what and where through a professional services project, but often they take a long time, are concomitantly high cost, are prone to error, and the data has changed before you finish the project. Start with a tool that you can reuse.
What about a mainframe? Hadoop, with help, applied as a data hub makes a fair amount of sense. You would be hard pressed to find anything close to an “integrated data hub” software package from the historical players in backup/restore, DR, or data protection. It is fair to say, therefore, that Cloudera and MapR, and a few others, are indeed about as good a choice as you might find for turnkey data hub software in the market. But consider the infrastructure for a data hub for a moment. You need (a) very fat I/O pipes that are (b) highly secure, and (c) operationally reliable with (c) super fast, low latency possibly virtual processing. Ever heard of IBM zEnterprise? IBM’s mainframe runs Linux, supports Java, and could serve as an excellent infrastructure for an enterprise data hub. Because of the requirement for fat I/O pipes, I worry about public cloud for a serious data hub, at least today.
GTSoftware offers Ivory DataHub, a solution for the mainframe. It looks like the GTSoftware solution primarily provides a data hub for mainframe-based assets though, rather than using the mainframe as a general purpose data hub. If you are interested in the more general purpose approach, however, Veristorm offers Apache Hadoop for IBM System z running Linux. The point? Don’t just jump to “commodity servers and storage” for your data hub infrastructure.
Summary: Enterprise Data Hub as a Strategic Asset
Applaud Cloudera, for they are pushing the state of the art. Applaud MapR, who seems to get the bigger enterprise data hub picture when it says on its web site, “MapR delivers on the promise of Hadoop with a proven, enterprise-grade Big Data platform that supports a broad set of mission-critical and real-time production uses.” Applaud all the other competitors for being part of the data hub movement, even though along the way there will be many more steps and maturation. Finally, do not lose sight that the true goal of a data hub is to turn data into a fungible, reusable asset across-the-solution-board for the benefit of your company.