While the initial hype of data lakes has passed, their schema-less approach is becoming a fundamental requirement in the age of digital transformation as every business seeks to incorporate data analytics into their strategies and decisions, across humans and machines.
Yet still, few businesses have successfully architected, implemented and utilized a data lake as intended. Although the value propositions are compelling, the cost, complexity, skill sets, and time-to-value have stalled many traditional on-prem enterprise data lake implementations. They involve extensive amounts of storage, compute, networking, integration, management and governance. Also, IT must ensure that the corporate data assets remain secure, unaltered and in place. And even among the organizations that do have operational data lakes, few business-focused stakeholders have the skills or access rights to the data lake to apply and perform analytics for special projects and business analytics.
The two resulting situations are:
- IT wants to get the line-of-business stakeholders off their backs with their incessant and untimely last-minute demands for data and analytics.
- Business-focused “wanna-be” data lake users are sick of being forced to “take a number” and wait months for IT to supply data and analytics.
Fortunately, there is a public cloud alternative to these costly monolithic on-prem data lakes. For example, Amazon Web Services, Google Cloud, and Microsoft Azure offer the cloud infrastructure services to deploy a data lake on their infrastructure. And the immediate conclusion for many would be… faster, better, cheaper. But like I tell my 12-year-old son, the real answer is almost always “It depends.”
So let’s take a quick look at how these two top-level approaches stack up across three categories, so that we can more clearly understand where each approach would be the best fit.
Note: There are many approaches within approaches to data lakes, and plenty of vendors, industry pundits and analysts will want to point out exceptions to my generalizations below. Additionally, there are many more metrics by which we could compare these two approaches. However, this blog seeks to simplify the comparison to the highest-level attributes for a business-minded audience.
We’re going to compare:
- Implementation Implications
- Use Cases
Cost: The sheer amount of capital required to build an on-prem enterprise data lake will vastly exceed the cost of the “as-a-service” approach of the public cloud offerings.
Complexity: As mentioned earlier, many data lake initiatives are simply stalled due to the complexity of building an in-house enterprise data lake, which spans storage, networking, virtualization, compute, workflow management, governance and much more. Whereas in the public cloud, the architecture has been pre-developed, pre-integrated and battle tested for you.
Skill sets: IT staff, data architects, engineers, data scientists and other specialists would be required to architect, implement and utilize an on-prem data lake. The public cloud approach dramatically reduces the need for deep technical knowledge so that business-focused users can gain self-service access to data on the lake.
Time to value: Deploying a data lake in the public cloud can be accomplished in as little as a few hours, whereas on-prem enterprise data lakes are more likely to take six months to a year, or more.
Elasticity: The public cloud offerings are backed by seemingly limitless scalability, with no need to re-engineer the architecture when your needs grow, or shrink. And the costs will stay in accordance with your usage needs. On-prem approaches require continual evaluation, planning and updates.
Performance: No doubt that some readers will not like this one. But having the data lake within one’s own firewall enables superior performance and lower latency that would otherwise be possible for enterprise applications. For data scientists that need to move large quantities of data for experimentation repeatedly, the on-prem approach will prevail. However, for streaming and real-time data, there are new approaches, such as edge analytics, which add new “it depends” situations based on the desired use cases for the data.
Availability: The universal access of the public cloud wins over private, hands down. However this also creates potential vulnerability in the eyes of your IT department.
IT Operations: IT operations is highly unlikely to utilize a public cloud for all of its enterprise data needs, as it needs to maintain direct control of the data that powers the majority of its mission-critical applications and workloads. Much more likely is that IT would augment certain lower-risk workloads to the public cloud, as well as using a hybrid-cloud model that incorporates private cloud as well.
BI Analysts: These stakeholder are a good fit for both models given that they have scheduled, known use cases, often for descriptive analytics only.
Business-focused Data Scientists: For the data scientists that are focused on the lines of business, a public cloud model will be a gift from heaven, with direct access to the data they need now. The on-prem data lake will serve them too, but perhaps with more red tape to gain access to the data they seek.
IT-focused Data Analysts: Although the IT-focused data scientists will be primarily focused on the enterprise data for IT operations, there will also be instances where a cloud data lake will provide easier access to external data sources such as IoT, streaming, edge and real time.
Data Architects: A data architect designs, creates, deploys and manages an organization’s data architecture. Hence, their roles would have limited applicability to a public cloud model, while their role is instrumental to the on-prem approach.
Business Users: The average business user, such as a marketing manager will not have the skill sets required to be able to make productive use of an on-prem enterprise data lake, whereas a public cloud model allows for a more simplified user experience. Furthermore, the irregular nature of their needs for data and analytics mean that the public cloud model provides far less disruptions to IT.
EDW offload/augmentation: Again, the schema-less nature of Hadoop based data lakes means that resource intensive EDW workloads can be offloaded to a Hadoop architecture at about 5% of the cost. However, data itself is what will determine whether it’s a better fit for on-prem or public cloud.
Reducing Data Silos: The on-prem data lake is the only realistic approach to the original promise of a data lake (a single repository for ALL your data). However the public cloud data lake can still play a major role consolidating data silos and it can be accomplished faster.
Enterprise Application Support: For the foreseeable future, the majority of mission critical applications will continue to utilize on-prem and private cloud approaches for reasons of data availability, security, and performance. However, this is changing as many businesses (not “enterprises” perhaps) are now running entirely on public cloud, so I am tempted to upgrade this to call this an “it depends.”
Data Science Experimentation: Because data scientists often want to create analytics sandboxes and move massive amounts of data, an on-prem model speeds that process. But for smaller data sets, the flexibility and accessibility of a public cloud data lake can help to fast track the projects.
Data Analytics for Business Users: On-prem data lakes are good for data analytics given that they are robust and supported by accessible and knowledgeable people, but the required skill sets and IT bottlenecks may create obstacles. The public cloud data lake is better suited for a direct business user.
So there we have it, folks. Both approaches have their merits. My takeaway from this analysis is that the public cloud approach lends itself to business-focused stakeholders with a simplified implementation and user experience that requires less technical knowledge. One such instantiation of this would be the TCS Connected Intelligence Data Lake for Business, which is available now in AWS Marketplace.
- Purpose-built for Business IT and special project teams: Quickly onboard, manage and govern data for analytics with a simplified, self-serve Hadoop data lake platform.
- Easy drag-and-drop user interface: Model, catalog and automate data ingestion with no additional coding.
- A single data platform to develop multiple use cases: Develop multiple use cases on a single platform, quickly and securely with fully featured administration, workflow automation, security policy and user access control.
For more information, I’ve recorded a short video that you can share, which shows you how you can dive in today!