Blog
by Pete SonsiniSep 23, 2015
As buzzwords go, “Big Data” has jumped the proverbial shark. The term is so rampantly overused (and misused) that it has virtually no meaning at all. That’s not to say, however, that there aren’t some highly compelling investment opportunities if you can manage to cut through the bull. We’ve backed a number of “big data” plays in recent years, both in infrastructure (e.g. MapR, Databricks) and applications (e.g. Lattice Engines, Conviva). And today we see a new opportunity evolving around data performance management. We’ve powered the processing frameworks, optimized the application layer, and now we’re moving to the next greatest asset within the enterprise chain: the information.
You may suspect we wouldn’t be talking about this without a shiny new startup in our portfolio, and you’d be right. We are thrilled to announce our investment today in StreamSets—a company that has the foresight and execution horsepower to create and shape the new data performance management market. At its core, StreamSets is a specialized solution comprising a set of low-footprint, low-latency components that forge the lowest-friction path for getting data from source to consumption.
Tracing the emergence of this new theme in data helps illustrate why we believe StreamSets will be a powerful player in data performance management, or DPM. Let’s take a few lessons from a few mature markets, starting with Network Performance Management (NPM), a $1B market today growing at 5%. NPM came about as IT administrators realized that enterprise networks were quickly becoming global and that real business value resided in the speed, capacity, and fidelity of their network. (This should strike a very familiar chord for anyone familiar with the “velocity, volume, variety” data challenges that have taken center stage over the last few years.)
It quickly became apparent that network performance was only part of the problem; applications were increasingly becoming complicated and required highly distributed resources. Business value had shifted from the network to the application, giving rise to the market of Application Performance Management (APM), a $3B market today growing at 8%. APM builds upon NPM: instead of just looking at the velocity of the network piping general data, APM looks at how fast specific application data is transported—for example, how fast a user is authenticated, or how long it takes to query a relational database management system and return a specific value. The number of transactions/pages/requests per second became the key metric for network volume, and the concept of variety began to play a critical role as different data flows were marked and accordingly prioritized to optimize application performance.
The advent of NPM and later APM were direct results of enterprises paying keen attention to which assets in their business were actually driving value and then optimizing for the performance of those assets. Today, it’s becoming increasingly clear that the asset driving a growing share of enterprise value is the data itself—a marked shift beyond just getting information to the endpoint (role of the network) and being able to use the information (role of the application). For example, it is no longer enough to serve up a website quickly, but rather the application needs to be first appropriately customized based on the specific traits of the user/ usage conditions and only then served. And this is true not just for web applications but for an ever growing set of data-powered applications inside the firewall (business intelligence, machine learning, archival, IoT, advertising etc.). Abstracting this idea and taking a bold step, we are seeing applications becoming black boxes that act on an input of data and in turn output enriched, valuable data. Coupled with highly available, scalable cloud infrastructure and the virtually free cost of compute and storage, the networking and application layers are no longer bottlenecks but rather nodes along the growing “network” of data flows that need to be optimized, protected, and monitored for health. Breakage in these data flows is just as catastrophic as any problem that might occur in the network or the application logic—in fact, analysts estimate the information management software IT spend at $60B in 2015, with the largest portion of spend within that market being day-to-day data management, making up a whopping 40% of the total spend in 2015.
And then comes StreamSets: StreamSets has developed data middleware software to continuously ingest both structured and unstructured data (transactions, events, etc.) in real-time from any origin (proprietary, third-party sources, etc.) and load it into any data store (relational / NoSQL databases, Hadoop, cloud databases like Amazon or Databricks, etc.) while sanitizing it into a format that is consumption ready for data-driven applications and analysts.
The secret sauce is their adaptable, intent driven flow engine, one of the cleverest implementations we’ve seen. Today, exploding data sources are constantly producing data with “drifting” data semantics, and this data is being loaded via brittle, ad-hoc, schema-dependent ETL and custom solutions that break down when they can’t match the schema. StreamSets solves this problem by allowing administrators to simply specify the intent of the data flow (i.e., “I’m looking for a phone number”) and let StreamSets do the heavy lifting to extract the needed information, regardless of input schema. This approach also aids with data monitoring, as the data “network” can now alert the admin when the data flow is deviating and automatically quarantine bad packets to prevent downstream application crashes. This provides a safe, scalable, and transparent way to build and monitor the increasingly important data “networks” that make up enterprises’ biggest asset today.
It will come as no surprise to those among us who eat, sleep and breathe data that Girish Pancha and Arvind Prabhakar are at the helm of this growing force in DPM. With over three decades of combined experience including key positions at Informatica and Cloudera, these founders know exactly what the real pain points in the industry are and exactly what is needed to solve them. And they’ve packaged their entire solution in a GUI portal that’s as beautifully designed as it is powerful. Determined that StreamSets should be a solution that both technical and non-technical users can use, Girish and Arvind created a drag-and-drop interface with prebuilt data analysis nodes and connectors, making tasks that should be extremely difficult (like visually mapping data flows, monitoring the health of the data network) as easy as playing with legos.
We couldn’t be more excited to welcome Girish, Arvind, and StreamSets into the NEA family and together help shape this next wave in “big data.” Because no matter what you call it, it’s still a big opportunity.