Data Gravity and Its Impact on Data Storage Infrastructure
Understand the cost and complexity of storing and moving mass data.
Data is now an essential asset to businesses in every vertical, just as physical capital and intellectual property are. Data growth, with ever-increasing quantities of both structured and unstructured data, will continue at unprecedented rates in coming years. Meanwhile, data sprawl — the increasing degree to which business data no longer resides in one location but is scattered across data centers and geographies — adds complexity to the challenges of managing data’s growth, movement, and activation.
Enterprises must implement a strategy to efficiently manage mass data across cloud, edge, and endpoint environments. And it’s more critical than ever to develop a conscious and calculated strategy when designing data storage infrastructure at scale.
What worked for terabytes doesn’t work for petabytes. As enterprises aim to overcome the cost and complexity of storing, moving, and activating data at scale, they should seek better economics, less friction, and a simpler experience — simple, open, limitless, and built for the data-driven, distributed enterprise. A new way to data.
The concept of data gravity is an important element to consider in these efforts.
According to the new Seagate-sponsored report from IDC, Future-proofing Storage: Modernizing Infrastructure for Data Growth Across Hybrid, Edge and Cloud Ecosystems, as storage associated with massive data sets continues to grow, so will its gravitational force on other elements within the IT universe.
Generally speaking, data gravity is a consequence of data’s volume and level of activation. Basic physics provides a suitable analogy: a body with greater mass has a greater gravitational effect on the bodies surrounding it. “Workloads with the largest volumes of stored data exhibit the largest mass within their ‘universe,’ attracting applications, services, and other infrastructure resources into their orbit,” according to the IDC report.
A large and active dataset will, by virtue of its complexity and importance, necessarily affect the location and treatment of the smaller datasets that need to interact with it. So, data gravity reflects data lifecycle dynamics, and must help inform IT architecture decisions.
Consider two datasets: one is 1 petabyte, and the other is 1 gigabyte. In order to integrate the two sets, it is more efficient to move the smaller dataset to the location of the larger dataset. As a result, the storage system with the 1 petabyte set now stores the 1 gigabyte set as well. Because large datasets will “attract" other smaller datasets, large databases tend to accrete data, further increasing their overall data gravity.
Managing, analyzing and activating data also relies on applications and services, whether those are provided by a private or public cloud vendor or an on-prem data management team. Applications collect and generate data, as well as consume, analyze, and aggregate it; a lot of work has to happen on the data. Naturally, the more massive a data set grows, the harder it is to make use of that data unless it is close to the applications and services that help to manage or activate the data. So applications and services are often moved close to the data sets, or are kept near the data sets. From on-premises data centers, to public clouds and edge computing, data gravity is a property that spans the entire IT infrastructure.
But according to the IDC report, such massive data sets can a become like black holes, “trapping stored data, applications, and services in a single location, unless IT environments are architected to allow the migration and management of stored data, along with the applications and services that rely on it, regardless of operational location.”
Because data gravity can affect an entire IT infrastructure, it should be a major design consideration when planning data management strategies. An important goal in designing a data ecosystem, according to IDC, is to “ensure that no single data set exerts uncontrollable force on the rest of the IT and application ecosystem.”
IT architecture strategy should put mass storage and data movement at its center. This begins with optimizing data location. A data-centered architecture brings applications, services and user interaction closer to the location where data resides, rather than relying on time-consuming and often costly long-distance transfers of mass data to and from centralized service providers.
IDC notes that “one way to mitigate the impact of data gravity is to ensure that stored data is colocated adjacent to applications regardless of location.”
This model can be accomplished by leveraging colocated data centers that bring together multiple private and public cloud service providers, allowing enterprises to pair their mass data storage with best-of-breed solutions for applications, computing, and networking needs.
The key goal of a data-centered architecture is data accessibility. Accessibility increases ease-of-use and smooth operations of a data pipeline, and can impact future business innovation, improving the ability to generate metadata and new datasets, enabling search and discovery of the data, and further empowering data scientists to deploy said data for machine learning and AI.
But putting data at the center of IT architecture can also positively impact application performance optimization, issues of transfer latency, access and egress charges, and security and compliance needs. The overall reliability and durability of the data is also an important benefit. Reliability is the ability to access data when needed, and durability is the ability to preserve data over extended periods of time.
Altogether, these considerations have large implications for enterprise data management planning — from defining an overall IT strategy to formulating a business initiative. Planning out the necessary workloads and jobs means accounting for data gravity. Key questions to ask include: What is the volume of data being generated or consumed? What is the distribution of data across datacenter, private clouds, public clouds, edge devices, and remote and branch offices? What is the velocity of the data being transmitted across the entire IT ecosystem? Addressing these considerations will increase the efficiency of the data infrastructure and can reduce costly data pipeline issues down the line.
IDC advises in its report, “Don't let a single workload or operational location dictate the movement of storage or data resources.” Because data has gravity, data infrastructure must be designed to prevent massive data sets or large individual workloads from exerting a dominant gravitational pull on storage resources, with architecture that efficiently moves storage, compute, or application resources as needed.
This means always maintaining awareness about which datasets are being pulled where, what is the most efficient path to move the data, and what helps those workloads run the best. This can also mean automating the movement of data to reduce storage costs, or moving lower-performing datasets that are not immediately or actively needed. Automated metadata management is also worth considering. This can enable search and discovery across data stores, increasing data accessibility.
Putting these ideas into action means deploying data architecture, infrastructure and management processes that are adaptive. While an organization may have a good idea of what its data gravity considerations are today, they may not be the same five years from now.
“Not every enterprise manages multiple massive data sets, but many already do,” IDC notes in the report. “And, given the pace of digitization of business and the importance placed on the value of enterprise data and data gathering, many organizations will find themselves managing massive data sets in the near future.”
It's important that every data management system can change to accommodate new data requirements. Data management and the data architecture to support it must be agile, and able to adapt to shifting business needs and emerging technical opportunities.
Learn more about hybrid architecture, overcoming network constraints, and the growing complexity of storage management, in the new Seagate-sponsored report from IDC, Future-proofing Storage: Modernizing Infrastructure for Data Growth Across Hybrid, Edge and Cloud Ecosystems