- Seagate Blog
- Managing Data at Scale
Considerations for Managing Data at Scale
In an IT 4.0 world—where more data is created at the micro, metro, and macro edges by devices such as cameras, drones, and autonomous vehicles—the scale of data generated or collected by one organization can easily swell to multiple petabytes. Scale changes everything—from the economics of storing data and the flexibility of moving it, to the foundational need for data security.
Key Considerations for Managing Data at Scale
Data at scale is no longer limited to private data centers or centralized public clouds, it is becoming ubiquitous. A wide range of data storage options have historically allowed enterprises to diversify their data storage solutions in a hybrid model between data centers and public cloud, but mobile and IoT applications are also driving organizations to keep data and compute resources on the edge and to further diversify where data moves and lives at any given point in its lifecycle.
Data volumes will continue to increase, and storage solutions need to be adaptable and highly scalable to a world of highly distributed data. The amount of new data created each year is currently growing at a compound annual growth rate of about 26%, according to Seagate’s Rethink Data report. In total, industry analyst IDC expects 175.8ZB of new data will be created in 2025, compared to 18.2ZB in 2015. The bulk of this data will not be managed in-house, but across multiple public clouds, private clouds, and edge devices.
Data Management Economics
Data center and traditional cloud economics have been based on hyper-scalability, low-cost energy, and inexpensive real estate. Historically, the physical location of the data center has been a cornerstone of storage economics. But the opportunities brought by edge computing have disrupted that simple equation for data consumers. To generate the greatest possible value, the enterprise must now retain, manage, and leverage its data.
Edge devices and sensors generate data that requires low latency analysis, making it more important to keep computing resources closer to the edge. Local networks are characterized by high bandwidth, low jitter, and low latency, making them a good fit for many edge computing workloads. They also provide resiliency to wide area network and cloud data center outages. The volume of data generated at the edge could drive up network transmission costs if that data were all stored solely in a centralized location. Thus, the most cost-optimal storage strategy increasingly distributes data beyond traditional cloud and on-premises storage hubs.
When it comes to economics of scale, it helps to keep in mind several dimensions of data at scale: data creation, data storage, data in motion, and data activation.
Data creation considerations focus around when and where data originates. This can encompass everything from IoT devices feeding sporadic business-critical information from the edge to the constant stream of performance-monitoring data on the factory floor and anything in between.
Data storage focuses on persistence, reliability, and durability of data. The decisions here centers around where and how data is stored. Data in motion strategies should aim for ease, speed, and cost-efficiency. By deploying processes that are flexible enough to align with the many varied reasons enterprises move their data—from disaster recovery to lifting and shifting mass data to bring it where it provides the most value.
Data activation is how the data is leveraged to further business objectives. This includes considerations such as when, where, and how data should be stored and used. For example, when to utilize machine learning to extrapolate trends or when to simply store the data for later use. The key here is “use". Too often collected data may be ignored, pushed aside, and underutilized, skewing the economics of storage toward middling or worse returns on investment.
Traditionally, most data creation happened in the data center or via the generation of work that flowed directly to the data center. But this has changed. Each year more and more data is created outside the data center with the expansion of IoT and edge computing, and this data will soon represent the majority.
Take the rise of fault-analysis applications as an example. Pervasive video monitoring of manufacturing areas allows manufacturers to detect faults in production machinery or products themselves. This process requires continuous video creation, often captured at a 1080p resolution and higher, thanks to the advent of 20- to 80-megapixel cameras. High-resolution video streams quickly create petabytes of data that must move along efficient storage pipelines to be activated and leveraged.
In some situations, data managers at the edge may be tempted to seek efficiency by reducing the amount of video data captured at each collection point, aiming to capture and transfer only important video data. But it may be ill-advised to try to reduce video data collection at the collection point, because it’s often impossible to understand in advance all the ways this data might provide value and be put to good use. Video data itself is rich with information that's not always discernible to humans, and the actions and patterns captured by sensors and cameras can also be more complex than humans understand at first. Today, AI and machine learning can find, analyze and repurpose patterns found in video data that human data managers are not looking for, or don’t yet understand, and this can provide unanticipated value to the company. Weighing the full long-term costs versus benefits, the most economical solution is to record everything happening in the manufacturing space.
In addition, initial processing of the full cache of video data at the edge location (for example, in a factory) is becoming more common. This enables data managers to retain all the data while reducing its file size before transferring it to the data center. Local data processing also helps reduce the risk of data loss during transmission.
Streaming video data is complex and the cost of moving such large volumes of data prior to analysis has historically been prohibitive. In such a case, the data would be analyzed locally so only significant findings might to be transmitted to centralized computing infrastructure on-premises or in the cloud. Today, there are new modes of data transport that can make mass data transfer affordable (see the data in motion section below), and there are also new paradigms for storing data in public cloud services that are geographically nearer to the data source. Both of these shifts change the calculation about how much data can be analyzed immediately, thus opening up greater opportunities increased learnings from much larger datasets in a short time and mitigating the historical need to cull data before it’s fully understood.
A key solution to the challenges of large data volumes at the edge is to move certain compute and storage capabilities closer to where data creation happens. This form of composable multicloud strategy can be achieved by either deploying a private cloud architecture or by choosing a specific public cloud service that’s located at the edge near the data. This may result in a loss of some of the simple operational economic advantages of centralized storage infrastructure, but the business advantages of efficiently analyzing and leveraging data more quickly brings enormous economic benefits.
The traditionally prohibitive cost of enterprise storage and vendor lock-in have, in the past, forced businesses to limit the volume of data they store and activate. Today, data holds enormous value, and enterprises can’t afford not to store most or all of the data they generate.
For years, mass data storage aimed to meet a narrow range of needs in business, sometimes limited to structured data that was ready for immediate use as well as backups and disaster recovery operations. Now, due to the essential role of data as a core asset of all businesses, the need to retain, access and activate mass data is central to a broader view of business continuity.
Standalone backups are being replaced by replication in multiple cloud locations. Replication is a common way to ensure availability and durability, with geographic separation of replicas providing additional protection against local disruptions.
Long-term data management models are developing with less concentration on bulk storage for rarely accessed data, and more focus on making data at scale readily accessible—with a goal to help the organization leverage and get the most out of the data it collects.
As more data is collected and stored on the edge there is a greater need for automated, policy-based management defining where and how that data is stored. This means each organization must define a variety of locations where data will reside at different moments of its lifecycle, and what services will be performed on any data at a given time. This is often done as part of a composable multicloud strategy in which various applications and services—software as a service (SaaS), compute as a service (CaaS), storage as a service (StaaS), infrastructure as a service (IaaS), and platform as a service (PaaS)—play a role in leveraging the full value of the data.
Data lifecycle management, privacy, and other areas of regulation are major drivers in the move to defining and automating the application of data management policies.
Data in Motion
The reasons an enterprise needs to move its data are varied.
Organizations may need to consolidate mass data into a single repository for big-picture analysis and improve the availability, security, and accessibility of their data.
Enterprises often deploy a disaster recovery plan to enable business continuity in the event of primary data failure with whole-enterprise backup. An ideal approach is to transfer data to a colocation data center with the ability to transfer it back in the event of disaster or data failure.
Today’s data managers will also commonly manage mass data migration to different cloud locations, lifting and shifting mass data from where it’s created to where it has the most value for the business—preferably without the limitations of network dependencies.
Data storage and movement policies should be designed to avoid cloud vendor lock-in, and to keep specific datasets and mass data from being trapped in silos, so that it can be freely accessed and moved as needed to specific cloud services or specific geographies where the data’s value can be leveraged at any given time.
It’s important that data managers understand new modes for moving data that enhance the ability to keep data in motion in order to realize its potential value. The goal should be to move data to wherever it creates the most value, using a model that supports rapid data transfer across edge and cloud storage environments while limiting egress and access fees.
Data managers should seek tools and services that let the organization transfer large amounts of data in days, rather than the weeks or months it takes when relying on the internet. A new class of enterprise-class mobile storage devices and services are now available can serve as high-capacity edge storage solutions that enable businesses to aggregate, store, move, and activate their data. The ideal solutions are scalable, modular, and vendor agnostic—integrated solutions that eliminate network dependencies so organizations can transfer mass data sets in a fast, secure, and efficient manner.
Such tools are optimal for data strategies that rely on activating data at the edge, enabling companies to deploy storage in the field quickly and capture data at the source. They can facilitate fast, simple, secure transfers so organizations can more easily move data to the cloud to put it to work.
Mobile storage as a service can also simplify right-sized data transfers to ease the work of scaling up or down as data transfer needs evolve. Businesses may see long-term savings in shifting this part of the infrastructure from CapEx to OpEx as they utilize cost-effective data transfers delivered as a service, which lets businesses order and pay only for the devices they need, when they need them.
With the data created, stored, and put in motion, there is still the issue of extracting value from the data. Up until recently, data had typically been collected for known business purposes, such as executing a sales transaction. There were specific needs that could be met by gathering information. Now, data analysis, artificial intelligence (AI), and machine learning (ML) can derive insights from unstructured data that lead to new discoveries and open up new business opportunities.
Organizations are finding additional ways to use and reuse data far from the creation point—for example, combining the benefits of sensors for pervasive data capture with AI for analyzing that data. AI-driven analysis in particular is a powerful way to extract new insights from unstructured data that might otherwise go untapped.
New forms of analysis also depend on being able to analyze for specific characteristics of the data. For example, streaming video data could provide the number of cars moving in certain directions in a given intersection per hour, or the number of people in a particular location in relation to seemingly unrelated concurrent events, or to time of day. Many of these new characteristic-based analysis methods raise compliance and privacy issues that can vary by jurisdiction and industry. This underscores the importance of policy-driven data management for business-critical privacy controls.
These new data activation techniques require a unified storage model that can manage and understand data that is unstructured. Cloud object storage is a format and data storage architecture that simplifies the storage and management of massive amounts of unstructured data. Cloud object storage treats discrete units of data as "objects" that can be stored in their native data format. Self-contained cloud objects contain three components: the data object, its descriptive metadata, and a unique identifier that allows APIs to find and retrieve the stored data. Compared with traditional file- and block-based database storage systems, the self-contained nature of each discrete unit of data, or “object”, in the cloud object storage model makes it more simple, efficient, reliable, and cost-effective to track, manage, and leverage.
When using a unified model to activate data across a multicloud infrastructure, administrators should always watch for adverse impacts such as an inadvertent emergence of data silos as a result of data gravity. In addition, if data is being activated from within a traditional public cloud provider and being moved over the network, data availability and time to analysis will be impacted, and administrators may need to plan for egress charges depending on the cloud provider’s policies.
Interested in learning more about data management economics? Click here to watch out on-demand webinar on Cloud Economics for Data Centric Use Cases!