Securing Data: From Root of Trust to Provenance Tracking

Artificial Intelligence (AI), machine learning (ML), and cloud computing are fundamentally changing the risk model of IT. Enterprise data, which has historically been located on centralized infrastructure under the physical control of the business, is now frequently stored in other locations such as the edge or the cloud.

Table of Contents:

Securing Data Securing Data Securing Data

Artificial Intelligence (AI), machine learning (ML), and cloud computing are fundamentally changing the risk model of IT. Enterprise data, which has historically been located on centralized infrastructure under the physical control of the business, is now frequently stored in other locations such as the edge or the cloud. The threat model fundamentally changes with distributed and composable infrastructures. As a result, data orchestration architecture must include other security measures, such as hardware-based roots of trust and open security solutions, to provide security beyond the perimeters of a physical data center.

"For example, at the edge, the threat model includes unauthorized physical access to the equipment—potentially even without anyone seeing it happen," says Manuel Offenberg, a data security researcher at Seagate.

Protecting Distributed Data

Today, enterprise data is stored in public and hybrid clouds. Data is generated at—and transmitted from—remote devices. There is no way for a single enterprise to physically secure all the devices, network equipment, and other distributed infrastructure they use.

This puts more emphasis on protecting the data that exists in a distributed architecture. Many of the security controls that are commonly used are well suited to protect the confidentiality of data. Strong encryption can protect data in transit and at rest. Other cryptographic tools, such as message digests, can help protect the integrity of data.

But now, the increasing importance of AI and ML means demands to ensure data authenticity are also on the rise.

There has long been an ever-growing arsenal of tools to exploit vulnerabilities in systems and software, but today, attackers have a new way to exploit our systems: by attacking our ML/AI systems. By hacking the data that feeds these systems, attackers can take advantage of weaknesses in ML/AI technologies for malicious purposes.

But ML/AI can also help in the battle against hackers. ML algorithms are used in many instances to detect malicious behavior. Take for example the credit card industry, where ML is used to analyze large numbers of legitimate and fraudulent transactions. The data samples used to train the algorithms may consist of numerous attributes, such as the type of product purchased, the location of the transaction, the amount charged, and specific attributes about the customer and the merchant. The ML algorithm identifies patterns in the data that can distinguish legitimate from fraudulent transactions.

Additionally, as Offenberg points out, ML models are trained using "artificial or adversarial machine learning, a new way of training other machine learning systems to recognize potential attacks that we as human beings couldn't even think of."

Data Provenance Becomes Crucial

Now, imagine an attacker gains access to historical credit card transaction data and modifies or injects new data that leads the algorithm to misidentify some fraudulent transactions as legitimate. This kind of poisoning of training data can be difficult to detect. Unlike backdoors in application source code, which can be detected by code reviews and other measures, ML models are represented in ways that are difficult—if not impossible—for humans to understand when looking at them. This is especially true when it comes to deep learning, where models may consist of many layers and large numbers of parameters that drive a complex array of calculations that produce the decision about whether or not a transaction is legitimate.

By establishing data provenance in combination with a secure root of trust, one can build a framework by which tampering with the data can be detected before the data is used, as in this example, for training a model. "These kind of attacks on ML/AI data will represent a new generation of security concerns that we haven't yet fully understood," Offenberg says.

Protecting Data Starts with Roots of Trust

The new class of ML/AI data attacks can be mitigated by improving hardware security with a root of trust, securing the compute operations on data, and maintaining data provenance throughout the data’s life cycle. A root of trust is any unconditionally trusted and foundational security component of a connected device. It can provide any implicitly trusted function that the rest of the system can dependably use to ensure security.

Roots of trust are secure elements that provide security services such as system boot integrity and strong cryptography to the operating system and applications running on the system. Using a root of trust increases system security, thereby improving trust for data stored and processed by that system. As data moves through distributed systems, trusted components can be used to protect data, and data provenance services can log operations on data from the time it is generated.

Today, the combination of distributed infrastructure with increasingly complex uses of data is underscoring the importance of data provenance. "If we know how, when, and where the data is created, and by whom or by what, we can now keep track of that data in a way that ensures: ‘this data hasn't been manipulated and we know its origin,’" Offenberg says. "If we build infrastructures based on the concept of secure data provenance, we achieve a higher level of trust in the data that we’re moving around and eventually consuming.”

Managing Data in Motion

Any data orchestration strategy must include data provenance that is built on trusted compute platforms. By securely tracking the time data was created, the identity of the owner of the data, and the device that created it, it’s possible to detect changes in the data. This provides the foundation for data trustworthiness.

Open security solutions, such as the OpenTitan project, which is building a reference design along with integration guidelines for silicon root of trust (RoT) chips, are part of the solution. Other open source tools, such as OpenSSL, are widely used already. One downside of distributed architectures is that when integration is not done properly, other vulnerabilities can be introduced. Likewise, simply relying on the security of open source solutions without understanding and following integration guidelines can introduce weaknesses. The Heartbleed attack on OpenSSL is a clear example of a vulnerability in an open source library that resulted in many systems suddenly becoming vulnerable. Organizations must be prudent and informed when integrating open source projects, paying particular attention to security and potential vulnerabilities that may be introduced by the way applications are integrated.

AI and ML workloads depend on large volumes of diverse data. In addition to protecting the integrity of data, ML practitioners need to be able to identify and extract specific data from large data stores. This, in turn, drives the need for advanced metadata capture and management, including the ability to tag or label data resources.

Ultimately, distributed systems cannot rely on the same security measures that protected siloed data centers. Comprehensive security protocols, including root of trust and data provenance, are part of the complex arrays of services that orchestrate data life cycles, protect the integrity of data, and make it accessible on demand.

Learn more about protecting data while optimizing its utility with backup and recovery solutions from Seagate.