As will be shown, the SSD benchmark testing “lies” are more ones of omission than of commission. But despite honorable intentions, any result must be considered a “lie” when it provides meaningless or, worse yet, misleading information. The underlying cause of these lies is a familiar one: accurate testing takes time. So short-cuts are taken. Important procedures get skipped. Key considerations are ignored. Results are read before performance stabilizes. Important tests are run without proper preparation—or not run at all.
The situation with performance testing of solid state drives is not unlike what occurred when the U.S. Environmental Protection Agency (EPA) introduced its gasoline mileage rating system in the 1970’s. The test simulations were not representative of the way most people drive, so the results were notoriously high. Virtually no one got the highway or city mileage “determined” from the testing, while the regulators and auto manufacturers alike acknowledged, yet completely ignored the problem. Subsequent enhancements have dramatically improved the accuracy of the results, but the changes were slow in coming. For example, highway speed limits increased from 55 to 65 MPH in 1987, but the EPA tests did not take this into account until the 2008 model year—21 years later!
The “your mileage may vary” caveat applies equally to SSD benchmark testing today. Some testing is robust, with the results providing an accurate prediction of the performance that might be expected in the real-world. More likely, the results are way off, way too often. Here are the three key factors that determine whether or not SSD benchmark testing results are representative of real-world performance:
Note how the nature of the data is of paramount importance when testing SSD performance. The reason why is explained more fully in the next section on The Potential Pitfalls of SSD Benchmark Testing, which also explores other common problems, as well how to avoid and detect them. There are, of course, valid ways to perform the tests, and these are outlined in the section on Best Practices in SSD Benchmark Testing.
Not addressed in this white paper are factors, other than performance, that are important when evaluating SSDs. The two most significant such considerations are power consumption and its effect on battery life, and how the disk’s useful life or endurance can be extended with features that minimize write amplification and/or maximize wear-leveling.
Solid state disks and hard disk drives (HDDs) are fundamentally different, and if benchmark testing does not accommodate this difference, the results will be fundamentally flawed. Understanding the potential pitfalls of SSD benchmark testing, therefore, requires some understanding of how SSDs are different.
The difference between HDDs and SSDs derives from the very physics of magnetic media and NAND flash memory. Magnetic media can be overwritten; flash memory cannot. With an HDD, the action of “deleting” a file affects only the metadata in the directory, which is changed to designate the affected sectors as now containing “free space” for writing new data over the old. This is the reason “deleted” files can be recovered (or “undeleted”) from HDDs, and this is also why it is necessary to actually erase sensitive data to fully secure an HDD.
With NAND flash memory, by contrast, free space can only be created by actually deleting or erasing the data that previously occupied a block of memory. The process of reclaiming blocks of flash memory that no longer contain valid data is called “garbage collection.” Only when the blocks, and the pages they contain, have been reset in this fashion are they able to store new data during a write operation. In SSD parlance, the act of writing, then deleting data is referred to as a program/erase (P/E) cycle.
Naturally, the need for garbage collection affects an SSD’s performance, because any write operation to a “full” disk (one whose initial free space or capacity has been filled at least once) needs to await the availability of new free space created through the garbage collection process. Because garbage collection occurs at the block level, there is also a significant performance difference, depending on whether sequential or random data is involved. Sequential files fill entire blocks, which dramatically simplifies garbage collection. The situation is very different for random data.
As random data is written, often by multiple applications, the pages are written sequentially throughout the blocks of the flash memory. The problem is: This new data is replacing old data distributed randomly in other blocks. This causes a potentially large number of small “holes” of invalid pages to become scattered among the pages still containing valid data. During garbage collection of these blocks, all valid data must be moved (i.e. read and re-written) to a different block. By contrast, when sequential files are replaced, entire blocks are often invalid, so no data needs to be moved. Sometimes a portion of a sequential file might share a block with another file, but on average only about half of such blocks will need to be moved, making it much faster than garbage collection for randomly-written blocks.
Testing SSDs Requires Different Procedures—and More Patience
Because SSDs and HDDs operate differently, they must be tested differently. That is not to say that everything is different. For example, with comparable disk form factors, the same test equipment can usually be used. Most existing test beds can also be used for testing PCIe® card-based solid state storage solutions. But while the same equipment can still be used, the procedures must be very different, or the tests will inevitably result in lies.
The resulting lies will rarely be intentional, although that is certainly possible. Consider, for example, how vehicle mileage testing could be performed to “find” exceptional gas mileage by steadily maintaining an optimal speed that is not at all typical of city or highway driving. A comparison could then be made with a competitor’s vehicle that gets really poor gas mileage by flooring the accelerator and slamming on the brakes!
It would be just as easy to “cheat” in a variety of different ways with SSD benchmark testing if the intent were to either over- or under-state the performance. But that is not the case here. Instead, the lies, damn lies and meaningless or misleading SSD benchmark test results are far more likely to be lies of omission, not commission. No independent trade publication or third-party lab would risk jeopardizing its hard-won reputation for excellence and integrity. And no reputable vendor would intentionally deceive prospective partners or customers. When Seagate tests SSDs that use Seagate® SandForce® flash controllers, for example, the engineers are careful to achieve meaningful results to set realistic performance expectations for users, and to accurately assess how various product enhancements might improve performance.
Despite the honorable intentions of independent labs and vendors alike, SSD benchmark test results are often meaningless or misleading because the list of what can be done wrong is long.
The potential problems begin with the benchmark tests themselves. Some are better than others, and none is perfect. Understanding and somehow mitigating their inherent limitations is, therefore, important to achieving meaningful test results. For example, most of the tests use synthetic data that is not representative of real-world conditions, and none forces sufficient preconditioning before reporting results. When multiple benchmark tests are used to overcome their respective limitations, it is also important to make adjustments in cases where some results are decimal (in Megabytes or MB) and others are binary (in Mebibytes or MiB).
Because all of these inherent limitations can be overcome by following a set of best test practices (covered in the next section), the three leading causes of meaningless or misleading benchmark test results involve the operator. And these are:
After reading this white paper, anyone conducting or evaluating someone else’s benchmark testing should be fully aware of the issues involved, and should know how to avoid or detect the errors that cause results to be meaningless or misleading. Unfortunately, the #1 cause is not so easy to remedy.
Good SSD benchmark testing takes time—several hours or more to test each disk—so it is tempting to take short-cuts. Figure 1 shows why proper testing takes so much time. In this test sequence, the SSD was completely erased before the test was started, meaning garbage collection was not required for the initial writes up to the SSD’s capacity. In summary:
Figure 1 - SSD performance (shown here for random data) changes dramatically before reaching steady state (source: Seagate internal testing)
In Figure 1, the green portion shows total writes between 1 and 2 times the SSD’s capacity, orange between 2 and 3 times, and red greater than 3 times. In this test with random data, steady state operation and, therefore, performance were not achieved until the disk had been “filled” with data to three times its capacity. Multiple passes are needed to help ensure garbage collection is occurring for the entire SSD, and that can only happen when every page will have incurred at least one program/erase cycle. Given the random selection of pages during write operations, at least three passes are generally needed.
Testing SSDs with Advanced Capabilities
SSDs with advanced capabilities, such as those that are SandForce Driven™, can take substantially longer to test. This is particularly true for disks that employ a data reduction technology. Some understanding of the rationale for and the performance impact of data reduction is, therefore, warranted here.
The need to move data during garbage collection causes the amount of data being physically written to the SSD to be a multiple of the logical data being written. This phenomenon is expressed as a simple ratio called “write amplification.” This ratio approaches 1.0 for sequential writes (where very little data needs to be moved), but is considerably higher—3.0 or more—with randomly written data. Minimizing write amplification with a data reduction technology increases both performance and endurance, and reduces power consumption.
Data reduction works by taking advantage of the entropy, or randomness, of data. Data with low entropy can be deduplicated, compressed and otherwise processed in loss-less ways so that only a fraction of the logical data “written” by the operating system is actually physically written to the SSD. The results can be truly remarkable. For example, the DuraWrite™ data reduction technology is able to reduce write amplification to an average of between 0.5 and 1.0 in typical user environments.
The performance gains afforded by data reduction derive from its effect on over-provisioning. When data entropy is low, the data actually written to the SSD requires fewer pages and blocks. Because the operating system is unaware of this reduction, the extra space can be used by the SSD’s flash controller as additional over-provisioning space, and the greater the over-provisioning space, the better the performance. As the entropy of the data increases, this additional over-provisioning “free space” decreases. At 100% entropy (“un-reducible” data) there is no additional over-provisioning space, causing the SSD to perform the same as one without data reduction.
Because data reduction, such as the technology found in SandForce flash controllers, is fundamental to an SSD’s performance, it cannot be ignored during benchmark testing. Doing so would, in effect, be like disabling the overdrive in a vehicle mileage test.
Interpreting Typical Benchmark Test Results for Random Reads/Writes
Another potential pitfall involves flawed interpretation of the benchmark test results. Figure 2 shows the results of a typical benchmark test comparing different SSDs at different read/ write ratios. The shape of the curve demonstrates the importance of knowing this ratio when interpreting test results reported in a different format, such as a single IOPS (I/O Operations per Second) for each SSD.
Without such disclosure (and resulting awareness), performance could be made to look truly remarkable… at 100 percent reads! Such a test might be appropriate for a CD/DVD ROM drive, but not for an SSD. A very low percentage of reads could be equally meaningless, as that reveals very little differentiation among the SSDs in this particular test. A valid benchmark test should report in the Typical Performance Range shown, or if using a single ratio, a mid-range read/write ratio of about 2:1.
Figure 2: Typical benchmark test results (IOPS vs. read/write ratio) (source: Seagate internal testing)
Figure 3 shows why it is essential to consider data entropy whenever any of the SSDs being tested employs a data reduction technology. Because data entropy has no effect on SSDs without data reduction, a single level of entropy (high, low or in-between) can be used to test these drives.
Figure 3: Data entropy cannot be ignored when testing SSDs with a data reduction technology (source: Seagate internal testing)
Given the dramatic improvement in performance as entropy decreases, tests would ideally be performed at a minimum of three representative levels over a wide range of data entropy (e.g. 100%, 70% and 10% as done in this test). Testing can be simplified, of course, where the actual level of entropy is known for the SSD’s intended application. For example, an SSD used as a boot device will never experience 100% entropy because the operating system will always provide some reduction in entropy. In such cases, the real-world performance will be higher than the 100% level shown in Figure 3.
It is instructive to use some analogies to understand the critical factors involved in SSD benchmark testing. Consider a marathon runner, who represents sequential reads/writes of relatively large files, and a sprinter, who represents random reads/writes of data in relatively small blocks. If the test (a race in this case) is either a 100 Meter dash or a 10 Kilometer run, neither race alone could determine, with any fairness or accuracy, who is the “faster” or “better” overall runner. The only way to determine that would be to have them run both a 100M dash and a marathon, and assess their performance half-way through the marathon.
Another analogy is the mileage test cited above where the driver was flooring the gas pedal and slamming on the brakes. This is obviously an inappropriate test for vehicle mileage. But this very same test could be fully valid for comparing 0-60 MPH acceleration and braking distance at 60 MPH for different vehicles. Whether comparing runners, vehicles or SSDs, doing it right requires conducting the right tests in the right way. Anything else is no better than a “lie.”
These analogies demonstrate the importance of using benchmark tests and procedures that are appropriate for the anticipated application(s). The application determines the nature of the data, which has two potential effects on the test methodology. How the data is written to and read from the disks, whether sequentially and/or randomly, has an effect on all SSD benchmark testing. The level of entropy of the data also has an effect whenever testing any SSD that employs a data reduction technology.
Preparing to Run the Benchmark Test(s)
As with many endeavors, patience is a virtue, and such is the case with SSD benchmark testing, where getting meaningful results requires following a rigorous set of procedures, and following them fully with no short-cuts.
The first tempting short-cut is to run only a single test. But because SSD behavior is very different for sequential vs. random reads and writes, two separate tests are normally required.
With a suitable benchmark test and data set, a single test might be appropriate for a known (or assumed and disclosed) mix of applications. For such a test, it is best to use a representative sample of the actual sequential and/or random data, and Seagate has modified Iometer to create a 2010 version that supports the use of user-supplied data.* Where the use of actual data is not possible or feasible to support, a single “combined” test can be performed using PCMark Vantage, which has the most real-world data set of the six benchmark tests considered here. In the runner analogy above, such a typical mix might be represented by a 5K race, which would result in a reasonably fair and meaningful (albeit imperfect) comparison between the two types of runners. Other benchmark tests not considered here (or future enhancements to those that are) might also have the ability to utilize a variable mix of sequential and random data to yield meaningful real-world results.
When separate testing is required, separate preconditioning is also required. The second temptation, therefore, is to skip this important “re-preconditioning,” or worse, to skip any preconditioning at all. Two steps are critical to ensuring that the SSDs are properly preconditioned. The first is a secure erase, unless the drive is FOB: Fresh Out of the Box. The second step is to format the drive, with a quick format being acceptable.
Note: Formatting an SSD performs the equivalent of a full-disk TRIM that effectively erases all of the existing data. The TRIM command is used by the operating system to specify which pages of data stored on an SSD no longer contain valid data, and can therefore be ignored (deleted) during garbage collection. Formatting puts the SSD in an empty state and should not be performed on a preconditioned drive unless the drive is being re-preconditioned for another test.
Running the Test(s)
It is vitally important to run each and every benchmark test until steady state results are achieved. Doing so is the only way to ensure that the SSDs have been properly preconditioned and are performing normally—as they would in a real application. This aspect of SSD benchmark testing presents the greatest temptation to declare “good enough!” and halt the test prematurely.
Expect the results to drop precipitously at first, and then to settle gradually and become consistent. With sequential data, steady state operation is usually achieved once the disk has been “filled” with data to its full capacity a single time. For random data, it is usually necessary to “fill” the disk three separate times (for reasons explained above).
How many times the benchmark test must be run to fill an SSD depends on its capacity and the data written per run. For this reason, SSDs with different capacities, and/or the use of different tests and/or data sets will all require a different number of test runs before achieving steady state results. With larger SSDs and random data sets, the tests can run for many hours until the performance reaches steady state.
As if this were not enough to test one’s patience, any SSD employing a data reduction technology should undergo additional testing. Ideally, the SSD would undergo three separate random data read/write tests at three different levels of data entropy (as was done in the test depicted in Figure 3). Although preconditioning has minimal impact on sequential data read/ write performance, entropy does have an impact and will, therefore, require the same three different levels of testing.
The only way to simplify this additional requirement is to use an average level of entropy for the target application(s), when known, or a mix of “typical” applications. Iometer is the best benchmark for testing known target applications because it enables the use of actual data. Iometer, PCMark Vantage and Anvil are all good choices for testing a mix of “typical” applications, and a reasonable “average” data entropy to use in this case is about 50%. Anvil supports a number of entropy levels, making it a good choice for performing a series of three tests at low, medium and high levels of entropy. CrystalDiskMark could be a good choice because it can test with both low and high entropy data, with the latter being the default configuration. AS-SSD and ATTO are poor choices because they utilize data with only high or only low entropy, respectively, and therefore they provide misleading results for SSDs with data reduction technology.
Reading the Results
A brand new or freshly-erased SSD exhibits astonishing performance, because there is no need to move any old data before writing new data. In other words, garbage collection is not active during the first pass of writing data. The results shown in Figure 4 could be interpreted to show that the SSD being tested delivers an unbelievable (literally) throughput of around 300MB/s for both sequential and random data—at least until the disk is first filled and garbage collection begins. The performance this disk will likely actually achieve is just under 250MB/s for sequential data and just under 25MB/s for random data.
Figure 4 – SSDs experience dramatically different performance with random and sequential data (source: Seagate internal testing)
Note how quickly steady state results are achieved for sequential data (15 minutes) and how long it takes for random data (over 3 hours). Note also that although the results for both sequential and random data are shown on the same chart, the actual tests were and should be run separately with separate preconditioning to get meaningful results.
Because most client PCs and servers support a mix of applications with a mix of random and sequential data (and software), real-world performance for the SSD tested in Figure 4 will fall somewhere within the range of 25MB/s to 250MB/s. And because these results were achieved by following the best practices outlined here, it is possible to interpolate the findings for a particular mix of random and sequential data. For example, a mix of applications that stores and accesses a 50/50 mix of random and sequential data, should experience an overall performance of 45MB/s, as determined by the following formula:
Real-world decision-making about SSD performance is possible only with valid benchmark test results that can be subjected to meaningful interpretation and interpolation. And valid results are possible only by following the best practices outlined here.
SSDs operate very differently from HDDs and, therefore, they must be tested differently. Accommodating these differences increases the time and effort it takes to get meaningful results, and therein lies the reason for the lies common in SSD benchmark testing.
Simply put: There are no short-cuts to getting accurate results when testing SSD performance. The nature of the data is of paramount importance, and this makes it necessary to perform separate tests involving sequential and random access. Additional tests are needed for SSDs with data reduction technology. And all of these separate tests require separate preconditioning.
While the temptation to take short-cuts might be great, doing so guarantees meaningless or misleading results. So when doing your own benchmark testing, follow the best practices outlined here, and try to use your own data if possible. And when considering someone else’s benchmark testing results, confirm that the methodology employed has met and exceeded the rigors of these best practices.