Performance Impact of De-duplication – OSNEXUS Customer Support

Subject:

We do not provide the ability to enable de-duplication via our Web Interface or CLI commands.
We enable compression by default with QuantaStor ZFS Storage pools and we do not automatically turn on the de-duplication feature due to it's heavy performance impact.

Detail:

Compression is enabled on ZFS Storage Pools by default. Compression works by reducing the foot print of the individual data blocks. The default compression algorithm used by ZFS is LZJB which focuses on speed. LZ4 is an option that is available but sacrifices some throughput speed for a bit more compress-ability of the data.

De-duplication is a feature available with the ZFS filesystem, but it comes at a great cost to performance. Because of this we do not provide any ability to enable it via our Web Interface or CLI commands.

Reporting of the dedup ratio or space savings is not implemented inside of the QuantaStor management layer but is available via the same ZFS filesystem cli tools that you can use for enabling/disabling deduplication.

One of the requirements for ZFS de-duplication is to have as much system memory as possible for the ZFS dedup Metadata table (DDT) to reside in system memory. depending on the size and number of your files, the typical recommendation is 1GB of system memory for 5GB of de-duplicated data for uses cases with lots of small files or 1GB for every 20GB of data for uses cases with very large files.

De-duplication is an expert level feature and should be fully understood before it is enabled on any production workloads.

If you found the performance impact unacceptable and wished to disable it, deduplication would need to be disabled and then all of the data would need to be rewritten inline to re-duplicate the blocks.

Notes:
It is hard to gauge the full performance impact of de-duplication in testing unless you perform your tests with the full complete production data set.
We do not recommend using de-duplication unless you have the ability to perform a full suite of your tests with the complete dataset and your use case does not focus on performance for client access.
Any performance use case is best suited for de-duplication to be disabled.

De-duplication in ZFS is an in-line process, this means that the only way to undedup data that has been de-duplicated is to disable dedup and copy the data elsewhere on the filesystem.

When using the remote replication feature with ZFS de-duplication, the data must be read from disk and de-duplicated, this can cause a very heavy strain on the system resulting in a very heavy performance impact while the remote replication process is running.
The cause for this is that the ZFS send and receive feature used for remote replication do not have the ability to read de-duplicated data from disk without expanding it and un-deduplicating it in memory for transmission to the target DR replication system.

Comments