Subject:
Additional detail on reviewing replication performance.
Details:
For replication performance, there are 4 areas where the performance could be limited in some way.
1) The disk performance reading the data from the Primary unit. This can be affected by the following:
• The performance capability of the Disk Array and it's hardware: A good resource we have found for calculating I/OPs and throughput for different RAID configs, disk types and workloads is available here: http://www.wmarow.com/strcalc/
• The read and write workloads from the clients, if it is highly random, it can greatly affect the performance provided by a platter based disk array, resulting in much lower performance availability for the reads necessary to gather the replication snapshot data for sending to the Secondary replication unit.
• The nature of the data changes captured in a replication snapshot. If there are many changes to different files or lots of different files, it will make the task of gathering the delta replication snapshot of changes to send to the destination replication unit a much more random workload than a sequential one.
• The number of snapshots on a storage volume or Network Share can also affect performance as it will need to check the reference information when generating a new replication snapshot against the prior ones. i.e. if the customer is keeping 100 hourly snapshots, this can slow down performance much more than if they had 20 daily snapshots instead.
2) The Network Bandwidth between the Primary and Secondary(Destination) Replication units.
• Any network bottleneck can impact replication performance.
3) The disk performance for writing and verifying the snapshots at the Secondary(Destination) Replication unit. This can be affected by the following:
• The performance capability of the Disk Array and it's hardware: A good resource we have found for calculating I/OPs and throughput for different RAID configs, disk types and workloads is available here: http://www.wmarow.com/strcalc/
• Typically a Secondary(Destination) Replication unit has no client access except in cases where the workload is failed over from the Primary or in the event the customer needs to mount a historical replication snapshot at the destination and recover a piece of data. However, if the customer is putting workload on the Secondary unit, it can greatly affect overall performance for the replication as writing the data can be the slowest part of the replication process. For best utilization of sequential disks performance there should be no workload on the Secondary unit other than what is placed there by the replication tasks.
• Customers who wish to use both the Primary and Secondary units for workload are advised to deploy their QuantaStor units that allow for much better Random I/O capability to handle this type of deployment, a RAID10 array of SSD data drives is normally recommended.
4) The Replication Rate Limit throttling mechanism.
• The qs-util ratelimitset NN command can be used to impose an artificial limit on the replication tasks for the system overall. This is intended to limit disk and network usage to a value below the limit on both the Primary and Secondary(Destination) Replication units.
• Please note that this value is used for all replication tasks on a system and is taken in real time and divided up by any replication tasks running on the system. if the customer has a rate limit of 200 but there are two replication tasks running, each task will be throttled to a maximum of 100MB/s of throughput. When one of the replication tasks finishes, the remaining replication task will have a maximum throughput limit of 200MB/s. The same is true if more replication tasks are started, if a Replication task is running with a 200MB/s limit, and three other tasks are started, all four tasks will now each have a maximum throughput of 50MB/s.
Comments