How long does it take to load 1M records into Salesforce?
Optimising Data Loads for Throughput
I have been asked many times how long it takes to load data into Salesforce. Successful deployments rely on knowing how long new and incremental data loads will take.
In my experience, when asked, most people feel like data loads are a full time job or at the very least, something that kills their productivity.
So I decided to try and optimise a job requiring me to load 1M records into a custom object to see what might affect its performance and perhaps move the needle to the right.
In this article I’ll load 1M records into a custom object called Memo__c that is a child to a parent Account record to see what variables impact timing.
Here are my top 4 considerations that will ensure you optimise your data loads for throughput.
Database Events
Database events take processing time. The available database and application server resources are consumed by code and configuration options you choose to deploy.
Processes you’ve put into Apex code and/or configuration tools like Flow and Validation Rules take time to evaluate and execute. These processes use CPU time and add additional overhead when data is created and updated in your data load jobs.
The most likely error you’ll see by not optimising your database events is:
APEX CPU time limit increased/exceeded
This will stop your data load jobs in their tracks.
I recommend analysing the objects you’re loading into to find out what is consuming CPU time prior to loading and identifying any potential bottlenecks. Focus on Apex Triggers and Flows that operate on any records impacted by your data load.
Record Locks
Salesforce record locks are a mechanism used to prevent data inconsistencies and ensure data integrity when multiple users or processes are trying to modify the same record simultaneously.
To maintain data consistency and prevent race conditions Salesforce automatically applies implicit and explicit locks on records. These are held for a short duration and released once your database operation completes.
While essential, record locks can also cause performance issues if not managed properly. Excessive locking can lead to lock contention, where multiple processes are waiting for the same lock to be released. This can result in slowdowns or timeouts.
In the diagram above the relationship between Memo__c and its parent Account will mean that when I try to update or create Memo__c records the associated Account record will be locked until the DML operation completes. Any other records that try and update the same record in parallel may fail. If this is the case for your data loads you’ll see the error:
UNABLE_TO_LOCK_ROW unable to obtain exclusive access to this record
There are a number of reasons that locking may be stopping your data loads. These vary from object, sharing and custom specific behaviours. Here is a cheatsheet that is still relevant that can help you identify what is locked and when for different scenarios.
In my experience the most common cause of locking is due to data skew. Data skew occurs when there is an uneven distribution of data within a specific dataset. Common types of data skew are:
Ownership Skew: A large number of records are owned by a single user.
Account Data Skew: A single account has an excessive number of related records (e.g. Opportunities, Contacts, Cases).
Lookup Data Skew: A large number of records reference the same parent record
For our example we need to ensure that when loading into a custom object called Memo__c that we do not associate the same Memo__c records to the same Account record — or in others words that we don’t introduce Account data skew. To reduce the chance of data skew I ensured that no more than 200 Memo records were associated to the same Account.
I recommend reviewing the cheatsheet for record locking to review the likely factors affecting your data load. In particular focus on ownership, account and lookup data skew. The batch size associated with your job can be tweaked to ensure that when a child record is associated to the same parent record that these records are all in the same batch job by ordering them in the file being loaded.
Sharing Events
For custom objects and some standard objects Salesforce stores details on who can access those records using __Share and Group Maintenance objects.
Sharing rows grant access to users and groups, but the data that specifies who belongs to each group resides in the Group Maintenance tables.
Since sharing is impacted by who owns the record, who the record is manually shared with, sharing rules and implicit sharing via the role hierarchy the Group Maintenance table require a significant number of updates when an associated record is loaded from your data load job.
A great place to get under the hood for record level access is the Salesforce documentation.
If sharing events are causing issues for your data loads you’ll see the error:
Group Membership operation already in progress
In my experience the most common cause of sharing events issues result from
Ownership Skew: A large number of records are owned by a single user.
Deep Role Hierarchies: A large number of levels in a role hierarchy.
# of Queues and Public Groups: A large number of queues and public groups impacts the number of recalculations to be done with record ownership changes
Changing owners when loading records cause the group membership table to be updated. Parallel data load jobs that also require updates to group membership are prevented from locking the table. The more changes of ownership you have the greater the issue.
Deep role hierarchies will potentially cause issues as the group membership table needs to be updated with ownership changes to include the new role, role and subordinates and cater for any sharing rule changes.
I recommend reviewing the number of likely ownership changes resulting from your data load. If a large number of configuration changes is expected consider deferring sharing calculation until after the data load job completes. Salesforce is addressing performance issues after Winter ’24 with this update that will reduce contention for some standard objects.
Paralellism
Bulk data load jobs using the Bulk API are designed to be run in parallel. The reason they are often not run in parallel is due to some of the issues highlighted above — this is going to be the biggest factor determining optimal throughput.
A degree of parallelism score of 1 or less means that our data load jobs are run sequentially one after the other — but Salesforce has numerous servers available to process our data so we can optimize the load.
I recommend running all BULK API jobs in parallel mode with the largest batch size to mitigate the risk of record locks.
The Result
Control
Let’s setup a control to load 1M Memo records into a custom object related to Account. We’ll minimise the impact of database events, sharing recalculations and record locks.
We’ll set our batch size to the maximum allowed (10,000) and a Bulk API v2 job serially to see how long 1M records take to load.
For the control job it took 3 hours 31 minutes to load 1M records (or 12,696 seconds). So we can calculate the degree of parallelism as follows:
We can see that the running our data load serially we achieved a degree of parallelism of less than 1 and our throughput was 4,730 records / min.
Optimised
Now let’s set our batch size to the maximum allowed (10,000) and a Bulk API v2 job in parallel to see how long 1M records take to load.
For the new Bulk API job it took 10 minutes and 53 seconds to load 1M records (or 653 seconds). So we can calculate the degree of parallelism as follows:
We can see that the running our data in parallel achieved a degree of parallelism of 24.5 and our throughput was 91,911 records / min.
Summary
Loading 1M records into Salesforce can be as fast as 10 minutes or as slow as 4 hours. If you’re planning for a go-live then understanding the factors that impact the load time can make your life easier. Database events, record locks, sharing events and the degree of parallelism all contribute to or hinder your data load success.
Before you run your next data load. Plan ahead, then quantify potential bottlenecks before adjusting your load to mitigate those risks.