How long does it take to load 1M records into Salesforce?

Optimising Data Loads for Throughput

Dave Norris
7 min readJul 28, 2024

I have been asked many times how long it takes to load data into Salesforce. Successful deployments rely on knowing how long new and incremental data loads will take.

In my experience, when asked, most people feel like data loads are a full time job or at the very least, something that kills their productivity.

Frustration factor for data loads

So I decided to try and optimise a job requiring me to load 1M records into a custom object to see what might affect its performance and perhaps move the needle to the right.

In this article I’ll load 1M records into a custom object called Memo__c that is a child to a parent Account record to see what variables impact timing.

Here are my top 4 considerations that will ensure you optimise your data loads for throughput.

Database Events

Database events take processing time. The available database and application server resources are consumed by code and configuration options you choose to deploy.

Database Events split between Code and Configuration

Processes you’ve put into Apex code and/or configuration tools like Flow and Validation Rules take time to evaluate and execute. These processes use CPU time and add additional overhead when data is created and updated in your data load jobs.

The most likely error you’ll see by not optimising your database events is:

APEX CPU time limit increased/exceeded

This will stop your data load jobs in their tracks.

I recommend analysing the objects you’re loading into to find out what is consuming CPU time prior to loading and identifying any potential bottlenecks. Focus on Apex Triggers and Flows that operate on any records impacted by your data load.

Record Locks

Salesforce record locks are a mechanism used to prevent data inconsistencies and ensure data integrity when multiple users or processes are trying to modify the same record simultaneously.

To maintain data consistency and prevent race conditions Salesforce automatically applies implicit and explicit locks on records. These are held for a short duration and released once your database operation completes.

Record locks showing Account and their relationship to a Custom Object

While essential, record locks can also cause performance issues if not managed properly. Excessive locking can lead to lock contention, where multiple processes are waiting for the same lock to be released. This can result in slowdowns or timeouts.

In the diagram above the relationship between Memo__c and its parent Account will mean that when I try to update or create Memo__c records the associated Account record will be locked until the DML operation completes. Any other records that try and update the same record in parallel may fail. If this is the case for your data loads you’ll see the error:

UNABLE_TO_LOCK_ROW unable to obtain exclusive access to this record

There are a number of reasons that locking may be stopping your data loads. These vary from object, sharing and custom specific behaviours. Here is a cheatsheet that is still relevant that can help you identify what is locked and when for different scenarios.

In my experience the most common cause of locking is due to data skew. Data skew occurs when there is an uneven distribution of data within a specific dataset. Common types of data skew are:

Ownership Skew: A large number of records are owned by a single user.
Account Data Skew: A single account has an excessive number of related records (e.g. Opportunities, Contacts, Cases).
Lookup Data Skew: A large number of records reference the same parent record

For our example we need to ensure that when loading into a custom object called Memo__c that we do not associate the same Memo__c records to the same Account record — or in others words that we don’t introduce Account data skew. To reduce the chance of data skew I ensured that no more than 200 Memo records were associated to the same Account.

Showing a large number of Memo records associated to the same Account

I recommend reviewing the cheatsheet for record locking to review the likely factors affecting your data load. In particular focus on ownership, account and lookup data skew. The batch size associated with your job can be tweaked to ensure that when a child record is associated to the same parent record that these records are all in the same batch job by ordering them in the file being loaded.

Sharing Events

For custom objects and some standard objects Salesforce stores details on who can access those records using __Share and Group Maintenance objects.

Sharing objects for custom and some standard objects

Sharing rows grant access to users and groups, but the data that specifies who belongs to each group resides in the Group Maintenance tables.

Since sharing is impacted by who owns the record, who the record is manually shared with, sharing rules and implicit sharing via the role hierarchy the Group Maintenance table require a significant number of updates when an associated record is loaded from your data load job.

A great place to get under the hood for record level access is the Salesforce documentation.

If sharing events are causing issues for your data loads you’ll see the error:

Group Membership operation already in progress

In my experience the most common cause of sharing events issues result from

Ownership Skew: A large number of records are owned by a single user.
Deep Role Hierarchies: A large number of levels in a role hierarchy.
# of Queues and Public Groups: A large number of queues and public groups impacts the number of recalculations to be done with record ownership changes

Changing owners when loading records cause the group membership table to be updated. Parallel data load jobs that also require updates to group membership are prevented from locking the table. The more changes of ownership you have the greater the issue.

Deep role hierarchies will potentially cause issues as the group membership table needs to be updated with ownership changes to include the new role, role and subordinates and cater for any sharing rule changes.

I recommend reviewing the number of likely ownership changes resulting from your data load. If a large number of configuration changes is expected consider deferring sharing calculation until after the data load job completes. Salesforce is addressing performance issues after Winter ’24 with this update that will reduce contention for some standard objects.

Paralellism

Bulk data load jobs using the Bulk API are designed to be run in parallel. The reason they are often not run in parallel is due to some of the issues highlighted above — this is going to be the biggest factor determining optimal throughput.

Equation for calculating throughput

A degree of parallelism score of 1 or less means that our data load jobs are run sequentially one after the other — but Salesforce has numerous servers available to process our data so we can optimize the load.

I recommend running all BULK API jobs in parallel mode with the largest batch size to mitigate the risk of record locks.

The Result

Control

Let’s setup a control to load 1M Memo records into a custom object related to Account. We’ll minimise the impact of database events, sharing recalculations and record locks.

The controlled setup with minimizing impact of database events, sharing and record locks

We’ll set our batch size to the maximum allowed (10,000) and a Bulk API v2 job serially to see how long 1M records take to load.

Total time to complete for a serial run
Total Processing Time for a serial run

For the control job it took 3 hours 31 minutes to load 1M records (or 12,696 seconds). So we can calculate the degree of parallelism as follows:

Degree of parallelism for 1M records run serially

We can see that the running our data load serially we achieved a degree of parallelism of less than 1 and our throughput was 4,730 records / min.

Optimised

Now let’s set our batch size to the maximum allowed (10,000) and a Bulk API v2 job in parallel to see how long 1M records take to load.

Total time to complete for a parallel run
Total Processing Time for a parallel run

For the new Bulk API job it took 10 minutes and 53 seconds to load 1M records (or 653 seconds). So we can calculate the degree of parallelism as follows:

Degree of parallelism for 1M records run in parallel

We can see that the running our data in parallel achieved a degree of parallelism of 24.5 and our throughput was 91,911 records / min.

Summary

Loading 1M records into Salesforce can be as fast as 10 minutes or as slow as 4 hours. If you’re planning for a go-live then understanding the factors that impact the load time can make your life easier. Database events, record locks, sharing events and the degree of parallelism all contribute to or hinder your data load success.

Before you run your next data load. Plan ahead, then quantify potential bottlenecks before adjusting your load to mitigate those risks.

--

--

Dave Norris

Developer Advocate @ Salesforce || Interested in solving unique challenges using different cloud service providers || All opinions are mine.