Update statement using a WHERE clause containing columns with null values


I am updating a column on one table using data from another table. The WHERE clause is based on multiple columns and some of the columns are null. From my thinking, this nulls are what are throwing off your standard UPDATE TABLE SET X=Y WHERE A=B statement.

See this SQL Fiddle of the two tables where am trying to update table_one based on data from table_two. My query currently looks like this:

UPDATE table_one SET table_one.x = table_two.y
FROM table_two
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number

The query fails for some rows in that table_one.x isn't getting updated when any of the columns in either table are null. i.e. it only gets updated when all columns have some data.

This question is related to my earlier one here on SO where I was getting distinct values from a large data set using Distinct On. What I now I want is to populate the large data set with a value from the table which has unique fields.


I used the first update statement provided by @binotenary. For small tables, it runs in a flash. Example is had one table with 20,000 records and the update was completed in like 20 seconds. But another table with 9 million plus records has been running for 20 hrs so far!. See below the output for EXPLAIN function

Update on table_one  (cost=0.00..210634237338.87 rows=13615011125 width=1996)
  ->  Nested Loop  (cost=0.00..210634237338.87 rows=13615011125 width=1996)
    Join Filter: ((((my_update_statement_here))))
    ->  Seq Scan on table_one  (cost=0.00..610872.62 rows=9661262 width=1986)
    ->  Seq Scan on table_two  (cost=0.00..6051.98 rows=299998 width=148)

The EXPLAIN ANALYZE option took also forever so I canceled it.

Any ideas on how to make this type of update faster? Even if it means using a different update statement or even using a custom function to loop through and do the update.

Since null = null evaluates to false you need to check if two fields are both null in addition to equality check:

UPDATE table_one SET table_one.x = table_two.y
FROM table_two
    (table_one.invoice_number = table_two.invoice_number
        OR (table_one.invoice_number is null AND table_two.invoice_number is null))
    (table_one.submitted_by = table_two.submitted_by
        OR (table_one.submitted_by is null AND table_two.submitted_by is null))
    -- etc

You could also use the coalesce function which is more readable:

UPDATE table_one SET table_one.x = table_two.y
FROM table_two
    coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
    AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
    AND -- etc

But you need to be careful about the default values (last argument to coalesce).
It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data
E.g coalesce(null, 1) = coalesce(1, 1) is a situation you'd want to avoid.

Update (regarding performance):

Seq Scan on table_two - this suggests that you don't have any indexes on table_two.
So if you update a row in table_one then to find a matching row in table_two the database basically has to scan through all the rows one by one until it finds a match.
The matching rows could be found much faster if the relevant columns were indexed.

On the flipside if table_one has any indexes then that slows down the update.
According to this performance guide:

Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.

Another suggestion from the same guide that might be helpful is:

If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.

So for example if table_one an id column you could add something like

and table_one.id between x and y

to the where condition and run the query several times changing the values of x and y so that all rows are covered.

The EXPLAIN ANALYZE option took also forever

You might want to be careful when using the ANALYZE option with EXPLAIN when dealing with statements with sideffects. According to documentation:

Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.