When to avoid the in operator in postgres

Robin Kåveland
Robin Kåveland
Published 2024-09-23
postgrescode

In my post about batch operations, I used the where id = any(:ids) pattern, with ids bound to a JDBC array. I've gotten questions about that afterwards, asking why I do it like that, instead of using in (:id1, :id2, ...). Many libraries can take care of the dynamic SQL generation for you, so often you can just write in (:ids), just like the array example. I would still prefer to use the = any(:ids) pattern, and I decided to write down my reasoning here.

It has nothing to do with query performance. These two queries have the same query plan and should have the same time cost to execute for the database:

postgres=# explain select sum(cost) from orders where id in (1, 2, 3, 4);
QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=17.13..17.14 rows=1 width=8)
   ->  Bitmap Heap Scan on orders  (cost=8.65..17.12 rows=4 width=4)
         Recheck Cond: (id = ANY ('{1,2,3,4}'::integer[]))
         ->  Bitmap Index Scan on orders_pkey  (cost=0.00..8.65 rows=4 width=0)
               Index Cond: (id = ANY ('{1,2,3,4}'::integer[]))
(5 rows)

postgres=# explain select sum(cost) from orders where id = any('{1, 2, 3, 4}');
QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=17.13..17.14 rows=1 width=8)
   ->  Bitmap Heap Scan on orders  (cost=8.65..17.12 rows=4 width=4)
         Recheck Cond: (id = ANY ('{1,2,3,4}'::integer[]))
         ->  Bitmap Index Scan on orders_pkey  (cost=0.00..8.65 rows=4 width=0)
               Index Cond: (id = ANY ('{1,2,3,4}'::integer[]))
(5 rows)

Parsing SQL is really, really fast. Over at eugene I host a tiny service that parses SQL queries to look for dangerous migrations. From time to time, people post colossal SQL scripts to it, and parsing is at most a few milliseconds. But it is theoretically true that parsing a query with a lot of placeholders in where id in (?, ?, ...) is more expensive. There could be differences in how much time it takes to actually bind the parameters, or parse the bound array from the wire protocol, but these are probably negligible in most cases. They are certainly unlikely to be the reason why you need to look at the query and optimize it.

What's the big win then?

You get the same query every time. This may sound a little simple, and it's not obvious that this is important. But it matters once you're running in production and need to follow up performance over time. This makes it practical to track the query in your telemetry, database logs and in extensions such as pgstatstatements. It'll produce exactly one row of aggregated performance data in pg_stat_statements, instead of a number of entries equal to the observed number of placeholders in the where id in (?, ..., ?) clause. Here's an example of data from pg_stat_statements where I ran both queries four times with the same parameters:

postgres=# select query, calls, mean_exec_time from pg_stat_statements where query like '%from orders%';
-[ RECORD 1 ]--+----------------------------------------------------------
query          | select sum(cost) from orders where id in ($1, $2)
calls          | 1
mean_exec_time | 0.115168
-[ RECORD 2 ]--+----------------------------------------------------------
query          | select sum(cost) from orders where id in ($1)
calls          | 1
mean_exec_time | 0.129291
-[ RECORD 3 ]--+----------------------------------------------------------
query          | select sum(cost) from orders where id in ($1, $2, $3, $4)
calls          | 1
mean_exec_time | 0.122043
-[ RECORD 4 ]--+----------------------------------------------------------
query          | select sum(cost) from orders where id = any($1)
calls          | 4
mean_exec_time | 0.19074100000000002
-[ RECORD 5 ]--+----------------------------------------------------------
query          | select sum(cost) from orders where id in ($1, $2, $3)
calls          | 1
mean_exec_time | 0.197209

Note how the = any($1) query has four calls, but there's four variants of the in query. Since the difference in the cost of the two queries is negligible, we can choose to go for the variant that makes it easy to track query performance over time, and I've been doing that for some years now.

It also used to be that the in operator was limited to 1000 elements, but I'm not sure if that's still the case. I do shudder to think about having 1000 copies of the same query with different numbers of placeholders and calls in my telemetry, though. Consider whether you can use arrays instead?