How to modify these producer-consumer microservices to allow parallel processing?

advertisements

I've got a couple microservices (implemented in ruby, although I doubt that is important for my question). One of them provides items, and the other one processes them, and then marks them as processed (via a DELETE call)

The provider has an /items endpoint which lists a bunch of items identified with an id, in JSON format. It also has a DELETE /items/id endpoint which removes one item from the list (presumably because it is processed)

The code (very simplified) in the "processor" looks like this:

items = <GET provider/items>
items.each do |item|
  process item
  <DELETE provider/items/#{item.id}>
end

This has several problems, but the one I would like to solve is that it is not thread-safe, and thus I can't run it in parallel. If two workers start processing items simultaneously, they will "step onto each other's toes": they will get the same list of items, and then (try to) process and delete each item twice.

What is the simplest way I can change this setup to allow for parallel processing?

You can assume that I have ruby available. I would prefer keeping changes to a minimum, and would rather not install other gems if possible. Sidekiq is available as a queuing system on the consumer.


Some alternatives (just brainstorming):

  1. Just drop HTTP and use pub-sub with a queue. Have the producer queueing items, a number of consumers processing them (and triggering state changes, in this case with HTTP if you fancy it).
  2. If you really want to HTTP, I think there are a couple of missing pieces. If your items' states are pending and processed, there's a hidden/implicit state in your state machine: in_progress (or whatever). Once you think of it, picture becomes clearer: your GET /items is not idempotent (because it changes the state of items from pending to in progress) and hence should not be a GET in the first place.

    a. an alternative could be adding a new entity (e.g. batch) that gets created via POST and groups some items under it and sends them. Items already returned won't be part of future batches, and then you can mark as done whole batches (e.g. PUT /batches/X/done). This gets crazy very fast, as you will start reimplementing features (acks, timeouts, errors) already present both in queueing systems and plain/explicit (see c) HTTP.

    b. a slightly simpler alternative: just turn /items in a POST/PUT (weird in both cases) endpoint that marks items as being processed (and doesn't return them anymore because it only returns pending items). The same issue with errors and timeouts apply though.

    c. have the producer being explicit and requesting the processing of an item to the other service via PUT. you can either include all needed data in the body, or use it as a ping and have the processor requesting the info via GET. you can add asynchronous processing in either side (but probably better in the processor).

I would honestly do 1 (unless compelling reason).