Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include all runtime errors in transaction result #1010

Closed
wants to merge 2 commits into from

Conversation

alavers
Copy link
Contributor

@alavers alavers commented Nov 24, 2019

READONLY errors are common with AWS Elasticache failovers, but there's a problem when it comes to ioredis transactions. Consider this contrived transaction:

const res = await multi()
  .set('foo', 42)
  .get('bar')
  .exec();

If you run it against an Elasticache endpoint which has just failed over under your feet, a READONLY error is reported, but your res contains an unexpected result:

[
  [null, 'barvalue']  
]

The transaction has partially failed, but I don't see the error in my exec result like I would if it was a WRONGTYPE error. On the wire, what I actually got back is this:

+OK
-READONLY You can't write against a read only replica.
+QUEUED
*1
$8
barvalue

Redis excluded the error from the exec result because it was rejected upfront rather than QUEUED. This PR will re-insert such errors back into the exec() result in the order they were sent so that we instead res looks this:

[
  [ReplyError: READONLY...]
  [null, 'barvalue']
]

I'm not sure if what I'm doing is kosher, e.g if exec() is meant to exactly represent the last element sent on the wire. But I imagine there may be other redis runtime errors that behave like READONLY. I'm open to alternative ideas on how to clearly communicate to the caller which individual commands succeeded and which failed in a partially failed transaction.

Note: The readme proposes to handle such ElastiCache READONLY errors with reconnectOnError. This works most of the time however mixing this with transactions introduces a different problem which I think was first reported here. I'm hoping do address that in a further PR.

Some transaction run time errors like WRONGTYPE will get QUEUED, and
therefore the error is included in the final response array. Other
errors however like READONLY dont get queued but are rejected straight
away, and Redis will therefore exclude them from the exec result. This
change reinserts such errors into the right position in the exec()
result.
@alavers
Copy link
Contributor Author

alavers commented Nov 25, 2019

The npm run format-check errors might be due to prettier 1.19 being recently released.

@alavers
Copy link
Contributor Author

alavers commented Nov 25, 2019

Confirmed the new prettier version is what was breaking CI. I had to bump prettier to ^1.19.1 in the package.json, I think travis CI was maybe caching 1.18.

@alavers
Copy link
Contributor Author

alavers commented Dec 2, 2019

Any thoughts @luin @AVVS ?

@AVVS
Copy link
Collaborator

AVVS commented Dec 2, 2019

I've quickly looked through the code, but certainly need more time to fully understand it. One thing I've noticed is that new test was added for multi, but not for pipeline.exec - I believe that change might have affected it - errors are reported for each of the replies in case of pipeline and not multi

as for multi itself, based on the docs:

In terms of the interface, multi differs from pipeline in that when specifying a callback to each chained command, the queueing state is passed to the callback instead of the result of the command:

redis
  .multi()
  .set("foo", "bar", function(err, result) {
    // result === 'QUEUED'
  })
  .exec(/* ... */);

maybe a better case would be to actually check queued state individually?

@alavers
Copy link
Contributor Author

alavers commented Dec 3, 2019

Thanks for taking a look!

The callback is definitely a workaround available now for catching such errors, but to leverage it would mean giving up much of the benefits of this lib's promise support. I think a workaround would look something like this:

await new Promise((resolve, reject) => {
  const errors = [];
  const cb = (err) => {
    if (err) {
      errors.push(err);
    }
  }

  redis.multi()
    .set('foo', 'bar', cb)
    .set('baz', 'qux', cb)
    .exec((err, result) => {
      if (err) {
        errors.push(err);
      }

      if (errors.length > 0) {
        reject(errors);
      } else {
        resolve(result);
      }
    });
});

And this would only allow us to catch the errors, I would have to do some more work to figure out which commands actually succeeded. At this point it starts to feel like ioredis could be doing some of this legwork for us, hence this PR.

Let me know what you think.

(I can certainly add a test for pipeline.exec if we agree on the overall approach.)

@stale
Copy link

stale bot commented Jan 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Jan 2, 2020
@alavers
Copy link
Contributor Author

alavers commented Jan 6, 2020

Happy new year dear maintainers!

@AVVS have you had a chance to take a closer look at this?

I can certainly add additional tests for pipeline.exec but I'd first like to get a sanity check on the overall approach. Once can certainly check for a successful QUEUED state in each individual command callback as you suggest but I think this is a bit cumbersome (see the code sample in my previous comment)

@stale
Copy link

stale bot commented Feb 5, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 7 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the wontfix label Feb 5, 2020
@alavers
Copy link
Contributor Author

alavers commented Feb 9, 2020

Bump @luin @AVVS

@stale stale bot removed the wontfix label Feb 9, 2020
@luin luin added the pinned label Feb 19, 2020
@alavers alavers requested a review from luin February 22, 2020 16:56
@luin
Copy link
Collaborator

luin commented Mar 14, 2020

Hi Andrew, Sorry for the late response!

The pull request works for this specified issue. However, I'm wondering whether it's a ticky way to handle this issue by merging results from different places. According to Redis official documentation (https://redis.io/topics/transactions#errors-inside-a-transaction), I think the proper behavior for a server to deal with this kind of error would be just discarding the whole transaction and returning an error directly, instead of returning a partial result.

@alavers
Copy link
Contributor Author

alavers commented Mar 20, 2020

Hey @luin thanks for looking into this!

I think the proper behavior for a server to deal with this kind of error would be just discarding the whole transaction and returning an error directly, instead of
returning a partial result.

This sounds good to me. All I really want is to know that the transaction has failed without jumping through these hoops.

If I have a syntax error in my transaction .exec() will reject the promise immediately. Are you proposing the same behavior for READONLY errors? Or stated more generally: any transaction command that fails to be queued should cause .exec() reject the promise (or return an err via callback). Does that sound reasonable?

Also, is there any back-compat concern? Do you think we should make this an opt-in setting?

@luin
Copy link
Collaborator

luin commented Mar 20, 2020

By "I think the proper behavior for a server" I meant I think what AWS Elasticache currently does doesn't seem right. So it leaves us a gray area to deal with this kind of error. The code seems deal with the issue reasonably so I'm going to merge this after I do more tests.

However, I failed to reproduce the behavior you mentioned. Here's my steps:

  1. Create a Elasticache instance without cluster enabled. It contains a primary node and 2 replicas distributing across three AZs.
  2. Launch an EC2 instance, make sure its security group can access the Elasticache instance.
  3. Install redis-cli. Here's what I've got for this step (HOST here is a replica):
ubuntu:~$ telnet HOST 6379
Trying IP...
Connected to HOST.
Escape character is '^]'.
multi
+OK
set foo 42
-READONLY You can't write against a read only replica.
Connection closed by foreign host.

It just closed the connection instead of allowing me enter further commands. Also for the following code:

const Redis = require('ioredis');
const redis = new Redis(HOST);

async function run () {
  const res = await redis.multi()
    .set('foo', 42)
    .get('bar')
    .exec();

  console.log(res);
}

setInterval(() => {
  run();
}, 100);

It just reject the whole transaction because of the READONLY error.

(node:14452) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 7)
(node:14452) UnhandledPromiseRejectionWarning: ReplyError: READONLY You can't write against a read only replica.

Did I missing anything?

@alavers
Copy link
Contributor Author

alavers commented Mar 20, 2020

That's odd. I've been testing against a localhost replica. E.g:

127.0.0.1:7000> replicaof 127.0.0.1 6379
OK
async function run() {
  const r = new RedisClient({
    host: "localhost",
    port: 7000
  });

  const result = await r
    .multi()
    .get("foo")
    .set("foo", "bar")
    .exec();
  console.log({ result });
}

run();
$ node test.js
{ result: [ [ null, 'somevalue' ] ] }

On the wire:

$ telnet localhost 7000
Trying ::1...
Connected to localhost.
Escape character is '^]'.
multi
+OK
get foo
+QUEUED
set foo bar
-READONLY You can't write against a read only replica.
exec
*1
$9
somevalue

The behavior you're showing with elasticache is what I would have expected to see when targeting a local read only replica. I'll run some additional tests against elasticache.

@alavers
Copy link
Contributor Author

alavers commented Mar 21, 2020

@luin You actually might have just stumbled on that bug I mentioned in the PR description that only happens with Elasticache.

If instead you run your transaction against the readonly replica twice (instead of in a setInterval), and add a .then() to log the error, you'll notice you only see that error once. The reason is when elasticache detects that READONLY error it forces us to reconnect (unlike what I see when testing against localhost replica). But when this happens, transaction commands in the offlineQueue get left in a corrupted state. Here's a repro that makes this clearer:

const Redis = require("ioredis");
const redis = new Redis(process.env.HOST);

function run(key) {
  const first = redis
    .multi()
    .set(key, 42)
    .get(key)
    .exec()
    .then(console.log, err => console.log(key, err));
}

run("first");
run("second");

Run with debugging turned we can see what's going wrong:

  ioredis:redis status[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: [empty] -> connecting +0ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> multi([]) +4ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> set([ 'first', '42' ]) +1ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> get([ 'first' ]) +1ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> exec([]) +0ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> multi([]) +1ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> set([ 'second', '42' ]) +0ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> get([ 'second' ]) +0ms
  ioredis:redis queue command[redis-20200320-alavers-debug-002.r2mr6z.0001.use1.cache.amazonaws.com:6379]: 0 -> exec([]) +1ms
  ioredis:redis status[192.168.255.83:6379]: connecting -> connect +15ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> info([]) +1ms
  ioredis:redis status[192.168.255.83:6379]: connect -> ready +3ms
  ioredis:connection send 8 commands in offline queue +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> multi([]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> set([ 'first', '42' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'first' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> multi([]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> set([ 'second', '42' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'second' ]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +0ms
  ioredis:redis status[192.168.255.83:6379]: ready -> close +2ms
  ioredis:connection reconnect in 50ms +4ms
  ioredis:redis status[192.168.255.83:6379]: close -> reconnecting +1ms
  ioredis:redis status[192.168.255.83:6379]: reconnecting -> connecting +51ms
  ioredis:redis status[192.168.255.83:6379]: connecting -> connect +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> info([]) +1ms
  ioredis:redis status[192.168.255.83:6379]: connect -> ready +1ms
  ioredis:connection resend 6 unfulfilled commands +55ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'first' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> multi([]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> set([ 'second', '42' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'second' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +0ms
first ReplyError: READONLY You can't write against a read only replica.
    at parseError (/home/admin/ioredistest/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/home/admin/ioredistest/node_modules/redis-parser/lib/parser.js:302:14) {
  command: { name: 'exec', args: [] },
  previousErrors: [
    ReplyError: READONLY You can't write against a read only replica.
        at parseError (/home/admin/ioredistest/node_modules/redis-parser/lib/parser.js:179:12)
        at parseType (/home/admin/ioredistest/node_modules/redis-parser/lib/parser.js:302:14) {
      command: [Object]
    }
  ]
}
  ioredis:redis status[192.168.255.83:6379]: ready -> close +4ms
  ioredis:connection reconnect in 50ms +6ms
  ioredis:redis status[192.168.255.83:6379]: close -> reconnecting +1ms
  ioredis:redis status[192.168.255.83:6379]: reconnecting -> connecting +50ms
  ioredis:redis status[192.168.255.83:6379]: connecting -> connect +2ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> info([]) +0ms
  ioredis:redis status[192.168.255.83:6379]: connect -> ready +1ms
  ioredis:connection resend 4 unfulfilled commands +54ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> multi([]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> set([ 'second', '42' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'second' ]) +0ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +0ms
  ioredis:redis status[192.168.255.83:6379]: ready -> close +1ms
  ioredis:connection reconnect in 50ms +2ms
  ioredis:redis status[192.168.255.83:6379]: close -> reconnecting +1ms
  ioredis:redis status[192.168.255.83:6379]: reconnecting -> connecting +50ms
  ioredis:redis status[192.168.255.83:6379]: connecting -> connect +2ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> info([]) +0ms
  ioredis:redis status[192.168.255.83:6379]: connect -> ready +1ms
  ioredis:connection resend 2 unfulfilled commands +53ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> get([ 'second' ]) +1ms
  ioredis:redis write command[192.168.255.83:6379]: 0 -> exec([]) +0ms

After our first reconnection we send offline queue commands from the middle of a failed transaction (resend 6 unfulfilled commands). From then onward our command queue seems to be out of sync.

I have a work in progress fix for this here that marks transaction commands with an isTransaction flag so they can be discarded from the offlineQueue upon reconnect.

@luin
Copy link
Collaborator

luin commented Mar 21, 2020

Ah! That makes a lot of senses. I was under impression that this issue only happens in AWS Elasticache. I suspect it's a bug on Redis side so I create an issue for that to double check.

@alavers
Copy link
Contributor Author

alavers commented Mar 23, 2020

I've tested the fix and I believe it resolves the core issue I was trying to address here. If I run the PR description code now, exec() rejects the promise with an error that looks like this:

ReplyError: EXECABORT Transaction discarded because of previous errors.
  command: { name: 'exec', args: [] },
  previousErrors: [
    ReplyError: READONLY You can't write against a read only replica.
      command: [Object]
    }
  ]
}

So I'll close this PR. Thanks @luin !

@alavers alavers closed this Mar 23, 2020
@alavers
Copy link
Contributor Author

alavers commented Mar 23, 2020

I'm not sure yet if this will resolves the desync issue caused by Elasticache immediately disconnecting the client upon readonly errors, so a further fix to clear out offlineQueue of pending transactions might still necessary. Elasticache is only up to redis version 5.0.6 now, so they should have this upcoming redis release up for us to test in maybe... 6 months from now? :]

@luin
Copy link
Collaborator

luin commented Mar 23, 2020

Thanks for pointing this out Andrew! That fix would benefit all Redis users, which is a great thing.

Besides waiting for AWS, I think we can also emulate what Elasticache does by adding c->flags |= CLIENT_CLOSE_AFTER_REPLY; to this line: https://github.com/antirez/redis/blob/1e16b9384d6aa3680ec5629fc0842ab53bf1655e/src/server.c#L3502-L3510.

@alavers
Copy link
Contributor Author

alavers commented Mar 23, 2020

That's a great idea, I'll give it a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants