Oracle Outage Retro 11-10-2021

fitblip · November 11, 2021, 12:53am

Starting at ~ 9:45 PM UTC, the Harbinger Oracle was unable to be updated due to an outage at Coinbase. This lead to a system-wide freeze on things that are dependent on the price feed, and points out a place we can work on being more resilient to failures. This outage lasted for ~1.5 hours.

Background

Kolibri utilizes the Harbinger Oracle, an on-chain price oracle run using independent and community-run posters bridging a price feed signed by Coinbase. Kolibri utilizes this oracle to understand the current real-world price of XTZ in terms of USD, and is a critical component of the Kolibri protocol.

In order to protect the protocol, there’s a parameter in the Kolibri Oracle component named maxDataDelaySecs, which governs how out of date data from the harbinger oracle should be before it’s considered “stale” and the protocol disables some functionality.

When this safeguard kicks in, some price-feed-critical functionality is disabled, and some transactions will fail. Without this, an oracle outage could lead to things like erroneous liquidations.

In terms of functionality, when the oracle is out of date the following rules apply:

Withdraws are disabled - withdraws depend on understanding collateralization %, which is based on the real-world price of XTZ
Borrowing/Minting is disabled - because the protocol can’t accurately gauge collateral value, it’d be unsafe to allow for further minting/borrowing
Liquidations are disabled - also dependent on the underlying real-world price of XTZ, so allowing liquidations without accurate prices would be disastrous

Everything else is enabled, meaning that people can repay their loans and add/deposit XTZ to their oven. This allows for people to pro-actively hedge against this idea of “tourniquet damage” where the oracle coming back online in the middle of a black-swan event triggers liquidations without allowing people to adjust their oven safety/collateralization.

The issue

Unfortunately, Coinbase isn’t exactly known for staying up in times of large trading volumes. It’s unclear what exactly triggered the outage on Nov 10th, but even though it affected their candles API, there was no public post about the outage. Likely this outage had wide-spread consequences for automated traders on the exchange, and it did affect the published oracle prices for ETH as well (coinbase runs 2 oracle apis, one for XTZ, and one for ETH).

With a major outage like this, it was nice for us to be able to reach out to folks we know internally to confirm there was an outage, but this exposes a real need for us to investigate alternative oracles/price feeds/etc, or at the very least have a plan to and threshold defined to cut over to something else should the need arise.

Future work

This retro should serve as a jumpoff point for us to investigate a few things, and come up with a runbook/plan (as well as thresholds for executing on them) if/when this happens again.

Some ideas (please feel free to post others below):

Harbinger has oracles ready for Binance, Gemini, and OKEx though they’re un-used currently. Finding someone internally to help sponsor running a poster internally to those organizations would allow for multiple price feeds that can be either aggregated together or failed over if an outage occurs. Really any CEX with sufficient liquidity should work, but liquidity depth is extremely important.
We can/should explore other oracles in the ecosystem. Youves uses an oracle called Ubinetic, but after digging into its design I’m not convinced it’s a secure way to post updates since it trusts an android execution engine. It is being run by trusted 3rd parties like Bitcoin Suisse, but knowing nothing of these organizations I’m hesitant to entrust such a critical thing to them. Youves has likely done some work vetting them so this is a potentially viable oracle solution that exists today.
Things like Chainlink coming to Tezos would be a very good solution since it’s currently what’s used by other DeFi protocols in the ETH system. I’m not sure what the status is.
I also have a design for a tez-native alternative to chainlink that exists in essentially “napkin form” with slashing mechanics that would operate in a decentralized way (governed by a DAO). It’d require a lot of eng work to build/test/deploy/etc but it would be a fun project to bring to fruition.

Regardless of which path we decide to work on, we should have clear runbooks and procedures for outages like this. Smaller outages may just be “handle comms and make sure people know” while a longer-term outage (i.e. coinbase deciding to stop supporting the tezos oracle API) may involve something like the break-glass contracts or a multi-sig that can promote a new oracle quicker than the ~1wk turnaround time it takes for the DAO to ratify things.

glcohen · December 10, 2021, 1:51am

Can you expand on the oracles ready for Binance, Gemini, and OKEx?
What does sponsorship / running a poster entail, and what is the financial obligation? Could this be something paid for by the DAO?

fitblip · December 13, 2021, 3:45am

Basically we have some contracts and stuff written for those exchanges’ candle APIs (AFAIK), but getting the signed feed is the real crux of the problem. We’re essentially ready for integration external, but the tricky bit is the internal work needed within these orgs to sign data with a key they control.

Realistically there’s no real incentive for these orgs to engage to do this sort of signing unless they want to do it as part of a larger oracle feed project (like was the case with Coinbase). Something like chainlink (which AFAIK is now used by basically every defi protocol) would fix all this IMO, but it’s hard to say what sort of lift that’d be, and that works against the odds of an org taking up this torch.