Starting at ~ 9:45 PM UTC, the Harbinger Oracle was unable to be updated due to an outage at Coinbase. This lead to a system-wide freeze on things that are dependent on the price feed, and points out a place we can work on being more resilient to failures. This outage lasted for ~1.5 hours.
Kolibri utilizes the Harbinger Oracle, an on-chain price oracle run using independent and community-run posters bridging a price feed signed by Coinbase. Kolibri utilizes this oracle to understand the current real-world price of XTZ in terms of USD, and is a critical component of the Kolibri protocol.
In order to protect the protocol, there’s a parameter in the Kolibri Oracle component named
maxDataDelaySecs, which governs how out of date data from the harbinger oracle should be before it’s considered “stale” and the protocol disables some functionality.
When this safeguard kicks in, some price-feed-critical functionality is disabled, and some transactions will fail. Without this, an oracle outage could lead to things like erroneous liquidations.
In terms of functionality, when the oracle is out of date the following rules apply:
- Withdraws are disabled - withdraws depend on understanding collateralization %, which is based on the real-world price of XTZ
- Borrowing/Minting is disabled - because the protocol can’t accurately gauge collateral value, it’d be unsafe to allow for further minting/borrowing
- Liquidations are disabled - also dependent on the underlying real-world price of XTZ, so allowing liquidations without accurate prices would be disastrous
Everything else is enabled, meaning that people can repay their loans and add/deposit XTZ to their oven. This allows for people to pro-actively hedge against this idea of “tourniquet damage” where the oracle coming back online in the middle of a black-swan event triggers liquidations without allowing people to adjust their oven safety/collateralization.
Unfortunately, Coinbase isn’t exactly known for staying up in times of large trading volumes. It’s unclear what exactly triggered the outage on Nov 10th, but even though it affected their candles API, there was no public post about the outage. Likely this outage had wide-spread consequences for automated traders on the exchange, and it did affect the published oracle prices for ETH as well (coinbase runs 2 oracle apis, one for XTZ, and one for ETH).
With a major outage like this, it was nice for us to be able to reach out to folks we know internally to confirm there was an outage, but this exposes a real need for us to investigate alternative oracles/price feeds/etc, or at the very least have a plan to and threshold defined to cut over to something else should the need arise.
This retro should serve as a jumpoff point for us to investigate a few things, and come up with a runbook/plan (as well as thresholds for executing on them) if/when this happens again.
Some ideas (please feel free to post others below):
Harbinger has oracles ready for Binance, Gemini, and OKEx though they’re un-used currently. Finding someone internally to help sponsor running a poster internally to those organizations would allow for multiple price feeds that can be either aggregated together or failed over if an outage occurs. Really any CEX with sufficient liquidity should work, but liquidity depth is extremely important.
We can/should explore other oracles in the ecosystem. Youves uses an oracle called Ubinetic, but after digging into its design I’m not convinced it’s a secure way to post updates since it trusts an android execution engine. It is being run by trusted 3rd parties like Bitcoin Suisse, but knowing nothing of these organizations I’m hesitant to entrust such a critical thing to them. Youves has likely done some work vetting them so this is a potentially viable oracle solution that exists today.
Things like Chainlink coming to Tezos would be a very good solution since it’s currently what’s used by other DeFi protocols in the ETH system. I’m not sure what the status is.
I also have a design for a tez-native alternative to chainlink that exists in essentially “napkin form” with slashing mechanics that would operate in a decentralized way (governed by a DAO). It’d require a lot of eng work to build/test/deploy/etc but it would be a fun project to bring to fruition.
Regardless of which path we decide to work on, we should have clear runbooks and procedures for outages like this. Smaller outages may just be “handle comms and make sure people know” while a longer-term outage (i.e. coinbase deciding to stop supporting the tezos oracle API) may involve something like the break-glass contracts or a multi-sig that can promote a new oracle quicker than the ~1wk turnaround time it takes for the DAO to ratify things.