Improving RPC Node Provider Service

Reputation

The Lava Network places a strong emphasis on delivering exceptional Quality of Service (QoS) to its consumers. To ensure this, consumers actively participate in monitoring and customizing their QoS excellence metrics. They gauge RPC Node provider's performance by measuring latency in Provider responses relative to a benchmark, assessing data freshness in comparison to the fastest Provider, and evaluating the percentage of error or timeout responses in the availability metric. These scores are diligently recorded and sent on-chain alongside the relay proofs of service, creating a transparent and accountable system. The Provider's performance metric is called "Reputation". Higher reputation indicates higher QoS scores.

To further enhance the integrity of the QoS scores, updates are aggregated across all consumers in a manner that safeguards against false reports. Negative reports are weighted by usage, meaning that a consumer must actively use and pay a Provider to diminish their QoS score. This mechanism discourages users from artificially lowering a Provider's score.

The Reputation metric only affect pairings and is aggregated over time with a decay function that favors the latest data, meaning Providers can improve, and those Providers that their service fails will be impacted to affect fewer users. This approach ensures that the reputation system remains dynamic and responsive, benefiting Providers striving to enhance their services while minimizing the impact of service failures on a broader scale.

📊 Passable QoS

Passable Quality of Service is scored separately in each relay session. Lower scores mean lower rewards. Up to half the accumulated CU can be reduced for bad service. Passable QoS metrics can be viewed both in the Lava Info explorer and Prometheus metrics.

Passable QoS is binary, either it's good or bad there is no in-between. Scores in the range 0-1 are a result of averaging that binary score across relays. You can learn more about Passable QoS from our 📄 RSCH-1000 research paper.

Metrics 📈

Passable Quality of Service divides into three metrics: Availability, Sync/Freshness of data, Latency.

🛎️ Availability

Score 0 or 1 per relay, averaged across relays in a session to give a range 0-1 for each session. 0 is given for each failed relay. A lower availability score can be the result of failed relays by one of the following:

Low-Score Causes

Details

Improvements

Improving availability involves finding the cause of errors, and taking the necessary actions to resolve them.

Details

⏲️ Latency

Score 0 or 1 per relay, averaged across relays in a session to give a range 0-1 per session. 0 is given for each relay that took above half the timeout. A lower latency score can be the result of slow responses, and can be identified by turning on debug logs in the Provider to see the latency or checking Prometheus.

Improvements

Details

🌿 Sync/freshness

Score 0 or 1 per relay, averaged across relays in a session to give a range 0-1 per session. A 0 is given for each relay that has a latest block proof that is older than the spec allowed block lag for QoS sync.

How to Identify

The freshness proofs are updated by the Provider service in a GET_BLOCKNUM request, and then returned together with consumer relay responses. It is possible to turn on debug logs and see the blocks advancing. The latest block is exported in prometheus, and also uploaded on chain. You can compare the latest block on your Provider to other Providers, a useful way to do this is filter provider_latest_block_report and compare your results to others:

caution

Here and below ensure that you replace {PUBLIC_RPC} with the correct endpoint.

lavap test events 2000 --event lava_provider_latest_block_report --node {PUBLIC_RPC}

Low-Score Causes

Details

📊 Reputation Score

Reputation Socre is calculated very similarly to Passable QoS. QoS Excellence provides a range of scores that are time-weighted to take the latest information all the actions mentioned here to improve passable QoS affect Reputation

Metrics 📈

Reputation score divides into 3 metrics:

Availability - score in the range 0-1
Sync/ Freshness of data - how much time behind other Providers are we, lower is better, 0 means your sync is the best in the pairing
Latency - how many benchmark ticks passed during a relay in average (time taken / benchmark time). lower is better

🏛️ Jailing

How to Fix Getting Jailed 🔓

Lava Protocol removes Providers that are providing inferior service. The mechanism is detached from QoS measurements. In order to avoid being jailed, a Provider needs to avoid the following:

have the staked endpoint not respond to connections
have disabled chains in the staked endpoint
have too many consecutive errors with a large group of consumers
have a non TLS connection or an expired certificate
block headers or origins
miss on getting rewards

How Jailing Happens ❓

Once one or more of the aforementioned conditions are met, Lava's Blockchain jails a Provider if:

there are enough other Providers in the spec
the Provider is not frozen (if you freeze for maintenance you will not get jailed)
the Provider is active for at least 8 epochs
in the last 8 epochs the Provider got less sum of rewards than reports sum in the last 2 epochs

Identify Getting Jailed 🩻

If your Provider got jailed it will stop receiving requests upon the next epoch.

It is possible to monitor this event via info webpage or the following commands :

⌨️ command when being reported before getting jailed:

lavap test events 2000 --event lava_provider_reported --node {PUBLIC_RPC}

a group of reports can lead to being jailed.

⌨️ command when the blockchain's criteria for jail are met:

lavap test events 2000 --event lava_provider_jailed --node {PUBLIC_RPC}

Resuming Service ▶️

Since version 0.27.0, Providers that are jailed can resume service by unfreezing. It is planned to have a cooldown period on unfreezing.

If repeated jails are activated in a short period:

lavad tx pairing unfreeze --help

Reasons ⚖️

Jail reports contain additional info on the report reason and they can be either due to:

disconnections - a Provider did not respond to connection attempts
errors - a Provider's responses were a sequence of consecutive errors.

In addition, the reports contain an exact time tag, so the Provider can check the events:

lavap test events 2000 --event lava_provider_reported --node {PUBLIC_RPC}

Disconnections 💢

These mean the Provider's endpoint did not respond and can be due to the following problems:

TLS certificate outdated or not set
misconfiguration proxying the requests to the Provider service
Provider service not running
wrong endpoint in the stake entry on chain: can be fixed by running

lavad tx pairing modify-provider ${CHAIN} --endpoints "${ENDPOINTS}" --geolocation ${GEOLOCATION} --from ${WALLET}

caution

Make sure the url in the endpoint is the Provider grpc listening address and not your node url

⏺️ Identifying a Disconnect

Disconnection problems can be identified by running the test command:

lavap test rpcprovider ${PUBLIC_ADDRESS}

Errors ❌

these mean the Provider service connection was solid but all relays turned to be errors, might be caused by the following:

disabled chain - the Provider doesnt have access to the node, or a verification does not pass, and the chain is disabled
unexpected errors
timeouts

⏺️ Identifying an Error

Errors can be identified by looking at the Provider service logs, it is recommended to run with debug if repeatedly getting jailed.

Metrics

Lava Network allows Providers to monitor their services through a set of different metrics available. The easiest way to access the Provider metrics is through the Lava info page which gives Providers a comprehensive look into their overall performance metrics as well as specific details over time.

High level metrics available to Providers:

Total CU - a numerical representation of the computational difficulty of executing a specific API calls. It is used to calculate Provider rewards.
Total Relays - number of data exchange events between Providers and consumers.
Total self-stake - an amount of Lava Network tokens bonded by the Provider
Delegation stake - total stake from delegators
Total stake - self stake + delegations
Commission - percentage of delegation rewards retained by the Provider

Reputation​

📊 Passable QoS​

Metrics 📈​

🛎️ Availability​

Low-Score Causes​

Improvements​

⏲️ Latency​

Improvements​

🌿 Sync/freshness​

How to Identify​

Low-Score Causes​

📊 Reputation Score​

Metrics 📈​

🏛️ Jailing​

How to Fix Getting Jailed 🔓​

How Jailing Happens ❓​

Identify Getting Jailed 🩻​

Resuming Service ▶️​

Reasons ⚖️​

Disconnections 💢​

⏺️ Identifying a Disconnect​

Errors ❌​

⏺️ Identifying an Error​

Metrics​

Reputation

📊 Passable QoS

Metrics 📈

🛎️ Availability

Low-Score Causes

Improvements

⏲️ Latency

Improvements

🌿 Sync/freshness

How to Identify

Low-Score Causes

📊 Reputation Score

Metrics 📈

🏛️ Jailing

How to Fix Getting Jailed 🔓

How Jailing Happens ❓

Identify Getting Jailed 🩻

Resuming Service ▶️

Reasons ⚖️

Disconnections 💢

⏺️ Identifying a Disconnect

Errors ❌

⏺️ Identifying an Error

Metrics