Introduction
date
Jun 20, 2021
slug
comp90020-cna-intro
status
Published
tags
Programming
COMP90020
summary
type
Page
Year
2021
Topics
- Failure detection
- Mutual exclusion
- Elections
- Group communication (multicast)
- Consensus
Failure Assumptions
- Processes are connected via reliable channels
- Processes do not rely on others to communicate
- Depending on the topic/algorithm:
- Process can't fail
- Process can crash - A failure detector can be used to detected crash failures
- Processes can act unexpectedly
Failure Detector
- Service that can decide whether a process has crashed
- Can collaborate with other processes to detect failures
- Often implemented by interacting local failure detectors
- 2 types:
- Unreliable failure detector
- Reliable failure detector
Unreliable failure detector
When queried about a process, produces one of two values:
- Unsuspected
- The detector has recently received evidence suggesting that the process has not failed
- E.g., a message was recently received from it
- May be inaccurate, the process may have failed since then
- Suspected
- The detector has some indication that the process may have failed
- E.g., no message received in quite some time
- May be inaccurate, for example, the process could be functioning correctly, but the communication link is down, or it could be running more slowly than expected
- Implementation
- Periodically, every T seconds each process p sends a “p is here” message to every other process
- If a local failure detector at q does not receive “p is here” from p within T+D (D = estimated maximum transmission delay), then p is suspected
- If message is subsequently received, p is declared OK
- Problem:
- For small D, intermittent network performance downgrades will lead to suspected nodes
- For large D crashes will remain unobserved (crashed nodes will be fixed before timeout expires)
- Solution:
- Variable D that reflect the observed network latencies
Reliable failure detector
When queried about a process, produces one of two values:
- Unsuspected
- Potentially inaccurate as in unreliable detector
- Failed
- Accurate determination that peer process has failed
- implementation
- only possible in a synchronous network