Introduction

date
Jun 20, 2021
slug
comp90020-cna-intro
status
Published
tags
Programming
COMP90020
summary
type
Page
Year
2021

Topics

  • Failure detection
  • Mutual exclusion
  • Elections
  • Group communication (multicast)
  • Consensus

Failure Assumptions

  • Processes are connected via reliable channels
  • Processes do not rely on others to communicate
  • Depending on the topic/algorithm:
    • Process can't fail
    • Process can crash - A failure detector can be used to detected crash failures
    • Processes can act unexpectedly

Failure Detector

  • Service that can decide whether a process has crashed
  • Can collaborate with other processes to detect failures
  • Often implemented by interacting local failure detectors
  • 2 types:
    • Unreliable failure detector
    • Reliable failure detector

Unreliable failure detector

When queried about a process, produces one of two values:
  • Unsuspected
    • The detector has recently received evidence suggesting that the process has not failed
    • E.g., a message was recently received from it
    • May be inaccurate, the process may have failed since then
  • Suspected
    • The detector has some indication that the process may have failed
    • E.g., no message received in quite some time
    • May be inaccurate, for example, the process could be functioning correctly, but the communication link is down, or it could be running more slowly than expected
  • Implementation
    • Periodically, every T seconds each process p sends a “p is here” message to every other process
    • If a local failure detector at q does not receive “p is here” from p within T+D (D = estimated maximum transmission delay), then p is suspected
    • If message is subsequently received, p is declared OK
  • Problem:
    • For small D, intermittent network performance downgrades will lead to suspected nodes
    • For large D crashes will remain unobserved (crashed nodes will be fixed before timeout expires)
  • Solution:
    • Variable D that reflect the observed network latencies

Reliable failure detector

When queried about a process, produces one of two values:
  • Unsuspected
    • Potentially inaccurate as in unreliable detector
  • Failed
    • Accurate determination that peer process has failed
  • implementation
    • only possible in a synchronous network

© wongchihaul 2021 - 2024