掉坑指南：在不同编程场景下，如何灵活运用 `on_failure` 机制，让你的代码更可靠

2025/3/15 20:13:57 4 0 0 0

1. 什么是 on_failure 机制？

2. 为什么需要 on_failure 机制？

3. 不同编程场景下的 on_failure 机制应用

3.1. Web 开发中的 on_failure

3.2. 消息队列中的 on_failure

3.3. 任务调度中的 on_failure

3.4. 监控系统中的 on_failure

4. on_failure 机制的实现方式

4.1. try...except...finally 语句 (Python)

4.2. 装饰器 (Python)

4.3. 框架提供的机制

4.4. 自定义异常处理类

5. on_failure 机制的实践建议

5.1. 明确错误类型，区分处理方式

5.2. 配置合理的重试策略

5.3. 记录详细的错误日志

5.4. 使用监控系统，及时发现问题

5.5. 进行单元测试和集成测试

6. 总结

你好，我是老码农小李。今天，咱们聊聊一个在程序开发中经常被忽视，但却至关重要的概念——on_failure 机制，也就是“失败处理”。

作为一名合格的程序员，咱们的目标不仅仅是写出能跑的代码，更重要的是写出“能抗”的代码。在实际开发中，系统出错是家常便饭，硬件故障、网络波动、用户输入错误……各种各样的意外都可能导致程序崩溃。而on_failure机制，就是咱们应对这些“意外”的秘密武器。

1. 什么是 `on_failure` 机制？

简单来说，on_failure 机制就是在代码执行失败时，能够自动触发的特定处理流程。这个处理流程可以是日志记录、错误通知、重试操作、回滚事务，甚至是优雅地终止程序等等。它的核心思想是：预先定义好失败发生时应该采取的行动，而不是任由错误导致程序崩溃。

在不同的编程语言和框架中，on_failure 机制的具体实现方式会有所不同，但基本原理都是一样的。

2. 为什么需要 `on_failure` 机制？

你可能会问，直接用 try...catch 不就行了吗？try...catch 确实是处理异常的常用手段，但它和 on_failure 机制并不是完全一样的。

颗粒度不同： try...catch 通常用于捕获代码块中的异常，而 on_failure 机制可以应用于更广泛的场景，比如函数调用、任务调度、系统监控等。
关注点不同： try...catch 更侧重于“捕获异常”，而 on_failure 机制更侧重于“处理失败”。它不仅要捕获异常，还要定义在失败后应该做什么。
适用场景不同： try...catch 适合处理程序内部的异常，而 on_failure 机制更适合处理系统级别的错误，比如数据库连接失败、网络请求超时等。

总而言之，on_failure 机制是为了提升代码的健壮性、可靠性和可维护性。它可以帮助咱们：

避免程序崩溃： 通过定义失败处理流程，即使发生错误，程序也能继续运行，而不是直接崩溃。
提高用户体验： 友好的错误提示、自动重试等机制可以避免用户因程序错误而感到沮丧。
快速定位问题： 详细的错误日志记录可以帮助咱们快速定位问题根源。
自动化运维： 自动化的失败处理可以减少人工干预，降低运维成本。

3. 不同编程场景下的 `on_failure` 机制应用

接下来，我将结合不同的编程场景，分享一些 on_failure 机制的具体应用。

3.1. Web 开发中的 `on_failure`

在 Web 开发中，on_failure 机制的应用非常广泛。比如：

数据库操作： 当数据库连接失败、查询超时或数据更新失败时，可以使用 on_failure 机制进行重试、回滚事务、发送错误通知等操作。

 # 示例：Python + Flask + SQLAlchemy
from flask import Flask, jsonify
from flask_sqlalchemy import SQLAlchemy
from sqlalchemy.exc import SQLAlchemyError
 
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'mysql+pymysql://user:password@host/database'
db = SQLAlchemy(app)
 
class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    username = db.Column(db.String(80), unique=True, nullable=False)
 
def add_user(username):
    try:
        user = User(username=username)
        db.session.add(user)
        db.session.commit()
        return {'message': 'User added successfully'}
    except SQLAlchemyError as e:
        db.session.rollback()  # 回滚事务
        # 记录错误日志
        print(f'Database error: {e}')
        # 发送错误通知
        # send_error_notification(e)
        return {'error': 'Failed to add user', 'details': str(e)}, 500  # 返回错误响应
 
@app.route('/users', methods=['POST'])
def create_user():
    username = request.json.get('username')
    if not username:
        return jsonify({'error': 'Username is required'}), 400
    result, status_code = add_user(username)
    return jsonify(result), status_code

解释：

SQLAlchemyError：捕获 SQLAlchemy 相关的数据库异常。
db.session.rollback()：如果发生错误，回滚数据库事务，确保数据一致性。
错误日志：使用 print 打印错误信息，实际项目中可以使用更强大的日志框架（比如 logging）记录日志。
错误通知：注释部分展示了发送错误通知的示例，可以使用邮件、短信、钉钉等方式通知管理员。
返回错误响应：向客户端返回 JSON 格式的错误信息和状态码，提升用户体验。

API 请求： 当调用外部 API 失败时，可以使用 on_failure 机制进行重试、熔断、降级等操作。

 # 示例：Python + requests
import requests
import time
 
def fetch_data(url, max_retries=3, retry_delay=1):
    for i in range(max_retries):
        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status()  # 检查 HTTP 状态码
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f'Request failed (attempt {i+1}/{max_retries}): {e}')
            if i < max_retries - 1:
                time.sleep(retry_delay)  # 等待重试
            else:
                # 记录错误日志
                print(f'Request failed after multiple retries: {e}')
                # 熔断机制：例如，如果连续失败，则不再尝试
                # send_error_notification(e)
                return None  # 或者抛出异常

解释：

requests.exceptions.RequestException：捕获各种网络请求异常（连接错误、超时、HTTP 错误等）。
response.raise_for_status()：检查 HTTP 状态码，如果不是 200 OK，则抛出异常。
重试机制：如果请求失败，等待一段时间后重试，最多重试 max_retries 次。
错误日志：记录重试次数和错误信息。
熔断机制：如果连续失败，可以触发熔断机制，避免继续请求，从而保护系统。例如，可以使用 Hystrix 或 Sentinel 等熔断器框架。
返回 None 或抛出异常：根据实际情况，可以选择返回 None 或抛出异常。如果抛出异常，上层代码可以继续处理。

缓存操作： 当缓存服务（比如 Redis、Memcached）连接失败或操作失败时，可以使用 on_failure 机制进行降级（使用数据库代替缓存）、重试、或者直接返回旧数据。

 # 示例：Python + Redis
import redis
import json
 
redis_client = redis.Redis(host='localhost', port=6379, db=0)
 
def get_data_from_cache(key, fallback_function):
    try:
        data = redis_client.get(key)
        if data:
            return json.loads(data.decode('utf-8'))
        else:
            # 缓存未命中，调用 fallback_function 获取数据
            data = fallback_function()
            if data:
                redis_client.set(key, json.dumps(data))
                return data
            else:
                return None
    except redis.exceptions.RedisError as e:
        # Redis 连接或操作失败
        print(f'Redis error: {e}')
        # 降级：如果 Redis 失败，直接从数据库获取数据
        return fallback_function()

解释：

redis.exceptions.RedisError：捕获 Redis 相关的异常。
降级：如果 Redis 发生错误，直接调用 fallback_function 从数据库获取数据。fallback_function 应该是一个获取数据的函数，例如从数据库查询数据。
缓存未命中：如果缓存未命中，调用 fallback_function 获取数据，并将数据存入缓存。

3.2. 消息队列中的 `on_failure`

消息队列（比如 RabbitMQ、Kafka）在异步处理中扮演着重要的角色。在消息队列中，on_failure 机制可以用于：

消息发送失败： 当发送消息到消息队列失败时，可以使用 on_failure 机制进行重试、将消息写入数据库（死信队列）、或者发送错误通知。

 # 示例：Python + pika (RabbitMQ)
import pika
import json
 
def send_message(queue_name, message, max_retries=3, retry_delay=1):
    for i in range(max_retries):
        try:
            connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
            channel = connection.channel()
            channel.queue_declare(queue=queue_name)
            channel.basic_publish(exchange='', routing_key=queue_name, body=json.dumps(message))
            print(f' [x] Sent {message}')
            connection.close()
            return True  # 发送成功
        except pika.exceptions.AMQPConnectionError as e:
            print(f'RabbitMQ connection error (attempt {i+1}/{max_retries}): {e}')
            if i < max_retries - 1:
                time.sleep(retry_delay)
            else:
                # 记录错误日志
                print(f'RabbitMQ connection error after multiple retries: {e}')
                # 将消息写入死信队列
                # write_message_to_dead_letter_queue(message)
                return False  # 发送失败
        except pika.exceptions.AMQPChannelError as e:
            print(f'RabbitMQ channel error (attempt {i+1}/{max_retries}): {e}')
            if i < max_retries - 1:
                time.sleep(retry_delay)
            else:
                # 记录错误日志
                print(f'RabbitMQ channel error after multiple retries: {e}')
                # 将消息写入死信队列
                # write_message_to_dead_letter_queue(message)
                return False  # 发送失败

解释：

pika.exceptions.AMQPConnectionError：捕获 RabbitMQ 连接相关的异常。
pika.exceptions.AMQPChannelError：捕获 RabbitMQ 频道相关的异常。
重试机制：如果发送失败，等待一段时间后重试，最多重试 max_retries 次。
死信队列：如果重试多次仍然失败，可以将消息写入死信队列，以便后续处理。死信队列是一个特殊的队列，用于存储无法被消费者正常处理的消息。

消息消费失败： 当消费者处理消息失败时，可以使用 on_failure 机制进行重试、将消息写入死信队列、或者发送错误通知。

 # 示例：Python + pika (RabbitMQ)
import pika
import json
import time
 
def consume_message(queue_name):
    connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
    channel = connection.channel()
    channel.queue_declare(queue=queue_name)
 
    def callback(ch, method, properties, body):
        try:
            message = json.loads(body.decode('utf-8'))
            print(f' [x] Received {message}')
            # 模拟消息处理失败
            if 'error' in message:
                raise Exception('Simulated message processing error')
            # 确认消息已被处理
            ch.basic_ack(delivery_tag=method.delivery_tag)
        except Exception as e:
            print(f' [x] Error processing message: {e}')
            # 拒绝消息，重新入队
            ch.basic_nack(delivery_tag=method.delivery_tag, requeue=True)
            # 或者将消息写入死信队列
            # write_message_to_dead_letter_queue(body)
 
    channel.basic_consume(queue=queue_name, on_message_callback=callback)
    print(' [*] Waiting for messages. To exit press CTRL+C')
    channel.start_consuming()

解释：

ch.basic_ack()：确认消息已被成功处理，从队列中删除消息。
ch.basic_nack(requeue=True)：拒绝消息，并将消息重新入队。requeue=True 表示将消息重新放回队列，以便其他消费者或同一消费者可以再次尝试处理。注意，如果消息一直处理失败，可能会导致消息在队列中循环，因此需要设置重试次数或使用死信队列。
死信队列：如果重试多次仍然失败，可以将消息写入死信队列，以便后续处理。死信队列可以用于分析失败原因，或者手动处理失败的消息。

3.3. 任务调度中的 `on_failure`

任务调度（比如 Celery、APScheduler）可以帮助咱们自动化执行各种任务。在任务调度中，on_failure 机制可以用于：

任务执行失败： 当任务执行失败时，可以使用 on_failure 机制进行重试、发送错误通知、或者记录错误日志。

 # 示例：Python + Celery
from celery import Celery
 
app = Celery('tasks', broker='redis://localhost:6379/0')
 
@app.task(bind=True, autoretry_for=(Exception,), retry_kwargs={'max_retries': 3, 'countdown': 5})
def my_task(self, arg):
    try:
        # 模拟任务执行失败
        if arg == 'fail':
            raise Exception('Task failed!')
        print(f'Task executed successfully with arg: {arg}')
        return 'success'
    except Exception as e:
        # 记录错误日志
        print(f'Task failed: {e}')
        # 自动重试
        raise self.retry(exc=e)

解释：

bind=True：将当前任务实例绑定到函数上，方便访问任务相关的属性和方法。
autoretry_for=(Exception,)：指定当发生 Exception 异常时，自动重试。
retry_kwargs：配置重试参数，包括最大重试次数 (max_retries) 和重试间隔 (countdown)。
self.retry(exc=e)：手动触发重试，并传递异常信息。

任务超时： 当任务执行超时时，可以使用 on_failure 机制进行取消任务、发送错误通知、或者记录错误日志。

 # 示例：Python + Celery
from celery import Celery
import time
 
app = Celery('tasks', broker='redis://localhost:6379/0')
 
@app.task(bind=True, timeout=10)
def long_running_task(self):
    try:
        print('Starting long running task...')
        time.sleep(15)  # 模拟任务超时
        print('Long running task finished.')
    except TimeoutError as e:
        # 任务超时
        print(f'Task timed out: {e}')
        # 发送错误通知
        # send_error_notification(e)
        return 'timeout'

解释：

timeout=10：设置任务超时时间为 10 秒。如果任务执行时间超过 10 秒，则会抛出 TimeoutError 异常。

3.4. 监控系统中的 `on_failure`

监控系统（比如 Prometheus、Zabbix）可以帮助咱们实时监测系统的运行状态。在监控系统中，on_failure 机制可以用于：

指标采集失败： 当采集指标失败时，可以使用 on_failure 机制进行重试、切换数据源、或者发送告警通知。

 # 示例：Python + Prometheus
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import time
import random
 
REGISTRY = CollectorRegistry()
 
# 定义指标
cpu_usage = Gauge('cpu_usage', 'CPU usage', registry=REGISTRY)
 
def collect_metrics():
    try:
        # 模拟采集 CPU 使用率
        cpu_usage_value = random.uniform(0, 100)
        cpu_usage.set(cpu_usage_value)
        print(f'Collected CPU usage: {cpu_usage_value}%')
        return True
    except Exception as e:
        # 采集失败
        print(f'Failed to collect metrics: {e}')
        # 切换数据源，或者使用默认值
        # switch_to_backup_data_source()
        return False
 
def push_metrics_to_gateway(gateway_url):
    try:
        push_to_gateway(gateway_url, job='my_job', registry=REGISTRY)
        print('Metrics pushed to gateway successfully')
    except Exception as e:
        print(f'Failed to push metrics to gateway: {e}')
        # 发送告警通知
        # send_alert_notification(e)
 
if __name__ == '__main__':
    gateway_url = 'http://localhost:9091'
    while True:
        if collect_metrics():
            push_metrics_to_gateway(gateway_url)
        else:
            # 如果采集失败，可以进行重试，或者切换数据源，或者记录错误日志
            print('Metrics collection failed. Retrying...')
        time.sleep(10)  # 每隔 10 秒采集一次

解释：

random.uniform(0, 100)：模拟获取 CPU 使用率，生成 0 到 100 之间的随机数。
cpu_usage.set(cpu_usage_value)：设置 CPU 使用率指标的值。
如果采集失败，可以进行重试，或者切换数据源（例如，如果从 /proc/cpuinfo 无法获取 CPU 使用率，可以尝试从其他来源获取），或者记录错误日志。
push_to_gateway：将指标推送到 Prometheus Pushgateway。

告警发送失败： 当发送告警通知失败时，可以使用 on_failure 机制进行重试、切换通知方式、或者记录错误日志。

 # 示例：发送邮件告警
import smtplib
from email.mime.text import MIMEText
 
def send_email_alert(sender, recipients, subject, body, max_retries=3, retry_delay=1):
    for i in range(max_retries):
        try:
            msg = MIMEText(body)
            msg['Subject'] = subject
            msg['From'] = sender
            msg['To'] = ', '.join(recipients)
            with smtplib.SMTP('smtp.example.com', 587) as server:
                server.starttls()
                server.login('your_email@example.com', 'your_password')
                server.send_message(msg)
            print('Email sent successfully')
            return True
        except Exception as e:
            print(f'Failed to send email (attempt {i+1}/{max_retries}): {e}')
            if i < max_retries - 1:
                time.sleep(retry_delay)
            else:
                # 记录错误日志
                print(f'Failed to send email after multiple retries: {e}')
                # 切换通知方式，例如使用短信或钉钉
                # send_sms_alert(subject, body)
                return False

解释：

重试机制：如果发送邮件失败，等待一段时间后重试，最多重试 max_retries 次。
切换通知方式：如果发送邮件失败，可以切换到其他通知方式，比如短信、钉钉等。

4. `on_failure` 机制的实现方式

前面咱们聊了 on_failure 的应用场景，接下来，我来分享几种常见的实现方式。

4.1. `try...except...finally` 语句 (Python)

Python 的 try...except...finally 语句是处理异常的常用方式，也是实现 on_failure 机制的基础。

 try:
    # 尝试执行的代码
    result = do_something()
except SomeException as e:
    # 捕获特定类型的异常
    handle_exception(e)
finally:
    # 无论是否发生异常，都会执行的代码
    cleanup()

try 块：包含可能引发异常的代码。
except 块：用于捕获特定类型的异常，并进行处理。可以有多个 except 块，用于捕获不同类型的异常。
finally 块：无论是否发生异常，都会执行的代码。通常用于释放资源，比如关闭文件、数据库连接等。

优点：

简单易懂，是 Python 语言内置的异常处理机制。
可以捕获多种类型的异常。
finally 块确保资源得到释放。

缺点：

代码结构不够清晰，可能会导致代码嵌套过多。
需要手动编写异常处理逻辑，容易出错。

4.2. 装饰器 (Python)

装饰器是 Python 的一个高级特性，可以用来增强函数的功能，也可以用来实现 on_failure 机制。

 import functools
 
def retry(max_retries=3, retry_delay=1):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for i in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f'Function {func.__name__} failed (attempt {i+1}/{max_retries}): {e}')
                    if i < max_retries - 1:
                        time.sleep(retry_delay)
                    else:
                        # 记录错误日志，或者抛出异常
                        print(f'Function {func.__name__} failed after multiple retries: {e}')
                        raise
        return wrapper
    return decorator
 
@retry(max_retries=5, retry_delay=2)
def my_function():
    # 模拟可能会失败的代码
    if random.random() < 0.5:
        raise Exception('Something went wrong!')
    print('Function executed successfully')
    return 'success'
 
my_function()

解释：

retry 装饰器：定义了一个装饰器函数，用于给函数添加重试功能。
functools.wraps(func)：保留被装饰函数的元信息（比如函数名、文档字符串等）。
wrapper 函数：包装了被装饰的函数，实现了重试逻辑。
@retry：使用 @ 语法将装饰器应用于函数。

优点：

代码简洁，易于维护。
可以复用异常处理逻辑。
可以灵活配置重试次数、重试间隔等参数。

缺点：

需要理解装饰器的概念。
异常处理逻辑和业务逻辑耦合在一起。

4.3. 框架提供的机制

许多框架都提供了内置的 on_failure 机制，可以方便地处理各种错误。

Celery (Python)： Celery 提供了任务重试、超时处理、失败处理等功能，可以方便地实现异步任务的可靠性。
Spring Boot (Java)： Spring Boot 提供了事务管理、异常处理、重试等功能，可以方便地构建可靠的 Web 应用。
RabbitMQ (多种语言)： RabbitMQ 提供了死信队列、重试机制等功能，可以保证消息的可靠传递。

优点：

功能强大，可以处理各种复杂的错误场景。
与框架集成，方便易用。
通常具有良好的性能和可靠性。

缺点：

需要学习框架的使用方法。
可能会受到框架的限制。

4.4. 自定义异常处理类

对于一些复杂的场景，可以自定义异常处理类，将异常处理逻辑封装起来。

 class OperationFailedError(Exception):
    def __init__(self, message, details=None):
        self.message = message
        self.details = details
        super().__init__(self.message)
 
class OperationHandler:
    def __init__(self, max_retries=3, retry_delay=1):
        self.max_retries = max_retries
        self.retry_delay = retry_delay
 
    def execute(self, operation):
        for i in range(self.max_retries):
            try:
                return operation()
            except Exception as e:
                print(f'Operation failed (attempt {i+1}/{self.max_retries}): {e}')
                if i < self.max_retries - 1:
                    time.sleep(self.retry_delay)
                else:
                    # 记录错误日志，或者抛出异常
                    print(f'Operation failed after multiple retries: {e}')
                    raise OperationFailedError('Operation failed', details=str(e))

解释：

OperationFailedError：自定义异常类，用于表示操作失败。
OperationHandler：自定义异常处理类，封装了重试逻辑。
execute 方法：执行操作，并处理异常。

优点：

代码结构清晰，易于维护。
可以复用异常处理逻辑。
可以根据不同的异常类型，采取不同的处理方式。

缺点：

需要编写额外的代码。
需要考虑异常处理类的设计。

5. `on_failure` 机制的实践建议

最后，我来分享一些关于 on_failure 机制的实践建议。

5.1. 明确错误类型，区分处理方式

不同的错误类型，应该采用不同的处理方式。比如：

可重试的错误： 比如网络连接超时、数据库暂时不可用等，可以进行重试。
不可重试的错误： 比如用户输入错误、数据格式错误等，应该进行错误提示，或者记录错误日志。
严重错误： 比如系统崩溃、硬件故障等，应该进行告警通知，或者自动重启。

5.2. 配置合理的重试策略

重试可以提高代码的健壮性，但是也要注意重试的次数和间隔。过多的重试可能会导致系统负载过高，甚至引发雪崩效应。

设置最大重试次数： 避免无限重试。
设置重试间隔： 使用指数退避算法（Exponential backoff），可以避免在短时间内多次重试。
监控重试次数： 如果重试次数过多，应该引起注意，检查系统是否存在问题。

5.3. 记录详细的错误日志

详细的错误日志是排查问题的关键。在记录错误日志时，应该包含以下信息：

错误时间： 记录错误发生的时间。
错误级别： 比如 INFO、WARNING、ERROR、CRITICAL。
错误信息： 描述错误发生的原因。
错误堆栈： 捕获异常的堆栈信息，方便定位问题。
相关上下文信息： 比如用户 ID、请求 ID、参数信息等。

5.4. 使用监控系统，及时发现问题

监控系统可以帮助咱们实时监测系统的运行状态，及时发现问题。可以监控以下指标：

错误率： 统计错误发生的频率。
重试次数： 监控重试次数，如果重试次数过多，应该引起注意。
系统负载： 监控 CPU、内存、磁盘等资源的使用情况。
响应时间： 监控 API 的响应时间，如果响应时间过长，可能存在性能问题。

5.5. 进行单元测试和集成测试

编写单元测试和集成测试，可以验证 on_failure 机制是否正常工作。可以模拟各种错误场景，测试异常处理逻辑是否正确。

6. 总结

on_failure 机制是提升代码健壮性、可靠性和可维护性的重要手段。在实际开发中，咱们应该根据不同的编程场景，选择合适的实现方式，并结合合理的重试策略、详细的错误日志和监控系统，让咱们的代码更加“能抗”。

希望今天的分享对你有所帮助。记住，写出可靠的代码，是每个程序员的责任。

如果你有任何问题，或者有其他关于 on_failure 机制的经验，欢迎在评论区留言，咱们一起交流学习！加油！

老码农小李 on_failure 异常处理错误处理代码健壮性软件开发

	# 示例：Python + Flask + SQLAlchemy
	from flask import Flask, jsonify
	from flask_sqlalchemy import SQLAlchemy
	from sqlalchemy.exc import SQLAlchemyError

	app = Flask(__name__)
	app.config['SQLALCHEMY_DATABASE_URI'] = 'mysql+pymysql://user:password@host/database'
	db = SQLAlchemy(app)

	class User(db.Model):
	id = db.Column(db.Integer, primary_key=True)
	username = db.Column(db.String(80), unique=True, nullable=False)

	def add_user(username):
	try:
	user = User(username=username)
	db.session.add(user)
	db.session.commit()
	return {'message': 'User added successfully'}
	except SQLAlchemyError as e:
	db.session.rollback() # 回滚事务
	# 记录错误日志
	print(f'Database error: {e}')
	# 发送错误通知
	# send_error_notification(e)
	return {'error': 'Failed to add user', 'details': str(e)}, 500 # 返回错误响应

	@app.route('/users', methods=['POST'])
	def create_user():
	username = request.json.get('username')
	if not username:
	return jsonify({'error': 'Username is required'}), 400
	result, status_code = add_user(username)
	return jsonify(result), status_code

	# 示例：Python + requests
	import requests
	import time

	def fetch_data(url, max_retries=3, retry_delay=1):
	for i in range(max_retries):
	try:
	response = requests.get(url, timeout=5)
	response.raise_for_status() # 检查 HTTP 状态码
	return response.json()
	except requests.exceptions.RequestException as e:
	print(f'Request failed (attempt {i+1}/{max_retries}): {e}')
	if i < max_retries - 1:
	time.sleep(retry_delay) # 等待重试
	else:
	# 记录错误日志
	print(f'Request failed after multiple retries: {e}')
	# 熔断机制：例如，如果连续失败，则不再尝试
	# send_error_notification(e)
	return None # 或者抛出异常

	# 示例：Python + Redis
	import redis
	import json

	redis_client = redis.Redis(host='localhost', port=6379, db=0)

	def get_data_from_cache(key, fallback_function):
	try:
	data = redis_client.get(key)
	if data:
	return json.loads(data.decode('utf-8'))
	else:
	# 缓存未命中，调用 fallback_function 获取数据
	data = fallback_function()
	if data:
	redis_client.set(key, json.dumps(data))
	return data
	else:
	return None
	except redis.exceptions.RedisError as e:
	# Redis 连接或操作失败
	print(f'Redis error: {e}')
	# 降级：如果 Redis 失败，直接从数据库获取数据
	return fallback_function()

	# 示例：Python + pika (RabbitMQ)
	import pika
	import json

	def send_message(queue_name, message, max_retries=3, retry_delay=1):
	for i in range(max_retries):
	try:
	connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
	channel = connection.channel()
	channel.queue_declare(queue=queue_name)
	channel.basic_publish(exchange='', routing_key=queue_name, body=json.dumps(message))
	print(f' [x] Sent {message}')
	connection.close()
	return True # 发送成功
	except pika.exceptions.AMQPConnectionError as e:
	print(f'RabbitMQ connection error (attempt {i+1}/{max_retries}): {e}')
	if i < max_retries - 1:
	time.sleep(retry_delay)
	else:
	# 记录错误日志
	print(f'RabbitMQ connection error after multiple retries: {e}')
	# 将消息写入死信队列
	# write_message_to_dead_letter_queue(message)
	return False # 发送失败
	except pika.exceptions.AMQPChannelError as e:
	print(f'RabbitMQ channel error (attempt {i+1}/{max_retries}): {e}')
	if i < max_retries - 1:
	time.sleep(retry_delay)
	else:
	# 记录错误日志
	print(f'RabbitMQ channel error after multiple retries: {e}')
	# 将消息写入死信队列
	# write_message_to_dead_letter_queue(message)
	return False # 发送失败

	# 示例：Python + Celery
	from celery import Celery

	app = Celery('tasks', broker='redis://localhost:6379/0')

	@app.task(bind=True, autoretry_for=(Exception,), retry_kwargs={'max_retries': 3, 'countdown': 5})
	def my_task(self, arg):
	try:
	# 模拟任务执行失败
	if arg == 'fail':
	raise Exception('Task failed!')
	print(f'Task executed successfully with arg: {arg}')
	return 'success'
	except Exception as e:
	# 记录错误日志
	print(f'Task failed: {e}')
	# 自动重试
	raise self.retry(exc=e)

	# 示例：Python + Prometheus
	from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
	import time
	import random

	REGISTRY = CollectorRegistry()

	# 定义指标
	cpu_usage = Gauge('cpu_usage', 'CPU usage', registry=REGISTRY)

	def collect_metrics():
	try:
	# 模拟采集 CPU 使用率
	cpu_usage_value = random.uniform(0, 100)
	cpu_usage.set(cpu_usage_value)
	print(f'Collected CPU usage: {cpu_usage_value}%')
	return True
	except Exception as e:
	# 采集失败
	print(f'Failed to collect metrics: {e}')
	# 切换数据源，或者使用默认值
	# switch_to_backup_data_source()
	return False

	def push_metrics_to_gateway(gateway_url):
	try:
	push_to_gateway(gateway_url, job='my_job', registry=REGISTRY)
	print('Metrics pushed to gateway successfully')
	except Exception as e:
	print(f'Failed to push metrics to gateway: {e}')
	# 发送告警通知
	# send_alert_notification(e)

	if __name__ == '__main__':
	gateway_url = 'http://localhost:9091'
	while True:
	if collect_metrics():
	push_metrics_to_gateway(gateway_url)
	else:
	# 如果采集失败，可以进行重试，或者切换数据源，或者记录错误日志
	print('Metrics collection failed. Retrying...')
	time.sleep(10) # 每隔 10 秒采集一次

	# 示例：发送邮件告警
	import smtplib
	from email.mime.text import MIMEText

	def send_email_alert(sender, recipients, subject, body, max_retries=3, retry_delay=1):
	for i in range(max_retries):
	try:
	msg = MIMEText(body)
	msg['Subject'] = subject
	msg['From'] = sender
	msg['To'] = ', '.join(recipients)
	with smtplib.SMTP('smtp.example.com', 587) as server:
	server.starttls()
	server.login('your_email@example.com', 'your_password')
	server.send_message(msg)
	print('Email sent successfully')
	return True
	except Exception as e:
	print(f'Failed to send email (attempt {i+1}/{max_retries}): {e}')
	if i < max_retries - 1:
	time.sleep(retry_delay)
	else:
	# 记录错误日志
	print(f'Failed to send email after multiple retries: {e}')
	# 切换通知方式，例如使用短信或钉钉
	# send_sms_alert(subject, body)
	return False

	try:
	# 尝试执行的代码
	result = do_something()
	except SomeException as e:
	# 捕获特定类型的异常
	handle_exception(e)
	finally:
	# 无论是否发生异常，都会执行的代码
	cleanup()

	import functools

	def retry(max_retries=3, retry_delay=1):
	def decorator(func):
	@functools.wraps(func)
	def wrapper(args, *kwargs):
	for i in range(max_retries):
	try:
	return func(args, *kwargs)
	except Exception as e:
	print(f'Function {func.__name__} failed (attempt {i+1}/{max_retries}): {e}')
	if i < max_retries - 1:
	time.sleep(retry_delay)
	else:
	# 记录错误日志，或者抛出异常
	print(f'Function {func.__name__} failed after multiple retries: {e}')
	raise
	return wrapper
	return decorator

	@retry(max_retries=5, retry_delay=2)
	def my_function():
	# 模拟可能会失败的代码
	if random.random() < 0.5:
	raise Exception('Something went wrong!')
	print('Function executed successfully')
	return 'success'

	my_function()

	class OperationFailedError(Exception):
	def __init__(self, message, details=None):
	self.message = message
	self.details = details
	super().__init__(self.message)

	class OperationHandler:
	def __init__(self, max_retries=3, retry_delay=1):
	self.max_retries = max_retries
	self.retry_delay = retry_delay

	def execute(self, operation):
	for i in range(self.max_retries):
	try:
	return operation()
	except Exception as e:
	print(f'Operation failed (attempt {i+1}/{self.max_retries}): {e}')
	if i < self.max_retries - 1:
	time.sleep(self.retry_delay)
	else:
	# 记录错误日志，或者抛出异常
	print(f'Operation failed after multiple retries: {e}')
	raise OperationFailedError('Operation failed', details=str(e))

掉坑指南：在不同编程场景下，如何灵活运用 `on_failure` 机制，让你的代码更可靠

1. 什么是 on_failure 机制？

2. 为什么需要 on_failure 机制？

3. 不同编程场景下的 on_failure 机制应用

3.1. Web 开发中的 on_failure

3.2. 消息队列中的 on_failure

3.3. 任务调度中的 on_failure

3.4. 监控系统中的 on_failure

4. on_failure 机制的实现方式

4.1. try...except...finally 语句 (Python)

4.2. 装饰器 (Python)

4.3. 框架提供的机制

4.4. 自定义异常处理类

5. on_failure 机制的实践建议

5.1. 明确错误类型，区分处理方式

5.2. 配置合理的重试策略

5.3. 记录详细的错误日志

5.4. 使用监控系统，及时发现问题

5.5. 进行单元测试和集成测试

6. 总结

1. 什么是 on_failure 机制？

2. 为什么需要 on_failure 机制？

3. 不同编程场景下的 on_failure 机制应用

3.1. Web 开发中的 on_failure

3.2. 消息队列中的 on_failure

3.3. 任务调度中的 on_failure

3.4. 监控系统中的 on_failure

4. on_failure 机制的实现方式

4.1. try...except...finally 语句 (Python)

4.2. 装饰器 (Python)

4.3. 框架提供的机制

4.4. 自定义异常处理类

5. on_failure 机制的实践建议

5.1. 明确错误类型，区分处理方式

5.2. 配置合理的重试策略

5.3. 记录详细的错误日志

5.4. 使用监控系统，及时发现问题

5.5. 进行单元测试和集成测试

6. 总结

评论点评

1. 什么是 `on_failure` 机制？

2. 为什么需要 `on_failure` 机制？

3. 不同编程场景下的 `on_failure` 机制应用

3.1. Web 开发中的 `on_failure`

3.2. 消息队列中的 `on_failure`

3.3. 任务调度中的 `on_failure`

3.4. 监控系统中的 `on_failure`

4. `on_failure` 机制的实现方式

4.1. `try...except...finally` 语句 (Python)

5. `on_failure` 机制的实践建议