LLM API 优化指南 - AI技术学习平台

⚡ LLM API 优化指南

调整参数与策略，提升大模型输出的准确性与效果

为什么要优化 LLM API？

┌─────────────────────────────────────────────────────────────────────────┐ │ 优化 LLM API 的价值 │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ 参数调优 │ → │ 输出质量↑ │ → │ 成本↓ │ │ │ │ │ │ │ │ │ │ │ │ 温度/长度 │ │ 更准确/稳定 │ │ 减少重试 │ │ │ │ 提示词设计 │ │ 符合预期 │ │ 精准控制 │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ 核心目标：让模型输出更准确、更可控、更高效 │ │ │ └─────────────────────────────────────────────────────────────────────────┘

🎛️ 核心参数详解

参数	取值范围	作用	优化建议
temperature	0.0 - 2.0	控制输出的随机性和创意程度	准确任务: 0.1-0.3 \| 创意任务: 0.7-1.0
max_tokens	正整数	限制模型输出的最大 token 数量	根据任务需求设置，避免截断
top_p	0.0 - 1.0	核采样，控制在每个步骤考虑的 token 分布	与 temperature 配合，低值更确定
frequency_penalty	-2.0 - 2.0	根据 token 在已生成内容中的出现频率进行惩罚	提高 (>0) 可减少重复内容
presence_penalty	-2.0 - 2.0	惩罚已经出现在生成内容中的 token	提高 (>0) 鼓励生成新内容
penalty_score	1.0 - 2.0	百度特有重复惩罚参数	设 >1 减少重复，设 1.0 无惩罚

temperature 参数对比

┌─────────────────────────────────────────────────────────────────────────┐ │ temperature 对比 │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ temperature = 0.1 (低) temperature = 0.7 (中) temp = 1.0 │ │ ┌────────────────────┐ ┌────────────────────┐ ┌───────────┐ │ │ │ │ │ │ │ │ │ │ │ • 确定性高 │ │ • 平衡创造力 │ │ • 高创意 │ │ │ │ • 回答一致 │ │ • 适度的多样性 │ │ • 不可预测 │ │ │ │ • 适合事实性任务 │ │ • 日常对话合适 │ │ • 适合写诗 │ │ │ │ • 可能过于保守 │ │ • 推荐日常使用 │ │ │ │ │ │ │ │ │ │ │ │ │ └────────────────────┘ └────────────────────┘ └───────────┘ │ │ │ │ 💡 经验法则：准确任务温度设低，创意任务温度设高 │ │ │ └─────────────────────────────────────────────────────────────────────────┘

🎯 准确性优化策略

策略1：降低随机性

# 准确任务：温度设为最低
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "中国的首都是哪里？"}
    ],
    temperature=0.1,  # 最低，输出更确定
    top_p=0.1,        # 只考虑最可能的 token
    max_tokens=500
)

# 输出：中国的首都是北京。

适用于：问答、事实核查、代码生成
效果：输出稳定，减少幻觉

策略2：强化系统提示

# 设计详细的系统提示
messages = [
    {
        "role": "system",
        "content": """你是一个专业的技术写作助手。
你的工作要求：
1. 只基于提供的事实信息回答，不编造内容
2. 如果信息不足，明确说"根据当前信息无法确定"
3. 回答结构清晰，使用 Markdown 格式
4. 引用具体的数据和来源
5. 避免使用不确定的语气词（如"可能"、"大概"）"""
    },
    {
        "role": "user",
        "content": "请分析这篇论文的主要贡献。"
    }
]

明确角色和约束
设定回答规则
减少不确定性

策略3：Few-shot 示例学习

# 通过示例引导模型按特定格式输出
messages = [
    {"role": "system", "content": "根据示例格式回答问题。"},
    {"role": "user", "content": "北京是中国的首都吗？"},
    {"role": "assistant", "content": "是的。北京是中华人民共和国的首都。"},
    {"role": "user", "content": "东京是美国的首都吗？"},
    {"role": "assistant", "content": "不是。东京是日本的首都，不是美国的首都。"},
    {"role": "user", "content": "渥太华是澳大利亚的首都吗？"}
]

# 模型会模仿示例格式回答：不是。渥太华是加拿大的首都...

💡 Few-shot 技巧：提供 2-5 个高质量示例，让模型学习期望的格式和推理方式

策略4：Chain-of-Thought 思维链

# 要求模型分步骤思考
messages = [
    {"role": "system", "content": "解决问题时请按以下步骤：
1. 首先理解问题要求
2. 分析关键信息
3. 逐步推理
4. 给出最终答案"},
    {"role": "user", "content": "如果一根绳子长10米，剪掉一半后还剩多少？"}
]

# 模型会逐步输出：
# 1. 理解问题：绳子原来10米，剪掉一半
# 2. 分析：一半就是 10 ÷ 2 = 5 米
# 3. 推理：正确
# 4. 答案：还剩 5 米

适用于：数学、逻辑推理、复杂决策
效果：减少错误，提高可解释性

🔧 高级优化技术

RAG 检索增强生成

┌─────────────────────────────────────────┐ │ RAG 工作流程 │ ├─────────────────────────────────────────┤ │ │ │ 用户问题 → 向量化 → 向量检索 → 相关文档 │ │ ↓ │ │ 问题 + 检索文档 → LLM → 答案 │ │ │ │ 优点： │ │ • 减少幻觉（基于事实） │ │ • 解决知识时效性问题 │ │ • 支持私有知识库 │ └─────────────────────────────────────────┘

# 伪代码：RAG 流程
# 1. 检索相关文档
retrieved_docs = vector_search(query, top_k=5)

# 2. 构建增强提示
augmented_prompt = f"""基于以下信息回答问题。

参考信息：
{retrieved_docs}

用户问题：{query}

请根据参考信息回答，如果信息不足请说明。"""

# 3. 调用 LLM
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": augmented_prompt}],
    temperature=0.1
)

Self-Consistency 自一致性

┌─────────────────────────────────────────┐ │ 自一致性策略 │ ├─────────────────────────────────────────┤ │ │ │ 同一问题 → 多次生成 → 投票选择 → 最终答案│ │ │ │ 步骤： │ │ 1. 用 CoT 生成多个答案 │ │ 2. 统计结果，取最常见的答案 │ │ 3. 提高准确率 │ │ │ └─────────────────────────────────────────┘

# 多次生成取多数答案
answers = []
for _ in range(5):  # 生成 5 次
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
        temperature=0.7
    )
    answers.append(response.choices[0].message.content)

# 统计最常见的答案（简化版）
from collections import Counter
most_common_answer = Counter(answers).most_common(1)[0][0]

📋 场景化参数配置

场景	temperature	top_p	其他设置
代码生成	0.1 - 0.2	0.1	明确要求代码格式，添加注释
事实问答	0.1 - 0.3	0.1 - 0.3	要求引用来源，不确定时说不知道
文案创作	0.7 - 0.9	0.8 - 0.95	设定风格、语气、受众
翻译任务	0.2 - 0.4	0.3	明确语言对，指定翻译风格
数据分析	0.1 - 0.2	0.1	要求分步骤，给出置信度
角色扮演	0.5 - 0.8	0.7 - 0.9	详细设定角色背景、性格

💰 成本优化技巧

📉 控制输出长度

设置 max_tokens 上限
提示词中明确"简要回答"
避免开放式问题

📝 精简上下文

只传递必要的对话历史
摘要长对话
使用 RAG 替代长文档

🔄 模型分流

简单任务用小模型
复杂任务用大模型
批量任务考虑离线处理

🛡️ 错误处理与容错

import time
from openai import OpenAI, RateLimitError, APIError

client = OpenAI(api_key="your-api-key")

def call_llm_with_retry(messages, max_retries=3):
    """带重试机制的 LLM 调用"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                temperature=0.7,
                timeout=30  # 超时设置
            )
            return response.choices[0].message.content

        except RateLimitError:
            # Rate Limit 错误：等待后重试
            wait_time = (attempt + 1) * 2
            print(f"Rate limit, waiting {wait_time}s...")
            time.sleep(wait_time)

        except APIError as e:
            # API 错误：检查是否可恢复
            if "server_error" in str(e):
                print(f"Server error, retry {attempt + 1}/{max_retries}")
                time.sleep(2)
            else:
                raise  # 其他错误直接抛出

        except Exception as e:
            print(f"Unknown error: {e}")
            raise

    raise Exception("Max retries exceeded")

📊 LLM API 回包分析

理解 API 响应结构

┌─────────────────────────────────────────────────────────────────────────┐ │ OpenAI API 响应结构 │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ { │ │ "id": "chatcmpl-abc123", ← 响应唯一标识 │ │ "object": "chat.completion", ← 对象类型 │ │ "created": 1690000000, ← 创建时间戳 │ │ "model": "gpt-4o", ← 使用的模型 │ │ "choices": [ ← 生成结果数组 │ │ { │ │ "index": 0, ← 结果索引 │ │ "message": { ← 模型回复内容 │ │ "role": "assistant", ← 角色 │ │ "content": "回答内容" ← 实际回复 │ │ }, │ │ "finish_reason": "stop", ← 结束原因 (stop/length/tool_calls) │ │ "logprobs": null ← 概率信息（可选） │ │ } │ │ ], │ │ "usage": { ← Token 使用统计 │ │ "prompt_tokens": 50, ← 输入消耗 │ │ "completion_tokens": 100, ← 输出消耗 │ │ "total_tokens": 150 ← 总消耗 │ │ }, │ │ "system_fingerprint": "fp_xxx" ← 模型版本指纹 │ │ } │ │ │ └─────────────────────────────────────────────────────────────────────────┘

解析响应内容

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "介绍一下人工智能"}],
    temperature=0.7
)

# 提取核心信息
response_id = response.id                    # 响应ID
model_name = response.model                  # 模型名称
created_time = response.created              # 创建时间

# 提取回复内容
message = response.choices[0].message
content = message.content                     # 实际回答文本
role = message.role                           # 角色 (assistant)

# 提取结束原因
finish_reason = response.choices[0].finish_reason
# stop: 正常结束 | length: 达到max_tokens | tool_calls: 调用工具

# 提取使用统计
usage = response.usage
prompt_tokens = usage.prompt_tokens           # 输入token数
completion_tokens = usage.completion_tokens   # 输出token数
total_tokens = usage.total_tokens             # 总token数

print(f"模型回复: {content}")
print(f"消耗Token: {total_tokens} (输入:{prompt_tokens} + 输出:{completion_tokens})")

💡 提示：始终检查 `finish_reason` 确保输出完整

Token 消耗与成本计算

# 各模型 Token 定价（示例，单位：美元）
MODEL_PRICING = {
    "gpt-4o": {"input": 5.0, "output": 15.0},      # 每百万 token
    "gpt-4o-mini": {"input": 0.15, "output": 0.6},
    "gpt-4-turbo": {"input": 10.0, "output": 30.0},
    "claude-3-5-sonnet": {"input": 3.0, "output": 15.0},
}

def calculate_cost(model, prompt_tokens, completion_tokens):
    """计算 API 调用成本"""
    pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
    
    input_cost = (prompt_tokens / 1_000_000) * pricing["input"]
    output_cost = (completion_tokens / 1_000_000) * pricing["output"]
    
    return input_cost + output_cost

# 计算单次调用成本
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "你好"}]
)

usage = response.usage
cost = calculate_cost("gpt-4o", usage.prompt_tokens, usage.completion_tokens)

print(f"Prompt Tokens: {usage.prompt_tokens}")
print(f"Completion Tokens: {usage.completion_tokens}")
print(f"本次调用成本: ${cost:.6f}")

📌 注意：不同模型、不同提供商标价不同，请以官方最新价格为准

流式响应 (Streaming) 解析

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

# 启用流式输出
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "写一首关于春天的诗"}],
    stream=True  # 开启流式
)

# 逐块解析响应
full_content = ""
chunk_count = 0

for chunk in stream:
    if chunk.choices[0].delta.content:
        content_chunk = chunk.choices[0].delta.content
        full_content += content_chunk
        chunk_count += 1
        print(f"[Chunk {chunk_count}] {content_chunk}", end="", flush=True)

print(f"\n\n总块数: {chunk_count}")
print(f"完整内容长度: {len(full_content)} 字符")

# 流式响应结构
# {
#     "id": "chatcmpl-...",
#     "object": "chat.completion.chunk",
#     "created": 1690000000,
#     "model": "gpt-4o",
#     "choices": [{
#         "index": 0,
#         "delta": {"content": "春", "role": "assistant"},
#         "finish_reason": null
#     }]
# }

⚠️ 流式响应特点：每次返回部分内容，`finish_reason` 在最后一块为 null，完成后变为 "stop"

错误响应处理

from openai import OpenAI, RateLimitError, APIError, AuthenticationError

client = OpenAI(api_key="your-api-key")

def parse_error_response(error):
    """解析错误响应并提供处理建议"""
    
    if isinstance(error, AuthenticationError):
        # 认证错误
        print(f"❌ 认证失败: {error}")
        print("💡 解决：检查 API Key 是否正确、是否有效")
        return {"type": "auth", "action": "check_api_key"}
    
    elif isinstance(error, RateLimitError):
        # 速率限制
        print(f"🚫 速率限制: {error}")
        if "rate limit" in str(error).lower():
            print("💡 解决：等待后重试，或申请提高限额")
        return {"type": "rate_limit", "action": "retry_after"}
    
    elif isinstance(error, APIError):
        # API 服务器错误
        print(f"⚠️ API 错误: {error.status_code} - {error.message}")
        if 500 <= error.status_code < 600:
            print("💡 解决：服务器错误，等待后重试")
            return {"type": "server_error", "action": "retry"}
        else:
            print("💡 解决：检查请求参数是否正确")
            return {"type": "client_error", "action": "fix_request"}
    
    else:
        print(f"❓ 未知错误: {error}")
        return {"type": "unknown", "action": "investigate"}

# 使用示例
try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}]
    )
except Exception as e:
    error_info = parse_error_response(e)
    print(f"错误类型: {error_info['type']}, 建议操作: {error_info['action']}")

HTTP 状态码	错误类型	常见原因	处理建议
401	认证错误	API Key 无效/过期/缺失	检查 Key 配置，生成新 Key
400	请求错误	参数无效、消息格式错误	检查请求参数格式
429	速率限制	请求频率超限	降低频率、增加延迟
500/502/503	服务器错误	OpenAI 服务器问题	等待后重试，实现重试机制
404	模型不存在	模型名称错误或不可用	检查模型名称是否正确

响应数据统计分析

from dataclasses import dataclass
from typing import List, Dict
from datetime import datetime

@dataclass
class APIResponseStats:
    """API 响应统计"""
    request_id: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    response_time: float  # 秒
    timestamp: datetime

class ResponseAnalyzer:
    """响应分析器 - 统计和监控 API 使用情况"""
    
    def __init__(self):
        self.history: List[APIResponseStats] = []
    
    def record_response(self, response, response_time: float):
        """记录一次响应"""
        stats = APIResponseStats(
            request_id=response.id,
            model=response.model,
            prompt_tokens=response.usage.prompt_tokens,
            completion_tokens=response.usage.completion_tokens,
            total_tokens=response.usage.total_tokens,
            response_time=response_time,
            timestamp=datetime.now()
        )
        self.history.append(stats)
    
    def get_statistics(self) -> Dict:
        """获取统计信息"""
        if not self.history:
            return {}
        
        total_requests = len(self.history)
        total_tokens = sum(s.total_tokens for s in self.history)
        total_prompt = sum(s.prompt_tokens for s in self.history)
        total_completion = sum(s.completion_tokens for s in self.history)
        avg_response_time = sum(s.response_time for s in self.history) / total_requests
        
        return {
            "总请求数": total_requests,
            "总Token消耗": total_tokens,
            "输入Token": total_prompt,
            "输出Token": total_completion,
            "平均响应时间": f"{avg_response_time:.2f}s",
            "平均每请求Token": total_tokens // total_requests
        }
    
    def get_model_usage(self, model_name: str) -> Dict:
        """获取特定模型的使用统计"""
        model_stats = [s for s in self.history if s.model == model_name]
        if not model_stats:
            return {}
        
        return {
            "模型": model_name,
            "请求次数": len(model_stats),
            "总Token": sum(s.total_tokens for s in model_stats)
        }

# 使用示例
analyzer = ResponseAnalyzer()

# 记录 API 调用
import time

start = time.time()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "你好"}]
)
elapsed = time.time() - start

analyzer.record_response(response, elapsed)

# 查看统计
stats = analyzer.get_statistics()
for key, value in stats.items():
    print(f"{key}: {value}")

✅ 最佳实践清单

🔧 参数调优

准确任务：temperature ≤ 0.3
创意任务：temperature = 0.7-1.0
设置 max_tokens 防止过长输出
配合 top_p 使用效果更好

📝 提示词设计

明确角色和任务要求
使用 Few-shot 示例
复杂任务分步骤引导
设定输出格式约束

🏗️ 系统架构

RAG 减少幻觉
实现重试和容错机制
缓存常见问题结果
监控 API 调用成本

📊 回包分析

解析响应结构，提取关键信息
监控 Token 消耗和成本
检查 finish_reason 确保输出完整
记录响应日志用于问题排查
实现错误响应分类处理