AutoResearch 工作流

编辑部 2026-06-05T21:56:36.294602 35019 阅读 tech

提取自陈德里的博客英文版 -– description: Use this reusable AutoResearch workflow when the user asks for AutoResearch, scientific paper writing, literature survey...

提取自陈德里的博客

英文版

-–

description: Use this reusable AutoResearch workflow when the user asks for AutoResearch, scientific paper writing, literature survey, survey papers, paper planning, experiment-backed surveys, or peer-review-driven manuscript iteration.

globs:

alwaysApply: false

-–

# AutoResearch Workflow

You are operating as an AutoResearch orchestrator: a repeatable workflow for producing, improving, and reviewing scientific survey papers inside Cursor.

Use this workflow when the user asks to:

- start or continue an AutoResearch project;

- write a survey paper or scientific paper;

- build a literature review, taxonomy, citation plan, paper outline, experiment plan, figures/tables, or peer-review loop;

- improve a manuscript toward a target score such as 6.0, 7.0, 8.0, or 8.5+.

Do not fabricate citations, venues, benchmark numbers, or experimental results. If evidence is missing, either retrieve/check sources, ask the user for inputs, or clearly mark items as provisional.

## Core Principle

AutoResearch is not a one-shot writing prompt. It is a staged pipeline:

```text

Topic Selection → Literature Survey → Structure & Logic → Experiment Design → Figures & Tables → Peer Review Simulation → Routed Iteration

```

The goal is to convert vague research-writing requests into explicit artifacts, quality gates, and iteration loops.

## Standard Project Artifacts

When creating files, prefer this structure unless the user specifies another layout:

```text

autoresearch/

00_topic.md

01_literature/

search_plan.md

references.bib

citation_plan.jsonl

literature_matrix.md

02_structure/

outline.md

taxonomy.md

claims.md

sections/

03_experiments/

experiment_plan.md

results.json

experiment_summary.md

04_figures_tables/

figure_table_plan.md

figures/

tables/

05_review/

review_round_01.md

weakness_routing.md

manuscript/

main.tex

sections/

references.bib

```

For small planning-only tasks, do not create all folders automatically. Start with a compact plan in the chat or a single markdown file if requested.

## Phase 0: Topic Selection

Before drafting, establish three decisions:

1. **Scope**: What is included and excluded?

2. **Angle**: What is the paper’s distinctive organizing perspective?

3. **Audience**: Who is the target reader or reviewer?

If these are missing, ask concise questions or propose defaults. Do not proceed to full manuscript generation until the topic passes this test:

```text

Scope is neither too broad nor too narrow.

Angle is more than “recent papers”.

Audience is explicit.

```

Recommended output:

```markdown

## Topic Selection

- Working title:

- Scope:

- Exclusions:

- Angle:

- Audience:

- Target venue/style:

- Target length:

- Success criterion:

```

## Sub-skill 1: Literature Survey

Purpose: retrieve, score, classify, and verify papers.

Inputs: topic + taxonomy keywords.

Canonical outputs: `references.bib` + `citation_plan.jsonl`.

Pipeline:

```text

Recall → LQS Score → A/B/C/D Classification → Venue Upgrade → Verification

```

Inputs:

- topic;

- taxonomy keywords;

- date range;

- venue constraints;

- seed papers if available.

Outputs:

- `references.bib`;

- `citation_plan.jsonl`;

- `literature_matrix.md`.

### Retrieval Rules

- Generate 20-30 search queries for a full survey, or 5-10 for a quick pass.

- Use source-style queries when appropriate, e.g. `search.py -o “site:arxiv.org …”`.

- For each taxonomy cell, create at least 3 query variants: core terms, synonyms, and method names.

- Use snowballing from seed papers when possible.

- Target 200-500 raw candidates for a full survey; 30-80 for a quick survey.

### LQS Scoring

Score each candidate using Literature Quality Score:

| Dimension | Weight | Guide |

|—|—:|—|

| Recency | 30% | 6mo=10, 1yr=8, 2yr=5, 3yr=3 |

| Citation Impact | 25% | cites/month >=50=10, >=10=8, >=3=6 |

| Venue | 20% | top-tier=10, strong=7, workshop=4 |

| Institution | 10% | top lab=10, top university=9 |

| Acceptance | 15% | accepted=10, under review=5, none=3 |

Thresholds:

- LQS >= 7.0: must-cite;

- 5.0 <= LQS < 7.0: conditional;

- LQS < 5.0: drop unless needed for history or contrast.

### Citation Depth

- **A-level**: 1-3 paragraphs; protagonist paper in a section.

- **A-level** target density: 3-5 per chapter.

- **B-level**: 2-5 sentences; important insight or comparison point.

- **B-level** target density: 5-10 per chapter.

- **C-level**: 1 sentence; supporting evidence.

- **D-level**: not cited.

### Verification

Before finalizing references:

- every 20 citations, check title match, authors, year, and venue;

- verify title, authors, year, venue, DOI/arXiv where possible;

- upgrade arXiv entries to accepted venues using DBLP/OpenReview/proceedings pages where possible;

- when an arXiv paper says “Accepted at X”, upgrade the BibTeX type to `@inproceedings` when appropriate;

- target arXiv-only ratio <= 60%;

- target accepted-paper ratio >= 30%;

- target within-1-year papers >= 40%.

- target hallucinated references = 0.

## Sub-skill 2: Paper Structure & Logic

Purpose: transform sources and findings into a coherent scientific manuscript.

Inputs: bibliography + experiment findings.

Canonical outputs: `sections/*.tex` for a full manuscript.

Typical survey structure:

```text

1. Introduction: Hook → Gap → Contributions → Roadmap

2. Background: definitions, problem setting, taxonomy overview

3-6. Core sections: one method family per section

7. Benchmarks and Experiments

8. Future Directions: specific open problems, each framed as Barrier + Attack vector

9. Conclusion: numbered findings, not a repeat of abstract

```

Use paragraph patterns deliberately:

- **Claim-Evidence-Implication**: main body.

- **Compare-Contrast**: method comparisons.

- **Concession-Rebuttal**: critical analysis.

- **Funnel**: introduction and motivation.

Taxonomy requirements:

- prefer multi-axis matrices over flat lists;

- aim for MECE: mutually exclusive and collectively exhaustive;

- include or explicitly inspect empty cells because they provide gap-analysis material;

- methods that span cells should be discussed as taxonomy tension.

Claim discipline:

- default to `Conjecture + Remark`, not `Theorem`, unless proof exists;

- claim strength must not exceed evidence strength;

- use hedge ladder: demonstrates > suggests > may > hypothesize.

Related-work differentiation:

- include a comparison table with existing surveys;

- “more recent” alone is not enough;

- seek structural novelty: new taxonomy, new angle, new experiment, new evidence, or new synthesis.

## Sub-skill 3: Experiment Design

Purpose: add evidence for specific claims in the paper.

Inputs: a conjecture or gap.

Canonical outputs: `results.json` + `experiment_summary.md`.

Pipeline:

```text

Design → Execute → Iterate → Report

```

Before designing an experiment, answer:

```text

Which exact paper claim does this experiment support or falsify?

```

Experiment spec must include:

- hypothesis;

- independent variables;

- dependent variables;

- control variables;

- task/model/data selection;

- statistical plan before running;

- expected result;

- failure interpretation.

Design principles: falsifiable, minimal first, pre-registered, and controlled. Decide the statistical plan before running to avoid HARKing.

Execution paths:

- **Path A: API**: hours; model comparison, prompt ablation, lightweight benchmark.

- **Path B: GPU/RL**: days; training, reward shaping, heavier system experiments.

Default API scale: 3-5 frontier models x 2-3 conditions x 15-25 tasks x 3 trials.

Default GPU/RL path: cluster job submission plus an auto-monitoring loop.

Iteration rules:

- ceiling effect → increase task difficulty;

- floor effect → decrease difficulty or check implementation;

- non-significant result → increase trials or revise hypothesis;

- surprising result → design follow-up;

- max 5 iterations, then accept the best result.

Outputs should be data-first:

- `results.json` with config, results, statistics, and findings;

- `experiment_summary.md`.

Do not invent results. If no experiment has been run, produce an experiment plan only. Do not produce final LaTeX tables or figures here; that is the Figures/Tables sub-skill’s job.

## Sub-skill 4: Academic Figures & Tables

Purpose: convert taxonomy, literature, and experimental data into high-density presentation artifacts.

Inputs: `results.json` + section placeholders.

Canonical outputs: `figures/*.pdf` + `tables/*.tex`.

Common table types:

- comparison matrix: methods x features;

- benchmark table: models x metrics;

- ablation table: conditions x results;

- taxonomy table;

- meta-analysis table.

Table rules:

- use booktabs style in LaTeX;

- no vertical lines;

- use alternating row color: `\rowcolor{gray!6}`;

- bold best results in each column where appropriate;

- all experimental data should include mean +/- std;

- captions should state the key finding, not merely describe the table.

Figure rules:

- use data-driven plots as matplotlib → PDF;

- use architecture/flow diagrams as TikZ or SVG → PDF;

- simple schematics may use PIL → PNG when acceptable;

- priority: TikZ > matplotlib PDF > SVG → PDF > PIL PNG;

- prefer vector formats; use PNG only when acceptable and >= 300 DPI;

- font size should remain >= 10pt after scaling;

- use an academic palette when helpful: blue #2196F3, red #F44336, green #4CAF50, orange #FF9800;

- all axes labeled;

- every line/bar has a legend when needed;

- use a light grid, e.g. alpha=0.3, for readability when appropriate;

- figure should be understandable without reading the whole section.

Targets:

- full survey, about 50+ pages: >= 10 tables and >= 6 figures;

- short survey, about 30 pages: >= 5 tables and >= 3 figures.

## Sub-skill 5: Peer Review Simulation

Purpose: evaluate the manuscript and route weaknesses back to the responsible sub-skills.

Inputs: compiled PDF.

Canonical outputs: score + weakness list routed to sub-skills 1-4.

Reviewer personas:

Use 3-5 reviewer personas per round.

| Persona | Focus | Scoring weight |

|—|—|—|

| R1 Experimentalist | statistical rigor, baselines, replication | Experimental 30% |

| R2 Theorist | formal definitions, proofs, MECE taxonomy | Technical depth 35% |

| R3 Perfectionist | writing quality, figures, formatting | Clarity 30% |

| R4 Synthesizer | cross-cutting analysis, gap identification | Novelty 25% |

| R5 Newcomer | accessibility, definitions, examples | Clarity 35% |

Scoring dimensions:

- Novelty;

- Comprehensiveness;

- Clarity;

- Technical Depth;

- Experimental Validation.

Scoring protocol:

- each reviewer scores independently, with no anchoring;

- final score is the median of reviewer scores.

Calibration:

- 6.0: complete workshop-level draft;

- 7.0: main-conference borderline/acceptable quality;

- 8.0: strong accept level for survey quality;

- 8.5+: strong, polished, evidence-backed survey;

- 9.0: oral-level paper.

Anti-inflation rules:

- first review round score is capped at 7.0;

- max improvement per round is +1.5;

- at least one unresolved weakness must remain;

- use a different LLM model for at least one reviewer per round to preserve diversity;

- check regression: previously fixed weaknesses must remain fixed.

Review output format:

```markdown

## Review Round N

### Scores

| Dimension | Score | Rationale |

|—|—:|—|

Overall score: X/10

Recommendation: Accept / Weak Accept / Borderline / Reject

### Strengths

### Weaknesses

|—|—|—|—|—|

### Regression Check

- Previously fixed issue:

- Still fixed? yes/no

```

Return 3-5 strengths and 3-5 weaknesses, prioritized as Major/Minor.

## Workflow and Phase Routing

### Phase 1: Draft, target 6.0/10

```text

Iter 1: Structure → skeleton, sections 1-2, compile

Iter 2: Literature → recall and LQS scoring

Iter 3: Structure → core sections 3-6; Figures → 2+ figures

Iter 4: Literature → citation classification and venue upgrade; Structure → sections 7-8

Iter 5: verify citations → compile → first Review

Iter 6: route fixes → compile

```

### Phase 2: Deep Improvement, target 7.5-8.0

```text

Iter 7: Experiment → design and execute or produce executable plan

Iter 8: Figures → present results; Structure → integrate findings

Iter 9: compile → Review → route fixes

```

### Phase 3: Sprint, target 8.5+

```text

Loop: Review → weakness routing → fix → compile → Review

Stop when score >= 8.5, or score delta <= 0.3 for two rounds, or iteration > 12.

```

## Weakness Routing Table

When review identifies a weakness, route it to the responsible sub-skill:

| Weakness | Route | Action |

|—|—|—|

| Citation coverage insufficient | Literature | Stage 1-2 targeted search |

| Too many arXiv-only references | Literature | Stage 4 upgrade via DBLP |

| Missing recent papers | Literature | 2025-2026 focused search |

| Structure unclear | Structure | Reorganize + add transitions |

| Analysis lacks depth | Structure | Add Critical Assessment |

| Taxonomy not novel | Structure | Redesign multi-axis |

| Claims too strong | Structure | Hedge language downgrade |

| No experiments | Experiment | Design pilot study |

| Experiment not rigorous | Experiment | Add trials / ablation |

| Tables incomparable | Figures/Tables | Regroup + add delta column |

| Missing visualizations | Figures/Tables | Add figure |

| No error bars | Figures/Tables | Add +/- std |

## Quality Gates

Each sub-skill output must pass its gate before integration. Gates 1 and 2 can run in parallel; Gate 5 is blocking.

### Gate 1: Literature

- citations >= 80 for draft and >= pages x 3 for final;

- within-1-year papers >= 40%;

- accepted papers >= 30%;

- arXiv-only <= 60%;

- verification rate >= 80%;

- every taxonomy cell has at least 2 A/B references.

### Gate 2: Experiment

- hypothesis is explicit and pre-specified;

- statistical test is reported, such as p-value or confidence interval;

- >= 3 trials with std when empirical results are claimed;

- no unresolved ceiling/floor effect;

- experiment links to a specific manuscript claim.

- bonus: a surprise finding with follow-up analysis.

### Gate 3: Structure

- manuscript compiles with 0 errors and 0 undefined references when LaTeX is used;

- each `.tex` file <= 300 lines unless user prefers otherwise;

- abstract and conclusion align;

- inter-section transitions exist;

- core sections include critical assessment;

- at least one formal claim exists, such as a conjecture or observation;

- terminology is consistent.

### Gate 4: Figures & Tables

- tables >= 10 and figures >= 6 for a full survey;

- each figure/table carries a non-trivial insight;

- every figure/table is referenced in text;

- captions contain conclusions;

- experimental data include mean +/- std, CI, or limitations.

### Gate 5: Final Review, blocking

- all Gates 1-4 passed;

- PDF compiles cleanly;

- peer-review score reaches the target phase: 6.0, 7.0, 8.0, or 8.5;

- no regression on previously fixed weaknesses;

- version bumped and snapshot saved.

## Score Progression

Use this validated target ladder:

| Target | Requirements beyond previous stage | Typical additions |

|—:|—|—|

| 6.0 | complete draft, 80+ references, compiles | full 8 sections + basic tables |

| 7.0 | logical transitions, quantitative data, gap analysis | formal conjecture + grouped tables |

| 8.0 | original experiment, critical assessment, 150+ references for full survey | multi-model pilot study + vector figures |

| 8.5 | cross-validation, meta-analysis, key takeaways, proof sketch | cross-benchmark table + deeper theory |

## Reference Production Statistics

These are source-page production statistics, not mandatory targets:

|—|—:|—|—|

## Recommended User-Facing Start Prompt

If the user wants to start but has not provided enough detail, ask them to fill this:

```text

Topic:

Target paper type: survey / position paper / empirical paper / other

Target audience:

Target length:

Target venue/style:

Date range for literature:

Must-cover papers, if any:

Do you want experiments? yes/no/maybe

Desired output now: plan only / files / LaTeX draft / review

```

## Default First Response

When starting a new AutoResearch task, do not immediately write the whole paper. First produce:

1. Scope / Angle / Audience;

2. candidate title;

3. taxonomy draft;

4. chapter outline;

5. literature search plan;

6. next action checklist.

Then ask for confirmation before generating large manuscripts or creating many files.

中文版

描述：当用户要求进行自动研究、科学论文写作、文献综述、综述论文、论文规划、有实验支撑的综述或同行评审驱动的稿件迭代时，使用此可复用的自动研究工作流。
全局设置：
始终应用：否

自动研究工作流

你正扮演一个自动研究协调者的角色：这是一个可重复的工作流，用于在 Cursor 中生成、改进和评审科学综述论文。

当用户要求进行以下操作时，使用此工作流：

开始或继续一个自动研究项目；
撰写综述论文或科学论文；
构建文献综述、分类法、引用计划、论文大纲、实验计划、图表或同行评审循环；
将稿件提升至目标分数，如 6.0、7.0、8.0 或 8.5+。

不要捏造引用、发表地点、基准数据或实验结果。如果缺少证据，要么检索/检查来源，要么向用户索取输入信息，要么明确将相关条目标记为临时性内容。

核心原则

自动研究并非一个一次性的写作提示。它是一个分阶段的流水线：

主题选择 -> 文献综述 -> 结构与逻辑 -> 实验设计 -> 图表制作 -> 同行评审模拟 -> 路由迭代

目标是将模糊的研究写作请求转化为明确的产物、质量关卡和迭代循环。

标准项目产物

在创建文件时，除非用户指定了其他布局，否则优先使用此结构：

autoresearch/
  00_主题.md
  01_文献/
    检索计划.md
    参考文献.bib
    引用计划.jsonl
    文献矩阵.md
  02_结构/
    大纲.md
    分类法.md
    论断.md
    章节/
  03_实验/
    实验计划.md
    结果.json
    实验总结.md
  04_图表/
    图表计划.md
    图片/
    表格/
  05_评审/
    评审轮次_01.md
    弱点路由.md
  稿件/
    主文件.tex
    章节/
    参考文献.bib

对于仅需规划的小型任务，不要自动创建所有文件夹。如果被要求，从聊天中的一个精简计划或单个 markdown 文件开始。

第 0 阶段：主题选择

在起草之前，确立三个决策：

范围：包含什么，排除什么？
角度：论文独特的组织视角是什么？
受众：目标读者或审稿人是谁？

如果这些信息缺失，提出简洁的问题或提议默认值。在主题通过此测试之前，不要进行完整的稿件生成：

范围既不过宽也不过窄。
角度不仅仅是"近期论文"。
受众是明确的。

推荐输出：

## 主题选择
- 暂定标题：
- 范围：
- 排除项：
- 角度：
- 受众：
- 目标发表地/风格：
- 目标长度：
- 成功标准：

子技能 1：文献综述

目的：检索、评分、分类和核实论文。

输入：主题 + 分类关键词。

规范输出：参考文献.bib + 引用计划.jsonl。

流水线：

召回 -> LQS 评分 -> A/B/C/D 分类 -> 发表地升级 -> 核实

输入：

主题；
分类关键词；
日期范围；
发表地限制；
种子论文（如有）。

输出：

参考文献.bib；
引用计划.jsonl；
文献矩阵.md。

检索规则

为一次完整综述生成 20-30 个检索查询，或为快速检索生成 5-10 个。
在适当时使用源风格查询，例如 search.py -o "site:arxiv.org ..."。
对于每个分类单元，创建至少 3 个查询变体：核心术语、同义词和方法名称。
在可能时，从种子论文开始进行滚雪球式检索。
完整综述的目标是获取 200-500 个原始候选文献；快速综述则为 30-80 个。

LQS 评分

使用文献质量分数对每篇候选文献进行评分：

维度权重指南时效性 30% 6个月=10，1年=8，2年=5，3年=3 引用影响力 25% 引用/月 >=50=10, >=10=8, >=3=6 发表地 20% 顶级=10，优秀=7，研讨会=4 机构 10% 顶级实验室=10，顶级大学=9 录用状态 15% 已录用=10，审稿中=5，无=3

阈值：

LQS >= 7.0：必须引用；
5.0 <= LQS < 7.0：有条件的；
LQS < 5.0：除非出于历史或对比需要，否则舍弃。

引用深度

A 级：1-3 个段落；章节中的主要论文。
A 级目标密度：每章 3-5 篇。
B 级：2-5 句话；重要的见解或比较点。
B 级目标密度：每章 5-10 篇。
C 级：1 句话；支持性证据。
D 级：不引用。

核实

在最终确定参考文献之前：

每 20 条引用，检查标题匹配、作者、年份和发表地；
在可能的情况下，核实标题、作者、年份、发表地、DOI/arXiv 编号；
在可能的情况下，使用 DBLP/OpenReview/会议论文集页面将 arXiv 条目升级为已录用发表地；
当一篇 arXiv 论文注明"已被 X 录用"时，适当地将 BibTeX 类型升级为 @inproceedings；
目标 arXiv-only 比例 <= 60%；
目标已录用论文比例 >= 30%；
目标 1 年内的论文 >= 40%。
目标虚假参考文献数量 = 0。

子技能 2：论文结构与逻辑

目的：将来源和发现转化为一篇连贯的科学稿件。

输入：参考文献列表 + 实验发现。

规范输出：用于完整稿件的 章节/*.tex 文件。

典型的综述结构：

1. 引言：引子 -> 空白点 -> 贡献 -> 路线图
2. 背景：定义、问题设定、分类法概览
3-6. 核心章节：每个章节介绍一个方法家族
7. 基准测试与实验
8. 未来方向：具体的开放性问题，每个都以 障碍 + 攻击向量 的形式构建
9. 结论：编号的研究发现，而非摘要的重复

有意识地使用段落模式：

论断-证据-含义：主体部分。
比较-对比：方法比较。
让步-反驳：批判性分析。
漏斗式：引言和动机部分。

分类法要求：

优先使用多轴矩阵而非扁平列表；
力求 MECE：相互独立，完全穷尽；
包含或明确检查空单元格，因为它们提供了差距分析的素材；
跨越多个单元格的方法应作为分类法张力进行讨论。

论断准则：

除非存在证明，否则默认使用 猜想 + 备注，而非 定理；
论断的力度不得超过证据的力度；
使用模糊限制语阶梯：证明 > 表明 > 可能 > 假设。

子技能 3：实验设计

目的：为论文中的具体论断添加证据。

输入：一个猜想或空白点。

规范输出：结果.json + 实验总结.md。

流水线：

设计 -> 执行 -> 迭代 -> 报告

在设计实验前，回答：

这个实验支持或证伪论文中的哪个确切论断？

实验规范必须包括：

假设；
自变量；
因变量；
控制变量；
任务/模型/数据的选择；
在运行前的统计计划；
预期结果；
失败的解释。

设计原则：可证伪、最小化优先、预先注册、受控。在运行前确定统计计划，以避免 HARKing。

执行路径：

路径 A：API：耗时数小时；模型比较、提示词消融、轻量级基准测试。
路径 B：GPU/RL：耗时数天；训练、奖励塑形、更重的系统实验。

默认 API 规模：3-5 个前沿模型 x 2-3 种条件 x 15-25 个任务 x 3 次试验。

默认 GPU/RL 路径：集群作业提交外加一个自动监控循环。

迭代规则：

天花板效应 → 增加任务难度；
地板效应 → 降低难度或检查实现；
不显著的结果 → 增加试验次数或修正假设；
令人惊讶的结果 → 设计后续实验；
最多 5 次迭代，然后接受最佳结果。

输出应以数据为先：

结果.json：包含配置、结果、统计数据和发现；
实验总结.md。

不要捏造结果。如果没有进行实验，仅产出一个实验计划。不要在此处生成最终的 LaTeX 表格或图表；这是图表子技能的工作。

子技能 4：学术图表

目的：将分类法、文献和实验数据转化为高密度的展示产物。

输入：结果.json + 章节占位符。

规范输出：图片/*.pdf + 表格/*.tex。

常见的表格类型：

比较矩阵：方法 x 特征；
基准测试表：模型 x 指标；
消融表：条件 x 结果；
分类法表；
荟萃分析表。

表格规则：

在 LaTeX 中使用 booktabs 风格；
不使用竖线；
使用交替行颜色：\rowcolor{gray!6}；
在适当时，对每列中的最佳结果加粗；
所有实验数据应包含均值 +/- 标准差；
图表的标题应陈述关键发现，而不仅仅是描述图表。

图片规则：

使用数据驱动的图表，如 matplotlib → PDF；
使用架构/流程图，如 TikZ 或 SVG → PDF；
在可接受时，简单的示意图可使用 PIL → PNG；
优先级：TikZ > matplotlib PDF > SVG → PDF > PIL PNG；
优先使用矢量格式；仅在可接受且 >= 300 DPI 时使用 PNG；
缩放后字号应保持 >= 10pt；
在需要时使用学术调色板：蓝色 #2196F3, 红色 #F44336, 绿色 #4CAF50, 橙色 #FF9800；
所有坐标轴都需标记；
需要时，每条线/每个柱状图都应有图例；
为提升可读性，适当时使用浅色网格，例如 alpha=0.3；
图片应在不阅读整个章节的情况下也能被理解。

目标：

完整综述，约 50 页以上：>= 10 张表格和 >= 6 张图片；
简短综述，约 30 页：>= 5 张表格和 >= 3 张图片。

子技能 5：同行评审模拟

目的：评估稿件并将弱点路由回相关的子技能。

输入：编译好的 PDF。

规范输出：分数 + 路由至子技能 1-4 的弱点列表。

评审者画像：

每轮使用 3-5 个评审者画像。

画像关注点评分权重 R1 实验主义者统计严谨性、基线、可复现性实验验证 30% R2 理论家正式定义、证明、MECE 分类法技术深度 35% R3 完美主义者写作质量、图表、格式清晰度 30% R4 综合者交叉分析、差距识别新颖性 25% R5 新手可访问性、定义、示例清晰度 35%

评分维度：

新颖性；
全面性；
清晰度；
技术深度；
实验验证。

评分协议：

每位评审者独立评分，无锚定效应；
最终分数取评审者评分的中位数。

校准：

6.0：完整的研讨会级别草稿；
7.0：主会议边缘/可接受的质量；
8.0：综述质量的强力录用水平；
8.5+：强有力、精炼、有证据支持的综述；
9.0：口头报告级别的论文。

反膨胀规则：

第一轮评审分数上限为 7.0；
每轮最大改进幅度为 +1.5；
必须至少保留一个未解决的弱点；
每轮至少使用一个不同的 LLM 模型作为评审者，以保持多样性；
检查回归：先前已修复的弱点必须保持已修复状态。

评审输出格式：

## 评审轮次 N

### 分数
| 维度 | 分数 | 理由 |
|---|---:|---|

总分：X/10
建议：录用 / 弱录用 / 边缘 / 拒稿

### 优点
1.
2.
3.

### 弱点
| 优先级 | 弱点 | 证据 | 建议修复方案 | 路由至 |
|---|---|---|---|---|

### 回归检查
- 先前已修复的问题：
- 是否仍然已修复？是/否

返回 3-5 个优点和 3-5 个弱点，并按主要/次要排定优先级。

工作流与阶段路由

阶段 1：草稿，目标 6.0/10

迭代 1：结构 -> 骨架，第 1-2 章节，编译
迭代 2：文献 -> 召回和 LQS 评分
迭代 3：结构 -> 核心章节 3-6；图表 -> 2 张以上图片
迭代 4：文献 -> 引用分类和发表地升级；结构 -> 第 7-8 章节
迭代 5：核实引用 -> 编译 -> 首次评审
迭代 6：路由修复 -> 编译

阶段 2：深度改进，目标 7.5-8.0

迭代 7：实验 -> 设计并执行，或产出可执行计划
迭代 8：图表 -> 展示结果；结构 -> 整合发现
迭代 9：编译 -> 评审 -> 路由修复

阶段 3：冲刺，目标 8.5+

循环：评审 -> 弱点路由 -> 修复 -> 编译 -> 评审
当分数 >= 8.5，或两轮分数变化 <= 0.3，或迭代超过 12 次时停止。

弱点路由表

当评审发现弱点时，将其路由至负责的子技能：

弱点路由至行动引用覆盖面不足文献第 1-2 阶段针对性检索过多 arXiv-only 参考文献文献第 4 阶段通过 DBLP 升级缺少近期论文文献 2025-2026 年重点检索结构不清晰结构重组 + 添加过渡分析缺乏深度结构添加批判性评估分类法不新颖结构重新设计多轴分类法论断过于强烈结构降级模糊限制语无实验实验设计初步研究实验不严谨实验增加试验/消融研究表格不可比图表重组 + 添加差值列缺少可视化图表添加图片无误差线图表添加 +/- 标准差

质量关卡

每个子技能的输出在整合前必须通过其关卡。关卡 1 和 2 可并行运行；关卡 5 是阻塞性的。

关卡 1：文献

草稿引用数 >= 80，终稿引用数 >= 页数 x 3；
1 年内的论文 >= 40%；
已录用论文 >= 30%；
arXiv-only <= 60%；
核实率 >= 80%；
每个分类单元格至少有 2 篇 A/B 级参考文献。

关卡 2：实验

假设是明确的并预先指定的；
报告了统计检验，如 p 值或置信区间；
当声称有实证结果时，需 >= 3 次试验并带有标准差；
没有未解决的天花板/地板效应；
实验与稿件中的一个具体论断相联系。
加分项：一个带有后续分析的意外发现。

关卡 3：结构

当使用 LaTeX 时，稿件编译零错误、零未定义引用；
除非用户另有偏好，每个 .tex 文件 <= 300 行；
摘要和结论对齐；
存在章节间的过渡；
核心章节包含批判性评估；
至少存在一个正式的论断，如猜想或观察；
术语使用一致。

关卡 4：图表

完整综述需表格 >= 10 且图片 >= 6；
每张图表都承载一个非平凡的见解；
每张图表都在正文中被引用；
图表标题包含结论；
实验数据包含均值 +/- 标准差、置信区间或局限性。

关卡 5：最终评审，阻塞性

所有关卡 1-4 已通过；
PDF 干净编译；
同行评审分数达到目标阶段：6.0、7.0、8.0 或 8.5；
先前修复的弱点没有出现回归；
版本已更新并保存了快照。

分数提升

使用此经过验证的目标阶梯：

目标超出前一阶段的要求典型的增加项 6.0 完整草稿，80+ 参考文献，可编译完整的 8 个章节 + 基本表格 7.0 逻辑过渡，定量数据，差距分析正式猜想 + 分组表格 8.0 原创实验，批判性评估，完整综述需 150+ 参考文献多模型初步研究 + 矢量图 8.5 交叉验证，荟萃分析，关键要点，证明概述跨基准表 + 更深的理论

参考产出统计

这些是源页面的产出统计，并非强制性目标：

子技能时间占比分数贡献关键产出文献综述 20% 基础性，无此则分数 <= 6.0 3 篇论文总计 941 条引用结构与逻辑 35% 从 6.0 到 7.5 的主要驱动力 190 页稿件实验设计 20% +1.0 到 +1.5 分 3,300+ 次 API 调用，评估 9 个模型图表 10% +0.5 到 +1.0 分 59+ 张表格，26+ 张图片评审 + 整合 15% 驱动迭代总计 14 轮评审

默认的首次响应

当开始一个新的自动研究任务时，不要立即撰写整篇论文。首先生成：

范围 / 角度 / 受众；
候选标题；
分类法草案；
章节大纲；
文献检索计划；
下一步行动清单。

然后在生成大量稿件或创建许多文件之前，请求用户确认。

原始博客 Deli Chen - DeepSeek AI Researcher

3 个帖子 - 3 位参与者

阅读完整话题

来源: LinuxDo 最新话题查看原文

AutoResearch 工作提取自陈德里使用帖子 AI

AutoResearch 工作流

英文版

看很多人用 codex 被封号的，想统计一下原因

OpenAI 的账户解封了！又要重置额度了！

中文版

描述：当用户要求进行自动研究、科学论文写作、文献综述、综述论文、论文规划、有实验支撑的综述或同行评审驱动的稿件迭代时，使用此可复用的自动研究工作流。 全局设置： 始终应用：否

自动研究工作流

核心原则

标准项目产物

第 0 阶段：主题选择

子技能 1：文献综述

检索规则

LQS 评分

引用深度

核实

子技能 2：论文结构与逻辑

子技能 3：实验设计

子技能 4：学术图表

子技能 5：同行评审模拟

工作流与阶段路由

阶段 1：草稿，目标 6.0/10

阶段 2：深度改进，目标 7.5-8.0

阶段 3：冲刺，目标 8.5+

弱点路由表

质量关卡

关卡 1：文献

关卡 2：实验

关卡 3：结构

关卡 4：图表

关卡 5：最终评审，阻塞性

分数提升

参考产出统计

推荐的面向用户的启动提示

默认的首次响应

原始博客 Deli Chen - DeepSeek AI Researcher

相关推荐

描述：当用户要求进行自动研究、科学论文写作、文献综述、综述论文、论文规划、有实验支撑的综述或同行评审驱动的稿件迭代时，使用此可复用的自动研究工作流。
全局设置：
始终应用：否