实验室仪器数据转换：工作原理

本章深入介绍工具的技术架构和运行机制。如果你只想使用工具，可以跳过这一章。但如果你想了解工具如何工作、如何扩展功能，或者遇到问题时需要排查，这一章会很有帮助。

整体架构

工具采用三层架构：

┌─────────────────────────────────────────────────┐
│           用户界面层 (CLI 脚本)                  │
│  convert_to_asm.py, validate_asm.py, etc.      │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│           核心处理层 (转换引擎)                  │
│  - 仪器检测                                      │
│  - 数据解析                                      │
│  - 格式转换                                      │
└────────────────────┬────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────┐
│           数据层 (allotropy + 自定义)           │
│  - 原生解析器                                    │
│  - 回退解析器                                    │
│  - PDF 提取器                                    │
└─────────────────────────────────────────────────┘

转换流程详解

完整工作流程

输入文件
    │
    ▼
┌─────────────────┐
│  1. 文件类型检测  │
│  (扩展名 + 内容)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. 仪器类型识别  │
│  (特征匹配)      │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌──────────────┐
│原生解析器│ │回退解析器    │
│(首选)    │ │(备选)        │
└───┬────┘ └──────┬───────┘
    │             │
    └──────┬──────┘
           │
           ▼
┌─────────────────┐
│  3. 数据分类     │
│  (原始 vs 计算)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  4. 结构化输出   │
│  (ASM JSON)     │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌──────┐ ┌──────────┐
│JSON   │ │扁平化CSV │
│输出   │ │输出      │
└──────┘ └──────────┘

步骤 1：文件类型检测

工具首先检查文件扩展名和文件头部信息，确定文件格式。

def detect_file_type(file_path):
  # 检查扩展名
  ext = os.path.splitext(file_path)[1].lower()

  if ext == '.xlsx' or ext == '.xls':
    return 'excel'
  elif ext == '.csv':
    return 'csv'
  elif ext == '.pdf':
    return 'pdf'
  elif ext == '.txt':
    return 'text'
  else:
    # 尝试通过文件内容判断
    return detect_by_content(file_path)

支持的文件类型：

Excel（.xlsx、.xls）：使用 openpyxl 或 pandas
CSV（.csv）：使用 pandas
PDF（.pdf）：使用 pdfplumber
文本（.txt）：使用 pandas 的 read_fwf

步骤 2：仪器类型识别

工具使用"特征指纹"匹配算法识别仪器类型。

识别特征包括：

列名模式：特定仪器有独特的列名

Vi-CELL: "Total Viable Cells", "Viability %"
NanoDrop: "A260", "A280", "260/280"

数据范围：不同仪器的数据数值范围不同

细胞计数：10^4 - 10^8 cells/mL
吸光度：0 - 4 AU

文件结构：不同厂商的文件布局不同

TapeStation: XML 格式，特定的标签结构
SoftMax Pro: 多工作表，特定的表名

元数据关键词：文件中包含的厂商特定术语

识别置信度：

高置信度 (95%+)：多个特征匹配，自动使用
中等置信度 (70-95%)：部分特征匹配，提示用户确认
低置信度 (<70%)：无法识别，使用通用解析器

def identify_instrument(file_content):
  scores = {}

  # 对每个已知的仪器类型计算匹配分数
  for instrument in KNOWN_INSTRUMENTS:
    score = calculate_match_score(file_content, instrument)
    scores[instrument] = score

  # 返回得分最高的仪器类型
  best_match = max(scores, key=scores.get)

  if scores[best_match] > 0.95:
    return best_match, 'high'
  elif scores[best_match] > 0.70:
    return best_match, 'medium'
  else:
    return None, 'low'

步骤 3：数据解析

层级 1：原生解析器（优先）

使用 allotropy 库的官方解析器。

from allotropy.parser_factory import Vendor
from allotropy.to_allotrope import allotrope_from_file

# 使用原生解析器
asm = allotrope_from_file(file_path, Vendor.BECKMAN_VI_CELL_BLU)

优势：

完整的元数据提取
符合 Allotrope 标准
包含计算数据追溯

支持的部分厂商：

AGILENT_TAPESTATION_ANALYSIS
BECKMAN_VI_CELL_BLU
THERMO_FISHER_NANODROP_EIGHT
MOLDEV_SOFTMAX_PRO
APPBIO_QUANTSTUDIO
... 等数十种

层级 2：回退解析器（备选）

当原生解析器不可用时，使用自定义的灵活解析器。

工作原理：

使用 pandas 读取文件
模糊匹配列名
从列名中提取单位信息
从文件结构中提取元数据

def flexible_parse(file_path):
  # 读取文件
  df = pd.read_excel(file_path)

  # 模糊匹配列名
  column_mapping = fuzzy_match_columns(df.columns)

  # 提取单位和元数据
  units = extract_units(df.columns)
  metadata = extract_metadata(file_path, df)

  # 构建 ASM 结构
  asm = build_asm_structure(df, column_mapping, units, metadata)

  return asm

局限性：

不包含 calculated-data-aggregate-document
元数据可能不完整
结构简化

层级 3：PDF 提取

对于 PDF 文件，先提取表格，再应用回退解析器。

import pdfplumber

def extract_pdf_tables(pdf_path):
  with pdfplumber.open(pdf_path) as pdf:
    all_tables = []

    for page in pdf.pages:
      tables = page.extract_tables()
      all_tables.extend(tables)

    return all_tables

步骤 4：数据分类

工具区分原始数据和计算数据。

分类规则：

数据类型	特征	示例
原始数据	仪器直接测量	细胞计数、吸光度、荧光强度
计算数据	需要公式计算	活力%、浓度、比率

计算数据追溯：

{
  "calculated-data-aggregate-document": {
    "calculated-data-document": [{
      "calculated-data-name": "viability_percent",
      "calculated-result": {"value": 93.3, "unit": "%"},
      "data-source-aggregate-document": {
        "data-source-document": [
          {
            "data-source-identifier": "viable_cell_count",
            "data-source-feature": "direct_count"
          },
          {
            "data-source-identifier": "total_cell_count",
            "data-source-feature": "direct_count"
          }
        ]
      }
    }]
  }
}

这记录了：

计算值的名称（viability_percent）
计算结果（93.3%）
数据来源（viable_cell_count 和 total_cell_count）
来源特征（都是直接计数）

步骤 5：验证

验证器检查 ASM 输出的质量。

验证项目：

结构完整性
- 必需字段存在
- JSON 格式正确
命名规范
- 字段名使用空格分隔（不是连字符）
- 大小写规范（如 "Absorbance" 而非 "absorbance"）
数据一致性
- 单位有效
- 数值范围合理
追溯完整性
- 计算数据有数据来源
- 标识符唯一

def validate_asm(asm_json):
  issues = []

  # 检查结构
  if not has_required_fields(asm_json):
    issues.append("Missing required fields")

  # 检查命名
  if not follows_naming_convention(asm_json):
    issues.append("Invalid field names")

  # 检查单位
  invalid_units = check_units(asm_json)
  if invalid_units:
    issues.append(f"Invalid units: {invalid_units}")

  # 检查追溯
  if not has_traceability(asm_json):
    issues.append("Calculated data missing traceability")

  return issues

软验证 vs 严格验证：

软验证（默认）：未知单位或技术生成警告，不阻止转换
严格验证：任何警告都视为错误

扁平化处理

为什么要扁平化？

ASM JSON 是嵌套结构，但很多系统和工具（如 Excel）需要扁平的表格。

扁平化算法

def flatten_asm(asm_json):
  flat_rows = []

  # 遍历所有测量
  for measurement in asm_json['measurement-document']:
    row = {}

    # 添加样本信息
    row['sample_identifier'] = measurement['sample-identifier']

    # 添加测量值
    for result in measurement['measurement-result']:
      row[result['type']] = result['value']
      row[result['type'] + '_unit'] = result['unit']

    # 添加元数据（会重复）
    row['instrument'] = asm_json['instrument-identifier']
    row['datetime'] = measurement['measurement-time']

    flat_rows.append(row)

  return pd.DataFrame(flat_rows)

示例：

输入（嵌套）：

{
  "instrument": "VI-CELL-001",
  "measurements": [
    {
      "sample": "A1",
      "results": {
        "total": {"value": 1.5, "unit": "cells/mL"},
        "viable": {"value": 1.4, "unit": "cells/mL"}
      }
    }
  ]
}

输出（扁平）：

instrument,sample,total,total_unit,viable,viable_unit
VI-CELL-001,A1,1.5,cells/mL,1.4,cells/mL

性能优化

缓存机制

仪器检测结果缓存
解析器实例复用
验证规则预加载

并行处理

from concurrent.futures import ProcessPoolExecutor

def batch_convert(file_list):
  with ProcessPoolExecutor(max_workers=4) as executor:
    results = executor.map(convert_file, file_list)

  return list(results)

内存管理

流式处理大文件
分块读取 CSV/PDF
及时释放内存

扩展性

添加新仪器支持

在 references/supported_instruments.md 中添加仪器信息
创建仪器特定的解析规则
添加测试数据

自定义验证规则

# 添加自定义验证器
def custom_validator(asm_json):
  # 你的验证逻辑
  if your_condition:
    return "Custom validation failed"
  return None

# 注册验证器
register_validator(custom_validator)

错误处理

分级错误处理

try:
  asm = convert_file(file_path)
except FileNotFoundError:
  # 致命错误：无法恢复
  logging.error("File not found")
  sys.exit(1)
except UnknownInstrumentError:
  # 可恢复错误：使用回退方案
  logging.warning("Unknown instrument, using generic parser")
  asm = generic_parse(file_path)
except ValidationError as e:
  # 部分失败：返回部分结果
  logging.error(f"Validation failed: {e}")
  return asm_with_warnings

下一步

了解工作原理后，你可以：

查看常见问题，解决你可能遇到的问题
阅读 SKILL.md 的完整文档，了解高级功能