Skip to content

Hi-C Workflow

The Hi-C workflow in FlowAgent implements comprehensive analysis of chromosome conformation capture data.

Workflow Steps

  1. Quality Control
  2. FastQC analysis
  3. Mapping statistics
  4. Library complexity
  5. Contact distance distribution

  6. Contact Matrix Generation

  7. Read pair alignment
  8. Fragment filtering
  9. Matrix binning
  10. Bias correction

  11. TAD Analysis

  12. TAD calling
  13. Boundary strength
  14. Domain scores
  15. Insulation analysis

  16. Interaction Analysis

  17. Loop calling
  18. Significant interactions
  19. Contact enrichment
  20. Distance normalization

  21. 3D Structure

  22. Structure prediction
  23. Model validation
  24. Visualization
  25. Ensemble analysis

Custom Script Integration Points

The Hi-C workflow supports custom scripts at various stages:

Pre-processing

  • Custom filtering
  • Quality metrics
  • Read pair processing

Matrix Analysis

  • Custom normalization
  • Feature detection
  • Pattern analysis

Structure Analysis

  • Model optimization
  • Validation metrics
  • Visualization tools

Example: Custom TAD Caller

# custom_tad_caller.py
import numpy as np
import pandas as pd
from scipy import signal

def call_tads(contact_matrix, params):
    """Call TADs using insulation score method."""
    # Calculate insulation score
    window_size = params['window_size']
    min_size = params['min_size']

    insulation = np.zeros(contact_matrix.shape[0])
    for i in range(window_size, len(insulation) - window_size):
        insulation[i] = np.mean(contact_matrix[
            i-window_size:i+window_size,
            i-window_size:i+window_size
        ])

    # Find boundaries
    boundaries = signal.find_peaks(-insulation)[0]

    # Call TADs
    tads = []
    for i in range(len(boundaries)-1):
        if boundaries[i+1] - boundaries[i] >= min_size:
            tads.append({
                'start': boundaries[i],
                'end': boundaries[i+1],
                'score': np.mean(insulation[boundaries[i]:boundaries[i+1]])
            })

    return pd.DataFrame(tads)

Usage

from flowagent.core.workflow_executor import WorkflowExecutor

# Initialize workflow
executor = WorkflowExecutor(llm_interface)

# Execute Hi-C workflow with custom TAD calling
results = await executor.execute_workflow(
    input_data={
        "fastq1": "read1.fastq",
        "fastq2": "read2.fastq",
        "genome": "reference.fa"
    },
    workflow_type="hic",
    custom_script_requests=["custom_tad_caller"]
)

Output Structure

results/
├── qc/
│   ├── fastqc/
│   ├── mapping_stats.txt
│   └── library_stats.txt
├── matrices/
│   ├── raw/
│   ├── normalized/
│   └── binned/
├── tads/
│   ├── boundaries.bed
│   └── domains.bed
├── interactions/
│   ├── loops.bedpe
│   └── significant_interactions.txt
└── structures/
    ├── models/
    └── validation/

Quality Metrics

The workflow tracks various quality metrics:

  1. Library Quality
  2. Read quality scores
  3. Mapping rates
  4. PCR duplicates
  5. Fragment size distribution

  6. Contact Quality

  7. Contact distance distribution
  8. Coverage uniformity
  9. Signal-to-noise ratio
  10. Bias factors

  11. Analysis Quality

  12. TAD boundary strength
  13. Loop significance
  14. Structure validation
  15. Resolution assessment

Resource Requirements

Typical resource requirements for Hi-C analysis:

  • CPU: 16-32 cores
  • Memory: 64-128GB RAM
  • Storage: 100-500GB per sample
  • Time: 12-24 hours per sample

Best Practices

  1. Quality Control
  2. Filter low-quality reads
  3. Remove PCR duplicates
  4. Check for biases
  5. Validate library complexity

  6. Matrix Generation

  7. Use appropriate bin sizes
  8. Apply ICE normalization
  9. Handle multi-mapping reads
  10. Consider distance effects

  11. Feature Detection

  12. Optimize parameters
  13. Use multiple methods
  14. Validate findings
  15. Consider replicates

  16. Visualization

  17. Use appropriate scales
  18. Show multiple resolutions
  19. Include validation metrics
  20. Compare conditions