Hi-C Workflow¶
The Hi-C workflow in FlowAgent implements comprehensive analysis of chromosome conformation capture data.
Workflow Steps¶
- Quality Control
- FastQC analysis
- Mapping statistics
- Library complexity
-
Contact distance distribution
-
Contact Matrix Generation
- Read pair alignment
- Fragment filtering
- Matrix binning
-
Bias correction
-
TAD Analysis
- TAD calling
- Boundary strength
- Domain scores
-
Insulation analysis
-
Interaction Analysis
- Loop calling
- Significant interactions
- Contact enrichment
-
Distance normalization
-
3D Structure
- Structure prediction
- Model validation
- Visualization
- Ensemble analysis
Custom Script Integration Points¶
The Hi-C workflow supports custom scripts at various stages:
Pre-processing¶
- Custom filtering
- Quality metrics
- Read pair processing
Matrix Analysis¶
- Custom normalization
- Feature detection
- Pattern analysis
Structure Analysis¶
- Model optimization
- Validation metrics
- Visualization tools
Example: Custom TAD Caller¶
# custom_tad_caller.py
import numpy as np
import pandas as pd
from scipy import signal
def call_tads(contact_matrix, params):
"""Call TADs using insulation score method."""
# Calculate insulation score
window_size = params['window_size']
min_size = params['min_size']
insulation = np.zeros(contact_matrix.shape[0])
for i in range(window_size, len(insulation) - window_size):
insulation[i] = np.mean(contact_matrix[
i-window_size:i+window_size,
i-window_size:i+window_size
])
# Find boundaries
boundaries = signal.find_peaks(-insulation)[0]
# Call TADs
tads = []
for i in range(len(boundaries)-1):
if boundaries[i+1] - boundaries[i] >= min_size:
tads.append({
'start': boundaries[i],
'end': boundaries[i+1],
'score': np.mean(insulation[boundaries[i]:boundaries[i+1]])
})
return pd.DataFrame(tads)
Usage¶
from flowagent.core.workflow_executor import WorkflowExecutor
# Initialize workflow
executor = WorkflowExecutor(llm_interface)
# Execute Hi-C workflow with custom TAD calling
results = await executor.execute_workflow(
input_data={
"fastq1": "read1.fastq",
"fastq2": "read2.fastq",
"genome": "reference.fa"
},
workflow_type="hic",
custom_script_requests=["custom_tad_caller"]
)
Output Structure¶
results/
├── qc/
│ ├── fastqc/
│ ├── mapping_stats.txt
│ └── library_stats.txt
├── matrices/
│ ├── raw/
│ ├── normalized/
│ └── binned/
├── tads/
│ ├── boundaries.bed
│ └── domains.bed
├── interactions/
│ ├── loops.bedpe
│ └── significant_interactions.txt
└── structures/
├── models/
└── validation/
Quality Metrics¶
The workflow tracks various quality metrics:
- Library Quality
- Read quality scores
- Mapping rates
- PCR duplicates
-
Fragment size distribution
-
Contact Quality
- Contact distance distribution
- Coverage uniformity
- Signal-to-noise ratio
-
Bias factors
-
Analysis Quality
- TAD boundary strength
- Loop significance
- Structure validation
- Resolution assessment
Resource Requirements¶
Typical resource requirements for Hi-C analysis:
- CPU: 16-32 cores
- Memory: 64-128GB RAM
- Storage: 100-500GB per sample
- Time: 12-24 hours per sample
Best Practices¶
- Quality Control
- Filter low-quality reads
- Remove PCR duplicates
- Check for biases
-
Validate library complexity
-
Matrix Generation
- Use appropriate bin sizes
- Apply ICE normalization
- Handle multi-mapping reads
-
Consider distance effects
-
Feature Detection
- Optimize parameters
- Use multiple methods
- Validate findings
-
Consider replicates
-
Visualization
- Use appropriate scales
- Show multiple resolutions
- Include validation metrics
- Compare conditions