As a bioinformatics tutor, I'll guide you through this complex pipeline step-by-step. This is a sophisticated tool for integrating ATAC-seq and scRNA-seq data to identify transcription factor binding sites (TFBS) and build gene regulatory networks.
Pipeline Overview
This pipeline performs multi-omics integration by:
- Processing ATAC-seq data to identify accessible chromatin regions
- Finding transcription factor binding sites using motif scanning
- Building gene regulatory networks from scRNA-seq data
- Integrating both modalities to create cross-modal networks
Let's break down each component:
Core Dependencies and Tools
External Bioinformatics Tools
# Required command-line tools
- perl (for sequence extraction)
- fimo (MEME suite - for motif scanning)
- sort-bed (BEDOPS - for genomic interval operations)
- bedops (BEDOPS - for set operations on genomic intervals)
- docker (for MAESTRO analysis)
Python Libraries
# Core scientific computing
import numpy as np
import pandas as pd
from scipy import sparse as sp_sparse
from scipy import io as sio
from sklearn import metrics
# Bioinformatics specific
import h5py # For 10X Genomics format
from tqdm import tqdm # Progress bars
from arboreto.algo import genie3, grnboost2 # Gene regulatory network inference
Step-by-Step Breakdown
1. ATAC-seq Peak Filtering (ATAC_filter_from_CSV
/ATAC_filter_from_MTX
)
Purpose: Filter peaks based on accessibility across cells
def ATAC_filter_from_CSV(filename, threshold=0.1):
# Read ATAC-seq count matrix
# Calculate accessibility rate per peak: n_accessible_cells / total_cells
# Keep peaks with rate > threshold
# Output BED format: chr start end accessibility_rate
Key Concept: Only keep peaks that are accessible in >10% of cells (reduces noise)
2. Sequence Extraction (extract_chromosome
)
Purpose: Extract DNA sequences from filtered peaks
def extract_chromosome(bed_file, genome_file):
# Uses Perl script GetSequence.pl
# Input: BED file with genomic coordinates
# Output: FASTA file with sequences
What happens: Converts genomic coordinates to actual DNA sequences for motif scanning
3. Motif Scanning with FIMO (fimo
)
Purpose: Find transcription factor binding motifs in sequences
def fimo(fasta_file, threshold=1e-4, motif_dir):
# For each TF motif in motif_dir:
# - Run FIMO to find matches in sequences
# - Filter by p-value threshold
# - Output binding site coordinates
Key Insight: Uses position weight matrices (PWMs) to predict where TFs might bind
4. TFBS Region Processing (get_TFBS_region
)
Purpose: Convert FIMO output to genomic coordinates
def get_TFBS_region(fimo_dir, threshold=1e-6):
# Parse FIMO results
# Convert relative positions to absolute genomic coordinates
# Create BED files for each TF
# Merge all TF binding sites
5. Promoter Region Integration (get_TFBS_from_promoter
)
Purpose: Identify TF binding sites in gene promoter regions
def get_TFBS_from_promoter(promoter_file, motif_dir):
# Filter promoter regions for TFs that have motifs
# Create mapping between TFs and their target gene promoters
6. Overlap Analysis (get_contained_region
)
Purpose: Find TFBS that overlap with TF promoter regions
def get_contained_region(atac_region_file, promoter_region_file):
# Use BEDOPS to find overlapping regions
# Identifies TF-TF regulatory relationships
Biological Significance: TFs regulating other TFs create regulatory cascades
7. Graph Construction (build_Graph
)
Purpose: Build TF-TF regulatory network from overlaps
def build_Graph(atac_region_file, promoter_region_file):
# Create adjacency matrix where A[i,j] = 1 if TF_i binds to TF_j's promoter
# Output: TF-TF regulatory graph
8. ATAC Feature Extraction (extract_atac_feature
)
Purpose: Calculate regulatory potential scores using MAESTRO
def extract_atac_feature(atac_file, node_file):
# Convert to 10X format
# Run MAESTRO in Docker container
# Calculate TF regulatory potential scores
# Output: TF activity scores per cell
MAESTRO: Integrates chromatin accessibility with TF motifs to predict TF activity
9. scRNA-seq Feature Extraction (extract_scrna_feature2
)
Purpose: Infer gene regulatory networks from expression data
def extract_scrna_feature2(scrna_file, node_file, method="grnboost2"):
# Use GRNBoost2 or GENIE3 algorithms
# Infer TF-target relationships from co-expression
# Output: TF-gene regulatory network weights
Algorithms:
- GRNBoost2: Gradient boosting-based method (faster)
- GENIE3: Random forest-based method (more established)
10. Cross-modal Integration (build_cross_graph
)
Purpose: Integrate ATAC and scRNA features
def build_cross_graph(graph_file, atac_feature_file, scrna_feature_file):
# Find intersection of TFs present in both modalities
# Create unified feature matrices
# Output: Integrated multi-omics network
Required External Files and Setup
1. Genome Reference Files
# Human genome (hg19/GRCh37)
genome_file = "hg19/index" # FASTA format genome
2. TF Motif Database
motif_dir = "human_HOCOMO" # Directory with .meme files
# Each file contains position weight matrices for TF motifs
3. Promoter Annotations
promoter_file = "gencode.v19.ProteinCoding_gene_promoter.txt"
# Format: chr start end strand gene_name
4. Docker Setup for MAESTRO
# Pull MAESTRO Docker image
docker pull winterdongqing/maestro:1.2.1
Data Input Formats
ATAC-seq Data
# CSV format: peaks × cells matrix
# Rows: genomic_regions (chr1_1000_2000)
# Columns: cell_barcodes
# Values: accessibility counts
# MTX format: Matrix Market format
# - matrix.mtx: sparse count matrix
# - barcodes.tsv: cell identifiers
# - peaks.tsv: peak coordinates
scRNA-seq Data
# Similar formats as ATAC-seq but with gene expression
# Rows: genes
# Columns: cells
# Values: expression counts/UMI
Running the Complete Pipeline
def run(atac_file, scrna_file, save_dir):
# 1. Extract TFBS from ATAC-seq peaks
atac_region_file = get_TFBS_from_ATAC(atac_file, motif_dir, genome_file)
# 2. Process promoter regions
promoter_region_file = get_TFBS_from_promoter(promoter_file)
# 3. Build TF-TF regulatory graph
graph_file, node_file = build_Graph(atac_region_file, promoter_region_file)
# 4. Extract features from both modalities
atac_feature_file = extract_atac_feature(atac_file, node_file)
scrna_feature_file = extract_scrna_feature(scrna_file, node_file)
# 5. Integrate modalities
cross_graph_file, cross_node_file, cross_atac_file, cross_scrna_file = \
build_cross_graph(graph_file, atac_feature_file, scrna_feature_file)
Key Concepts to Understand
1. Regulatory Potential Score
- Combines chromatin accessibility with TF motif presence
- Higher score = more likely the TF is active in that cell type
2. Gene Regulatory Networks (GRNs)
- Directed graphs where edges represent regulatory relationships
- Inferred from co-expression patterns in scRNA-seq data
3. Multi-omics Integration
- Combines complementary information from different data types
- ATAC-seq: where regulation can happen (accessibility)
- scRNA-seq: where regulation does happen (expression changes)
4. Chromatin Accessibility
- Open chromatin regions where TFs can bind
- Measured by ATAC-seq (Assay for Transposase-Accessible Chromatin)
This pipeline is quite sophisticated and represents current best practices in computational genomics for multi-omics integration. Each step builds upon the previous one to create a comprehensive model of gene regulation.
Would you like me to dive deeper into any specific component or help you set up the required external tools?