Configuration¶
To get started, modify the example file below:
See sections below for explainations of each field.
Basic options¶
input¶
The path to a file specifying the input data sets. See Input data format for instructions for creating the input file.
input: "data/input.yml"
assembly¶
(Optional.) The name of the genome assembly. This is an optional parameter.
We use this parameter to automatically determine the values of genome
, annotation
and motif_file
.
If your genome assembly is not supported by the program, you will need to mannually
set the parameters for genome
, annotation
and motif_file
.
We currently support the following assemblies: “GRCh38”, “hg38”, “GRCm38” or “mm10”.
assembly: "GRCh38"
Advanced options¶
genome¶
(Optional.) Complete genome in a SINGLE ungzipped FASTA file.
If not specified, this will be downloaded according to assembly
.
genome: "/home/kai/genome/GRCh38/genome.fa"
annotation¶
(Optional.) Genome annotation in GTF format. For human and mouse, Gencode annotations are available at http://www.gencodegenes.org/. Very important: chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence file. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2,... naming convention, and ENSEMBL uses 1, 2, ... naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together.
If not specified, this will be downloaded according to assembly
.
annotation: "/home/kai/genome/GRCh38/gencode.v25.annotation.gtf"
motif_file¶
(Optional.)
A plain text file containing PWMs in MEME format.
The naming convention for motifs is TF_NAME+OTHER_STRING
, where
the TF_NAME
should match the gene names in your annotation file.
You may include multipe PWMs for same TFs and use OTHER_STRING
to distinguish
them. For example, SP1+ID1
, SP1+ID2
.
Taiji will combine sites found from different PWMs.
We provide human and mouse motifs here (downloaded from cisBP database):
Human
and mouse
motif files.
When motif_file
is not specified, these will be used according to
the value of the assembly
field.
motif_file: "/home/kai/motif_databases/cisBP_human.meme"
genome_index¶
(Optional.)
This is the FILE containing customized genome sequence index.
The program will detect whether the file exists.
If the index is not present, it will generate the index at the specified location.
If you leave this parameter unspecified,
the program will generate the index in the output_dir
.
To avoid re-generate the index for every project, we recommend you to set this parameter mannually.
genome_index: "/home/kai/genome/GRCh38/GRCh38.index".
bwa_index¶
(Optional.)
BWA genome index prefix.
The program will detect whether the directory contain proper BWA indices.
If the indices are not present, it will generate the indices at the specified
location. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
bwa_index: "/home/kai/genome/GRCh38/BWAIndex/genome.fa"
star_index¶
(Optional.)
This is the DIRECTORY containing STAR INDICES.
The program will detect whether the directory contain proper STAR indices.
If the indices are not present, it will generate the indices within the specified
directory. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
star_index: "/home/kai/genome/GRCh38/STAR_index/"
rsem_index¶
(Optional.)
RSEM genome index prefix.
The program will detect whether the directory contain proper RSEM indices.
If the indices are not present, it will generate the indices at the specified
location. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
rsem_index: "/home/kai/genome/GRCh38/RSEM_index/genome"
callpeak_genome_size¶
(Optional.)
The effective genome size used for MACS2’s “-g/–gsize” parameter.
This will be automatically determined based on the assembly or genome file.
For human or mouse assembly, we set this parameter to “hs” or “mm”.
For other genome, we set this parameter to 0.9 * GENOME_SIZE
.
The value of this parameter usually doesn’t make big difference.
callpeak_genome_size: "2.7e9"
external_network¶
(Optional.) External network file to be used in PageRank analysis.
external_network: "pathway.tsv"
Single cell ATAC-seq analysis¶
Parameters related to scATAC-seq analysis. All parameters in this section should be specified under scatac_options. For example:
scatac_options:
cluster_optimizer: CPM
cluster_by_window: false
cluster_resolution_list: [0.001, 0.002, 0.02, 0.05, 0.1, 0.25, 1]
cluster_resolution: 0.0075
tss_enrichment_cutoff: 7
do_subclustering: true
cluster_exclude: ["C3.3", "C25.3"]
subcluster_resolution:
"C1": 0.01
"C2": 0.03
"C4": 0.05
"C5": 0.06
"C6": 0.05
"C7": 0.03
"C8": 0.01
"C10": 0.001
"C11": 0.001
tss_enrichment_cutoff¶
(Optional.) TSS enrichment cutoff for filtering cell in single cell ATAC-seq analysis.
tss_enrichment_cutoff: 7
fragment_cutoff¶
(Optional.) Used to remove cells that do not have enough fragments/reads.
fragment_cutoff: 1000
doublet_score_cutoff¶
- ::
- doublet_score_cutoff: 0.5
cluster_optimizer¶
(Optional.) Quality function used in graph clustering. Available options are RBConfiguration and CPM. RBConfiguration optimizes modularity and has resolution limit while CPM is resolution-limit free.
cluster_optimizer: CPM
cluster_resolution¶
Mannually specify the clustering resolution. Otherwise it will be automatically determined.
cluster_resolution: 1
cluster_resolution_list¶
cluster_resolution_list: [0.005, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 0.8, 1]
cluster_by_window¶
Whether to use bin/window based clustering. Default is to use peak based clustering.
- ::
- cluster_by_window: False
do_subclustering¶
Whether to perform iterative clustering to identify subclusters.
- ::
- do_subclustering: False
subcluster_resolution¶
cluster_exclude¶
Clusters to be excluded from the final result.
Single cell RNA-seq analysis¶
scrna_cell_barcode_length¶
The length of the cell barcode used in demultiplexing.
scrna_cell_barcode_length: 12
scrna_doublet_score_cutoff¶
(Optional.) Cutoff for doublet detection, a value between 0 and 1 reflecting how likely a “cell” is a doublet. (default is 0.5)
scrna_doublet_score_cutoff: 0.5
scrna_cluster_resolutions¶
Distributed computing¶
The following settings are used in the cloud computing mode.
submit_cpu_format¶
The template for specifying the number of cpu cores in the job submission script.
This is system/environment dependent.
For example, if your system uses -l nodes=1:ppn=XX
to allocate CPU resource,
you should put -l nodes=1:ppn=%d
here.
The %d
will be replaced by the actual numbers when submitting the jobs.
The CPU cores for individual steps can be modified in the resource section below.
submit_cpu_format: "-l nodes=1:ppn=%d"
submit_memory_format¶
The template for specifying the amount of memory in the job submission script.
This is system/environment dependent.
For example, if your system uses -l mem=XXG
to allocate memory resource,
you should put -l mem=%dG
here.
The %d
will be replaced by the actual numbers when submitting the jobs.
The memory for individual steps can be modified in the resource section below.
submit_memory_format: "-l mem=%dG"
submit_params¶
Additional parameters to be included in the submission command.
submit_params: "-q glean"
resource¶
(Optional.) Specify the computational resources for each step.
resource:
SCATAC_Remove_Duplicates:
parameter: "-q home -l walltime=24:00:00"
SCATAC_Merged_Reduce_Dims:
parameter: "-q home -l walltime=24:00:00"
cpu: 4
memory: 80