Input data format¶
The input data file can be written in either the YAML or the TSV format.
YAML format is more readable and powerful, but it is sensitive to indentation and beginners usually get confused. For simple cases, the TSV format is easier to work with and more straightforward.
See these examples:
Supported input data types are listed below:
ATAC-seq¶
Keyword: ATAC-seq
.
Supported formats | Description | Required tags | Optional tags |
---|---|---|---|
*.fastq or
*.fastq.gz |
Raw reads | None | None |
*.bam |
Aligned reads | None | “Filtered”: do not further filter this file |
*.bed or
*.bed.gz |
Aligned reads | None |
Taiji will use MACS2 to call peaks. It is also possible to use your own peaks. To do so, include the peak files in your input file, for example:
ATAC-Seq:
- group: 'ATAC_CD4_day1'
id: 'CD4_day1'
replicates:
- rep: 1
files:
- path: CD4.narrowpeak
format: NarrowPeak
- path: CD4.bed.gz
OR in TSV format:
type <TAB> id <TAB> group <TAB> rep <TAB> path <TAB> format
ATAC-seq <TAB> ATAC_CD4_day1 <TAB> CD4_day1 <TAB> 1 <TAB> CD4.narrowpeak <TAB> NarrowPeak
ATAC-seq <TAB> ATAC_CD4_day1 <TAB> CD4_day1 <TAB> 1 <TAB> CD4.bed.gz <TAB> Bed
RNA-seq¶
Keyword: RNA-seq
.
Supported formats | Description | Required tags | Optional tags |
---|---|---|---|
*.fastq or
*.fastq.gz |
Raw reads | None | None |
Plain Text | Expression profile | “GeneQuant” | None |
When the input format is “plain text”, Taiji assumes it contains two columns separated by Tabs. The first column is the names of genes and the second column is the expression levels. For example:
Gene1 <TAB> 12
Gene2 <TAB> 20
Gene3 <TAB> 25
Single cell RNA-seq¶
Keyword: scRNA-seq
.
Supported formats | Description | Required tags | Optional tags |
---|---|---|---|
*.fastq.gz |
Raw reads | None | None |
HiC¶
Keyword: HiC
.
Supported formats | Description | Required tags | Optional tags |
---|---|---|---|
Plain Text | a list of Loops | “ChromosomeLoop” | None |
Currently the pipeline do not analyze HiC data, so the user need to provide the end result - a list of loops, in following format:
chrom_1 <TAB> start_1 <TAB> end_1 <TAB> chrom_2 <TAB> start_2 <TAB> end_2
For example:
chr21 <TAB> 29343 <TAB> 500000 <TAB> chr21 <TAB> 1009340 <TAB> 1023400
chr1 <TAB> 10321 <TAB> 102100 <TAB> chr1 <TAB> 107150 <TAB> 123400
Data from the internet¶
Taiji can automatically download and analyze data from ENCODE portal and GEO database.
Using data from GEO¶
Note
The fastq-dump
is needed for downloading data from GEO database.
The SRA
needs to be added to the “format” field.
ATAC-Seq:
- group: 'ATAC_CD4_day1'
id: 'CD4_day1'
replicates:
- rep: 1
files:
- path: SRR891275
format: SRA
tags: ['PairedEnd']
Using data from ENCODE¶
ATAC-Seq:
- group: 'heart_left_ventricle'
id: heart_left_ventricle_ATAC
replicates:
- rep: 1
files:
- pair:
- path: ENCFF766IGD
tags: ['ENCODE']
- path: ENCFF075UOA
tags: ['ENCODE']
RNA-Seq:
- group: 'heart_left_ventricle'
id: heart_left_ventricle_RNA
replicates:
- rep: 1
files:
- path: ENCFF884JDN
tags: ['ENCODE', 'GeneQuant']