Input data format¶

The input data file can be written in either the YAML or the TSV format.

YAML format is more readable and powerful, but it is sensitive to indentation and beginners usually get confused. For simple cases, the TSV format is easier to work with and more straightforward.

See these examples:

Supported input data types are listed below:

ATAC-seq¶

Keyword: ATAC-seq.

Supported formats	Description	Required tags	Optional tags
`.fastq` or `.fastq.gz`	Raw reads	None	None
`*.bam`	Aligned reads	None	“Filtered”: do not further filter this file
`.bed` or `.bed.gz`	Aligned reads	None

Taiji will use MACS2 to call peaks. It is also possible to use your own peaks. To do so, include the peak files in your input file, for example:

ATAC-Seq:
    - group: 'ATAC_CD4_day1'
      id: 'CD4_day1'
      replicates:
      - rep: 1
        files:
        - path: CD4.narrowpeak
          format: NarrowPeak
        - path: CD4.bed.gz

OR in TSV format:

type <TAB> id <TAB> group <TAB> rep <TAB> path <TAB> format
ATAC-seq <TAB> ATAC_CD4_day1 <TAB> CD4_day1 <TAB> 1 <TAB> CD4.narrowpeak <TAB> NarrowPeak
ATAC-seq <TAB> ATAC_CD4_day1 <TAB> CD4_day1 <TAB> 1 <TAB> CD4.bed.gz <TAB> Bed

RNA-seq¶

Keyword: RNA-seq.

Supported formats	Description	Required tags	Optional tags
`.fastq` or `.fastq.gz`	Raw reads	None	None
Plain Text	Expression profile	“GeneQuant”	None

When the input format is “plain text”, Taiji assumes it contains two columns separated by Tabs. The first column is the names of genes and the second column is the expression levels. For example:

Gene1 <TAB> 12
Gene2 <TAB> 20
Gene3 <TAB> 25

Single cell RNA-seq¶

Keyword: scRNA-seq.

Supported formats	Description	Required tags	Optional tags
`*.fastq.gz`	Raw reads	None	None

HiC¶

Keyword: HiC.

Supported formats	Description	Required tags	Optional tags
Plain Text	a list of Loops	“ChromosomeLoop”	None

Currently the pipeline do not analyze HiC data, so the user need to provide the end result - a list of loops, in following format:

chrom_1 <TAB> start_1 <TAB> end_1 <TAB> chrom_2 <TAB> start_2 <TAB> end_2

For example:

chr21 <TAB> 29343 <TAB> 500000 <TAB> chr21 <TAB> 1009340 <TAB> 1023400
chr1  <TAB> 10321 <TAB> 102100 <TAB> chr1  <TAB> 107150  <TAB> 123400

Data from the internet¶

Taiji can automatically download and analyze data from ENCODE portal and GEO database.

Using data from GEO¶

Note

The fastq-dump is needed for downloading data from GEO database. The SRA needs to be added to the “format” field.

ATAC-Seq:
    - group: 'ATAC_CD4_day1'
      id: 'CD4_day1'
      replicates:
      - rep: 1
        files:
        - path: SRR891275
          format: SRA
          tags: ['PairedEnd']

Using data from ENCODE¶

ATAC-Seq:
- group: 'heart_left_ventricle'
  id: heart_left_ventricle_ATAC
  replicates:
  - rep: 1
    files:
      - pair:
        - path: ENCFF766IGD
          tags: ['ENCODE']
        - path: ENCFF075UOA
          tags: ['ENCODE']

RNA-Seq:
- group: 'heart_left_ventricle'
  id: heart_left_ventricle_RNA
  replicates:
  - rep: 1
    files:
    - path: ENCFF884JDN
      tags: ['ENCODE', 'GeneQuant']