Tutorial

Introduction

The Taiji pipeline aims at integrating different kinds of high throughput profiling techniques to construct TF regulatory networks and identify key regulators through network analysis. That being said, one only needs ATAC-seq or DNase-seq data to run the analysis, though the result will be better given other information, especially RNA-seq data.

The data input to the Taiji pipeline are fastq or bam files. For gene expression data, you have the option to provide an external gene expression table instead of having the pipeline analyze the raw RNA-seq data for you.

How to use

To run the Taiji pipeline, you would need 2 configuration files.

The first configuration file is used to specify the options used by the pipeline. Please look at this example configuration file for details.

The second configuration file contains the information about the input data sets. Take a look at this example file.

To run the pipeline, supply the taiji with the first configuration file: taiji run --config example_config.yml or taiji run --config example_config.yml --remote if the program is compiled with sge flag.

Parallelism

Taiji supports two levels of parallelism – node level and workflow level. Node level parallelism is automatically turned on when compiling with the drmaa flag. The workflow level parallelism can be turned on using -N <num_of_process>. However, this is only recommended for users with a super computer, as it will consume a lot of memory.

Auto-recovery

The pipeline supports auto-recovery, which means you can stop the program at any time and it will resume from the last checkpoint. The checkpoints are saved in a file called “sciflow.db”. Delete this file if you want a fresh run.

Results

The Taiji pipeline outputs many files, distributed in several directories.

OUTPUTDIR/Rank/

This is the primary output of Taiji pipeline. It contains the TF ranks under different conditions / cell-types. Cell-type-specific TF can be identified by looking at the TF rank dynamics (fold-change) across different cell types.

  • GeneRank_all.tsv: Ranks for all genes.
  • GeneRank_filtered.tsv: Genes with lower rank and that are less variate are removed.

OUTPUTDIR/TFBS/

BED files of predicted TF binding sites in each cell type.

OUTPUTDIR/Network/

Static gene regulatory network for each cell type.

OUTPUTDIR/ATAC_Seq/

BAM, BED, and peak files for ATAC-seq data.

OUTPUTDIR/RNA_Seq/

BAM, gene and transcript quantification files for RNA-seq data.

Visualize the results

Download the software from here. Use stack install to install the software.

To visualize the result, use:

taiji-view rank GeneRanks.tsv --expression RNASeq/expression_profile.tsv output.svg

GeneRanks.tsv is the file containing the PageRank result; RNASeq/expression_profile.tsv is the file containing the gene expressions; output.svg is the output file.

Filtering the result

--cv can be used to filter the result based on the coefficient of variance (CV) of PageRank scores. For example, --cv 0.5 will exclude any gene with CV less than 0.5. The default value is 1, which keeps the highly variable genes.

The file type of the output is determined by the suffix of file name. Supported file extensions include ”.png”, ”.pdf”, ”.jpg”, ”.svg”.