CLI

CLI instructions for running spi-epi2gene.

Examples

The examples cover some simple case where we map 1) a bed file with genes that overlap (e.g. H3K36me3) and 2) a bed file with peaks we would expect to land in the promoter (H3K27ac). Lastly, we show how a DMRseq file would be annotated to genes.

Genes are assigned to a peak if the peak overlaps with 2500 (default) upstream of the TSS or 500 (default) base pairs on the gene body. The peak could overlap with any part of this region and it will be assigned to the gene. Peaks can be assigned to multiple genes i.e. if we have a broad peak then it could be assigned to many genes (i.e. H3K27me3).

Annotate gene body overlapping peaks

Here we read in the file test_H3K27me3.bed and annotate it to peaks that fall in the promoter region of genes annotated in hsapiens_gene_ensembl-GRCh38.p13.csv (generated using sci-biomart). Genes are assigned to a peak if the peak overlaps with 2500 upstream of the TSS or 500 base pairs from the gene end.

scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_H3K36me3.bed --t b --upflank 3000 --downflank 500 --m overlaps

Annotate promoter region peaks

Here we read in the file test_H3K27me3.bed and annotate it to peaks that fall in the promoter region of genes annotated in hsapiens_gene_ensembl-GRCh38.p13.csv (generated using sci-biomart). Genes are assigned to a peak if the peak overlaps with 2500 upstream of the TSS or 500 base pairs on the gene body. The peak could overlap with any part of this region and it will be assigned to the gene.

scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_H3K27me3.bed --t b --upflank 2500 --overlap 500 --m in_promoter

Annotate DMRseq regions (CSV) to genes

Here we have had to override the column: ‘chr’ with ‘seqnames’ seen with the tag –chr and the ‘value’ term, with ‘stat’ .. code-block:: bash

scie2g –a data/hsapiens_gene_ensembl-GRCh38.p13.csv –o data/output_file.csv –l2g data/test_dmrseq.csv –t d –upflank 2500 –m overlaps –chr seqnames –value stat

Annotate MethylKit DMCs (CSV) to genes

scie2g –a data/hsapiens_gene_ensembl-GRCh38.p13.csv –o data/output_file.csv –l2g data/test_H3K27me3.bed –t b –upflank 2500 –overlap 500 –m in_promoter

scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_methyl.csv --t d --upflank 2500 --m overlaps --value meth.diff

Arguments

scie2g

usage: scie2g [-h] [--a A] [--o O] [--b B] [--l2g L2G] [--t T] [--upflank UPFLANK] [--downflank DOWNFLANK] [--overlap OVERLAP] [--m M] [--chr CHR] [--start START] [--end END] [--value VALUE] [--hdr HDR] [--chridx CHRIDX] [--startidx STARTIDX] [--endidx ENDIDX] [--valueidx VALUEIDX] [--hdridx HDRIDX] [--hdrlbl HDRLBL] [--gchr GCHR] [--gstart GSTART]
              [--gend GEND] [--gdir GDIR] [--gname GNAME]

Named Arguments

--a

Annotation with the gene locations

--o

Output file (csv)

Default: “l2g_outputfile.csv”

--b

Output file (bed)

Default: “l2g_outputfile.bed”

--l2g

Input file to run scie2g on

--t

The input file type: d=CSV, b=Bed

Default: “b”

--upflank

Maximum distance upstream from TSS (default = 2500) for overlaps and in_promoter

Default: 2500

--downflank

Maximum distance downstream from gene end (default = 500) only used in overlaps

Default: 500

--overlap

Overlap with gene body (default = 500) used in in_promoter

Default: 500

--m

Overlap method (overlaps or in_promoter <- default).

Default: “in_promoter”

--chr

CSV only: name of your chromosone column

Default: “chr”

--start

CSV only: name of your start column

Default: “start”

--end

CSV only: name of your end column

Default: “end”

--value

CSV only: name of your value column

--hdr

CSV only: comma separated list of other columns you want to include in the output e.g “stat,pvalue”

Default: “”

--chridx

BED only: index of your chromosone column

Default: 0

--startidx

BED only: index of your start column

Default: 1

--endidx

BED only: index of your end column

Default: 2

--valueidx

BED only: index of your value column

Default: 7

--hdridx

BED only: comma separated list of indexs

Default: “0,1,2,3,6,8”

--hdrlbl

BED only: comma separated list of header in human readable format as output to your csv file.

Default: “”chr”,”start”,”end”,”peak_name”,”signal”,”qvalue””

--gchr

Position in annotation file that your chr annotation is.

Default: 2

--gstart

Position in annotation file that your start is.

Default: 3

--gend

Position in annotation file that your end is.

Default: 4

--gdir

Position in annotation file that your gene direction is.

Default: 5

--gname

Position in annotation file that gene name is.

Default: 0