CLI¶
CLI instructions for running spi-epi2gene.
Examples¶
The examples cover some simple case where we map 1) a bed file with genes that overlap (e.g. H3K36me3) and 2) a bed file with peaks we would expect to land in the promoter (H3K27ac). Lastly, we show how a DMRseq file would be annotated to genes.
Genes are assigned to a peak if the peak overlaps with 2500 (default) upstream of the TSS or 500 (default) base pairs on the gene body. The peak could overlap with any part of this region and it will be assigned to the gene. Peaks can be assigned to multiple genes i.e. if we have a broad peak then it could be assigned to many genes (i.e. H3K27me3).
Annotate gene body overlapping peaks¶
Here we read in the file test_H3K27me3.bed and annotate it to peaks that fall in the promoter region of genes annotated in hsapiens_gene_ensembl-GRCh38.p13.csv (generated using sci-biomart). Genes are assigned to a peak if the peak overlaps with 2500 upstream of the TSS or 500 base pairs from the gene end.
scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_H3K36me3.bed --t b --upflank 3000 --downflank 500 --m overlaps
Annotate promoter region peaks¶
Here we read in the file test_H3K27me3.bed and annotate it to peaks that fall in the promoter region of genes annotated in hsapiens_gene_ensembl-GRCh38.p13.csv (generated using sci-biomart). Genes are assigned to a peak if the peak overlaps with 2500 upstream of the TSS or 500 base pairs on the gene body. The peak could overlap with any part of this region and it will be assigned to the gene.
scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_H3K27me3.bed --t b --upflank 2500 --overlap 500 --m in_promoter
Annotate DMRseq regions (CSV) to genes¶
Here we have had to override the column: ‘chr’ with ‘seqnames’ seen with the tag –chr and the ‘value’ term, with ‘stat’ .. code-block:: bash
scie2g –a data/hsapiens_gene_ensembl-GRCh38.p13.csv –o data/output_file.csv –l2g data/test_dmrseq.csv –t d –upflank 2500 –m overlaps –chr seqnames –value stat
Annotate MethylKit DMCs (CSV) to genes¶
scie2g –a data/hsapiens_gene_ensembl-GRCh38.p13.csv –o data/output_file.csv –l2g data/test_H3K27me3.bed –t b –upflank 2500 –overlap 500 –m in_promoter
scie2g --a data/hsapiens_gene_ensembl-GRCh38.p13.csv --o data/output_file.csv --l2g data/test_methyl.csv --t d --upflank 2500 --m overlaps --value meth.diff
Arguments¶
scie2g
usage: scie2g [-h] [--a A] [--o O] [--b B] [--l2g L2G] [--t T] [--upflank UPFLANK] [--downflank DOWNFLANK] [--overlap OVERLAP] [--m M] [--chr CHR] [--start START] [--end END] [--value VALUE] [--hdr HDR] [--chridx CHRIDX] [--startidx STARTIDX] [--endidx ENDIDX] [--valueidx VALUEIDX] [--hdridx HDRIDX] [--hdrlbl HDRLBL] [--gchr GCHR] [--gstart GSTART]
[--gend GEND] [--gdir GDIR] [--gname GNAME]
Named Arguments¶
- --a
Annotation with the gene locations
- --o
Output file (csv)
Default: “l2g_outputfile.csv”
- --b
Output file (bed)
Default: “l2g_outputfile.bed”
- --l2g
Input file to run scie2g on
- --t
The input file type: d=CSV, b=Bed
Default: “b”
- --upflank
Maximum distance upstream from TSS (default = 2500) for overlaps and in_promoter
Default: 2500
- --downflank
Maximum distance downstream from gene end (default = 500) only used in overlaps
Default: 500
- --overlap
Overlap with gene body (default = 500) used in in_promoter
Default: 500
- --m
Overlap method (overlaps or in_promoter <- default).
Default: “in_promoter”
- --chr
CSV only: name of your chromosone column
Default: “chr”
- --start
CSV only: name of your start column
Default: “start”
- --end
CSV only: name of your end column
Default: “end”
- --value
CSV only: name of your value column
- --hdr
CSV only: comma separated list of other columns you want to include in the output e.g “stat,pvalue”
Default: “”
- --chridx
BED only: index of your chromosone column
Default: 0
- --startidx
BED only: index of your start column
Default: 1
- --endidx
BED only: index of your end column
Default: 2
- --valueidx
BED only: index of your value column
Default: 7
- --hdridx
BED only: comma separated list of indexs
Default: “0,1,2,3,6,8”
- --hdrlbl
BED only: comma separated list of header in human readable format as output to your csv file.
Default: “”chr”,”start”,”end”,”peak_name”,”signal”,”qvalue””
- --gchr
Position in annotation file that your chr annotation is.
Default: 2
- --gstart
Position in annotation file that your start is.
Default: 3
- --gend
Position in annotation file that your end is.
Default: 4
- --gdir
Position in annotation file that your gene direction is.
Default: 5
- --gname
Position in annotation file that gene name is.
Default: 0