Notebook¶
Example annotating mouse and human peaks to genes¶
This notebook provides key examples to get started:
1) Annotating peaks from bed files to genes (mouse)
3) Annotating DNA methylation data to genes (human)
The data used in this example can be found here: https://github.com/ArianeMora/sciepi2gene/tree/main/tests/data
[1]:
import pandas as pd
from scie2g import Bed, Csv
from sciutil import SciUtil
import igv
data_dir = '../tests/data/'
bed_file = f'{data_dir}test_H3K27ac.bed'
mm10_annot_file = f'{data_dir}mmusculus_gene_ensembl-GRCm38.p6.csv'
u = SciUtil()
Inspect files¶
Just to give you an idea of what the files as input are we print out the head of the annotation file and the bed file.
[2]:
u.dp(['MM10 Bed file header:'])
print(pd.read_csv(bed_file, '\t'))
u.dp(['MM10 annotation file header:'])
print(pd.read_csv(mm10_annot_file))
--------------------------------------------------------------------------------
MM10 Bed file header:
--------------------------------------------------------------------------------
chr1 13139105 13374083 Peak_147937 12 . 2.20467 3.65974 2.03542 \
0 chr4 118442415 118457513 Peak_126956 13 . 2.16107 4.38555 2.71094
1 chr15 102326116 102331873 Peak_43115 20 . 3.35950 9.21174 7.28157
2 chrX 73716597 73738534 Peak_76892 15 . 2.65320 6.01850 4.22238
131 Ncoa2
0 120 Mpl
1 147 Pfdn5
2 122 Abcd1
--------------------------------------------------------------------------------
MM10 annotation file header:
--------------------------------------------------------------------------------
ensembl_gene_id external_gene_name chromosome_name start_position \
0 ENSMUSG00000102693 4933401J01Rik 1 3073253
1 ENSMUSG00000064842 Gm26206 1 3102016
2 ENSMUSG00000102851 Gm18956 1 3252757
3 ENSMUSG00000103377 Gm37180 1 3365731
4 ENSMUSG00000104017 Gm37363 1 3375556
... ... ... ... ...
56300 ENSMUSG00000095134 Mid1-ps1 Y 90753057
56301 ENSMUSG00000095366 Gm21860 Y 90752427
56302 ENSMUSG00000096768 Gm47283 Y 90784738
56303 ENSMUSG00000099871 Gm21742 Y 90837413
56304 ENSMUSG00000096850 Gm21748 Y 90838869
end_position strand
0 3074322 1
1 3102125 1
2 3253236 1
3 3368549 -1
4 3377788 -1
... ... ...
56300 90763485 1
56301 90755467 -1
56302 90816465 1
56303 90844040 1
56304 90839177 -1
[56305 rows x 6 columns]
Run bed file annotation¶
Now we can annotate the bed file to the genes in the annotation file (using the default parameters).
[3]:
bed = Bed(bed_file)
# Add the gene annot
bed.set_annotation_from_file(mm10_annot_file)
# Now we can run the assign values
bed.assign_locations_to_genes()
bed.save_loc_to_csv(f'test_bed_h3k27ac_output.csv')
4it [00:00, 53.30it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr1 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_H3K27ac.bed
--------------------------------------------------------------------------------
Inspect the output of the annotation¶
Look at the previous annotation to check the annotations are correct.
You’ll notice that many of the peaks have been annotated to multiple genes, in particular, since the annotation file has ensembl IDs, we get multiple assignments to a single gene name.
[4]:
u.dp(['Parsed bed and annotated it:'])
print(pd.read_csv('test_bed_h3k27ac_output.csv'))
--------------------------------------------------------------------------------
Parsed bed and annotated it:
--------------------------------------------------------------------------------
peak_idx gene_idx chr start end peak_value 8 \
0 0 157 chr1 13139105 13374083 2.20467 2.03542
1 0 158 chr1 13139105 13374083 2.20467 2.03542
2 0 159 chr1 13139105 13374083 2.20467 2.03542
3 0 160 chr1 13139105 13374083 2.20467 2.03542
4 0 161 chr1 13139105 13374083 2.20467 2.03542
5 0 162 chr1 13139105 13374083 2.20467 2.03542
6 0 163 chr1 13139105 13374083 2.20467 2.03542
7 0 164 chr1 13139105 13374083 2.20467 2.03542
8 0 165 chr1 13139105 13374083 2.20467 2.03542
9 0 166 chr1 13139105 13374083 2.20467 2.03542
10 1 33541 chr4 118442415 118457513 2.16107 2.71094
11 1 33542 chr4 118442415 118457513 2.16107 2.71094
12 2 18356 chr15 102326116 102331873 3.35950 7.28157
13 2 18357 chr15 102326116 102331873 3.35950 7.28157
14 3 53203 chrX 73716597 73738534 2.65320 4.22238
15 3 53204 chrX 73716597 73738534 2.65320 4.22238
9 ensembl_gene_id external_gene_name chromosome_name \
0 131 Ncoa2\n ENSMUSG00000103506 Gm38376 chr1
1 131 Ncoa2\n ENSMUSG00000103495 Gm37409 chr1
2 131 Ncoa2\n ENSMUSG00000099183 Gm27881 chr1
3 131 Ncoa2\n ENSMUSG00000102639 Gm38223 chr1
4 131 Ncoa2\n ENSMUSG00000104170 Gm37702 chr1
5 131 Ncoa2\n ENSMUSG00000087782 Gm23169 chr1
6 131 Ncoa2\n ENSMUSG00000103085 Gm38120 chr1
7 131 Ncoa2\n ENSMUSG00000102664 Gm38380 chr1
8 131 Ncoa2\n ENSMUSG00000005886 Ncoa2 chr1
9 131 Ncoa2\n ENSMUSG00000101476 Gm29570 chr1
10 120 Mpl\n ENSMUSG00000081921 Gm12858 chr4
11 120 Mpl\n ENSMUSG00000006389 Mpl chr4
12 147 Pfdn5\n ENSMUSG00000001289 Pfdn5 chr15
13 147 Pfdn5\n ENSMUSG00000001285 Myg1 chr15
14 122 Abcd1\n ENSMUSG00000002015 Bcap31 chrX
15 122 Abcd1\n ENSMUSG00000031378 Abcd1 chrX
start_position end_position strand
0 13177462 13179714 -1
1 13234809 13236434 -1
2 13262198 13262279 1
3 13289812 13290506 -1
4 13297036 13298907 -1
5 13334498 13334609 -1
6 13343563 13348204 -1
7 13354483 13356730 -1
8 13139105 13374083 -1
9 13376070 13388413 1
10 118441013 118441220 -1
11 118442415 118457513 -1
12 102326116 102331873 1
13 102331709 102338139 1
14 73686178 73716175 -1
15 73716597 73738534 1
Override the default parameters¶
Here we just show how some of the parameters can be overrided.
Here rather than using the peak signal, we choose the peak score (maybe we want it for visualisation purposes). We also add the “Peak” (feature 9, to our output file). We do this by changing: peak_value=4
(see the features below for the narrow peak format).
In order to just add extra information, we do this by adding to the hdridx
and set it to be: hdridx='"0","1","2","3","4","8","9"'
. Basically this means: in the output you’ll have: chrom, chromStart,chromEnd,name,score,signalValue,qvalue
.
Other things we change are the buffer_before_tss
(the buffer before the TSS in which we want to annotate the peak to the gene) and buffer_after_tss
(i.e. the amount downstream of the gene end - soz this is poorly named).
We take the information from the genome browser FAQ: https://genome.ucsc.edu/FAQ/FAQformat.html#format12
chrom - Name of the chromosome (or contig, scaffold, etc.).
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name - Name given to a region (preferably unique). Use “.” if no name is assigned.
score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were “‘0”’ when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.
strand - +/- to denote strand or orientation (whenever applicable). Use “.” if no orientation is assigned.
signalValue - Measurement of overall (usually, average) enrichment for the region.
pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.
peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.
[5]:
bed = Bed(bed_file, overlap_method='overlaps', buffer_after_tss=1000,
buffer_before_tss=1000, buffer_gene_overlap=500,
gene_start=3, gene_end=4, gene_chr=2,
gene_direction=5, gene_name=0,
chr_idx=0, start_idx=1,
end_idx=2, peak_value=6, header_extra='"7"')
# Add the gene annot
bed.set_annotation_from_file(mm10_annot_file)
# Now we can run the assign values
bed.assign_locations_to_genes()
bed.save_loc_to_csv(f'bed_h3k27ac_output_overlaps.csv')
4it [00:00, 59.27it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr1 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_H3K27ac.bed
--------------------------------------------------------------------------------
Example with CSV file¶
With the CSV file, we allow for the header to be used (since unlike the bed files we read the whole file into memory). The good thing about the beds is that we only read one line at a time, so it will be able to handle much larger files.
[6]:
csv_file = f'{data_dir}test_methyl_overlaps.csv'
hg38_annot_file = f'{data_dir}hsapiens_gene_ensembl-GRCh38.p13.csv'
# Print out the files again like we did above
u.dp(['Human CSV file header:'])
print(pd.read_csv(csv_file))
u.dp(['Human annotation file header:'])
print(pd.read_csv(hg38_annot_file))
--------------------------------------------------------------------------------
Human CSV file header:
--------------------------------------------------------------------------------
Unnamed: 0 chr start end strand pvalue qvalue meth.diff \
0 1 7 27070640 27072751 * 0.0005 0.005 -1.0
1 2 7 27082673 27109403 * 0.0005 0.005 -1.0
2 3 7 27095203 27096872 * 0.0005 0.005 -1.0
3 4 7 27095403 27096072 * 0.0005 0.005 -1.0
4 5 7 27095403 27095404 * 0.0005 0.005 -1.0
description genes
0 Overlaps None NaN
1 Overlaps Hoxa1 and HotairM1 in TSS ALL HOTAIRM1,HOXA1,HOXA2
2 Overlaps Hoxa1 and HotairM1 in TSS A HOTAIRM1,HOXA1
3 Overlaps Hoxa1 and HotairM1 in TSS B HOTAIRM1,HOXA1
4 Overlaps Hoxa1 and HotairM1 in TSS C HOTAIRM1,HOXA1
--------------------------------------------------------------------------------
Human annotation file header:
--------------------------------------------------------------------------------
ensembl_gene_id external_gene_name chromosome_name start_position \
0 ENSG00000223972 DDX11L1 1 11869
1 ENSG00000278267 MIR6859-1 1 17369
2 ENSG00000243485 MIR1302-2HG 1 29554
3 ENSG00000227232 WASH7P 1 14404
4 ENSG00000284332 MIR1302-2 1 30366
... ... ... ... ...
67125 ENSG00000224240 CYCSP49 Y 26549425
67126 ENSG00000227629 SLC25A15P1 Y 26586642
67127 ENSG00000231514 CCNQP2 Y 26626520
67128 ENSG00000237917 PARP4P1 Y 26594851
67129 ENSG00000235857 CTBP2P1 Y 56855244
end_position strand
0 14409 1
1 17436 -1
2 31109 1
3 29570 -1
4 30503 1
... ... ...
67125 26549743 1
67126 26591601 -1
67127 26627159 -1
67128 26634652 -1
67129 56855488 1
[67130 rows x 6 columns]
Annotate genes to the CSV file¶
Here we show an example of annotating to a csv file. We also show how we can convert a CSV to a bed file both before and after the annotation. This allows us to be able to view the results of the CSV file.
Convert the CSV input file to a bed so we can view it in IGV
Annotate the features to genes
Save csv to a bed file
[10]:
csv_file = f'{data_dir}test_methyl_overlaps.csv'
f = Csv(csv_file, chr_str='chr', start='start', end='end', value='meth.diff',
header_extra=['pvalue', 'qvalue', 'description', 'genes'])
# Convert to bed:
"""
output filename = test_methyl_overlaps_input
track name = MethylOutputOverlaps_input
value = qvalue
name of peak = genes (a column in the csv file)
"""
f.convert_to_bed(pd.read_csv(csv_file), f'test_methyl_overlaps_input.bed', 'Methyl_input',
'qvalue', 'genes')
f.set_annotation_from_file(hg38_annot_file)
# Now we can run the assign values
f.assign_locations_to_genes()
# Save the output to a csv file
f.save_loc_to_csv(f'test_methyl_overlaps_output.csv')
# We now save the annotated peaks to another bed file so we can inspect the differences.
f.convert_to_bed(f.loc_df, f'test_methyl_overlaps_output.bed', 'MethylOutputOverlaps', 'qvalue', 'external_gene_name')
100%|██████████| 5/5 [00:00<00:00, 262.75it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr7 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_methyl_overlaps.csv
--------------------------------------------------------------------------------
Print out the bed results and the¶
If you want to use this section you need to install IGV’s browser capabilities in your virtual environment.
You do this the following way (as per documentation at: https://github.com/igvteam/igv-jupyter
# To install to configuration in your home directory
jupyter serverextension enable --py igv
jupyter nbextension install --py igv
jupyter nbextension enable --py igv
# If using a virtual environment
jupyter serverextension enable --py igv --sys-prefix
jupyter nbextension install --py igv --sys-prefix
jupyter nbextension enable --py igv --sys-prefix
[14]:
import igv
b = igv.Browser({"genome": "hg38"})
b.show()
# Load input track (in purple)
b.load_track(
{
"name": "Local Bed",
"url": "files/test_methyl_overlaps_input.bed",
"format": "bed",
"type": "annotation",
"sourceType": "file",
"indexed": False,
"displayMode": "EXPANDED",
"color": "#8258A1"
})
# Load output track (in green)
b.load_track(
{
"name": "Local Bed",
"url": "files/test_methyl_overlaps_output.bed",
"format": "bed",
"type": "annotation",
"sourceType": "file",
"indexed": False,
"displayMode": "EXPANDED",
"color": "#8DD9B3"
})
b.search('HOXA2')
b.get_svg()
b.display_svg()
[14]:
'Awaiting SVG - try again in a few seconds'