Notebook

Example annotating mouse and human peaks to genes

This notebook provides key examples to get started:

1) Annotating peaks from bed files to genes (mouse)
3) Annotating DNA methylation data to genes (human)

The data used in this example can be found here: https://github.com/ArianeMora/sciepi2gene/tree/main/tests/data

[1]:
import pandas as pd
from scie2g import Bed, Csv
from sciutil import SciUtil
import igv

data_dir = '../tests/data/'
bed_file = f'{data_dir}test_H3K27ac.bed'
mm10_annot_file = f'{data_dir}mmusculus_gene_ensembl-GRCm38.p6.csv'
u = SciUtil()

Inspect files

Just to give you an idea of what the files as input are we print out the head of the annotation file and the bed file.

[2]:
u.dp(['MM10 Bed file header:'])
print(pd.read_csv(bed_file, '\t'))

u.dp(['MM10 annotation file header:'])
print(pd.read_csv(mm10_annot_file))

--------------------------------------------------------------------------------
                             MM10 Bed file header:                                
--------------------------------------------------------------------------------
    chr1   13139105   13374083  Peak_147937  12  .  2.20467  3.65974  2.03542  \
0   chr4  118442415  118457513  Peak_126956  13  .  2.16107  4.38555  2.71094
1  chr15  102326116  102331873   Peak_43115  20  .  3.35950  9.21174  7.28157
2   chrX   73716597   73738534   Peak_76892  15  .  2.65320  6.01850  4.22238

   131     Ncoa2
0    120     Mpl
1  147     Pfdn5
2  122     Abcd1
--------------------------------------------------------------------------------
                         MM10 annotation file header:                                  
--------------------------------------------------------------------------------
          ensembl_gene_id external_gene_name chromosome_name  start_position  \
0      ENSMUSG00000102693      4933401J01Rik               1         3073253
1      ENSMUSG00000064842            Gm26206               1         3102016
2      ENSMUSG00000102851            Gm18956               1         3252757
3      ENSMUSG00000103377            Gm37180               1         3365731
4      ENSMUSG00000104017            Gm37363               1         3375556
...                   ...                ...             ...             ...
56300  ENSMUSG00000095134           Mid1-ps1               Y        90753057
56301  ENSMUSG00000095366            Gm21860               Y        90752427
56302  ENSMUSG00000096768            Gm47283               Y        90784738
56303  ENSMUSG00000099871            Gm21742               Y        90837413
56304  ENSMUSG00000096850            Gm21748               Y        90838869

       end_position  strand
0           3074322       1
1           3102125       1
2           3253236       1
3           3368549      -1
4           3377788      -1
...             ...     ...
56300      90763485       1
56301      90755467      -1
56302      90816465       1
56303      90844040       1
56304      90839177      -1

[56305 rows x 6 columns]

Run bed file annotation

Now we can annotate the bed file to the genes in the annotation file (using the default parameters).

[3]:
bed = Bed(bed_file)
# Add the gene annot
bed.set_annotation_from_file(mm10_annot_file)
# Now we can run the assign values
bed.assign_locations_to_genes()
bed.save_loc_to_csv(f'test_bed_h3k27ac_output.csv')

4it [00:00, 53.30it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr1 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_H3K27ac.bed    
--------------------------------------------------------------------------------

Inspect the output of the annotation

Look at the previous annotation to check the annotations are correct.

You’ll notice that many of the peaks have been annotated to multiple genes, in particular, since the annotation file has ensembl IDs, we get multiple assignments to a single gene name.

[4]:
u.dp(['Parsed bed and annotated it:'])
print(pd.read_csv('test_bed_h3k27ac_output.csv'))

--------------------------------------------------------------------------------
                         Parsed bed and annotated it:                                  
--------------------------------------------------------------------------------
    peak_idx  gene_idx    chr      start        end  peak_value        8  \
0          0       157   chr1   13139105   13374083     2.20467  2.03542
1          0       158   chr1   13139105   13374083     2.20467  2.03542
2          0       159   chr1   13139105   13374083     2.20467  2.03542
3          0       160   chr1   13139105   13374083     2.20467  2.03542
4          0       161   chr1   13139105   13374083     2.20467  2.03542
5          0       162   chr1   13139105   13374083     2.20467  2.03542
6          0       163   chr1   13139105   13374083     2.20467  2.03542
7          0       164   chr1   13139105   13374083     2.20467  2.03542
8          0       165   chr1   13139105   13374083     2.20467  2.03542
9          0       166   chr1   13139105   13374083     2.20467  2.03542
10         1     33541   chr4  118442415  118457513     2.16107  2.71094
11         1     33542   chr4  118442415  118457513     2.16107  2.71094
12         2     18356  chr15  102326116  102331873     3.35950  7.28157
13         2     18357  chr15  102326116  102331873     3.35950  7.28157
14         3     53203   chrX   73716597   73738534     2.65320  4.22238
15         3     53204   chrX   73716597   73738534     2.65320  4.22238

                  9     ensembl_gene_id external_gene_name chromosome_name  \
0   131     Ncoa2\n  ENSMUSG00000103506            Gm38376            chr1
1   131     Ncoa2\n  ENSMUSG00000103495            Gm37409            chr1
2   131     Ncoa2\n  ENSMUSG00000099183            Gm27881            chr1
3   131     Ncoa2\n  ENSMUSG00000102639            Gm38223            chr1
4   131     Ncoa2\n  ENSMUSG00000104170            Gm37702            chr1
5   131     Ncoa2\n  ENSMUSG00000087782            Gm23169            chr1
6   131     Ncoa2\n  ENSMUSG00000103085            Gm38120            chr1
7   131     Ncoa2\n  ENSMUSG00000102664            Gm38380            chr1
8   131     Ncoa2\n  ENSMUSG00000005886              Ncoa2            chr1
9   131     Ncoa2\n  ENSMUSG00000101476            Gm29570            chr1
10    120     Mpl\n  ENSMUSG00000081921            Gm12858            chr4
11    120     Mpl\n  ENSMUSG00000006389                Mpl            chr4
12  147     Pfdn5\n  ENSMUSG00000001289              Pfdn5           chr15
13  147     Pfdn5\n  ENSMUSG00000001285               Myg1           chr15
14  122     Abcd1\n  ENSMUSG00000002015             Bcap31            chrX
15  122     Abcd1\n  ENSMUSG00000031378              Abcd1            chrX

    start_position  end_position  strand
0         13177462      13179714      -1
1         13234809      13236434      -1
2         13262198      13262279       1
3         13289812      13290506      -1
4         13297036      13298907      -1
5         13334498      13334609      -1
6         13343563      13348204      -1
7         13354483      13356730      -1
8         13139105      13374083      -1
9         13376070      13388413       1
10       118441013     118441220      -1
11       118442415     118457513      -1
12       102326116     102331873       1
13       102331709     102338139       1
14        73686178      73716175      -1
15        73716597      73738534       1

Override the default parameters

Here we just show how some of the parameters can be overrided.

Here rather than using the peak signal, we choose the peak score (maybe we want it for visualisation purposes). We also add the “Peak” (feature 9, to our output file). We do this by changing: peak_value=4 (see the features below for the narrow peak format).

In order to just add extra information, we do this by adding to the hdridx and set it to be: hdridx='"0","1","2","3","4","8","9"'. Basically this means: in the output you’ll have: chrom, chromStart,chromEnd,name,score,signalValue,qvalue.

Other things we change are the buffer_before_tss (the buffer before the TSS in which we want to annotate the peak to the gene) and buffer_after_tss (i.e. the amount downstream of the gene end - soz this is poorly named).

We take the information from the genome browser FAQ: https://genome.ucsc.edu/FAQ/FAQformat.html#format12

  1. chrom - Name of the chromosome (or contig, scaffold, etc.).

  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

  4. name - Name given to a region (preferably unique). Use “.” if no name is assigned.

  5. score - Indicates how dark the peak will be displayed in the browser (0-1000). If all scores were “‘0”’ when the data were submitted to the DCC, the DCC assigned scores 1-1000 based on signal value. Ideally the average signalValue per base spread is between 100-1000.

  6. strand - +/- to denote strand or orientation (whenever applicable). Use “.” if no orientation is assigned.

  7. signalValue - Measurement of overall (usually, average) enrichment for the region.

  8. pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.

  9. qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.

  10. peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

[5]:
bed = Bed(bed_file, overlap_method='overlaps', buffer_after_tss=1000,
                  buffer_before_tss=1000, buffer_gene_overlap=500,
                  gene_start=3, gene_end=4, gene_chr=2,
                  gene_direction=5, gene_name=0,
                  chr_idx=0, start_idx=1,
                  end_idx=2, peak_value=6, header_extra='"7"')
# Add the gene annot
bed.set_annotation_from_file(mm10_annot_file)
# Now we can run the assign values
bed.assign_locations_to_genes()
bed.save_loc_to_csv(f'bed_h3k27ac_output_overlaps.csv')

4it [00:00, 59.27it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr1 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_H3K27ac.bed    
--------------------------------------------------------------------------------

Example with CSV file

With the CSV file, we allow for the header to be used (since unlike the bed files we read the whole file into memory). The good thing about the beds is that we only read one line at a time, so it will be able to handle much larger files.

[6]:
csv_file = f'{data_dir}test_methyl_overlaps.csv'
hg38_annot_file = f'{data_dir}hsapiens_gene_ensembl-GRCh38.p13.csv'

# Print out the files again like we did above
u.dp(['Human CSV file header:'])
print(pd.read_csv(csv_file))

u.dp(['Human annotation file header:'])
print(pd.read_csv(hg38_annot_file))

--------------------------------------------------------------------------------
                            Human CSV file header:                                
--------------------------------------------------------------------------------
   Unnamed: 0  chr     start       end strand  pvalue  qvalue  meth.diff  \
0           1    7  27070640  27072751      *  0.0005   0.005       -1.0
1           2    7  27082673  27109403      *  0.0005   0.005       -1.0
2           3    7  27095203  27096872      *  0.0005   0.005       -1.0
3           4    7  27095403  27096072      *  0.0005   0.005       -1.0
4           5    7  27095403  27095404      *  0.0005   0.005       -1.0

                              description                 genes
0                           Overlaps None                   NaN
1  Overlaps Hoxa1 and HotairM1 in TSS ALL  HOTAIRM1,HOXA1,HOXA2
2    Overlaps Hoxa1 and HotairM1 in TSS A        HOTAIRM1,HOXA1
3    Overlaps Hoxa1 and HotairM1 in TSS B        HOTAIRM1,HOXA1
4    Overlaps Hoxa1 and HotairM1 in TSS C        HOTAIRM1,HOXA1
--------------------------------------------------------------------------------
                         Human annotation file header:                                
--------------------------------------------------------------------------------
       ensembl_gene_id external_gene_name chromosome_name  start_position  \
0      ENSG00000223972            DDX11L1               1           11869
1      ENSG00000278267          MIR6859-1               1           17369
2      ENSG00000243485        MIR1302-2HG               1           29554
3      ENSG00000227232             WASH7P               1           14404
4      ENSG00000284332          MIR1302-2               1           30366
...                ...                ...             ...             ...
67125  ENSG00000224240            CYCSP49               Y        26549425
67126  ENSG00000227629         SLC25A15P1               Y        26586642
67127  ENSG00000231514             CCNQP2               Y        26626520
67128  ENSG00000237917            PARP4P1               Y        26594851
67129  ENSG00000235857            CTBP2P1               Y        56855244

       end_position  strand
0             14409       1
1             17436      -1
2             31109       1
3             29570      -1
4             30503       1
...             ...     ...
67125      26549743       1
67126      26591601      -1
67127      26627159      -1
67128      26634652      -1
67129      56855488       1

[67130 rows x 6 columns]

Annotate genes to the CSV file

Here we show an example of annotating to a csv file. We also show how we can convert a CSV to a bed file both before and after the annotation. This allows us to be able to view the results of the CSV file.

  1. Convert the CSV input file to a bed so we can view it in IGV

  2. Annotate the features to genes

  3. Save csv to a bed file

[10]:
csv_file = f'{data_dir}test_methyl_overlaps.csv'

f = Csv(csv_file, chr_str='chr', start='start', end='end', value='meth.diff',
        header_extra=['pvalue', 'qvalue', 'description', 'genes'])

# Convert to bed:
"""
output filename = test_methyl_overlaps_input
track name = MethylOutputOverlaps_input
value = qvalue
name of peak = genes (a column in the csv file)
"""
f.convert_to_bed(pd.read_csv(csv_file), f'test_methyl_overlaps_input.bed', 'Methyl_input',
                 'qvalue', 'genes')
f.set_annotation_from_file(hg38_annot_file)

# Now we can run the assign values
f.assign_locations_to_genes()

# Save the output to a csv file
f.save_loc_to_csv(f'test_methyl_overlaps_output.csv')

# We now save the annotated peaks to another bed file so we can inspect the differences.
f.convert_to_bed(f.loc_df, f'test_methyl_overlaps_output.bed', 'MethylOutputOverlaps', 'qvalue', 'external_gene_name')
100%|██████████| 5/5 [00:00<00:00, 262.75it/s]
--------------------------------------------------------------------------------
Warning: Your input file used different chr conventions to ensembl: chr7 vs 1
We have added chr prefix to the ensembl chrs.
file: ../tests/data/test_methyl_overlaps.csv    
--------------------------------------------------------------------------------