Example using scidat: Download Annotate TCGA¶
This was designed to be a simple interface that allows for downloading and annotating data from TCGA.
Section 1: Data selection and metadata download¶
Download metatdata from TCGA. For this program to work you need to first go to: https://portal.gdc.cancer.gov/ and select the files to download.
In this example, I chose my data by using the following steps: 1. Navigate to Exploration
tab 2. Selected kidney
as the Primary Site 3. Selected solid tissue normal
and primary tumor
for my Sample Type (823 cases) 4. Selected KIRP
as the project (279 cases) 5. Selected dead
as the vital status (43 cases) 6. Clicked View Files in Repository
RNA-seq
as my Experimental Strategy 8. Selected HTSeq - Counts
in Workflow Type 9. Clicked Add all files to cart
(50 files, 43 cases)Now I navigated to my cart which had 50 files in it. Since it is recomended to use th GDC-data transfer tool (which scidat uses) I only download the manifest
and the metadata
for my files of interest. Before proceeding, look at how much space the files will need (e.g. for me this is 12.62MB) make sure the computer you are downloading on has that space available.
Download:¶
13. `Biospecimen` (tsv)
14. `Clinical` (tsv)
15. `Sample Sheet`
16. `Metadata`
17. Click the `Download` button and select `Manifest`
Last steps: 1. Lastly, move all these files to a new empty directory ~/Documents/TCGA_data_download_scidat/
. 2. Unzip the Clinical
folder and delete the zipped version. 3. Copy the clinical.tsv
file in the clinical.cart....
folder to the ~/Documents/TCGA_data_download_scidat/
directory. 4. Make a new directory in ~/Documents/TCGA_data_download_scidat/
called downloads
(we’ll use this below)
Get GDC Data Transfer Tool¶
If you haven’t already got the GDC transfer tool, you’ll need to download this, follow the instructions from TCGA on how to do this: https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
(choose the correct version of the below 3 for your system) Package icongdc-client_v1.5.0_Windows_x64.zip Package icongdc-client_v1.5.0_Ubuntu_x64.zip Package icongdc-client_v1.5.0_OSX_x64.zip
Unzip and move the downloaded GDC transfer tool to the same directory you put your manifest file in i.e. ~/Documents/TCGA_data_download_scidat/
from before. Now we’re ready to download the files and annotate them with the data we downloaded.
Section 2: File download¶
The script below assumes you have put the files from above in ~/Documents/TCGA_data_download_scidat/
and that your gdc-client
is executable.
If you are using a windows machine (or put your files in a different folder you’ll need to rename that directory).
Before running the script below you will need to make sure you have edited the path to the manifest file, and the gdc_client to be located where you placed them on your computer and with their proper names (e.g.
[1]:
from scidat import API
# SET THIS TO TRUE FOR THE FIRST TIME YOU RUN THIS
download = False
"""
This will take about 3 minutes so sit back and have a coffee!
Note if you have a very large amount of data to download you should increase the max_cnt.
This will run a process for each 5 cases, i.e. if your downloading 5000 files, that will be 1000 processes!
"""
meta_dir = '/Users/ariane/Documents/code/tcga-format/data/example_download/TCGA_data_download_scidat_unix/'
manifest_file = f'{meta_dir}gdc_manifest_20200416_055430.txt'
gdc_client = f'{meta_dir}./gdc-client'
download_dir = f'{meta_dir}downloads/'
clinical_file = f'{meta_dir}clinical.tsv'
sample_file = f'{meta_dir}gdc_sample_sheet.2020-04-16.tsv'
annotation_file = f'{meta_dir}tcga_hsapiens_gene_ensembl-GRCh38.p13.csv'
# scidat spits the manifest file into submanifests so that we can make multiple calls to TCGA simulateously.
# This speeds up the download process. Note, it will use up more of your computers' processing so if you are worried
# just set the max_cnt to be more than the number of files in your manifest e.g. 100000000
api = API(manifest_file, gdc_client, clinical_file, sample_file, download_dir=download_dir, annotation_file=annotation_file, requires_lst=None, clin_cols=None,
max_cnt=5, sciutil=None, split_manifest_dir=download_dir,
meta_dir=meta_dir, sep='_')
if download:
api.download_data_from_manifest()
Section 3: Clinical information annotation¶
Here we are going to builld the clinical information dataframe.
We’re also interested in the mutation data so we’re going to download the mutation data.
[3]:
"""
This section will take about 30 second to 1 min
"""
# Build the annotation
api.build_annotation()
# Download the mutation data --> here we can pass only some case IDs or if we leave it empty we download all the
# mutation data for our cases.
if download:
api.download_mutation_data()
--------------------------------------------------------------------------------
Clinical dataframe
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
submitter_id project_id age_at_index gender race \
0 TCGA-G7-6793 TCGA-KIRP 60 female white
2 TCGA-A4-7585 TCGA-KIRP 67 male white
4 TCGA-AL-3466 TCGA-KIRP 41 male white
6 TCGA-J7-8537 TCGA-KIRP 37 female asian
8 TCGA-BQ-5877 TCGA-KIRP 60 male black or african american
vital_status tumor_stage days_to_death
0 Dead stage iv 336
2 Dead stage iii 1070
4 Dead stage iv 293
6 Dead stage iii 321
8 Dead stage iv 270
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Sample dataframe
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
File ID \
0 661829b5-a067-467f-aa58-518c8f91bad9
1 0aa7c232-98e9-411e-a5a3-77dc841ce743
2 8a8c6c6c-971a-4045-b04d-9325195a1598
3 b6a0c9ec-3294-4a27-84ca-b59183c3a737
4 cf683af3-6dd7-4f28-be42-cbc1cecc560c
File Name Data Category \
0 61e5178d-86bc-4229-8a41-61401f71f64b.htseq.cou... Transcriptome Profiling
1 a05ad788-3f8e-4b01-9ddd-fd6643efc4c7.htseq.cou... Transcriptome Profiling
2 7a3c64e7-98bc-4cbf-8a2a-0c3314d46ad2.htseq.cou... Transcriptome Profiling
3 66f56582-eab0-4192-9c2a-0029c343b815.htseq.cou... Transcriptome Profiling
4 acf91e5b-fe69-4064-9a7f-eb4094c7ee62.htseq.cou... Transcriptome Profiling
Data Type Project ID Case ID Sample ID \
0 Gene Expression Quantification TCGA-KIRP TCGA-BQ-5877 TCGA-BQ-5877-01A
1 Gene Expression Quantification TCGA-KIRP TCGA-5P-A9JY TCGA-5P-A9JY-01A
2 Gene Expression Quantification TCGA-KIRP TCGA-BQ-5879 TCGA-BQ-5879-11A
3 Gene Expression Quantification TCGA-KIRP TCGA-A4-7585 TCGA-A4-7585-01A
4 Gene Expression Quantification TCGA-KIRP TCGA-UZ-A9PN TCGA-UZ-A9PN-01A
Sample Type
0 Primary Tumor
1 Primary Tumor
2 Solid Tissue Normal
3 Primary Tumor
4 Primary Tumor
--------------------------------------------------------------------------------
[4]:
# Now we can do all sorts of things for example get cases with specific metadata
# Now we want to check the functionality we expect
metadata = {'race': ['white'], 'gender': ['male'], 'project_id': ['TCGA-KIRP']}
case_files = api.get_files_with_meta(metadata, "all")
case_files.sort()
print(len(case_files))
26
Section 4: Annotation with mutation data¶
Unfortunately TCGA doesn’t make it easy to add the mutation annotation data.
If in Section 1 you didn’t do the optional step 5 please go back and download the mutation data as we need this to add the mutation information.
How this section works is that we have the different mutations (the file we downloaded from Section 1 has the mutation ID and the information about the specific mutation. Unfortunatly, this isn’t connected to a case (patient). In order to connect the mutation with the patient we need to call the API provided by TCGA (https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/#simple-somatic-mutation-endpoint-examples).
[5]:
# First lets build the mutation DF
api.build_mutation_df()
# Lets get all the genes with mutations
genes = api.get_genes_with_mutations()
print(len(genes))
3090
Look at the data and add gene names¶
Here we build the RNAseq dataframe. Since TCGA gives us the genes identified by their Ensembl ID, we want to annotate these with their gene symbol
. We do this by using biomart.
This section will take some time as we’re going to be downloading the annotation information from biomart.
Note you won’t have to do this everytime! It will just be the first time when you are creating this reference file.
The annotation file was generated by the following script from scibiomart:
scibiomart --m ENSEMBL_MART_ENSEMBL --d hsapiens_gene_ensembl --a "ensembl_gene_id,entrezgene_id,external_gene_name,chromosome_name,start_position,end_position,strand" --o tcga_
[6]:
# Build the RNAseq dataframe
api.build_rna_df(download_dir)
# Now we need to access biomart to be able to mix between the two (i.e. one is annotated with gene symbol the other has ens id)
df = api.get_rna_df()
# Use the pre stored annotation file, if you want to update this or add different attributes see scibiomart.
# api.set_gene_annotation_file(gene_annotation_file_path)
# Now we want to add our annotation data to our dataframe
# The first parameter is our dataframe, then the second parameter is whether we want to drop rows with null values
# We'll drop these rows since we only want data that has full metadata (i.e. go terms etc)
df = api.add_gene_metadata_to_df(df)
Example 1: Calculating the logfold change accross 2 conditions¶
Since the data from TCGA is just raw counts, we want to first normalise it. Usually I would use normalisations from DeSeq2, but for this tutorial we’ll just log transform counts + 1.
Here we’ll pretend to be interested in how female KIRC and male KIRC samples differ. To do this, we’ll first calculate the log fold change between the two conditions. Then to see if there are any outlier genes, we’ll have a look at a volcano plot of the data.
[11]:
from sciviso import Boxplot, Violinplot
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
filter_col = 'ssm.consequence.0.transcript.gene.symbol'
column_returned = "case_id"
genes_of_interest = ['P53', 'SOX2', 'SOX1', 'SHH', 'HOXA1']
gene_name = 'external_gene_name'
ARL11_mutations = api.get_mutation_values_on_filter(column_returned, ['ARL11'], filter_col)
print("Samples with ARL11 mutations", ARL11_mutations)
transformed_df = pd.DataFrame()
values, columns, filtered_df = api.get_values_from_df(df, gene_name, ARL11_mutations)
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['ARL11-MUT_' + c] = np.log2(filtered_df[c].values + 1)
MUC4_mutations = api.get_mutation_values_on_filter(column_returned, ['MUC4'], filter_col)
print("Samples with MUC4 mutations", MUC4_mutations)
values, columns, filtered_df = api.get_values_from_df(df, gene_name, MUC4_mutations)
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['MUC4-MUT_' + c] = np.log2(filtered_df[c].values + 1)
# Now lets calculate the changes between two conditions, let's go with KIRC vs KIRP
transformed_df[gene_name] = df[gene_name].values
boxplot = Boxplot(transformed_df, None, None)
box_df = boxplot.format_data_for_boxplot(transformed_df, ['MUC4-MUT', 'ARL11-MUT'], gene_name, ["MUC4"])
print(box_df.head())
boxplot = Violinplot(box_df, "Conditions", "Values", "MUC4 MUT and ANK3 MUT", add_dots=True)
boxplot.plot()
plt.show()
# Here normally we would run edgeR or deseq2 but we're just going to do a simple wilcoxin signed rank test between KIRP and KIRC for these conditions
No handles with labels found to put in legend.
Samples with ARL11 mutations ['TCGA-IA-A83W']
Samples with MUC4 mutations ['TCGA-P4-A5E8', 'TCGA-5P-A9JY', 'TCGA-UZ-A9PU', 'TCGA-IA-A83T', 'TCGA-SX-A7SM']
Samples Values Conditions
0 ARL11-MUT_TCGA-KIRP_PrimaryTumor_male_white_1_... 5.087463 ARL11-MUT
1 MUC4-MUT_TCGA-KIRP_PrimaryTumor_male_white_1_h... 6.918863 MUC4-MUT
2 MUC4-MUT_TCGA-KIRP_PrimaryTumor_male_white_1_h... 2.584963 MUC4-MUT
3 MUC4-MUT_TCGA-KIRP_SolidTissueNormal_male_whit... 5.754888 MUC4-MUT
4 MUC4-MUT_TCGA-KIRP_PrimaryTumor_male_blackoraf... 4.700440 MUC4-MUT

Example 2: Looking at differences across genes between conditions¶
Here we have selected a group of genes and we’re interested in how specifc cases change across these genes. For this, we’ll use the log transformed data. Here we look at a box plot of each condition on the x axis and the y-values are the values for these conditions.
submitter_id project_id age_at_index gender race vital_status tumor_stage normal_samples tumor_samples case_files tumor_stage_num
SolidTissueNormal vs PrimaryTumor
For this, we’re just going to make two plots, a violin plot of the KIRP dataset for a set of genes and the KIRC we’re going to be looking at the normal tissue compared with the shape of the tumour tissue
[12]:
male_KIRP = api.get_files_with_meta({'gender': ['male'], 'tumor_stage_num': [1, 2, 3, 4], 'project_id': ['TCGA-KIRP']})
male_KIRC = api.get_files_with_meta({'gender': ['male'], 'tumor_stage_num': [1, 2, 3, 4], 'project_id': ['TCGA-KIRC']})
transformed_df = pd.DataFrame()
# Here we're going to get a couple of groups of data, first the KIRP normal
values, columns, filtered_df = api.get_values_from_df(df, gene_name, male_KIRP, None, column_name_includes=["SolidTissueNormal"])
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['KIRP-NORMAL_' + c] = np.log2(filtered_df[c].values + 1)
values, columns, filtered_df = api.get_values_from_df(df, gene_name, male_KIRP, None, column_name_includes=["PrimaryTumor"])
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['KIRP-TUMOUR_' + c] = np.log2(filtered_df[c].values + 1)
values, columns, filtered_df = api.get_values_from_df(df, gene_name, male_KIRC, None, column_name_includes=["SolidTissueNormal"])
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['KIRC-NORMAL_' + c] = np.log2(filtered_df[c].values + 1)
values, columns, filtered_df = api.get_values_from_df(df, gene_name, male_KIRC, None, column_name_includes=["PrimaryTumor"])
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['KIRC-TUMOUR_' + c] = np.log2(filtered_df[c].values + 1)
transformed_df[gene_name] = df[gene_name].values
boxplot = Boxplot(transformed_df, None, None)
box_df = boxplot.format_data_for_boxplot(transformed_df, ['KIRP-TUMOUR', 'KIRP-NORMAL', 'KIRC-NORMAL', 'KIRC-TUMOUR'], gene_name, ["ANK3"])
boxplot = Violinplot(box_df, "Conditions", "Values", "Looking at KIRP v KIRC tumour and normal ANK3", add_dots=True)
boxplot.plot()
plt.show()
No handles with labels found to put in legend.

Example 3: Looking at genes across conditions¶
Here we may have a hypothesis related to how a specific gene changes across multiple conditions. For this, we may want to look at a box plot of the genes on the x-axis and the values of that gene from the different samples as the y axis.
[13]:
from sciviso import Boxplot
import matplotlib.pyplot as plt
import pandas as pd
df = api.get_rna_df()
df = api.add_gene_metadata_to_df(df)
filter_col = 'ssm.consequence.0.transcript.gene.symbol'
column_returned = "case_id"
genes_of_interest = ['SOX2', 'SOX1', 'SHH', 'HOXA1']
# Now we have our genes we want these to be our x axis
female_stage1 = api.get_files_with_meta({'gender': ['female'], 'tumor_stage_num': [1, 2], 'project_id': ['TCGA-KIRP']})
female_stage4 = api.get_files_with_meta({'gender': ['female'], 'tumor_stage_num': [3, 4], 'project_id': ['TCGA-KIRP']})
# Lets log2 + 1 the data
transformed_df = pd.DataFrame()
values, columns, filtered_df = api.get_values_from_df(df, gene_name, female_stage4)
print(filtered_df.shape)
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['LATE_STAGE' + c] = np.log2(filtered_df[c].values + 1)
values, columns, filtered_df = api.get_values_from_df(df, gene_name, female_stage1)
for c in columns:
if c != gene_name and 'TCGA' in c:
transformed_df['EARLY_STAGE' + c] = np.log2(filtered_df[c].values + 1)
transformed_df[gene_name] = df[gene_name].values
# Lets get the values for these cases
# Cool we have 56 and 19 that's enough for a statistical sig let's see if any of our genes are interesting
for gene in genes_of_interest:
try:
boxplot = Boxplot(transformed_df, None, None)
box_df = boxplot.format_data_for_boxplot(transformed_df, ['EARLY_STAGE', 'LATE_STAGE'], gene_name, [gene])
boxplot = Boxplot(box_df, "Conditions", "Values", gene, add_dots=True)
boxplot.plot()
plt.show()
except:
print(gene)
(60684, 10)
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
EARLY_STAGE v.s. LATE_STAGE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=9.052e-01 U_stat=2.550e+01

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
EARLY_STAGE v.s. LATE_STAGE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.514e-01 U_stat=1.800e+01

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
EARLY_STAGE v.s. LATE_STAGE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.159e-01 U_stat=3.800e+01

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
EARLY_STAGE v.s. LATE_STAGE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=9.530e-01 U_stat=2.600e+01
