Notebook for RNA processing part 1¶
Download data from TCGA¶
Since they only provided the normalised values, we wanted to download the counts data and add in the clinical information. https://www.sciencedirect.com/science/article/pii/S0092867419311237 https://cptac-data-portal.georgetown.edu/study-summary/S050
They only provided the RPKm so I had to go and download the raw counts from GDC directly: https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22kidney%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22CPTAC-3%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%5D%7D&searchTableTab=cases
Here I selected HTseq counts for the patients from the CPTAC RCC study.
sb = SciBiomartApi()
results_df = sb.get_human_default(attr_list=['entrezgene_id'])
# Now let's sort it
results_df = sb.sort_df_on_starts(results_df)
print(results_df.values)
sb.save_as_csv(results_df, 'data/supps/')
# Now we need to download the data from TCGA for the RNAseq
from scidat.api import API, APIException
from sciutil import SciUtil
import pandas as pd
u = SciUtil()
save_fig = False
base_dir = '../data/raw_downloads/TCGA/'
supp_dir = '../data/raw_downloads/supps/'
output_dir = f'../data/sircle/F1_DE_input_TvN/'
fig_dir = '../figures/'
data_dir = f'{base_dir}TCGA_RNA/'
annotation_file = f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13_external_synonym.csv'
gene_name = 'hgnc_symbol'
gdc_client = f'{data_dir}./gdc-client'
sample_file = f'{data_dir}gdc_sample_sheet.2021-05-03.tsv'
manifest_file = f'{data_dir}gdc_manifest_20210503_065756.txt'
clinical_file = f'{data_dir}clinical.cart.2021-05-03/clinical.tsv'
api = API(manifest_file, gdc_client, clinical_file, sample_file, data_dir, data_dir, annotation_file,
max_cnt=10, requires_lst=['counts'])
"""
If you haven't downloaded the data already you'll need to do this step!
"""
download_rnaseq = False
if download_rnaseq:
api.download_data_from_manifest()
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False. if (await self.run_code(code, result, async_=asy)):
Filter annotation file to just include minimal mappings between external gene names¶
import numpy as np
annotation_df = pd.read_csv(annotation_file)
# Fill NA's to make the smallest entrez gene IDs be the first.
annotation_df['entrezgene_id'] = annotation_df['entrezgene_id'].fillna(value=100000000000000000)
annotation_df = annotation_df.loc[annotation_df.groupby('ensembl_gene_id').entrezgene_id.idxmin()] # We just remove the duplicates
# Replace the same number with NaN
annotation_df['entrezgene_id'] = annotation_df['entrezgene_id'].replace(100000000000000000, np.nan)
# Remove
annotation_df.sort_index(ascending=True)
# Save as CSV for everything else
annotation_df.to_csv(f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13.csv', index=False)
annotation_file = f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13.csv'
annotation_df
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
ensembl_gene_id | external_gene_name | chromosome_name | start_position | end_position | strand | entrezgene_id | external_synonym | hgnc_symbol | |
---|---|---|---|---|---|---|---|---|---|
104394 | ENSG00000000003 | TSPAN6 | X | 100627108 | 100639991 | -1 | 7105.0 | T245 | TSPAN6 |
104389 | ENSG00000000005 | TNMD | X | 100584936 | 100599885 | 1 | 64102.0 | BRICD4 | TNMD |
54163 | ENSG00000000419 | DPM1 | 20 | 50934867 | 50959140 | -1 | 8813.0 | CDGIE | DPM1 |
6100 | ENSG00000000457 | SCYL3 | 1 | 169849631 | 169894267 | -1 | 57147.0 | PACE-1 | SCYL3 |
6079 | ENSG00000000460 | C1orf112 | 1 | 169662007 | 169854080 | 1 | 55732.0 | FLJ10706 | C1orf112 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
31798 | ENSG00000289640 | NaN | 16 | 21116439 | 21121291 | 1 | NaN | NaN | NaN |
61295 | ENSG00000289641 | NaN | 3 | 127227404 | 127231028 | 1 | NaN | NaN | NaN |
4650 | ENSG00000289642 | NaN | 1 | 149013782 | 149014961 | 1 | NaN | NaN | NaN |
64035 | ENSG00000289643 | NaN | 4 | 42285823 | 42292644 | -1 | NaN | NaN | NaN |
23315 | ENSG00000289644 | NaN | 13 | 74304640 | 74306050 | -1 | NaN | NaN | NaN |
68005 rows × 9 columns
Build RNAseq df using scidat¶
## Once the data is downlaoded we can build the RNAdf
# Now we want to first build the annotation
api.build_annotation()
api.build_rna_df()
df = api.get_rna_df()
# Drop rows that aren't genes
df_filtered = df[df['id'].str.contains('ENSG')]
df.to_csv(f'{output_dir}CCRCC_Clark_Cell2019_rna_df.csv')
-------------------------------------------------------------------------------- Warning: the column you selected for submitter ID submitter_id was not in the colums. Resetting submitter_id Available columns: case_id, case_submitter_id, project_id, age_at_index, age_is_obfuscated, cause_of_death, cause_of_death_source, country_of_residence_at_enrollment, days_to_birth, days_to_death, ethnicity, gender, occupation_duration_years, premature_at_birth, race, vital_status, weeks_gestation_at_birth, year_of_birth, year_of_death, age_at_diagnosis, ajcc_clinical_m, ajcc_clinical_n, ajcc_clinical_stage, ajcc_clinical_t, ajcc_pathologic_m, ajcc_pathologic_n, ajcc_pathologic_stage, ajcc_pathologic_t, ajcc_staging_system_edition, anaplasia_present, anaplasia_present_type, ann_arbor_b_symptoms, ann_arbor_clinical_stage, ann_arbor_extranodal_involvement, ann_arbor_pathologic_stage, best_overall_response, breslow_thickness, burkitt_lymphoma_clinical_variant, child_pugh_classification, circumferential_resection_margin, classification_of_tumor, cog_liver_stage, cog_neuroblastoma_risk_group, cog_renal_stage, cog_rhabdomyosarcoma_risk_group, days_to_best_overall_response, days_to_diagnosis, days_to_last_follow_up, days_to_last_known_disease_status, days_to_recurrence, eln_risk_classification, enneking_msts_grade, enneking_msts_metastasis, enneking_msts_stage, enneking_msts_tumor_site, esophageal_columnar_dysplasia_degree, esophageal_columnar_metaplasia_present, figo_stage, figo_staging_edition_year, first_symptom_prior_to_diagnosis, gastric_esophageal_junction_involvement, gleason_grade_group, gleason_grade_tertiary, gleason_patterns_percent, goblet_cells_columnar_mucosa_present, greatest_tumor_dimension, gross_tumor_weight, icd_10_code, igcccg_stage, inpc_grade, inpc_histologic_group, inrg_stage, inss_stage, international_prognostic_index, irs_group, irs_stage, ishak_fibrosis_score, iss_stage, largest_extrapelvic_peritoneal_focus, last_known_disease_status, laterality, lymph_node_involved_site, lymph_nodes_positive, lymph_nodes_tested, lymphatic_invasion_present, margin_distance, margins_involved_site, masaoka_stage, medulloblastoma_molecular_classification, metastasis_at_diagnosis, metastasis_at_diagnosis_site, method_of_diagnosis, micropapillary_features, mitosis_karyorrhexis_index, mitotic_count, morphology, non_nodal_regional_disease, non_nodal_tumor_deposits, ovarian_specimen_status, ovarian_surface_involvement, papillary_renal_cell_type, percent_tumor_invasion, perineural_invasion_present, peripancreatic_lymph_nodes_positive, peripancreatic_lymph_nodes_tested, peritoneal_fluid_cytological_status, pregnant_at_diagnosis, primary_diagnosis, primary_gleason_grade, prior_malignancy, prior_treatment, progression_or_recurrence, residual_disease, satellite_nodule_present, secondary_gleason_grade, site_of_resection_or_biopsy, sites_of_involvement, supratentorial_localization, synchronous_malignancy, tissue_or_organ_of_origin, transglottic_extension, tumor_confined_to_organ_of_origin, tumor_depth, tumor_focality, tumor_grade, tumor_largest_dimension_diameter, tumor_regression_grade, tumor_stage, vascular_invasion_present, vascular_invasion_type, weiss_assessment_score, who_cns_grade, who_nte_grade, wilms_tumor_histologic_subtype, year_of_diagnosis, chemo_concurrent_to_radiation, days_to_treatment_end, days_to_treatment_start, initial_disease_status, number_of_cycles, reason_treatment_ended, regimen_or_line_of_therapy, therapeutic_agents, treatment_anatomic_site, treatment_arm, treatment_dose, treatment_dose_units, treatment_effect, treatment_effect_indicator, treatment_frequency, treatment_intent_type, treatment_or_therapy, treatment_outcome, treatment_type Run: annotate.set_case_submitter_id() to setup. Continuing with automatic selection. -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Submitter ID set as: case_submitter_id -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Clinical dataframe -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- case_submitter_id project_id age_at_index gender race \ 0 C3N-01175 CPTAC-3 '-- female not reported 1 C3L-00010 CPTAC-3 '-- male white 2 C3N-00313 CPTAC-3 '-- female not reported 3 C3L-00917 CPTAC-3 '-- male white 4 C3N-00494 CPTAC-3 '-- male not reported vital_status tumor_stage days_to_death 0 Alive Stage III '-- 1 Alive Stage I '-- 2 Alive Stage I '-- 3 Alive Stage I '-- 4 Alive Stage I '-- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Sample dataframe -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- File ID \ 0 af2f69dd-738e-4d3a-bc9b-75b8872a68cf 1 ffe98eaf-5ecc-4251-bcb8-2504522f6689 2 80335cdd-dcf8-4827-aa39-d86cc9b4fdde 3 8de154b8-e1b0-4d45-a161-d357b01c1261 4 03bbd508-34cc-49bb-8bf3-824d7606856b File Name Data Category \ 0 7a256e52-346f-47c3-bd3f-c89ba609fe58.htseq_cou... Transcriptome Profiling 1 1fdfc946-c469-40a4-81c1-de381b43a896.htseq_cou... Transcriptome Profiling 2 5a776391-2b4c-452c-b40f-8ca3b346de4d.htseq_cou... Transcriptome Profiling 3 5883ecff-0735-4340-92ce-150b56acd877.htseq_cou... Transcriptome Profiling 4 16e3b64c-08bb-487e-bbba-2a398be4206b.htseq_cou... Transcriptome Profiling Data Type Project ID Case ID Sample ID \ 0 Gene Expression Quantification CPTAC-3 C3N-00149 C3N-00149-06 1 Gene Expression Quantification CPTAC-3 C3N-00320 C3N-00320-03 2 Gene Expression Quantification CPTAC-3 C3L-01885 C3L-01885-03 3 Gene Expression Quantification CPTAC-3 C3L-00790 C3L-00790-01 4 Gene Expression Quantification CPTAC-3 C3N-00148 C3N-00148-06 Sample Type 0 Solid Tissue Normal 1 Primary Tumor 2 Primary Tumor 3 Primary Tumor 4 Solid Tissue Normal --------------------------------------------------------------------------------
QC check correlations between the samples¶
Here we just want to do some very basic comparisons between patients, basically just filtering out any with very poor correlation.
import seaborn as sns
import matplotlib.pyplot as plt
# Check the correlation between samples
all_cases = [c for c in df_filtered.columns if 'CPTAC' in c]
corr = df_filtered[all_cases].corr()
sns.clustermap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values, cmap='RdBu_r', row_cluster=True, col_cluster=True)
if save_fig:
plt.savefig(f'{fig_dir}Heatmap_RNAseq.svg')
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg) /Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:1205: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. self.fig.tight_layout(**tight_params)
import numpy as np
all_cases = [c for c in df_filtered.columns if 'CPTAC' in c]
corr = df_filtered[all_cases].corr()
# Print out the minimum correlation:
min_corr = min(np.min(corr))
# Since 50% correlation is pretty low, let's see which cases they are and probably remove them
# from the RNAseq (since it looks like there are 3 outliers.)
mean_cor = np.mean(corr, axis=1)
corr['mean_corr'] = mean_cor
corr.sort_values(by=['mean_corr'])
# Plot out the mean correlation values so we can choose a good filter.
plt.hist(mean_cor, bins=20)
plt.title(f'Min corr: {min_corr}')
if save_fig:
plt.savefig(f'{fig_dir}Hist_RNAseq.svg')
plt.show()
# Sort again and this time filter
corr_sorted = corr.sort_values(by=['mean_corr'])
corr_sorted = corr_sorted[corr_sorted['mean_corr'] < 0.75]
u.dp([len(corr_sorted), 'patients with avg. correlations less than 75. Filtering out these samples, and printing cases.'])
cols_to_omit = [c for c in corr_sorted.index if 'CP' in c]
case_ids = [c.split('_')[-2] for c in corr_sorted.index if 'CP' in c]
print('\n'.join(case_ids))
print(cols_to_omit)
-------------------------------------------------------------------------------- 6 patients with avg. correlations less than 75. Filtering out these samples, and printing cases. -------------------------------------------------------------------------------- C3N-00492 C3N-01648,C3N-01648 C3L-00359 C3L-00448 C3N-01200 C3N-00315 ['CPTAC-3_PrimaryTumor_female_notreported_1_htseq.counts_None_None_CPTAC-3_C3N-00492_469d59b2-7cda-469d-b915-f36c8deaa242', 'CPTAC-3_PrimaryTumor,PrimaryTumor_None_None_None_htseq.counts_None_None_CPTAC-3_C3N-01648,C3N-01648_a5aa7a66-eb32-4dbf-b930-b71242889c56', 'CPTAC-3_PrimaryTumor_female_white_1_htseq.counts_None_None_CPTAC-3_C3L-00359_cf8af8ac-49f8-40f2-a16e-5630e0ff62f2', 'CPTAC-3_SolidTissueNormal_male_white_1_htseq.counts_None_None_CPTAC-3_C3L-00448_4f77ce4c-e7cd-4aa0-8776-bd103f2b686d', 'CPTAC-3_PrimaryTumor_female_notreported_1_htseq.counts_47_None_CPTAC-3_C3N-01200_3a7e5a81-a316-4e73-8286-001d623d5709', 'CPTAC-3_PrimaryTumor_male_notreported_1_htseq.counts_None_None_CPTAC-3_C3N-00315_686347c3-a656-4416-9b18-a0737b431ea6']
Filter samples¶
Find which samples had the poor correlation and remove these
cols_to_keep = [c for c in df_filtered.columns if c not in cols_to_omit]
corr = df_filtered[cols_to_keep].corr()
sns.clustermap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values, cmap='RdBu_r', row_cluster=True, col_cluster=True)
if save_fig:
plt.savefig(f'{fig_dir}Heatmap_RNAseq_removed_corr-leq-0.75.svg')
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg) /Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:1205: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. self.fig.tight_layout(**tight_params)
# Actually remove those columns! (But make sure you keep the ID column)
df_filtered = df_filtered[cols_to_keep]
# Print out the minimum correlation:
min(np.min(corr))
# Since 50% correlation is pretty low, let's see which cases they are and probably remove them
# from the RNAseq (since it looks like there are 3 outliers.)
0.5420264228833744
Add in the gene names and entrez IDs¶
We also want to add in gene name and also add in the entrez IDs etc do this from HG 38 since this is what it was mapped to from the paper: https://www.sciencedirect.com/science/article/pii/S0092867419311237
Total RNA Sequencing
Indexed RNA-seq libraries were sequenced using the HiSeq4000 platform to generate a minimum of 120 million paired end reads (75 base pairs) per library with a target of greater than 90% mapped reads. The sequence data were demultiplexed and converted to FASTQ files, and adaptor and low-quality sequences were quantified/trimmed. Samples were then assessed for quality by mapping reads to the hg38 reference genome, estimating the total number of reads that mapped, assessing the amount of RNA that mapped to coding regions, the amount of rRNA in the sample, the number of genes expressed, and the relative expression of housekeeping genes. Samples that passed the quality criteria were then clustered with other expression data from similar and distinct tumor types to confirm expected expression patterns, including pathological status (i.e., normal adjacent versus tumor tissue) and tissue-origin specificity. FASTQ files of all reads were then uploaded to the GDC repository.
# add in annotation for gene IDs
import pandas as pd
annot = pd.read_csv(annotation_file)
# Remove the version then merge with our annotation dataframe
ensembl_ids = [e_id.split('.')[0] for e_id in df_filtered['id'].values]
df_filtered['ensembl_gene_id'] = ensembl_ids
print(len(df_filtered))
df_filtered_annot = df_filtered.set_index('ensembl_gene_id').join(annot.set_index("ensembl_gene_id"), how="inner")
# # 54713 for hg19 vs 39743 for hg38
df_filtered_annot = df_filtered_annot.drop_duplicates(subset='id', keep='first')
print(len(df_filtered_annot))
# Save this dataframe as a counts file but first let's re-order so all the columns with metadata are
# at the start
meta_cols = [c for c in df_filtered_annot if 'CPTAC' not in c]
cols = [c for c in df_filtered_annot if 'SolidTissueNormal' in c]
cols += [c for c in df_filtered_annot if 'SolidTissueNormal' not in c and 'CPTAC' in c]
df_filtered_annot = df_filtered_annot[meta_cols + cols]
60483 56421
df_filtered_annot.to_csv(f'{output_dir}CCRCC_Clark_Cell2019_rna_filtered_df.csv')