Notebook for RNA processing part 1

N01_RNAProcessing

Download data from TCGA¶

Since they only provided the normalised values, we wanted to download the counts data and add in the clinical information. https://www.sciencedirect.com/science/article/pii/S0092867419311237 https://cptac-data-portal.georgetown.edu/study-summary/S050

They only provided the RPKm so I had to go and download the raw counts from GDC directly: https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22kidney%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22CPTAC-3%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%5D%7D&searchTableTab=cases

Here I selected HTseq counts for the patients from the CPTAC RCC study.


sb = SciBiomartApi()
results_df = sb.get_human_default(attr_list=['entrezgene_id'])
# Now let's sort it
results_df = sb.sort_df_on_starts(results_df)
print(results_df.values)
sb.save_as_csv(results_df, 'data/supps/')

In [1]:

# Now we need to download the data from TCGA for the RNAseq
from scidat.api import API, APIException
from sciutil import SciUtil
import pandas as pd

u = SciUtil()
save_fig = False

In [2]:

base_dir = '../data/raw_downloads/TCGA/'
supp_dir = '../data/raw_downloads/supps/'
output_dir = f'../data/sircle/F1_DE_input_TvN/'
fig_dir = '../figures/'
data_dir = f'{base_dir}TCGA_RNA/'

annotation_file = f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13_external_synonym.csv'

gene_name = 'hgnc_symbol'
gdc_client = f'{data_dir}./gdc-client'
sample_file = f'{data_dir}gdc_sample_sheet.2021-05-03.tsv'
manifest_file = f'{data_dir}gdc_manifest_20210503_065756.txt'
clinical_file = f'{data_dir}clinical.cart.2021-05-03/clinical.tsv'

api = API(manifest_file, gdc_client, clinical_file, sample_file, data_dir, data_dir, annotation_file,
                            max_cnt=10, requires_lst=['counts'])

"""
If you haven't downloaded the data already you'll need to do this step!
""" 

download_rnaseq = False
if download_rnaseq:
    api.download_data_from_manifest()

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):

Filter annotation file to just include minimal mappings between external gene names¶

In [3]:

import numpy as np
annotation_df = pd.read_csv(annotation_file)
# Fill NA's to make the smallest entrez gene IDs be the first.
annotation_df['entrezgene_id'] = annotation_df['entrezgene_id'].fillna(value=100000000000000000)
annotation_df = annotation_df.loc[annotation_df.groupby('ensembl_gene_id').entrezgene_id.idxmin()] # We just remove the duplicates
# Replace the same number with NaN
annotation_df['entrezgene_id'] = annotation_df['entrezgene_id'].replace(100000000000000000, np.nan)
# Remove
annotation_df.sort_index(ascending=True)
# Save as CSV for everything else
annotation_df.to_csv(f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13.csv', index=False)
annotation_file = f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13.csv'
annotation_df

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[3]:

	ensembl_gene_id	external_gene_name	chromosome_name	start_position	end_position	strand	entrezgene_id	external_synonym	hgnc_symbol
104394	ENSG00000000003	TSPAN6	X	100627108	100639991	-1	7105.0	T245	TSPAN6
104389	ENSG00000000005	TNMD	X	100584936	100599885	1	64102.0	BRICD4	TNMD
54163	ENSG00000000419	DPM1	20	50934867	50959140	-1	8813.0	CDGIE	DPM1
6100	ENSG00000000457	SCYL3	1	169849631	169894267	-1	57147.0	PACE-1	SCYL3
6079	ENSG00000000460	C1orf112	1	169662007	169854080	1	55732.0	FLJ10706	C1orf112
...	...	...	...	...	...	...	...	...	...
31798	ENSG00000289640	NaN	16	21116439	21121291	1	NaN	NaN	NaN
61295	ENSG00000289641	NaN	3	127227404	127231028	1	NaN	NaN	NaN
4650	ENSG00000289642	NaN	1	149013782	149014961	1	NaN	NaN	NaN
64035	ENSG00000289643	NaN	4	42285823	42292644	-1	NaN	NaN	NaN
23315	ENSG00000289644	NaN	13	74304640	74306050	-1	NaN	NaN	NaN

68005 rows × 9 columns

Build RNAseq df using scidat¶

In [4]:

## Once the data is downlaoded we can build the RNAdf
# Now we want to first build the annotation
api.build_annotation()
api.build_rna_df()
df = api.get_rna_df()

# Drop rows that aren't genes
df_filtered = df[df['id'].str.contains('ENSG')]
df.to_csv(f'{output_dir}CCRCC_Clark_Cell2019_rna_df.csv')

--------------------------------------------------------------------------------
Warning: the column you selected for submitter ID submitter_id was not in the colums. Resetting submitter_id Available columns: 	case_id, case_submitter_id, project_id, age_at_index, age_is_obfuscated, cause_of_death, cause_of_death_source, country_of_residence_at_enrollment, days_to_birth, days_to_death, ethnicity, gender, occupation_duration_years, premature_at_birth, race, vital_status, weeks_gestation_at_birth, year_of_birth, year_of_death, age_at_diagnosis, ajcc_clinical_m, ajcc_clinical_n, ajcc_clinical_stage, ajcc_clinical_t, ajcc_pathologic_m, ajcc_pathologic_n, ajcc_pathologic_stage, ajcc_pathologic_t, ajcc_staging_system_edition, anaplasia_present, anaplasia_present_type, ann_arbor_b_symptoms, ann_arbor_clinical_stage, ann_arbor_extranodal_involvement, ann_arbor_pathologic_stage, best_overall_response, breslow_thickness, burkitt_lymphoma_clinical_variant, child_pugh_classification, circumferential_resection_margin, classification_of_tumor, cog_liver_stage, cog_neuroblastoma_risk_group, cog_renal_stage, cog_rhabdomyosarcoma_risk_group, days_to_best_overall_response, days_to_diagnosis, days_to_last_follow_up, days_to_last_known_disease_status, days_to_recurrence, eln_risk_classification, enneking_msts_grade, enneking_msts_metastasis, enneking_msts_stage, enneking_msts_tumor_site, esophageal_columnar_dysplasia_degree, esophageal_columnar_metaplasia_present, figo_stage, figo_staging_edition_year, first_symptom_prior_to_diagnosis, gastric_esophageal_junction_involvement, gleason_grade_group, gleason_grade_tertiary, gleason_patterns_percent, goblet_cells_columnar_mucosa_present, greatest_tumor_dimension, gross_tumor_weight, icd_10_code, igcccg_stage, inpc_grade, inpc_histologic_group, inrg_stage, inss_stage, international_prognostic_index, irs_group, irs_stage, ishak_fibrosis_score, iss_stage, largest_extrapelvic_peritoneal_focus, last_known_disease_status, laterality, lymph_node_involved_site, lymph_nodes_positive, lymph_nodes_tested, lymphatic_invasion_present, margin_distance, margins_involved_site, masaoka_stage, medulloblastoma_molecular_classification, metastasis_at_diagnosis, metastasis_at_diagnosis_site, method_of_diagnosis, micropapillary_features, mitosis_karyorrhexis_index, mitotic_count, morphology, non_nodal_regional_disease, non_nodal_tumor_deposits, ovarian_specimen_status, ovarian_surface_involvement, papillary_renal_cell_type, percent_tumor_invasion, perineural_invasion_present, peripancreatic_lymph_nodes_positive, peripancreatic_lymph_nodes_tested, peritoneal_fluid_cytological_status, pregnant_at_diagnosis, primary_diagnosis, primary_gleason_grade, prior_malignancy, prior_treatment, progression_or_recurrence, residual_disease, satellite_nodule_present, secondary_gleason_grade, site_of_resection_or_biopsy, sites_of_involvement, supratentorial_localization, synchronous_malignancy, tissue_or_organ_of_origin, transglottic_extension, tumor_confined_to_organ_of_origin, tumor_depth, tumor_focality, tumor_grade, tumor_largest_dimension_diameter, tumor_regression_grade, tumor_stage, vascular_invasion_present, vascular_invasion_type, weiss_assessment_score, who_cns_grade, who_nte_grade, wilms_tumor_histologic_subtype, year_of_diagnosis, chemo_concurrent_to_radiation, days_to_treatment_end, days_to_treatment_start, initial_disease_status, number_of_cycles, reason_treatment_ended, regimen_or_line_of_therapy, therapeutic_agents, treatment_anatomic_site, treatment_arm, treatment_dose, treatment_dose_units, treatment_effect, treatment_effect_indicator, treatment_frequency, treatment_intent_type, treatment_or_therapy, treatment_outcome, treatment_type	
 Run: annotate.set_case_submitter_id() to setup. 
Continuing with automatic selection.	
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                    Submitter ID set as: 	case_submitter_id	                    
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                              Clinical dataframe	                               
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
  case_submitter_id project_id age_at_index  gender          race  \
0         C3N-01175    CPTAC-3          '--  female  not reported   
1         C3L-00010    CPTAC-3          '--    male         white   
2         C3N-00313    CPTAC-3          '--  female  not reported   
3         C3L-00917    CPTAC-3          '--    male         white   
4         C3N-00494    CPTAC-3          '--    male  not reported   

  vital_status tumor_stage days_to_death  
0        Alive   Stage III           '--  
1        Alive     Stage I           '--  
2        Alive     Stage I           '--  
3        Alive     Stage I           '--  
4        Alive     Stage I           '--  	
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                               Sample dataframe	                                
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
                                File ID  \
0  af2f69dd-738e-4d3a-bc9b-75b8872a68cf   
1  ffe98eaf-5ecc-4251-bcb8-2504522f6689   
2  80335cdd-dcf8-4827-aa39-d86cc9b4fdde   
3  8de154b8-e1b0-4d45-a161-d357b01c1261   
4  03bbd508-34cc-49bb-8bf3-824d7606856b   

                                           File Name            Data Category  \
0  7a256e52-346f-47c3-bd3f-c89ba609fe58.htseq_cou...  Transcriptome Profiling   
1  1fdfc946-c469-40a4-81c1-de381b43a896.htseq_cou...  Transcriptome Profiling   
2  5a776391-2b4c-452c-b40f-8ca3b346de4d.htseq_cou...  Transcriptome Profiling   
3  5883ecff-0735-4340-92ce-150b56acd877.htseq_cou...  Transcriptome Profiling   
4  16e3b64c-08bb-487e-bbba-2a398be4206b.htseq_cou...  Transcriptome Profiling   

                        Data Type Project ID    Case ID     Sample ID  \
0  Gene Expression Quantification    CPTAC-3  C3N-00149  C3N-00149-06   
1  Gene Expression Quantification    CPTAC-3  C3N-00320  C3N-00320-03   
2  Gene Expression Quantification    CPTAC-3  C3L-01885  C3L-01885-03   
3  Gene Expression Quantification    CPTAC-3  C3L-00790  C3L-00790-01   
4  Gene Expression Quantification    CPTAC-3  C3N-00148  C3N-00148-06   

           Sample Type  
0  Solid Tissue Normal  
1        Primary Tumor  
2        Primary Tumor  
3        Primary Tumor  
4  Solid Tissue Normal  	
--------------------------------------------------------------------------------

QC check correlations between the samples¶

Here we just want to do some very basic comparisons between patients, basically just filtering out any with very poor correlation.

In [5]:

import seaborn as sns
import matplotlib.pyplot as plt

# Check the correlation between samples
all_cases = [c for c in df_filtered.columns if 'CPTAC' in c]
corr = df_filtered[all_cases].corr()
sns.clustermap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, cmap='RdBu_r', row_cluster=True, col_cluster=True)
if save_fig:
    plt.savefig(f'{fig_dir}Heatmap_RNAseq.svg')

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:1205: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  self.fig.tight_layout(**tight_params)

In [6]:

import numpy as np

all_cases = [c for c in df_filtered.columns if 'CPTAC' in c]
corr = df_filtered[all_cases].corr()
# Print out the minimum correlation:
min_corr = min(np.min(corr))
# Since 50% correlation is pretty low, let's see which cases they are and probably remove them 
# from the RNAseq (since it looks like there are 3 outliers.)
mean_cor = np.mean(corr, axis=1)
corr['mean_corr'] = mean_cor
corr.sort_values(by=['mean_corr'])
# Plot out the mean correlation values so we can choose a good filter.
plt.hist(mean_cor, bins=20)
plt.title(f'Min corr: {min_corr}')
if save_fig:
    plt.savefig(f'{fig_dir}Hist_RNAseq.svg')
plt.show()

# Sort again and this time filter
corr_sorted = corr.sort_values(by=['mean_corr'])
corr_sorted = corr_sorted[corr_sorted['mean_corr'] < 0.75]
u.dp([len(corr_sorted), 'patients with avg. correlations less than 75. Filtering out these samples, and printing cases.'])

cols_to_omit = [c for c in corr_sorted.index if 'CP' in c]
case_ids = [c.split('_')[-2] for c in corr_sorted.index if 'CP' in c]
print('\n'.join(case_ids))
print(cols_to_omit)

--------------------------------------------------------------------------------
6	patients with avg. correlations less than 75. Filtering out these samples, and printing cases.	
--------------------------------------------------------------------------------
C3N-00492
C3N-01648,C3N-01648
C3L-00359
C3L-00448
C3N-01200
C3N-00315
['CPTAC-3_PrimaryTumor_female_notreported_1_htseq.counts_None_None_CPTAC-3_C3N-00492_469d59b2-7cda-469d-b915-f36c8deaa242', 'CPTAC-3_PrimaryTumor,PrimaryTumor_None_None_None_htseq.counts_None_None_CPTAC-3_C3N-01648,C3N-01648_a5aa7a66-eb32-4dbf-b930-b71242889c56', 'CPTAC-3_PrimaryTumor_female_white_1_htseq.counts_None_None_CPTAC-3_C3L-00359_cf8af8ac-49f8-40f2-a16e-5630e0ff62f2', 'CPTAC-3_SolidTissueNormal_male_white_1_htseq.counts_None_None_CPTAC-3_C3L-00448_4f77ce4c-e7cd-4aa0-8776-bd103f2b686d', 'CPTAC-3_PrimaryTumor_female_notreported_1_htseq.counts_47_None_CPTAC-3_C3N-01200_3a7e5a81-a316-4e73-8286-001d623d5709', 'CPTAC-3_PrimaryTumor_male_notreported_1_htseq.counts_None_None_CPTAC-3_C3N-00315_686347c3-a656-4416-9b18-a0737b431ea6']

Filter samples¶

Find which samples had the poor correlation and remove these

In [7]:

cols_to_keep = [c for c in df_filtered.columns if c not in cols_to_omit]
corr = df_filtered[cols_to_keep].corr()
sns.clustermap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, cmap='RdBu_r', row_cluster=True, col_cluster=True)
if save_fig:
    plt.savefig(f'{fig_dir}Heatmap_RNAseq_removed_corr-leq-0.75.svg')

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:1205: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. 
  self.fig.tight_layout(**tight_params)

In [8]:

# Actually remove those columns! (But make sure you keep the ID column)
df_filtered = df_filtered[cols_to_keep]
# Print out the minimum correlation:
min(np.min(corr))
# Since 50% correlation is pretty low, let's see which cases they are and probably remove them 
# from the RNAseq (since it looks like there are 3 outliers.)

Out[8]:

0.5420264228833744

Add in the gene names and entrez IDs¶

We also want to add in gene name and also add in the entrez IDs etc do this from HG 38 since this is what it was mapped to from the paper: https://www.sciencedirect.com/science/article/pii/S0092867419311237

Total RNA Sequencing
Indexed RNA-seq libraries were sequenced using the HiSeq4000 platform to generate a minimum of 120 million paired end reads (75 base pairs) per library with a target of greater than 90% mapped reads. The sequence data were demultiplexed and converted to FASTQ files, and adaptor and low-quality sequences were quantified/trimmed. Samples were then assessed for quality by mapping reads to the hg38 reference genome, estimating the total number of reads that mapped, assessing the amount of RNA that mapped to coding regions, the amount of rRNA in the sample, the number of genes expressed, and the relative expression of housekeeping genes. Samples that passed the quality criteria were then clustered with other expression data from similar and distinct tumor types to confirm expected expression patterns, including pathological status (i.e., normal adjacent versus tumor tissue) and tissue-origin specificity. FASTQ files of all reads were then uploaded to the GDC repository.

In [9]:

# add in annotation for gene IDs
import pandas as pd

annot = pd.read_csv(annotation_file)
# Remove the version then merge with our annotation dataframe
ensembl_ids = [e_id.split('.')[0] for e_id in df_filtered['id'].values]
df_filtered['ensembl_gene_id'] = ensembl_ids
print(len(df_filtered))
df_filtered_annot = df_filtered.set_index('ensembl_gene_id').join(annot.set_index("ensembl_gene_id"), how="inner")
# # 54713 for hg19 vs 39743 for hg38
df_filtered_annot = df_filtered_annot.drop_duplicates(subset='id', keep='first')
print(len(df_filtered_annot))
# Save this dataframe as a counts file but first let's re-order so all the columns with metadata are 
# at the start
meta_cols = [c for c in df_filtered_annot if 'CPTAC' not in c]
cols = [c for c in df_filtered_annot if 'SolidTissueNormal' in c]
cols += [c for c in df_filtered_annot if 'SolidTissueNormal' not in c and 'CPTAC' in c]
df_filtered_annot = df_filtered_annot[meta_cols + cols]

60483
56421

In [10]:

df_filtered_annot.to_csv(f'{output_dir}CCRCC_Clark_Cell2019_rna_filtered_df.csv')