Notebook for Protein processing

N04_ProteinProcessing

We build the protein and sample dataframes and omit any poor quality samples¶

First though Impute data¶

We used the DreamAI package to impute missing values from the protein data.


#install_github("WangLab-MSSM/DreamAI/Code")
#https://github.com/WangLab-MSSM/DreamAI
library("DreamAI")

prot_data <- read.csv('../data/raw_downloads/CPTAC/6_CPTAC3_CCRCC_Whole_abundance_gene_protNorm=2_CB.tsv', sep='\t')
colnames(prot_data)
prot_num_data <- prot_data[, 5:length(colnames(prot_data))]
rownames(prot_num_data) <- prot_data$Index
imputed_data <- DreamAI(prot_num_data, k = 10, maxiter_MF = 10, ntree = 100,
        maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
        gamma_ADMIN = NA, gamma = 50, CV = FALSE,
        fillmethod = "row_mean", maxiter_RegImpute = 10,
        conv_nrmse = 1e-06, iter_SpectroFM = 40, method = c("KNN",
                                                            "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute"),
        out = c("Ensemble"))

ens_data <- imputed_data$Ensemble
write.csv(ens_data, '../data/sircle/F1_DE_input_TvN/6_CPTAC3_CCRCC_Whole_abundance_gene_protNorm=2_CB_DreamAI-imputed.csv')

Next we want to filter poor quality samples¶

Using the imputed data, have a look at the quality of the samples from the protein data.

In [11]:

import math
import pandas as pd
import numpy as np

base_dir = '../data/'
data_dir = f'{base_dir}raw_downloads/CPTAC/'
output_dir = f'{base_dir}sircle/F1_DE_input_TvN/'
fig_dir = '../figures/'
supp_dir = f'{base_dir}raw_downloads/supps/'
gene_name = 'hgnc_symbol'
save_fig = False
# Proteomics 
# Need to make a sample sheet (i.e. with the clinical information associated with the cases)
# Let's just make it up for now
prot_og_df = pd.read_csv(f'{data_dir}6_CPTAC3_CCRCC_Whole_abundance_gene_protNorm=2_CB.tsv', sep='\t')
prot_df = pd.read_csv(f'{output_dir}6_CPTAC3_CCRCC_Whole_abundance_gene_protNorm=2_CB_DreamAI-imputed.csv', index_col=0)
prot_df['external_gene_name'] = prot_df.index
clin_df = pd.read_csv(f'{data_dir}Patient_Clinical_Attributes.csv')
clin_prot_df = pd.read_csv(f'{data_dir}S044_CPTAC_CCRCC_Discovery_Cohort_Specimens_r1_Sept2018.csv')
# First things we want to merge this with our gene names etc
annotation_file = f'{supp_dir}hsapiens_gene_ensembl-GRCh38.p13_external_synonym.csv'
annot = pd.read_csv(annotation_file)
prot_df = prot_df.join(annot.set_index('external_gene_name'), how="left", rsuffix='_')
# Drop duplicates (we don't have any dups based on geen ID going into this so we know)
# that this was just introduced with the name mapping
meta_cols = [c for c in prot_df.columns if 'CPT' not in c and 'QC' not in c and 'NC' not in c]
# non-ccrcc cases as stated in Sup Table 1 from the study
non_ccrcc = ['C3L-00359', 'C3N-00313', 'C3N-00435', 'C3N-00492', 'C3N-00832', 'C3N-01175', 'C3N-01180']

cols = [c for c in prot_df.columns if 'CPT' in c]
prot_df = prot_df[meta_cols + cols]
prot_df

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3072: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[11]:

	external_gene_name	ensembl_gene_id	chromosome_name	start_position	end_position	strand	entrezgene_id	external_synonym	hgnc_symbol	CPT0079430001	...	CPT0012080003	CPT0021240003	CPT0009020003	CPT0017450001	CPT0009060003	CPT0012900004	CPT0017410003	CPT0009080003	CPT0012920003	CPT0009000003
A1BG	A1BG	ENSG00000121410	19	58345178.0	58353492.0	-1.0	1.0	NaN	A1BG	24.762998	...	25.407447	25.084466	25.794049	24.582934	24.875762	25.398878	25.23929	24.812949	25.320314	25.234656
A1CF	A1CF	ENSG00000148584	10	50799409.0	50885675.0	-1.0	29974.0	ACF	A1CF	21.803441	...	21.414654	20.563227	21.363325	21.573985	21.652046	22.094532	21.14661	21.843023	21.404158	21.215747
A1CF	A1CF	ENSG00000148584	10	50799409.0	50885675.0	-1.0	29974.0	ACF64	A1CF	21.803441	...	21.414654	20.563227	21.363325	21.573985	21.652046	22.094532	21.14661	21.843023	21.404158	21.215747
A1CF	A1CF	ENSG00000148584	10	50799409.0	50885675.0	-1.0	29974.0	ACF65	A1CF	21.803441	...	21.414654	20.563227	21.363325	21.573985	21.652046	22.094532	21.14661	21.843023	21.404158	21.215747
A1CF	A1CF	ENSG00000148584	10	50799409.0	50885675.0	-1.0	29974.0	APOBEC1CF	A1CF	21.803441	...	21.414654	20.563227	21.363325	21.573985	21.652046	22.094532	21.14661	21.843023	21.404158	21.215747
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZZEF1	ZZEF1	ENSG00000074755	17	4004445.0	4143030.0	-1.0	23140.0	FLJ10821	ZZEF1	22.462580	...	22.386080	22.465500	22.399780	22.479540	22.391350	22.256580	22.20671	22.461570	22.289010	22.429880
ZZEF1	ZZEF1	ENSG00000074755	17	4004445.0	4143030.0	-1.0	23140.0	KIAA0399	ZZEF1	22.462580	...	22.386080	22.465500	22.399780	22.479540	22.391350	22.256580	22.20671	22.461570	22.289010	22.429880
ZZEF1	ZZEF1	ENSG00000074755	17	4004445.0	4143030.0	-1.0	23140.0	ZZZ4	ZZEF1	22.462580	...	22.386080	22.465500	22.399780	22.479540	22.391350	22.256580	22.20671	22.461570	22.289010	22.429880
ZZZ3	ZZZ3	ENSG00000036549	1	77562416.0	77683419.0	-1.0	26009.0	ATAC1	ZZZ3	18.320770	...	18.381270	18.356620	18.252760	18.214290	18.463200	18.685680	18.34442	18.167880	18.258070	18.410360
ZZZ3	ZZZ3	ENSG00000036549	1	77562416.0	77683419.0	-1.0	26009.0	DKFZP564I052	ZZZ3	18.320770	...	18.381270	18.356620	18.252760	18.214290	18.463200	18.685680	18.34442	18.167880	18.258070	18.410360

32938 rows × 203 columns

Make sure we have ensembl IDs for all the genes, it appears that they used an external gene name, not the hgnc ID so we needed to get the external synonym from biomart¶

    sb = SciBiomartApi() #url='http://grch37.ensembl.org/biomart/martservice/')
    self.sb = sb
    results_df = sb.get_human_default(attr_list=['entrezgene_id', 'external_synonym', 'hgnc_symbol'])
    # Now let's sort it
    results_df = sb.sort_df_on_starts(results_df)
    print(results_df.values)
    sb.save_as_csv(results_df, '.')

In [12]:

prot_df[prot_df['ensembl_gene_id'].isnull()]

Out[12]:

	external_gene_name	ensembl_gene_id	chromosome_name	start_position	end_position	strand	entrezgene_id	external_synonym	hgnc_symbol	CPT0079430001	...	CPT0012080003	CPT0021240003	CPT0009020003	CPT0017450001	CPT0009060003	CPT0012900004	CPT0017410003	CPT0009080003	CPT0012920003	CPT0009000003
AAED1	AAED1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	17.512793	...	16.991444	17.379542	16.697086	16.678002	16.839023	17.431539	16.886722	16.944516	16.719501	16.705734
AARS	AARS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24.233538	...	24.272492	24.286220	24.065209	24.195082	24.248931	24.397540	24.197450	24.262034	24.110052	24.760174
ACPP	ACPP	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	21.870815	...	20.199568	20.747871	22.387413	20.950298	20.377747	20.638903	20.674340	21.557213	22.401540	20.466685
ADPRHL2	ADPRHL2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22.756684	...	23.336900	23.277520	22.635802	22.597466	23.032666	23.066881	22.915864	22.780574	22.722272	23.319582
ADSS	ADSS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22.851863	...	22.628547	22.727531	22.912229	22.926994	22.840347	22.808574	22.690034	22.823103	22.916575	22.460676
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
WRB	WRB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	19.341020	...	19.327640	19.529110	19.287570	19.369450	19.372180	19.351020	19.205480	19.266680	19.157710	19.497210
WRB-SH3BGR	WRB-SH3BGR	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.706760	...	15.525680	15.510770	15.823540	15.651850	15.526200	15.600570	15.568430	15.742060	15.768940	15.483620
YARS	YARS	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	22.999120	...	23.347130	23.490860	22.838370	22.972440	23.344330	23.292650	23.294710	22.931600	22.830860	23.843570
ZNF664-RFLNA	ZNF664-RFLNA	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	17.343060	...	17.234980	17.834130	17.856130	17.779350	17.777660	17.847280	17.855410	17.814160	17.833840	17.832260
ZNRD1	ZNRD1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	18.329540	...	18.600920	18.491850	18.454710	18.430590	18.413140	18.664090	18.463040	18.488700	18.553960	18.510800

211 rows × 203 columns

In [13]:

ensembl_gene_ids =  prot_df['ensembl_gene_id'].values
gene_names =  prot_df[gene_name].values
entrez_ids =  prot_df['entrezgene_id'].values
count_updated = 0
nan_ids = prot_df[prot_df['ensembl_gene_id'].isnull()]['external_gene_name'].values
for i, gene_name in enumerate(prot_df['external_gene_name'].values):
    if gene_name in nan_ids:
        mapping = annot[annot['external_synonym'] == gene_name]
        if len(mapping) > 0:
            ensembl_gene_ids[i] = mapping['ensembl_gene_id'].values[0]
            gene_names[i] = mapping['hgnc_symbol'].values[0]
            entrez_ids[i] = mapping['entrezgene_id'].values[0]
            count_updated += 1
        else:
            print(f'{gene_name} not found')
print(count_updated)

APOBEC3A_B not found
ELOA3D not found
FLJ44635 not found
GAGE2D not found
GPR75-ASB3 not found
KIAA0754 not found
KIAA1107 not found
LOC110384692 not found
PALM2 not found
WRB-SH3BGR not found
ZNF664-RFLNA not found
200

In [14]:

prot_df['ensembl_gene_id'] = ensembl_gene_ids
prot_df['original_gene_id'] = prot_df.index
prot_df['hgnc_symbol'] = gene_names
prot_df['entrezgene_id'] = entrez_ids

In [15]:

cols = ['external_gene_name'] + [c for c in prot_df if 'Protein' in c]
prot_df = prot_df.sort_values(by='external_synonym')
prot_df = prot_df.drop_duplicates(subset=cols)
prot_df

Out[15]:

	external_gene_name	ensembl_gene_id	chromosome_name	start_position	end_position	strand	entrezgene_id	external_synonym	hgnc_symbol	CPT0079430001	...	CPT0021240003	CPT0009020003	CPT0017450001	CPT0009060003	CPT0012900004	CPT0017410003	CPT0009080003	CPT0012920003	CPT0009000003	original_gene_id
IFITM2	IFITM2	ENSG00000185201	11	303655.0	309397.0	1.0	10581.0	1-8D	IFITM2	17.026068	...	17.504612	17.260155	17.184139	17.378241	17.511433	17.458581	17.217443	17.244616	17.469904	IFITM2
IFITM3	IFITM3	ENSG00000142089	11	319676.0	329475.0	-1.0	10410.0	1-8U	IFITM3	20.639384	...	21.358486	20.565209	21.183230	21.073030	21.238467	21.128238	20.418574	20.619187	21.535092	IFITM3
PRDX6	PRDX6	ENSG00000117592	1	173477330.0	173488815.0	1.0	9588.0	1-Cys	PRDX6	26.209040	...	26.159050	26.073830	26.127710	25.789890	26.188190	26.140030	26.366040	25.967260	26.028550	PRDX6
ALDH1L1	ALDH1L1	ENSG00000144908	3	126103562.0	126197994.0	-1.0	10840.0	10-fTHF	ALDH1L1	26.007852	...	24.462299	25.569387	25.554207	24.285995	25.587137	24.648387	25.351848	25.367392	25.147031	ALDH1L1
KNOP1	KNOP1	ENSG00000103550	16	19701937.0	19718235.0	-1.0	400506.0	101F10.1	KNOP1	17.288918	...	17.937470	17.444340	17.302225	17.752290	17.874671	17.922705	17.407511	17.572139	17.894442	KNOP1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZNF827	ZNF827	ENSG00000151612	4	145757627.0	145938823.0	-1.0	152485.0	NaN	ZNF827	16.037390	...	16.494920	16.462100	16.372980	16.398880	16.429020	16.487920	16.485150	16.508460	16.389360	ZNF827
ZNF865	ZNF865	ENSG00000261221	19	55605647.0	55617269.0	1.0	100507290.0	NaN	ZNF865	18.440160	...	18.272280	17.960540	17.776450	18.222560	18.070480	18.274070	17.970690	17.613400	18.100140	ZNF865
ZNF888	ZNF888	ENSG00000213793	19	52904415.0	52923470.0	-1.0	388559.0	NaN	ZNF888	14.569570	...	14.827710	14.680650	14.659560	14.804610	14.772650	14.834110	14.647240	14.651030	14.849270	ZNF888
ZNRD1	ZNRD1	ENSG00000066379	NaN	NaN	NaN	NaN	30834.0	NaN	POLR1H	18.329540	...	18.491850	18.454710	18.430590	18.413140	18.664090	18.463040	18.488700	18.553960	18.510800	ZNRD1
ZYX	ZYX	ENSG00000159840	7	143381295.0	143391111.0	1.0	7791.0	NaN	ZYX	24.430150	...	24.934070	24.427000	25.044390	24.801910	24.476020	24.827360	24.437460	24.590230	24.502910	ZYX

11355 rows × 204 columns

In [16]:

# Now there are only 11 unmapped gene IDs which is much better than 200

In [17]:

import seaborn as sns
import matplotlib.pyplot as plt

# Filter out poor quality samples
# Check the correlation between samples

all_cases = [c for c in prot_df.columns if 'CP' in c]
corr = prot_df[all_cases].corr()
sns.clustermap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values, cmap='RdBu_r', row_cluster=True, col_cluster=True)
if save_fig:
    plt.savefig(f'{fig_dir}Heatmap_protein_imputed.svg')

/Users/ariane/opt/miniconda3/envs/clean_ml/lib/python3.6/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)

In [18]:

# Print out the minimum correlation:
min(np.min(corr))
# Since this is so high we can conclude all the protein samples are pretty strongly correlated and
# we don't need to filter out any samples.

Out[18]:

0.9612818725719975

Add in sample info to the protein dataset¶

While we have the aliqout IDs, we want to add in other info, such as patient demographics we may use for performing DE analysis experiments

In [19]:

# In the protein dataframe they use the Aliquot ID rather than the case ID
# We want to match this to the case and if it was tumour or normal sample

# Get the protein aliqot IDs and we want to match these with patient info
p_Aliquot_ID = [c for c in prot_df.columns if 'CPT' in c]

# Get the patient info
sample_type = clin_prot_df['Group'].values
cond_ids = []
cond_names = []
case_ids = []
cases = clin_prot_df['ParticipantID'].values

# Iterate though the different aliqots
for cp in p_Aliquot_ID:
    # Iterate through the clinical samples to find a match
    for i, c in enumerate(clin_prot_df['Aliquot ID'].values):
        if c == cp:
            cond_names.append(sample_type[i])
            case_ids.append(cases[i])
            if sample_type[i] == 'Tumor':
                cond_ids.append(1)
            else:
                cond_ids.append(0)
            break

# Make a sample Dataframe
prot_sample_data = pd.DataFrame()
prot_sample_data['AliquotID'] = p_Aliquot_ID
prot_sample_data['CondName'] = cond_names
prot_sample_data['CondId'] = cond_ids
prot_sample_data['CaseId'] = case_ids
prot_sample_data['SafeCases'] = [c.replace('-', '.') for c in case_ids]
prot_sample_data['FullLabel'] = [f'{cond_names[i]}_{case_ids[i].replace("-", ".")}_{a}' for i, a in enumerate(p_Aliquot_ID)]

In [20]:

# Rename the columns to have the aliqiot ID in it
column_map = {}
for i, a in enumerate(p_Aliquot_ID):
    column_map[a] = f'{cond_names[i]}_{case_ids[i].replace("-", ".")}_{a}'
prot_df = prot_df.rename(columns=column_map)
prot_df.to_csv(f'{output_dir}protein_df.csv', index=False)

In [21]:

meta_cols = [c for c in prot_df.columns if 'CP' not in c]
meta_cols

Out[21]:

['external_gene_name',
 'ensembl_gene_id',
 'chromosome_name',
 'start_position',
 'end_position',
 'strand',
 'entrezgene_id',
 'external_synonym',
 'hgnc_symbol',
 'original_gene_id']

Add in clinical info to the protein sample dataframe¶

Race demographics (ethnicity_self_identified)
Age (age) --> separate into young (< 50) mid (50 - 70), old (> 70)?
BMI (BMI) --> separate into underweight (< 19) normal (19 - 25), pre-obese (26-30), obsesity class 1 (30 - 35), obesisty class 2 (35 - 40), obesity class 3) (41 - 49

In [25]:

# Read in the clinical file we made with all the molecular info from our data
clin_df = pd.read_csv(f'{output_dir}clinical_sircle.csv')
# Now we want to merge the clinical info with the cases from the sample df
prot_sample_df = prot_sample_data.set_index("CaseId").join(clin_df.set_index("case_id"), how="left", rsuffix='')

## -------- Rename the columns
new_full_label_map = {}
new_full_label = []
for full_label in prot_sample_df['FullLabel'].values:
    new_label = f'Protein_{full_label}'
    new_full_label.append(new_label)
    new_full_label_map[full_label] = new_label
# Update
prot_sample_df['FullLabel'] = new_full_label
prot_df = prot_df.rename(columns=new_full_label_map)

prot_df.to_csv(f'{output_dir}prot_data_sircle.csv', index=False)

prot_sample_df.to_csv(f'{output_dir}prot_sample_data_sircle.csv')
prot_sample_df

Out[25]:

	AliquotID	CondName	CondId	SafeCases	FullLabel	gender	TumorStage	AgeGrouped	BMIGrouped	RaceGrouped	...	CIMPStatus	GenomeInstability	VHL+TTN	VHL-TTN	VHL+PBRM1	VHL-PBRM1	PBRM1-VHL	VHL	TTN-VHL	TTN+PBRM1-VHL
C3L-00004	CPT0001550001	Normal	0	C3L.00004	Protein_Normal_C3L.00004_CPT0001550001	Male	Stage III	old	normal	White	...	1	0	0	1	1	0	0	0	0	0
C3L-00004	CPT0001540009	Tumor	1	C3L.00004	Protein_Tumor_C3L.00004_CPT0001540009	Male	Stage III	old	normal	White	...	1	0	0	1	1	0	0	0	0	0
C3L-00010	CPT0001230001	Normal	0	C3L.00010	Protein_Normal_C3L.00010_CPT0001230001	Male	Stage I	young	between	White	...	0	0	0	1	0	1	0	1	0	0
C3L-00010	CPT0001220008	Tumor	1	C3L.00010	Protein_Tumor_C3L.00010_CPT0001220008	Male	Stage I	young	between	White	...	0	0	0	1	0	1	0	1	0	0
C3L-00011	CPT0001340003	Tumor	1	C3L.00011	Protein_Tumor_C3L.00011_CPT0001340003	Female	Stage IV	old	between	White	...	1	1	1	0	0	1	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
C3N-01649	CPT0088640003	Normal	0	C3N.01649	Protein_Normal_C3N.01649_CPT0088640003	Male	Stage III	middle	obese	White	...	0	0	1	0	1	0	0	0	0	0
C3N-01651	CPT0088710001	Normal	0	C3N.01651	Protein_Normal_C3N.01651_CPT0088710001	Male	Stage II	old	between	White	...	1	1	1	0	1	0	0	0	0	0
C3N-01651	CPT0088690003	Tumor	1	C3N.01651	Protein_Tumor_C3N.01651_CPT0088690003	Male	Stage II	old	between	White	...	1	1	1	0	1	0	0	0	0	0
C3N-01808	CPT0089480003	Normal	0	C3N.01808	Protein_Normal_C3N.01808_CPT0089480003	Male	Stage I	middle	between	White	...	0	0	0	0	0	0	0	0	0	0
C3N-01808	CPT0089460004	Tumor	1	C3N.01808	Protein_Tumor_C3N.01808_CPT0089460004	Male	Stage I	middle	between	White	...	0	0	0	0	0	0	0	0	0	0

194 rows × 27 columns

In [26]:

pd.read_csv(f'{output_dir}prot_sample_data_sircle.csv')

Out[26]:

	Unnamed: 0	AliquotID	CondName	CondId	SafeCases	FullLabel	gender	TumorStage	AgeGrouped	BMIGrouped	...	CIMPStatus	GenomeInstability	VHL+TTN	VHL-TTN	VHL+PBRM1	VHL-PBRM1	PBRM1-VHL	VHL	TTN-VHL	TTN+PBRM1-VHL
0	C3L-00004	CPT0001550001	Normal	0	C3L.00004	Protein_Normal_C3L.00004_CPT0001550001	Male	Stage III	old	normal	...	1	0	0	1	1	0	0	0	0	0
1	C3L-00004	CPT0001540009	Tumor	1	C3L.00004	Protein_Tumor_C3L.00004_CPT0001540009	Male	Stage III	old	normal	...	1	0	0	1	1	0	0	0	0	0
2	C3L-00010	CPT0001230001	Normal	0	C3L.00010	Protein_Normal_C3L.00010_CPT0001230001	Male	Stage I	young	between	...	0	0	0	1	0	1	0	1	0	0
3	C3L-00010	CPT0001220008	Tumor	1	C3L.00010	Protein_Tumor_C3L.00010_CPT0001220008	Male	Stage I	young	between	...	0	0	0	1	0	1	0	1	0	0
4	C3L-00011	CPT0001340003	Tumor	1	C3L.00011	Protein_Tumor_C3L.00011_CPT0001340003	Female	Stage IV	old	between	...	1	1	1	0	0	1	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
189	C3N-01649	CPT0088640003	Normal	0	C3N.01649	Protein_Normal_C3N.01649_CPT0088640003	Male	Stage III	middle	obese	...	0	0	1	0	1	0	0	0	0	0
190	C3N-01651	CPT0088710001	Normal	0	C3N.01651	Protein_Normal_C3N.01651_CPT0088710001	Male	Stage II	old	between	...	1	1	1	0	1	0	0	0	0	0
191	C3N-01651	CPT0088690003	Tumor	1	C3N.01651	Protein_Tumor_C3N.01651_CPT0088690003	Male	Stage II	old	between	...	1	1	1	0	1	0	0	0	0	0
192	C3N-01808	CPT0089480003	Normal	0	C3N.01808	Protein_Normal_C3N.01808_CPT0089480003	Male	Stage I	middle	between	...	0	0	0	0	0	0	0	0	0	0
193	C3N-01808	CPT0089460004	Tumor	1	C3N.01808	Protein_Tumor_C3N.01808_CPT0089460004	Male	Stage I	middle	between	...	0	0	0	0	0	0	0	0	0	0

194 rows × 28 columns

In [28]:

prot_df = pd.read_csv(f'{output_dir}prot_data_sircle.csv')
prot_sample_df = pd.read_csv(f'{output_dir}prot_sample_data_sircle.csv', index_col=0)
prot_sample_df = prot_sample_df[~prot_sample_df.index.isin(non_ccrcc)] # Ensure we only include ccRCC patients
prot_sample_df

Out[28]:

	AliquotID	CondName	CondId	SafeCases	FullLabel	gender	TumorStage	AgeGrouped	BMIGrouped	RaceGrouped	...	CIMPStatus	GenomeInstability	VHL+TTN	VHL-TTN	VHL+PBRM1	VHL-PBRM1	PBRM1-VHL	VHL	TTN-VHL	TTN+PBRM1-VHL
C3L-00004	CPT0001550001	Normal	0	C3L.00004	Protein_Normal_C3L.00004_CPT0001550001	Male	Stage III	old	normal	White	...	1	0	0	1	1	0	0	0	0	0
C3L-00004	CPT0001540009	Tumor	1	C3L.00004	Protein_Tumor_C3L.00004_CPT0001540009	Male	Stage III	old	normal	White	...	1	0	0	1	1	0	0	0	0	0
C3L-00010	CPT0001230001	Normal	0	C3L.00010	Protein_Normal_C3L.00010_CPT0001230001	Male	Stage I	young	between	White	...	0	0	0	1	0	1	0	1	0	0
C3L-00010	CPT0001220008	Tumor	1	C3L.00010	Protein_Tumor_C3L.00010_CPT0001220008	Male	Stage I	young	between	White	...	0	0	0	1	0	1	0	1	0	0
C3L-00011	CPT0001340003	Tumor	1	C3L.00011	Protein_Tumor_C3L.00011_CPT0001340003	Female	Stage IV	old	between	White	...	1	1	1	0	0	1	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
C3N-01649	CPT0088640003	Normal	0	C3N.01649	Protein_Normal_C3N.01649_CPT0088640003	Male	Stage III	middle	obese	White	...	0	0	1	0	1	0	0	0	0	0
C3N-01651	CPT0088710001	Normal	0	C3N.01651	Protein_Normal_C3N.01651_CPT0088710001	Male	Stage II	old	between	White	...	1	1	1	0	1	0	0	0	0	0
C3N-01651	CPT0088690003	Tumor	1	C3N.01651	Protein_Tumor_C3N.01651_CPT0088690003	Male	Stage II	old	between	White	...	1	1	1	0	1	0	0	0	0	0
C3N-01808	CPT0089480003	Normal	0	C3N.01808	Protein_Normal_C3N.01808_CPT0089480003	Male	Stage I	middle	between	White	...	0	0	0	0	0	0	0	0	0	0
C3N-01808	CPT0089460004	Tumor	1	C3N.01808	Protein_Tumor_C3N.01808_CPT0089460004	Male	Stage I	middle	between	White	...	0	0	0	0	0	0	0	0	0	0

184 rows × 27 columns

In [29]:

meta_cols = [c for c in prot_df.columns if 'CP' not in c]
prot_df = prot_df[meta_cols + list(prot_sample_df['FullLabel'].values)]

prot_df.to_csv(f'{output_dir}prot_data_sircle_ccRCC.csv', index=False)

prot_sample_df.to_csv(f'{output_dir}prot_sample_data_sircle_ccRCC.csv')