PDC Clustergram Example

If you are viewing the HTML, the notebook can be found here.

This notebook attempts to demonstrate the following:

Use the Proteome Data Commons (PDC) API to retreive protein relative expression data for a CPTAC study. The PDC uses values produced by the Common Data Analysis Pipeline (CDAP). More information on the PDC implmentation of the CDAP can be found here. Note: TMT/iTRAQ data are provided as log2ratios, where the denominator is a control channel. In general, control channels are different between studies so comparisons between studies using these values are not currently reccomended in the absence of bridging data.
Use the PDC API to retrieve the associated clinical metadata for all samples
Format the data for analysis using pandas
Cluster and visualize the data using the Seaborn clustermap package

The results are intended to help identify clusters of samples (tumors) displaying similar patterns of protein expression.

These are the required imports. Install them using pip, if needed.

import requests
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore

Next, set up the query parameters.

The first one is pdc_submitter_id. This example is for the ccRCC whole proteome study. PDC sample identifiers are available on the portal.

pdc_study_id = 'PDC000127'

The next one is data type. A table of data_types is available here. A description of how these values are computed can be found here. In brief, these values are log2 transformed ratio data, where aliquot is the numerator and the denominator is a pooled, reference sample. Ratios are calculated following normalization in the CPTAC Common Data Analysis Pipeline (CDAP.)

data_type = 'log2_ratio'  # Retrieves CDAP iTRAQ or TMT data

Finally, we are ready to build the query string using pdc_study_id and data_type.

quant_data_query = '''
{ 
    quantDataMatrix(
    pdc_study_id: "''' + pdc_study_id +'''" data_type: "''' + data_type + '''" acceptDUA: true
    )
}'''

Now, we can define a function to make the GraphQL query. This will get called once for the gene expression data and once for the clinical data. If you are new to GraphQL, you can also try your queries here.

def query_pdc(query):
    # PDC API url
    url = 'https://pdc.cancer.gov/graphql'

    # Send the POST graphql query
    print('Sending query.')
    pdc_response = requests.post(url, json={'query': query})

    # Check the results
    if pdc_response.ok:
        # Decode the response
        return pdc_response.json()
    else:
        # Response not OK, see error
        return pdc_response.raise_for_status()

Now, we are ready to make the first API call.

decoded = query_pdc(quant_data_query)

# You may want to check this for an error, here

Sending query.

matrix = decoded['data']['quantDataMatrix']

Next, load the data into a dataframe.

# Aliquots are first row, gene names are first column
ga = pd.DataFrame(matrix[1:], columns=matrix[0]).set_index('Gene/Aliquot')

oldnames = list(ga.columns)
newnames = [ x.split(':')[1] for x in oldnames ]

ga.rename(columns=dict(zip(oldnames, newnames)), inplace=True)
ga = ga.sort_index(axis=1)

print(ga.shape)

(9591, 207)

The first number above is the number of genes and the second is the number of samples (i.e., aliquots) in the returned data set.

Since the expression values are returned as strings, we need to convert those to floats and deal with missing data.

for col in ga.keys():
    ga[col] = pd.to_numeric(ga[col], errors='coerce')

The clustermap module within the Seaborn package does not allow for NaN values. So we must create a mask value that does not interfere much with the clustering and is likely to be unique. No imputation is used. By using a value close to 0, we are saying that these are unchanged between samples. Better solutions should be used to deal with missing data points.

mask_na = 0.000666
ga = ga.fillna(mask_na)

Next, build the clinical data query. This API needs only a study_submitter_id as input. We will be retrieving 5 fields and using 2 in our clustergram.

metadata_query = '''
    {
        clinicalMetadata(pdc_study_id: "''' + pdc_study_id + '''" acceptDUA: true) {
            aliquot_submitter_id
            morphology
            primary_diagnosis
            tumor_grade
            tumor_stage
        }
    }
    '''

Next, let's query for the clinical metadata. This will be used to annotate the columns (aliquots) to look for correlations between clinical attributes and samples. That data are loaded into a dataframe after the query returns.

decoded = query_pdc(metadata_query)

Sending query.

print(len(decoded))
clin_matrix = decoded['data']['clinicalMetadata']
metadata = pd.DataFrame(clin_matrix, columns=clin_matrix[0]).set_index('aliquot_submitter_id')
metadata = metadata.sort_index(axis=0)

1

We can then set up a color mapping function for the clinical annotations.

def get_colors(df, name, color) -> pd.Series:
    s = df[name]
    su = s.unique()
    colors = sns.light_palette(color, len(su))
    lut = dict(zip(su, colors))
    return s.map(lut)

Next, call get_colors() to map the tumor_stage and primary_diagnosis attributes. Others are available.

stage_col_colors = get_colors(metadata, 'tumor_stage', 'red')
diagnosis_col_colors = get_colors(metadata, 'primary_diagnosis', 'green')

And, finally, generate the large clustermap using seaborn.clustermap

sns.clustermap(ga, metric='euclidean', method='complete', cmap='seismic', mask=ga == mask_na, center=0.,
                figsize=(12.5, 50), col_colors=[stage_col_colors, diagnosis_col_colors])
plt.show()

You could also convert the log2 ratio data to a standard statistic, like z-score. This can help compress the range, accounting for outliers.

zdf = ga.T.apply(zscore, ddof=len(ga.columns)-1)
zdf = zdf.T

And examine clustering according to that transformation.

sns.clustermap(zdf, metric='euclidean', method='complete', cmap='seismic', mask=ga == mask_na, center=0.,
               col_colors=[stage_col_colors, diagnosis_col_colors], figsize=(12.5, 50))
plt.show()

Hopefully, the clustermaps have been drawn. Columns are labeled aliquot_id:aliquot_submitter_id, according to the PDC data model. Additional work pre-processing the data, removing outliers and batch effects etc., should always be invested on any large-scale analysis. This ends this example notebook.

We hope that you found this tutorial useful. There is also an accompanying tutorial on the PDC site, if you are finding this notebook and have not seen the video. Please submit any questions or requests to: PDCHelpDesk@mail.nih.gov