Analyze PDC Data in the Cloud

NCI Cloud Resources, an integral part of the NCI Cancer Research Data Commons (CRDC), offer computational power and analytical tools for scalable analysis of large datasets in the cloud, driving cancer research and discovery. Within the CRDC ecosystem, the PDC has seamlessly integrated with these Cloud Resources using technologies like the Data Commons Framework Services (NCI DCFS). This interoperability enables users to directly analyze the PDC data on these Cloud Resources, eliminating the need for researchers to download and store massive datasets. In the following sections, we outline the methods for analyzing PDC data using each of these resources.

Generating a manifest file

To begin your analysis of PDC Data using either Broad Institute Firecloud or Seven Bridges Cancer Genomic Cloud, you will first need to choose which files you would like to work with. The full list of file types available in the PDC are available here: Proteomic Data Commons. These include RAW mass spectrometry data formats, such as:

RAW (Vendor) FormatMass spectrometry data uploaded by the data submitters as RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra.
mzML FormatRAW format spectra in the HUPO Proteome Standards Initiative (PSI) External Link compliant mzML format..

To analyze this data using cloud resources, you will not need to download the files to your device. However, you will need to identify which files you would like to import and generate a manifest file, which contains file information such as their names and unique identifiers, on the selection of files.
Here are two ways to generate a manifest file:
  • From the study summary page: Click on the number of files next to the specific file type in the available files section. In the file overlay that appears, you can select the checkbox to choose all or specific files of interest.
    Files overlay window from Study summary pageSelect files on the files overlay window

  • From the Explore page: Apply filters to isolate the study or group of studies you want to download files from. Use the left-hand filter menu to select the data category of files (e.g., Raw Mass Spectra or Processed Mass Spectra) you're searching for. Navigate to the files tab and choose all or specific files of interest by selecting the checkboxes on the left side of the table. Once you've selected the desired files, click one of the options next to "Export File Manifest".

    Select files on the Explore page

  • The metadata contained in the manifest files (CSV, TSV, or PFB) can be used to organize the data for downstream analysis.
    Identify the following metadata available in the PDC file manifest:
    • PDC Study ID, e.g., PDC000319
    • PDC Study Version, e.g., 1
    • Data Category, e.g., Processed Mass Spectra
    • Run Metadata ID, e.g., AML Gilteritinib TimeCourse - Phosphoproteome-1
    • File Type, e.g., Open Standard
    • File, e.g., PTRC_exp12_plex_01_P_f06.mzML.gz

    Understand what the Run Metadata ID represents:
    • The Run Metadata ID refers to a specific experimental run or "plex" within the dataset. In proteomic experiments, researchers often perform multiple runs or experiments with different conditions or treatments.
    • Each of these runs is assigned a unique Run Metadata ID to distinguish it from other runs within the same study.
    • The Run Metadata ID helps in grouping all files associated with a particular experimental run, making it easier to access and analyze data specific to that run.

    For instance, you can use the following pattern to organize the data into the following folder structure:
    PDC Study ID / PDC Study Version / Data Category / Run Metadata ID / File Type / File
    PDC000319 / 1 / Processed Mass Spectra / AML Gilteritinib TimeCourse - Phosphoproteome-1 / Open Standard / PTRC_exp12_plex_01_P_f06.mzML.gz

    The 'CSV' and 'TSV' formats are tabular text files, with either commas (CSV) or tab characters (TSV) separating the values. Select the 'CSV' format if you plan to import to Seven Bridges Cancer Genomic Cloud; this will download the manifest file to your device which you will use later to import the data. Proceed to the 'Seven Bridges Cancer Genomic Cloud' section for more information.

    'PFB' (Portable Format for Bioinformatics) is a serialized format containing both the data and the structure/schema of that data. Select this option if you plan to import to FireCloud. Note that, unlike the 'CSV' and 'TSV' options, selecting 'PFB' will not download a file to your device, but will generate a URL to a FireCloud session which will ingest the PFB-encoded data directly. Note that you will need to create an account with Terra and set up a workspace before importing data to FireCloud. Proceed to the 'Data handoff to Broad Institute Firecloud' section for more information.

Data handoff to Broad Institute FireCloud
Firecloud is a Broad Institute project that provides access to data along with a suite of applications and tools for data analysis and visualization, powered by the Terra cloud platform. To import and analyze data from the PDC on this platform, you will need to first create an account with Terra, link that account to a Google Cloud billing account, then create a workspace on Terra. Information on each of these steps can be found here: Getting Started - Terra.Bio External Link

Once you have set up your account and workspace, selecting the PFB option on a selection of files will direct you to a Terra session, and the files you selected will be made available in the chosen Terra workspace. You may view information regarding the files you have selected under the 'Data' tab in that workspace.

Watch the webinar on accessing and analyzing PDC data on the FireCloud/Terra platform using PANOPLY, a proteogenomic data analysis toolkit.

A step-by-step guide to accessing PDC data and metadata, and analyzing it using FragPipe and PANOPLY pipelines on the FireCloud/Terra platform.

Data handoff to Seven Bridges Cancer Genomic Cloud (CGC)
The CGC is a cloud-based computational environment that hosts Genomic data alongside many tools for analyzing that data and integrating it with other data sources and data types, including proteomics data. Information on how to use the CGC and create the project that will house the data you import can be found here: The CGC Knowledge Center (cancergenomicscloud.org) External Link, and documentation on how to import data from the PDC can be found here: Import data from the PDC (cancergenomicscloud.org) External Link

In brief, within the 'Files' tab of your CGC project, select the 'Import from manifest' option. Next, select 'Proteomic Data Commons' in the 'Import Files From' dropdown and select the manifest you have downloaded. Once the import is complete, the data will be available in the project.

Obtaining additional metadata using PDC APIs
Along with the raw data files, the PDC offers many GraphQL-based APIs that will allow you to retrieve important metadata, such as the experimental design of the chosen study and the clinical/biospecimen information for each participant. An overview of PDC APIs can be found here - https://pdc.cancer.gov/pdc/api-documentation, and full documentation of all publicly available APIs, along with a 'GraphQL explorer' that may help you build your queries, can be found here - https://pdc.cancer.gov/pdc/publicapi-documentation/#!/Case/allCases

Below is an example of how you may use three of these APIs with Python to retrieve all the clinical metadata of a study’s participants. For more information on the output of each of these queries, please visit the following links for each API.
studyExperimentalDesign: Used to retrieve the experimental design (Plex information, aliquot to plex mapping)
biospecimenPerStudy: Used to retrieve the cases, samples and aliquots involved in a study
Case: Used to retrieve the clinical and biospecimen metadata associated with a case or a list of cases.








Perform multiomic correlations on ISB Cancer Gateway in the Cloud
The ISB-CGC BigQuery Table Search UI (https://isb-cgc.appspot.com/bq_meta_search/ External Link) is a powerful discovery tool designed for users to explore and search for ISB-CGC hosted BigQuery tables. These tables include all the protein expression data generated from the PDC's Common Data Analysis Pipeline (CDAP), which are regularly ingested into the ISB-CGC BigQuery platform. Researchers can leverage the Python and R interfaces provided by ISB-CGC to readily access and analyze this protein expression data. ISB-CGC also incorporates processed genomic data from other data commons, facilitating easy correlation analysis between proteomic and genomic data, if complementary omics data is available for the specific areas of interest. Learn more about ISB-CGC here (https://isb-cgc.appspot.com/ External Link).
Warning