Frequently asked questions
  1. General
  2. Navigating PDC
  3. Data Submission
  4. Data Download
  5. Data Access
  6. Data Licensing Policy
  7. Data Analysis
  8. Processed Outputs from PDC Common Data Analysis Pipeline
General
What are the goals of PDC?

The primary goal of Proteomic Data Commons (PDC) is to democratize access to cancer-related proteomic datasets and to provide sustainable computational support to the cancer research community through interoperability with other components, such as Genomic Data Commons and Cloud Resources, in the NCI Cancer Research Data Commons (CRDC) External Link ecosystem.

When was PDC launched?

NCI PDC is officially launched on Mar 23, 2020. Click here to read the official announcement.

Is PDC a member of the ProteomeXchange Consortium?

PDC is currently an observer member of the ProteomeXchange Consortium External Link.

How do I report an issue or submit a comment about the process?

You can send us your feedback or comments on any issues you experience at PDCHelpDesk@mail.nih.gov.

What is the difference between PDC and CPTAC Data Portal?

The CPTAC Data Portal was a centralized repository for the public dissemination of proteomic sequence datasets collected by Clinical Proteomic Tumor Analysis Consortium (CPTAC) consortium. As of February, 2022 the CPTAC Data Portal is retired and all of the data is now available through PDC.

The PDC is a public data repository of mass spectrometry (MS) based proteomics data, and is maintained by the National Cancer Institute. PDC will host datasets from large consortiums such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC), International Cancer Proteogenome Consortium (ICPC) and Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO) and also from independent research programs and grants.

The PDC Data Portal also provides a platform for efficiently querying, visualizing, analyzing and downloading high quality, curated and harmonized proteomic datasets.
PDC represents a state of the art repository technology and facilitates multi-omics integration through interoperability with other components, such as Genomic Data Commons and Cloud Resources, in the NCI Cancer Research Data Commons (CRDC) External Link ecosystem.

How to find corresponding genomic and/or imaging data for the PDC proteomic data?

If a study in PDC has corresponding genomic and/or imaging data available in other resources, you may find the mapping of the individual cases to the external resources in the ‘Clinical’ tab on the 'Explore' page (see the image below). Either click on the thumbnail in the column ‘Genomic or Imaging Data Resource’ or you may also select the cases of interest and export the clinical manifest. The manifest file will have the resource name and identifier for each case.



Select file manifect
What ontologies are used in annotating?
Where can I learn about your metadata standards?

PDC uses caDSR and NCIt terminologies for biospecimen and clinical metadata and PSI External Link ontologies for Proteomic metadata. Learn more about PDC data dictionary and data model.

Go Back
Explore Page Features
What is the information displayed on PDC Explore page?

The Explore page on the PDC website offers a comprehensive overview of all available data and various ways to explore the cohorts across PDC. The information is organized into three main panels:

Graphic Panel

The graphic panel visually represents data organized into analytical fractions, disease types, and experimental types. This provides a quick and intuitive overview of the data distribution and key categories.

Data Panel

The data panel is divided into several tabs, each displaying data organized by studies, and detailing clinical, biospecimen, and file properties.

  • Studies Tab: Lists all studies along with summary statistics for various attributes, such as the number of cases and files in each data category. Clicking on a specific number filters the data accordingly and takes you to the appropriate tab. Selecting a study by its identifier or name opens a detailed study summary overlay.
  • Clinical Tab: Displays a list of cases (participants, subjects, donors, patients, etc.) across the PDC, along with their clinical information.
  • Biospecimens Tab: Lists aliquots derived from case samples analyzed in proteomic experiments across PDC, along with their sample properties.
  • Files Tab: Provides a list of files across PDC, along with their associated metadata.
  • Genes Tab: Provides a list of genes expressed across PDC, along with their associated metadata.
Filter Panel

Located on the left side, the filter panel allows users to refine the data based on harmonized metadata. The filters are organized into distinct groups:

  • General
  • Biospecimen
  • Clinical
  • Files
  • Genes
  • Study

Applying filters updates the data displayed in both the Graphic and Data panels.

Downloading Manifest and Data

Due to the extensive data available on PDC, multiple manifests can be downloaded from the Explore page:

Manifest Downloads

Each tab under the Data Panel allows for the download of relevant manifests, which export the displayed information as TSV or CSV files. Available manifests include:

  • Study Manifest
  • Clinical Manifest
  • Biospecimen Manifest
  • File Manifest
  • Gene manifest
Data Downloads

To download data files, you may download each file directly from the Files tab on the Data Panel by clicking the download button or by using the file manifest to download multiple files at once. Refer to the ‘Data Download’ section for more details.


Go Back
Study Summary Page Features
What is the information displayed on PDC Study Summary page?

The Study Summary page on the PDC website provides a comprehensive overview of study information, helping users understand the data before downloading. Here's how you can access a Study Summary page and the features it includes:

Accessing the Study Summary Page
  • From the Study Tab on the Explore Page: Click on the study identifier or name.
  • Direct URL: If you know the study identifier (e.g., PDC Study ID), you can directly access the page by appending it to the following URL: https://pdc.cancer.gov/pdc/study/<PDC Study ID>

Example: https://pdc.cancer.gov/pdc/study/PDC000544

Information on the Study Summary Page

The Study Summary page includes several sections, each providing specific information about the study:

Summary Panel:

Displays general properties of the study such as the name, program, experimental type, Disease Types, Project ID, and more.

Overview panel:
  • Study Description Tab: Provides a description of the study, usually the abstract from the associated publication.
  • Protocol Tab: Provides a description of the analytical sample preparation, chromatography and mass spectrometry parameters used to generate the data
  • Clinical Tab: Lists cases (participants, subjects, donors, patients, etc.) across the PDC, along with their clinical information.
  • Biospecimen Tab: Lists aliquots derived from case samples analyzed in proteomic experiments across PDC, along with their sample properties.
  • Experimental Design Tab: Displays a dataframe that describes the relationship between samples and files. It allows easy visualization of sample types (tumor, normal, etc.) used in the experiment and how they are tagged with isobaric reagents in labeled experiments.
  • Workflow Metadata Tab: Provides details of the various tools, databases, and parameters used in the PDC Common Data Analysis Pipeline.
  • Data Use Agreement Tab: Displays PDC data use guidelines.
Available Data Panels

The available data for download is organized into two sections:

  • Left Section: Contains the raw data from data submitters and the processed data generated by the PDC harmonization process, including the common data analysis pipeline.
  • Right Section: Includes additional data provided by the data submitters.
Related PDC Studies Panel

Lists all other studies within PDC related to the current study. These are usually additional characterizations of the same cohort, such as proteome and other post-translational modification (PTM) studies like phosphoproteome and ubiquitylome.

External References

Lists other resources where complementary data for the same cohort is available.

Publications Section

Provides the citation of the primary publication associated with the study.

Heatmap Visualization Section

Features a heatmap thumbnail that links to a Morpheus heatmaps visualization page. This page includes quantitative data generated from the PDC common data analysis pipeline, annotated with extensive clinical information, and loaded into a Morpheus heatmaps viewer.


Go Back
Case Summary Page Features
What is the information displayed on PDC Case Summary page?

The Case Summary page on the PDC website provides a comprehensive overview of Cases. A case (participants, subjects, donors, patients) in PDC may have multiple samples collected (e.g., tumor and normal tissues). Each sample can then be divided into multiple aliquots, which are used for different types of analyses, such as proteomics, metabolomics and other molecular assays.

Information on the Case Summary Page

The Case Summary page includes several sections, each providing specific information about the study:

Clinical data panel:

Demography, Diagnosis, Exposure, Follow Up, Treatment. For further information, refer to PDC Data Dictionary available under MORE menu

File Count by Experimental Type and Data Category

Displays the number of associated files and studies in which the Case has been involved.

The Biospecimen Hierarchy

Outlines the structure of biospecimens derived from a Case, as follows:

Case -> Sample(s) -> Aliquot(s)

Links to external resources

Where complementary genomic and imaging information for the Case may be available.


Go Back
Data Submission
How do I submit data to PDC?

Detailed instructions for submitting data into the PDC are available here.

How do I register my program with PDC?

PDC currently accepts Mass Spectrometry data from proteomic experiments specifically for data dependent and data independent acquisitions. You may contact PDCHelpDesk@mail.nih.gov to request a program for your lab.

How do I configure my AWS account to upload data from my S3 bucket to PDC?

In order to use the S3 transfer feature, you need to configure your AWS account and S3 bucket to allow us to copy.
First you will need to have an AWS Access Key and AWS Secret Key from an IAM user that has access to your bucket. The IAM policy should look something like this:

              {
              "Version": "2012-10-17",
                "Statement": [
              {
              "Sid": "S3Access",
                    "Effect": "Allow",
                    "Action": ["s3:GetObject", "s3:ListBucket"],
                    "Resource": ["arn:aws:s3:::awsexamplesourcebucket", "arn:aws:s3:::awsexamplesourcebucket/*"]
              }
              ]
              }
            

Then you will need to update your bucket policy to allow us to copy by adding our ARN with get object permission. The source bucket policy should look something like this:

              {
              "Version": "2012-10-17",
                "Statement": [
              {
              "Sid": "DelegateS3Access",
                    "Effect": "Allow",
                    "Principal": {"AWS": "arn:aws:iam::033707373097:user/S3-Prod"},
                    "Action": ["s3:GetObject"],
                    "Resource": ["arn:aws:s3:::awsexamplesourcebucket", "arn:aws:s3:::awsexamplesourcebucket/*"]
              }
              ]
              }
            

Note: Replace the 'awsexamplesourcebucket' with your bucket name.

What types of raw data does PDC accept?

PDC accepts various proprietary data formats developed by different manufacturers of mass spectrometers such as “.raw”, “.d”; “.wiff”; “.wiff.scan”; “.wiff.mtd”; “.dat”; “.mis”. If you encounter a data format that is not currently supported while uploading to the submission portal, please reach out to the PDC helpdesk for assistance.

Note:

For AB Sciex (formerly Applied Biosystems) series of mass spectrometers, such as the AB Sciex TripleTOF and QTRAP, which produce WIFF format data, it's important to note that multiple files may represent one raw file. In such cases, users are required to upload all associated files together in the same upload.

For Bruker instruments like the Bruker Tims Tof series, where a folder represents one raw file, users should compress individual .d folders corresponding to a single raw file into a single zip file. The compressed file should be named in the format "file.d.zip" to correspond to the original file name.



Go Back
Data Download
What type of files are available for download from PDC?
PDC distributes several types of files: those submitted by the original data submitters and those generated through the PDC Common Data Analysis Pipeline (CDAP).

Mass Spectrometry Data Formats
RAW (Vendor) FormatMass spectrometry data uploaded by the data submitters as RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra.
mzML FormatRAW format spectra in the HUPO Proteome Standards Initiative (PSI) External Link compliant mzML format..
Spectral Library
Generated by PDC Common Data Analysis Pipeline.
Experiment level quantative spectral library with spectra and retention time boundries given to Skyline for quantification.
Peptide-Spectrum Match (PSM) Data
Generated by PDC Common Data Analysis Pipeline.
RAW PSM FormatThe best peptide-spectrum matches (PSMs), from the first-level analysis of the PDC CDAP, for each tandem-mass spectrum against the peptide sequences from a reference protein sequence database (Uniprot) in tsv format.
mzIdentML PSM FormatRaw PSMs in the PSI compliant mzIdentML format.
Protein Assembly
Generated by PDC Common Data Analysis Pipeline.
Protein identification and quantitation reports generated from the PSM data through a conservative gene-based generalized parsimony analysis. Peptides are associated with genes, rather than protein identifiers, and genes with at least two unshared peptide identifications are inferred. The resulting gene list is estimated to have a false-discovery rate of at most 1%. Several different output files are generated depending on the experiment type.

DDA CDAP:
.summary.tsv - Protein identification summary report
.precursor_area.tsv - Label-free workflow protein quantitation report for relative quantitation by precursor peak area integration
.spectral_count.tsv - Label-free workflow protein quantitation report for relative quantitation by spectral counts
.itraq.tsv - iTRAQ workflow protein relative quantitation report
.tmt.tsv - TMT workflow protein relative quantitation report
.peptides.tsv - Identified peptide summary report
.phosphopeptide.tsv - Labelled workflow phosphopeptide relative quantitation report
.phosphosite.tsv - Labelled workflow phosphopeptide relative quantitation report
.glycopeptide.tsv - Labelled workflow N-linked glycopeptide relative quantitation report
.glycosite.tsv - Labelled workflow N-linked glycosite relative quantitation report

DIA CDAP:
precursors_unnormalized.tsv - Unnormalized precursor peak areas
precursors_normalized - Median normalized precursor peak areas
proteins_unnormalized - Unnormalized protein abundances. Calculated by taking the sum of every precursor in the protein.
proteins_normalized - DirectLFQ normalized protein abundances
sky.zip - The skyline document used for quantification of chromotographic peaks

Please note: The DIA CDAP analysis pipeline is currently under development, which means that the output files, data, and formats may undergo changes. Thank you for your understanding.
QC reportsQuality control metrics computed by the CDAP, the report consists of summary statistics derived from all MS/MS spectra from the raw spectral datafiles.
Supplementary data
(provided by data submitters)
Other metadata
These are supplementary files from the original data submitters for distribution at the PDC. These usually include the following:
  • Descriptive protocols
  • Clinical metadata
  • Other useful information
Alternate Processing Pipeline
These are original output files from the mass spectrometry data processing pipeline run by the data submitters. These files are typically the ones used for results in a peer-reviewed publication and to inform conclusions.

Submitted by the data submitters
How do I download data?

There are a few different ways to identify and download the files of interest:

1. Downloading Files from a Specific Study:

  • a. On the Explore page, click on the specific study of interest, either by the study identifier or study name.
    • This opens the study summary overlay. Click on the number corresponding to the data category of your interest.
    • An overlay will appear with a list of files.
    • Click on the download button next to each file to download them individually.
    • For information on downloading multiple files at once, see ‘How do I download multiple files at once?’.
  • b. On the Explore page, studies are listed under the Study tab. Click on the number corresponding to the data category of your interest. This will take you to the ‘Files’ tab.
    • Click on the download button next to each file to download them individually.
    • For information on downloading multiple files at once, see ‘How do I download multiple files at once?’.

2. Using Filters to Identify Files:

  • For example, if you are interested in protein assembly files for all breast cancer studies from the CPTAC program, apply appropriate filters on the 'Explore' page to narrow down the data of interest.
  • Once you have identified the files, move to the ‘Files’ tab on the Explore page.
  • Click on the download button next to each file to download them individually.
  • For information on downloading multiple files at once, see ‘How do I download multiple files at once?’.
Select file manifect
How do I download multiple files at once?

1. Select Files:

  • Follow the steps in ‘How do I download data?’.
  • Once you are on the file overlay of the study summary page or the ‘Files’ tab on the Explore page, select one or all files.
  • Click ‘Export File Manifest’. This manifest will contain a download URL for each selected file on a separate row.
  • Note: The URLs will expire 7 days (168 hours) after the manifest is generated. You can revisit the PDC portal to generate a new file manifest if needed.
Select file manifect

2. Download Methods:

  • PDC Data Download Client:
    • Use the PDC Data Download Client, a command-line tool that provides an alternative method for downloading data from PDC and enables resumption of interrupted transfers.
    • Click here for documentation of PDC Data Download Client.
  • Sample Scripts:
    • Use sample scripts (in bash and Python) available on the PDC GitHub to download and reorganize or simply reorganize previously downloaded files into the desired folder structure. You can modify these scripts to suit your needs.
    • See the FAQ ‘How do I organize the data into a folder structure?’ for more details.
    • Access the scripts here.
  • Download Managers:
    • Use download managers (special programs or browser extensions) that help manage large and multiple downloads.
    • An illustration of a download manager

Some free download managers:

Disclaimer: The third-party software links are provided “as is” without warranty of any kind, either expressed or implied and such software is to be used at your own risk.

How do I organize the data into a folder structure?

By default, all downloaded files will be placed in the same folder without any particular folder structure. The PDC manifest file provides all relevant metadata if you wish to organize them into a folder structure. This is especially useful when analyzing large datasets with labelling experiments.

The following metadata data available in PDC file manifest can be used to organize the files:
PDC Study ID, e.g., PDC000319
PDC Study Version, e.g., 1
Data Category, e.g., Processed Mass Spectra
Run Metadata ID, e.g., AML Gilteritinib TimeCourse - Phosphoproteome-1
File Type, e.g., Open Standard
File, e.g., PTRC_exp12_plex_01_P_f06.mzML.gz


You may use this information to create a folder structure and move the downloaded files into the desired location.
e.g.
PDC Study ID/ PDC Study Version/Data category/Run Metadata ID/File Type/File
PDC000319/1/Processed Mass Spectra/ AML Gilteritinib TimeCourse - Phosphoproteome-1/Open Standard/ PTRC_exp12_plex_01_P_f06.mzML.gz

You may use the following sample scripts (in bash and python) available on PDC github to either download and reorganize or simply reorganize the previously downloaded files into this folder structure. Feel free to modify it to suit your needs.
https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData

Why does the file download URLs throw an error or exception?

PDC is hosted on AWS cloud and is under active development. To reduce egress costs from unintended downloads, the URLs will expire after 7 days (168 hours). You may revisit the PDC portal to generate a new file manifest. We also limit downloads of the same file from the same IP Address to only 10 times per 24 hour period. So if you have downloaded the file several times already you may get an error message indicating that you have exceeded your download attempts for the 24 hour period.

How do I download clinical and biospecimen data?

PDC portal allows users to build cohorts by applying various clinical, biospecimen, experimental and file features as filters and export the selections as manifest files.

To download biospecimen (case, sample, aliquot) related data, once you identify the data of your interest by applying filters, move to the 'Biospecimens' tab on the 'Explore' page. Select the checkbox to select a specific row, all rows on the page or all pages and click export biospecimen manifest button in CSV or TSV format.

To download clinical data, once you identify the data of your interest by applying filters, move to the 'Clinical' tab on the 'Explore' page. Select the checkbox to select a specific row, all rows on the page or all pages and click the export clinical manifest button in CSV or TSV format. Clinical data are organized into multiple files and are exported as one zip archive. The archive contains: Clinical manifest that includes data for demographic and diagnosis.Exposure manifest that includes exposure related data.Follow-up manifest that includes follow-up related data.Treatment manifest that includes treatment related data.

Refer to PDC data dictionary for more information about the biospecimen and clinical data.

Go Back
Data Access
Do I have to login to PDC to access data?

No, there is no need to create an account or login to PDC to browse the portal or download the data. However login is required if you would like to submit data to PDC.

What is a study version?

PDC studies can have multiple versions. Additional versions (updates) of a study are created when the underlying data changes substantially. This may involve changes to the raw data, processed data, and/or metadata. When a new version is created, it may fall out of sync with the original publication of this data. Use of the latest version is strongly encouraged, as it commonly represents an update directly from the submitter.

Where do I get more details about a study?

Click on the name of the study in the 'Explore' page, that opens up a study summary page. The study summary page provides details about the objective of the study, protocol, experimental design, clinical data of the cases and samples used in the study.

How can I find gene/s of my interest?

Genes can be searched with their gene symbols through the search box on PDC portal. Enter the gene symbol or name of the gene (such as kinase) in the search box and click on the gene of your interest in the drop down. This will take you to a gene summary page.

It is also possible to search for multiple genes such as those involved in a pathway. Go to the Gene tab on the 'Explore' page and in the gene filter on the left hand side enter the list of gene symbols. You may also select from the prebuilt gene lists from several pathways important in cancer.

How do I explore protein quantitation data for a study through heatmaps?

The PDC Common Data Analysis Pipeline generates a protein abundance matrix rolled up to gene level for each study. The data can be viewed as an interactive heatmap through Morpheus viewer on PDC.
Morpheus is a heat maps viewer from Broad Institute. It is versatile and has a lot of features to filter, cluster and save the data. More ways to explore the heatmap can be found here accessed here - Morpheus - Tutorial External Link

What can I search on PDC?

PDC can be searched for the following:
- Biospecimens such as case, sample or aliquot using their original identifiers or PDC ids
- Studies using their PDC id (e.g. PDC000220) or partial name (e.g. CPTAC, HNSCC)
- Genes using their Gene symbol (e.g. BRCC3) or partial description (e.g. Kinase)
- External identifiers such as dbGaP study (e.g. phs000892)
It is also possible to search for related data from external resources using the external identifiers. For example, you may search for a dbGaP study using its identifier (e.g., phs000892) to identify all related PDC studies.

Can I search for a particular peptide sequence within the datasets in PDC?

Peptides can be searched through PepQuery, a peptide-centric search that focuses on only novel DNA or protein sequences of interest.
From the menu bar, go to Analysis -> PepQuery. Enter the peptide in the search box and select the dataset to search against. More details about PepQuery can be found here - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6396417/.

What types of data can I get using the API?

All of the PDC data and metadata is accessible through APIs. Learn more here.

Go Back
Data Licensing Policy
What data are not open-access?

Currently all of the data in PDC are open access.
In the near future, PDC is expected to host data submitters provided patient sample specific protein sequence databases. These databases are generated using the genomic and transcriptomic information from the patient sample and helps in identification of novel proteins, resulting from single nucleotide variants, splice variants and fusion genes. These files are currently designated as controlled access databases and would need dbGaP authorization for access.

How do I cite the PDC?

We ask that whenever using the PDC data in a publication, please cite the PDC resource and the primary publication of the data:
To cite the resource, cite PDC url - https://pdc.cancer.gov
PDC uses human readable identifiers for representing studies.
To cite individual study, either cite the PDC study id (e.g., PDC000250) or an URL to the study (e.g., https://pdc.cancer.gov/pdc/study/PDC000250).
The primary publication from the original data producers is available on the individual study summary pages and also on the PDC publications page.

The CPTAC program requests that publications using data from the program cite the primary publication from the consortium (available on the individual study summary pages and on the PDC publications page) and include the following statement: "Data used in this publication were generated by the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC)."
What are the data use and licensing policies of the PDC?

PDC data submission and data use are governed by Creative Commons CC-BY 4.0 External Link licensing terms.
Under CC-BY terms, that user is free to,
    Share — copy and redistribute the material in any medium or format
    Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
    Attribution — You must give appropriate credit External Link, provide a link to the license, and indicate if changes were made External Link. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
    No additional restrictions — You may not apply legal terms or technological measures External Link that legally restrict others from doing anything the license permits.

What is the Intellectual Property policy of the PDC?

Users of PDC shall acknowledge that they are encouraged not to acquire ownership interest in PDC data, nor any immediate or future intellectual property rights in any research conducted using the PDC data. NIH considers these data as pre-competitive and urges Users to avoid making IP claims derived directly from the proteomic dataset(s). It is expected that these NIH-provided data, and conclusions derived therefore, will remain freely available, without requirement for licensing. However, the NIH also recognizes the importance of the subsequent development of IP on downstream discoveries, especially in therapeutics, which will be necessary to support full investment in products that the public needs. For further information about the PDC Intellectual Property policy, please contact us at PDCHelpDesk@mail.nih.gov.

What is an embargo date in PDC?

There is no longer an embargo on the data released by PDC. Submitters of data may, however, request their data to be released within a specified time frame.

Go Back
Data Analysis
What is PDC Common Data Analysis pipeline?
What is the PDC harmonization process?

Learn more about the PDC harmonization process here.

Can I analyze PDC data on the NCI cloud resources?

Yes, PDC data is accessible through the CRDC Cloud Resources for further analysis. Refer to Analyze PDC Data in the Cloud for more details.

Can I create a custom cohort to download?

PDC portal allows users to build cohorts by applying various clinical, biospecimen, experimental and file features as filters and export the selections as manifest files.

Go Back
Processed outputs from PDC Common Data Analysis Pipeline
What analytical quantitation workflows are used in PDC studies?

Most studies from CPTAC and other programs in PDC use an isobaric labelling protein quantitation workflow, in which multiple biological samples are labeled with an identifying reagent (the isobaric tag) and mixed before tandem mass-spectrometry analysis. The isobaric tag reagents are named based on the technique and their multiplexing capacity, iTRAQ reagents provide 4-plex analyses, while TMT-n provide n-plex analyses, for n = 10, 11, 16, and 18. Isobaric tags are quantified in each identified tandem mass-spectrum of a peptide, but since peptide intensities vary a lot, all isobaric tag intensities must be normalized with respect to one of the tags in each spectrum. The resulting ratios can be summarized by averaging the ratios for the peptides from a protein. To expand the number of biological samples in CPTAC studies beyond the capacity of the isobaric tag reagents, a common reference sample is included in all analytical samples and its tag’s intensities used as the ratio denominator throughout.

A small number of older CPTAC studies use a label-free quantitation workflow, without labelling reagents, and quantify peptides based on the integrated area under the elution profile of each precursor ion (precursor area) and the number of peptides identified from the protein (spectral counts).

The isobaric labelling quantitation workflows provide relative protein abundance, relative to the common reference sample, while the label-free quantitation workflows provide absolute protein abundance without reference to any other sample.

What summary report files provide protein abundance values?

For processed protein abundance, download the quantitation report file based on the quantitation workflow, and labeling reagent where appropriate, used in the study:

  • TMT Workflow: *.tmt10.tsv, *.tmt11.tsv, *.tmt16.tsv, *.tmt18.tsv
  • iTRAQ Workflow: *.itraq.tsv
  • Label-free Workflow: *.precursor_area.tsv or *.spectral_count.tsv

Each of these summary reports provide protein abundance values for the biological samples analyzed in the study. TMT and iTRAQ workflows provide relative protein abundance values, while the label-free workflow provides absolute protein abundance values.

What are the Summary Report files?

- Protein Identification Reports:

  • *.summary.tsv: Summary of the evidence for identified proteins. Generated through a conservative gene-based generalized parsimony analysis, which uses identified peptides to identify proteins.
  • *.peptides.tsv: Summary of the evidence for identified peptides.

- Protein Quantitation Reports:

  • *.precursor_area.tsv: Label-free workflow protein quantitation by precursor peak area integration.
  • *.spectral_count.tsv: Label-free workflow protein quantitation by spectral counts.
  • *.itraq.tsv: iTRAQ workflow protein relative quantitation.
  • *.tmt10.tsv: TMT-10 workflow protein relative quantitation.
  • *.tmt11.tsv: TMT-11 workflow protein relative quantitation.
  • *.tmt16.tsv: TMT-16 workflow protein relative quantitation.
  • *.tmt18.tsv: TMT-18 workflow protein relative quantitation.

- Site-specific PTM Reports:

  • *.phosphopeptide.tsv and *.phosphosite.tsv: Phosphopeptide relative quantitation reports.
  • *.glycopeptide.tsv and *.glycosite.tsv: Deglycosylated N-linked glycopeptide relative quantitation reports.
  • *.acetylpeptide.tsv and *.acetylsite.tsv: Acetylated peptide relative quantitation reports.
  • *.ubiquitylpeptide.tsv and *.ubiquitylsite.tsv: Cleaved ubiquitilated peptide relative quantitation reports.

Site-specific PTM reports are only available for the isobaric labelling quantitation workflows and represent the summary of spectral ratios across similarly modified peptides and modified protein sites.

Why are there different numbers of samples in the identification and quantitation files?

Identification files (e.g., *.summary.tsv) contain information about analytical samples, which are usually (TMT, iTRAQ workflows) a mixture of multiple biological samples. Quantitation files (e.g., *.tmt11.tsv, *.itraq.tsv) provide relative abundance values for each individual biological sample after de-multiplexing. Only the label-free quantitation workflow has a single biological sample in each analytical sample.

How many clinical biospecimens are quantitated in each analytical sample?

In the isobaric labelling quantitation workflows, one of the isobaric tags is assigned to the common reference sample. Usually, the same isobaric tag is used for the common reference sample in every analytical sample, generally it is the first or last tag. Consequently, for studies using labelling reagents with a plex capacity of n, n-1 labels will be used for clinical biospecimens.

Why do spectral counts appear in TMT and iTRAQ workflows?

Spectral counts in the identification summary report (`*.summary.tsv`) are for analytical samples, which usually represent multiple biological samples, and provide evidence for identified proteins.

What do the 'Log ratio' and 'Unshared Log ratio' columns represent?

In the isobaric labelling quantitation workflow files (e.g., `*.tmt11.tsv`):

  • Log Ratio: Includes relative quantitative values from all peptides that map to a specific protein.
  • Unshared Log Ratio: Excludes data from peptides shared between identified proteins, using only uniquely mapped peptides to compute the relative protein abundance.

While each protein must have its own peptide evidence for peptide identification, it is unclear whether the ratios observed for shared peptides should be added to each genes’ average ratio. These two versions of the ratio represent two extremes. While these two ways of summarizing the relative protein abundance usually produce quite similar values, the Unshared Log Ratio values may represent the summary of fewer values, while the Log Ratio values may represent the convolution of two different, homologous proteins’ relative quantitation values.

Where is the relative abundance of individual peptides? My proteins of interest share many peptide identifications.

The Unshared Log Ratio relative protein quantitation values summarize only the ratios of peptides which are not shared between identified proteins, so this may be sufficient for your needs. If not, you can find the individual reporter ion intensities in the PSM and mzIdentML files, for each identified tandem mass-spectrum and aggregate these to peptides.

Why don’t you provide absolute abundance of proteins?

The isobaric labelling quantitation workflow produces peaks for the isobaric tags in each tandem mass-spectrum, but the peak intensity of different peptides’ ions is not consistent with each other and their protein’s absolute abundance. However, the ratio of reporter ion peak intensities for different peptides is consistent across peptides, and the repeated observations can be used to estimate the relative abundance of a protein between samples.

How can I map internal sample IDs to case submitter IDs?

Each analytical sample in the `*.sample.txt` file represents a mixture of biological samples. The file contains biospecimen_submitter_ids, which can be linked to case_submitter_ids using the Biospecimen tab on the PDC website.

Why don't the summary statistics (Mean, Median, StdDev) in the quantitation files match my calculations?

The summary statistics in the quantitation files are computed before median normalization of the log2ratio values. To match these statistics, you need to add the median back to each column.

Go Back
Contact Us Contact us: Email
Warning