The primary goal of Proteomic Data Commons (PDC) is to democratize access to cancer-related proteomic datasets and to provide sustainable computational support to the cancer research community through interoperability with other components, such as Genomic Data Commons and Cloud Resources, in the NCI Cancer Research Data Commons (CRDC) ecosystem.
NCI PDC is officially launched on Mar 23, 2020. Click here to read the official announcement.
PDC is currently an observer member of the ProteomeXchange Consortium .
You can send us your feedback or comments on any issues you experience at PDCHelpDesk@mail.nih.gov.
The CPTAC Data Portal was a centralized repository for the public dissemination of proteomic sequence datasets collected by Clinical Proteomic Tumor Analysis Consortium (CPTAC) consortium. As of February, 2022 the CPTAC Data Portal is retired and all of the data is now available through PDC.
The PDC is a public data repository of mass spectrometry (MS) based proteomics data, and is maintained by the National Cancer Institute. PDC will host datasets from large consortiums such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC), International Cancer Proteogenome Consortium (ICPC) and Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO) and also from independent research programs and grants.
The PDC Data Portal also provides a platform for efficiently querying, visualizing, analyzing and downloading high quality, curated and harmonized proteomic datasets.
PDC represents a state of the art repository technology and facilitates multi-omics integration through interoperability with other components, such as Genomic Data Commons and Cloud Resources, in the NCI Cancer Research Data Commons (CRDC) ecosystem.
If a study in PDC has corresponding genomic and/or imaging data available in other resources, you may find the mapping of the individual cases to the external resources in the ‘Clinical’ tab on the 'Explore' page (see the image below). Either click on the thumbnail in the column ‘Genomic or Imaging Data Resource’ or you may also select the cases of interest and export the clinical manifest. The manifest file will have the resource name and identifier for each case.
PDC uses caDSR and NCIt terminologies for biospecimen and clinical metadata and PSI ontologies for Proteomic metadata. Learn more about PDC data dictionary and data model.
Go BackThe Explore page on the PDC website offers a comprehensive overview of all available data and various ways to explore the cohorts across PDC. The information is organized into three main panels:
The graphic panel visually represents data organized into analytical fractions, disease types, and experimental types. This provides a quick and intuitive overview of the data distribution and key categories.
The data panel is divided into several tabs, each displaying data organized by studies, and detailing clinical, biospecimen, and file properties.
Located on the left side, the filter panel allows users to refine the data based on harmonized metadata. The filters are organized into distinct groups:
Applying filters updates the data displayed in both the Graphic and Data panels.
Due to the extensive data available on PDC, multiple manifests can be downloaded from the Explore page:
Each tab under the Data Panel allows for the download of relevant manifests, which export the displayed information as TSV or CSV files. Available manifests include:
To download data files, you may download each file directly from the Files tab on the Data Panel by clicking the download button or by using the file manifest to download multiple files at once. Refer to the ‘Data Download’ section for more details.
The Study Summary page on the PDC website provides a comprehensive overview of study information, helping users understand the data before downloading. Here's how you can access a Study Summary page and the features it includes:
Example: https://pdc.cancer.gov/pdc/study/PDC000544
The Study Summary page includes several sections, each providing specific information about the study:
Displays general properties of the study such as the name, program, experimental type, Disease Types, Project ID, and more.
The available data for download is organized into two sections:
Lists all other studies within PDC related to the current study. These are usually additional characterizations of the same cohort, such as proteome and other post-translational modification (PTM) studies like phosphoproteome and ubiquitylome.
Lists other resources where complementary data for the same cohort is available.
Provides the citation of the primary publication associated with the study.
Features a heatmap thumbnail that links to a Morpheus heatmaps visualization page. This page includes quantitative data generated from the PDC common data analysis pipeline, annotated with extensive clinical information, and loaded into a Morpheus heatmaps viewer.
The Case Summary page on the PDC website provides a comprehensive overview of Cases. A case (participants, subjects, donors, patients) in PDC may have multiple samples collected (e.g., tumor and normal tissues). Each sample can then be divided into multiple aliquots, which are used for different types of analyses, such as proteomics, metabolomics and other molecular assays.
The Case Summary page includes several sections, each providing specific information about the study:
Demography, Diagnosis, Exposure, Follow Up, Treatment. For further information, refer to PDC Data Dictionary available under MORE menu
Displays the number of associated files and studies in which the Case has been involved.
Outlines the structure of biospecimens derived from a Case, as follows:
Case -> Sample(s) -> Aliquot(s)
Where complementary genomic and imaging information for the Case may be available.
Detailed instructions for submitting data into the PDC are available here.
PDC currently accepts Mass Spectrometry data from proteomic experiments specifically for data dependent and data independent acquisitions. You may contact PDCHelpDesk@mail.nih.gov to request a program for your lab.
In order to use the S3 transfer feature, you need to configure your AWS account and S3 bucket to allow us to copy.
First you will need to have an AWS Access Key and AWS Secret Key from an IAM user that has access to your bucket. The IAM policy should look something like this:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": ["arn:aws:s3:::awsexamplesourcebucket", "arn:aws:s3:::awsexamplesourcebucket/*"] } ] }
{ "Version": "2012-10-17", "Statement": [ { "Sid": "DelegateS3Access", "Effect": "Allow", "Principal": {"AWS": "arn:aws:iam::033707373097:user/S3-Prod"}, "Action": ["s3:GetObject"], "Resource": ["arn:aws:s3:::awsexamplesourcebucket", "arn:aws:s3:::awsexamplesourcebucket/*"] } ] }
PDC accepts various proprietary data formats developed by different manufacturers of mass spectrometers such as “.raw”, “.d”; “.wiff”; “.wiff.scan”; “.wiff.mtd”; “.dat”; “.mis”. If you encounter a data format that is not currently supported while uploading to the submission portal, please reach out to the PDC helpdesk for assistance.
Note:
For AB Sciex (formerly Applied Biosystems) series of mass spectrometers, such as the AB Sciex TripleTOF and QTRAP, which produce WIFF format data, it's important to note that multiple files may represent one raw file. In such cases, users are required to upload all associated files together in the same upload.
For Bruker instruments like the Bruker Tims Tof series, where a folder represents one raw file, users should compress individual .d folders corresponding to a single raw file into a single zip file. The compressed file should be named in the format "file.d.zip" to correspond to the original file name.
Mass Spectrometry Data Formats | |
RAW (Vendor) Format | Mass spectrometry data uploaded by the data submitters as RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra. |
mzML Format | RAW format spectra in the HUPO Proteome Standards Initiative (PSI) compliant mzML format.. |
Spectral Library Generated by PDC Common Data Analysis Pipeline. Experiment level quantative spectral library with spectra and retention time boundries given to Skyline for quantification. | |
Peptide-Spectrum Match (PSM) Data Generated by PDC Common Data Analysis Pipeline. | |
RAW PSM Format | The best peptide-spectrum matches (PSMs), from the first-level analysis of the PDC CDAP, for each tandem-mass spectrum against the peptide sequences from a reference protein sequence database (Uniprot) in tsv format. |
mzIdentML PSM Format | Raw PSMs in the PSI compliant mzIdentML format. |
Protein Assembly Generated by PDC Common Data Analysis Pipeline. | |
Protein identification and quantitation reports generated from the PSM data through a conservative gene-based generalized parsimony analysis. Peptides are associated with genes, rather than protein identifiers, and genes with at least two unshared peptide identifications are inferred. The resulting gene list is estimated to have a false-discovery rate of at most 1%. Several different output files are generated depending on the experiment type. DDA CDAP: .summary.tsv - Protein identification summary report .precursor_area.tsv - Label-free workflow protein quantitation report for relative quantitation by precursor peak area integration .spectral_count.tsv - Label-free workflow protein quantitation report for relative quantitation by spectral counts .itraq.tsv - iTRAQ workflow protein relative quantitation report .tmt.tsv - TMT workflow protein relative quantitation report .peptides.tsv - Identified peptide summary report .phosphopeptide.tsv - Labelled workflow phosphopeptide relative quantitation report .phosphosite.tsv - Labelled workflow phosphopeptide relative quantitation report .glycopeptide.tsv - Labelled workflow N-linked glycopeptide relative quantitation report .glycosite.tsv - Labelled workflow N-linked glycosite relative quantitation report DIA CDAP: precursors_unnormalized.tsv - Unnormalized precursor peak areas precursors_normalized - Median normalized precursor peak areas proteins_unnormalized - Unnormalized protein abundances. Calculated by taking the sum of every precursor in the protein. proteins_normalized - DirectLFQ normalized protein abundances sky.zip - The skyline document used for quantification of chromotographic peaks Please note: The DIA CDAP analysis pipeline is currently under development, which means that the output files, data, and formats may undergo changes. Thank you for your understanding. | |
QC reports | Quality control metrics computed by the CDAP, the report consists of summary statistics derived from all MS/MS spectra from the raw spectral datafiles. |
Supplementary data (provided by data submitters) | Other metadata These are supplementary files from the original data submitters for distribution at the PDC. These usually include the following:
These are original output files from the mass spectrometry data processing pipeline run by the data submitters. These files are typically the ones used for results in a peer-reviewed publication and to inform conclusions. Submitted by the data submitters |
There are a few different ways to identify and download the files of interest:
1. Downloading Files from a Specific Study:
2. Using Filters to Identify Files:
1. Select Files:
2. Download Methods:
Some free download managers:
Disclaimer: The third-party software links are provided “as is” without warranty of any kind, either expressed or implied and such software is to be used at your own risk.
By default, all downloaded files will be placed in the same folder without any particular folder structure. The PDC manifest file provides all relevant metadata if you wish to organize them into a folder structure. This is especially useful when analyzing large datasets with labelling experiments.
The following metadata data available in PDC file manifest can be used to organize the files:
PDC Study ID, e.g., PDC000319
PDC Study Version, e.g., 1
Data Category, e.g., Processed Mass Spectra
Run Metadata ID, e.g., AML Gilteritinib TimeCourse - Phosphoproteome-1
File Type, e.g., Open Standard
File, e.g., PTRC_exp12_plex_01_P_f06.mzML.gz
You may use this information to create a folder structure and move the downloaded files into the desired location.
e.g.
PDC Study ID/ PDC Study Version/Data category/Run Metadata ID/File Type/File
PDC000319/1/Processed Mass Spectra/ AML Gilteritinib TimeCourse - Phosphoproteome-1/Open Standard/ PTRC_exp12_plex_01_P_f06.mzML.gz
You may use the following sample scripts (in bash and python) available on PDC github to either download and reorganize or simply reorganize the previously downloaded files into this folder structure. Feel free to modify it to suit your needs.
https://github.com/esacinc/PDC-Public/tree/master/tools/downloadPDCData
PDC is hosted on AWS cloud and is under active development. To reduce egress costs from unintended downloads, the URLs will expire after 7 days (168 hours). You may revisit the PDC portal to generate a new file manifest. We also limit downloads of the same file from the same IP Address to only 10 times per 24 hour period. So if you have downloaded the file several times already you may get an error message indicating that you have exceeded your download attempts for the 24 hour period.
PDC portal allows users to build cohorts by applying various clinical, biospecimen, experimental and file features as filters and export the selections as manifest files.
To download biospecimen (case, sample, aliquot) related data, once you identify the data of your interest by applying filters, move to the 'Biospecimens' tab on the 'Explore' page. Select the checkbox to select a specific row, all rows on the page or all pages and click export biospecimen manifest button in CSV or TSV format.
To download clinical data, once you identify the data of your interest by applying filters, move to the 'Clinical' tab on the 'Explore' page. Select the checkbox to select a specific row, all rows on the page or all pages and click the export clinical manifest button in CSV or TSV format. Clinical data are organized into multiple files and are exported as one zip archive. The archive contains: Clinical manifest that includes data for demographic and diagnosis.Exposure manifest that includes exposure related data.Follow-up manifest that includes follow-up related data.Treatment manifest that includes treatment related data.
Refer to PDC data dictionary for more information about the biospecimen and clinical data.
No, there is no need to create an account or login to PDC to browse the portal or download the data. However login is required if you would like to submit data to PDC.
PDC studies can have multiple versions. Additional versions (updates) of a study are created when the underlying data changes substantially. This may involve changes to the raw data, processed data, and/or metadata. When a new version is created, it may fall out of sync with the original publication of this data. Use of the latest version is strongly encouraged, as it commonly represents an update directly from the submitter.
Click on the name of the study in the 'Explore' page, that opens up a study summary page. The study summary page provides details about the objective of the study, protocol, experimental design, clinical data of the cases and samples used in the study.
Genes can be searched with their gene symbols through the search box on PDC portal. Enter the gene symbol or name of the gene (such as kinase) in the search box and click on the gene of your interest in the drop down. This will take you to a gene summary page.
It is also possible to search for multiple genes such as those involved in a pathway. Go to the Gene tab on the 'Explore' page and in the gene filter on the left hand side enter the list of gene symbols. You may also select from the prebuilt gene lists from several pathways important in cancer.
The PDC Common Data Analysis Pipeline generates a protein abundance matrix rolled up to gene level for each study. The data can be viewed as an interactive heatmap through Morpheus viewer on PDC.
Morpheus is a heat maps viewer from Broad Institute. It is versatile and has a lot of features to filter, cluster and save the data. More ways to explore the heatmap can be found here accessed here - Morpheus - Tutorial
PDC can be searched for the following:
- Biospecimens such as case, sample or aliquot using their original identifiers or PDC ids
- Studies using their PDC id (e.g. PDC000220) or partial name (e.g. CPTAC, HNSCC)
- Genes using their Gene symbol (e.g. BRCC3) or partial description (e.g. Kinase)
- External identifiers such as dbGaP study (e.g. phs000892)
It is also possible to search for related data from external resources using the external identifiers. For example, you may search for a dbGaP study using its identifier (e.g., phs000892) to identify all related PDC studies.
Peptides can be searched through PepQuery, a peptide-centric search that focuses on only novel DNA or protein sequences of interest.
From the menu bar, go to Analysis -> PepQuery. Enter the peptide in the search box and select the dataset to search against. More details about PepQuery can be found here - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6396417/.
All of the PDC data and metadata is accessible through APIs. Learn more here.
Go BackCurrently all of the data in PDC are open access.
In the near future, PDC is expected to host data submitters provided patient sample specific protein sequence databases. These databases are generated using the genomic and transcriptomic information from the patient sample and helps in identification of novel proteins, resulting from single nucleotide variants, splice variants and fusion genes. These files are currently designated as controlled access databases and would need dbGaP authorization for access.
We ask that whenever using the PDC data in a publication, please cite the PDC resource and the primary publication of the data:
To cite the resource, cite PDC url - https://pdc.cancer.gov
PDC uses human readable identifiers for representing studies.
To cite individual study, either cite the PDC study id (e.g., PDC000250) or an URL to the study (e.g., https://pdc.cancer.gov/pdc/study/PDC000250).
The primary publication from the original data producers is available on the individual study summary pages and also on the PDC publications page.
PDC data submission and data use are governed by Creative Commons CC-BY 4.0 licensing terms.
Under CC-BY terms, that user is free to,
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Users of PDC shall acknowledge that they are encouraged not to acquire ownership interest in PDC data, nor any immediate or future intellectual property rights in any research conducted using the PDC data. NIH considers these data as pre-competitive and urges Users to avoid making IP claims derived directly from the proteomic dataset(s). It is expected that these NIH-provided data, and conclusions derived therefore, will remain freely available, without requirement for licensing. However, the NIH also recognizes the importance of the subsequent development of IP on downstream discoveries, especially in therapeutics, which will be necessary to support full investment in products that the public needs. For further information about the PDC Intellectual Property policy, please contact us at PDCHelpDesk@mail.nih.gov.
There is no longer an embargo on the data released by PDC. Submitters of data may, however, request their data to be released within a specified time frame.
Go BackLearn more about the PDC harmonization process here.
Yes, PDC data is accessible through the CRDC Cloud Resources for further analysis. Refer to Analyze PDC Data in the Cloud for more details.
PDC portal allows users to build cohorts by applying various clinical, biospecimen, experimental and file features as filters and export the selections as manifest files.
Go BackMost studies from CPTAC and other programs in PDC use an isobaric labelling protein quantitation workflow, in which multiple biological samples are labeled with an identifying reagent (the isobaric tag) and mixed before tandem mass-spectrometry analysis. The isobaric tag reagents are named based on the technique and their multiplexing capacity, iTRAQ reagents provide 4-plex analyses, while TMT-n provide n-plex analyses, for n = 10, 11, 16, and 18. Isobaric tags are quantified in each identified tandem mass-spectrum of a peptide, but since peptide intensities vary a lot, all isobaric tag intensities must be normalized with respect to one of the tags in each spectrum. The resulting ratios can be summarized by averaging the ratios for the peptides from a protein. To expand the number of biological samples in CPTAC studies beyond the capacity of the isobaric tag reagents, a common reference sample is included in all analytical samples and its tag’s intensities used as the ratio denominator throughout.
A small number of older CPTAC studies use a label-free quantitation workflow, without labelling reagents, and quantify peptides based on the integrated area under the elution profile of each precursor ion (precursor area) and the number of peptides identified from the protein (spectral counts).
The isobaric labelling quantitation workflows provide relative protein abundance, relative to the common reference sample, while the label-free quantitation workflows provide absolute protein abundance without reference to any other sample.
For processed protein abundance, download the quantitation report file based on the quantitation workflow, and labeling reagent where appropriate, used in the study:
Each of these summary reports provide protein abundance values for the biological samples analyzed in the study. TMT and iTRAQ workflows provide relative protein abundance values, while the label-free workflow provides absolute protein abundance values.
- Protein Identification Reports:
- Protein Quantitation Reports:
- Site-specific PTM Reports:
Site-specific PTM reports are only available for the isobaric labelling quantitation workflows and represent the summary of spectral ratios across similarly modified peptides and modified protein sites.
Identification files (e.g., *.summary.tsv) contain information about analytical samples, which are usually (TMT, iTRAQ workflows) a mixture of multiple biological samples. Quantitation files (e.g., *.tmt11.tsv, *.itraq.tsv) provide relative abundance values for each individual biological sample after de-multiplexing. Only the label-free quantitation workflow has a single biological sample in each analytical sample.
In the isobaric labelling quantitation workflows, one of the isobaric tags is assigned to the common reference sample. Usually, the same isobaric tag is used for the common reference sample in every analytical sample, generally it is the first or last tag. Consequently, for studies using labelling reagents with a plex capacity of n, n-1 labels will be used for clinical biospecimens.
Spectral counts in the identification summary report (`*.summary.tsv`) are for analytical samples, which usually represent multiple biological samples, and provide evidence for identified proteins.
In the isobaric labelling quantitation workflow files (e.g., `*.tmt11.tsv`):
While each protein must have its own peptide evidence for peptide identification, it is unclear whether the ratios observed for shared peptides should be added to each genes’ average ratio. These two versions of the ratio represent two extremes. While these two ways of summarizing the relative protein abundance usually produce quite similar values, the Unshared Log Ratio values may represent the summary of fewer values, while the Log Ratio values may represent the convolution of two different, homologous proteins’ relative quantitation values.
The Unshared Log Ratio relative protein quantitation values summarize only the ratios of peptides which are not shared between identified proteins, so this may be sufficient for your needs. If not, you can find the individual reporter ion intensities in the PSM and mzIdentML files, for each identified tandem mass-spectrum and aggregate these to peptides.
The isobaric labelling quantitation workflow produces peaks for the isobaric tags in each tandem mass-spectrum, but the peak intensity of different peptides’ ions is not consistent with each other and their protein’s absolute abundance. However, the ratio of reporter ion peak intensities for different peptides is consistent across peptides, and the repeated observations can be used to estimate the relative abundance of a protein between samples.
Each analytical sample in the `*.sample.txt` file represents a mixture of biological samples. The file contains biospecimen_submitter_ids, which can be linked to case_submitter_ids using the Biospecimen tab on the PDC website.
The summary statistics in the quantitation files are computed before median normalization of the log2ratio values. To match these statistics, you need to add the median back to each column.
Go Back