The goal of PDC data harmonization process is to transform proteomic data from disparate sources and workflows into derived forms by common data analysis methods and tools. The harmonization process is therefore useful for removing data analysis variables, enabling comparisons across datasets.
Harmonization starts with assigning standard identifiers, data integrity checks, adherence to standards (community-accepted vocabulary and nomenclature for clinical attributes, peptides, proteins, protein sequence variants, and modifications) and PDC data model.
The PDC uses submitted raw mass spectrometry data files to produce derived analysis results which can be used to study identification of proteins and post-translational modifications (PTMs). All processing is done through Common Data Analysis Pipelines (CDAP). Whenever possible, quantitative results are also extracted from the raw data, enabling downstream analyses of differential expression between samples at the protein- or PTM site-level.
One goal of the PDC is to harmonize a diversity of proteomics data types. All current data types are mass spectrometry data acquired using data dependent acquisition (DDA), but pipelines are under construction for the analysis of data acquired using data independent acquisition (DIA) or SWATHTM. DDA deposited in the PDC so far includes label-free, iTRAQ4 and TMT10. The CDAP is capable of processing data from unenriched, phospho-, and glyco-enriched peptide samples.
PDC uses the CDAP developed at NIST and Georgetown University and in use at the CPTAC Data Coordinating Center (published here) as a starting point for DDA datasets. The pipeline is implemented in the Galaxy Framework that runs on Amazon Cloud. Software programs and parameters are also detailed in this document. In general, the pipeline proceeds in the following order:
The results of the pipeline for a multiplexed labeling study is a matrix in which rows are genes and columns are aliquots (samples). Values in the cells represent protein (gene) expression for that sample relative to a common reference (typically a pooled sample).
An overview of the Common Data Analysis pipeline is outlined here and described in more details in this publication.
Details of the processed outputs from CDAP can be found here.
Detail of the formats of CDPA out puts can be found here : Peptide Spectral Matches and Protein Reports.
Our approach for analysis of data independent acquisition is based on work in the MacCoss Lab and is currently under development. If available, sample or pool-specific data are used to build a spectral library. The peptides in the library are then scored assigned match scores for each DIA data file.
Following analysis by either CDAP, the results are available as Protein Assembly reports on the PDC Data Portal along with original and processed RAW data and metadata according to approved release schedules.