The publication of the database search algorithm SEQUEST in 1994 (Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom.1994, 5, 976−989) marked the birth of a field of bioinformatics dedicated to the task of data analysis in a shotgun proteomics experiment and interpretation of the information that can be mined from the data. The Yates group has continued to play an important role in this endeavor by pioneering new computational methods and tools for processing and extracting biological information from complex data sets and refining existing algorithms to keep up with improvements in mass spectrometers and computing capabilities. These computational tools aim to improve performance at each step of the analysis process, including the processing of raw data generated by the mass spectrometer, search methods for protein identification, protein quantification and statistical assessments of identification and quantification results.

Raw data processing

The first step of analysis in MS-based proteomics experiments is peak extraction from MS raw data files. The Yates lab developed RawExtract(McDonald WH, Tabb DL, Sadygov RG, MacCoss MJ, Venable J, et al. (2004) MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. Rapid communications in mass spectrometry : RCM 18: 2162–2168. doi: 10.1002/rcm.1603) in 1996 to convert binary ThermoFisher Scientific RAW files to text-based files in MS1/MS2 format which can be used for peptide identification. An update to this tool has recently been published. RawConverter (He, L., Diedrich, J. K., Chu, Y. Y., & Yates, J. R. (2015). Extracting Accurate Precursor Information for Tandem Mass Spectra by RawConverter. Analytical chemistry) can convert RAW files to MS1/MS2, MGF (Mascot Generic Format) or mzXML. It also supports file format conversions between these formats. RawConverter also improves the accuracy of selection of the monoisotopic peak of precursor ions as well as charge state information, resulting in a reduction of search space required for database searches and an increase in confidence of identifications.

Peptide and protein identification and statistical significance assessment

We currently employ the protein sequence database search engine ProLuCID (Xu, T., Venable, J.D., Park, S.K., Cociorva, D., Lu, B., Liao, L., Wohlschlegel, J., Hewel, J., Yates III, J.R.: ProLuCID, a fast and sensitive tandem mass spectra-based protein identification program. Mol. Cell. Proteom. 5, S174–S174 (2006)), which is a platform independent Java program inspired by the SEQUEST algorithm. Like SEQUEST, ProLuCID uses a modified cross-correlation (XCorr) calculation to score a spectrum against the theoretical spectrum of a peptide sequence, but also introduces the computation of a binomial probability which serves as a preliminary score to pre-filter spectra. ProLuCID associates a Z-score with the PSM with the highest XCorr for each spectrum, and this score can be used to filter search results. PSMs identified by ProLuCID are statistically assessed using DTASelect2 (Tabb, DL, McDonald, WH and Yates III, JR, DTASelect and Contrast:  Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics, Journal of Proteome Research, 2002, 1 (1), pp 21–26). DTASelect2 is an updated tool used to identify the proteins present in an analyzed sample. Recent changes have improved the PSM filtering accuracy, and provided filtering flexibility by allowing users to define up to 150 different parameters. DTASelect2 outputs a set of spectra, peptides, or proteins identified under a user-selected false discovery rate.

Peptide and protein quantification

For peptide and protein quantification, we use the recently developed Census 2 ( Park SK, et al (2014) Census 2: isobaric labeling data analysis .Bioinformatics. 30(15):2208-9; Park, S.K. and Venable, J.D., et. al. (2008) A quantitative analysis software tool for mass spectrometry-based proteomics. Nature methods 5, 319-322). Like the original Census algorithm, Census 2 can quantify peptides and proteins labeled using a variety of labeling strategies (e.g. 15N, SILAC, iTRAQ, TMT, dimethyl, 18O) as well as label-free strategies for both high and low-resolution MS data (spectral counting and extracted-ion chromatogram (XIC)-based quantification). It can utilize output from DTASelect as well as pepXML and mzXML files. Census 2 incorporated novel features for the analysis of peptides quantified using tandem mass tag (TMT) reporter ions, including reporter ion impurity correction, reporter ion minimum intensity threshold filter and optional weighted normalization that corrects mixing errors. Features have also been added for quantification using metabolic labeling strategies (15N, SILAC). Census 2 can also process MS experiments performed using HCD, CID/HCD double-play, HCD MS3, or MultiNotch MS3 data.

Post-translational modifications discovery and assessment

ProLuCID can be used to identify several post-translationally modified peptides, but supplemental tools have been developed for more in-depth analysis of PTM events. We have developed an algorithm named Debunker [ Lu, B., Ruse, C., Xu, T., Park, S. K., Yates, J., 3rd, Automatic validation of phosphopeptide identifications from tandem mass spectra. Anal. Chem. 2007, 79, 1301–1310.] which uses a support vector machine binary classifier to assess the phosphorylation status of a given peptide. Debunker may be used to assess the confidence of phosphorylation status and site localization for a given peptide.

Functional enrichment analysis

Functional enrichment analyses are often used to generate hypotheses regarding the underlying mechanisms revealed in MS-based proteomics datasets (refs). We have recently developed a functional enrichment analysis algorithm explicitly for MS-based proteomics data (PSEA-Quant, Lavallée-Adam, M., Rauniyar, N., McClatchy, D.B., Yates J.R. III: PSEA-Quant: a protein set enrichment analysis on label-free and label-based protein quantification data. J. Proteome Res. 13(12), 5496–5509 (2014)). This web-based user-friendly algorithm, inspired by GSEA [ Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov, J.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550 (2005)], consists of a novel protein set enrichment analysis for label-free and labeled MS-based quantitative proteomics. Unlike GSEA, PSEA-Quant allows the analysis of proteomics samples originating from a single or multiple conditions. This Java program uses Census’ output, while supporting other file formats, to identify protein sets that are statistically significantly enriched among abundant proteins that are quantified with high reproducibility across a set of replicates.

Algorithms and tools currently in development

Software development is an ongoing and vital part of our lab. We are constantly developing new algorithms, improving existing tools and adapting to new technologies. Current projects in the lab include algorithms for the identification of cross-linked peptides, the identification of unspecified PTMs, top-down computational analysis and PrOntoNet, which infers PPIs from a list of identified proteins in organisms for which there are very few known PPIs in the literature and for which the proteins are largely functionally uncharacterized. We have also recently developed PINT, an open-source platform-independent Java program built on top of a MySQL database engine that provides an environment to store data and integrate results from various computational analyses (including ProLuCID, DTASelect, and Census), which may originate from different proteomics approaches.