Unified Sequest Files

The individual storage of tandem mass spectra (.dta) and SEQUEST results (.out) in text files produces many problems. First, a spectrum or SEQUEST result is almost always smaller than the block size for the filesystem on which it is stored. This results in wasted disk storage space for each .dta or .out stored. Secondly, many operating systems suffer from sluggishness or instability when attempting to access directories with thousands of files in them. The creation of the MS2 and SQT file formats was an attempt to circumvent these problems.

MS2 Format

Each .ms2 file is analogous to a collection of .dta files. Most information from the .dta file name (low and high scan numbers, charge state) is instead stored as text in the .ms2 file. The root name for all spectra from a .dat file is the root name for the .ms2 file. An exception to this is the file phs.ms2, which consists of spectra which have been drawn from other .ms2 files that show prominent phosphate losses from the precursor. The spectra are stored back-to-back in the .ms2 file. There is no header or footer for the .ms2 file. While most .ms2 file spectra are sorted by scan number, it is not required that they must be. An individual spectrum record may have one or two headers and one body. The header format is as follows:

[M+H+] [Z]

Where [loscan] is the low scan number for this spectrum and [hiscan] is the high scan number. These may be (and often are) identical. The [Z] represents the expected charge of the precursor ion. Multiple charges may be proposed for the same spectrum, which is why multiple headers may appear (back-to-back). The precursor mass estimated for a particular charge state [M+H+] follows the colon line.

The body of an individual spectrum consists of records as follows:

[M/Z] [Intensity]

The white space between the two data on each line should be a space or tab character.

SQT Format

Each .sqt file corresponds to a collection of .out files. The same information is present in each, but a single file of many identifications is used instead of using a different file for each identification. "sqt" was chosen for the extension as an abbreviation of SEQUEST, though users have determined that the pronunciation "squat" is faster (Q: What did you find in your sample? A: Squat). The file consists of a header and a body. The header is identical to that of an .out file, containing the fields that are invariant for the collection of identifications. The following is an example of the header.

SEQUEST-PVM v.27 (rev. 6), (c) 1993
Molecular Biotechnology, Univ. of Washington, J.Eng/J.Yates
Licensed to Finnigan Corp., A Division of ThermoQuest Corp.
Start Time 00/00/0000, 00:00 PM
# amino acids = 0000, # proteins = 000000, database file
ion series nABY ABCDVWXYZ: 0 1 1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
display top 5/3, ion % = 0.0, CODE = 1000
(M* +16.000) (ST# -80.000) C=160.139
mass tol = 3.0000, fragment tol = 0.0, AVG/AVG

The body of a .sqt file consists of S, M, and L lines. The S line gives basic information on each spectrum. The M lines that follow a particular S line give information pertaining to a particular sequence found by SEQUEST to match the spectrum. The L lines following an M line name the loci (and optionally describe the loci) which contain the M line's sequence. Each group of S, M, and L lines is packaged back-to-back with other such groups. The format for each of these lines is tab-delimited:

S	1110	1110	3	147	shamu34	3087.3400	5715.4	232.7	1309031
M	1	1	3087.4406	0.0000	6.4682	1850.7	44	108	T.VIVLNHPGQISAGYSPVLDCHTAHIACR.F	U
M	2	117	3086.6493	0.5173	3.1221	284.2	23	108	Y.M*AAMLGSFPS#AAIFFGTYEYTKRTMIED.W	U
M	3	117	3086.6493	0.5222	3.0905	284.2	23	108	Y.MAAM*LGSFPS#AAIFFGTYEYTKRTMIED.W	U
M	4	187	3086.6356	0.5353	3.0056	265.3	23	100	L.LVNNTDIQLLAKEPSYKKMREKFATM*.G	U
M	5	198	3086.8583	0.5485	2.9203	263.9	24	108	V.INLQES#ILAGIRLVPVLPILLDYVLYNI.S	U

The fields in an S line include:

The fields in an M line include:

The fields in an L line include:

Relevant Software

The Yates Lab uses a variant of ExtractMS which produces MS2 files instead of DTA files.
Rovshan Sadygov's software to remove spectra which have had incorrect charge states assigned works directly on MS2 files.
The Yates Lab uses a version of SEQUEST that works from MS2 files and creates SQT files rather than working from DTA files to produce OUT files.
Johannes Graumann in the Deshaies Lab at CalTech wrote a migration tool which builds SQT files from collections of OUT files.
In order to speed review of SEQUEST results in SQT files, Dave Tabb wrote a tool to sort the most significant identifications to the top of the file. This order is also used to move spectra to the top of the paired MS2 file.
The Yates Lab proteomic SEQUEST result assembly software can read SQT files interchangeably with OUT files.
To view individual spectrum results, Hayes McDonald in the Yates Lab created a Perl CGI to create HTML files containing the SEQUEST results from a SQT file and the spectrum from an MS2 file, displayed through DTASelect's spectrum viewer applet.
Johannes Graumann in the Deshaies Lab at CalTech created a Perl CGI to set protein loci manual validation states in the DTASelect.txt file.
Hayes McDonald created software that changes the validation state of an identification in a SQT file.
Modified 6/25/02