The individual storage of tandem mass spectra (.dta) and SEQUEST results (.out) in text files produces many problems. First, a spectrum or SEQUEST result is almost always smaller than the block size for the filesystem on which it is stored. This results in wasted disk storage space for each .dta or .out stored. Secondly, many operating systems suffer from sluggishness or instability when attempting to access directories with thousands of files in them. The creation of the MS2 and SQT file formats was an attempt to circumvent these problems.
Each .ms2 file is analogous to a collection of .dta files. Most information from the .dta file name (low and high scan numbers, charge state) is instead stored as text in the .ms2 file. The root name for all spectra from a .dat file is the root name for the .ms2 file. An exception to this is the file phs.ms2, which consists of spectra which have been drawn from other .ms2 files that show prominent phosphate losses from the precursor. The spectra are stored back-to-back in the .ms2 file. There is no header or footer for the .ms2 file. While most .ms2 file spectra are sorted by scan number, it is not required that they must be. An individual spectrum record may have one or two headers and one body. The header format is as follows:
Where [loscan] is the low scan number for this spectrum and [hiscan] is the high scan number. These may be (and often are) identical. The [Z] represents the expected charge of the precursor ion. Multiple charges may be proposed for the same spectrum, which is why multiple headers may appear (back-to-back). The precursor mass estimated for a particular charge state [M+H+] follows the colon line.
The body of an individual spectrum consists of records as follows:
The white space between the two data on each line should be a space or tab character.
Each .sqt file corresponds to a collection of .out files. The same information is present in each, but a single file of many identifications is used instead of using a different file for each identification. "sqt" was chosen for the extension as an abbreviation of SEQUEST, though users have determined that the pronunciation "squat" is faster (Q: What did you find in your sample? A: Squat). The file consists of a header and a body. The header is identical to that of an .out file, containing the fields that are invariant for the collection of identifications. The following is an example of the header.
SEQUEST-PVM v.27 (rev. 6), (c) 1993 Molecular Biotechnology, Univ. of Washington, J.Eng/J.Yates Licensed to Finnigan Corp., A Division of ThermoQuest Corp. Start Time 00/00/0000, 00:00 PM # amino acids = 0000, # proteins = 000000, database file ion series nABY ABCDVWXYZ: 0 1 1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 display top 5/3, ion % = 0.0, CODE = 1000 (M* +16.000) (ST# -80.000) C=160.139 mass tol = 3.0000, fragment tol = 0.0, AVG/AVG
The body of a .sqt file consists of S, M, and L lines. The S line gives basic information on each spectrum. The M lines that follow a particular S line give information pertaining to a particular sequence found by SEQUEST to match the spectrum. The L lines following an M line name the loci (and optionally describe the loci) which contain the M line's sequence. Each group of S, M, and L lines is packaged back-to-back with other such groups. The format for each of these lines is tab-delimited:
S 1110 1110 3 147 shamu34 3087.3400 5715.4 232.7 1309031 M 1 1 3087.4406 0.0000 6.4682 1850.7 44 108 T.VIVLNHPGQISAGYSPVLDCHTAHIACR.F U L YBR118W L YPR080W M 2 117 3086.6493 0.5173 3.1221 284.2 23 108 Y.M*AAMLGSFPS#AAIFFGTYEYTKRTMIED.W U L YMR166C M 3 117 3086.6493 0.5222 3.0905 284.2 23 108 Y.MAAM*LGSFPS#AAIFFGTYEYTKRTMIED.W U L YMR166C M 4 187 3086.6356 0.5353 3.0056 265.3 23 100 L.LVNNTDIQLLAKEPSYKKMREKFATM*.G U L YNL258C M 5 198 3086.8583 0.5485 2.9203 263.9 24 108 V.INLQES#ILAGIRLVPVLPILLDYVLYNI.S U L YLR235C
The fields in an S line include:
The fields in an M line include:
The fields in an L line include: