The high-throughput nature of proteomics mass spectrometry is enabled by a productive combination of data acquisition protocols and the computational tools used to interpret the resulting spectra (list of peaks). One of the key components in mainstream protocols is the generation of tandem mass (MS/MS) spectra by peptide fragmentation. This approach is currently used in the large majority of proteomics experiments to routinely identify hundreds to thousands of proteins from single mass spectrometry runs. In order to do that multiple tools have to be employed: LC-MS/MS to segregate components of proteomic samples associated with protein identification (see Figure 1) softwares, X!Tandem 1 , Mascot or SEQUEST all of which perform protein identification but with different algorithms.
Figure 1 — General overview of the experimental steps and flow of data in protein identification using shotgun proteomics experiment. 2 .
Figure 1 shows a general overview of the experimental steps in protein identification. Sample proteins are first cleaved into peptides, which are then separated using chromatography (e.g. HPLC). The peptides are then ionised and then selected ions are fragmented to produce signature tandem mass spectrometry (MS/MS) spectra. Peptides are identified from the MS/MS spectra by using programs such as X!Tandem, which search against a database of known peptides. Sequences of the identified peptides are used to infer which proteins are present in the original sample.
Liquid Chromatography (LC) is a technique used to separate molecules in the solvent (mobile phase). Nowadays liquid chromatography utilising high pressure to force the solvent through the columns packed with porous particles (stationary phase) is referred as High Pressure Liquid Chromatography (HPLC) – see Figure 2. After purifying the proteomic cell content, LC is able to separate the different proteins which are injected in the mass spectrometer. Each peak from the results will be analysed by mass spectrometry.
Figure 2 — Schema of High Pressure Liquid Chromatography (HPLC) 5 . Compounds of the mixture are separated in the HPLC column according to various parameters (polarity, charge, affinity etc). The type of separation depends on the column being used). A detector flow cell is used to detect the separated compounds band.
Mass spectrometry (MS) – see Figure 3 – has been widely used to analyse biological samples and has evolved into an indispensable tool for proteomics research. Fundamentally, MS measures the mass-to-charge ratio (m/z) of gas-phase ions. Mass spectrometers consist of an ion source that converts analyte molecules into gas-phase ions, a mass analyser that separates ionised analytes on the basis of m/z ratio and a detector that records the number of ions at each m/z value.
Figure 3 — Schema of mass specter. “A mass spectrometer consists of three components: an ion source, a mass analyzer, and a detector. The ionizer converts a portion of the sample into ions. There is a wide variety of ionization techniques, depending on the phase (solid, liquid, gas) of the sample and the efficiency of various ionization mechanisms for the unknown species. An extraction system removes ions from the sample, which are then targeted through the mass analyzer and onto the detector. The differences in masses of the fragments allows the mass analyzer to sort the ions by their mass-to-charge ratio. The detector measures the value of an indicator quantity and thus provides data for calculating the abundances of each ion present.” 6
Tandem mass spectrometry (MS/MS) – see Figure 4 – is a key technique for protein or peptide sequencing and post-translational modifications analysis. Collision-induced dissociation (CID) has been the most widely used MS/MS technique in proteomics research. In this method, gas-phase peptide/protein cations are internally heated by multiple collisions with rare gas atoms. The result is fragmented ions that, after the detection phase and reconstitution, reveal the amino-acid chains.
Figure 4 — Schema of a tandem mass spectrometry associated with liquid chromatography (LC-MS/MS). This technique combine separation form liquid chromatography and m/z ratio analyse of the tandem mass spectrometry in order to generate spectra.
File formats : During a full proteomics analysis, as seen in Figure 5, many files are created. Every step has its own multiple file formats:
Figure 5 — Multiple formats during MS treatment 7 . From the sample processing to the final file, various file formats can be used. This schema shows the different processes and file formats that are generated during a full MS experiment.
For this tutorial we will focus on the “Informatics Analysis” part using the following file formats:
Figure 6 — Sample of a MGF file. MGF files are used for enconding MS results, multiple spectra can be enconded in one MGF. Every spectrum begins with the BEGINS IONS assessment and finishes with END IONS. MGF files can be divided in 2 parts : – The header : containing information about the embedded Search Parameters. – Ions information : the first figure is the ion mass, the second is the ion charge.
X!Tandem is an open source software that can match tandem mass spectra (usually the experiment) with peptide sequences from a database. This process allows identification of proteins in one or more samples.
The X!Tandem search engine calculates a statistical confidence (expectation value) for all of the individual spectrum-to-sequence assignments. Some spectra might map to more than one protein, in this case X!Tandem will pick the best match with the highest confidence score (see Figure 7). The output is a lists all of the high confidence assignments.
Figure 7 — Schema of the X!Tandem analysis 2 . After LC-MS/MS analysis the experimental spectrum is compared to a theoretical spectrum made from a protein database. This comparison leads to a list of peptide match ranked by score.
GALAXY 3 is an open source, web-based platform for biomedical research. GALAXY allows users to perform multi-omics data analyses: genomics, proteomics, transcriptomics and more. Numerous tools are available to achieve various analyses. This tutorial will focus on proteomics identification: matching the experimental data, spectra from LC-MS/MS analysis against data from a protein database using X!Tandem.
Before starting, a quick overview of the GALAXY interface – see Figure 8. The interface is divided into three parts:
Figure 8 — Galaxy interface. Divided in 3 parts Galaxy’s interface go from the left selecting the tools to the right where the results are displayed.
This tutorial describes how to identify a list of proteins from tandem mass spectrometry data.
Analyses of this type are a fundamental part of most proteomics studies. The basic idea is to match tandem MS spectra obtained from a sample with equivalent theoretical spectra against a reference protein database. The process is referred to as “protein identification”, although amino acid sequences are not obtained de novo with this method.
The objective of this tutorial is to perform protein identification from MS/MS spectra data using the X!Tandem tool in GALAXY – see Figure 9. The basic steps involved are:
The tutorial will finish with an exercise where you repeat the same protocol but with your own proteome as the reference database instead of using UniProt.
Figure 9 - General flowchart of this training. The first part is to upload datasets to be investigated. Then the training will cover the key parts of the configuration of a X!Tandem search. Finally, we look at sorting and analysing the results.
The aim of the training will be to create a list of all proteins that can be confidently said to be present in the sample.
This tutorial uses the following open source tools:
This tutorial uses an E. Coli MS/MS spectra dataset “s1-000.RAW.gz” that has already been converted to MGF format for use in the Galaxy. The MGF data file is over 200MB, instead of downloading it to your computer, you can upload this file into Galaxy directly. Copy this link (https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/EColi_K12_MS_Spectra.mgf) and follow the instructions below.
Note If you downloaded the RAW data (which can be obtained from here), you will need to manually convert it to “MGF” before it can be used in Galaxy. To convert the file, follow the following steps: * Download and Install ProteoWizard on your local computer. (Note: version - Windows 64-bit installer(able to convert vendor files except T2D) * Start the ProteoWizard software on your local computer with windows operating system. * Upload the raw file and convert to mgf format
Tip : You can also use the Get Data → Upload file tool to obtain the same result. Here you want to upload your MS/MS spectra.
Warning : X!Tandem only accepts mgf files in GALAXY. Other file formats have to be converted beforehand. A useful tool for that is msconvert 12 (download : here. This tool only working on Windows).
We will first use the UniProt Database as our reference data to search against.
This part of the tutorial is to perform the X!Tandem MS/MS search.
X!Tandem proposes many options, the key options of interest are:
In this tutorial, we are using the following parameters:
Parameters Name | Value | Default Value |
---|---|---|
Uploaded FASTA file | EColi_K12_UniProt_Database | – |
MSMS File | EColi_K12_MS_Spectra.mgf | – |
Variable Modifications | Oxidation M | – |
Fixed Modifications | Carbamidomethyl C | – |
Missed Cleavages Allowed | 2 | 2 |
Enzyme | Trypsin | Trypsin |
Fragment ion tolerance | 0.5 | 0.5 |
Precursor ion tolerance | 10 ppm | 10 ppm |
The output of X!Tandem is an XML format, which is not easy to decipher. In order to get a more readable file, we will convert the XML format to a table. This is a two step process:
After the X!Tandem search we obtain a list of proteins present in the sample data from Step 1:
Tabular name | Tandem file XML designation | Definition |
---|---|---|
Protein | label | Protein name according to the database used for the MS/MS search |
Peptide | seq | Peptide sequence |
Assumed_charge | z | Parent ion mass (plus a proton) from the spectrum |
Calc_neutral_pep_mass | mh (+mass of a proton) | Parent ion mass calculated from the spectrum |
Neutral_mass | mh (+mass of a proton) | Calculated peptide mass (plus a proton) |
Retention_time | rt | Length of time between injection and position of the target compound peak. 10 |
Start_scan | id | id of the group treated (where the analysis starts) |
End_scan | id | id of the domain treated (usually the same as start_scan, no consideration of the doted name in the original XML file) |
Search_engine | — | Name of the search engine used, in our case X!Tandem (associated with the scoring method : k-score) |
Raw_score | expect | Expectation value for the top ranked protein identified with this spectrum |
Note: You can find all the details on the X!Tandem output file here: The file format for X! series search engines. The tandem XML file contains more information than the tabular format which can also be of further use (e.i. the score for the different ions x, y, z, a, b, c …)