MassExplorer A Tool to Help Guide Analysis of Mass Spectrometry Samples

Vishnu Shankar

MassExplorer is an application that helps identify differences in the abundances of detected molecular species between two groups of mass spectrometry data. It is uniquely suited for processing Desorption Electrospray Ionization Mass Spectrometry Imaging (DESI-MSI) data.

What does this application do?

In the last five years, several projects have leveraged the sensitivity of desorption electrospray ionization mass spectrometry (DESI-MS) and other mass-spectrometry based methods to detect metabolites and other molecular species in a sample before using statistical learning methods to find important peaks that correlate with the disease state. Fig. 1 summarizes this methodology, consisting of sample collection, measurement of spectra, data processing, and statistical learning methods.

Given the wide applicability of this methodology, we motivate the development of our tool MassExplorer with two aims:

• Automate repetitive statistical pre-processing
• Guide measurements by providing statistical insights and visualizations that are accessible to biologists and chemists

MassExplorer consists of 6 modules that are summarized below, described also here. .

1. Input Module: The user can input multiple files corresponding to a “healthy” set and “disease” set. The application accepts files in either typical mass spectra distribution or in imaging format (as shown in the application). Additionally, the user can specify an internal standard peak, which is used in normalizing the intensities, a tolerance, which determines how much a peak can vary between the samples in order to be considered the same, and a threshold, which allows the user to exclude metabolites that are detected in only few samples.

2. Pairwise Comparison: The intensities are processed and normalized according to the chosen settings in the first panel. One can then visualize the processed data, selecting a spectrum of interest from each set. Also, the user can specify the range of interest and also look up the peaks in a table, which can be searched and sorted according to intensities.

3. Get Training Matrix: This module shows the average overlayed spectra between groups. Additionally, the user can download the formatted data to a .csv file, where the headers consist of the processed peaks and the rest of the table is populated by the intensities. The downloaded csv also consists of the file number in normal file formats or Scan, X, Y, Patient ID, Disease.State for imaging formats.

4. Find Specific Differences: Both datasets are processed according to the user inputted tolerance in the “Input Module.” To determine the aggregate statistically significant differences between both sets, significance analysis of microarrays (SAM) (PNAS, 98:5116–5121, April 2001) is used to calculate a modified t-statistic (i.e. q-statistic) in each cluster. The plot shows the fold change (log2) between sets for clusters with statistically significant differences, where the calculated q-statistic accounts for the false discovery rate (FDR). SAM is able to account for FDR by assigning a score to each detected analyte, before using permutations to estimate the percentage of analytes that are identified by chance. Compared to other procedures that calculate statistically significance, SAM is preferable, as it does not assume the data is normally distributed and can work equally well at both small and large sample sizes. The user here can also download the SAM plots and outputs to their local machine.

5. Build Model: To find the most informative peaks that can distinguish both groups, a binary logistic classifier is built using the LASSO (Roy Stat Soc Ser B 1996;58: 267–88). If the number of samples are too few, the model uses a 10-fold cross-validation to train the model and select the tuning parameter (how much shrinkage should be imposed on the selected co-efficients). Otherwise, the model is trained via a leave one patient out cross-validation. This module includes plots to indicate the cross-validation performance, the selected peaks, and the model performance.

6. Visualize Peak Variability: To further understand the peaks that can distinguish groups, one can type in any peak and visualize the distribution of a metabolite between both groups via a boxplot. Based on the entered peak, this module also helps visualize the spatial distribution of the metabolite.

How can the application be accessed?

  1. Online: The application is free to use online. As the online version is hosted on the Shiny server, it is limited to handling small datasets (<100 MB).
  2. Executable: The Mac-OSX executable can be found at this link. This application can be downloaded to your local machine.

How do you use this application?

The most recent user guide to the application (updated 07/10/20) can be dowloaded in .pdf form here.

A brief tutorial video is also included below to help familiarize the user with the tool’s features.

Do you have any sample files to test the application?

Some sample files and the corresponding output are included. Note that there is sample data and output for both the website and executable version.

What are the best practices to use this application?

See pages 3 and 4 of the README guide, attached here.

Can I access the source code?

The source code for the Shiny app can be accessed here and is also available on Github. Please contact Vishnu if you would like to help develop the application further.

Who do I contact if I need help?

Contact Vishnu Shankar (vishnus1@cs.stanford.edu) if you have any questions or need further help.