View on GitHub

MSFragger

Ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics

Running MSstats using MSstats.csv from IMQuant

Using the .d folders (raw timsTOF data) as input according to the FragPipe tutorial, IMQuant can generate a MSstats compatible file MSstats.csv.
Given an experimental setup that looks like this in the ‘Select LC/MS Files’ tab of FragPipe:

The MSstats.csv file output from IMQuant (via FragPipe) will look something like this:

This MSstats.csv file can be read by MSstats without any conversion. The R command for installing MSstats is:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstats")

Below is an example R script that reads and analyzes MSstats.csv using MSstats (make sure to set rootDir to the directory containing your MSstats.csv file):

rm(list = ls())
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(readr)))
suppressWarnings(suppressMessages(library(MSstats)))

rootDir <- "folder_with_MSstats.csv" # Specify the path of the directory containing MSstats.csv.
print(str_c("Using IMQuant's result from ", rootDir))

# Read MSstats.csv file.
raw <- read_csv(str_c(rootDir, "MSstats.csv"), na = c("", "NA", "0"))
raw$ProteinName <- factor(raw$ProteinName)
raw$PeptideSequence <- factor(raw$PeptideSequence)

# Change root directory for MSstats
print(str_c("Root DIR: ", rootDir))
setwd(rootDir)

# Processing the data using MSstats
processedData <- dataProcess(raw, logTrans = 10)

# Downstream analysis...

Details of the functions and options can be found here.

With processedData, users can perform various downstream analyses described in the MSstats user manual.

Convert processedData into a protein intensity table (optional)

Although processedData generated by MSstats is suitable for data analysis in R, it is not reader-friendly. Here we provide a R script that converts this data structure into a protein intensity table after running MSstats (note that the Run names and rootDir should be changed):

rm(list = ls())
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(readr)))
suppressWarnings(suppressMessages(library(MSstats)))
suppressWarnings(suppressMessages(library(matrixStats)))

rootDir <- "folder_with_MSstats.csv" # Specify the path of the directory containing MSstats.csv.
print(str_c("Using IMQuant's result from ", rootDir))

# Read MSstats.csv file.
raw <- read_csv(str_c(rootDir, "MSstats.csv"), na = c("", "NA", "0"))
raw$ProteinName <- factor(raw$ProteinName)
raw$PeptideSequence <- factor(raw$PeptideSequence)

# Change root directory for MSstats
print(str_c("Root DIR: ", rootDir))
setwd(rootDir)

# Processing the data using MSstats
processedData <- dataProcess(raw, logTrans = 10)

# Only need run level data in generating protein intensity table.
runLevelData <- processedData$RunlevelData

# Prepare a protein intensity table. The Run number and names need to be changed according to your experiments.
# In this example, we have four runs named "Run1", "Run2", "Run3", and "Run4". If there are more runs with different names, you need to increase the column of proteinIntensityTable data frame accordingly.

proteinIntensityTable <- data.frame(Protein = unique(runLevelData$Protein), Run1 = rep(NA, length(unique(runLevelData$Protein))), Run2 = rep(NA, length(unique(runLevelData$Protein))), Run3 = rep(NA, length(unique(runLevelData$Protein))), Run4 = rep(NA, length(unique(runLevelData$Protein))))
for (protein in runLevelData$Protein) {
  if (length(runLevelData[runLevelData$RUN == 1 & runLevelData$Protein == protein,]$LogIntensities) > 0) {
    proteinIntensityTable[proteinIntensityTable$Protein == protein,]$Run1 <- runLevelData[runLevelData$RUN == 1 & runLevelData$Protein == protein,]$LogIntensities
  }
  if (length(runLevelData[runLevelData$RUN == 2 & runLevelData$Protein == protein,]$LogIntensities) > 0) {
    proteinIntensityTable[proteinIntensityTable$Protein == protein,]$Run2 <- runLevelData[runLevelData$RUN == 2 & runLevelData$Protein == protein,]$LogIntensities
  }
  if (length(runLevelData[runLevelData$RUN == 3 & runLevelData$Protein == protein,]$LogIntensities) > 0) {
    proteinIntensityTable[proteinIntensityTable$Protein == protein,]$Run3 <- runLevelData[runLevelData$RUN == 3 & runLevelData$Protein == protein,]$LogIntensities
  }
  if (length(runLevelData[runLevelData$RUN == 4 & runLevelData$Protein == protein,]$LogIntensities) > 0) {
    proteinIntensityTable[proteinIntensityTable$Protein == protein,]$Run4 <- runLevelData[runLevelData$RUN == 4 & runLevelData$Protein == protein,]$LogIntensities
  }
}

write_csv(proteinIntensityTable, str_c(rootDir, "MSstats_protein.csv"), na = "")

The protein intensity table (MSstats_protein.csv) will then look like the following:

Protein Run1 Run2 Run3 Run4
A0A024RBG1 2.268105 2.194517 2.199005 2.251258
A0A0B4J2D5 3.015404 2.995607 3.01179 2.984934
A0A0B4J2F0 3.068368 3.069178   3.005508
A0A0B4J2F2 2.206007 2.215123 2.257591 2.185901
A0A0U1RRE5 2.26019 2.204316 2.305125 2.288413
A0A0U1RRL7 2.080483 2.135092 2.115185  
A0A140G945 3.064431   3.048989  
A0AV96 2.540166 2.579872 2.566627 2.606213
A0AVF1 2.435175 2.507841 2.335089 2.35696
A0AVT1 2.59594 2.590422 2.586453 2.58392
A0FGR8 2.685545 2.636278 2.696507 2.720165
A0JLT2 2.173107 2.17204 2.197647 2.164165
A0JNW5 3.07027 2.975922 3.023312 3.014954
A0MZ66 2.868374 2.818923 2.861887 2.846367
A0PJW6 2.553568 2.534552 2.685835 2.567018
A1L0T0 2.751461 2.737689 2.780147 2.776778
A1L170 2.395757 2.386972 2.39856 2.447601
A1X283 2.561679 2.536692 2.57959 2.587461
A2RRP1 2.462482 2.431679 2.47807 2.450703
A2RTX5   2.257712   2.174976
A2RU67     3.001728 2.402563
A2VDF0 2.207056 2.295692 2.374141 2.262893
A3KMH1 2.446569 2.461708 2.471607 2.509326
A3KN83 2.62989 2.644234 2.6331 2.58684
A4D1E9 2.58176 2.537538 2.564693 2.571433
A4D1P6     2.293137 2.237853
A5A3E0 2.407162 2.3797 2.396964 2.407677
A5D8V6 2.334407 2.261135 2.269774 2.258641
A5PLN9 2.55918 2.881768 2.65032 3.034141
A5YKK6 2.62894 2.654115 2.620041 2.628946