A New Approach for Identification and Quantification of Intact Proteoforms
The critical protein actors in biological systems are the intact proteoforms, namely the different forms of proteins produced from the genome in a variety of splice forms, and adorned with a myriad of post-translational modifications that dramatically affect their function (see Figure 1) . Lamentably, today’s bottom-up proteomic technologies, which identify and quantify peptides derived from proteins, rather than the proteins themselves, fail to deliver crucial protein-level information to biologists. We present a new vision for discerning the identities and abundances of proteoforms in biological samples of interest.
Our strategy integrates state-of-the-art proteomic and genomic technologies. A new generation of high resolution mass spectrometers and a novel isotopic tagging strategy will yield for each proteoform its accurate mass, the number of lysines it contains, and its relative abundance. Parallel RNA-Seq analysis on the same sample will provide transcriptomic data to construct a tightly focused sample-specific proteoform database. Comparison of the mass spectrometric data with the proteoform database will reveal the proteoform identities present in the sample, and the isotopic tagging strategy will provide relative abundances. We present here the concepts involved in this overall approach along with some preliminary data for aspects of the strategy.
Figure 1. “Proteoform” defined. A proteoform is a particular form of a protein that is defined by its exact amino acid sequence combined with any post translational modifications on that sequence. Thus, each “protein” arising from a gene may exist in numerous different proteoforms. Figure from 
Overview of Strategy
The crux of this overall approach lies in the ability to identify proteoforms by producing two synergistic sets of data: (A) a customized sample-specific proteoform database enumerating the proteoforms that could be present in the sample (candidate proteoforms table); and (B) the experimental proteomics data giving the accurate intact masses and lysine count for all detectable proteins in the sample (experimental mass table). Subsequent matching of each experimental mass plus its lysine count to an entry in the candidate table reveals the identities of the proteoforms actually present in the sample. The overall strategy to generate these experimental and candidate proteoform tables is outlined in Figure 2 and explained in detail below.
Figure 2. Cultured cells or even whole animals are grown with two different forms of the amino acid lysine called “NeuCode light” and “NeuCode heavy.” Both protein and RNA are extracted from these samples. The RNA undergoes RNA-Seq analysis to reveal the transcripts present, which are then converted to a list of protein sequences that can actually exist in these samples. Decoration of these protein sequences with PTMs produces a list of candidate proteoforms. The proteins undergo two dimensions of separation followed by mass spectrometric analysis to generate a table of experimental intact protein masses and associated lysine counts. Matching the experimental data to this sample-specific database identifies the proteoforms present.
Construction of Sample-Specific Proteoform Databases
Possible AA sequences + Possible PTMs = Candidate Proteoform Database. A proteoform, by definition, is a unique combination of primary amino acid sequence and post-translational modifications (see Figure 1). Accordingly, to populate the proteoform database, we first identify the primary amino acid sequences present in the sample using sample-specific RNA-Seq data. Next, we identify the subset of post-translational modifications present in the sample using prior studies and/or bottom-up proteomic analysis of the sample itself. Finally, we assemble the library of candidate proteoforms generated from these lists of amino acid sequences and PTMs.
An ideal proteoform database contains only those proteoforms present in the sample. Missing or incorrect entries lead to false negative or false positive identifications and thus diminish the quality of the results. The key limitation of intact protein analysis has been the combinatorial explosion in the number of possible proteoforms. As an example of the magnitude of this problem, consider the 20,265 reviewed human protein entries in UniProt, which have 68,898 known amino acid variants listed. We expanded the number of combinatorial possibilities for each protein based on its own list of possible variants, and then summed them across all proteins to yield >1069 amino acid sequences. Clearly, numbers of this magnitude are not consistent with database utility. Our strategy of employing RNA-Seq data to limit the database to the primary amino acid sequences actually contained in the sample reduces this number to 2x104, a breathtaking reduction in complexity that is critical to feasibility of the approach.
Identification of correct primary protein amino acid sequences. Our process for construction of sample-specific protein sequence databases from RNA-Seq data was described in two recent papers[2, 3]. Limiting the protein sequences included in the database to those where the corresponding mRNAs have reasonably high expression levels is an important strategy for reducing the potential for false positive identification of a proteoform. We recently performed deep RNA sequencing (300 million reads) and deep-coverage bottom-up proteomic analysis (32 fractions yielding nearly 8000 identified proteins) of Jurkat cell lysate. In total 2x104 transcript sequences were observed, but the majority of observable proteins arose from genes expressed at levels >10 TPM (transcripts-per-million), as shown in Figure 3. Accordingly, we employ a 10 TPM cut-off to further reduce the number of primary sequences, down to 1x104, for inclusion in custom databases.
Identification of the subset of PTMs present in the sample. The second step in the construction of sample-specific proteoform databases is the adornment of the primary amino acid sequences with PTMs. This step requires careful consideration because allowing too many possibilities increases the number of false matches and also lowers the confidence of correct identifications. While it is likely that many proteins have at least one post-translational modification and some more than one, recent studies have indicated that the majority of proteoforms identified in a human cell line had 2 or fewer PTMs . We initially limit the number of possible modifications to ≤ 5 PTMs per protein sequence. The list of PTMs is assembled from two sources of information: PTM databases (e.g. Proteoform Repository, UniProt, etc.) and bottom-up proteomics experiments.
Assembly of proteoform databases from amino acid sequences and PTMs. The final proteoform databases are produced by assembly of each primary amino acid sequence with 0 to 5 total PTMs, selected from UniProt and/or bottom-up studies. The output of these scripts is a table of possible proteoforms in the sample, where each entry contains the following information: 1) identity of the proteoform (amino acid sequence and PTMs), 2) the intact mass (amino acid residue masses plus PTM masses), 3) the number of lysines, and 4) optionally other protein group identifiers such as gene name or UniProt accession number.
As described above, employing RNA-Seq data reduces the number of base sequences to 1x104. For a subset of 2750 sequences corresponding to those proteins with MW below 30kDa (the cut-off we employ in experiments using GELFREE fractionation of the sample), the resulting sequences were converted into 21,690 possible proteoforms using the known PTMs (from UniProt) and the limit of ≤ 5 PTMs per base sequence. This is clearly a feasible database size for implementation of our proteoform identification strategy.
Figure 3. Number of detected proteins (from bottom-up proteomics) as a function of the RNA expression level in TPM.
Figure 4. Protein molecular weight distribution (UniProt-human).
Obtaining Intact Mass, Lysine Count, and Abundance
Experimental determination of intact mass. Proteins from cell lysate are fractionated according to molecular weight using off-line GELFREE fractionation, an approach recently pioneered by Kelleher and coworkers for the fractionation of complex protein mixtures in top down proteomics . For this preliminary work, we combined all fractions up to 30kDa (an approximate upper mass limit for our current mass spectrometer); future work will likely involve separate analysis of each 10kDa molecular weight fraction. Further separation is accomplished using online nanocapillary reverse phase chromatography, and mass spectrometry is performed by ESI-MS in a high resolution (R=100,000) Orbitrap mass spectrometer. An example spectrum is shown in Figure 5. Deconvolution and deisotoping of the resultant spectra yields a list of highly accurate intact proteoform masses.
Experimental determination of lysine count. To increase the information available for proteoform identification, we also determine the number of lysines present in each proteoform. This is accomplished using the NeuCode isotopic labeling method. NeuCode (Neutron enCoding) is a variation of SILAC (Stable Isotope Labeling of Amino acids in Cell culture) recently developed by our colleague Joshua Coon. In its simplest form, it employs two isotopologues of lysine whose mass differs by 0.036 Daltons (see right panel of Figure 5). These can be employed to label two different samples, permitting relative quantification of peptides. Most tryptic peptides have only a single lysine, and thus the mass shift between a pair of NeuCode labeled peptides generally corresponds to only 0.036 Daltons. In contrast, intact proteins will have a variable number of lysines, resulting in mass differences of n*0.036 Da, where n is the number of lysines encoded by the protein. The high accuracy determination of the mass differences between pairs of NeuCode labeled proteoforms reveal the number of lysines present. We show in Figure 6 a mass spectrum from a pair of NeuCode-labeled proteoforms demonstrating the observed mass shift corresponding to the number of lysines in that particular proteoform.
Figure 5. Intact Protein Mass Spectrum from NeuCode-labeled Rat INS-1 Cells. Experimental mass spectrum of an intact proteoform showing the distribution of charge states (left), an expanded view of the +12 charge state showing the isotopic distribution (middle), and a further expanded view of a pair of NeuCode isotopologue peaks (right).
Figure 6. The structures of the two NeuCode isotopologues.
Figure 7. Expanded View of a Pair of NeuCode Mass Spectral Peaks. The NeuCode “light” and “heavy” peaks differ by 0.0203 m/z units, which when multiplied by the charge state and divided by 0.036 Da (spacing per lysine) reveals that this proteoform contains 7 lysines. In addition, the relative intensities of the NeuCode “light” and “heavy” peaks reveals the relative amount of this proteoform in the two samples that were mixed together; a ratio of 4.7:1 was observed in this case. This figure clearly demonstrates the information obtainable from NeuCode labeling and high-resolution mass spectrometry. Future work will utilize software to deconvolute and deisotope the spectra to collapse all of the charge states and isotopic peaks into a single pair of NeuCode peaks, to obtain an accurate intact mass, a lysine count, and relative quantification.
Experimental determination of relative abundance. The identification of peptides or proteins in proteomics has little utility in the absence of quantitative information on their levels. Changes in protein concentration in response to conditions often provides critical clues to biological function. Therefore, it is essential that a useful proteomics strategy provide quantitative information on proteoform abundance. The NeuCode strategy, in addition to providing lysine count, also yields a quantitative measure of relative abundance, as illustrated in Figure 6.
Summary and Larger Context for the Project
This is a special moment in time. We are in the midst of a technological revolution of unprecedented scope. Next generation sequencing platforms allow rapid, inexpensive and comprehensive transcriptomic analyses, and new mass spectrometers of ever-increasing sensitivity can rapidly determine the accurate mass of intact proteins. This technological convergence opens the possibility of dramatically changing the fundamental paradigm of proteomic analyses. We propose to integrate these state-of-the art genomics and proteomics capabilities into a new two-pronged strategy, in which custom sample-specific proteoform databases will be created and used to identify proteoforms from mass spectrometric data. Employing such an informed database will avoid the combinatorial explosion of possible proteoforms that has severely hindered identification strategies for intact proteins (e.g. top-down proteomics).
The transition we propose is similar to the metamorphosis in genomics that occurred with SNP analyses, where the early stages of the genome project focused on “SNP discovery”, while once the SNP databases were near completion, the field turned to “SNP scoring”, a much less expensive and more straightforward process. In proteomics, once proteoform databases are near completion, in that the hard work has been done to identify what proteoforms actually exist in nature, it will no longer be necessary to engage in laborious and complex full-fledged analyses of proteins involving MS/MS fragmentation. Rather, a simple isotopic tagging procedure and accurate mass determination will be sufficient to provide accurate quantitative information on proteoform identities and abundances.
The ability to determine the extent and nature of protein variation is a critical missing piece in proteomics today. A surprise revealed by the success of the human genome project was the much lower than anticipated number of genes present in human, in the range of ~20,000 rather than the predicted ~100,000. This fact has led to the general recognition that much of the complexity and sophistication afforded by our biological machinery is at the level of protein variation rather than just resulting from a large number of distinct genes. These protein variations occur on at least three levels; alternative splicing of the RNA transcript; codon substitutions; and a wide variety of post-translational modifications (PTMs). Determining the identities and abundances of these proteoforms is vital to understanding normal and disease biology. The variations in proteoforms play central roles in a wide variety of biological processes, from cell signaling and signal transduction to gene regulation.
- Smith, Lloyd M. and Kelleher, Neil L., "Proteoform: a single term describing protein complexity." Nature Methods, 2013, 10(3), 186-187.
- Sheynkman, G.M.; Shortreed, M.R.; Frey, B.L.; and Smith, L.M., "Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq." Molecular & Cellular Proteomics, 2013, 12(8), 2341-2353.
- Sheynkman, G.M.; Shortreed, M.R.; Frey, B.L.; Scalf, M.; and Smith, L.M., "Large-scale mass spectrometric detection of variant peptides resulting from non-synonymous nucleotide differences.” submitted to Journal of Proteome Research.
- Tran, J.C.; Zamdborg, L.; Ahlf, D.R.; Lee, J.E.; Catherman, A.D.; Durbin, K.R.; Tipton, J.D.; Vellaichamy, A.; Kellie, J.F.; Li, M.X.; Wu, C.; Sweet, S.M.M.; Early, B.P.; Siuti, N.; LeDuc, R.D.; Compton, P.D.; Thomas, P.M.; and Kelleher, N.L., "Mapping intact protein isoforms in discovery mode using top-down proteomics." Nature, 2011, 480, 254-258.
- Tran, J.C. and Doucette, A.A., "Gel-eluted liquid fraction entrapment electrophoresis: An electrophoretic method for broad molecular weight range proteome separation." Analytical Chemistry, 2008, 80(5), 1568-1573.
- Hebert, A.S.; Merrill, A.E.; Bailey, D.J.; Still, A.J.;Westphall, M.S.; Strieter, E.R.; Pagliarini, D.J.; and Coon, J.J., "Neutron-encoded mass signatures for multiplexed proteome quantification." Nature Methods, 2013, 10(4), 332-334.