Proteoforms

The critical protein actors in biological systems are the intact proteoforms, namely the different forms of proteins produced from the genome in a variety of splice forms, and adorned with a myriad of post-translational modifications that dramatically affect their function (see Figure 1) [1]. Lamentably, today’s bottom-up proteomic technologies, which identify and quantify peptides derived from proteins, rather than the proteins themselves, fail to deliver crucial protein-level information to biologists. In top-down proteomics, where intact proteins are analyzed by tandem mass spectrometry (MS/MS), more proteoforms are observed than able to be identified by fragmentation.

We have developed an intact-mass strategy that identifies proteoforms by their observed mass alone, resulting in an increase in the number of identified proteoforms in top-down proteomic studies.[2] We have pioneered the concept of the “proteoform family,” which is the set of proteoforms from a given gene. We visualize proteoform families as a network, where nodes are unique proteoforms and edges between nodes are mass differences corresponding to modifications or amino acid differences. The visualization of proteoform families enables one to view all of the proteoforms from a given gene and their relative abundances in a single graphic.

Figure 1. “Proteoform” defined. A proteoform is a particular form of a protein that is defined by its exact amino acid sequence combined with any post translational modifications on that sequence. Thus, each “protein” arising from a gene may exist in numerous different proteoforms. Figure from [1]

Figure 2. Visualized proteoform families from yeast lysate. Pink squares represent genes, green circles represent theoretical proteoforms from the database, blue circles represent intact-mass experimental proteoforms, purple circles represent experimental proteoforms identified by top-down MS2 analysis, and lines between circles represent mass differences corresponding to a modification or amino acid difference.  Figure from [2]

 

The transition we propose is similar to the metamorphosis in genomics that occurred with SNP analyses, where the early stages of the genome project focused on “SNP discovery”, while once the SNP databases were near completion, the field turned to “SNP scoring”, a much less expensive and more straightforward process. In proteomics, once proteoform databases are near completion, in that the hard work has been done to identify what proteoforms actually exist in nature, it will no longer be necessary to engage in laborious and complex full-fledged analyses of proteins involving MS/MS fragmentation. Rather, simple accurate mass determination will be sufficient to provide accurate quantitative information on proteoform identities and abundances.

The ability to determine the extent and nature of protein variation is a critical missing piece in proteomics today. A surprise revealed by the success of the human genome project was the much lower than anticipated number of genes present in human, in the range of ~20,000 rather than the predicted ~100,000. This fact has led to the general recognition that much of the complexity and sophistication afforded by our biological machinery is at the level of protein variation rather than just resulting from a large number of distinct genes. These protein variations occur on at least three levels; alternative splicing of the RNA transcript; codon substitutions; and a wide variety of post-translational modifications (PTMs). Determining the identities and abundances of these proteoforms is vital to understanding normal and disease biology. The variations in proteoforms play central roles in a wide variety of biological processes, from cell signaling and signal transduction to gene regulation.

 

References

  1. Smith, Lloyd M. and Kelleher, Neil L., "Proteoform: a single term describing protein complexity." Nature Methods, 2013, 10(3), 186-187.
  2. Schaffer, Leah V.; Shortreed, Michael R.; Cesnik, Anthony J.; Frey, Brain L.; Solntsev, Stefan K.; Scalf, Mark; Smith, Lloyd M. “Expanding Proteoform Identifications in Top-Down Proteomic Analyses by Constructing Proteoform Families.” Analytical Chemistry, 2018, 90, 1325-1333.