New, more specific classification system for proteins may help advance scientific discovery

By Tatum Lyles Flick - UW Madison Department of Chemistry

Lloyd Smith, a professor at the University of Wisconsin-Madison Department of Chemistry, believes that for researchers to understand complex biological systems, they need to have deeper knowledge about the exact molecular forms (proteoforms) in which proteins occur. 

“When you identify a proteoform, you also identify the protein that it comes from. If you know the proteoform, you know the defined molecule with a specific molecular structure,” Smith said, adding that genes encode proteins, which are often chemically altered, giving rise to a variety of different proteoforms that direct numerous biological functions.

Along with the term proteoform, Smith coined the term proteoform family, which identifies a group of proteoforms that correspond to a particular protein made by a specific gene.

Two graduate students from the Smith lab work on identification of proteoforms.He and Neil L. Kelleher, a chemist from Northwestern University, recently published a perspective article in Science, called Proteoforms as the next proteomics currency, that advocates for an identification and naming process change and points out the importance of these molecules in human health and disease.

Smith and Kelleher started this conversation in a 2013 letter to Nature Methods, titled Proteoform: a single term describing protein complexity, through which they argued that much of what makes an organism function is directed by proteins and their proteoforms, not by a high gene count, as researchers previously thought.

Recent genetics research from the Vidal laboratory at Harvard, cited in Smith’s article, indicates that two variants of the same protein can drive biological functions as varied as those driven by two different proteins made by completely different genes.

“Our overarching goal is to identify all sources of variation in proteins associated with a certain gene,” said Anthony Cesnik, a senior graduate student in Smith’s lab. “We want to see the hidden features of the proteome because subtle variations can be key factors involved in different types of disease.”

Revealing the intricacies of proteins and proteoforms requires a large amount of data analysis and can be time consuming, but Smith and his students see value in bringing this ability to other research labs.

“Our goal is to develop tools biologists can use to understand disease,” said Leah Schaffer, a senior graduate student in Smith’s lab. “Most current technology identifies proteins in a sample, but our lab is working on identification of proteoforms. This is important because they can have different biological implications.”

Because scientists are beginning to understand the importance of specific proteoforms for a variety of biological functions, Smith believes it is necessary to characterize those proteins, and related molecules, in molecular detail. Doing so means devising new strategies for identification.

Currently, most scientists use a “bottom-up” approach to proteomics, which means they use enzymes to cleave a protein into smaller pieces and then identify the resulting parts. While this can lead to accurate detection of the molecules, it does not yield identification of the full proteoform.

Smith, Kelleher and a small subset of other scientists world-wide take a top-down approach to proteomics, actively devising new ways to identify proteins and their proteoforms.

“In top-down proteomics, when a protein is put in the instrument, it is intact. It breaks apart in the instrument, so we can be sure that everything we are measuring is from that molecule,” Cesnik explained. “It’s like trying to assemble a puzzle. If the pieces from 100 puzzles land on the table, it’s harder to put together than if they are separate.”

“My group works on finding better ways to identify molecules,” Smith said, explaining that biologists are often limited in their knowledge of the proteoforms present in their system. “So, they know which gene the protein is from, but they don’t always know the different forms of the protein, or how much of each is present.”

To properly identify a proteoform and accurately define its molecular arrangement, Smith believes it is best to use a combination of methods, rather than relying solely on the intermediate digestion step, used by bottom-up identification.

“We currently use many techniques together, which include RNA sequencing to find genetic variation and splice forms, so we know what the organism is capable of making; intact mass measurement and top-down fragmentation to identify unknown proteoforms,” Smith said, adding that his group is creating a catalog of the molecules. “There is a lot of informatics involved, so my students also write computer programs to perform the data analysis.”

The equipment used to identify proteins can assess several at a time in what can be a lengthy process and not involve all proteins in a sample. Smith’s group is working through the intact mass process to change that.

“The instrument picks the most abundant proteins and fragments them, but it doesn’t have time to fragment all proteins in a sample,” Schaffer said. “We developed software that takes intact mass observations and makes identifications based on that mass, which means that we don’t have to fragment the protein in this process.”

Schaffer pointed out that quantification is also key to the information gathered in this process.

“Diseased cells can have higher or lower numbers of specific proteoforms than healthy cells, and it’s important that we have an idea of what those counts should be,” she said.

Currently, thousands of labs use bottom-up assessments of proteins and proteoforms, but only a couple hundred work with top-down analysis. Smith and his collaborators want to change that by helping others understand the benefits of top-down analysis and how elucidating the many proteoforms can lead to greater scientific knowledge and discoveries.


This research is funded by the National Institute of Health (NIH) through the National Institute of General Medical Sciences (R01GM114292, “Intact Proteoform Identification and Quantification”). Cesnik is supported by the NIH (U24CA199347) and by the Computation and Informatics Training Program (T15LM007359). Schaffer is supported by the Biotechnology Training Program (T32GM008349).