Unreasonable Effectiveness of Machine Learning

Dec 14

Earlier this year, we stumbled into a surprise while testing one of our machine learning models, and we’ve told the story a few times -- but never in writing.

Before we get to the surprise, you’ll need a little bit of background on our work. If you’re familiar with iPSCs and direct reprogramming, skip down to the section on predicting reprogramming targets.

From stem cells…

Cellular reprogramming and direct cellular reprogramming are central to our work at Ivy Natal. We’ll briefly describe both of these, starting with cellular reprogramming.

Cellular reprogramming usually refers to the process of turning normal adult cells into cells that function like embryonic stem cells (ESCs), the cells found in the embryo which give rise to all the cells and tissues of an adult animal. When these stem cells are a product of reprogramming, they are called induced pluripotent stem cells (iPSCs) to distinguish them from ESCs.

This process for producing iPSCs was first published by Shinya Yamanaka’s team at Kyoto University. In 2006, Yamanaka’s team hypothesized that this “reprogramming” from an adult cell to a stem cell might be possible, and specifically that it might be possible by “turning on” the genes which were believed to be important in ESCs. They started with 24 candidate genes, and then inserted these genes into the DNA of mouse skin cells. ESC-like cells emerged from this process, and Yamanaka’s research group further removed one gene at a time from their starting pool of 24, and they arrived at four genes, which were together sufficient to generate ESC-like cells. These genes, Oct4, Sox2, cMyc, and Klf4, are what have become known as the “Yamanaka factors”.

This work in mice was quickly extended to humans. And in the same month in 2007, two independent groups reported success: again Shinya Yamanaka’s group, and also James Thomson’s group at the University of Wisconsin-Madison. With the same principle used in mouse reprogramming, Yamanaka's group successfully transformed human fibroblasts into iPSCs with the same four genes (OCT4, SOX2, KLF4, and cMYC), while Thomson and his colleagues used a different set of factors: OCT4, SOX2, NANOG, and LIN28.

Now, from our perspective, there are two miracles in the development of cellular reprogramming. The first is the availability of iPSCs, because iPSCs can be re-differentiated into a number of cell types using standard cell biology methods, which enables new kinds of cell-based therapies. The second is the method of their production, expression of a set of genes -- because this method is in no way specific to iPSCs, and it can in principle be applied to produce any other cell type.

Of these two miracles, the first has in many ways overshadowed the second; the downstream applications of iPSCs have received more attention than the potential application of the method of producing iPSCs to other cell types. Of course, our business is built on another application of the method of producing iPSCs, specifically the application of producing new egg cells from skin cells.

…to direct reprogramming

The process of transforming adult cells into other cell types, other than iPSCs, is called direct cellular reprogramming. It is “direct” because there is not an intermediate iPSCs stage.

Just like Shinya Yamanaka’s group in 2006, every potential direct cellular reprogramming application is confronted with the challenge of identifying which transcription factors are necessary and sufficient to drive the transformation from the starting cell type to the desired cell type. Yamanaka’s group leveraged the extensive literature on mouse embryonic stem cells, as well as his and his own group’s expertise, to identify candidates and then he systematically worked through them.

For Ivy Natal, we are attempting to produce human egg cells, an incredibly rare cell type, and one that has not been subject to a similar degree of study as mouse embryonic stem cells. How then do we go about finding the transcription factors for the reprogramming? Well, instead of assembling a pool of candidate transcription factors from the literature, which may have gaps or errors, we are leveraging recent advances in both sequencing technologies and machine learning.

Predicting targets

Our system for predicting targets has multiple generations, and the story of our surprise comes from testing our second generation system. This system works in essentially three stages.

First, we have built a machine learning model to better understand all human transcription factors. Transcription factors are proteins that certain genes code for, but what distinguishes them from other proteins is that transcription factors attach back to DNA in locations determined by the transcription factor. It’s not a perfect analogy, but if a cell’s DNA is a computer program, then transcription factors (along with other cell machinery) modify the program as it is running.

The model we have built for human transcription factors predicts their DNA binding sites, which is related to which genes the transcription factors may modify the expression of. Individually, these predictions could be used to study the function of each transcription factor, but our goal is to understand the relationship between a cell’s identity, whether it is a skin cell or an egg cell, and the transcription factors active in the cell.

To do that, the second stage of our process is augmenting our model with additional sequencing data specific to our cell type, and using a machine learning technique called feature extraction to identify the transcription factors that are relatively more active between our cell types, which helps us further narrow the set of potential transcription factors to directly reprogram skin cells into egg cells.

The third and final stage of our process looks at the relationships between the transcription factors identified in the above steps and isolates transcription factors which reinforce expression of each-other. In other words, it looks for situations where one transcription factor binds to a part of the genome, and by doing so activates the expression of a second transcription factor, and that second transcription factor does the same to activate expression of the first transcription factor.

These self-reinforcing relationships between transcription factors actually form entire self-reinforcing networks, and these networks are part of the foundation of a cell’s identity.

A surprise!

So, that’s the background. Now to the surprise.

As we mentioned, researchers have used transcription factors to directly reprogram between several different cell types -- including highly specialized cells like neurons. These established reprogramming tasks are valuable because they can be used to validate our method of predicting transcription factors for new reprogramming tasks, like producing egg cells. After all, if our method of predicting transcription factors works, it should be able to reproduce the transcription factors for established reprogramming tasks.

Of course, the best studied reprogramming task is producing iPSCs, so it was (and has remained) our first trial to evaluate our prediction systems.

Now, we build systems that we expect to work, but until you do something, you never *really* know, so earlier this year, when testing our second generation prediction system with iPSCs for the first time, we were anxious to read down the list of predicted targets, which are highlighted in blue in the table below:

First, POU5F1 is Oct4, which is common to the Yamanaka and Thomson factors. Then NANOG, which is a Thomson factor that is now widely recognized to be a key part of the pluripotency network, and finally... TRIM24?

Our hearts sank just a bit. We are not experts on iPSCs, and we do not follow every development in the field (we are more focused on the specifics of egg cell development), but this transcription factor was a surprise. So we searched PubMed, and this was the abstract of the first result [emphasis ours]:

Maintenance of pluripotency is regulated by a network of transcription factors coordinated by Oct4, Sox2, and Nanog (OSN), yet a systematic investigation of the composition and dynamics of the OSN protein network specifically on chromatin is still missing. [...] Trim24, a previously unrecognized protein in the network, converges with OSN on multiple enhancers and suppresses the expression of developmental genes while activating cell cycle genes. Consistently, Trim24 significantly improved efficiency of cellular reprogramming, demonstrating its direct functionality in establishing pluripotency.

It was a paper published in 2016 by a group based in Germany (Rafiee et al. 2016), nearly ten years after Yamanaka’s results were published. It established that Trim24, the surprise in our list of predictions, is a core member of the set of transcription factors that produce iPSCs.

We got goosebumps.

We expected our system to work. We expected that it could predict the Yamanaka factors, or the Thomson factors, or some combination, despite the fact that it was naive to all the biology that informed those researchers’ original candidates of factors, and was trained on nothing more than sequencing data. But we didn’t expect it to know more about iPSC reprogramming than we did, and we didn’t expect it would identify a transcription factor for iPSCs that took the field 10 years to find after the Yamanaka paper.

Unreasonable effectiveness

This result, the first test of a system which we expected to work, and ultimately worked even better than we could have expected, reminded one of us of a passage from an essay by prominent mathematician Richard Hamming, in which he grapples with the astonishing success of mathematics:

During my thirty years of practicing mathematics in industry, I often worried about the predictions I made. From the mathematics that I did in my office I confidently (at least to others) predicted some future events-if you do so and so, you will see such and such-and it usually turned out that I was right. How could the phenomena know what I had predicted (based on human-made mathematics) so that it could support my predictions?

The eerie feeling which Richard Hamming felt about mathematics, and we felt about our transcription factor model, does not come from a lack of understanding. He was a commanding mathematician, and despite the reputation of machine learning, we understand our model (it was designed and trained on data to perform a specific task, and we understand how it does that task) but still the feeling of wonder at success remains.

Colin Bortner