How “the most advanced machine learning approach” is finding new cancer-causing mutational signatures

The changes to our DNA that cause cancer are most often what’s called somatic mutations.

Somatic mutations are mutations that affect any of the cells in our body except for the sex cells – the egg or sperm. That means they can’t be passed on to the next generation.

These mutations can be caused by our DNA being exposed to carcinogens like chemicals in tobacco that alter DNA directly, or by errors made by our cells when replicating or repairing damaged DNA.

These processes will all alter DNA in a different way. However, each process that causes a change leaves a characteristic pattern of mutations that can be linked to it when analysed, known as a mutational signature.

Therefore, if we analyse the DNA of an individual’s cancer cells, we can, in some cases, identify which processes have caused the mutations we see by comparing them to known signatures.

But to find these signatures in cancer DNA, we need tools powerful enough to sift through a huge amount of data and recognise mutations, or identify new ones, with a high degree of accuracy.

Now, with funding from Cancer Grand Challenges, a team of researchers at the University of California San Diego has developed the newest in a suite of tools that’s quickly establishing itself as the best of the best.

A cut above the rest

SigProfilerExtractor is a machine learning algorithm designed to find new signatures from cancer genetic data. Back in June, we explained how it had already been used to identify 21 copy number signatures in cancer DNA, meaning patterns in how many of each chromosome a cancer cell has. But the team who developed it didn’t stop there.

To make sure SigProfilerExtractor is the best it can be, they put it to the test against 13 other tools. The tools were used to analyse information from over 60,000 synthetic genomes, complete sets of cancer genes, that contained 2,500 simulated signatures. In this task, not only did SigProfilerExtractor detect 20-50% more true positive signatures than other tools, but it also found almost no false positive signatures in the data.

“What we’ve done in this paper is the largest possible benchmarking of computational tools that exist,” says Dr Ludmil Alexandrov, Associate Professor at University of California San Diego.

“Now, it’s one thing to show that you outperform other tools on synthetic data. But you also need to show that you can make novel biological discoveries.”

But luckily, SigProfilerExtractor can do that too.

Finding the link

Once it had been benchmarked against the other bioinformatics tools, the team put it to work on data from the DNA of real people living with cancer. When tasked with analysing the entire genomes of almost 5,000 cancers and all the coding genes of almost 20,000 more, the algorithm found four signatures that had not been detected by any other tool.

One of these signatures, seen in the DNA of bladder cancers, could even be linked to chemicals in tobacco.

“Epidemiologically, we know that tobacco smoking increases the risk for developing bladder cancer,” says Alexandrov. “But we don’t see the traditional mutational signature associated with tobacco smoking in bladder cancer like we do in lung and oral cancers.

“So, in a way this was a big mystery. Is there a mutational signature in bladder cancer? Do tobacco carcinogens actually mutate bladder cancer, or is it linked in another way?”

Thanks to SigProfilerExtractor, that mystery has been solved. The signature, now named SBS92, shows that there is a tobacco signature in bladder cancer, and it’s different to the signature we see in lung cancers. What’s more, they also found the signature in healthy bladder tissues of smokers.

However, what causes the other three new signatures that the algorithm detected is still unclear, which opens up new potential avenues of investigation.

“Signatures can have external origin, or originate from different processes within the cell,” says Marcos Díaz-Gay, co-lead author of the study and postdoctoral researcher in the Alexandrov Lab at University of California San Diego.

“So, it will be interesting to look at different cancer datasets, maybe from different countries with different environmental exposures, to identify the causes of these new signatures that we can see in the genome.”

What’s next?

Identifying what may have caused a certain cancer from its mutational signatures is one thing, but what does that mean for treatment?

“What we want to do is apply this knowledge in the clinic and use it on an individual level,” says Díaz-Gay.

We know that some treatments are more effective in cancers with particular mutations. So, by identifying what mutational signatures are present in a tumour, we may be able to better tailor an individual’s treatment to their specific cancer. And the team behind SigProfiler are hoping to work with a global database called COSMIC, the Catalogue of Somatic Mutations in Cancer, to allow researchers from anywhere in the world to analyse their patient’s tumours using the algorithm.

“We’re actually trying at the moment to set up a web server on the COSMIC website where anyone can go and upload a sample,” adds Alexandrov.

“Then you’re able to analyse the mutational signatures in an individual patient with a very, very high accuracy.

“Obviously, one cannot give clinical advice from a website. But one can say, ‘this is a signature and there is a lot of evidence from previous research that people who have these signatures are likely to respond to this specific drug’.”

Rising to the challenge

Where other bioinformatics tools often become less useful over time as there isn’t the funding to maintain them, SigProfilerExtractor has a leg up.

“One of the different things about the SigProfiler suite of tools is that we’ve had the opportunity to develop and maintain them for the last five years. And essentially, that’s created quite a large community around them.

“This is the huge advantage of Cancer Grand Challenges. We know that it performs better than other tools because we had sufficient funds to compare it to everything else that exists, and make it very usable for people.”

Having already established itself as the cream of the crop, the results of this research are only the beginning for the SigProfiler tools. With continued funding from Cancer Grand Challenges, the team can further develop the algorithm, taking us one step closer to personalising cancer treatment.