At 3:20pm on 17 November 2017, Professor Andrew Ng of Stanford University made a bold claim. “Should radiologists be worried about their jobs?” he tweeted from Mountain View, California.

“Breaking news: we can now diagnose pneumonia from chest X-rays better than radiologists.”

This was accompanied by a link to the summary of an academic paper written by the Stanford Machine Learning Group, of which Ng is the head. It claimed that ‘CheXNet ,’ a 121-layer convolutional neural network could infer the probability of a patient having pneumonia from a chest X-ray with greater efficiency than a human radiologist. This, the paper’s authors claimed, could “improve healthcare delivery and increase access to medical imaging expertise in parts of the world where access to skilled radiologists is limited.”

The science press latched onto the story quickly. Their general reaction was captured by MIT Tech Review, in a brief news story they published on the paper shortly after Ng’s announcement. “Add diagnosing dangerous lung diseases to the growing list of things artificial intelligence (AI) can do better than humans,” it said. “Analysing image-based data like X-rays, CT scans and medical photos is what deeplearning algorithms excel at. And they could very well save lives.”

Artificial coalface

There are, broadly speaking, two sides to AI research. The first takes place behind closed doors, in laboratories cultivated by search engines and social networks. It is, by most accounts, a friendly environment to work in. It is also unyieldingly private.

The other takes place in academia. Across the world, professors and PhD students continue to mount bold and sometimes strange experiments that push the bounds of AI. The papers that result often appear first on open academic forums like arXiv or GitHub, where debates on the subtleties of code are often genial, forthright and informative enough to spur new avenues of inquiry.

Pranav Rajpurkar and Jeremy Irvin are very much of this second school. Co-creators of CheXNet, they’re also masters students in machine learning and computer science, respectively, at Stanford and joined the Machine Learning Group in 2017. “I worked on a few projects over the year in healthcare and also education,” says Irvin. Once Pranav started working with the lab, “we started collaborating and working in the medical imaging space.”

The inception of CheXNet came with the publication of a huge public dataset of chest X-ray images by the NIH Clinical Center in September. The institution published 100,000 anonymised frontal scans from patients labelled with 14 different pathologies. The Stanford ML Group considered it an ideal source of raw material to train new deep-learning algorithms in detecting lung conditions from X-rays. They decided to begin with one of the most common.

“One of the things about pneumonia is that it has a high burden, even in developed countries, among adults,” says Rajpurkar. “It’s something like one million hospitalisations and 50,000 deaths a year just in the US.”

An inflammation of lung tissue resulting from infection, pneumonia is a disease the diagnosis of which is often a case of eliminating, rather than confirming, possibilities. It begins with a presentation of vague symptoms in the patient, which can include coughing, sweating, shivering or breathing difficulties. The doctor then produces a stethoscope to investigate further, and sometimes orders a chest X-ray to confirm whether what they’re hearing from the patient’s lungs is pneumonia or something else.

Pneumonia is usually presented on a chest X-ray as a visible blotch, the result of the inflammation of the tissue inside the lung(s) that the disease causes. “Chest X-rays are quite wonderful, because it’s the most common imaging procedure,” says Rajpurkar. “About two billion chest X-rays a year [are conducted], and pneumonia is one of the things that chest X-rays provide the most information [on in] any diagnostic test.”

Any AI tasked with detecting its presence would have to differentiate these blotches from more serious masses, like tumours. To that end, the team at Stanford built a convolutional neural network (CNN) called CheXNet to interpret what was what on the scan.

“What that model does is learn to identify patterns and images, basically, after training them on a large amount of images,” explains Irvin.

To identify the probability of pneumonia, CheXNet scanned over each image, analysing different portions of it, and slowly discerning the relationship between the blotches and their labels.

“After training for a long time on this large set of images, your algorithm will hopefully have a good set of parameters that can accurately predict whatever your task is,” explains Irvin.

It’s an approach to machine learning that moves beyond what Rajpurkar calls ‘canned engineering,’ where researchers and radiologists work together to create an algorithm that effectively tries to look for the same signs as an expert would. That is ending, he says, with deep learning.

“[We] started moving from these handengineered algorithms, which required expert input, to algorithms that were entirely driven by data and what we call ‘end to end’,” explains Rajpurkar. “[That means] that the only signal that’s coming into the model is whether or not it made an error. And there is no other expert knowledge going in.”

The team have also gone to great lengths to examine exactly why the CNN makes any given decision regarding an X-ray. Increasingly, neural networks of this type are being seen as ‘black boxes,’ where data is fed in, but researchers are not entirely clear as to why the results have led to the final decision. It’s led to an increasingly robust debate in the AI research community as to whether current work in the field prioritises engineering over science; building an arch to strengthen a bridge, as it were, without realising how the arch achieves this.

“The challenge is, now that we have no expert knowledge integration, how can we actually tell whether the model is doing anything sensible?” says Rajpurkar. One way to tell is simply to compare the accuracy of the CNN against the labels on the dataset, but the team at Stanford decided to go a step further by producing a heat map of probable zones of pneumonia on each scan analysed by the model. “That’s something that, we think, was a big jump in terms of how much we are able to convince experts that this is an algorithm that works,” Rajpurkar explains.

Underlying concerns

But not everyone believed them. Underneath Ng’s tweet, a niche community of radiologists passionate about AI – and a few radiologists who weren’t – voiced their complaints loud and clear. Some were offended about the professor’s assertion that they should worry about their jobs, pointing out that medical AI was, in fact, still in its experimental stages. Others pointed out that the weight given to chest X-rays in the diagnosis of pneumonia was being fundamentally overstated.

This criticism derived from the fact that the diagnosis of pneumonia is a clinical decision, informed by a range of information of which the chest X-ray is only one part. According to Luke Oaken Rayner, a radiologist and AI enthusiast who published an informal analysis of CheXNet in January, this wasn’t the fault of the Stanford ML Group: the authors had been scrupulous in their terming the CNN’s task as purely the detection of pneumonia from X-rays, not full-blown diagnosis of the disease. Even so, added Rayner, Ng perhaps should have known better than to write as much in his tweet, as should the science journalists that reproduced it without criticism.

Worse, however, was the persistent criticism that the labels assigned to the dataset used to train CheXNet were flawed. In the original NIH dataset, a distinction is made between pneumonia and a phenomenon called ‘consolidation’, which is a cloud-like shape found on the X-ray when air replaces fluid in part of the lung. Pneumonia results in consolidation, but consolidation is not necessarily pneumonia, and it is extremely hard for radiologists to observe the difference between the two from X-rays alone. The inclusion of ‘infiltration’, a comparatively rare term used to describe airway opacity but in a clinical setting seems to be rarely used, was another object of confusion.

There were also questions about the validity of the testing process itself. CheXNet ’s performance was measured against four radiologists observing frontal chest X-rays to detect pneumonia.

The neural network bested them, but the significance of this finding is open to debate if one considers that radiologists would also be able to access patient histories and lateral scans in a clinical setting.

The team behind CheXNet have not proved immune to criticism. The criticism about the lack of extra information available to radiologists in the testing phase was acknowledged as a limitation in the public discussion about the viability of the calculations made in the original paper; the model has undergone several iterations. On other criticisms, however, they have remained forthright.

“I think most of the concerns when applied to our work become invalid when you [consider] that our labels on the test-set were derived from Stanford radiologists and were not automatically extracted from looking at the radiology report,” says Rajpurkar. “That makes our test-set robust. We can trust those labels much more than we should be trusting automatically generated labels.”

In other words, even if the dataset that CheXNet was trained upon was flawed, its high performance on the set of images it was tested on – the ones that were painstakingly relabelled at Stanford – prove its efficacy at the task at hand. “CheXNet serves as evidence that deep learning can be trained on noisy data and still learn valuable patterns,” says Irvin.

Even so, precisely how it did this remains mysterious – in short, we do not know how the arch works. CheXNet has yet to be peer-reviewed. Even so, Rajpurkar and Irvin firmly believe that it could yet be deployed in a clinical setting in a few years.

“The main application we see for the current state of CheXNet is to triage patients, or to basically rank them by the severity of the prediction of our model,” explains Irvin. “This algorithm could essentially order patients by prediction of pneumonia.”

Above all, they’ve been concerned with keeping the discussion about CheXNet genial, and have focused on the potential AI has in transforming medicine in the long term. The neural network might not yet be ready for deployment in a clinical setting, but it’s certainly pointed out a way forward in automating one of the trickiest tasks that radiologists face today.

“We’ve done our best to keep this a good, positive, open conversation,” says Rajpurkar. “We want to convince the community that we are doing sound science, and that the community is also responsible for making sure that we do sound science. I think the conversation is incredibly useful.”