Video produced by Steve Nathans-Kelly
At Data Summit 2019, Helena Deus, Elsevier technology research director, explained how the company used natural language processing and AI to organize dispersed scientific content in a non-traditional way and built an extensive community of readers.
Deus was presenting the work by Dr. Michelle Gregory, SVP Data Science, Elsevier.
"A lot of our customers were facing this wealth of literature," said Deus. "There's a massive amount of papers that are coming out every year. Not just papers, publications. Information that's been generated by people. Many of them—in particular PIs and scientists at the top of their field—needed the ability to quickly grasp topics, to get familiarized with the topic really quickly, either because the funding agents they were used to working with is now funding this particular topic, or in the case of specialist doctors they need the ability to learn as much as they can about a patient's diagnosis, so that they can be confident they're making the right decisions."
As a result, Elsevier faced a challenge, she said. "We knew we had a massive amount of information about a variety of topics, but it was dispersed. It was all over the place, essentially, in textbooks, in papers, in guidelines. And so the first thing we did is we built a taxonomy. What kind of things do we want to capture information about?"
DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.
Deus said that the company considered how to organize knowledge in a way that's different from the way knowledge is traditionally organized. "Once we had this taxonomy, we were actually now able to use standard NLP techniques like Named Entity Recognition, Entity Identification, et cetera. And then the final step is, how do we scale this? Because we've got 16 million individual articles, this is a massive amount of data. So investing on technology that can scale was actually pretty critical. In this case we used Databricks and Apache Spark and that helped us scale. But at the end of the day what we have is something that looks like this. If you go to Google nowadays, and you search for a medical term that doesn't tend to show up in Wikipedia, things that haven't actually been annotated in Wikipedia, the top result you get in Google is one of these topic pages. And they're all dynamically generated. There's no manual curation going on here."
The result has been great improvement in Elsevier' visibility, she noted. "Nowadays, we have about 330,000 topics in 20 domains, 10 million visits per month to the topic pages, it drives 13% of ScienceDirect traffic."
Many presenters have made their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.
To access the full presentation, "Digital Transformation Is Business Information: How to Incorporate AI Technology into a 130-Year-Old Company," go to https://datasummit.brightcovegallery.com/detail/videos/data-summit-2019-keynotes/video/6040533219001/keynote---digital-transformation-is-business-transformation:-how-to-incorporate-ai-technology-into-a-130-year-old-company?autoStart=true#links