Video produced by Steve Nathans-Kelly
In a presentation at Data Summit 2019 titled "From Structured Text to Knowledge Graphs: Creating RDF Triples from Published Scholarly Data," Access Innovations' Bob Kasenchak explained how to disambiguate duplicate named entities in data extraction and conversion in this clip from his presentation at Data Summit 2019.
"One of the common sticking points in this kind of data extraction and conversion is the problem of named entities," said Kasenchak. "These problems come in several flavors: Any sufficiently large dataset of academic literature, you will certainly find authors with the same names. In older data, you will also find authors who only provided their first initial. It was very common until about the mid-1960s for you to just sign your paper J. Smith, Harvard University."
And, as a result there is trouble telling authors apart. "Now, there are, again, some resources for this of persistent identifiers of researcher names, Orchid is great, but it has its own problems, and it's really only relevant for living authors and newer data. Those persistent identifiers don't exist for authors in your legacy content. But the most difficult problem is author name duplication and disambiguation. For institutions, educational institutions, but also research institutions, corporations, government agencies and other organizations that have authors affiliated with them in the content, the problem is similar. In most legacy data, there are freely entered versions of names which may or may not include department information, abbreviations, and so on."
DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.
As an example, Kasenchak showed a list of names from a data et of actual names from published academic content. "The duplicates in this excerpt indicate that the text string, like B D Silverman, occurred more than once. Is Arthur Silverman the same person as Arthur R. Silverman? Are B. G. and Barry G. and Barry George Silverman the same person in all cases? I will add that this particular dataset had about 905,000 articles with an average of 3.5 authors per paper.
This was almost 4.5 million raw names that had to be disambiguated. "When we got it done, we had a much cleaner list. We had it down to about 900,000 unique names, which is still a lot of person names. Several things are interesting about this: Clearly, at the end of the day, even with a clean list, you're going to have duplicate names; actual people that have the same text string of their name. Which means that a name is a poor candidate for the triple statement; you have to have an I.D. number attached to that person and you can't just use the string of their name as the main object. But what's really interesting is how to distinguish the names from one another and how to collapse records for the same person expressed differently, like old Barry George Silverman here, into a single record."
As a result, a process was developed to look at the other metadata around the names. "By taking each of the 4.5 million instances of an author name, the first task is to take some determination of which authors to compare. There's no need to compare every author on the list of 4.5 million with every other author; Mr. Joung and Mr. Smith probably aren't the same persons. So you need to do some clustering to determine things that are likely matches using Levenshtein distances, other kinds of NLP methods. B. G. Silverman and Garry George Silverman are pretty far apart Levenshtein-wise because of the number of characters in the text string."
According to Kasenchak, if you use the last name, you can cluster things together that you want to compare to each other. "Once we had clusters of similar names, we compared the other metadata in the paper: What topic did they publish on? Who were their co-authors? What institution were they affiliated with? And for newer content, do they have the same email address? That's a pretty good indicator, so we liked it when that happened. But this analysis provides a series of similarity factors which we could weight and compare. Email matches are obviously a very strong indicator, co-author and topic matched, and so forth. And so we had these series of weights then we compared all these names with to disambiguate them. And at the end of the day, we had some edge cases that the algorithm was unable to say, "yes" or "no" and we had to have humans actually go through and review some of these thing to try and make a clean list."
Institutional names present a slightly different challenge, said Kasenchak. "Especially the way that institutional names are expressed in free text entry data fields, especially from legacy data converted to digital. Again, this is an actual list of text strings I extracted from actual published scholarly data. It's very easy for a human to look at these and say, "Oh, well, these are the same, and the other ones are not the same." But it's much harder to write a script to try and do this, especially when you don't have a canonical list of names to begin with because all the different abbreviations, iterations, and name changes institutions go through. And then throw in some non-primary English names and you have yourself another data mess including transliterations and all kinds of other things. So text analytics could be of some help here."
It is apparent that pulling words in certain proximities to each other and parsing on commas would help, but there's only so much you can do, and at the end of the day, you're going have to deploy some human review at the end of your process, he noted. "So just to round out the story this dataset, with it's four-and-a-half million author names from about 900,000 records, we ended up with something like 30,000 unique institutions names. And again, it's better to use some kind of I.D. number to represent these in your data than to try and use the string of the actual institution of the name, because they're too similar."
Many presenters have made their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.
To access the vide of the full presentation, "From Structured Text to Knowledge Graphs: Creating RDF Triples from Published Scholarly Data," go to https://datasummit.brightcovegallery.com/detail/video/6040884584001/a204.-from-structured-text-to-knowledge-graphs:-creating-rdf-triples-from-published-scholarly-data?autoStart=true&q=kasenchak#links