Video produced by Steve Nathans-Kelly
Content is a critical asset to many companies, however, content is often expressed in formats ideally suited for transmittal and display—not as data for analysis.
In a Data Summit 2019 presentation, Bob Kasenchak, director of business development at Access Innovations, showed how to deal with unreliably or inadequately indexed extracted source data using controlled subject indexing.
"It's extremely helpful to have some kind of topical indexing, categorization, tags, call them what you want. We call them taxonomies. This information may or may not be included in the source data that you're extracting the information from and even if it is, it might not be reliable," said Kasenchak.
DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.
This brings us to controlled subject indexing--tagging, semantic enrichment, taxonomy, said Kasenchak. "Now I'm a taxonomist so I can speak at length about this topic but I won't, I'll keep it brief for now. But any time you have a large number of objects, it is extremely useful to categorize them for analytical purposes. So for things like vast quantities of scholarly content, useful categories can include things that are found in the metadata like the document type, the author, which journal it was published in and some quantifiable data like the number of authors and so on."
But to really analyze the content, some kind of taxonomy or thesaurus describing the domain that the paper is about is required.
The classification of digital objects by topic allows us to define, codify, group, and relate large amounts of data, content data for analysis, reporting, and information retrieval—basically, search. So, discover is broader than search—of course, browse and search are quite different."
There is no way to organize this quantity of information without some kind of classification scheme, he said. "This all comes to us by means of the old physical library problem that I have a book and I need to put it somewhere where people can find it; I only have one copy of the book and I need to have some kind of scheme to figure out where to put it so that people can get it. Now, of course, with digital objects, you don't just have one copy or a single place. But when you have 900,000 papers, it's kind of impossible to browse through to find what you're looking for and you need some kind of classification."
Many presenters have made their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.
To access the video of the full presentation, "From Structured Text to Knowledge Graphs: Creating RDF Triples From Published Scholarly Data," go to https://datasummit.brightcovegallery.com/detail/video/6040884584001/a204.-from-structured-text-to-knowledge-graphs:-creating-rdf-triples-from-published-scholarly-data?autoStart=true&q=kasenchak#links