Outperforming Humans
Machine Learning (ML) begins to outperform humans in many tasks which seemingly
require intelligence. The hype about ML makes it even into mass media.
ML can read lips, recognizes faces, or transform speech to text. But
when ML has to deal with the ambiguity, variety and richness of
language, when it has to understand text or extract knowledge, ML
continues to need human experts.
Knowledge is Stored as Text
The
Web is certainly our greatest knowledge source. However, the Web has
been designed for being consumed by humans, not by machines. The Web’s
knowledge is mostly stored in text and spoken language, enriched with
images and video. It is not a structured relational database storing
numeric data in machine processable form.
Text is Multilingual
The
Web is also very multilingual. Recent statistics show that surprisingly
only 27% of the Web’s content is English and only 21% in the next 5
most used languages. That means more than half of its knowledge is
expressed in a long tail of other languages.
Constraints of Machine Learning
ML
faces some serious challenges. Even with today’s availability of
hardware, the demand for computing power can become astronomical when
input and desired output are rather fuzzy (see the great NYT article "The Great A.I. Awakening").
ML is great for 80/20 problems, but it is dangerous in contexts with high accuracy needs: “Digital assistants on personal smartphones can get away with mistakes, but for some business applications the tolerance for error is close to zero", emphasizes Nikita Ivanov, from Datalingvo, a Silicon Valley startup.
ML performs good on n-to-1 questions. For instance, in face recognition “all these pixel show which person?” has only one correct answer. However, ML is struggling in n-to-many or in gradual circumstances … there are many ways to translate a text correctly or express a certain piece of knowledge.
ML
is only as good as its available relevant training material. For many
tasks mountains of data are needed. And the data better be of supreme
quality. For language related tasks these mountains of data are often
required per language and per domain. Further, it is also hard to decide
when the machine has learned enough.
Monolingual ML Good enough?
Some
suggest why not process everything in English. ML does also an OK job
at Machine Translation, like Google Translate. So why not translate
everything into English and then lets run our ML algorithms? This is a
very dangerous approach since errors multiply. If the output of an 80%
accurate Machine Translation becomes the input to an 80% accurate
Sentiment Analysis errors multiply to 64%. At that hit rate you are
getting close to flipping a coin.
Human Knowledge to Help
The
world is innovating constantly. Every day new products and services are
created. To talk about them we continuously craft new words: the bumpon, the ribbon, a plug-in hybrid, TTIP ‒ only with the innovative force of language we can communicate new things.
Struggle with Rare Words
By
definition new words are rare. They first appear in one language and
then may slowly propagate into other domains or languages. There is no
knowledge without these rare words, the terms. Look at a typical product
catalog description with the terms highlighted. Now imagine this
description without the terms – it would be nothing but a meaningless
scaffold of fill-words.
Knowledge Training Required
At
university we acquire the specific language, the terminology, of the
field we are studying. We become experts in that domain. But even so,
later in our professional career when we change jobs we still have to
acquire the lingo of the new company: names of products, modules,
services, but also job roles and their titles, names for departments,
processes, etc. We get familiar with a specific corporate language by
attending training, by reading policies, specifications, and functional
descriptions. Machines need to be trained in the very same way with that
explicit knowledge and language.
Multilingual Knowledge Systems Boost ML with Knowledge
There
is a remedy: Terminology databases, enterprise vocabularies, word
lists, glossaries – organizations usually already own an inventory of
“their” words. This invaluable data can be leveraged to boost ML with
human knowledge: by transforming these inventories into a Multilingual Knowledge System
(MKS). An MKS captures not only all words in all registers in all
languages, but structures them into a knowledge graph (a 'convertible'
IS-A 'car' IS-A 'vehicle'…, 'front fork' IS-PART of 'frame' IS-PART of
'bicycle').
It is the humanly curated Multilingual Knowledge System that enables ML and Artificial Intelligence solutions to work for specific domains with only small amounts of textual data and also for less resourced languages.