Metaの研究者が画像・音声・文字を同じように学習するAIを開発

今回は「Metaの研究者が画像・音声・文字を同じように学習するAIを開発」についてご紹介します。

関連ワード（事前、位置関係、学習方法等）についても参考にしながら、ぜひ本記事について議論していってくださいね。

本記事は、TechCrunch様で掲載されている内容を参考にしておりますので、より詳しく内容を知りたい方は、ページ下の元記事リンクより参照ください。

AIの領域には常に進歩が見られるが、それは1つの分野に限定される傾向がある。例えば、合成音声を生成するためのクールな新方法は、人間の顔の表情を認識するための方法とはまた別の分野だ。

かつてのFacebook（フェイスブック）から社名が変わったMeta（メタ）の研究者たちは、もう少し汎用性のあるもの、つまり話し言葉、書かれた文字、視覚的な認識を問わず、自分でうまく学習することができるAIの開発に取り組んでいる。

AIモデルに何かを正しく解釈させるための伝統的な訓練方法では、ラベル付けした例を大量（数百万単位）に与えて学習させる方法が採られてきた。猫の写真に猫とラベル付けしたものや、話し手と言葉を書き起こした会話などだ。しかし、次世代AIの学習に必要な規模のデータベースを手作業で作成することは、もはや不可能であることが研究者たちによって明らかにされたため、このアプローチはもはや流行遅れとなった。誰が5000万枚の猫の写真にラベルを付けたいと思うだろうか？まあ、中にはそんな人もいるかもしれないが、しかし、一般的な果物や野菜の写真を5000万枚もラベル付けしたい人はいるだろうか？

現在、最も有望視されているAIシステムの中に「自己教師型」と呼ばれるものがある。これは、書籍や人々が交流している様子を撮影したビデオなど、ラベルのない大量のデータを処理し、システムのルールを構造的に理解するモデルだ。例えば、1000冊の本を読めば、単語の相対的な位置関係や文法構造に関する考え方を、目的語とか冠詞とかコンマが何であるかを誰かに教えてもらうことなく、学ぶことができる。つまり、たくさんの例から推論して得るということだ。

これは直感的に人間の学習方法に似ていると感じられ、そのことが研究者が好む理由の1つになっている。しかし、このモデルも依然としてシングルモーダルになる傾向があり、音声認識用の半教師あり学習システムを構築するために行った作業は、画像解析にはまったく適用できない。両者はあまりにも違いすぎるのだ。そこで登場するのが、「data2vec（データトゥベック）」というキャッチーな名前が付けられたFacebook/Metaの最新研究だ。

data2vecのアイデアは、より抽象的な方法で学習するAIフレームワークを構築することだった。つまり、ゼロから始めて、本を読ませたり、画像をスキャンさせたり、音声を聞かせたりすると、少しの訓練で、それらのことを学習していくというものだ。それはまるで、最初は一粒の種だが、与える肥料によって、水仙やパンジー、チューリップに成長するようなものだ。

さまざまなデータ（音声、画像、テキスト）で学習させた後にdata2vecをテストしてみると、その分野のモダリティに対応した同規模の専用モデルと同等か、あるいは凌駕することさえあったという（つまり、モデルがすべて100メガバイトに制限されている場合は、data2vecの方が優れているが、専用モデルはさらに成長すればdata2vecを超えるだろう）。

「このアプローチの核となる考え方は、より総合的に学習させるということです。AIは、まったく知らないタスクも含めて、さまざまなタスクを学べるようになるべきです」と、チームはブログに書いている。「data2vecによって、コンピュータがタスクを遂行するためにラベル付きデータをほとんど必要としない世界に近づくことも、私たちは期待しています」。

Mark Zuckerberg（マーク・ザッカーバーグ）CEOはこの研究について「人は視覚、聴覚、言葉を組み合わせて世界を体験しています。このようなシステムは、いつの日か私たちと同じように、世界を理解することができるようになるでしょう」とコメントしている。

これはまだ初期段階の研究であり、突如として伝説の「総合的なAI」が出現すると期待してはいけない。

しかし、さまざまな領域やデータタイプに対応する総合的な学習構造を持つAIを実現することは、現在のような断片的なマイクロインテリジェンスの集合体よりも、より優れた、よりエレガントなソリューションであるように思われる。

data2vecのコードはオープンソースで、事前に学習されたいくつかのモデルも含めてこちらで公開されている。

画像クレジット：Andriy Onufriyenko / Getty Images

【原文】

Advances in the AI realm are constantly coming out, but they tend to be limited to a single domain: For instance, a cool new method for producing synthetic speech isn’t also a way to recognize expressions on human faces. Meta (AKA Facebook) researchers are working on something a little more versatile: an AI that can learn capably on its own whether it does so in spoken, written or visual materials.

The traditional way of training an AI model to correctly interpret something is to give it lots and lots (like millions) of labeled examples. A picture of a cat with the cat part labeled, a conversation with the speakers and words transcribed, etc. But that approach is no longer in vogue as researchers found that it was no longer feasible to manually create databases of the sizes needed to train next-gen AIs. Who wants to label 50 million cat pictures? Okay, a few people probably — but who wants to label 50 million pictures of common fruits and vegetables?

Currently some of the most promising AI systems are what are called self-supervised: models that can work from large quantities of unlabeled data, like books or video of people interacting, and build their own structured understanding of what the rules are of the system. For instance, by reading a thousand books it will learn the relative positions of words and ideas about grammatical structure without anyone telling it what objects or articles or commas are — it got it by drawing inferences from lots of examples.

This feels intuitively more like how people learn, which is part of why researchers like it. But the models still tend to be single-modal, and all the work you do to set up a semi-supervised learning system for speech recognition won’t apply at all to image analysis — they’re simply too different. That’s where Facebook/Meta’s latest research, the catchily named data2vec, comes in.

The idea for data2vec was to build an AI framework that would learn in a more abstract way, meaning that starting from scratch, you could give it books to read or images to scan or speech to sound out, and after a bit of training it would learn any of those things. It’s a bit like starting with a single seed, but depending on what plant food you give it, it grows into an daffodil, pansy or tulip.

Testing data2vec after letting it train on various data corpi corpora showed that it was competitive with and even outperformed similarly sized dedicated models for that modality. (That is to say, if the models are all limited to being 100 megabytes, data2vec did better — specialized models would probably still outperform it as they grow.)

“The core idea of this approach is to learn more generally: AI should be able to learn to do many different tasks, including those that are entirely unfamiliar,” wrote the team in a blog post. “We also hope data2vec will bring us closer to a world where computers need very little labeled data in order to accomplish tasks.”

“People experience the world through a combination of sight, sound and words, and systems like this could one day understand the world the way we do,” commented CEO Mark Zuckerberg on the research.

This is still early stage research, so don’t expect the fabled “general AI” to emerge all of a sudden — but having an AI that has a generalized learning structure that works with a variety of domains and data types seems like a better, more elegant solution than the fragmented set of micro-intelligences we get by with today.

The code for data2vec is open source; it and some pretrained models are available here.

（文：Devin Coldewey、翻訳：Hirokazu Kusakabe）

元記事： https://jp.techcrunch.com/2022/01/22/2022-01-20-meta-researchers-build-an-ai-that-learns-equally-well-from-visual-written-or-spoken-materials/

IT関連 #事前 #位置関係 #学習方法