「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究

今回は「「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究」についてご紹介します。

関連ワード（Facebook、コンピュータービジョン、機械学習等）についても参考にしながら、ぜひ本記事について議論していってくださいね。

本記事は、TechCrunch様で掲載されている内容を参考にしておりますので、より詳しく内容を知りたい方は、ページ下の元記事リンクより参照ください。

機械学習は、やり方を教えるデータさえあれば、あらゆることができる。これは必ずしも簡単なことではない。だから研究者は、AIに少々の「常識」を加える方法を常に模索している。常識があれば、AIが猫だと認識する前に500枚の猫の写真を見せる必要がなくなるからだ。Facebook（フェイスブック）の最新の研究は、データのボトルネックを減らす方向へ大きな一歩を踏み出した。

同社の強力なAI研究部門は、高度なコンピュータービジョンアルゴリズムなどの技術進歩や応用範囲拡大の方法に長年取り組んでいる。着実に前進しており、その成果は一般に他のリサーチコミュニティと共有されている。Facebookが特に追求している興味深い開発の1つは「半教師あり学習」と呼ばれるものだ。

一般にAIの訓練について考えるとき、上述の猫の500枚の写真のようなものを思い浮かべる。こうした画像はあらかじめ選り分けられ、ラベルが付されている（つまり、猫の輪郭が描かれていたり、猫の周りに四角い囲みをつけたり、単に猫が画像の中のどこかにいると示されていたりする）。こうして、機械学習システムが猫の認識プロセスを自動化するアルゴリズムを作れるようにする。当然のことながら、犬や馬で行いたい場合は、500枚の犬の写真、500枚の馬の写真などが必要となる。つまり、線形に応用範囲が広くなる。テクノロジーの世界では決して目にしたくない言葉だ。

「教師なし」学習に関連する半教師あり学習では、ラベル付けされたデータをまったく使用せずにデータセットの重要な部分を理解する。これで単純に明後日の方向に進んでしまうことはなく、そこにはまだ構造がある。例えばシステムに1000個の文（センテンス）を与えて学習させた後、いくつかの単語が欠落している10の文をシステムに提示する。システムはおそらく、最初に見た1000文に基づき空白を埋めるまともな仕事をすることができる。しかし、それを画像や動画で行うのはそれほど簡単ではないし、単純でも予測可能でもない。

だがFacebookの研究者は、簡単ではないかもしれないが可能であり、実際には非常に効果的であることを示した。DINOシステム（DIstillation of knowledge with NO labels「ラベルなしでの情報抽出」の略）は、ラベル付きのデータが皆無でも、人、動物、静物のビデオの中から目的のものを見つけるべく学習することができる。

画像クレジット：Facebook

AIは上記の処理を、1つずつ順番に分析される一連の画像として動画を捉えるのではなく「一連の単語」と「文」の違いのような複雑で相互に関連する集まりとして捉えることによって行う。動画の冒頭だけでなく、途中や最後にも注意を払うことで、AIエージェントは「この一般的な形の対象物が左から右に移動する」という感覚を得る。その情報は他の知識にも反映される。例えば右側にある物が最初の物と重なっている場合、システムは双方の輪郭をパッと見て同じではないと認識する。その知識は他の状況にも応用できる。言い換えれば、AIは「見たものの意味」という基本的な感覚を養う。そして新しい対象物に関して非常に少ない訓練で同じことを行う。

これによりコンピュータビジョンシステムは、従来の訓練を受けたシステムと比べて優れたパフォーマンスを発揮するという点で効果的であるだけでなく、関連づけや説明する能力が高まる。例えば500枚の犬の写真と500枚の猫の写真で訓練されたAIは犬と猫を認識するが、その類似性はまったく理解しない。だがDINOは、具体的にではないが、両者が視覚的に類似し、とにかく車よりも類似していることを理解する。そしてメタデータとコンテキストがメモリで見えるようになる。犬と猫は、犬と山よりも、その種のデジタル認知空間では「近い」のだ。こうした概念は小さな集まりとして見ることができる。下の画像で、ある種の概念同士がどのくらい近接しているのか見て欲しい。

画像クレジット：Facebook

これには、この記事では取り上げない技術的な利点がある。興味がある人は、Facebookのブログ投稿にリンクされている論文に詳細があるので参照されたい。

隣接する研究プロジェクトとしてPAWSと呼ばれる訓練方法もある。これは、ラベル付けされたデータの必要性をさらに減らす。PAWSは、半教師あり学習のアイデアの一部を従来の教師ありメソッドと組み合わせて、ラベル付きデータとラベルなしデータの両方から学習させ、訓練を飛躍的に向上させる。

Facebook自身はもちろん、多くのユーザー向け（そして秘密の）画像関連の製品のために、速く優れた画像分析を必要としている。だが、コンピュータービジョンの世界でのこうした一般的な進歩は、目的が異なる開発者コミュニティでも歓迎されることは間違いない。

画像クレジット：Facebook

【原文】

Machine learning is capable of doing all sorts of things as long as you have the data to teach it how. That’s not always easy, and researchers are always looking for a way to add a bit of “common sense” to AI so you don’t have to show it 500 pictures of a cat before it gets it. Facebook’s newest research takes a big step toward reducing the data bottleneck.

The company’s formidable AI research division has been working for years now on how to advance and scale things like advanced computer vision algorithms, and has made steady progress, generally shared with the rest of the research community. One interesting development Facebook has pursued in particular is what’s called “semi-supervised learning.”

Generally when you think of training an AI, you think of something like the aforementioned 500 pictures of cats — images that have been selected and labeled (which can mean outlining the cat, putting a box around the cat or just saying there’s a cat in there somewhere) so that the machine learning system can put together an algorithm to automate the process of cat recognition. Naturally if you want to do dogs or horses, you need 500 dog pictures, 500 horse pictures, etc. — it scales linearly, which is a word you never want to see in tech.

Semi-supervised learning, related to “unsupervised” learning, involves figuring out important parts of a data set without any labeled data at all. It doesn’t just go wild, there’s still structure; for instance, imagine you give the system a thousand sentences to study, then showed it 10 more that have several of the words missing. The system could probably do a decent job filling in the blanks just based on what it’s seen in the previous thousand. But that’s not so easy to do with images and video — they aren’t as straightforward or predictable.

But Facebook researchers have shown that while it may not be easy, it’s possible and in fact very effective. The DINO system (which stands rather unconvincingly for “DIstillation of knowledge with NO labels”) is capable of learning to find objects of interest in videos of people, animals and objects quite well without any labeled data whatsoever.

Image Credits: Facebook

It does this by considering the video not as a sequence of images to be analyzed one by one in order, but as a complex, interrelated set, like the difference between “a series of words” and “a sentence.” By attending to the middle and the end of the video as well as the beginning, the agent can get a sense of things like “an object with this general shape goes from left to right.” That information feeds into other knowledge, like when an object on the right overlaps with the first one, the system knows they’re not the same thing, just touching in those frames. And that knowledge in turn can be applied to other situations. In other words, it develops a basic sense of visual meaning, and does so with remarkably little training on new objects.

This results in a computer vision system that’s not only effective — it performs well compared with traditionally trained systems — but more relatable and explainable. For instance, while an AI that has been trained with 500 dog pictures and 500 cat pictures will recognize both, it won’t really have any idea that they’re similar in any way. But DINO — although it couldn’t be specific — gets that they’re similar visually to one another, more so anyway than they are to cars, and that metadata and context is visible in its memory. Dogs and cats are “closer” in its sort of digital cognitive space than dogs and mountains. You can see those concepts as little blobs here — see how those of a type stick together:

Image Credits: Facebook

This has its own benefits, of a technical sort we won’t get into here. If you’re curious, there’s more detail in the papers linked in Facebook’s blog post.

There’s also an adjacent research project, a training method called PAWS, which further reduces the need for labeled data. PAWS combines some of the ideas of semi-supervised learning with the more traditional supervised method, essentially giving the training a boost by letting it learn from both the labeled and unlabeled data.

Facebook of course needs good and fast image analysis for its many user-facing (and secret) image-related products, but these general advances to the computer vision world will no doubt be welcomed by the developer community for other purposes.

（文：Devin Coldewey、翻訳：Nariko Mizoguchi）

Facebook - Log In or Sign Up

Facebookアカウントを作成するか、ログインしてください。友達や家族と写真や動画、近況をシェアしたり、メッセージをやり取りしましょう。 ... Facebookを使うと、友達や同僚、同級生、仲間たちとつながりを深められます。ケータイ ...

https://ja-jp.facebook.com/

元記事： https://jp.techcrunch.com/2021/05/03/2021-04-30-computer-vision-inches-towards-common-sense-with-facebooks-latest-research/

人工知能・AI #Facebook #コンピュータービジョン #機械学習

COMMENTS

26276：

2021-05-04 20:38

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 | TechCrunch Japan

26277：

2021-05-04 19:08

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 (2021-05-03 02:30 PM)chCrunchJP

26279：

2021-05-04 11:38

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 | TechCrunch Japan

26275：

2021-05-04 11:32

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 via @jptechcrunch

26280：

2021-05-04 11:07

ほぅ、Netflix三昧のGW過ごしてると気が引けてくるな... / 「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 | TechCrunch Japan

26274：

2021-05-04 10:33

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 via @jptechcrunch

26278：

2021-05-04 01:47

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究 (2021-05-03 02:30 PM)chCrunchJP

ゲイツ夫妻、離婚を発表--結婚から27年

中国Xpengが展開するLiDARを利用した自律運転EV

「常識」獲得に向け少しずつ進化するコンピュータービジョン、フェイスブックの最新研究

Facebook - Log In or Sign Up

COMMENTS

26276：

26277：

26279：

26275：

26280：

26274：

26278：

Recommended