音声の文字起こし、要約、モデレートを行うオールインワンAPIのAssemblyAIが32.1億円を調達

今回は「音声の文字起こし、要約、モデレートを行うオールインワンAPIのAssemblyAIが32.1億円を調達」についてご紹介します。

関連ワード（向上、数百、要約等）についても参考にしながら、ぜひ本記事について議論していってくださいね。

本記事は、TechCrunch様で掲載されている内容を参考にしておりますので、より詳しく内容を知りたい方は、ページ下の元記事リンクより参照ください。

ここ数年、音声や動画のコンテンツやインターフェースが爆発的に増えているのは明らかだが、それらのメディアを扱う方法はまだ発展途上だ。そんな中、AssemblyAIが2800万ドル（約32億1000万円）の新たな資金調達により、音声解析のための主要なソリューションとなることを目指す。同社の超シンプルなAPIを利用することで、一度に数千の音声ストリームの文字起こし、要約、その他何が起きているのかを把握することができる。

電話や会議がビデオ通話になり、ソーシャルメディアの投稿が10秒のクリップ動画になり、チャットボットが発話し、音声を理解するようになるなど、マルチメディアは信じられないほど短期間に多くのものの標準となった。数え切れないほどの新しいアプリケーションが登場してきているが、他の新しい成長産業と同様に、アプリケーションを適切に実行したり、アプリケーションの上に新しいものを構築したりするためには、アプリケーションが生成するデータを操作できる必要がある。

問題は、音声はもともと簡単に扱えるものではないことだ。音声ストリームの「検索」はどのように行えば良いだろう。波形を見たり、通して聴いたりすることもできるが、それよりもまずは文字に書き起こして、その結果得られたテキストを検索する方が良いだろう。そこでAssemblyAIの出番となる。音声文字起こしサービスは数多くあるものの、自社のアプリや業務プロセスには簡単に組み込めない場合が多い。

AssemblyAIのCEOで共同創業者のDylan Fox（ディラン・フォックス）氏は「音声コンテンツのモデレーションや検索、要約を行う場合には、データをより柔軟で、その上に機能やビジネスプロセスを構築できる形式に変換する必要があります」と語る。「そこで、Twilio（トゥイリオ）やStripe（ストライプ）のように、たとえハッカソンの場でも使えるような、誰でも使える超高精度の音声分析APIを作ろうということになったのです。こうした機能を組み上げるためには多くの支援が必要ですが、その際にあまりにも多くのサービスプロバイダーを組合せたくはありません」。

AssemblyAIは、極めてシンプルに（1、2行のコードで）呼び出せる数種類のAPIを提供しているが、そのAPIを利用することで「このポッドキャスト中に禁止されている内容がないかチェックする」「この会話の話者を特定する」「この会議を100文字以内に要約する」などのタスクを実行することができる。

コードして、コールして完了（画像クレジット：AssemblyAI）

だが、私もそうだったが、この仕事が一歩踏み込めばどれだけ複雑な作業になるかと考えると、果たして小さな会社がこれだけ多くのことを簡単にこなせる道具を作れるのかどうかと疑問に思うだろう。フォックス氏は、これが困難な課題であることを認めつつも「技術は短期間で大きく進歩したのです」と語った。

「特にここ数年で、こうしたモデルの精度が急速に向上しています。要約、勘定識別……どれも本当に良くなりました。そして、私たちは実際に最先端の技術を推進しています。私たちは大規模なディープラーニング研究を行っている数少ないスタートアップの1つですので、私たちのモデルは、世間一般のものよりも優れているのです。研究開発やトレーニングのためのGPUや計算資源には、今後数カ月間だけでも100万ドル（約1億1500万円）以上を投入します」。

簡単にはデモンストレーションできないので、直感的に理解するのは難しいかもしれないが、画像生成（「このXXは存在しません」の類）やコンピュータービジョン（顔認証、防犯カメラ）と同様に、言語モデルも進歩してきている。もちろん、GPT-3はその身近な例だが、フォックス氏は、書き言葉を理解し生成することと、会話やくだけた話し方を分析することは、実質的にまったく別の研究領域であると指摘する。よって機械学習技術の進歩（トランスフォーマーや新しい効率的なトレーニングのフレームワーク）は両者に貢献してきたが、多くの意味ではそれらはリンゴとオレンジの関係（同じ果物というだけで、それ以外の属性は異なっている）のようなものだ。

いずれにせよ、数秒から1時間程度の音声でも、APIを呼び出すだけで効果的なモデレーションや要約処理を行うことができるようになった。これは、ショートビデオのような機能を開発したり統合したりする際などにとても有効だ。たとえば1時間に10万件ものクリップがアップロードされることを想定した場合、それらがポルノや詐欺、パクリでないことを確認する最初のスクリーニングはどうすれば良いだろう？また、そのスクリーニングプロセスを構築するためにローンチがどれくらい遅れるだろう？

フォックス氏は、このような立場にある企業が、ちょうど決済プロセスの追加に直面したときと同様に、簡単で効果的な方法を選ぶことができるようになることを希望している。つまり機能をゼロから自分で構築することもできるし、15分で「Stripe」を追加することもできるということだ。これは、根本的に望ましいものだというだけでなく、Microsoft（マイクロソフト）やAmazon（アマゾン）などの大手プロバイダーが提供する、複雑でマルチサービスなパッケージの中の音声分析製品とは明らかに一線を画している。

インタビューに答えるフォックス氏（画像クレジット：Jens Panduro）

同社はすでに数百の有料顧客を数え、2021年1年間で売上を3倍に伸ばし、現在は1日100万件のオーディオストリームを処理している。フォックス氏はいう「100%ライブストリーム処理です。大きな市場と大きなニーズがあり、お客様からの支払いもあります」とフォックス氏はいう。

2800万ドル（約32億1000万円）のラウンドAは、Accelが主導し、Y Combinator、John（ジョン）とPatrick（パトリック・コリソン）氏 (Stripe)、Nat Friedman（ナット・フリードマン）氏 (GitHub)、そしてDaniel Gross（ダニエル・グロス）氏（Pioneer）が参加している。全額を、採用、研究開発インフラ、製品パイプラインの構築などに振り向ける計画だ。フォックス氏が指摘したように、同社は今後数カ月の間にGPUとサーバーに100万ドル（約1億1500万円）を投入する（大量のNVIDIA A100が、信じられないほど計算集約型の研究とトレーニングのプロセスを支えることになる）。もしそうしなければ、クラウドサービスにお金を払い続けることになるのだから、間借り生活から早めに脱却したほうが良いのだ。

採用に関しては、音声解析関連技術に力を入れているGoogleやFacebookと直接競合するため、苦労するのではないかと質問してみた。しかし、フォックス氏は楽観的だった。そうした大企業の文化が遅く窮屈なものであると感じているからだ。

「本当に優秀なAI研究者やエンジニアには、最先端で仕事をしたいという願望が間違いなくあると思います。そして同時に実用化の最先端にも関わりたいという願望です」と彼はいう。「革新的なことを思いついたら、数週間後には製品化できる…そんなことができるのはスタートアップ企業だけです」。

画像クレジット：AssemblyAI

【原文】

The explosion in audio and video content and interfaces over the last few years has been plain to see, but ways of dealing with all that media behind the scenes hasn’t quite caught up. AssemblyAI, powered by $28 million in new funding, is aiming at becoming the go-to solution for analyzing speech, offering ultra-simple API access for transcribing, summarizing and otherwise figuring out what’s going on in thousands of audio streams at a time.

Multimedia has become the standard for so many things in an incredibly short time: phone calls and meetings became video calls, social media posts became 10-second clips, chatbots learned to speak and understand speech. Countless new applications are appearing, and like any new and growing industry, people need to be able to work with the data those applications produce in order to run them well or build something new on top of them.

The problem is audio isn’t naturally easy to work with. How do you “search” an audio stream? You could look at the waveform or scrub through it, but more likely you’ll want to transcribe it first and then search the resulting text. That’s where AssemblyAI steps in: though there are numerous transcription services, it’s not often easy to integrate them into your own app or enterprise process.

“If you want to do content moderation, or search, or summarize audio data, you have to turn that data into a format that’s more pliable, and that you can build features and business processes on top of,” said AssemblyAI CEO and co-founder Dylan Fox. “So we were like, let’s build a super-accurate speech analysis API that anyone can call, even at a hackathon — like a Twilio or Stripe style integration. People need a lot of help to build these features, but they don’t want to glue a bunch of providers together.”

AssemblyAI offers a handful of different APIs that you can call extremely simply (a line or two of code) to perform tasks like “check this podcast for prohibited content,” or “identify the speakers in this conversation,” or “summarize this meeting into less than 100 words.”

Code it, call it, done. Image Credits: AssemblyAI

You may very well, as I was, be skeptical that a single small company can produce working tools to accomplish so many tasks so simply, considering how complex those tasks turn out to be once you get into them. Fox acknowledged that this is a challenge, but said that the tech has come a long way in a short span.

“There’s been a rapid increase in accuracy in these models, over the last few years especially,” he said. “Summary, sentiment identification… they’re all really good now. And we’re actually pushing the state of the art — our models are better than what’s out there, because we’re one of the few startups really doing large-scale deep learning research. We’re going to spend over a million dollars on GPU and compute for R&D and training, in the next few months alone.”

It can be harder to grasp intuitively because it’s not so easily demonstrable, but language models have come along just as things like image generation (This ___ does not exist) and computer vision (Face ID, security cameras) have. Of course GPT-3 is a familiar example of this, but Fox pointed out that understanding and generating the written word is practically an entirely different research domain than analyzing conversation and casual speech. Thus, although the same advances in machine learning techniques (like transformers and new, more efficient training frameworks) have contributed to both, they’re like apples and oranges in most ways.

The result, at any rate, has been that it’s possible to perform effective moderation or summarizing processes on an audio clip a few seconds or an hour long, simply by calling the API. That’s immensely useful when you’re building or integrating a feature like, for example, short-form video — if you expect a hundred thousand clips to be uploaded every hour, what’s your process for a first pass at making sure they aren’t porn, or scams, or duplicates? And how long will launch be delayed while you build that process?

Instead, Fox hopes, companies in this position will look for an easy and effective way forward, the way they might if they were faced with adding a payment process. Sure you could build one from scratch — or you could add Stripe in about 15 minutes. This not only is sort of fundamentally desirable, but it clearly separates them from the more complex, multi-service packages that define audio analysis products by big providers like Microsoft and Amazon.

The Fox in question. Image Credits: Jens Panduro

The company already has hundreds of paying customers, having tripled revenue in the last year, and now processes a million audio streams a day. “We’re 100% live. There’s a huge market and a huge need, and the spend from customers is there,” Fox said.

The $28 million A round was “led by Accel, with participation from Y Combinator, John and Patrick Collison (Stripe), Nat Friedman (GitHub), and Daniel Gross (Pioneer).” The plan is to spread all those zeroes across recruitment, R&D infrastructure and building out the product pipeline. As Fox noted, the company is spending a million on GPUs and servers in the next few months, a bunch of Nvidia A100s that will power the incredibly computation-intensive research and training processes. Otherwise you’re stuck paying for cloud services, so it’s better to rip that Band-Aid off early.

As for recruiting, I suggested that they might have a hard time staffing up in direct competition with the likes of Google and Facebook, which are of course working hard on their own audio analysis pipelines. Fox was optimistic, however, feeling that the culture there can be slow and stifling.

“I think there’s definitely a desire in really good AI researchers and engineers to want to work on the bleeding edge — and the bleeding edge in production,” he said. “You come up with something innovative, and a few weeks later have it in production… a startup is the only place you can do stuff like that.”

（文：Devin Coldewey、翻訳：sako）

元記事： https://jp.techcrunch.com/2022/03/05/2022-03-04-assembly-ai-snags-28m-for-all-in-one-api-to-transcribe-summarize-and-moderate-audio/

IT関連 #向上 #数百 #要約