Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the rank-math domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u381127994/domains/listenmonster.com/public_html/blog/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the ultimate-addons-for-gutenberg domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u381127994/domains/listenmonster.com/public_html/blog/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the rocket domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u381127994/domains/listenmonster.com/public_html/blog/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the generatepress domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u381127994/domains/listenmonster.com/public_html/blog/wp-includes/functions.php on line 6114

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the generatepress domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u381127994/domains/listenmonster.com/public_html/blog/wp-includes/functions.php on line 6114
OpenAI Whisper vs Google Speech To Text

Listen Monster

OpenAI Whisper vs Google Speech To Text

Recently Google announced Google Chirp their own speech-to-text API. In this blog post, I will compare OpenAI Whisper with Google Chirp in terms of

  • Accuracy
  • Features
  • Pricing
  • Features

Let’s jump into it.

Introduction

Google Chirp

Google launched its speech-to-text API on Google Cloud a while ago. Chirp is basically its version 2.0. Currently, Google Chirp supports 309 languages. Chirp is part of the Google Universal speech model. Basically, Google is trying to build a speech recognition model in 1000+ languages.

USM’s other models such as Short & Long both support 183 languages, and Telephony (183) languages. So the total number is 762.

USM has accents for many languages that’s why the total number is 762 languages. If we remove the accent & count only unique language then the actual number is less than 150.

It is available on Google Cloud as an API. In their testing, Google is claiming it is better than Whisper.

OpenAI Whisper

OpenAI launched Whisper as open source on github. It means you can download and install on the web server, no need to pay any money to whisper.

Whisper is trained on low-quality data so it can provide the highest accuracy while transcribing.

Whisper has 4 models.

  1. Tiny
  2. Small
  3. Medium
  4. Large

Soon after OpenAI launched large v2. The large-v2 model is trained for 2.5x more epochs with added regularization for improved performance.

The tiny model is super fast however not very accurate. The Large v2 is the most accurate however it takes a long time compared to the tiny model.

Whisper supports 97 languages. OpenAI Whisper does not provide an accent option.

Accuracy

The most important thing for any speech-to-text model is accuracy. There is no doubt models are extremely good.

In fact, you will rarely find mistakes in transcriptions.

During our results, we found out that Whisper is better for most languages.

For Some lesser-known languages such as Punjabi, Google Chirp is better than Whisper.

You can watch the following video to see English test results

So if you want to transcribe in known languages then Whisper is slightly better otherwise Google Chirp is better.

Features

Except for the accuracy, there are so many crucial features such as longer file support, export options etc.

OpenAI Whisper provides the following features

  1. Multiple Export options: TXT, SRT, VTT & Word level transcription
  2. Automatic language detection
  3. Speaker diarization
  4. Remove Silence
  5. Temperature
  6. Number of speakers
  7. Temperature: Set the high value for more unpredictable and wild results. Lower the value by 0.2 for a more precise and consistent output.
  8. Translation

Her are Google chirp features

  • Multiple export options: JSON, SRT, TXT and CSV
  • Real-time transcription
  • 8 hours longe transcriptions

Google USM features depend upon the model. If you use Chirp then you will simply get output however you will not other features.

Google Chirp currently does not have the features. However, their models (short, long and telephony) have these features.

  1. Confidence scores
  2. Speech adaptation
  3. Diarization
  4. Forced normalization
  5. Word level confidence
  6. Language detection
  7. Profanity filter

The main limitation of OpenAI whisper is you can transcribe only a 25 MB file through API.

Even if you use your own hosted model then still transcribing long files is not easy.

Unfortunately, Whisper doesn’t come with the feature to transcribe in real time. But with the help of some customized coding it can transcribe in real time.

The one main advantage of Whisper is privacy. Since you can install it on your machine which means more privacy.

So in terms of features definitely Google Speech to Text is better than Whisper.

Pricing

Pricing is another crucial factor. In general, Whisper is more affordable than Google speech-to-text.

However, there are some specific cases when Google Speech-to-text API is more affordable than Whisper.

First, let’s discuss API pricing

Google speech to text pricing

Google speech to text used to be quite expensive at $0.024 / minute. However recently after OpenAI announced whisper they have launched API v2 which is more accurate and cheaper as well.

Google speech-to-text API v2 has multiple plans. It might be a bit confusing at first. Here are the plans along with an explanation.

0-500,000 minutes / month0-500,000 minutes/month500,000-1,000,000 minutes/month1,000,000-2,000,000 minutes/month
Without Data Logging$0.016 / minute$0.010 / minute$0.008 / minute$0.004 / minute
With Data Logging$0.012 / minute$0.0075 / minute$0.006 / minute$0.003 / minute
Dynamic batch speech (Without Data Log in)$0.003 / minute$0.003 / minute $0.003 / minute$0.003 / minute
Dynamic batch speech (With Data Log in)$0.00225 / minute$0.00225 / minute$0.00225 / minute$0.00225 / minute

What is Data Logging?

In the data Logging in price, your data will be shared with Google. It will help google to make their model better.

What is Dynamic batch speech?

Dynamic batch processes audio at a lower level of urgency. It simply means your audio file will take more time to transcribe the data.

Whisper Pricing

OpenAI shared Whisper code and you can also access it by API. Here is the pricing of API.

ModelPricing
Whisper Large V2 $0.006 / minute (rounded to the nearest second)

OpenAI does not provide API for tiny, Small, and large.

If you host on your own machine then it is a fixed monthly bill. In case you have small usages then hosting your own whisper can be expensive compared to API.

Since you need a powerful machine with a graphic card, Without a Graphic card it will take a long time to transcribe simple audio files.

In general, OpenAI Whisper is more affordable than Google speech to text. However, there are some scenarios when Google speech-to-text is more affordable.

For example, you want to transcribe long audio files but your usage is quite small. So hosting Whisper on your own machine will be expensive and API is not friendly with longer files.

On top of that, if you are comfortable sharing your data for training then Google Speech to text is more affordable than Whisper.

When you have higher usage then Hosting Whisper is definitely more affordable.

Whisper VS Chirp: And The Winner Is

In most scenarios, Whisper is more accurate and affordable. However, Google is trying to compete with it by lowering its price and providing more language options.

The main advantage of Whisper is it is open source and the results are quality good enough. That’s why most developers are choosing Whisper even if it is missing some features.

Since we are living in an always changing world, new words are getting introduced every day and languages are changing. Google might take over the Whisper in the long run.

Related Posts

Leave a Comment