Get Live Speech Transcriptions In Your Browser

Share

  1. Blog
  2. 2021
  3. 11
  4. Get Live Speech Transcriptions In Your Browser

There are so many projects you can build with Deepgram's streaming audio transcriptions. Today, we are going to get live transcriptions from a user's mic inside of your browser.

Before We Start

For this project, you will need a Deepgram API Key - get one here. That's it in terms of dependencies - this project is entirely browser-based.

Create a new index.html file, open it in a code editor, and add the following boilerplate code:

<!DOCTYPE html>
<html>
  <body>
    <p id="status">Connection status will go here</p>
    <p id="transcript">Deepgram transcript will go here</p>
    <script>
      // Further code goes here
    </script>
  </body>
</html>

Get User Microphone

You can request access to a user's media input devices (microphones and cameras) using a built in getUserMedia() method. If allowed by the user, it will return a MediaStream which we can then prepare to send to Deepgram. Inside of your <script> add the following:

navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
  console.log({ stream })
  // Further code goes here
})

Load your index.html file in your browser, and you should immediately receive a prompt to access your microphone. Grant it, and then look at the console in your developer tools.

Now we have a MediaStream we must provide it to a MediaRecorder which will prepare the data and, once available, emit it with a datavailable event:

if (!MediaRecorder.isTypeSupported('audio/webm'))
  return alert('Browser not supported')
const mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' })

We now have everything we need to send Deepgram.

Connect to Deepgram

To stream audio to Deepgram's Speech Recognition service, we must open a WebSocket connection and send data via it. First, establish the connection:

const socket = new WebSocket('wss://api.deepgram.com/v1/listen', [
  'token',
  'YOUR_DEEPGRAM_API_KEY',
])
A reminder that this key is client-side and, therefore, your users can see it. Please factor this into your actual projects.

Then, log when socket onopen, onmessage, onclose, and onerror events are triggered:

socket.onopen = () => {
  console.log({ event: 'onopen' })
}

socket.onmessage = (message) => {
  console.log({ event: 'onmessage', message })
}

socket.onclose = () => {
  console.log({ event: 'onclose' })
}

socket.onerror = (error) => {
  console.log({ event: 'onerror', error })
}

Refresh your browser and watch the console. You should see the socket connection is opened and then closed. To keep the connection open, we must swiftly send some data once the connection is opened.

Sending Data to Deepgram

Inside of the socket.onopen function send data to Deepgram in 250ms increments:

mediaRecorder.addEventListener('dataavailable', async (event) => {
  if (event.data.size > 0 && socket.readyState == 1) {
    socket.send(event.data)
  }
})
mediaRecorder.start(250)

Deepgram isn't fussy about the timeslice you provide (here it's 250ms), but bear in mind that the bigger this number is, the longer between words being spoken and it being sent, slowing down your transcription. 100-250 is ideal.

Take a look at your console now while speaking into your mic - you should be seeing data come back from Deepgram!

Handling the Deepgram Response

Inside of the socket.onmessage function parse the data sent from Deepgram, pull out the transcript only, and determine if it's the final transcript for that phrase ("utterance"):

const received = JSON.parse(message.data)
const transcript = received.channel.alternatives[0].transcript
if (transcript && received.is_final) {
  console.log(transcript)
}

You may have noticed that for each phrase, you have received several messages from Deepgram - each growing by a word (for example "hello", "hello how", "hello how are", etc). Deepgram will send you back data as each word is transcribed, which is great for getting a speedy response. For this simple project, we will only show the final version of each utterance which is denoted by an is_final property in the response.

To neaten this up, remove the console.log({ event: 'onmessage', message }) from this function, and then test your code again.

That's it! That's the project. Before we wrap up, let's give the user some indication of progress in the web page itself.

Showing Status & Progress In Browser

Change the text inside of <p id="status"> to 'Not Connected'. Then, at the top of your socket.onopen function add this line:

document.querySelector('#status').textContent = 'Connected'

Remove the text inside of <p id="transcript">. Where you are logging the transcript in your socket.onmessage function add this line:

document.querySelector('#transcript').textContent += transcript + ' '

Try your project once more, and your web page should show you when you're connected and what words you have spoken, thanks to Deepgram's Speech Recognition.

The final project code is available at https://github.com/deepgram-devs/browser-mic-streaming, and if you have any questions, please feel free to reach out on Twitter - we're @DeepgramDevs.