Live Transcriptions with iOS and Deepgram

Share

  1. Blog
  2. 2022
  3. 01
  4. Live Transcriptions with iOS and Deepgram

This post was written by a member of the Deepgram community. If you have any questions about this post please reach out to the Deepgram Developer Relations team - we're @DeepgramDevs on Twitter or devrel@deepgram.com via email.

Deepgram's Speech Recognition API can provide accurate transcripts with pre-recorded and live audio. In this post, you will build an iOS app that takes the user's microphone and prints transcripts to the screen in real-time.

The final project code is available at https://github.com/deepgram-devs/deepgram-live-transcripts-ios.

Before We Start

For this project, you will need:

Once you open Xcode, create a new Xcode project. Select the App template, give it a product name, select "Storyboard" for the interface, and "Swift" for the language. Untick "User Core Data" and "Include Tests", then hit next.

Choose a place to save your project and hit finish to open your new project.

You will be using the WebSocket library Starscream for this project. You can add it as a Swift Package by going to File > Add Packages, then entering the URL (https://github.com/daltoniam/Starscream) into the search box. Click add package and wait for the package to download.

Get Microphone Permissions

You will be using the microphone for this project, so you will need to add a microphone usage description to your project. When an iOS app asks for microphone permission, a reason why the permission is being requested is required and will be shown to the user. Open the info.plist file to add a description.

Hover over the last entry and click the plus icon to add a new entry. The key should be "Privacy - Microphone Usage Description", and you should add a value that describes why you need the microphone. You can use "To transcribe audio with Deepgram".

Setting Up AVAudioEngine

To access a data stream from the microphone, you will use AVAudioEngine.

Before you get started, it is worth having a mental model of how AVAudioEngine works. Apple's description is: "An audio engine object contains a group of AVAudioNode instances that you attach to form an audio processing chain".

For this example, you will create a chain of nodes that will take the microphone's input, convert it to a format that Deepgram can process, then send the data to Deepgram. The nodes and conversion will be discussed further in the blog.

Open the ViewController.swift file and import AVFoundation and Starscream at the top

import AVFoundation
Import Starscream

This will give you access to the AVAudioEngine class. Then inside the ViewController class, create an instance of AVAudioEngine by adding a property:

private let audioEngine = AVAudioEngine()

Next, create a function to analyse the audio and declare some constants:

private func startAnalyzingAudio() {
  let inputNode = audioEngine.inputNode
  let inputFormat = inputNode.inputFormat(forBus: 0)
  let outputFormat = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: inputFormat.sampleRate, channels: inputFormat.channelCount, interleaved: true)
}

From the top, you have the inputNode, which you can think of as the microphone followed by the inputFormat.

Next is the outputFormat - The iOS microphone is a 32-bit depth format by default, so you will be converting to a 16-bit depth format before sending the data to Deepgram. While we will use a PCM 16-bit depth format for the audio encoding, Deepgram supports many audio encodings that can be specified to the API. Consider using a more compressed format if your use case requires low network usage. Just below those, add the new nodes that you will need and attach them to the audio engine:

let converterNode = AVAudioMixerNode()
let sinkNode = AVAudioMixerNode()

audioEngine.attach(converterNode)
audioEngine.attach(sinkNode)

The sinkNode is needed because to get the data in the correct format you need to access the data after it has passed through the converterNode, or its output. If you refer to the diagram above, notice how the stream of data to Deepgram is coming from the connection between nodes, the output of the converterNode, and not the nodes themselves.

Get Microphone Data

Use the installTap function on the AVAudioMixerNode class to get the microphone data. This function gives you a closure that returns the output audio data of a node as a buffer. Call installTap on the converterNode, continuing the function from above:

converterNode.installTap(onBus: 0, bufferSize: 1024, format: converterNode.outputFormat(forBus: 0)) { (buffer: AVAudioPCMBuffer!, time: AVAudioTime!) -> Void in

}

You will come back to this closure later. Now finish off the startAnalyzingAudio by connecting all the nodes and starting the audio engine:

audioEngine.connect(inputNode, to: converterNode, format: inputFormat)
audioEngine.connect(converterNode, to: sinkNode, format: outputFormat)
audioEngine.prepare()

do {
try AVAudioSession.sharedInstance().setCategory(.record)
  try audioEngine.start()
} catch {
  print(error)
}

You can see now how the processing pipeline mirrors the diagram from earlier. Before the audio engine is started, the audio session of the application needs to be set to a record category. This lets the app know that the audio engine is solely being used for recording purposes. Note how the converterNode is connected with outputFormat.

Connect to Deepgram

You will be using the Deepgram WebSocket Streaming API, create a WebSocket instance at the top of the ViewController class:

private let apiKey = "Token YOUR_DEEPGRAM_API_KEY"
private lazy var socket: WebSocket = {
  let url = URL(string: "wss://api.deepgram.com/v1/listen?encoding=linear16&sample_rate=48000&channels=1")!
  var urlRequest = URLRequest(url: url)
  urlRequest.setValue(apiKey, forHTTPHeaderField: "Authorization")
  return WebSocket(request: urlRequest)
}()
A reminder that this key is in your app's source code and, therefore, it is not secure. Please factor this into your actual projects.

Note how the URL has an encoding matching the outputFormat from earlier. In the viewDidLoad function, set the socket's delegate to this class and open the connection:

override func viewDidLoad() {
  super.viewDidLoad()
  socket.delegate = self
  socket.connect()
}

You will implement the delegate later in the post.

Send Data to Deepgram

Now that the socket connection is open, you can now send the microphone data to Deepgram. To send data over a WebSocket in iOS, it needs to be converted to a Data object. Add the following function to the ViewController class that does the conversion:

private func toNSData(buffer: AVAudioPCMBuffer) -> Data? {
  let audioBuffer = buffer.audioBufferList.pointee.mBuffers
  return Data(bytes: audioBuffer.mData!, count: Int(audioBuffer.mDataByteSize))
}

Then return to the startAnalyzingAudio function, and within the installTap closure, you can send the data to Deepgram. Your tap code should look like this now:

converterNode.installTap(onBus: bus, bufferSize: 1024, format: converterNode.outputFormat(forBus: bus)) { (buffer: AVAudioPCMBuffer!, time: AVAudioTime!) -> Void in
  if let data = self.toNSData(buffer: buffer) {
     self.socket.write(data: data)
  }
}

Call startAnalyzingAudio in the viewDidLoad function below the WebSocket configuration.

Handle the Deepgram Response

You will get updates from the WebSocket via its delegate. In the ViewController.swift file outside the class, create an extension for the WebSocketDelegate and a DeepgramResponse struct:

extension ViewController: WebSocketDelegate {
  func didReceive(event: WebSocketEvent, client: WebSocket) {
     switch event {
     case .text(let text):

     case .error(let error):
        print(error ?? "")
     default:
        break
     }
  }
}

struct DeepgramResponse: Codable {
  let isFinal: Bool
  let channel: Channel

  struct Channel: Codable {
     let alternatives: [Alternatives]
  }

  struct Alternatives: Codable {
     let transcript: String
  }
}

The didReceive function on the WebSocketDelegate will be called whenever you get an update on the WebSocket. Before you finish implementing didReceive, you need to prepare to decode the data and update the UI to display the transcripts. At the top of the ViewController class, add the following properties:

private let jsonDecoder: JSONDecoder = {
  let decoder = JSONDecoder()
  decoder.keyDecodingStrategy = .convertFromSnakeCase
  return decoder
}()

private let transcriptView: UITextView = {
  let textView = UITextView()
  textView.isScrollEnabled = true
  textView.backgroundColor = .lightGray
  textView.translatesAutoresizingMaskIntoConstraints = false
  return textView
}()

Create a new function called setupView and configure your UI:

private func setupView() {
  view.addSubview(transcriptView)
  NSLayoutConstraint.activate([
     transcriptView.topAnchor.constraint(equalTo: view.safeAreaLayoutGuide.topAnchor),
     transcriptView.leadingAnchor.constraint(equalTo: view.leadingAnchor, constant: 16),
     transcriptView.trailingAnchor.constraint(equalTo: view.trailingAnchor, constant: -16),
     transcriptView.bottomAnchor.constraint(equalTo: view.safeAreaLayoutGuide.bottomAnchor)
  ])
}

Then call setupView in the viewDidLoad function.

Returning to the didReceive function of the WebSocketDelegate, within the text case; you need to convert the text, which is a String type, into a Data type so you can decode it into an instance of DeepgramResponse:

extension ViewController: WebSocketDelegate {
  func didReceive(event: WebSocketEvent, client: WebSocket) {
     switch event {
     case .text(let text):
        let jsonData = Data(text.utf8)
        let response = try! jsonDecoder.decode(DeepgramResponse.self, from: jsonData)
        let transcript = response.channel.alternatives.first!.transcript

        if response.isFinal && !transcript.isEmpty {
          if transcriptView.text.isEmpty {
             transcriptView.text = transcript
          } else {
             transcriptView.text = transcriptView.text + " " + transcript
          }
        }
     case .error(let error):
        print(error ?? "")
     default:
        break
     }
  }
}

Once you have a DeepgramResponse instance, you will check if it is final, meaning it is ready to add to the transcriptView while handling the empty scenarios. If you now build and run the project (CMD + R), it will open in the simulator, where you will be prompted to give microphone access. Then you can start transcribing!

The final project code is available at https://github.com/deepgram-devs/deepgram-live-transcripts-ios, and if you have any questions, please feel free to reach out to the Deepgram team on Twitter - @DeepgramDevs.

FEEDBACK