Speech Recognition API in Swift

337 Views

PLEASE NOTE: This tutorial has been written using XCode 11 and Swift 5

In this tutorial, you’ll learn how to transcribe live or pre-recorded audio in your iOS app with the same engine used by Siri, the Speech framework. Speech Recognizer gives cool possibilities to your apps. For instance, you may create an application that automatically transcribes audio from a movie so you could search for your favorite sentences. You can do so much more with the Speech framework, just unleash your creativity.

The Speech Recognition framework doesn’t work in the Simulator, so make sure to use a real device with iOS 10 (or later) to create and run the app of this tutorial.

Create the App

Start by creating a new Single View Application with Xcode, name it SpeechRecognition, set its language as Swift and save it on your Desktop.

Delete the SceneDelegate.swift file, move it to Trash, we don’t need it.

Now click the Info.plist file, and click the button next to Application Scene Manifest. You must delete that row in order to be able to run the app with the AppDelegate.

Finally, enter the AppDelegate.swift file and replace the existing code inside the class with the following:

var window: UIWindow?

    func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey: Any]?) -> Bool {
        // Override point for customization after application launch.
        return true
    }

    func applicationWillResignActive(_ application: UIApplication) {
        // Sent when the application is about to move from active to inactive state. This can occur for certain types of temporary interruptions (such as an incoming phone call or SMS message) or when the user quits the application and it begins the transition to the background state.
        // Use this method to pause ongoing tasks, disable timers, and invalidate graphics rendering callbacks. Games should use this method to pause the game.
    }

    func applicationDidEnterBackground(_ application: UIApplication) {
        // Use this method to release shared resources, save user data, invalidate timers, and store enough application state information to restore your application to its current state in case it is terminated later.
        // If your application supports background execution, this method is called instead of applicationWillTerminate: when the user quits.
    }

    func applicationWillEnterForeground(_ application: UIApplication) {
        // Called as part of the transition from the background to the active state; here you can undo many of the changes made on entering the background.
    }

    func applicationDidBecomeActive(_ application: UIApplication) {
        // Restart any tasks that were paused (or not yet started) while the application was inactive. If the application was previously in the background, optionally refresh the user interface.
    }

    func applicationWillTerminate(_ application: UIApplication) {
        // Called when the application is about to terminate. Save data if appropriate. See also applicationDidEnterBackground:.
    }

Your AppDelegate.swift file should look like this:

Design it!

Now you can enter the Main.storyboard file and drag a TextView and 2 Buttons in the ViewController. The TextView will be used to show the text that the app will recognize and transcribe while you’ll talk close to your device.

The first Button will enable your speech recording, the second one will speak the displayed text.

We have to declare our Views in the ViewController.swift file, so click on the Adjust Editor Options button and select Assistant, so Xcode will split the central area into 2 sections, the Storyboard and the code area.

Connect the TextView and the two Buttons into the swift file, name the TextView as speechTxt, the first Button as recSpeechButton and the second Button as playSpeechButton.

The final result should look like this:

It’s time to connect the IBActions of our buttons. Connect the Record Speech button right below the closure of the viewDidLoad function and name it recordButt. Then connect the Play Speech button and name it playButt.

Let’s code

First of all, let’s add a few Privacy rows in the Info.plist file, they are needed to show Permissions descriptions and allow microphone and speech usage in your application.

Click any button on any row and search for the Privacy - Speech Recognition Usage Description key. Select it and type its String value, something like: This app needs to recognize your speech and transcribe it into text

Then click another button to add the Privacy - Microphone Usage Description. Set its String value as This app needs to access the Microphone to record speech

Now select the ViewController.swift file from the Files list panel to enter the file in a full-screen mode. Start coding by importing the Speech and the AVFoundation frameworks right below the UIKit one:

import Speech
 import AVFoundation

Declare two important delegates next to the ViewController’s class name, the SFSpeechRecognizerDelegate and AVSpeechSynthesizerDelegate:

class ViewController: UIViewController, SFSpeechRecognizerDelegate, AVSpeechSynthesizerDelegate {

We have now to declare a few variables that our app needs to process the recording of our voice and perform the text recognition. Put the following code below your IBOutlet declarations:

let audioEngine = AVAudioEngine()
 var speechRecognizer = SFSpeechRecognizer()
 let recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
 var recognitionTask: SFSpeechRecognitionTask?
 let speechSynth = AVSpeechSynthesizer()
 var isRecording = false
 var isPlaying = false

Then, outside the viewDidLoad function, paste this code:

func requestSpeechAuthorization() {
        SFSpeechRecognizer.requestAuthorization { authStatus in OperationQueue.main.addOperation {
            switch authStatus {
                case .authorized:
                    print("Speech recognition is authorized!")
                case .denied:
                    print("User denied access to speech recognition")
                case .restricted:
                    print("Speech recognition restricted on this device")
                case .notDetermined:
                    print("Speech recognition not yet authorized")
            default:break
            }
        }}
}

The function above handles all statuses of the SpeechRecognizer framework, based on its availability and if you allowed the speech recognition permission or not.

Let’s now place the following code inside the recordButt IBAction:

let locale = "\(Locale.current)"
speechRecognizer = SFSpeechRecognizer(locale: Locale.init(identifier: locale))

startRecording()

With the code above we get the current language and region of our device and set it as a String, then we initialize the Speech Recognizer and call a function that will listen to our speech and display your words as text.

Such method is the following:

func listenAndRecognizeSpeech() {
        isRecording = !isRecording
        print("IS RECORDING: \(isRecording)")
        
        // Start recording
        if isRecording {
            recSpeechButton.setTitle("Now speak...", for: .normal)
            playSpeechButton.isEnabled = false
            
            // Init Audio Engine
            let node = audioEngine.inputNode
            let recordingFormat = node.outputFormat(forBus: 0)
            node.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { buffer, _ in
                self.recognitionRequest.append(buffer)
            }
            audioEngine.prepare()
            do { try audioEngine.start()
            } catch {
                self.stopRecording()
                return self.simpleAlert("There has been an audio engine error: \(error.localizedDescription)")
            }
            guard let myRecognizer = SFSpeechRecognizer() else {
                self.stopRecording()
                return self.simpleAlert("Speech recognition is not supported for your current locale.")
            }
            if !myRecognizer.isAvailable {
                self.isRecording = false
                self.stopRecording()
                return self.simpleAlert("Speech recognition is not currently available. Check back at a later time.")
            }
            
            // Show the recognized string
            recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { result, error in
                if let result = result {
                    
                    // Get speech and display it as text in the TextView
                    let bestString = result.bestTranscription.formattedString
                    self.speechTxt.text = bestString
                    
                // error
                } else if let error = error {
                    self.simpleAlert("There has been a speech recognition error: \(error.localizedDescription)")
                }
            })
            
            
        // Stop Recording
        } else { stopRecording() }
    }

When you tap the Record Speech button, its title changes into Now Speak... and you can talk to your device to see your words become text in the TextView on the top of the screen.

When you’re done, tap the button again and the app will stop recording. So now add this function right below the previous one:

func stopRecording() {
     recSpeechButton.setTitle("Record Speech", for: .normal)
     playSpeechButton.isEnabled = true
        
     isRecording = false
     recognitionTask?.finish()
     recognitionTask = nil
     recognitionRequest.endAudio()
     audioEngine.stop()
     audioEngine.inputNode.removeTap(onBus: 0)
}

If you try to run the app, you’ll get a few error marks in the code. Don’t worry, we’re just missing a simple method that fires an Alert Controller in case of errors. Place this code right above the closure of the class:

func simpleAlert(_ message:String) {
        let alert = UIAlertController(title: "Speech recognizer", message: message, preferredStyle: .alert)
        let ok = UIAlertAction(title: "OK", style: .default, handler: { (action) -> Void in })
        alert.addAction(ok)
        present(alert, animated: true, completion: nil)
 }

Now you can run the app and start testing the Speech Recognition framework. When you’ll first tap the Record Speech button, you’ll be asked to allow the Microphone and Speech permissions. Do that and start speaking, you should see something like this:

Cool, right? 🙂

We’re not done yet, it’s time to enable the button that transforms our text into voice, so now put the following code inside the playButt IBAction:

isPlaying = !isPlaying
        print("IS PLAYING: \(isPlaying)")
        
        // Start playing
        if isPlaying {
            // speechTxt has some text
            if speechTxt.text != "" {
                let locale = "\(Locale.current)"
                
                // Set Audio Session
                let session = AVAudioSession.sharedInstance()
                do { try session.overrideOutputAudioPort(AVAudioSession.PortOverride.speaker)
                    try session.setPreferredSampleRate(44100)
                } catch { print(error) }
                
                speechSynth.delegate = self
                let speechUtterance = AVSpeechUtterance(string: speechTxt.text!)
                speechUtterance.rate = 0.5
                speechUtterance.volume = 1.0
                speechUtterance.voice = AVSpeechSynthesisVoice(language: locale)
                speechSynth.speak(speechUtterance)
                
                recSpeechButton.isEnabled = false
                playSpeechButton.setTitle("Playing...", for: .normal)
            
            // speechTxt is empty...
            } else { simpleAlert("Nothing to talk about here...") }
            
        // Stop playing
        } else {
            speechSynth.stopSpeaking(at: .immediate)
            recSpeechButton.isEnabled = true
            playSpeechButton.setTitle("Play Speech", for: .normal)
        }

Lets’ add the speechSynthesizer delegate function that checks when the built-in voice stops speaking:

func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance) {
        recSpeechButton.isEnabled = true
        playSpeechButton.setTitle("Play Speech", for: .normal)
 }

Awesome, you’re done with coding, you can now run the app again and tap the Play Speech button to listen to your device speaking the TextView’s text out.

Conclusion

That’s all for this tutorial, you have learned how to build a small app that uses the SpeechRecognition framework to detect and speak text.

Hope you enjoyed this article, feel free to post comments about it. You can also download the full Xcode project of this tutorial, just click the link below:

Download the Xcode project

Buy me a coffee - XScoder - thanks for your support

Your support will be highly appreciated 😉