Speech Recognition and Its Dilemmas!

Image by vectorjuice on Freepik
Can language be mastered with algorithms? Ah, the stark irony in this sentence is highly debatable; making the answer to this question less significant.
Linguistics marrying mathematics is a complex affair not just because of the different dynamics of these fields but also due to their core foundational aspects. Go back to the school days when some of us excelled in languages but struggled with numbers or vice versa. Not many of us find a perfect balance between these two fields (or abilities).
However, Artificial Intelligence (AI) aims to achieve that balance. It strives to make human-machine interactions smoother by decoding and identifying spoken languages and converting them into text.
The recent advancements backed by AI and ML around integrating grammar, structure, accents, syntax, dialects, and adaptability have streamlined the pace and efficacy of human-computer interaction (HCI). It has revolutionized the overall modern communication experiences.
What’s and Whys About Speech Recognition
Automatic Speech Recognition (ASR) or computer-aided speech recognition is a machine’s ability to convert human speech into a written format, hence the name- Speech-to-Text. However, it is quite often confused with voice recognition.
Voice recognition technology majorly focuses on identifying individual user voices using biometric technology.
Think of Speech Recognition as the initial trigger enabling voice technology to perform smoothly. We owe it to ASR technology for the quick, fun, and adaptive responses of Alexa, Cortana, or Siri (our beloved voice assistants!). Had it not been for speech recognition and its advancements, our speech would have been limited to just the audio recordings in our computers even today.
Now, let’s take a glance at how Speech Recognition functions: analyzing the audio, breaking it into smaller parts, converting it into a machine-friendly (readable) structure, and finally using algorithms to interpret it for producing the aptest text presentation. This technology is assessed on its accuracy rate viz; speed and Word Error Rate or WER. Factors such as accent, volume, pronunciation, background noise, industry-specific jargon, etc., directly affect the WER.
A few speech recognition algorithms and computation techniques:
- Natural Language Processing (NLP),
- Hidden Markov models (HMM),
- Neural networks,
- N-grams,
- Speaker Diarization (SD).
ASR has become a highly innovative and speculative field generating metadata across sources. As per Gartner’s predictions- 25% of employee interactions with various applications will be mainly via voice by 2023. A few main reasons behind its scaling popularity are:
- High Speed
- Predictive outcomes (or analytics) it can deliver
- Its role in accelerating automation
- Its ability to cater exceptionally well to the rapidly growing “remote world”
- Coost-effective- it just requires the initial investment rather than recurrent costs using manual methods.
Why is Speech Recognition hard?
Our language is arbitrary. Hence, its peculiarities and complexities make it very challenging for the machine to analyze and produce error-free transcription. Further, the involvement of various abbreviations, syntaxes, acronyms, phrases, dialects, accents, context, semantics, pragmatics, pauses, etc., poses dilemmas limiting ASR’s efficacy, efficiency, and accuracy.
The biggest speech recognition challenges:
1) Imprecision and Misinterpretations: Context is key.
To master this, the machine would have to learn, but most importantly, understand the difference between- hearing and listening. While communicating, we take into account the speaker’s expressions, body language, tone, pitch and then determine the meaning (as well as sentiments behind it).
But for the machine, it is in a tough spot, since it lacks contextual experience (and sentiments) and runs solely on algorithms.
2) Background noise: hinders accuracy big time.
Loud surroundings and background noises make it unfit and unreliable for outdoor or large public spaces. The technology lags in mitigating and filtering background noises to isolate the human voice. Hence, additional external devices (like headsets) can help in this scenario, but that is just too much extra baggage. Another aid here is acoustic training, which comes with its limitations too.
3) Language Base: The more, the merrier!
The current gap in language coverage provides a barrier to adoption. The huge number of varying accents and dialects are amongst the biggest factors impacting accuracy. That’s why we not only need more languages in the arena but equally more accents, and dialects. This can help in providing more exposure, experience, and better learning opportunities to the machine.
4) Data security and privacy: cost and implementation
For a machine to learn and train, massive data input is required. The current approach to obtaining data via paid research or studies is very restrictive. It forms a fraction of the total voice data generated in this digital age. Accessing, using, and managing the collected data raises questions about data security and individual user privacy.
This conflict of interest narrows the availability of data inputs required for AI, making data accessibility even harder.
Wrap-Up
Speech recognition technology is an inclusive package deal a user gets with any modern digital experience (that’s how embedded it has become). This evolving technology revolves around adaptability, paving the way for more unique use cases.
Human’s quest to streamline human-machine interaction has come a long way. Sure, it ain’t perfect at the moment- nothing is- but who knows what the future holds!