All software applications are vulnerable, and speech apps are no exception — well worth keeping in mind given the proliferation of smart speakers, digital voice assistants, voice user interfaces, and voice-based authentication and identification systems. You can’t afford to dismiss the security threats these pose — because once hackers have gained unauthorized access, they’re able to operate these speech-enabled devices without the users’ knowledge. Too often, however, privacy concerns around speech applications garner more attention.
To learn more about how artificial intelligence (AI) and machine learning could be put to malicious use in exploiting speech apps, I contacted Kashyap Kompella, CEO and chief analyst of
rpa2ai Research, a global industry analyst firm that focuses on enterprise AI and automation, and co-author of “Practical Artificial Intelligence — An Enterprise Playbook.”
Can voice biometrics solve the security issue around hacked smart appliances?
Generally speaking, the contact center industry is starting to adopt voice biometrics to verify speaker identity and authenticate users. Outside of contact center use cases, speaker identity isn’t the norm. I should note, however, that this may change as applications are getting much better at identifying speakers accurately. We can expect to see this ability being put to use for authentication, perhaps unobtrusively in the background without having to change the user experience or introduce additional “security challenge” steps.
How does machine learning impact speech apps?
Machine learning is no doubt the wind beneath the wings for speech apps that are soaring in usage. Accuracy and performance of automatic speech recognition (ASR) systems have increased significantly in recent years. New machine learning methods plus the availability of larger speech datasets and easy access to massive computing power have all contributed to making speech/voice a viable user interaction medium for many applications.
From a security point of view, AI and machine learning are dual-use technologies — meaning, they’re at the disposal of vendors as well as malicious actors. Generative adversarial networks, a deep learning technique, can generate synthetic voices and trick voice biometric and voice authentication systems. In addition, deep learning techniques used in voice applications are opaque and in some cases can potentially give rise to newer vulnerabilities. Automatic speaker verification systems are vulnerable to spoofing attacks.
What are white box attacks, and are these a concern for speech app security?
Attacks are commonly classified based on the amount of knowledge about the target systems that hackers possess. If an attacker has a great deal of knowledge about design, architecture, and platform details, it’s a white box attack. White box attacks can be targeted at the application, operating system, or hardware level. Voice replay attacks (where the user’s voice is collected and then replayed in the background) or dolphin attacks (where inaudible ultrasound signals are converted into speech signals by the microphone) are examples of white box attacks.
What is a black box attack?
In contrast, when hackers don’t have in-depth knowledge about the system they’re targeting this is known a black box attack. The exact details of the machine learning models and data used to train these models aren’t usually available to attackers. Attackers exploit the fact that the output of deep learning algorithms can be quite sensitive to small changes in certain input parameters. In a hidden voice command attack, hackers produce a mangled sound that, while not recognized by humans, is recognized by an ASR system.
What can an enterprise do to mitigate these security issues?
It’s not just the speech apps that are going mainstream. In their wake, the threat landscape is also expanding. So, don’t just emphasize features and functionality but also prioritize security.
Most attackers compromise speech apps by leveraging lax security controls of the host platform or device hardware. Monitor permissions granted and patch known vulnerabilities at this level — this is basic security hygiene.
Many of the attacks on speech apps try to “fake” live human commands and feed a compromised signal instead. A defensive strategy based on the enterprise apps identifying the source of commands — whether it’s a human or a recorded bot — makes a lot of sense.
Enterprises shouldn’t deploy speech apps without a thorough analysis of potential vulnerabilities and their mitigation.