Fooling deep neural networks with adversarial input have exposed a significant vulnerability in the current state-of-the-art systems in multiple domains. Both black-box and white-box approaches have been used to either replicate the model itself or to craft examples which cause the model to fail. In this work, we propose a framework which uses multi-objective evolutionary optimization to perform both targeted and un-targeted black-box attacks on Automatic Speech Recognition (ASR) systems.
We apply this framework on two ASR systems: Deepspeech and Kaldi-ASR, which increases the Word Error Rates (WER) of these systems by upto 980%, indicating the potency of our approach. During both un-targeted and targeted attacks, the adversarial samples maintain a high acoustic similarity of 0.98 and 0.97 with the original audio.
Below are some of the adversarial samples generated on Kaldi-ASR and Deepspeech using the proposed framework in both targeted and untargeted setting.
Set-1 Un-targeted Attack
Click to Reveal text
Actual Text: I have got to go him
Genetated Text: it got girl
ASR: Deepspeech
Click to Reveal text
Actual Text: I have got to go him
Genetated Text: i get ill
ASR: Deepspeech
Click to Reveal text
Actual Text: I have got to go him
Genetated Text: the good girl to have
ASR: Kaldi-ASR
Click to Reveal text
the scottish go to him
ASR: Kaldi-ASR
Set-2 Un-targeted Attack
Click to Reveal text
Actual Text: he is the man that are written for
Generated Text: he is the man the tired
ASR: Deepspeech
Click to Reveal text
Actual Text: he is the man that are written for
Generated Text: hes the man their coverage
ASR: Deepspeech
Click to Reveal text
Actual Text: he is the man that are written for
Generated Text: these the man that's all right
ASR: Kaldi-ASR
Click to Reveal text
Actual Text: he is the man that are written for
Generated Text: he's the man that are ready and four