MCUXpresso_MIMXRT1052xxxxB/boards/evkbimxrt1050/eiq_examples/tflm_kws
Yilin Sun 6baf4427ce
Updated to v2.15.000
Signed-off-by: Yilin Sun <imi415@imi.moe>
2024-03-18 23:15:10 +08:00
..
armgcc Updated to v2.15.000 2024-03-18 23:15:10 +08:00
doc Updated to v2.15.000 2024-03-18 23:15:10 +08:00
source Updated to v2.15.000 2024-03-18 23:15:10 +08:00
board.c Updated to v2.15.000 2024-03-18 23:15:10 +08:00
board.h Updated to v2.15.000 2024-03-18 23:15:10 +08:00
board_init.c Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
board_init.h Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
clock_config.c Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
clock_config.h Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
dcd.c Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
dcd.h Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
evkbimxrt1050_sdram_init.jlinkscript Updated to v2.14.0 2023-11-30 20:55:00 +08:00
pin_mux.c Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
pin_mux.h Update SDK to v2.13.0 2023-01-26 09:35:56 +08:00
readme.md Updated to v2.15.000 2024-03-18 23:15:10 +08:00
tflm_kws_v3_14.xml Updated to v2.15.000 2024-03-18 23:15:10 +08:00

readme.md

Overview

Keyword spotting example based on Keyword spotting for Microcontrollers [1].

Input data preprocessing

Raw audio data is pre-processed first - a spectrogram is calculated: A 40 ms window slides over a one-second audio sample with a 20 ms stride. For each window, audio frequency strengths are computed using FFT and turned into a set of Mel-Frequency Cepstral Coefficients (MFCC). Only first 10 coefficients are taken into account. The window slides over a sample 49 times, hence a matrix with 49 rows and 10 columns is created. The matrix is called a spectrogram.

In the example, static audio samples ("off", "right") are evaluated first regardless microphone is connected or not. Secondly, audio is processed directly from microphone.

Classification

The spectrogram is fed into a neural network. The neural network is a depthwise separable convolutional neural network based on MobileNet described in [2]. The model produces a probability vector for the following classes: "Silence", "Unknown", "yes", "no", "up", "down", "left", "right", "on", "off", "stop" and "go".

Quantization

The NN model is quantized to run faster on MCUs and it takes in a quantized input and produces a quantized output. An input spectrogram needs to be scaled from range [-247, 30] to range [0, 255] and round to integers. Values lower than zero are set to zero and values exceeding 255 are set to 255. An output of the softmax function is a vector with components in the interval (0, 255) and the components will add up to 255).

HOW TO USE THE APPLICATION: Say different keyword so that microphone can catch them. Voice recorded from the microphone can be heared using headphones connected to the audio jack. Note semihosting implementation causes slower or discontinuous audio experience. Select UART in 'Project Options' during project import for using external debug console via UART (virtual COM port).

[1] https://github.com/ARM-software/ML-KWS-for-MCU [2] https://arxiv.org/abs/1704.04861

Files: main.cpp - example main function ds_cnn_s.tflite - pre-trained TensorFlow Lite model converted from DS_CNN_S.pb (source: https://github.com/ARM-software/ML-KWS-for-MCU/blob/master/Pretrained_models/DS_CNN/DS_CNN_S.pb) (for details on how to quantize and convert a model see the eIQ TensorFlow Lite User's Guide, which can be downloaded with the MCUXpresso SDK package) off.wav - waveform audio file of the word to recognize (source: Speech Commands Dataset available at https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz,
file speech_commands_test_set_v0.02/off/0ba018fc_nohash_2.wav) right.wav - waveform audio file of the word to recognize (source: Speech Commands Dataset available at https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz, file speech_commands_test_set_v0.02/right/0a2b400e_nohash_1.wav) audio_data.h - waveform audio files converted into a C language array of audio signal values ("off", "right") audio signal values using Python with the Scipy package: from scipy.io import wavfile rate, data = wavfile.read('yes.wav') with open('wav_data.h', 'w') as fout: print('#define WAVE_DATA {', file=fout) data.tofile(fout, ',', '%d') print('}\n', file=fout) train.py - model training script based on https://www.tensorflow.org/tutorials/audio/simple_audio timer.c - timer source code audio/* - audio capture and pre-processing code audio/mfcc.cpp - MFCC feature extraction matching the TensorFlow MFCC operation audio/kws_mfcc.cpp - ausio buffer handling for MFCC feature extraction model/get_top_n.cpp - top results retrieval model/model_data.h - model data from the ds_cnn_s.tflite file converted to a C language array using the xxd tool (distributed with the Vim editor at www.vim.org) model/model.cpp - model initialization and inference code model/model_ds_cnn_ops.cpp - model operations registration model/output_postproc.cpp - model output processing

SDK version

  • Version: 2.15.000

Toolchain supported

  • MCUXpresso 11.8.0
  • IAR embedded Workbench 9.40.1
  • Keil MDK 5.38.1
  • GCC ARM Embedded 12.2

Hardware requirements

  • Mini/micro USB cable
  • EVKB-IMXRT1050 board
  • Personal computer

Board settings

Disconnect camera device from the J35 connector if connected (possible signal interference).

Prepare the Demo

  1. Connect a USB cable between the host PC and the OpenSDA USB port on the target board.
  2. Open a serial terminal with the following settings:
    • 115200 baud rate
    • 8 data bits
    • No parity
    • One stop bit
    • No flow control
  3. Download the program to the target board.
  4. Either press the reset button on your board or launch the debugger in your IDE to begin running the demo.

Running the demo

The log below shows the output of the demo in the terminal window:

Keyword spotting example using a TensorFlow Lite model. Detection threshold: 25

Static data processing: Expected category: off

 Inference time: 32 ms
 Detected:        off (100%)

Expected category: right

 Inference time: 32 ms
 Detected:      right (98%)

Microphone data processing:

 Inference time: 32 ms
 Detected: No label detected (0%)


 Inference time: 32 ms
 Detected:         up (85%)


 Inference time: 32 ms
 Detected:       left (97%)