AI-Powered Dog Trainer Part 1

Liam Davies

Issue 49, August 2021

This article includes additional downloadable resources.
Please log in to access.

Log in

Learn how to program and train your own AI Audio Classifier on a Raspberry Pi Zero!

BUILD TIME: A few hours

Here at DIYODE, we’re very fond of Artificial Intelligence and have written a number of articles both generally and technically regarding its capabilities and implementations. There are many different guides both in past issues of DIYODE and across the internet regarding using Raspberry Pi’s with AI, but many are focused on visual recognition-based applications.

This project will be slightly different, focusing rather on classifying live audio from a microphone instead. We want to show how you can set up this system to classify any type of audio you have around the home and make it work with your own projects. However, perhaps more interestingly, we’re going to get this running on a humble Raspberry Pi Zero instead of the newest powerhouse from the Raspberry Pi lineup, the Pi 4. This is not an insubstantial technical challenge as the tiny Zero is working with just 512MB of RAM and a single-core 1GHz processor, peanuts compared to its bigger brother.

The Objective

Meet Biscuit! She’s a four-year-old miniature groodle with a mischievous streak and is very protective of her home, so she tends to bark at people strolling past. While there is nothing wrong with alerting anybody around, it tends to frustrate everybody in earshot after a while, notably neighbours.

So, here’s the plan. Training the absence of an action is tricky, and negative reinforcement in the form of punishment only weakens the connection between the dog and trainer, potentially causing aggression. Positive reinforcement is the most commonly used training method, and for a good reason. It builds trust between dog and trainer and naturally rewards good behaviour. In the context of reducing barking, we want Biscuit to respond to a simple loudspeaker command such as “Come” or “Quiet”, and she must remain quiet for a fixed amount of time. If no further barks are heard by our trainer, it dispenses a treat for good behaviour.

We’ll also point out that, naturally, we aren’t professional dog trainers. We don’t fully understand the psychology of dogs and how best to train them, so this project could be considered as an experiment to see how effective our training method actually is. We’re rather putting focus on the process of collecting, labelling and training AI on the data we collect. We’ll explain how to collect and filter through good sample data, and use TensorFlow Lite on a Raspberry Pi Zero to run a sound detection model.

Part 2 of the project will be dedicated to finishing off this project with a fully 3D printed enclosure, treat dispenser, and a completed software program.

How It Works

At the core of our AI will be a neural network powered by the excellent and widely-used TensorFlow library. Training will be completed on a much more powerful computer than the Pi Zero, and then the model will be exported as a TensorFlow Lite model to be friendly to the Zero’s limited processing power. Evaluating a neural network (i.e. predicting if live input data is a positive match) is much easier than training one, considering that a neural is essentially a giant network of weighted sums. Just feed in your inputs - the microphone data - and get your output – a detected bark sample.

Speaking of which, we should explain how the microphone data is processed. A microphone signal consists of essentially a single stream of numbers, which by itself isn’t super useful to the neural network – how would it be expected to extract features from a constantly changing one-dimensional signal? A FFT (Fast Fourier Transform) will be performed on the microphone data, which will break the signal down into its constituent frequencies rather than raw values. We can then use sections of the FFT for inputs to our neural network, divided into “banks” of frequencies. For example, a barking sound sample of a dog may look like this when converted from time domain to frequency domain:

Each column of the frequency domain (a ‘bank’) will be an input to our neural network. In the example below, we have five frequencies as inputs. The values – or the amplitude of the frequency – will be fed forward into the network, propagating with multiple mathematical operations through each layer until it reaches the output layer. You can imagine each connection in the network possessing a sort of analogue knob (a potentiometer, if you will) that dictates how strong its connection is to the previous layer.

Unfortunately, the mathematics is not the focus here. There is plenty of information on the internet about how and why neural networks use calculus, activation functions, bias values, and all other sorts of nerdy stuff to make them work. Using this clever maths magic, though, we are able to tune the ‘knobs’, or weights in technical terms, until the output matches what we expect given a set of inputs. The behaviour we require is simple conceptually – given a set of sound frequencies, is the sample a bark, or not a bark?

In the code we’ll be using, we’re collecting sound data every two seconds, and analysing the FFT across its duration. We ended up choosing 100 banks, divided into even sections between 250Hz and 2500Hz – fairly common frequencies for sounds around the house. That means that each input node of the neural network will be sensitive to a 22.5Hz band within that range.

Obviously, there are a couple of potential problems with this method on a more general level. First of all, would we be inadvertently training the dog to bark, then stay quiet to get treats? We need to be careful that we’re not reinforcing the first part of this behaviour. Not activating the treat dispenser every time, perhaps. Second, we’re curious to see whether the motivation of a treat will be enough to prevent Biscuit from barking.

If she’s in a particularly cranky mood, she may ignore the system and continue barking. And finally, will the commands we program the project to speak work the same as if a human trainer said them - i.e. will a person saying “Quiet” have the same effect as the machine? We could ask questions all day, but to answer them we actually need to build our project!

Setting Up:

Note: This project assumes you have some basic skills with installing and using Raspbian OS.

Parts Required:JaycarAltronicsPakronics
1 x Pimoroni Audio Amp SHIM---
1 x Raspberry Pi Zero W (Wireless)-Z6303AADA2816^
1 x Mini HDMI to Standard HDMI Jack AdapterPA3645Included AboveADA2819
1 x Micro USB OTG Host CableWC7725Included AboveADA1099
1 x Speaker - 3" Diameter - 4 Ohm 3 Watt--ADA1314
1 x microSD Memory Card - 16GBXC4989DA0365DF-FIT0394
1 x USB Microphone *AM4136D0985ADA3367

Parts Required:

* Any USB microphone will work. We used a Blue Yeti studio microphone. ^ Starter kit.

The initial hardware setup for testing our AI is very basic, consisting of a Raspberry Pi Zero W, a Blue Yeti microphone for recording, a USB Powerbank for moving the setup around the house, and a speaker. For the speaker, we picked up Core-Electronics Pimoroni Mono Audio SHIM, which is a pinky finger-sized HAT that sits on your Raspberry Pi and can drive a 3W speaker at a very reasonable loudness! All that is needed is to solder some wires onto the speaker and to pop them into the positive and negative terminals on the SHIM. You will also need to follow the instructions included with the SHIM to set it up as an audio device on the Raspberry Pi.

Don’t think you have to use exactly our setup; the Zero is not fussy when it comes to audio devices. You have almost definitely got a set of old speakers and a microphone lying around which will work. Since we had access to a quite clear and sensitive Blue Yeti, we figured it would help pick up fainter barks from across the house.

Remote Access

We often find using a Raspberry Pi directly with a physical mouse and keyboard an inconvenience. It means you need a separate set of computer peripherals to use it, and together with your regular workstation, this doubles the space on your desk required to work on the project. The first approach is VNC (Virtual Network Computing) Viewer. This is simply a desktop and input streaming software that allows you to remotely use computers without physically connecting your devices to them. If you’re not confident with the command line on Linux, this is a great place to start and can be accessed by heading to the Raspberry Pi Configuration menu and enabling VNC. Then, using your viewer computer, connect to the Raspberry Pi’s IP address and enter the default credentials “pi” and “raspberry” for username and password respectively, assuming you haven’t changed them. And then you’ve got full access to the Raspberry Pi desktop!

However, we prefer using command-line based SSH (Secure Shell) for access. We understand this is a bigger learning curve, but it is much faster and is very convenient for running commands and editing scripts without a large software overhead. Using the “ssh” command on your main computer’s terminal (Mac or Linux works great, but you may need to enable the feature on Windows computers), you can connect to your Pi using this command:

ssh pi@192.168.X.X

After you have access to your Raspberry Pi’s terminal, we can get started setting up the libraries we need for the project. Before we do that though, we’re going to set up a virtual Python environment. Why should you bother doing this, you may ask?

It’s good practice to initialize set up separate Python environments for each project. This way, differing dependencies between projects don’t interfere with each other. As an analogy, imagine a Python environment as a sandbox with toys. Children playing in one big sandbox are bound to have disagreements, starting arguments over which toys they want to use. Maybe one of them doesn’t like a toy brand, size, colour, and so forth, you get the idea.

Making a sandbox for each of them is a good way of solving the problem – they each get their own area to play, and if the toys are chosen for each sandbox, there will be no disagreements. This is the way Python virtual environments work, permitting the setup of individual configurations without messing up the others. For example, installing TensorFlow (as we’re going to do shortly) in a specific virtual environment won’t appear in other virtual environments.

The commands below should get the virtual environment up and running. We create a new folder called ‘envs’ to store the environments and create an environment called audio1. To ‘activate’ (i.e. start using) the environment, use the source command with the directory to the activate directory of your environment. There will be the environment name in brackets at the beginning of your terminal as such if successful: (audio1) pi@raspberrypi:~$

pip install virtualenv
mkdir envs
cd envs
virtualenv audio1
source audio1/bin/activate

With our environment set up, it’s now time to talk AI! We need to collect the training data, build an AI model based on the data, convert it to a format optimal for the Pi Zero, and then finally run predictions on it. Depending on your specific application of this project, each stage may differ, but overall, the process should be vaguely similar to this:

On your Raspberry Pi, you’ll now need to run these commands to get going. This command installs tools for audio management and mp3 encoding, which must be installed in order to collect and train with microphone data.

apt-get install ffmpeg lame libatlas-base-dev alsa-utils
pip3 install tensorflow

The AI setup used for this project is an adapted version of Fabio Manganiello’s open source Smart Baby Monitor code. More information on his excellent code can be found on his Platypush Blog at He’s created a microphone recording and manipulation library called micmon that we’ll be using to convert raw microphone readings into inputs into a TensorFlow network.

While we could write the TensorFlow software ourselves, there is not a negligible amount of in-depth mathematics and AI knowledge needed, including a good understanding of how TensorFlow and Keras operate. Manganiello’s library is a good medium between dealing with the mathematics and more complex side of AI while learning some intuition about why and how neural networks operate the way they do. His open-source code can be found on Git and cloned directly to your Pi using these commands to get it running.

git clone
cd micmon
sudo pip3 install -r requirements.txt
sudo python3 build install

Gathering Data

Obviously, we need to gather training data for our AI to understand what and what not to look for. Depending on what exactly you’re training your AI to do, a bit of research (and intuition) will have to dictate the best way to gather training data.

Two common methods exist for sourcing data for AIs. The first is gathering your own data from your environment, which if done correctly is very effective for building high-performing models. If your device will be used solely in the environment that the data was collected in, the AI will be able to better filter input data. We say “done correctly” because it’s really important that you are able to gather large quantities of high-quality data. Remember, we need to label both positive and negative samples of our data. i.e. barking or anything else such as a dishwasher or footsteps.

You can also get large datasets of training data from the internet, with everything from household sounds to car engines. These also need to be pre-labelled otherwise they’ll all need to be labelled manually. There is no hard and fast rule for gathering AI training data, but the best guideline is quality and quantity – as much as possible.

In my case, I opted to leave our Raspberry Pi Zero recording for a couple of hours around the house when Biscuit would be the most active. This can be done by finding currently available recording devices on your Pi, and then recording an mp3 file to the SD card.

pi@raspberrypi: $ arecord -l
**** List of CAPTURE Hardware Devices ****
card 1: Microphone [Yeti Stereo Microphone], 
device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
pi@raspberrypi: $ arecord -D 
plughw: 1, 0 -c 1 -f cd | lame - audio.mp3


Now we need to look through the source audio data and label what is and isn’t a dog barking. We aren’t going to sugar-coat it; this is a very boring but unfortunately necessary part of the AI training process. Each audio.mp3 file needs to be put into its own folder called samplex (where ‘x’ is a sample number starting from 1), and put together with a labels.json. To fill out the labels.json file, we recommend opening your audio file in Audacity or Adobe Audition and going to spectrogram view. Then, work your way through the file, looking for positive samples of your desired sound. Enter the timestamps of when the sound starts and ends into the labels.json file as follows:

{ "00:00": "negative",
  "04:50": "positive",
  "05:02": "negative",
…. }

Using the spectrogram is quite handy because you’ll quickly get used to how your target sound appears visually, skipping through the file to find samples easily.

For the first test run of this project, we labelled about an hour of audio. We don’t expect the predictions to be state-of-the-art, but it should be a good starting point. For the next part of the project we aim to label much more audio, so all sorts of situations are recognised. For example, if a kitchen appliance is running or a different dog in the street is barking, we should label both positive and negative samples when those situations occur.

We now need to run the micmon-datagen script which converts the microphone audio to the frequency domain , resulting in ‘banks’ of frequencies. This is exported as a .npz file which will be used during training. You can replace the parameters and file directories with your values from your own setup.

    --low 250 --high 2500 --bins 100 
    --sample-duration 2 --channels 1 


It’s finally time need to train our neural network to understand the correlation between inputs and outputs! When we’ve finished training, we’ll export a file that represents our model state and allows us to make predictions with it.

import os
from tensorflow.keras import layers
import tensorflow as tf
from micmon.dataset import Dataset
from micmon.model import Model

These code imports are essential for running our training system, including the dataset management code and TensorFlow itself. By the way, this code is written and should run on a workstation computer separately to the Raspberry Pi. If you have a lot of data or a complex model to train with, GPU acceleration with CUDA cores significantly speeds up this process as compared to training on a CPU, especially a comparatively weak one like the Pi’s.

# .npz dataset files
datasets_dir = os.path.expanduser(
# This is the output directory 
# where the model will be saved
model_dir = os.path.expanduser(
lite_model_dir = os.path.expanduser(
# This is the number of training 
# epochs for each dataset sample
epochs = 100
datasets = Dataset.scan(
datasets_dir, validation_split=0.3)
labels = ['negative', 'positive']
freq_bins = len(datasets[0].samples[0])

This code sets up our model directories to read and write model data and sets up the datasets themselves. The epochs variable is the number of rounds of training we should complete on each sample file. The more you do, the more accurate the model (usually) gets, however, sometimes overfitting can result – meaning the model gets so used to seeing a single set of inputs it does not know how to perform when unknown inputs are presented to it.

You’ll notice the datasets variable has a “validation_split” parameter in the attached function. A value of 0.3 means that 30% of the data will be used for evaluating the models performance through validation, while the other 70% will be used to train on and decide answers to input data with. This way, the validation data does not bias the network’s performance, so accurate measurements can be made of how well it’s doing!

model = Model(
        layers.Dense(int(2 * freq_bins),
        layers.Dense(int(0.75 * freq_bins),

Next up, we can set up the structure of the model. For lack of a better description, we’re designing its brain. This model description with four layers may look the same as the neural network diagram in the introduction to the project, because it is. One input layer, consisting of the frequency bank labels (100 of them), two hidden layers for “thinking”, and one output layer for making predictions – positive or negative.

The ‘activation’ parameter is a type of non-linear function that transfers values from one layer to the next. Fundamentally, it makes very complex behaviour possible from simple networks like ours. We could spend many pages talking about the different types but it’s not important to discuss in this project.

# Train the model
for epoch in range(epochs):
    for i, dataset in enumerate(datasets):
        print(f'[epoch {epoch+1}/{epochs}] 
[audio sample {i+1}/{len(datasets)}]')
        evaluation = model.evaluate(dataset)
        print(f'Validation set loss and 
accuracy: {evaluation}')
# Save the model, overwrite=True)

Finally, we can get to training. This is done in a loop according to the number of epochs we have told the model to complete. We’re using the function from the TensorFlow library, and then testing its performance with the model.evaluate function. Rinse and repeat!

When initially building this project, we had some trouble getting the required TensorFlow libraries onto the Raspberry Pi Zero, because of its slower and slightly less capable architecture. It tried to download older TensorFlow libraries which are incompatible with the exported models.

Enter, TensorFlow Lite. TFLite is a mobile-optimized library compatible with base TensorFlow libraries that can make inferences directly on small, low-power devices. That means where otherwise data would have to be sent to a dedicated server or processing service to make predictions based on input data, TensorFlow Lite speeds up processing time by simply running the model on the target device. This is great news for latency and power usage, and doesn’t even need to have internet connection.

TFLite works by converting an existing full-size TensorFlow model to a compressed .tflite file that, in our case, is around 150KB! Combined with the TensorFlow Lite binary of ~1MB, this tiny package can fit on virtually any IoT device capable of basic processing.

# Save the model, overwrite=True)
converter = tf.lite.TFLiteConverter.
tflite_model = converter.convert()
with open(lite_model_dir + 
"/model.tflite", 'wb') as f:

The code for generating a TFLite model is very simple, which uses the existing model files, opens them and exports them again as a model.tflite file. The functions to accomplish this are already included in the TensorFlow library.


It’s now time to use our completed TensorFlow Lite model to make predictions on the Raspberry Pi Zero. First things first, we need to transfer the TFLite file over to the Zero, using SCP. If you would prefer to transfer it using a USB or another method, that’s OK too. Replace with the appropriate IP addresses and directories to suit your environment.

user@pc:~/Python/AudioTests $ scp model.tflite pi@192.168.0.XX/Python/AudioTests/model/

We can now write our code to load the prediction model, audio and microphone libraries, and GSpread, which will be discussed in the next section.

import numpy as np
from tensorflow import lite as tf
import os
from datetime import datetime
import wave
import pyaudio
import json
from import AudioDevice
from signal import signal, SIGINT
from sys import exit
import gspread

There are a lot of libraries we’re importing here, but most of it should be self-explanatory of why they’re being used. Notice we don’t need to import any TensorFlow libraries except the Lite functions!

print("Initialising PyAudio library...")
#instantiate PyAudio  
p = pyaudio.PyAudio()
def handler(signal_received, frame):
    # Handle any cleanup here
    print('Program terminating, 
closing PyAudio...')
signal(SIGINT, handler)
audio_system = 'alsa'  # Supported: alsa and pulse
audio_device = 'plughw:1,0'  
    # Get list of recognized 
    # input devices with arecord -l

Above, we’re starting the audio libraries so we can read data from the microphone and interpret it in terms of the frequency domain live. The ‘handler’ function is an extra addition we made so the PyAudio library can safely shutdown when Ctrl-C is pressed on the keyboard, since it requires p.terminate() to be called in order to clean up it’s resources properly.

print("Initialising TFLite model...")
model_dir = os.path.expanduser(
model_name = 'lite-sound-detectmodel.tflite'
# Confidence detection threshold
# Load the TFLite model and allocate tensors.
interpreter = tf.Interpreter(
model_path=os.path.join(model_dir, model_name))
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

The block of code above is responsible for setting up all the different parts of the code necessary for reading the TFLite model file. Our confidence threshold is 0.90, which means the model must be at least 90% confident before reporting a detected bark. With it at the default of 50%, false positives quickly become apparent.

Google Sheets

To gather statistics on how often our Raspberry Pi is triggered, we’re going to add integration with Google Sheets to automatically enter and update rows into a spreadsheet. While we won’t flesh out the data analysis in this part of the project, we plan to add graphs and summary statistics to see how often barking is detected. For example, if the postman comes around at a specific time of week, we should see his or her arrival reflected in the barking data. (Biscuit really doesn’t like the postman!)

We were delighted with how easy it is to use Google Sheets with Python, requiring only about 10 minutes of setup. It can then host and analyse any data from your IoT projects for free. The Python library we’re using is called Gspread and is a very convenient to quickly integrate Google Sheets functionality into your programs.

Before we write any code in Gspread though, we need to authenticate our tool through the Google Developer Console at Here, the Google Drive and Google Sheets APIs need to be enabled through using the search box. Bots such as our Raspberry Pi can be given “Service Account” privileges which removes the need for dealing with account logins on the side of the Raspberry Pi, keeping it secure. It instead uses a login credential file that contains a service key for Gspread to use.

After creating a new project in the Developer Console, you’ll need to head to APIs & Services > Credentials > Create credentials > Service account key and then fill out your application details. Then click on “Manage Service Accounts” and add a new service key with type JSON. This will immediately download the file to your computer, which can be placed into your Pi under /home/.config/gspread/ service_account.json - you may need to create the folder and rename the account file. Do not give the JSON file as it contains details of your API access. Finally, you’ll need to create a new Sheet file under your Google account and (important!) share the file with the service account email.

Back to our prediction model code, we can add GSpread to be initialized and make a function to run code when a bark is successfully detected. Notice that to add data to a table in Google Sheets, it requires just one line of code: sh.sheet1.append_row(…). It’s incredibly simple, but we can think of many IoT applications using this API.

print("Initialising Google Sheets API...")
gc = gspread.service_account()
#Open AI Dog Trainer spreadsheet
sh ="AI Dog Trainer")
def detect_bark(confidence):
    now =
    current_time = now.strftime(
"%m/%d/%Y %H:%M:%S")
    confidence_perc = confidence * 100
    print(f"[{current_time}] Detected dog barking! 
(Confidence: {confidence_perc:.1f}%)")
    #Note the playSound() function 
    #is included in project files.

Prediction Loop

It feels like there are thousands of lines of setup code for this program, but don’t fear, we’re nearly there. This is the prediction code, an infinite loop that continually analyses incoming audio data – the bread and butter of the project.

print("Finished initialising 
model and audio devices.")
print("Starting prediction loop...")
with AudioDevice(audio_system, 
device=audio_device) as source:
    for sample in source:
        input_data = [np.array(
high_freq=2500), dtype=np.float32)]
input_details[0]['index'], input_data)
        output_data = interpreter.get_tensor(
        if output_data[0][1] > THRESHOLD:

It looks pretty crazy, but this is actually reassuringly intuitive if you read it through. We convert the microphone audio sample to an FFT dataset – the 100 banks of frequencies we talked about earlier – and feed it into the first layer (or ‘tensor’).

Then, we use invoke() to run the values through the network and read the output tensor. If the positive output tensor is higher than our threshold, 0.90, the model thinks that it detected a bark. Awesome!


We didn’t have a lot of time to test the finished AI project, but it really is just a matter of running the script and letting the AI do the rest. We used a 3 second WAV clip of the Star Wars theme to play whenever the barking detection is triggered.

For the limited amount of training the model underwent, I was pretty impressed with its performance. It reliably recognises when Biscuit barks, even sometimes from across the house or in the yard. However, after leaving it running for a few days a couple of weird false positives surfaced – including crackling of the fireplace, doors creaking open or sometimes even a plate clinking. Maybe those objects have similar resonant frequencies to a dog barking, but more to the point, the model needs more training. Multiple hours of training audio would be highly beneficial to its performance.


You should be excited if you run into any code errors. It means you get to sit down with a hot coffee, working through your source files in order to then shout a satisfying “aha!” once the code works. Ok, maybe it isn’t that romantic or enjoyable, but it is most definitely necessary.

Many of the problems we ran into were dependency-related, so if the right version of a library isn’t installed (or not installed at all), the code is likely to complain. It’s always possible to get a fresh start by starting a new Python virtual environment and installing the libraries from scratch.

Speaking of installing libraries, we found the larger packages like Numpy and TensorFlow take a very long time to install. Sometimes the Python package installer cannot find prebuilt ‘wheels’ and must manually compile from the code source. In some cases, this can take multiple hours.

So be patient! If it really is frozen, we found the fix was to install packages with the --no-cache-dir command when running pip – this is possibly related to the Zero’s very limited RAM.

Next Month: Part 2

It’s probably safe to say you don’t need us to tell you about what AI is capable of. However, we hope that this guide has provided you with some ideas of how to make your own AI systems, even if it’s got nothing to do with barking dogs.

Neural networks are, fundamentally, a big network of mathematical sums, so it should not be surprising that any data we can quantify as makers can be leveraged by AI to make our creations substantially better. In the future, we’re going to explore TensorFlow Lite more as it’s a great way of running fully-fledged AI models on hardware no bigger than your palm.

Next month, we’re making a 3D printed enclosure and additional electronics to make this into a finished project. We’ll be adding a treat dispenser, a fancy 3D printed enclosure and an automatic servo motor script that will let the system work without user intervention. Stay tuned, and clue us in with any audio AI inventions you make!

Liam Davies

Liam Davies

DIYODE Staff Writer