Implemented native resampling and DTMF gen - frugalvox - A tiny VoIP IVR framework by hackers, for hackers

commit 12ffaf9efec3a799041125779933108b99905101
parent e15357c5fe9813a7b9d741753497c712387ba2d8
Author: Luxferre <lux@ferre>
Date:   Tue, 28 Feb 2023 15:26:49 +0200

Implemented native resampling and DTMF gen

Diffstat:
M Dockerfile  | 2 +-
M README.md  | 46 ++++++++++++++--------------------------------
M example-config/config.yaml  | 8 +-------
M fvx.py  | 59 +++++++++++++++++++++++++++++++++++------------------------

4 files changed, 51 insertions(+), 64 deletions(-)
diff --git a/Dockerfile b/Dockerfile
@@ -2,7 +2,7 @@ FROM python:3.10-slim-bullseye
 USER root
 WORKDIR /usr/src/app
 RUN sed -i -e's/ main/ main contrib non-free/g' /etc/apt/sources.list
-RUN apt-get update -y && apt-get install -y libttspico-utils espeak-ng espeak-ng-data espeak-ng-espeak flite sox mbrola-*
+RUN apt-get update -y && apt-get install -y libttspico-utils espeak-ng espeak-ng-data espeak-ng-espeak flite mbrola-*
 COPY requirements.txt ./
 COPY pyVoIP-1.6.4.patched-py3-none-any.whl ./
 RUN pip install --no-cache-dir -r requirements.txt
diff --git a/README.md b/README.md
@@ -24,11 +24,10 @@ A tiny VoIP IVR framework by hackers and for hackers.
 
 - Python 3.8 or higher (3.10 recommended)
 - pyVoIP 1.6.4, patched according to [this comment](https://github.com/tayler6000/pyVoIP/issues/107#issuecomment-1440231926) (also available as a `.whl` file in this repo)
-- NumPy (mandatory, required for DTMF detection)
-- SoX (mandatory, required for DTMF generation and TTS transcoding functionality)
+- NumPy (mandatory, required for DTMF detection and generation)
 - eSpeakNG (optional but used by the default TTS engine configuration)
 
-For Python-side dependencies, just run `pip install -r requirements.txt` from the project directory. SoX and eSpeakNG (or other TTS engine of your choice, see the FAQ section) must be installed separately with your host OS package manager.
+For Python-side dependencies, just run `pip install -r requirements.txt` from the project directory. eSpeakNG (or other TTS engine of your choice, see the FAQ section) must be installed separately with your host OS package manager.
 
 ### Usage
 
@@ -38,7 +37,7 @@ Make sure your `python` command is pointing to Python 3.8 or higher.
 
 ## Running FrugalVox in Docker
 
-The Docker image encapsulates all the dependencies (including Python 3.10, three different TTS engines (see the FAQ section), SoX and the patched pyVoIP package) but requires you to provide all the configuration and action scripts in a volume mounted from the host. In addition to this, the configuration file itself must be called `config.yaml` since the container is only going to be looking for this name.
+The Docker image encapsulates all the dependencies (including Python 3.10, three different TTS engines (see the FAQ section) and the patched pyVoIP package) but requires you to provide all the configuration and action scripts in a volume mounted from the host. In addition to this, the configuration file itself must be called `config.yaml` since the container is only going to be looking for this name.
 
 ### Building
 
@@ -114,11 +113,7 @@ All the fields in this section, except `transport`, are currently mandatory. If 
 
 This section allows you to configure your TTS engine, for FrugalVox to be able to generate audio clips from your text. The fields are:
 
-- `tts.voice`: the name of the voice supported by your TTS engine (eSpeakNG by default)
-- `tts.rate`: words per minute speech rate of the voice
-- `tts.volume`: voice volume (0 to 200 for eSpeakNG)
-- `tts.pitch`: voice pitch (check with your TTS program which one is optimal for you, 60 is the default in the example config)
-- `tts.cmd`: a dictionary with the TTS synth and transcoder command templates, please just leave the default values there unless you want to switch to a different TTS engine other than eSpeakNG or a different encoder other than SoX
+- `tts.cmd`: the TTS synth command template, please just leave the default values there unless you want to switch to a different TTS engine other than eSpeakNG
 - `tts.phrases`: a dictionary where every key is the clip name and the value is the phrase text to be rendered to that clip on the kernel start
 
 ### Static audio clips list: `clips`
@@ -179,7 +174,7 @@ The action script may import any other Python modules at your disposal, includin
 ### Useful methods, variables and objects exposed by the `fvx` kernel module
 
 - `fvx.load_yaml(filename)`: a wrapper method to read a YAML file contents into a Python variable (useful if your action scripts have their own configuration files)
-- `fvx.load_audio(filename)`: a method to read a WAV PCM file into the audio buffer in memory (note that it must be unsigned 8-bit 8Khz in order to work with pyVoIP calls)
+- `fvx.load_audio(filename)`: a method to read a WAV PCM file into the audio buffer in memory, automatically resampling it if necessary
 - `fvx.logevent(msg)`: a drop-in replacement for Python's `print` function that outputs a formatted log message with the timestamp
 - `fvx.audio_buf_len`: the recommended length (in bytes) of a raw audio buffer to be sent to or received from the call object the action is operating on
 - `fvx.emptybuf`: a buffer of empty audio data, `fvx.audio_buf_len` bytes long
@@ -223,7 +218,7 @@ Because vanilla pyVoIP 1.6.4 has a bug its maintainers don't even seem to recogn
 
 **I understand the importance of eSpeakNG but it sounds terrible even with MBROLA. Which else open source TTS engines can you recommend to use with FrugalVox?**
 
-The first obvious choice would be ~~Festival~~ [Flite](https://github.com/festvox/flite). With an externally downloaded `.flitevox` voice, of course. It has a number of limitations: only English and Indic languages support, no way to adjust the volume, but the output quality is definitely a bit better. If you use the Docker image of FrugalVox, Flite is also included but you have to ship your own `.flitevox` files located somewhere inside your config directory. Also, current Flite versions already generate 16 KHz PCM files instead of 8 KHz, so the transcoder command still needs to be in place.
+The first obvious choice would be ~~Festival~~ [Flite](https://github.com/festvox/flite). With an externally downloaded `.flitevox` voice, of course. It has a number of limitations: only English and Indic languages support, no way to adjust the volume, but the output quality is definitely a bit better. If you use the Docker image of FrugalVox, Flite is also included but you have to ship your own `.flitevox` files located somewhere inside your config directory.
 
 The second obvious choice would be [Pico TTS](https://github.com/naggety/picotts) which is (or was) used as a built-in offline TTS engine in Android. It supports more European languages (besides two variants of English, there also are Spanish, German, French and Italian) but has a single voice per language and absolutely no parameters to configure. Also, it requires autotools to build but the process looks straightforward: `./autogen.sh && ./configure && make && sudo make install`. After this, we're interested in the `pico2wave` command. Please note that its current version has some bug retrieving the text from the command line, so we use an "echo to the pipe" approach. For your convenience, this engine also comes pre-installed in the FrugalVox Docker image.
 
@@ -239,13 +234,7 @@ eSpeakNG + MBROLA:
 
 ```yaml
 tts:
-  voice: 'us-mbrola-2'
-  rate: 130 # words per minute
-  volume: 70 # from 0 to 200
-  pitch: 60
-  cmd:
-    synth: 'espeak -v %s -a %d -p %d -s %d -w %s "%s"' # parameter order: voice, volume, pitch, rate, filename, text
-    transcode: 'sox %s -r 8000 -b 8 -c 1 -D %s' # parameter order: inputfile, outputfile
+  cmd: 'espeak -v us-mbrola-2 -a 70 -p 60 -s 130 -w %s "%s"' # parameter order: filename, text
   ...
 ```
 
@@ -253,13 +242,7 @@ Flite/Mimic 1:
 
 ```yaml
 tts:
-  voice: 'tts/mycroft_voice_4.0.flitevox'
-  rate: 1 # Flite uses a factor instead of absolute value
-  volume: 0 # Flite doesn't support volume adjustment
-  pitch: 100 # Flite uses slightly different pitch scale
-  cmd:
-    synth: 'flite -voice %s --setf vol=%d --setf int_f0_target_mean=%d --setf duration_stretch=%d -o %s -t "%s"' # parameter order: voice, volume, pitch, rate, filename, text
-    transcode: 'sox %s -r 8000 -b 8 -c 1 -D %s' # parameter order: inputfile, outputfile
+  cmd: 'flite -voice tts/mycroft_voice_4.0.flitevox --setf int_f0_target_mean=100 --setf duration_stretch=1 -o %s -t "%s"' # parameter order: filename, text
   ...
 ```
 
@@ -267,16 +250,15 @@ Pico TTS:
 
 ```yaml
 tts:
-  voice: 'en-US'
-  rate: 0 # Pico doesn't support it
-  volume: 0 # Pico doesn't support it
-  pitch: 0 # Pico doesn't support it
-  cmd:
-    synth: VOICE=%s UU=%d%d%d OUTF=%s sh -c 'echo "%s" | pico2wave -l $VOICE -w $OUTF' # parameter order: voice, volume, pitch, rate, filename, text
-    transcode: 'sox %s -r 8000 -b 8 -c 1 -D %s' # parameter order: inputfile, outputfile
+  cmd: OUTF=%s sh -c 'echo "%s" | pico2wave -l en-US -w $OUTF' # parameter order: filename, text
   ...
 ```
 
+## Version history
+
+- 0.0.2 (2023-02-28, current): fully got rid of SoX dependency, simplified TTS configuration
+- 0.0.1 (2023-02-26): initial release
+
 ## Credits
 
 Created by Luxferre in 2023.
diff --git a/example-config/config.yaml b/example-config/config.yaml
@@ -11,13 +11,7 @@ sip:
 
 # TTS engine configuration
 tts:
-  voice: 'us-mbrola-2'
-  rate: 130 # words per minute
-  volume: 70 # from 0 to 200
-  pitch: 60
-  cmd: # command templates, do not modify them unless you're fully changing the engine
-    synth: 'espeak -v %s -a %d -p %d -s %d -w %s "%s"' # parameter order: voice, volume, pitch, rate, filename, text
-    transcode: 'sox %s -r 8000 -b 8 -c 1 -D %s' # parameter order: inputfile, outputfile
+  cmd: 'espeak -v us-mbrola-2 -a 70 -p 60 -s 130 -w %s "%s"' # parameter order: filename, text
   phrases: # key is the clip name, value is the text
     passprompt: 'Please enter your pin followed by pound after the beep.'
     cmd: 'Please enter your command, ending with pound.'
diff --git a/fvx.py b/fvx.py
@@ -2,7 +2,7 @@
 
 # FrugalVox: experimental, straightforward, no-nonsense IVR framework on top of pyVoIP (patched) and TTS engines
 # Created by Luxferre in 2023, released into public domain
-# Deps: PyYAML, NumPy, espeak-ng/flite/libttspico, SoX, patched pyVoIP (see https://github.com/tayler6000/pyVoIP/issues/107#issuecomment-1440231926)
+# Deps: PyYAML, NumPy, espeak-ng/flite/libttspico, patched pyVoIP (see https://github.com/tayler6000/pyVoIP/issues/107#issuecomment-1440231926)
 # All configuration is in config.yaml
 
 import sys
@@ -10,17 +10,17 @@ import os
 import signal
 import tempfile
 import yaml
-import wave
+import wave, audioop
 import time
 from datetime import datetime # for logging
 import traceback # for logging
 import socket # for local IP detection
-import numpy as np # for in-band DTMF detection
+import numpy as np # for in-band DTMF detection and generation
 import importlib.util # for action modules import
 from pyVoIP.VoIP import VoIPPhone, InvalidStateError, CallState
 
 # global parameters
-progname = 'FrugalVox v0.0.1'
+progname = 'FrugalVox v0.0.2'
 config = {} # placeholder for config object
 configfile = './config.yaml' # default config yaml path (relative to the workdir)
 if len(sys.argv) > 1:
@@ -51,7 +51,6 @@ DTMF_TABLE = {
     '#': [1477, 941],
     'D': [1633, 941]
 }
-DTMF_GEN_CMD = 'sox -n -D -b 8 -r 8000 %s synth 0.2 sin %s sin %s remix - gain -0.1' # command template to generate DTMF tone clips (order: file, f1, f2)
 ivrconfig = None # placeholder for IVR auth config
 calls = {} # placeholder for all realtime call instances
 
@@ -61,11 +60,36 @@ def logevent(msg):
     dts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
     print('[%s] %s' % (dts, msg))
 
-def load_audio(fname): # load audio data from a WAV PCM file
+def load_audio(fname): # load audio data from a WAV PCM file, resampling it if necessary
     f = wave.open(fname, 'rb')
-    frames = f.getnframes()
+    outrate = 8000
+    aparams = f.getparams()
+    frames = aparams.nframes
+    channels = aparams.nchannels
+    inrate = aparams.framerate
+    swidth = aparams.sampwidth
     data = f.readframes(frames)
     f.close()
+    if channels > 1: # convert to mono
+        data = audioop.tomono(data, swidth, 0.5, 0.5)
+    if inrate > outrate or swidth > 1: # convert the sample rate and bit width at the same time
+        rfactor = int(inrate / outrate) * swidth # only multiples of 8 KHz are supported
+        out = bytearray()
+        blen = len(data)
+        bwidth = swidth << 3 # incoming bit width
+        bfactor = 1 << (bwidth - 8) # factor to divide the biased sample value by to get a single byte
+        for i in range(0, blen, swidth): # only add every `rfactor`th frame
+            if (i % rfactor) == 0:
+                if swidth == 1:
+                    bval = data[i]
+                else:
+                    bval = int.from_bytes(bytes(data[i:i+swidth]), byteorder='little', signed=True)
+                if bfactor > 1: # perform bit reduction if necessary
+                    bval = int(round(bval / bfactor)) + 128
+                if bval > 255: # handle clipping
+                    bval = 255
+                out.append(bval)
+        data = bytes(out)
     return data
 
 def load_yaml(fname): # load an object from a YAML file
@@ -75,17 +99,8 @@ def load_yaml(fname): # load an object from a YAML file
     return yaml.safe_load(yc)
 
 def tts_to_file(text, fname, conf): # render the text to a file
-    fh, tname = tempfile.mkstemp('.wav', 'fvx-')
-    os.close(fh)
-    rate = int(conf['rate'])
-    volume = int(conf['volume'])
-    pitch = int(conf['pitch'])
-    ecmd = conf['cmd']['synth'] % (conf['voice'], volume, pitch, rate, tname, text)
+    ecmd = conf['cmd'] % (fname, text)
     os.system(ecmd) # render to the temporary file
-    # now, resample the synthesized file to Unsigned 8-bit 8Khz mono PCM
-    smpcmd = conf['cmd']['transcode'] % (tname, fname)
-    os.system(smpcmd)
-    os.remove(tname)
 
 def tts_to_buf(text, conf): # render the text directly to a buffer
     fh, fname = tempfile.mkstemp('.wav', 'fvx-')
@@ -95,13 +110,9 @@ def tts_to_buf(text, conf): # render the text directly to a buffer
     os.remove(fname)
     return buf
 
-def gen_dtmf(f1, f2): # render two sine frequencies to a file
-    fh, tname = tempfile.mkstemp('.wav', 'fvx-')
-    os.close(fh)
-    os.system(DTMF_GEN_CMD % (tname, f1, f2))
-    buf = load_audio(tname)
-    os.remove(tname)
-    return buf
+def gen_dtmf(f1, f2): # directly render two sine frequencies to a buffer (0.2 s duration and 8KHz sample rate hardcoded)
+    nbuf = np.arange(0, 0.2, 1 / 8000) # init target signal buffer and then sum the sine signals
+    return (127 + 61.44 * (np.sin(2 * np.pi * f1 * nbuf) + np.sin(2 * np.pi * f2 * nbuf))).astype(np.ubyte).tobytes()
 
 def get_caller_addr(call): # extract caller's SIP address from the call request headers
     return call.request.headers['From']['address']

	frugalvox A tiny VoIP IVR framework by hackers, for hackers
	git clone git://git.luxferre.top/frugalvox.git
	Log \| Files \| Refs \| README \| LICENSE

M	Dockerfile	\|	2	+-
M	README.md	\|	46	++++++++++++++--------------------------------
M	example-config/config.yaml	\|	8	+-------
M	fvx.py	\|	59	+++++++++++++++++++++++++++++++++++------------------------