frugalvox

A tiny VoIP IVR framework by hackers, for hackers
git clone git://git.luxferre.top/frugalvox.git
Log | Files | Refs | README | LICENSE

README.md (21396B)


      1 # FrugalVox
      2 
      3 A tiny VoIP IVR framework by hackers and for hackers.
      4 
      5 ## Features
      6 
      7 - Small and nimble: the kernel is a single Python 3 file (~270 SLOC), and the configuration is a single YAML file
      8 - Hackable: the kernel is well-commented and not so big, all the actions are full-featured Python scripts
      9 - Written with plain telephony in mind, supporting both out-of-band and in-band DTMF command detection, as well as DTMF audio clip generation
     10 - Comes with PIN-based authentication and action access control out of the box (optional but recommended)
     11 - Comes with TTS integration out of the box, configured for eSpeakNG by default (optional)
     12 - Container-ready
     13 - Released into public domain
     14 
     15 ## Limitations
     16 
     17 - Only UDP transport is supported for now (pyVoIP limitation)
     18 - Only a single SIP server registration per instance is supported by design (if you need to receive incoming calls via multiple SIP accounts, you must spin up multiple FrugalVox instances)
     19 - The format for top-level action commands is hardcoded (see the workflow section of this README)
     20 
     21 ## Running FrugalVox as a host application
     22 
     23 ### Dependencies 
     24 
     25 - Python 3.8 or higher (3.10 recommended)
     26 - pyVoIP 1.6.4, patched according to [this comment](https://github.com/tayler6000/pyVoIP/issues/107#issuecomment-1440231926) (also available as a `.whl` file in this repo)
     27 - NumPy (mandatory, required for DTMF detection and generation)
     28 - eSpeakNG (optional but used by the default TTS engine configuration)
     29 
     30 For Python-side dependencies, just run `pip install -r requirements.txt` from the project directory. eSpeakNG (or other TTS engine of your choice, see the FAQ section) must be installed separately with your host OS package manager.
     31 
     32 ### Usage
     33 
     34 Just run `python /path/to/fvx.py [your-config.yaml]`. Press Ctrl+C or otherwise terminate the process when you don't need it.
     35 
     36 Make sure your `python` command is pointing to Python 3.8 or higher.
     37 
     38 ## Running FrugalVox in Docker
     39 
     40 The Docker image encapsulates all the dependencies (including Python 3.10, three different TTS engines (see the FAQ section) and the patched pyVoIP package) but requires you to provide all the configuration and action scripts in a volume mounted from the host. In addition to this, the configuration file itself must be called `config.yaml` since the container is only going to be looking for this name.
     41 
     42 ### Building
     43 
     44 From the source directory, run: `docker build -t frugalvox .` and the image called `frugalvox` will be built.
     45 
     46 Alternatively, you can build the "slim" version of the image based on Alpine Linux, which will only contain the Pico TTS engine. For `x86_64` architecture, such an image will only weigh around 180 MiB. Do this using this command: `docker build -t frugalvox:slim -f Dockerfile.slim .`
     47 
     48 ### Running from the command line
     49 
     50 The command to run the `frugalvox` image locally is:
     51 
     52 ```
     53 docker run -d --rm -v /path/to/configdir:/opt/config --name frugalvox frugalvox
     54 ```
     55 
     56 Note that the `/path/to/configdir` must be absolute. Use `$(pwd)` command to get the current working directory if your configuration directory path is relative to it.
     57 
     58 ### Running from Docker Compose
     59 
     60 Add this to your `compose.yaml`, replacing `$PWD/example-config` with your configuration directory path:
     61 
     62 ```yaml
     63 services:
     64   # ... other services here ...
     65   frugalvox:
     66     image: frugalvox:latest
     67     container_name: frugalvox
     68     restart: on-failure:10
     69     volumes:
     70       - "$PWD/example-config:/opt/config"
     71 ```
     72 
     73 Then, on the next `docker compose up -d` run, the FrugalVox container should be up. Note that you can attach this service to any network you already created in Compose file, as long as it allows the containers to have Internet access. 
     74 
     75 ## Typical FrugalVox workflow
     76 
     77 Without auth:
     78 
     79 1. User calls the IVR.
     80 2. User is prompted for the command.
     81 3. User enters the DTMF command in the form `[action ID]*[param1]*[param2]*...#`.
     82 4. The IVR checks the action ID. If it exists, the corresponding action script is run. If it doesn't, an error message is played back to the user.
     83 5. User is prompted for the next command, and so on.
     84 
     85 With auth (recommended):
     86 
     87 1. User calls the IVR.
     88 2. User is prompted for the PIN (internal user ID) followed by `#` key.
     89 3. The IVR checks the PIN. If it doesn't exist in the user list, the caller is warned and the call is terminated. Otherwise, go to the next step.
     90 4. User is prompted for the command.
     91 5. User enters the DTMF command in the form `[action ID]*[param1]*[param2]*...#`.
     92 6. The IVR checks the action ID. If it exists in the list of actions allowed for the user, the corresponding action script is run. If it doesn't, an error message is played back to the user.
     93 7. User is prompted for the next command, and so on.
     94 
     95 ## Configuration
     96 
     97 All FrugalVox configuration is done in a single YAML file that's passed to the kernel on the start. If no file is passed, FrugalVox will look for `config.yaml` file in the current working directory.
     98 
     99 The entire config is done in four different sections of the YAML file: `sip`, `tts`, `clips` and `ivr`.
    100 
    101 ### SIP client configuration: `sip`
    102 
    103 This section lets you configure which SIP server FrugalVox will connect to in order to start receiving incoming calls. The fields are:
    104 
    105 - `sip.host`: your SIP provider hostname or IP address
    106 - `sip.port`: your SIP provider port number (usually 5060)
    107 - `sip.transport`: optional, not used for now, added here for the forward compatibility with future versions (the only supported transport is now `udp`)
    108 - `sip.username`: your SIP account auth username (only the username itself, no domain or URI parts)
    109 - `sip.password`: your SIP account auth password
    110 - `sip.rtpPortLow` and `sip.rtpPortHigh`: your UDP port range for RTP communication, usually the default of 10000 and 20000 is fine
    111 
    112 All the fields in this section, except `transport`, are currently mandatory. If unsure about `rtpPortLow` and `rtpPortHigh`, just leave the values provided in the example config.
    113 
    114 ### Text-to-speech engine settings: `tts`
    115 
    116 This section allows you to configure your TTS engine, for FrugalVox to be able to generate audio clips from your text. The fields are:
    117 
    118 - `tts.cmd`: the TTS synth command template, please just leave the default values there unless you want to switch to a different TTS engine other than eSpeakNG
    119 - `tts.phrases`: a dictionary where every key is the clip name and the value is the phrase text to be rendered to that clip on the kernel start
    120 
    121 ### Static audio clips list: `clips`
    122 
    123 This section determines which static audio clips are to be loaded into memory in addition to the synthesized voice. The clips must be in the unsigned 8-bit 8KHz PCM WAV format. The fields are:
    124 
    125 - `clips.dir`: path to the directory (relative to the configuration file directory) containing the audio clips to load
    126 - `clips.files`: a dictionary mapping the clip names to the `.wav` file names in the `clips.dir` directory 
    127 
    128 Both fields are required to fill, but if you're not planning on using any static audio, just set `clips.dir` value to `'.'` and `clips.files` to `{}`.
    129 
    130 When naming the clips, a single limitation holds: a clip must not be named `dtmf`. Because when the reference is passed to action scripts, the `clips.dtmf` object will hold the audio clips generated for all 16 DTMF digits.
    131 
    132 ### IVR configuration: `ivr`
    133 
    134 This section lets you set up PIN-based authentication, access control and, most importantly, IVR actions themselves and the scripts that implement them.
    135 
    136 These two fields are mandatory to fill (although can be left as empty arrays):
    137 
    138 - `ivr.cmdpromptclips`: an array with a sequence of the clip names to play back to prompt the caller for a command
    139 - `ivr.cmdfailclips`: an array with a sequence of the clip names to play back to alert the caller about an invalid command 
    140 
    141 Note that any clip name in this section can refer to both static and synthesized voice audio clips, they all are populated at the same place at this point.
    142 
    143 To turn the authentication part on and off, use the `ivr.auth` field. If and only if this field is set to `true`, the following fields are required:
    144 
    145 - `ivr.authpromptclips`: an array with a sequence of the clip names to play back to prompt for the caller's PIN
    146 - `ivr.authfailclips`: an array with a sequence of the clip names to play back to alert the caller about the invalid PIN before hanging up the call
    147 - `ivr.users`: a dictionary that maps valid user IDs (PINs) to the lists (arrays) of action IDs they are authorized to run, or `'*'` string if the user is authorized to run all registered actions on this instance
    148 
    149 Also, if the user is authenticated but not authorized to run a particular action, the same "invalid command" message sequence from `ivr.cmdfailclips` will be played back as if the command was not found. This is implemented by design as a security measure. FrugalVox itself will log a different message to the console though.
    150 
    151 Finally, all the action mapping is done in the mandatory `ivr.actions` dictionary. The key is the action ID (without parameters, i.e. commands `22*5#` and `22*44*99#` both mean an action with ID 22, just with different parameter lists) and the value is the path to the action script file, relative to the configuration file directory.
    152 
    153 ## Action scripts
    154 
    155 An action script is a regular Python module file referenced in the `ivr.actions` section of the configuration YAML file. A single script may implement one or more actions based on the action ID. The module must implement the `run_action` call in order to work as an action script, as follows:
    156 
    157 ```
    158 def run_action(action_id, params, call_obj, user_id, config, clips, calls):
    159     ...
    160 ```
    161 
    162 where:
    163 
    164 - `action_id` is a string action identifier,
    165 - `params` is an array of string parameters to the action,
    166 - `call_obj` is an instance of `pyVoIP.VoIP.VoIPCall` class passed from the main FrugalVox call processing loop,
    167 - `user_id` is the ID (PIN) string of the user running the action (if authentication is turned off, it's always `0000`),
    168 - `config` is the dictionary containing the entire FrugalVox configuration object (as specified in the YAML),
    169 - `clips` is the object containing all in-memory audio clips (in the unsigned 8-bit 8KHz PCM format) ready to be fed into the `write_audio` method of the `call_obj` (with `clips['dtmf']` being a dictionary with the pre-rendered DTMF digits),
    170 - `calls` is a dictionary with all currently active calls on the instance (keyed with the `call_obj.call_id` value).
    171 
    172 _Protip:_ if we use `*` as parameter separator and `#` as command terminator, why are `action_id`, `user_id` and all the action parameters still treated as strings as opposed to numbers? Because `A`, `B`, `C` and `D` still are valid DTMF digits and can be legitimately used in the actions or their parameters. Of course, if you target normal phone users, you should avoid using the "extended" digits, but there still is a possibility to do so. If you need to treat your action parameters or any IDs as numbers only, please do this yourself in your action script code.
    173 
    174 The action script may import any other Python modules at your disposal, including the main `fvx.py` kernel to use its helper methods, and all the modules available in the configuration file directory (in case it differs from the default one). An example action script that implements three demonstration actions, `32` for echo test, `24` for caller ID readback and `22*[times]` for beep, is shipped in this repo at `example-config/actions/echobeep.py`.
    175 
    176 ### Useful methods, variables and objects exposed by the `fvx` kernel module
    177 
    178 - `fvx.load_yaml(filename)`: a wrapper method to read a YAML file contents into a Python variable (useful if your action scripts have their own configuration files)
    179 - `fvx.load_audio(filename)`: a method to read a WAV PCM file into the audio buffer in memory, automatically resampling it if necessary
    180 - `fvx.logevent(msg)`: a drop-in replacement for Python's `print` function that outputs a formatted log message with the timestamp
    181 - `fvx.audio_buf_len`: the recommended length (in bytes) of a raw audio buffer to be sent to or received from the call object the action is operating on
    182 - `fvx.emptybuf`: a buffer of empty audio data, `fvx.audio_buf_len` bytes long
    183 - `fvx.detect_dtmf(buf)`: a method to detect a DTMF digit in the audio data buffer (see `example-config/actions/echobeep.py` for an example of how to use it correctly)
    184 - `fvx.tts_to_buf(text, ttsconfig)`: a method to directly render your text into an audio data buffer (pass `config['tts']` as the second parameter if you don't want to change anything in the TTS settings)
    185 - `fvx.tts_to_file(text, filename, ttsconfig)`: same as `fvx.tts_to_buf` method but writes the result to a WAV PCM file
    186 - `fvx.get_caller_addr(call_obj)`: a method to extract the caller's SIP address from a `VoIPCall` object (e.g. the one passed to the action)
    187 - `fvx.get_callee_addr(call_obj)`: a method to extract the destination SIP address from a `VoIPCall` object (e.g. the one passed to the action)
    188 - `fvx.flush_input_audio(call_obj)`: a method to ensure any excessive audio is not collected in the call audio buffer, recommended to use at the start of any actions that perform incoming audio processing
    189 - `fvx.playbuf(buf, call_obj)`: a method to properly play back any audio buffer to the call's line
    190 - `fvx.kernelroot`: a string that contains the FrugalVox kernel directory path
    191 - `fvx.configroot`: a string that contains the config file directory path
    192 
    193 ## FAQ
    194 
    195 **How was this created?**
    196 
    197 Initially, FrugalVox was created to answer a simple question: "given a VoIP provider with a DID number I pay for monthly anyway, and a cheap privacy-centric VPS already filled with other stuff, how can I combine these two things together to control various things on the Internet from a Nokia 1280 class feature phone without the Internet access?"
    198 
    199 **Why not Asterisk/FreeSWITCH/etc then? What's wrong with existing solutions?**
    200 
    201 Nothing wrong at the first glance, but... Asterisk's code base was, as of November 2016, as large as 1139039 SLOC. If you don't see a problem with that, I envy your innocence. Anyway, I doubt that any of those would be able to comfortably run on that cheap VPS or on my Orange Pi with 256 MB RAM. For my goals, it would be like hunting sparrows with ballistic missiles.
    202 
    203 FrugalVox kernel, on the other hand, is around 270 SLOC in 2023. Despite being written in Python, it is really frugal in terms of resource consumption, both while running and while writing and debugging its code. Yet, thanks to full exposure of the action scripts to the Python runtime, it can be as flexible as you want it to be. Not to mention that such a small piece of code is much easier to audit and discover and promptly mitigate any subtle errors or security vulnerabilities.
    204 
    205 **So, is it a PBX or just a scriptable IVR system?**
    206 
    207 I'd rather think of FrugalVox not as a turnkey solution, but as a framework. If you look at the `fvx.py` kernel alone, you'll see nothing but a scriptable IVR with user authentication and TTS integration. However, its the action scripts that give such a system its meaning. FrugalVox, along with the underlying pyVoIP library, exposes all the tooling you need to create your own interactive menus, connect and bridge calls, dial into other FrugalVox instances or other services, and so on. It's a relatively simple building block that, while being useful alone, can also be used to build VoIP systems of arbitrary complexity when properly combined with other similar blocks.
    208 
    209 **If it's meant to be flexible, why hardcode the top level command format?**
    210 
    211 Because such a format is the only format that allows to run actions with any amount of parameters more or less efficiently using a simple phone keypad. The goal here isn't to replace something like VoiceXML, but to give the caller ability to get to the action as quickly as possible. The `PIN#` and then `action_id*param1*param2*..#` sequence is as complex as it should be. Multi-level voice menus waste the caller's time, but you can implement them as well if you really need to.
    212 
    213 **Does FrugalVox offer a way to fully replace DTMF commands with speech recognition?**
    214 
    215 Currently, there is no such way, but you surely can integrate speech recognition into your action scripts. It is not an easy thing to do even in Python, and in no way frugal on computing resource consumption, but definitely is possible, see [SpeechRecognition](https://pypi.org/project/SpeechRecognition/) module reference for more information.
    216 
    217 **Why do you need a patched pyVoIP version instead of a vanilla one?**
    218 
    219 Because vanilla pyVoIP 1.6.4 has a bug its maintainers don't even seem to recognize as a bug. Its `RTPClient` instance creates two sockets to communicate with the same host and port. As a result, when the client is behind a NAT and tries exchanging audio data using both `read_audio` and `write_audio` methods, only the latter works correctly because it's sending datagrams out to the server. Patching `RTPClient` to only use the single socket made things work the way they should.
    220 
    221 **I understand the importance of eSpeakNG but it sounds terrible even with MBROLA. Which else open source TTS engines can you recommend to use with FrugalVox?**
    222 
    223 The first obvious choice would be ~~Festival~~ [Flite](https://github.com/festvox/flite). With an externally downloaded `.flitevox` voice, of course. It has a number of limitations: only English and Indic languages support, no way to adjust the volume, but the output quality is definitely a bit better. If you use the Docker image of FrugalVox, Flite is also included but you have to ship your own `.flitevox` files located somewhere inside your config directory.
    224 
    225 The second obvious choice would be [Pico TTS](https://github.com/naggety/picotts) which is (or was) used as a built-in offline TTS engine in Android. It supports more European languages (besides two variants of English, there also are Spanish, German, French and Italian) but has a single voice per language and absolutely no parameters to configure. Also, it requires autotools to build but the process looks straightforward: `./autogen.sh && ./configure && make && sudo make install`. After this, we're interested in the `pico2wave` command. Please note that its current version has some bug retrieving the text from the command line, so we use an "echo to the pipe" approach. For your convenience, this engine also comes pre-installed in the FrugalVox Docker image.
    226 
    227 The third (not so obvious) choice **might** be [Mimic 1](https://github.com/MycroftAI/mimic1) which is basically Flite on steroids. That's why, unlike Mimic 2 and 3, it still is pretty lightweight and suitable for our IVR purposes. It supports all the `.flitevox` voice files as well as the `.htsvoice` format. However, there is a "small" issue: currently, Mimic 1 still only supports sideloading `.flitevox` and not `.htsvoice` files by specifying the arbitrary path into the `-voice` option, all HTS voices must be either compiled in or put into the `$prefix/share/mimic/voices` (where `$prefix` usually is `/usr` or `/usr/local`) or the current working directory, and then referenced in the `-voice` option without the `.htsvoice` suffix. For me, this inconsistency kinda rules Mimic 1 out of the recommended options.
    228 
    229 Another approach to the same problem would be to build the HTS Engine API and then a version of Flite 2.0 with its support, both sources taken from [this project page](https://hts-engine.sourceforge.net/). The build process is not so straightforward but you should be left with a `flite_hts_engine` binary with a set of command line options totally different from the usual Flite or Mimic 1. If you understand how FrugalVox is configured to use Pico TTS, then you'll have no issues configuring it for `flite_hts_engine`. The voice output quality is debatable compared to the usual `.flitevox` packages, so I wouldn't include this into my recommended list either.
    230 
    231 Alas, that looks like it. The great triad of lightweight and FOSS TTS engines consists of eSpeakNG, Flite with variations and Pico TTS. All other engines, not counting the online APIs, are too heavy to fit into the scenario. Of course, nothing prevents you from integrating them as well if you have enough resources. In that case, I'd recommend [Mimic 3](https://github.com/MycroftAI/mimic3) but that definitely is out of this FAQ's scope.
    232 
    233 Note that for both Flite and Mimic 1 the output voice must support a sample rate that is divisible by 8000 Hz in order to sound correctly. Since version 0.0.2, FrugalVox uses an internal resampler that has this limitation. A way to mitigate this in the future versions is being investigated.
    234 
    235 To recap, here are all the example TTS configurations for all the reviewed engines:
    236 
    237 eSpeakNG + MBROLA:
    238 
    239 ```yaml
    240 tts:
    241   cmd: 'espeak -v us-mbrola-2 -a 70 -p 60 -s 130 -w %s "%s"' # parameter order: filename, text
    242   ...
    243 ```
    244 
    245 Flite/Mimic 1:
    246 
    247 ```yaml
    248 tts:
    249   cmd: 'flite -voice tts/cmu_us_rms.flitevox --setf int_f0_target_mean=100 --setf duration_stretch=1 -o %s -t "%s"' # parameter order: filename, text
    250   ...
    251 ```
    252 
    253 Pico TTS:
    254 
    255 ```yaml
    256 tts:
    257   cmd: OUTF=%s sh -c 'echo "%s" | pico2wave -l en-US -w $OUTF' # parameter order: filename, text
    258   ...
    259 ```
    260 
    261 ## Version history
    262 
    263 - 0.0.2 (2023-02-28, current): fully got rid of SoX dependency, simplified TTS configuration
    264 - 0.0.1 (2023-02-26): initial release
    265 
    266 ## Credits
    267 
    268 Created by Luxferre in 2023.
    269 
    270 Made in Ukraine.