Coyote Protocol

To be honest I can't say how much I had Marc Bohlan's whistling box in mind when planning my coyotes. It certainly impressed me. I'm fairly obsessed with dog culture, the humor in it, and the subtlety. I also find communication protocols interesting (in principle, I'm not one to sit in endless committee meetings) and designed one (FidoNet) so the idea of boxes that simulated dog pack cohesion and emotional communication fascinates me.

I'm big on analog computation, for many reasons, partly because it's underexploited in our everything-in-software world. I'm no luddite, C programmer for 25+ years, sysadmin, net weenie, etc.

Coyote Protocol is group of electronic accoustic boxes that talk to each other via dog-like singing and howling. The find each other by listening, negotiate unique identities and form pack cohesion. They will have coded in them certain (tbd) characteristics such as personal space (don't get to close) where if violated, they get upset, make this clear through vocalization. The neighbor coyotes hear this and respond in kind. When all's well, after a while they coyotes will talk amongst themselves.

Half the computation is done in the analog realm. While analog processing is application-specific and has lower precision, it works in parallel and in real-time. The analog section here amplifies the weak microphone signal, selects the narrow frequency range (150Hz to 600Hz), and translates changes in the frequency domain to changes in voltage, which are tracked by the digital portion and patterns extracted.

Pattern matching

Coyote Protocol embodies in hardware and software my working theory of how mammalian hearing, and by implication, our sensorium in general, functions.

Patterns, patterns, patterns. it seems to me that all of human cognition is devoted to finding patterns of things, and patterns in things. To me, patterns are an evolutionary way to conserve energy -- if a sense thing can be reduced to a known pattern, further attention (and therefore effort and energy) does not need to be devoted to it. Errors are fine; if a shadow "looks like" a person, the chance of that error being harmful is small, and other criteria will support or deny the patterned assumption.

My basic assumption about sensing is as follows: the organ that converts external energy (changing air pressure) to signal/symbol is relatively simple, and very limited in physical ability. The organ is built, at a low level of detail, to be especially responsive to certain forms of sound; in Coyote Protocol's case, to continuous utterances of a half second to many seconds in length, consisting of a single tone that persists at one frequency or glides from one frequency to another within it's range of 180 to 600 Hz. This was specifically modelled after my own ad hoc observations of dog singing.

The central concept behind Coyote Protocol's hearing is to find patterns in heard pitch changes exclusively, and within a very narrow frequency range. A very simple scheme was chosen; an ordinary electret microphone, very high gain, very strong and fast automatic gain control (AGC), and a passband of 180 Hz to 600Hz. This is followed by a very simple but effective frequency-to-voltage converter, the LM331, and an analog to digital converter (12 bit resolution). The speaking side uses an LM331 as voltage-to-frequency converter. The system is quite accurate; 1 Hz reproducable accuracy reproducing a heard tone.

The above circuitry is the hearing "organ" and also the first layer of symbolic conversion: from frequency domain to time domain. Response to stepwise changes in pitch is about 25 - 50 milliseconds. Final output of the analog portion is a slowly varying voltage between 0 and 1 volt, linearly proportional to frequency; 0 volts is 0 Hz, 5 volts is 600 Hz. Energy outside the passband results in 0 volts out. There is very little smoothing (mainly ripple from the LM331) and no further analog processing.

The second layer of symbolic processing converts time-variant voltage to "bigrams". The voltage is sampled periodically (100 times/second) and each pair of readings (current vs. previous) is converted to a printable character (bigram) representing change found in the sampled voltage: up, down, same (aka "flats"), or no-signal. Small changes in pitch are disregarded in a simplistic way at this layer.

It is significant that beginning with the second layer of symbolics all representation of sound is via ASCII characters, eg. printable text.

A state machine reduces the stream of bigrams to a shorter string representing the utterance as a series of tones and the transitions between them (if present). At this second layer, the first line of "Mary Had A Little Lamb" would be represented as five symbols (the last three notes being the same, they would be heard as one longer note), a text string ideally prepresented as 20 characters long; in the real world extraneous noise "corrupts" perfect tone detection. At this point, the symbol string accurately describes the heard utterance, in pitch and duration, and the speaking system can reproduce it exactly, within 1 Hz frequency-wise and within 10mS, duration-wise (internally, durations are counted in 10-millisecond units).

The first two layers of processing are done in realtime as the utterance is under way. Third and subsequent layers are done after the utterance ends. The third layer begins further abstraction to the heard pattern. It is here that the fun begins.

First, each Coyote remembers two utterances, the current one (after further processing, below) and the previous one. This allows for detection of patterns-of-patterns. This will be described further below.

The first thing done to the current pattern is to canonicalize it; the utterance is examined for further transitions via differentiation of the symbol string, and small frequency changes detected in early layers may be rejected and merged into one flat. The canonical utterance will later be saved as the "previous" utterence, when that time comes.

The utterance is reduced further, to the initial pitch followed by a list of up and down pitch changes that omits pitch and duration; this "fourth" layer of abstraction allows for same-but-different comparisons.