Walder
with simultaneous onsets are ordered such that we can bound the value we are predicting,
so we must incorporate this into the training loss function.
Roughly speaking, the “midi” rows of the input indicate which notes are on at a given
time. We can only allow one note to turn on at a time in the input, since we predict one note
at a time, hence the second column has only pitch 58 turned on, and the algorithm should
ideally predict that the next pitch is 62 (given a lower bound of 58), even though musically
these notes occur simultaneously. Conversely, we can and do allow multiple pitches to turn
off simultaneously (as in columns 5 and 10), since we are not predicting these — the notes
to turn off are dictated by the rhythmic structure (algorithmically, this requires some simple
book keeping to keep track of which note(s) to turn off).
The non-midi input rows 6 to 10 represent the rhythmic structure in a way which we
intend to aid in the prediction. ∆
t
event
is the duration of the note being predicted, ∆
t
step
is
the time since the previous input column,
onset
is 1 if and only if we are predicting at the
same time as in the previous column,
offset
is 1 if and only if we are turning notes off at the
same time as in the previous column, and
t
represents the time in the piece corresponding
to the current column (in practice we scale this value to range from 0 to 1). In the figure,
the time columns are all in units of quarter notes.
This representation allows arbitrary timing information, and is not restricted to a uniform
discretization of time as in many other works, e.g. Boulanger-Lewandowski et al. (2012);
Allan and Williams (2005). A major problem with the uniform discretization approach is that
in order to represent even moderately complex music, the grid would need to be prohibitively
fine grained, making learning difficult. Moreover, unlike the “piano roll” approaches we
explicitly represent onsets and offsets, and are able to discriminate between, say, two eighth
notes of the same pitch following one another as opposed to a single quarter note.
3.4. On our Assumption of Fixed Rhythmic Information
While we would like to model the rhythmic structure, this turns out to be challenging. Indeed,
even with assumption that rhythmic structure is given the problem is rather non-trivial.
After all, although the present work is among the most sophisticated models, even with
fixed rhythmic information taken from professional composers our results clearly fall short
of complete human results. Hence, a reasonable first step is to subdivide the problem by
modelling only the pitches.
Previous authors have modelled a simplified set of possible durations, as in Mozer (1994)
and more recently Colombo et al. (2016). However, modelling realistic durations and start
times would require a prohibitively fine grained temporal discretisation (for the data from
Boulanger-Lewandowski et al. (2012), it turns out that the largest acceptable subdivision is
one 480th of a quarter note). Modelling this in turn requires more sophisticated machinery,
such as a point process model (Daley and Vere-Jones, 2003; Du et al., 2016). In the meantime,
and until acceptable solutions to the present problem are developed, assuming fixed timing
has the advantage of allowing arbitrarily complex rhythmic structures.
In summary, rather than fully modeling an overly simplistic music representation, we
partially model a more realistic music representation. Our findings an advancements may be
generalised to the full joint distribution over pitch and timing, since we can always employ
the chain rule of probability on the set of
n
triples (
x
i
, t
i
, d
i
) representing (pitch, start time,
178