Bach in 2014: Music Composition with Recurrent Neural Network

Under review as a workshop contribution at ICLR 2015

BACH IN 2014: MUSIC COMPOSITION WITH RECUR-

RENT NEURAL NETWORK

I-Ting Liu

Department of Music

Carnegie Mellon University

Pittsburgh, PA 15213, USA

[email protected]

Bhiksha Ramakrishnan

Language Technologies Institute

Carnegie Mellon University

Pittsburgh, PA 15213, USA

[email protected]

ABSTRACT

We propose a framework for computer music composition that uses resilient prop-

agation (RProp) and long short term memory (LSTM) recurrent neural network.

In this paper, we show that LSTM network learns the structure and characteristics

of music pieces properly by demonstrating its ability to recreate music. We also

show that predicting existing music using RProp outperforms Back propagation

through time (BPTT).

1 INTRODUCTION

Composing music with computer has attracted researchers for a long time. Among different ap-

proaches, artiﬁcial neural networks have been proposed to handle this task, as neural networks were

originally developed to model how human brain works. Neural networks have been very successful

in language modeling, pattern recognition, and predicting time series data. Music generation can be

formulated as a time-series prediction problem: the note played at each time can be regarded as a

prediction given the notes played before.

The most common neural network is a multi-layered feed-forward network. The network predicts

the next note played in time given the previous note played. This kind of neural network has a lim-

ited ability to capture rhythmic pattern and music structure in music as it does not have a mechanism

to keep track of the notes played in the past. On the other hand, the fact that music is a complex

structure that has both short-term and long-term dependency just as a language model makes recur-

rent neural network (RNN) an ideal structure for this task. Feedback connections in RNN enable

RNN to maintain an internal state for temporal relationship of the inputs.

However, RNN has been notoriously hard to train because of ”vanishing gradients, (Hochreiter

& Schmidhuber, 1997)” a problem commonly seen in RNNs when training with gradient based

methods. Gradient methods, such as Back-Propagation Through Time (BPTT) (Werbos, 1990),

Real-Time Recurrent Learning (RTRL) (Robinson & Fallside, 1987) and their combination, update

the network by ﬂowing errors ”back in time.” As the error propagates from layer to layer, it tends

to either explode or shrink exponentially depending on the magnitude of the weights. Therefore,

the network fails to learn learn long-term dependency between inputs and outputs. Tasks with time

lags that are greater than 5-10 are already difﬁcult to learn (Hochreiter & Schmidhuber, 1997), not

to mention that dependency of music usually spans across tens to hundreds of notes in time, which

contributes to music’s unique phrase structures.

Long short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) algorithm was designed to

tackle the error-ﬂow problem by enforcing constant error ﬂow through ”constant error carousels”

in internal states. LSTM learns quickly and efﬁciently, and is proved to be effective in multiple

arXiv:1412.3191v2 [cs.AI] 14 Dec 2014

Under review as a workshop contribution at ICLR 2015

recognition tasks. Eck & Schmidhuber (2002) was the ﬁrst that employed LSTM and RNN to learn

and compose blues music. Franklin (2006) later on utilized LSTM networks to model Jazz music.

We believe LSTM recurrent neural networks could learn the global structure of music pieces well,

but the efﬁciency and efﬁcacy of the training phase could be further improved with adaptive learning

algorithm.

In this paper, we propose a framework for music composition that uses resilient propagation

(RPROP) (Riedmiller & Braun, 1993) (Riedmiller & Rprop, 1994) in replace of standard back prop-

agation to train the recurrent neural network. We will also use long short term memory cells in the

network as they have better ability to learn songs, long song phrases and structrues precisely.

The remainder of the paper is organized as follows. In section 2 we will look at past works that use

RNN for computer-aided music generation. In section 3, we will introduce our system framework.

We will then evaluate the system by conducting experiments in section 4.

2 RELATED WORK

Todd (1989) was one of the earliest paper that used RNN for note-by-note music generation with

Jordan recurrent neural network. Jordan recurrent network is a simple RNN that has a recurrent

connection from the output layer to the input layer, and a self-recurrent link at the input layer. The

network is trained with BPTT, and recurrence is managed by teacher-forcing. In the training phase,

Todd trained monophonic melodies with the network. The trained network could then be used to

generate music by either mixing and varying the original training data, or by introducing new ”seed

melody” as the input, and the rest of the network output are recorded as the generated music.

In Mozer (1994), a fully connected RNN was trained by minimizing log-likelihood function of the

L2 norm of the predicted and actual output via back-propagation through time (BPTT). The outputs

of the ﬁnal layer are treated as probability of whether the note should be on or off. In addition, to

better model harmonic relationship of musical notes, Mozer proposed a grey-code like representation

that encodes notes based on their location on chromatic circle, circle of ﬁfths and pitch height, a

psychologically based representation derived from Shepard (1982). To compensate BPTT-trained

RNN’s inefﬁcacy of learning long-term dependencies, Mozer also used a similar encoding scheme

to represent durations based on three fraction scales.

Franklin (2001) adopted Todd’s network and added a second training phase, where the network

was further trained via reinforcement learning. In reinforcement learning phase, a scalar value was

calculated by a set of ”music rules” to determine how good the output is, and is then used to replace

the explicit error information.

To deal with vanishing gradient problem, Eck & Schmidhuber (2002) used two long short-term

memory (LSTM) networks, one for learning melody and one for chords, to compose blues music.

The output of the chord network is connected to the input of the melody network. The system was

able to learn the standard 12-bar blues chord sequences and generate music notes that follows the

chords. Franklin (2006) also used LSTM networks to learn Jazz music. They developed a pitch rep-

resentation scheme based on major and minor thirds, the circles-of-thirds representation, inspired

by Mozer (1994)’s circle-of-ﬁfths pitch representation. They also extended Mozer’s duration repre-

sentation by dividing note durations into 96 subdivisions, corresponding to a ”tick” in the Musical

Instrument Digital Interface (MIDI)(Messick, 1988) standard digital protocol.

To describe music’s correlated pattern among multiple notes, Boulanger-Lewandowski et al. (2012)

developed an RNN-based model by using restricted Bolzmann machine (RBM) (Smolensky, 1986)

and recurrent temporal RBM(RTRBM) (Sutskever et al., 2009). The model, RNN-RBM, allows

freedom in describing the temporal dependencies of the notes, and is believed to be able to model

unconstrained polyphonic music in a piano-roll representation without any dimension reduction.

3 METHOD

This section goes over each individual component involved in building the whole system.

Under review as a workshop contribution at ICLR 2015

3.1 MUSIC REPRESENTATION

Multiple input and output neurons are used to represent different pitches. If a note is played at a

given time, then the value of the neuron associated with the particular pitch is 1.0, otherwise 0.0.

In our system, we use 88 binary visible units that span the whole range of a piano from A0 to C8

as was done by Boulanger-Lewandowski et al. (2012). The reason why we avoided psychologically

distributed encodings or any other dimension reduction techniques but instead represent the data in

this simple form is that we believe that a good network should be able to identify harmonically corre-

lated pattern between notes by learning bias. Besides, such representation is ﬂexible in representing

both monophonic and polyphonic data.

To represent music on a RNN, we split time into fractions. The length of the fraction depends on the

type of the music we are training. Computer music composition problem could then be formulated

as a supervised time-series problem. The input to the network at time t, x(t), is a 88-dimension

vector representing the note played at time t. The target at time t, y(t) is the note played at time

t + 1, i.e. x(t + 1). The network is then trained over multiple iterations, each of which learns and

predicts the note of the next fraction.

3.2 RECURRENT NEURAL NETWORKS AND LONG SHORT TERM MEMORY

A recurrent neural network (RNN) has at least one feedback connection from one or more of its

units to units, thus forming cyclic paths in the network. RNNs are known to be able to approximate

dynamical systems due to the internal states that act as internal memory to process sequence of

inputs through time. In this paper, we use standard long short-term memory structure as shown in

Figure 1.

The network has one fully-connected hidden layer of memory blocks, which contains one or more

units (memory cells). A memory block also contains three sigmoid gating units: input gate, output

gate, and forget gate. An input gate learns to control when inputs are allowed to pass into the cell

in the memory block so that only relevant contents are remembered; a output gate learns to control

when the cell’s output should be passed out of the block, protecting other units from interference

from current irrelevant memory contents; A forget gate learns to control when it is time forget

already remembered value, i.e. to reset the memory cell. The outputs of all memory blocks are fed

back recurrently to all memory blocks to ”remember” past values. For more detail about LSTM,

please refer to Hochreiter & Schmidhuber (1997).

3.3 RESILIENT PROPAGATION (RPROP)

Training of the long short term memory (LSTM) network is done by resilient propagation (RProp),

a heuristic optimization algorithm proposed by (Riedmiller & Braun, 1993). It is a kind of local

adaptation learning strategy, which aims to facilitate gradient-descent learning problem by modi-

fying weight-speciﬁc parameters, such as learning rate, by using only weight-speciﬁc information,

which is partial derivatives in this case. The basic principle of RProp is to directly adapt the size

of the weight update by ignoring the size of the partial derivatives. Unlike other adaptive learning

algorithm that takes account of the magnitude of the gradient, only the sign of the derivative is used

to perform both learning and adaptation. The direction of the weight update is indicated by the sign

of the derivative, and the size of weight change is decided solely by weight-speciﬁc update values.

The update value ∆w

(t)

is now given by

∆w

(t)











−∆

(t)

, if

∂E

(t)

∂w

> 0

+∆

(t)

, if

∂E

(t)

∂w

< 0

0, else

(t+1)

= w

(t)

+ ∆w

(t)

where w

is the weight from neuron j to neuron i, E is an arbitrary error function,

∂E

(t)

∂w

is the

summed gradient information over all patterns in the pattern set. (Riedmiller & Braun, 1993)

Under review as a workshop contribution at ICLR 2015

Figure 1: Long Short Term Memory

Researches have shown that networks trained with local adaptive algorithms, especially RProp, con-

verge considerably faster than ordinary gradient descent algorithm (Riedmiller, 1994). Moreover,

RProp does not require parameter tuning in order to obtain optimal results. For more details on

implementing RProp, please refer to (Riedmiller & Rprop, 1994)

3.4 LEARNING TO COMPOSE

The system is composed of two phases: training phase and testing phase. In training phase, multiple

music pieces are fed into the network using the representation mentioned in section 3.1. The network

learns multiple music pieces with LSTM recurrent neural network and resilient back-propagation.

The goal for both phase is to predict the output probability for each given note to be on. This task

is similar to multi-label classiﬁcation problem: each input instance is assigned with zero or multiple

labels, the label being whether that particular note is on or off. The error E is computed with mean

square error (MSE), whose deﬁnition is given as follows:

E =

i=1

− y

)

where t

is the target value, and y

is the predicted value. For the output layer, we use a logistic

sigmoid layer, which produces output in the range of [0, 1].

After the error rate has converged, the network is then tested by starting it with the inputs of the ﬁrst

time step. Outputs for ensuing time steps are then predicted with previous predictions. All neurons

whose activation value is greater than a decision threshold, which we empirically set as 0.9, are

treated as on, and the notes associated with those neurons are played. The testing phase is also the

composition phase, where the predictions could be recorded to form new songs.

Under review as a workshop contribution at ICLR 2015

4 EXPERIMENTS

4.1 TRAINING DATA

For the experiment in this paper, we used J.S. Bach’s Chorale midi dataset and splits by Allan &

Williams (2005). The dataset contains 384 four-part harmonization, and was split into training and

testing set based on keys (major or minor) of the pieces.

4.2 MUSIC RECONSTRUCTION

For the ﬁrst part of the experiment, we aim to inspect the network’s ability to learn the representation

of music pieces. A trained human musician could perform a piece ﬂawlessly after learning and

practicing, and we would like to know whether our network is capable of recreating a song as it is

after training.

We randomly picked four chorales from the training subset in the dataset. The network was fed one

chorale at a time, and was trained until convergence. Then, the beginning notes of each chorale was

fed into the network as initial input. The network then predicted notes for ensuing time steps given

the previous predicted notes.

The mean square error (MSE) of the training phase for Chorale No.34 is shown as examples in

Figure 2. The network converged exceptionally fast: total MSE reached 1.0% at epoch 125. Figure 3

and Figure 4 show the network’s rendition as well as the original score. Note that since we are using

midi ﬁles as input and output, we did not put emphasis on transcribing midi ﬁles nor obtaining

correct time or key signature. The music sheets are merely shown as references of how the network

performs. Albeit the reconstruction was not perfect, this experiment demonstrated the network’s

capability of reconstructing a song fairly well after training.

Figure 2: Mean square error for one of J.S. Bach’s chorales

4.3 MUSIC PREDICTION

Table 1: Accuracy and F1 score of J.S. Bach Chorale Dataset with BPTT and RProp

Accuracy F1 score

BPTT 21.03% 11.84%

RProp 31.91% 20.29%

Under review as a workshop contribution at ICLR 2015

Figure 3: Reconstruction of one of J.S. Bach’s chorales with LSTM network

Figure 4: Original Score of one of J.S. Bach’s chorales

To see the network’s ability to use the knowledge learned during training phase to compose music,

we tested the network on the whole J.S. Bach Dataset. As before, MSE of the training phase with

RProp is provided in Figure 5. MSE for the network trained with BPTT is shown in Figure 6. From

the two ﬁgures, we could tell that the network also converges rapidly when trained with RProp on the

whole dataset compared to the network trained with BPTT. To evaluate performance, we calculated

F1 score, a common evaluation scheme for multi label classiﬁcation task. F1 score F , precision P

and recall R for each music piece d

are deﬁned as follows (Godbole & Sarawagi, 2004):

P (d

) = |T ∩ S|/|S|

R(d

) = |T ∩ S|/|T|

F 1(d

) =

2P (d

)R(d

)

P (d

) + R(d

)

We also calculated frame-level accuracy proposed by Bay et al. (2009). An overall accuracy score

Acc is given by

Acc =

t=1

T P (t)

t=1

(T P (t) + F P (t) + F N (t)

where T P (t), F P (t), F N(t) are true positives, false positives, and false negatives of time (frame

index) t.

Under review as a workshop contribution at ICLR 2015

Figure 5: Mean square error for one of J.S. Bach’s chorales using RProp

Figure 6: Mean square error for one of J.S. Bach’s chorales using BPTT

Table 1 shows the accuracy and f1 score of the network on the testing data when the network was

trained with RProp and BPTT. Under the same number of iterations, RProp outperforms BPTT in

both metrics.

5 DISCUSSION

5.1 LIMITATION OF REPRESENTATION

As mentioned by Eck & Schmidhuber (2002), the method we use to represent music ignores two

issues. First, it is not possible to separate melody from accompaniment with this kind of represen-

tation. While such representation is ﬂexible in representing any kind of data as mentioned earlier in

section 3, it does not differentiate chords from melody. Second, there is no way to identify when a

note ends. Eight eighth notes of the same pitch are represented exactly the same way as four quarter

notes of the same pitch. One way to deal with this problem is to shrink the step size and append zero

at the end of each note. Another way is to add an additional visible unit in the network to indicate

whether the current input is the beginning of a note as done by Todd (1989). The effectiveness of

Under review as a workshop contribution at ICLR 2015

the two possible implementation should be discussed by conducting more systematic experiments

in the future work.

5.2 LACK OF PROPER EVALUATION METRIC

In this paper, we evaluated the system with accuracy and F1 score. However, we discovered that

these evaluation metrics does not necessarily correspond to human’s perception of music similarity.

Higher accuracy or F1 score doesn’t mean the reconstructed music sounds more similar to the orig-

inal song. If we are to evaluate his network’s reconstruction performance, further study is required

to design a better evaluation metric that could capture perceived similarity between music pieces.

5.3 FURTHER IMPROVEMENT

We believe preprocessing the input data could help the network learn better, such as transposing

songs to the same key, as suggested by Boulanger-Lewandowski et al. (2012). Also, the system

could also be expanded to model dynamics, which is an important feature of music style. Melody and

accompaniment (chords) could also be trained with connected two networks, which might enable us

to model music characteristics more comprehensively.

6 CONCLUSION

In this paper, an RNN-based music composition system was proposed. Using LSTM in the network

enables the network to learn the structure and rhythm of music pieces, and the information could be

used to compose new pieces similar in form. Instead of using BPTT algorithm to train the network,

we employed RProp to expedite learning. Experiments showed that the network could learn the

music and recreate the original piece well in only tens of iterations (learning epochs). It is also

shown in the experiment that the network could compose new music once it learned knowledge

about music.

ACKNOWLEDGMENTS

This paper was the project for Deep Learning course offered in Carnegie Mellon University. The

author wants to thank course TA, Danny Lan, and instructor, Professor Bhiksha Ramakrishnan, for

their great advice.

REFERENCES

Allan, Moray and Williams, Christopher KI. Harmonising chorales by probabilistic inference. Ad-

vances in neural information processing systems, 17:25–32, 2005.

Bay, Mert, Ehmann, Andreas F, and Downie, J Stephen. Evaluation of multiple-f0 estimation and

tracking systems. In ISMIR, pp. 315–320, 2009.

Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. Modeling temporal de-

pendencies in high-dimensional sequences: Application to polyphonic music generation and tran-

scription. arXiv preprint arXiv:1206.6392, 2012.

Eck, Douglas and Schmidhuber, Juergen. A ﬁrst look at music composition using lstm recurrent

neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artiﬁciale, 2002.

Franklin, Judy A. Multi-phase learning for jazz improvisation and interaction. In Proceedings of

the Eighth Biennial Symposium for Arts & Technology, 2001.

Franklin, Judy A. Recurrent neural networks for music computation. INFORMS Journal on Com-

puting, 18(3):321–338, 2006.

Godbole, Shantanu and Sarawagi, Sunita. Discriminative methods for multi-labeled classiﬁcation.

In Advances in Knowledge Discovery and Data Mining, pp. 22–30. Springer, 2004.

Under review as a workshop contribution at ICLR 2015

Hochreiter, Sepp and Schmidhuber, J

urgen. Long short-term memory. Neural computation, 9(8):

1735–1780, 1997.

Messick, Paul. Maximum midi, 1988.

Mozer, Michael C. Neural network music composition by prediction: Exploring the beneﬁts of psy-

choacoustic constraints and multi-scale processing. Connection Science, 6(2-3):247–280, 1994.

Riedmiller, Martin. Advanced supervised learning in multi-layer perceptronsfrom backpropagation

to adaptive learning algorithms. Computer Standards & Interfaces, 16(3):265–278, 1994.

Riedmiller, Martin and Braun, Heinrich. A direct adaptive method for faster backpropagation learn-

ing: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pp.

586–591. IEEE, 1993.

Riedmiller, Martin and Rprop, I. Rprop - description and implementation details, 1994.

Robinson, AJ and Fallside, Frank. The utility driven dynamic error propagation network. University

of Cambridge Department of Engineering, 1987.

Shepard, Roger N. Geometrical approximations to the structure of musical pitch. Psychological

Review, 89(4):305, 1982.

Smolensky, Paul. Information processing in dynamical systems: Foundations of harmony theory.

1986.

Sutskever, Ilya, Hinton, Geoffrey E, and Taylor, Graham W. The recurrent temporal restricted

boltzmann machine. In Advances in Neural Information Processing Systems, pp. 1601–1608,

2009.

Todd, Peter M. A connectionist approach to algorithmic composition. Computer Music Journal, pp.

27–43, 1989.

Werbos, Paul J. Backpropagation through time: what it does and how to do it. Proceedings of the

IEEE, 78(10):1550–1560, 1990.