IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 1
A Survey on Reinforcement Learning for
Recommender Systems
Yuanguo Lin
1
, Yong Liu
1
, Fan Lin
, Lixin Zou
, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, Chunyan Miao
Abstract—Recommender systems have been widely applied in
different real-life scenarios to help us find useful information.
In particular, Reinforcement Learning (RL) based recommender
systems have become an emerging research topic in recent years,
owing to the interactive nature and autonomous learning ability.
Empirical results show that RL-based recommendation methods
often surpass supervised learning methods. Nevertheless, there
are various challenges of applying RL in recommender systems.
To understand the challenges and relevant solutions, there should
be a reference for researchers and practitioners working on
RL-based recommender systems. To this end, we firstly provide
a thorough overview, comparisons, and summarization of RL
approaches applied in four typical recommendation scenarios,
including interactive recommendation, conversational recommen-
dation, sequential recommendation, and explainable recommen-
dation. Furthermore, we systematically analyze the challenges
and relevant solutions on the basis of existing literature. Finally,
under discussion for open issues of RL and its limitations of
recommender systems, we highlight some potential research
directions in this field.
Index Terms—Reinforcement learning, Recommender systems,
Interactive recommendation, Policy gradient.
I. INTRODUCTION
P
ERSONALIZED recommender systems [1]–[3] are com-
petent to provide interesting information that matches
users’ preferences, and thereby alleviating the informa-
tion overload problem. Recommendation technologies usually
make use of various information to provide potential items
for users. To this end, the early recommendation research
primarily focuses on developing content-based and collabo-
rative filtering-based methods [4], [5]. Recently, motivated
by the quick developments of deep learning, various neural
recommendation methods have been developed [6]. However,
modeling the various information is not enough. In real-world
scenarios, the recommender system suggests items according
to the user-item interaction history, and then receives user
Y. Lin is with the School of Computer Engineering, Jimei University;
and the School of Informatics, Xiamen University, China. Email: xd-
F. Lin and W. Zeng are with the School of Informatics, Xiamen University,
Y. Liu and P. Wu are with the Joint NTU-UBC Research Centre of Excel-
lence in Active Living for the Elderly (LILY), Nanyang Technological Univer-
sity, Singapore. Email: [email protected] and [email protected].
C. Miao is with the School of Computer Science and Engineering, Nanyang
Technological University, Singapore. Email: [email protected].
L. Zou is with the School of Cyber Science and Engineering, Wuhan
University, China. Email: [email protected].
H. Chen is with the School of Computer Science and Technology, Univer-
sity of Science and Technology of China, China. Email: [email protected].
Corresponding author
1
Co-first authors
feedback to make further recommendations [7], [8]. In other
words, the recommender system aims to obtain users’ prefer-
ences from the interactions and recommend items that users
may be interested in. Nevertheless, existing recommendation
methods (e.g., supervised learning) usually ignore the interac-
tions between a user and the recommendation model. They do
not effectively capture the user’s timely feedback to update
the recommendation model, thus leading to unsatisfactory
recommendation results.
In general, the recommendation task could be modeled as
an interactive process, i.e., the user is recommended an item
and then provides feedback (e.g., skip, click or purchase)
for the recommendation model. In the next interaction, the
recommendation model learns the optimal policy from the
user’s explicit/implicit feedback and recommends a new item
to the user. From the user’s point of view, an efficient interac-
tion means helping users find their favorite items as soon as
possible. The interactive recommendation approach has been
applied in real-world recommendation tasks. However, it often
suffers from some problems, e.g., cold-start [9], [10], data
sparsity [11], interpretability [12] and safety [13].
As a machine learning method that focuses on how an
intelligent agent interacts with its environment, Reinforce-
ment Learning (RL) [14], [15] learns the policy by trial
and error search, which is beneficial to sequential decision
making. Hence, it can provide potential solutions to model
the interactions between the user and agent. In particular,
Deep Reinforcement Learning (DRL) [16], the combination
of traditional RL with deep learning methods, is competent
to learn from historical data with enormous state and action
spaces to address large-scale problems. It has powerful rep-
resentation learning and function approximation properties to
be applied across various fields [17], [18], e.g., games [19]
and robotics [20]. Recently, the application of RL to solve
recommendation problems has become a new research trend
in recommender systems [21]–[23]. Specifically, RL enables
the recommender agent to constantly recommend items to
users for learning the optimal recommendation policies. Many
experimental results have demonstrated that RL-based rec-
ommendation methods [24], [25] evidently outperform super-
vised learning methods. For example, as shown in Table I,
RL-based recommendation methods (i.e., PGPR [26], Actor-
Critic [27], and ADAC [28]) consistently perform better than
the supervised learning-based recommendation methods (i.e.,
DKN [29], BPR [30], and RippleNet [31]) on two Amazon
datasets in terms of Hit Ratio (HR) and Normalized Dis-
counted Cumulative Gain (NDCG), especially with significant
margin (pvalue < 0.01) on the Clothing dataset. In practice,
arXiv:2109.10665v4 [cs.IR] 11 Jun 2023
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 2
TABLE I
THE RECOMMENDATION PERFORMANCE OF SUPERVISED LEARNING METHODS (i.e., DKN, BPR, RIPPLENET) AND RL-BASED METHODS (i.e., PGPR,
ACTOR-CRITIC, ADAC) ON TWO AMAZON DATASETS IN TERMS OF HR AND NDCG (%). THE PVALUE DENOTES THE T-TEST USED TO TEST THE
PERFORMANCE DIFFERENCES BETWEEN PGPR AND OTHER METHODS.
Dataset Beauty Clothing
Metrics HR NDCG pvalue HR NDCG pvalue
DKN 8.673±0.058 1.872±0.049 0.021 1.203±0.088 0.300±0.024 1.45E-05
BPR 9.021±0.068 2.744±0.045 0.036 1.820±0.061 0.609±0.027 6.95E-05
RippleNet 9.294±0.027 2.401±0.036 0.040 1.882±0.041 0.624±0.025 8.07E-05
PGPR 14.559±0.051 5.513±0.042 - 7.003±0.032 2.843±0.030 -
Actor-Critic 14.821±0.043 5.730±0.051 0.912 6.924±0.044 2.796±0.031 0.949
ADAC 15.856±0.053 5.863±0.048 0.718 7.501±0.022 3.010±0.058 0.748
Fig. 1. Distribution of year-wise publications (until October 2022) about
traditional RL- and DRL-based recommendation methods.
RL-based recommender systems have been applied to many
specific scenarios [32]–[37], such as e-commerce [38]–[40],
e-learning [41], [42], and health care [43]–[45].
Data collection methodology. There are a growing number
of studies of RL-based recommender systems. To search
relevant articles for the analysis, we adopted the following
collection rules to include or exclude papers.
Search terms: Our survey involves two keywords: Re-
inforcement Learning
1
and Recommender Systems. Ac-
cordingly, we mainly adopted the following search
terms: ’Reinforcement Learning’ AND ’Recommender
Systems’. To find more research papers, we also used
related RL algorithms as the search terms (e.g., ’Q-
learning’, ’Policy Gradient’, and ’Actor-Critic’) instead
of ’Reinforcement Learning’, along with ’Recommender
Systems’. Similarly, we used the search term ’Recom-
mendation’ instead of ’Recommender Systems’, along
with ’Reinforcement Learning’.
Search sources: We first used Google Scholar to find re-
search papers with these search terms. We then increased
the collection of relevant articles by the following aca-
demic databases: Science Direct, ACM Portal, Springer
Link, and IEEE Xplore. Finally, we selected 98 related
papers to include in this survey.
Publication type: Only high-level publications on RL-
based recommendation from the international conferences
and top journals were included in our survey.
1
Note that this survey focuses on recommender systems using full RL.
Therefore, we did not include bandits, which are diferent from full RL.
The distribution of collected research papers over the years
is shown in Fig. 1. There are a few publications on traditional
RL-based recommendation methods from 2005 to 2017, with
an increase in number of papers on DRL-based recommen-
dation methods since 2018. The main reason is that DRL
algorithms have proved to be the satisfactory solutions to some
recommendation issues, and they have attracted much attention
from the research community.
Related work. To facilitate the research about RL-based
recommender systems, [46] provides a review of the RL- and
DRL-based algorithms developed for recommendations, and
presents several research directions in top-K recommenda-
tion, application architecture, and evaluation. Besides, [47]
provides an overview of DRL-based recommender systems
mainly according to model-based and model-free algorithms,
and discusses the benefits and drawbacks of DRL-based
recommendations. Nevertheless, it is necessary to make a
more comprehensive overview and analysis of (D)RL-based
recommender systems.
Our contribution. Different from [46] and [47], the main
contributions made in this work are as follows.
1) Comprehensive review: We summarize existing (D)RL
algorithms applied in four typical recommendation scenarios,
i.e., interactive recommendation, conversational recommenda-
tion, sequential recommendation, and explainable recommen-
dation. It could be helpful for readers to understand how
(D)RL algorithms are applied in different recommendation
systems. Moreover, from the RL perspective, the comprehen-
sive survey of RL-based recommender systems follows three
classes of RL algorithms: value-function, policy search, and
Actor-Critic. This taxonomy of the literature is made based on
the fact that these three types of methods have been widely
applied in existing RL-based recommender systems.
2) Systematical analysis: We systematically analyze the
challenges of applying (D)RL in recommender systems and
relevant solutions, including environment construction (e.g.,
the state representation and negative sampling), prior knowl-
edge, reward function definition, learning bias (e.g., the data
or policy bias), and task structuring (i.e., the task of RL can be
decomposed into basic components or a sequence of subtasks).
3) Open directions: To facilitate future progress in this
field, we put forward open issues of RL, analyze practical
challenges of this field, and suggest possible future direc-
tions for the research and application of recommender sys-
tems.The open issues and emerging topics contain sampling
efficiency, reproducibility, generalization, evaluation, biases,
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 3
Fig. 2. (a) Markov decision process formulated as the interaction between an agent and its environment [49]. (b) RL-based recommender system models the
interactive recommendation task as a Markov decision process.
interpretability, safety and privacy.
The remainder of this paper is organized as follows. Sec-
tion II introduces the background of RL, defines related
concepts, and lists commonly used approaches. Section III
presents a standard definition of the RL-based recommenda-
tion methods. Section IV provides a comprehensive review of
the RL algorithms developed for recommender systems. Then,
Section V discusses the challenges and relevant solutions
of applying RL in recommender systems. Next, Section VI
discusses various limitations and potential research directions
of RL-based recommender systems. Finally, Section VII con-
cludes this study.
II. OVERVIEW OF REINFORCEMENT LEARNING
Different from supervised and unsupervised learning, RL
[48] focuses on goal-directed learning that maximizes the
total reward achieved by an agent when interacting with its
environment. Trial-and-error and delayed rewards are two most
important characteristics that distinguish RL from the other
types of machine learning methods.
In RL, the agent learns the optimal policy from its interac-
tions with the environment to maximize the total reward for
sequential decision making. As shown in Fig. 2(a), the learning
process of RL can be formulated as a Markov Decision
Process (MDP) [49]. Formally, the MDP is defined as a 5-
tuple < S, A, P, R, γ >, where S denotes a finite set of states,
A denotes a finite set of actions, P is the state transition
probabilities, R denotes the reward function, and γ [0, 1]
is a discount factor of the reward. At each time step t, the
agent receives an environment state S
t
S and selects the
corresponding action A
t
A. Then, the agent receives a
numerical reward R
t+1
R and makes itself into a new state
S
t+1
. Thus, the MDP forms a sequence τ as follows.
τ = {S
0
, A
0
, R
1
, S
1
, A
1
, R
2
, · · · , S
T
}, (1)
where T is the maximum time step in a finite MDP. To
maximize the return in the long run, the agent tries to select
actions so that the cumulative reward it receives over the
future is maximized. In this case, we introduce the concept
of discounting. In general, the agent selects an action A
t
to
maximize the discounted return G
t
[49]:
G
t
= R
t+1
+ γR
t+2
+ · · · =
T
X
i=t+1
γ
it1
R
i
. (2)
The discount factor γ affects the return. If γ = 0, the agent
only maximizes immediate rewards then it reduces the return
in the long run. As γ approaches 1, the agent takes future
rewards into account more strongly.
Based on whether to use models and planning for solving
RL problems, RL algorithms can be classified into two main
groups, i.e., model-free algorithms and model-based algo-
rithms. The model-free algorithms directly learn the policy
without any model of the transition function, whereas the
model-based algorithms employ a learned or pre-determined
model to learn the policy. On the other hand, according to
the way of action conducted by the agent, RL algorithms fall
into the following major groups, i.e., value-function methods,
policy search methods, and Actor-Critic methods.
A. Value-function Approaches
Many traditional RL methods generally achieve a global
optimum return by obtaining the maximal value in terms
of the best action. These methods are called value-function
approaches, which utilize the maximal value to learn the
optimal policy indirectly. Intuitively, the maximal value is
generated by the best action a
following the optimal policy
π
, that is, the state-value under an optimal policy equals the
expected return for the best action from the state. This is the
Bellman equation for the optimal state-value function v
π
(s),
or called the Bellman optimality equation defined as:
v
π
(s) = max
a
E[R
t+1
+ γv
π
(S
t+1
)|S
t
= s, A
t
= a]
= max
a
X
s
,r
p(s
, r|s, a)
r + γv
π
(s
)
,
(3)
where E[·] denotes the expected value of a random variable
given by following the optimal policy π
.
Similarly, the Bellman optimality equation for the action-
value function q
π
(s, a) is defined by
q
π
(s, a) = E[R
t+1
+ γ max
a
q
π
(S
t+1
, a
)|S
t
= s, A
t
= a]
=
X
s
,r
p(s
, r|s, a)
r + γ max
a
q
π
(s
, a
)
.
(4)
Different from the optimal state-value function v
π
(s),
q
π
(s, a) explicitly reflects the effects of the best action, which
is more prone to be adopted by many algorithms.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 4
In general, the optimal value function is estimated by Dy-
namic Programming (DP), Monte Carlo methods or temporal-
difference (TD) learning, such as Sarsa [49], Q-Learning [50],
and Deep Q-Networks (DQN) [51]. Compared to policy-
search methods, the value-function approaches are often of
vastly reduced variance. Nevertheless, they are not suitable for
complex application scenarios due to the slow convergence.
B. Policy Search Methods
In contrast to value-function approaches, policy-search
methods directly optimize the policies, which are parameter-
ized by a set of policy parameters θ
t
. These policy parameters
can be updated to maximize the expected return with either
gradient-free or gradient-based optimization methods [16]. As
one of the most popular RL algorithms, the gradient-based
optimization method is competent to solve complex issues. To
search for the optimal policies, the gradient-based optimization
method learns the policy parameters with the gradient of some
performance measure J(θ). Formally, the updates approximate
gradient ascent in J(θ) by
θ
t+1
= θ
t
+ αJ(θ
t
), (5)
where J(θ
t
) denotes a stochastic estimate that approximates
the gradient of J(θ
t
) in terms of θ
t
, and α is the step size
that influences the learning rate [49]. Existing policy gradient
methods generally follow the gradient updating strategy in
Eq. (5).
The policy gradient methods provide an appropriate equa-
tion proportional to the policy gradient, which may need
Monte-Carlo based method of sampling their expectation that
approximates the equation. Thus, the REINFORCE algorithm
[52] that adopts the Monte-Carlo policy gradient method can
be established by
J(θ)
X
s
µ(s)
X
a
θ
π(a|s, θ)q
π
(s, a)
.
= E
π
h
X
a
θ
π(a|S
t
, θ)q
π
(S
t
, a)
i
,
(6)
where the symbol refers to “proportional to”, the distribution
µ(s) is the on-policy distribution under the policy π, and the
gradients are column vectors of partial derivatives with respect
to the policy parameter θ.
Some advanced algorithms have been proposed to address
the shortcomings of policy-search methods. For example,
policy gradient methods with function approximation [53]
ensure the stability of the algorithms. By adjusting super-
parameters artificially or adaptively, Proximal Policy Opti-
mization (PPO) Algorithms [54] and Trust Region Policy
Optimization (TRPO) [55] speed up the convergence of the
algorithms. Moreover, Guided Policy Search (GPS) [56] uti-
lizes the path optimization algorithm to guide the training
process of the policy gradient method, and thereby improves
its efficiency.
C. Actor-Critic Algorithms
There are a set of algorithms that incorporate the advantages
of value-function approaches and policy search methods. They
attempt to estimate a value function, meanwhile adopt the
policy gradient to search in the policy space. Actor-Critic
[57] is one of the most representative algorithms. It combines
the policy-based method (i.e., the actor) with the value-based
approach (i.e., the critic) to learn the policy and value-function
together. The actor trains the policy according to the value
function of the critic’s feedback, while the critic trains the
value function and uses the TD method to update it in one-step.
The one-step Actor-Critic algorithm replaces the full return
with the one-step return as
θ
t+1
.
= θ
t
+α
R
t+1
+γˆv(S
t+1
, w)ˆv(S
t
, w)
θ
π(A
t
|S
t
, θ
t
)
π(A
t
|S
t
, θ
t
)
,
(7)
where w is the state-value weight vector learned by the Actor-
Critic algorithm [49]. In Eq. (7), ˆv(S
t
, w) is a learned state
value function that is used as the baseline. In recent years,
many improved Actor-Critic algorithms have been developed,
such as Asynchronous Advantage Actor-Critic (A3C) [58],
Soft Actor-Critic (SAC) [59], Deterministic Policy Gradient
(DPG) [60] and its variation Deep Deterministic Policy Gra-
dient (DDPG) [27]. Actor-Critic algorithms may alleviate the
problem of sampling efficiency by experience replay [61].
However, due to the coupling of value evaluation and policy
updates, the stability of these algorithms is unsatisfactory.
III. FORMULATION OF RECOMMENDATION PROBLEM
In a typical recommender system, suppose there are a set
of users U and a set of items I, with R R
X×Y
denotes
the user-item interaction matrix, where X and Y denote the
number of users and items, respectively. Let r
ui
t
denote the
interaction behavior between user u and item i at time step t.
The recommender system aims to generate a predicted score
ˆr
ui
t
, which describes the user’s preference on the item i.
Generally, we can formulate the recommendation task as a
finite MDP [62], [63], as shown in Fig. 2(b). At each time
step, the recommender agent interacts with the environment
(i.e., the user and/or logged data) by recommending an item
to the user in the current state. At the next time step, the
recommender agent receives feedback from the environment
and recommends a new item to the user in a new state. The
user’s feedback may contain explicit feedback (e.g., purchase
and rating) or implicit feedback (e.g., user’s browsing record
from logged data). The recommender agent aims at learning
an optimal policy with the policy network to maximize the
cumulative reward. More precisely, the MDP in recommenda-
tion scenario is a 5-tuple < S, A, P, R, γ >, which can be
defined as follows.
States S. The finite state space describes the environment
states in the fixed length history trajectories, in which
S
t
= {i
1
, i
2
, · · · , i
t
} is an observed state from the
sequence of interacted items at time step t.
Actions A. The discrete action space contains the whole
set of recommended candidate items
2
. An action A
t
is to
2
Note that the action space may include other kinds of actions in different
recommendation scenarios, e.g., the selection of query attributes in conversa-
tional recommender systems, and the outgoing edges of entities in KG-based
explainable recommender systems.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 5
recommend an item i at time step t. In logged data, the
action can be taken from the user-item interactions.
Transition probability P. It is the state transition prob-
ability matrix. The state transits from s to s
according to
the probability p(s
, r|s, a) after the recommender agent
receives the user’s feedback (i.e., the reward r).
Reward function R. Once the recommender agent takes
action a at state s, it obtains the reward r(s, a) in
accordance with the user’s feedback.
Discount factor γ. It is the discount-rate parameter for
future rewards.
We assume in online RL-based recommendation environ-
ment, the recommender agent recommends an item i
t
to a user
u, while the user provides a feedback f
t
for the recommender
agent at the t-th interaction. The recommender agent obtains
the reward r(S
t
, A
t
) associated with the feedback f
t
, and
recommends a new item i
t+1
to the user u at the next interac-
tion. Given the observation on the multi-turn of interactions,
the recommender system generates a recommendation list.
The recommender agent aims to learn a target policy π
θ
to
maximize the cumulative reward of the sampled sequence [62]:
max
π
θ
E
τ π
θ
[R
τ
], (8)
where θ refers to policy parameters, and R
τ
=
P
|τ |
t=0
γ
t
r(S
t
, A
t
) denotes the cumulative reward with
respect to the sampled sequence τ = {S
0
, A
0
, · · · , S
T
}.
The recommender agent often suffers from the high cost to
learn the target policy by interacting with real users online.
An alternative is to employ offline learning, which learns a
behavior policy π
δ
from the logged data. We should solve
the policy bias to learn an optimal policy π
when using the
offline learning, since there is a noticeable difference between
the target policy π
θ
and the behavior policy π
δ
.
IV. RECOMMENDER SYSTEMS BY REINFORCEMENT
LEARNING
In this section, we summarize specific RL algorithms ap-
plied in four typical recommendation scenarios (i.e., interactive
recommendation, conversational recommendation, sequential
recommendation, and explainable recommendation) following
value-function, policy search, and Actor-Critic, respectively.
An overview of related literature is shown in Table II. Note that
some models/frameworks have an offline evaluation strategy
in the offline experimental environment, whereas an online
evaluation strategy denotes the experiments conducted on
online communities or real users.
A. Interactive Recommendation
In a typical interactive recommendation scenario, a user u is
recommended an item i
t
and provides a feedback f
t
at the t-th
interaction. The recommendation system recommends a new
item i
t+1
based on the feedback f
t
. Such an interactive process
can be easily formulated as an MDP, where the recommender
agent constantly interacts with the user and learns the policy
from the feedback to improve the quality of recommendations
[64], as shown in Fig. 2 (b). Due to their nature of learning
from dynamic interactions, RL algorithms have been widely
adopted to solve interactive recommendation problems [64].
1) Value-function Approaches: There is a common chal-
lenge of sample efficiency for RL algorithms, which may
lead to inefficient learning of the recommendation policy. To
address the limitation, Knowledge Graph (KG) is incorporated
within RL algorithms for the interactive recommendation
[71]. KG can provide rich supplementary information, which
reduces the sparsity of the user feedback. To this aim, KG
enhanced Q-learning model is proposed to make the sequential
decision efficiently. Besides, a DQN approach with double-Q
[99] and dueling-Q [100] models the long-term user satisfac-
tion. From the interaction history o
t
with item i
t
at time step t,
the recommender agent obtains the reward R
t
and then stores
the experience in the replay buffer D. The goal is to improve
the performance of the Q-network by minimizing the mean-
square loss function as follows.
L(θ
q
) = E
(o
t
,i
t
,R
t
,o
t+1
)D
[(y
t
q(S
t
, i
t
; θ
q
))
2
], (9)
where (o
t
, i
t
, R
t
, o
t+1
) refers to the learnt experience, and y
t
denotes the target value of the optimal q
.
Moreover, [67] proposes a DQN-based recommendation
framework that incorporates Convolutional Neural Network
(CNN) and Generative Adversarial Networks (GAN) [101],
called DRCGR. It automatically learns the optimal policy
based on both the positive and negative user feedback.
To optimize instant and long-term user engagement in rec-
ommender systems, [69] designs a novel Q-network that has
three layers named raw behavior embedding layer, hierarchical
behavior layer, and Q-value layer. Moreover, Pseudo Dyna-Q
(PDQ) [70] is proposed to ensure the stability of convergence
and low computation cost of existing algorithms. The PDQ
framework consists of two major components: a world model
imitates user’s feedback from the historical logged data and
generates pseudo experiences. A recommendation policy based
on Q-learning maximizes the expected reward by the pseudo
experiences from the world model and logged experiences.
Besides, a few attempts use the DP method to optimize the
recommendation policies [66]. For example, [65] adopts the
value iteration algorithm to learn the true status value in a
collaboration network.
The multi-step problem in interactive recommender systems
has been studied in [68], where the multi-step interactive
recommendation is cast as a multi-MDP task for all target
users. To model user-specific preferences explicitly, a biased
User-specific Deep Q-learning Network (UDQN) is proposed
by adding a bias parameter to capture the difference in the
Q-values of different target users.
2) Policy Search Methods: Existing RL-based algorithms
are often developed for short-term recommendation, whereas
[9] employs DRL and Recurrent Neural Network (RNN) to
improve the accuracy of long-term recommendation. Specifi-
cally, RNN is performed to simulate the sequential interactions
between the environment (the user) and the recommender
agent by evolving user states adaptively. This strategy can help
tackle the cold-start issue in recommender systems. On the
other hand, the interaction process is split into sub-episodes
and reboot the accumulated reward of each sub-episode, which
significantly improves the effectiveness of policy learning.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 6
TABLE II
OVERVIEW OF RL ALGORITHMS FOR DIFFERENT RECOMMENDER SCENARIOS.
Scenario Model RL Algorithm RL Environment Evaluation Strategy Dataset
Interactive
Recommendation
UWR [64]
Value-function
Q-Learning model-free Offline N/A
Multi-with RL [65] DP model-based Offline ACM
ARA [66] DP model-based Online & Offline N/A
DRCGR [67] DQN model-free Offline N/A
UDQN [68] DQN model-free Offline ML100K, ML1M, YMusic
FeedRec [69] DQN model-free Offline JD
PDQ [70] Q-Learning model-based Offline Taobao, Retailrocket
KGQR [71] DQN model-free Offline Book-Crossing, ML20M
SL+RL [9]
Policy Search
REINFORCE model-free Offline ML100K, ML1M, Steam
RCR [72] REINFORCE model-free Online & Offline Yelp, UT-Zappos50K
TPGR [73] REINFORCE model-free Offline ML10M, Netflix
FairRec [74]
Actor-Critic
DPG model-free Offline ML100K, Kiva
Attacks&Detection [13] AC model-free Offline Amazon
SDAC [75] AC model-based Offline RecSys, Kaggle
AAMRL [76] AC model-free Online UT-Zappos50K
DRR [77] AC model-free Online & Offline
ML1M, Yahoo! Music,
ML100K, Jester
Conversational
Recommendation
EMC,BTD, EHL [78]
Value-function
Monte Carlo,
TD learning
model-free Online & Offline
TRAVEL, PC,
CAMERA, CAR
ISRA [79] DP model-based Online N/A
EGE [80] Q-learning model-free Online Shoes, Fashion IQ Dress
MemN2N [81] DQN model-free Offline Personalized Dialog
SCPR [82] DQN model-free Online & Offline Yelp, LastFM
UNICORN [83] DQN model-free Online & Offline Yelp, LastFM, Taobao
CRM [84]
Policy Search
REINFORCE model-free Online & Offline Yelp
EAR [85] REINFORCE model-free Online & Offline Yelp, LastFM
CRSAL [86] Actor-Critic A3C model-free Offline
DSTC2, CamRest676,
MultiWOZ 2.1
RelInCo [87] AC model-free Offline OpenDialKG, REDIAL
Sequential
Recommendation
SQN [62]
Value-function
Q-learning model-free Offline RC15, RetailRocket
DEERS [88] DQN model-free Online & Offline JD
NRRS [89]
Monte Carlo
tree search
model-based Offline Million Song
RLradio [90] R-Learning model-based Online & Offline N/A
KERL [91]
Policy Search
Truncated PG model-free Offline Amazon, LastFM
HRL+NAIS [41] REINFORCE model-free Offline XuetangX
DARL [92] REINFORCE model-free Offline XuetangX
SAC [62]
Actor-Critic
AC model-free Offline RC15, RetailRocket
DeepChain [93] AC model-based Online & Offline JD
SAR [94] AC model-free Offline
Steam, Electronics,
ML10M, Kindle
Explainable
Recommendation
MT Learning [95]
Policy Search
REINFORCE model-free Offline Amazon, Yelp
RL-Explanation [12] REINFORCE model-free Offline Amazon, Yelp
MKRLN [96] REINFORCE model-free Offline movie, book, KKBOX
SAKG+SAPL [97] REINFORCE model-free Offline Amazon
PGPR [26] REINFORCE model-free Offline Amazon
ADAC [28]
Actor-Critic
AC model-free Offline Amazon
AnchorKG [98] AC model-free Offline MIND, Bing News
In text-based interactive recommender systems, user feed-
back with natural language usually causes undesired issues.
For instance, the recommender system may violate user’s
preferences, since it ignores the previous interactions and thus
recommends similar items. To handle these issues, a Reward
Constrained Recommendation (RCR) model [72] is proposed
to incorporate user preferences sequentially. More specifically,
the text-based interactive recommendation is formulated as a
constraint-augmented RL problem, where the user feedback
is taken as a constraint. To generalize from the constraints,
a discriminator parameterized as the constraint function is
developed to detect the violation of user’s preferences. The
policy is optimized by the policy gradient with baseline (i.e., a
general constraint), to learn constraints on violations of user’s
preferences. Based on the user’s feedback on the recommended
items, the recommender system utilizes constraints from these
feedback to prevent undesired text generation.
Moreover, most existing RL methods fail to alleviate the
issue of large discrete action space in interactive recom-
mender systems, since there are a large number of items
to be recommended. To solve this problem, [73] proposes a
Tree-structured Policy Gradient Recommendation framework
(TPGR) to achieve high effectiveness and efficiency for large-
scale interactive recommendations. To maximize long-run cu-
mulative rewards, the REINFORCE algorithm is utilized to
learn the strategy for making recommendation decisions.
3) Actor-Critic Algorithms: Actor-Critic Algorithms have
also been adopted extensively for interactive recommender
systems in recent studies [77]. For instance, an RL-based
framework (i.e., FairRec) [74] is proposed to dynamically
achieve a fairness-accuracy balance, in which the fairness
status of the system and user’s preferences combine to form
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 7
the state representation. The FairRec framework contains two
parts: an actor network and a critic network. The actor network
generates the recommendation according to the fairness-aware
state representation. The actor network is trained from the
critic network, and updated by the DPG algorithm. Then, the
critic network estimates the value of the actor network outputs.
The critic network is updated by TD learning.
[13] addresses adversarial attacks in RL-based interactive
recommender systems. They propose a general framework
that consists of two models. In the adversarial attack model,
the agent crafts adversarial examples following Actor-Critic
or REINFORCE algorithm. In the encoder-classification de-
tection model, the agent detects potential attacks based on
the adversarial examples, employing a classifier designed by
attention networks. Following [26], the authors demonstrate
the effectiveness of the proposed framework through judging
whether the recommender system is attacked.
Online interactions in RL for interactive recommendation
may hurt the user experiences. An alternative is to adopt
logged feedbacks to perform offline learning. However, it usu-
ally suffers from some challenges, such as unknown logging
policy and extrapolation error.To deal with these challenges,
[75] proposes a stochastic Actor-Critic method based on
a probabilistic formulation, and present some regularization
approaches to reduce the extrapolation error.
In interactive recommender systems, utilizing multi-modal
data can enrich user feedback. To this end, [76] proposes a
vision-language recommendation approach that enables effec-
tive interactions with the user by providing natural language
feedback. In addition, an attribute augmented RL is introduced
to model explicit multi-modal matching. More specifically, the
multi-modal data A
t
(i.e., the action of recommending items
at time t) and x
t
(i.e., natural language feedback from users
at time t) are leveraged in the proposed approach. Then a
recommendation tracker, consisting of a feature extractor and a
multi-modal history critic, is designed to enhance the ground-
ing of natural language to visual items. The recommendation
tracker may track the user’s preferences based on a history
of multi-modal matching rewards. The policy is updated via
the Actor-Critic algorithm for recommending the items with
desired attributes to the user.
B. Conversational Recommendation
Contrary to interactive recommender systems where users
receive information passively (i.e., the recommendation system
is dominant), conversational recommender systems interact
with users actively. They explicitly acquire users’ active feed-
back and make recommendations that users really like. To
achieve this objective, different from interactive recommender
systems that recommend items from each interaction, conver-
sational recommender systems [103] usually recommend items
after communicating with users by real-time multi-turn inter-
active conversations, based on natural language understanding
and generation. In this case, there is a critical issue of trade-
offs between exploration and exploitation for conversation
and recommendation strategies [102]. Conversational recom-
mender systems explore the items unseen by a user to capture
the user’s preferences by multi-turn interactions. However,
compared to exploiting the related items that have already
been captured, exploring the items that may be unrelated will
harm the user experience. RL provides potential solutions to
address this challenge. As shown in Fig. 3, the policy network
stimulates the reward centers in conversational recommender
systems, integrating exploration with exploitation in multi-turn
interactions.
1) Value-function Approaches: At each conversation turn
of conversational recommendation, the policy learning usually
aims to decide what to ask, which to recommend, and when to
ask or recommend. [83] introduces the UNICORN (i.e., UNI-
fied COnversational RecommeNder) model that employs an
adaptive DQN method to cast these decision-making processes
as a unified policy learning task. To adapt the conversational
strategy, [79] adopts the DP method to yield better policies that
assist user behaviors more efficiently. Besides, [81] proposes
an RL-based dialogue strategy that utilizes recommendation
results based on the user’s utterances, whose intention is
estimated by a Long Short-Term Memory (LSTM) network.
It is interesting that conversational recommender systems
can use incremental critiquing as a powerful type of user
feedback, to retrieve the items in line with the user’s soft
preferences at each turn. From this point of view, it needs
a related quality measure for the recommendation efficiency.
To achieve this objective, a novel approach is proposed to
improve the quality measure by combining a compatibility
score with similarity score [78]. To evaluate the compatibility
of user critiques, exponential reward functions are presented by
Monte Carlo and TD methods based on the user specialization.
Moreover, a global weighting of user’s preferences is proposed
to enhance the critiquing quality, which brings about faster
convergence of the similarity. By using the combination of
these two scores, the conversational recommender system
improves its robustness against the noisy user data.
However, it is difficult to capture users’ preferences over
time since the recommender system usually only obtains
partial observations of the users’ preferences. In this case,
we can formalize the observed interactions as a Partially Ob-
servable Markov Decision Process (POMDP). Afterwards, the
Estimator-Generator-Evaluator (EGE) [80] trains the POMDP
with Q-learning to track users’ preferences and generates the
next recommendations at each iteration.
The aforementioned conversational recommender systems
usually utilize user feedback in implicit ways. Instead, to make
full use of user preferred attributes on items in an explicit way,
a Simple Conversational Path Reasoning (SCPR) framework
[82] is proposed to conduct interactive path reasoning for
conversational recommendation on a graph. The SCPR obtains
user preferred attributes more easily, by pruning off irrelevant
candidate attributes following user feedback based on a policy
network. The policy network inputs the state vector s and
outputs the action-value Q(s, a), referring to the estimated
reward for asking action a
ask
or recommending action a
rec
.
The policy is optimized by the standard DQN to achieve
its convergence. Different from EAR [85], SCPR uses the
adjacent attribute constraint on the graph to reduce the search
space of attributes, such that the decision-making efficiency
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 8
Fig. 3. The common framework of conversational recommendation model based on RL. The user interface extracts user intention from the utterances during
the conversation [102]. The policy network learns the optimal policy from dialogue states. The recommender system is trained with user intention, item
features, and dialogue states, to make personalized recommendations [81], [84], [86].
can be improved.
2) Policy Search Methods: To model the user’s current
intention and long term preferences for personalized recom-
mendations, [84] utilizes deep learning technologies to train
a conversational recommender system. Specifically, a deep
belief tracker extracts a user intention by analyzing the user’s
current utterance. Meanwhile, a deep policy network guides
the dialogue management by the user’s current utterance and
long-term preferences learned by a recommender module.
The framework of the proposed Conversational Recommender
Model (CRM) is composed of three modules: a natural lan-
guage understanding with the belief tracker, dialogue man-
agement based on the policy network, and a recommender
designed based on factorization machine [104]. More specif-
ically, at first, the belief tracker converts the utterance (i.e.,
the input vector z
t
) into a learned vector representation (i.e.,
S
t
) by LSTM network. Afterwards, to maximize the long-
term return, they use the deep policy network to select a
reasonable action from the dialogue state at each turn. The
REINFORCE algorithm is adopted to optimize the policy
parameter. Finally, the factorization machine is utilized to
train the recommendation module, which generates a list of
personalized items for the corresponding user.
Distinct from previous studies that ignore how to adapt
the recommended items for user feedback, [85] proposes a
unified framework called Estimation Action Reflection (EAR),
in which a Conversation Component (CC) intensively interacts
with a Recommender Component (RC) in the three–stage
process. Specifically, the framework starts from the estimation
stage, the RC guides the action of CC by ranking candidate
items. Afterwards, at the action stage, the CC decides which
questions to ask in terms of item features and makes a
recommendation. The conversation action is performed by
a policy network, which is optimized via the REINFORCE
algorithm. When a user rejects the recommended items that
are made at the action stage, the framework moves to the
reflection stage for adjusting its estimations. Extensive experi-
ments demonstrate the proposed EAR outperforms CRM [84]
according to not only fewer conversation turns but also better
recommendations, mainly because the candidate items adapt
to user feedback in the interaction between the RC and CC.
3) Actor-Critic Algorithms: To address the training issue
of task-oriented dialogue systems, a Conversational Recom-
mender System with Adversarial Learning (CRSAL) [86] is
proposed to fully enable two-way communications. The pro-
cess of CRSAL is divided into three stages. In the information
extraction stage, a dialogue state tracker firstly infers the
current dialogue belief state b
t
from the user’s utterances.
Afterwards, a neural intent encoder extracts and encodes the
user’s utterance intention z
t
. Finally, a neural recommender
network derives recommendations from item features and di-
alogue states. In the conversational response generation stage,
a neural policy agent (i.e., the actor) generates human-like
action in the current dialogue state, and a natural language
generator updates conversational responses from the critic
network by the selected action. In the RL stage, the decision
procedure of dialogue actions can be formulated as a Partially
Observable MDP (POMDP). The agent selects the best action
in each conversation round under the long-term policy. To this
end, an adversarial learning mechanism based on the A3C
algorithm is developed to fine-tune the actions generated by
the agent, which employs the discriminator to train the optimal
parameters of the proposed model.
In conversational recommender systems, valuable informa-
tion from user’s utterances is often conducive to the retrieval
performance. For example, [87] proposes an RL-based model
to extract relevant information from the context of the con-
versation. The model introduces two Actors: a selector-Actor
finds the most relevant words for the target of the conversation,
and an arrangement-Actor returns the related order of words
based on the user’s utterances. Both Actors are trained by the
Actor-Critic algorithm.
C. Sequential Recommendation
Unlike interactive recommendation methods that generate
recommendations based on the user’s feedback via constant in-
teractions, sequential recommender systems predict the user’s
future preference and recommend the next item given a
sequence of historical interactions. Let i
u
i:n
= i
u
1
i
u
2
· · · i
u
n
denote the user-item interaction sequence, where n
is the sequence length. The sequential recommender system
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 9
aims to recommend the next item
ˆ
i that is not in the historical
interactions. Some attempts have revealed that RL algorithms
deal well with the sequential recommendation problems, since
such problems can be naturally formulated as an MDP to
predict the user’s long-term preferences. In this case, the
recommender agent easily performs a sequence of ranking,
which usually learns the optimal policy from the logged data
with off-policy methods.
1) Value-function Approaches: In sequential recommenda-
tion, it is essential to fuse long-term user engagement and
user-item interactions (e.g., clicks and purchases) into the rec-
ommendation model training. Learning the recommendation
policy from logged implicit feedback by RL is a promising
direction. However, there exists challenges in implementation
due to a lack of negative feedback and the pure off-policy
setting. [62] presents a novel self-supervised RL method to
address the challenges. More precisely, the next item rec-
ommendation problem is modeled as an MDP, where the
recommender agent sequentially recommends items to related
users to maximize the cumulative reward. To optimize the
recommendation model, they define a flexible reward function
that contains purchase interactions, long-term user engage-
ment, and item novelty. Based on this method, the authors
develop a self-supervised Q-learning model to train two layers
with the logged implicit feedback. Similarly, [90] leverages
R-Learning [105] to develop a music recommender system
named RLradio. It exploits both explicitly revealed channel
preferences and user’s implicit feedback, i.e., the user actually
listens to a music track that played in the recommended
channel. [89] also leverages the wireless sensing and RL algo-
rithm to improve the user experience in music recommender
systems, in which user’s preferences are explored by the Monte
Carlo tree search method.
Moreover, in [88], a novel DQN is built for the proposed
framework, where Gated Recurrent Unit (GRU) captures the
user’s sequential behaviors as positive state s
+
, and negative
state s
is obtained by a similar way. The positive and negative
signals are fed into the input layer separately, which assists
the new DQN to distinguish contributions of the positive and
negative feedback in recommendations.
2) Policy Search Methods: It is a non-trivial problem to
capture the user’s long-term preferences in sequential rec-
ommender systems, since the user-item interactions may be
sparse or limited. Thus, it is unreliable for RL algorithms to
learn user interests by using random exploration strategies.
To overcome these issues, [91] proposes a Knowledge-guidEd
Reinforcement Learning (KERL) framework that adopts an
RL model to make recommendations over KG. In the MDP
modeled by KERL, the environment contains the information
of interaction data and KG, which are useful for the sequential
recommendation. In this case, the agent selects an action
a in state S
t
for recommending an item i
t+1
to a user u.
During each recommendation process, the agent obtains an
intermediate reward, which is defined by integrating sequence-
level and knowledge-level rewards.
Moreover, [41] analyzes users’ sequential learning behaviors
and points out that the attention-based recommender systems
perform poorly when the users enroll in diverse historical
courses, because the effects of the contributing items are
diluted by many different items. To remove noisy items and
recommend the most relevant items at the next time, a profile
reviser with two-level MDPs is designed. In this profile reviser,
a high-level task decides whether to revise the user profile,
and a low-level task decides which item should be removed.
The agent in the proposed Hierarchical Reinforcement Learn-
ing (HRL) framework performs hierarchical tasks under the
revising policy, which is updated by the REINFORCE algo-
rithm. In addition, for capturing users’ dynamic preferences
in sequential learning behaviors, [92] proposes a dynamic
attention mechanism, which combines with the HRL-based
profile reviser to distinguish the effects of contributing courses
in each interaction. As a result, the proposed model achieves
more accurate recommendations than HRL [41].
3) Actor-Critic Algorithms: As mentioned before, [62]
proposes a novel self-supervised RL method to learn the
recommendation policy from logged implicit feedback in
sequential recommendation. To optimize the recommendation
model, a Self-supervised Actor-Critic (SAC) model treats the
self-supervised layer as an actor to perform ranking and takes
the RL layer as a critic to estimate the state-action value.
However, both SQN and SAC [62] utilize a fixed length
of interaction sequences as input to train the models, which
affects the recommendation accuracy, because users usually
have various sequential patterns. To address this issue, [94]
proposes a Sequence Adaptation model via deep Reinforce-
ment learning (SAR) to adjust the length of interaction se-
quences. In particular, the RL agent performs the selection of
a sequence length (i.e., action) in a personalized manner in
the actor network. Finally, a joint loss function is optimized
to align the cumulative rewards of the critic network with the
recommendation accuracy.
D. Explainable Recommendation
The objective of explainable recommendation is to solve
the problem of interpretability in the recommender systems.
The explainable recommender systems not only provide high-
quality recommendations but also generate relevant explana-
tions for recommendation results [106]. In particular, the visual
explanation seems more intelligent when a recommender sys-
tem conducts path reasoning over KG, because KG contains
rich relationships between users and items for intuitive expla-
nations. Fig. 4 illustrates the common framework of RL with
KG for the explainable recommendation. The KG is cast as
a part of the MDP environment, and the agent performs path
reasoning for recommendations [26], [28].
In the survey on explainable recommendation [107], the
explainable recommendation methods are divided into five
classes: explanation with relevant items or users, feature-based
explanation, social explanation, textual sentence explanation,
and visual explanation. The emphasis of this section is to
survey RL applied to the explainable recommendation, where
the approaches focus on textual sentence explanation and
visual explanation.
1) Policy Search Methods: In terms of textual sentence
explanation, [95] proposes a multi-task learning framework
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 10
Fig. 4. The RL algorithms with KG for explainable recommendation [26], [28]. The models aim to learn a path-finding policy over KG. The learned policy
is adopted for the interpretable reasoning paths that make accurate recommendations to the given user.
that simultaneously makes rating predictions and generates
explanations from users’ reviews. The rating prediction is
performed by a context-aware matrix factorization model,
which learns latent factor vectors of users and items. The
recommendation explanation is employed by an adversar-
ial sequence-to-sequence model, in which GRU generates
personalized reviews from observed review documents, and
adversarial training for the review generation is optimized by
the REINFORCE algorithm.
To control the explanation quality flexibly, [12] designs
an RL framework to generate sentence explanations. In the
proposed framework, there are couple of agents instantiated
with attention-based neural networks. The task of one agent
is to generate explanations, while the other agent is respon-
sible for predicting the recommendation ratings based on
the explanations. The environment mainly consists of users,
items, and prior knowledge. In this work, the recommendation
model is treated as a black box. The agents extract the
interpretable components from the environment to generate
effective explanations.
Furthermore, to leverage the image information of entities,
rather than focusing on rich semantic relationships in a hetero-
geneous KG, [96] presents a Multi-modal Knowledge-aware
RL Network (MKRLN), where the representation of recom-
mended path consists of both structural and visual information
of entities. The recommender agent starts from a user and
searches suitable items along hierarchical attention-paths over
the multi-modal KG. These attention-paths can improve the
recommendation accuracy and explicitly explain the reason
for recommendations.
The real-world KG is generally enormous, with a focus
on finding more reasonable paths in the graph. From this
aspect, [26] proposes a reinforcement KG reasoning approach
called Policy-Guided Path Reasoning (PGPR), which performs
recommendations and explanations by providing actual paths
with the REINFORCE algorithm over the KG. Specifically,
the recommendation problem is treated as a deterministic MDP
over the KG. In the training stage, following a soft reward and
user-conditional action pruning strategy, the agent starts from a
given user, and learns to reach the correct items of interest. In
the inference stage, the agent samples diverse reasoning paths
for recommendation by a policy-guided search algorithm, and
generates genuine explanations to answer why the items are
recommended to the user.
Distinct from previous explainable recommendation meth-
ods that use KGs, [97] focuses on sentiment on relations in the
KG. To obtain more convincing explanations with sentiment
analysis, a Sentiment-Aware Knowledge Graph (SAKG) is
constructed by analyzing users’ reviews and ratings on items.
Moreover, a Sentiment-Aware Policy Learning (SAPL) method
is introduced to make recommendations and guide the rea-
soning over the SAKG. Experimental results demonstrate that
the proposed framework outperforms state-of-the-art baselines
(e.g., PGPR [26]), in terms of both accuracy and explainability.
2) Actor-Critic Algorithms: Differing from PGPR [26], the
ADversarial Actor-Critic (ADAC) model [28] is proposed to
identify interpretable reasoning paths. By learning the path-
finding policy, the actor obtains its search states over the KG
and potential actions. The actor obtains the reward R
e,t
if
the path-finding policy from the current state fits the observed
interactions. To integrate the expert path demonstrations, they
design an adversarial imitation learning module based on two
discriminators (i.e., meta-path and path discriminators), which
are trained to distinguish the expert paths from the paths
generated by the actor. When its paths are similar to the expert
paths in the meta-path or path discriminator, the actor obtains
the reward (R
m,t
or R
p,t
) from the imitation learning module.
The critic merges these three rewards (i.e., R
e,t
, R
m,t
, and
R
p,t
) to estimate each action-value accurately. Later, the actor
is trained with an unbiased estimate of the reward gradient
through the learned action-values. Finally, the major modules
of ADAC are jointly optimized to find the demonstration-
guided path, which brings accurate recommendations.
Although existing works (e.g., PGPR [26] and ADAC [28])
combine KG and RL to enhance recommendation reasoning,
they are not suitable for news recommendation, where a
news article usually contains multiple entities. To address this
challenge, a recommendation reasoning paradigm, named An-
chorKG, is proposed to employ anchor KG path to make news
recommendation [98]. The AnchorKG model consists of two
parts. An anchor graph generator captures the latent knowledge
information of the article, which leverages k-hop neighbors
of article entities to learn high-quality reasoning paths. On
the other hand, an AC-based framework is developed to train
the anchor graph generator. Finally, the model performs a
multi-task learning process to optimize both the anchor graph
generator and the recommendation task jointly.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 11
E. Discussion on RL Aspect
As mentioned above, the comprehensive survey of RL-based
recommender systems follows three classes of RL algorithms:
value-function, policy search, and Actor-Critic. From the RL
perspective, we make a comparison of these three types of
methods when they are applied in the recommender systems.
Value-function approaches depend heavily on the sam-
ples to learn the optimal policy. Thus, these approaches
are suitable for small discrete action spaces and are often
applied in small-scale recommender systems, such as
traditional interactive recommendation and conversational
recommendation. However, real-world recommender sys-
tems usually contain large action spaces, which leads to
the slow convergence of RL when they utilize value-
function approaches to make recommendations.
Policy search methods directly optimize the policy with-
out relying on the value function. They are only suitable
for continuous action spaces due to the policy gradient,
which results in high variances. However, benefitting
from fast convergence properties, policy search methods
are competent in large-scale recommender systems, such
as sequential recommendation and explainable recom-
mendation, especially KG-based recommendation.
Actor-Critic algorithms incorporate the advantages of
value-function approaches and policy search methods.
Nevertheless, Actor-Critic algorithms usually cause po-
tential information loss since they map discrete action
spaces into continuous action spaces by the policy net-
work [47], which ensures the policy is differentiable with
respect to its parameters. Hence, Actor-Critic algorithms
are rarely applied in the recommender systems that focus
on the convincing recommendations, e.g., conversational
recommendation and explainable recommendation.
RL-based recommendation models can also be divided into
model-based and model-free algorithms. As shown in Table II,
a few recommendation models adopt model-based algorithms,
while most existing recommendation models utilize model-free
algorithms.
Model-based algorithms (e.g., DP and heuristic search)
require a model to represent the environment of the
recommender system, thus the agent relies on planning
and guarantees sample efficiency. Nevertheless, such al-
gorithms often result in biased estimations since the envi-
ronment of recommender systems dynamically changes.
Moreover, the transition probability is deterministic in
model-based algorithms. Therefore, these algorithms are
not applicable to real-world recommender systems.
Model-free algorithms (e.g., TD, DQN, and REIN-
FORCE) generally achieve better recommendation perfor-
mances, because the agent mainly relies on learning from
previous experiences. The drawback of such algorithms
is sample inefficiency, i.e., they require a large number
of samples to ensure the convergence of the algorithms.
V. CHALLENGES IN REINFORCEMENT LEARNING BASED
RECOMMENDATION APPROACHES
As noted above, RL aims to maximize long-run cumulative
rewards by learning the optimal policy from the interactions
between the agent and its environment. Thus, it relies not
only on the environment but also on prior knowledge. In
the recommendation applications, RL approaches often suffer
from various challenges. To this end, many researchers put
forward relevant solutions to different issues. In the following,
we summarize relevant studies from ve aspects: environment
construction, prior knowledge, reward function definition,
learning bias, and task structuring.
A. Environment Construction
In RL, the agent observes states from the environment, then
conducts relevant actions under the policy, and receives a re-
ward from the environment. Applied to recommender systems,
the policy training in the environment is often confronted with
many unpredictable situations, due to the need for exploration.
In this case, environment construction is critical to learn the
optimal recommendation policy.
1) State Representation: The state representation plays an
important role in RL to capture the dynamic information
during the interactions between users and items, since the
environment state is a key component of MDP. However,
most current RL methods focus on policy learning to opti-
mize recommendation performance. To effectively model the
state representation, [77] proposes a DRL-based recommenda-
tion framework termed DRR, where four state representation
schemes are designed to learn the recommendation policy (the
actor network) and the value function (the critic network).
Specifically, the DRR-p scheme employs the element-wise
product operator to learn the pairwise local dependency be-
tween items. DRR-u scheme incorporates the pairwise interac-
tions between users and items into the DRR-p scheme. DRR-
ave scheme concatenates the user embedding, the average
pooling result of items, and the user-item interactions into
a whole vector to describe the state representation. In the
DRR-att scheme, a weighted average pooling is conducted
by attention networks. The actor network conducts a ranking
action a = π
θ
(s) according to the state representation s,
and generates a corresponding Q-value by the action-value
function Q
π
(s, a). The critic network leverages a DQN pa-
rameterized as Q
w
(s, a) to approximate Q
π
(s, a). Based on
DPG [60], the actor network can be updated by the policy
gradient via
θ
J(π
θ
)
1
N
X
t
a
Q
w
(s, a)|
s=S
t
,a=π
θ
(S
t
)
θ
π
θ
(s)|
s=S
t
,
(10)
where J(π
θ
) denotes the expectation of all possible Q-values
following the policy π
θ
, and N is the batch size.
Moreover, [108] introduces the DEMER (i.e., DEcon-
founded Multi-agent Environment Reconstruction) framework,
which assumes that the environment reconstruction from the
historical data is powerful in RL-based recommender systems.
DEMER randomly samples one trajectory from the observed
data, and then forms its first state as the initial observation.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 12
It finally generates a policy-generated trajectory by the TRPO
algorithm, considering the confounder embedded policy as a
role of hidden confounders in the environment.
It may be practical to formulate a simulation environment
of recommendation as an RL gym. To achieve this objective,
PyRecGym [109] is designed for RL-based recommender
systems to support standard test datasets and common input
types. In the PyRecGym, the states of the environment refer
to user profiles, items, and interactive features with contextual
information. The RL agent interacts with a gym engine in the
current state, and obtains related feedback from the gym en-
gine according to the reward. PyRecGym contains three major
functions: initialization function that initializes the initial state
of the environment for the user, reset function that resets the
user state in the next episode and returns the initial state to the
RL agent, and step-function that reacts to the agent’s action
and returns the next state.
2) Knowledge Graph Leverage: Recent advances in KG
have attracted increasing attention in RL-based recommender
systems. The KG-enhanced interaction fashion enriches user-
item relations by the structural knowledge to describe the MDP
environment. Besides, the multi-hop paths in a KG contribute
to the reasoning process for explainable recommendation.
The KERL framework [91] incorporates KG into RL-based
recommender systems, with the aim of predicting future user’s
preferences and addressing the sparsity. To learn the user’s
preferences from sparse user feedback, [71] proposes a KG-
enhanced Q-learning model for interactive recommender sys-
tems. Instead of learning the policy by trial-and-error search,
the model utilizes the knowledge of correlations among items
learned from KG to enrich the state representation of both
items and users. Thus, it spreads the user’s preferences among
the correlated items over KG.
[26] made the first attempt to leverage KG and RL for the
explainable recommendation. They develop a unified frame-
work (i.e., PGPR) to provide actual recommendation paths
guided by the policy network in a KG. To further improve
the convergence of the PGPR approach, [28] proposes a
demonstration-based KG reasoning framework named ADAC,
in which imperfect path demonstrations are extracted to guide
path-finding. The ADAC model aims to identify interpretable
reasoning paths for accurate recommendations.
Another interesting work is the negative sampling by KG
Policy Network [110]. It is incorporated into the recommen-
dation framework to explore informative negatives over KG.
The recommender (i.e., matrix factorization) and the sampler
(i.e., KG Policy Network) are jointly trained by the iteration
optimization. The recommender parameters are updated by the
stochastic gradient descent (SGD) method, and the sampler
parameters are updated via the REINFORCE algorithm.
3) Negative Sampling: Most existing studies extract nega-
tive sampling from the unobserved data to assist the training
of the recommendation model. However, they often fail to
yield high-quality negative samples to reflect the user’s needs,
which provides an important clue on the environment con-
struction for RL. To discover informative negative feedback
from the missing data, [110] proposes a KG policy network
for knowledge-aware negative sampling, which employs an RL
agent to search high-quality negative instances with multi-hop
exploring paths over the KG. Similarly, to learn the sampling
strategy from the missing negative signals, the supervised
negative Q-learning method [111] trains the RL algorithm with
a supervised sequential learning method to sample negative
items.
Moreover, the user exposure data, which records the history
interactions based on implicit feedback, is also beneficial
to train negative samples. From this point of view, [112]
introduces a recommender-sampler framework, where the sam-
pler samples candidate negative items as the output, and the
recommender is optimized by the SGD method to learn the
pairwise ranking relation between a ground truth item and a
generated negative item. After obtaining the multiple rewards,
the sampler is optimized by the REINFORCE algorithm to
generate both real and hard negative items. Moreover, [113]
uses a CF-based pre-training method to sample negative items
for RL-based recommender systems.
4) Social Relation: Traditional recommender systems often
suffer from two main challenges, i.e., cold start and data
sparsity. A promising way for alleviating these issues is to
utilize users’ social relationships to model users’ preferences
efficiently. Thus, the agent can sample users’ social relation-
ships and deliver them to the environment.
Applied to RL-based recommendation, [114] integrates
social networks into the estimation of action-values. More
specifically, a social matrix factorization method is proposed
to describe the high-level state/action representations. To learn
more relevant hidden representations from personal prefer-
ences and social influence, an enhanced SADQN model is
developed to utilize additional neural layers to summarize
potential features from the hidden representations, and then
predict the final action-value with the summarized features.
Moreover, [115] proposes a Social Recommendation frame-
work based on Reinforcement Learning (SRRL) to identify
reliable social relationships for the target user. In particular,
SRRL adaptively samples the social friends to improve the
recommendation quality with user feedback, since the reward
is always real-time.
B. Prior Knowledge
The combination of imitation learning with RL can be
named apprenticeship learning [116], which utilizes demon-
strations to initialize the RL. In particular, the Inverse Rein-
forcement Learning (IRL) algorithm [117] often assumes that
the expert acts to maximize the reward, and derives the optimal
policy from the learned reward function. In RL-based rec-
ommender systems, the reward signals are usually unknown,
since users scarcely offer feedback. On the other hand, IRL
algorithms are good at reconstructing the reward function for
the optimal policy from users’ observed trajectories.
To improve the novelty of the next-POI (i.e., Point-
of-Interests) recommendation that boosts users’ interests,
[118] adopts a novel IRL algorithm termed Maximum Log-
likelihood (MLIRL) [119] to model the unknown user’s pref-
erences based on the state features. This method exploits the
knowledge of the user’s preferences to estimate an initial
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 13
reward function that justifies the observed trajectories, and
optimizes the user’s behavior by a gradient ascent method.
To mine users’ interests in online communities efficiently,
[120] designs a reinforced user profiling for recommendations
by employing both data-specific and expert knowledge, where
the agent employs random search to find data-specific paths in
the environment. These meta-paths contain expert knowledge
and semantic meanings for searching useful nodes.
Distinct from the top-K recommendation with the target
of ranking optimization, the novel exact-K recommendation
focuses on combinatorial optimization. It may be more suit-
able to address the recommendation problems in application
scenarios. Towards this end, [121] designs an encoder-decoder
framework named Graph Attention Networks (GAttN). The
proposed framework learns the joint distribution of the K
items and outputs an optimal card that contains these K items.
To train the GAttN efficiently, an RL from demonstrations
method integrates behavior cloning [122] into RL.
C. Reward Function Definition
The agent behavior in RL indirectly relies on the reward
function, while in practice, it is a key challenge to define
a promising reward function. Beyond the need to learn the
policy by intermediate rewards, the reward definition based on
the specific demands in different recommendation scenarios is
necessary. For instance, [62] defines a flexible reward function
that contains purchase interactions, long-term user engage-
ment, and item novelty for the relevant recommendation tasks.
[123] defines discrete rewards for different user behaviors.
Moreover, a one-step reward in [124] is distributed to the
online recommendations during visitor interactions.
A creative work is reported in [125] which leverages
GAN to model the dynamics of user behaviors and learn
relevant reward functions, which depends not only on the
user’s historical behaviors but also on the selected item. The
learned reward allows the agent to recommend items in a
principled way, instead of relying on the manual design. Based
on the simulation environment using the user behavior model,
a cascading DQN is proposed to learn the combinatorial rec-
ommendation policy. Similarly, [126] introduces a generative
IRL approach to avoid defining a reward function manually.
The recommendation problem is regarded as automatic policy
learning. Thus, this approach first generates a policy based on
the user’s preference. Later it uses a discriminative actor-critic
network to evaluate the learned policy, based on the reward
function defined by
R(s, a) = logD(s, a)log
max(ϵ, 1logD(s, a))
+r, (11)
where logD(s, a) is the negative reward generated by the
discriminator, r [0, 1] represents the user’s feedback to
prevent the agent from taking an extra step to update itself
when the user clicks each recommendation, and ϵ is the
maximum percentage of change that is updated at a time.
For commercial recommendation systems, online advertis-
ing is frequently inserted into personalized recommendation
to maximize the profit. To this aim, a value-aware recommen-
dation model [127] based on RL is designed to optimize the
economic value of candidate items to make recommendations.
In the value-aware model, measuring by the gross merchandise
volume, the total reward is defined as the expected profit that is
converted from all types of user actions (e.g., click and pay).
Moreover, [128] presents an advertising strategy for DQN-
based recommendation, where the advertising agent simulta-
neously maximizes the income of advertising and minimizes
the negative influence of advertising on the user experience.
Thus, the reward contains both the income of advertising and
the influence of advertising on the user experience.
In practice, users have interests in exploring novel items. To
this end, [129] proposes a fast Monte Carlo tree search method
for diversifying recommendations, where the reward function
is designed with the diversity and accuracy gain derived from
recommending items at corresponding positions.
D. Learning Bias
RL algorithms can be classified into on-policy and off-
policy methods [49]. On-policy methods often sweep through
all states to learn the policy and result in expensive costs,
hence on-policy methods are not applicable to large-scale
recommender systems. On the other hand, learning the rec-
ommendation policy from logged data (e.g., logged user
feedback) is a more practical solution, because it alleviates
complex state space and high interaction cost with off-policy
[130], [131]. Nevertheless, due to the difference between the
target policy and the behavior policy, the off-policy methods
usually result in data biases or policy biases.
1) Data Biases: There are lots of logged implicit feedback,
such as user clicks and dwell time, available for learning users’
preferences. However, the learning methods tend to easily
suffer from biases caused by only observing partial feedback
on previous recommendations. To deal with such data biases,
[132] proposes a recommender system with the REINFORCE
algorithm, where an off-policy correction approach is utilized
to learn the recommendation policy from the logged implicit
feedback, and incorporates the learned model of the behavior
policy to adjust the data biases.
Differing from [132] that simply tackles data biases in
the candidate generation module, the scalable recommender
systems should contain not only the candidate generation stage
but also a more powerful ranking stage. Toward this end, a
two-stage off-policy method [133] is proposed to obtain data
biases from logged user feedback and correct such biases
by using inverse propensity scoring [134]. More precisely,
the ranking module usually changes between the logged data
(in behavior policy) and the candidate generation (in target
policy). When there are evidently different preferences on the
items to recommend, the two-stage off-policy policy gradient
can be conducted to correct such biases. Moreover, the vari-
ance is reduced by introducing a hyper-parameter to down-
weight the gradient of the sampled candidates.
2) Policy Biases: The direct offline learning methods, such
as Monte Carlo and TD methods are subject to either huge
computations or instability of convergence. To handle these
problems, [70] proposes the PDQ framework to tackle the
selection bias between the recommendation policy and logging
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 14
policy. More specifically, the offline learning process is divided
into two steps. In the first step, a user simulator is iteratively
updated under the recommendation policy via de-biasing the
selection bias. In the second step, the recommendation policy
is improved with Q-learning, by both the logged data and user
simulator. In addition, a regularizer for the reward function
reduces both bias and variance in the learned policy.
Moreover, [135] proposes a model-based RL method to
model the user-agent interactions for offline policy learning
with adversarial training, where the agent interacts with the
user behavior model to generate recommendations that ap-
proximate the true data distribution. To reduce the bias in
the learned policy, the discriminator is employed to rescale
the generated rewards for the policy learning, and de-bias the
user behavior model by distinguishing simulated trajectories
from the real interactions. As a result, the recommendation
performance can be improved. Similarly, in the Adversarial
future encoding (AFE) model [136], a future-aware discrimi-
nator is taken as a recommendation module to identify user-
item pairs, whereas a generator confuses the future-aware
discriminator by generating items with only common features.
The AFE model is optimized with the recommendation loss,
which reduces the optimization bias in the pre-training task.
E. Task Structuring
The complexity of the RL task can generally be reduced
by decomposing the RL task into several basic components
or a sequence of subtasks [20]. For recommender systems,
previous studies focus on Multi-Agent Reinforcement Learn-
ing (MARL), HRL, and Supervised Reinforcement Learning
(SRL).
1) Multi-agent Reinforcement Learning: RL-based recom-
mender systems suffer from inherent challenges (e.g., the curse
of dimensionality) when they employ a single agent to perform
the task. An alternative solution is to leverage multiple agents
with similar tasks to improve the learning efficiency with the
help of parallel computation. Generally, MARL algorithms can
be classified into four categories, i.e., fully cooperative, fully
competitive, both cooperative and competitive, and neither
cooperative nor competitive tasks [137].
Fully cooperative tasks in MARL-based recommender sys-
tems have the same goal (i.e., maximizing the same cumulative
return) achieved by all the agents [93]. For instance, in the
DEMER approach [108], the policy agent cooperates with the
environment agent to learn the policy of a hidden confounder.
Moreover, to capture the general preference of users and
their temporary interests, [138] introduces a Temporary In-
terest Aware Recommendation (TIARec) model with MARL.
Particularly, an auxiliary classifier agent can judge whether
each interaction is atypical or not. The classifier agent and
the recommender agent are jointly trained to maximize the
cumulative return of the recommendations.
In contrast, fully competitive tasks typically adopt the mini-
max principle: each agent maximizes its own benefit under the
assumption that the opponents keep endeavoring to minimize
it. In a dynamic collaboration recommendation method for
recommending collaborators to scholars [65], the competition
should be characterized as a latent factor, since scholars hope
to compete for better candidates. To this end, the proposed
method uses competitive MARL to model scholarly competi-
tion, i.e., multiple agents (authors) compete with each other by
learning the optimized recommendation trajectories. Besides
that, an improved market-based recommendation model [139]
urges all agents to classify their recommendations into various
internal quality levels, and employs Boltzmann exploration
strategy to conduct these tasks by the recommender agents.
Both cooperative and competitive tasks also exist in MARL-
based recommender systems [140]. For example, [141] aims
to recommend public accessible charging stations intelligently.
Each charging station is regarded as an individual agent. Sub-
sequently, a centralized attentive critic module is developed to
stimulate multiple agents to learn cooperative policies. While a
delayed access strategy is proposed to exploit future charging
competition information during model training. Besides, to ad-
dress the sub-optimal policy of ranking due to the competition
between independent recommender modules, [38] promotes
the cooperation of different modules by generating signals
for these modules. Each agent acts on the basis of its signal
without mutual communication. For the i-th agent with the
signal vector ϕ
i
, given a Q-value function Q
i
θ
(S
t
, A
t
) and a
shared signal network Φ
i
(S
t
), the objective function can be
defined as follows.
J
ϕ
(ξ) =
1
N
X
i
E
S
t
,A
i
t
D,ϕ
i
Φ
i
[Q
i
θ
(S
t
, A
i
t
, π
i
(S
t
, ϕ
i
t
))
+ αlogΦ
i
(ϕ
i
t
|S
t
)]
, (12)
where ξ and θ are the network parameters, D denotes the
distribution of samples, and N is the batch size. The objective
function for each agent is optimized by the SAC algorithm.
2) Hierarchical Reinforcement Learning: In HRL, an RL
problem can be decomposed into a hierarchy of subproblems
or subtasks, which reduces the computational complexity.
There exist some studies for HRL, which solve the recom-
mendation problems well. For example, using Hierarchies of
Abstract Machines (HAMs) [142], [41] formalizes the overall
task of profile reviser as an MDP, and decomposes the task
MDP into two abstract subtasks. If the agent decides to revise
the user profile (i.e., a high-level task), it allows the high-
level task to call a low-level task to remove noisy courses. To
improve the recommendation adaptivity, a Dynamic Attention
and hierarchical Reinforcement Learning (DARL) framework
[92] is developed to automatically track the changes of the
user’s preferences in each interaction. These two methods
adopt the REINFORCE algorithm to optimize both the high-
level and low-level policy functions.
HRL with the MaxQ approach [143] is a task-hierarchy that
restricts subtasks to different subsets of states, actions, and
policies of the task MDP without importing extra states. For
recommender systems, a multi-goals abstraction based HRL
[11] is designed to learn the user’s hierarchical interests. In
addition, [144] proposes a novel HRL model for the integrated
recommendation (i.e., simultaneously recommending the het-
erogeneous items from different sources). In the proposed
model, the task of integrated recommendation is divided into
two subtasks (i.e., sequentially recommending channels and
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 15
TABLE III
OVERVIEW OF CHALLENGES IN RL-BASED RECOMMENDATION APPROACHES.
Issue Model Evaluation Strategy
Environment
Construction
State Representation
DRR [77], DEMER [49] Online & Offline
PyRecGym [109] Offline
KG Leverage
KERL [91], KGQR [71], PGPR [26], ADAC [28],
Attacks&Detection [13], KGPolicy [110]
Offline
Negative Sampling KGPolicy [110], RNS [112], SNQN [111], DCFGAN [113] Offline
Social Relation
SADQN [114] Online
SRRL [115] Offline
Prior Knowledge CBHR [118], DR [120], GAttN [121] Offline
Reward Function
Definition
SQN and SAC [62], VPQ [123], GAN-CDQN [125]
Offline
DEAR [128] Online & Offline
PWR [124], Value-based RL [127], GIRL [126] Online
Learning Bias
Data Biases Off-policy Correction [132], 2-IPS [133] Online & Offline
Policy Biases
PDQ [70] Offline
IRecGAN [135], AFE [136] Online & Offline
Task Structuring
Multi-agent RL
DEMER [108], MASSA [38], DeepChain [93], RLCharge [140] Online & Offline
Multi-with RL [65], INQ [139], TIARec [138], MASTER [141] Offline
Hierarchical RL
HRL+NAIS [41], DARL [92], SHIL [146] Offline
MaHRL [11], HRL-Rec [144] Online & Offline
Supervised RL
SQN and SAC [62], SRL-RNN [43], SL+RL [9],
PAR [147], Off-policy with guarantees [148]
Offline
SRR [149], EDRR [63] Online & Offline
items). The low-level agent works as a channel selector,
which provides personalized channel lists in terms of user’s
preferences. On the other hand, the high-level agent is regarded
as an item recommender, which recommends heterogeneous
items based on the channel constraints.
Another prevailing approach is the options framework (i.e.,
closed-loop policies for taking action over a period of time)
[145], which generalizes the primitive actions to include a
temporally extended navigation of actions. For example, [146]
designs a Subgoal conditioned Hierarchical Imitation Learning
(SHIL) framework for dynamic treatment recommendation.
In the SHIL framework, the high-level policy sequentially
selects a subgoal for each sub-task. Based on each subgoal, the
low-level policy produces the low-level action (i.e., effective
medication) in the corresponding state for each sub-task.
3) Supervised Reinforcement Learning: In practice, the rec-
ommendation problems can be more easily solved by utilizing
a combination of RL and supervised learning rather than
only using RL, as supervised learning and RL can handle
corresponding tasks according to their advantages respectively.
For example, [62] proposes a self-supervised RL model to
learn the recommendation policy from users’ logged feedback
in sequential recommender systems. The model has two output
layers: One is the self-supervised layer trained with cross-
entropy loss function to perform ranking. The other is trained
with RL based on a flexible reward function, which performs
as a regularizer for the supervised layer. Studies in [147], [148]
focus on optimizing the user’s life-time value in personalized
ad recommender systems. To achieve this goal, they propose
an RL-based recommendation model, where mapping from
features to actions is learned by a random forest algorithm.
SRL is often applied to satisfy the adaptability of recom-
mendation strategies. For instance, to address the top-aware
drawback (i.e., the performance on the top positions) that
may reduce user experiences, a supervised DRL model [149]
jointly utilizes the supervision signals and RL signal to learn
the recommendation policy in a complementary way. Different
from the top-aware recommender distillation framework [150]
that utilizes DQN to reinforce the rank of recommendation
lists, the supervised DRL model contains two styles of supervi-
sion signals. It adopts the cross-entropy loss for classification-
based supervision signal, and employs pairwise ranking-based
loss for the ranking-based supervision signal. The suitable
supervision signals adaptively balance the long-term reward
and immediate reward.
Moreover, [43] puts forward SRL with RNN for dynamic
treatment recommender systems. By combining the supervised
learning signal (e.g., the indicator of doctor prescriptions) with
the RL signal (e.g., maximizing the cumulative reward from
survival rates), this approach learns the prescription policy
to refrain from unacceptable risks and provides the optimal
treatment. [9] also proposes a novel SRL with RNN for the
long-term recommendation. More precisely, RNN is employed
to adaptively evolve user states for simulating the sequential
interactions between recommender system and users.
To tackle the training compatibility among the components
of embedding, state representation, and policy in RL-based
recommender systems, [63] proposes an End-to-end DRL-
based Recommendation framework namely EDRR, in which
a supervised learning signal is designed as a classifier of
the user’s feedback on the recommendation results. Moreover,
DQN and DDPG are employed to elaborate how embedding
component works in the proposed framework.
F. Summary
An overview of challenges in RL-based recommendation
approaches is presented in Table III. Overall, a number of
studies have attempted to address the challenges of applying
RL in recommender systems. On the one hand, many studies
focus on adapting RL algorithms to different recommenda-
tion scenarios. For instance, a promising reward function
that is skillfully designed can meet specific requirements of
the recommender system. On the other hand, some studies
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 16
leverage related techniques to solve the issues of RL-based
recommender systems. For example, the KG not only enriches
user-item relations by the structural knowledge to describe
the MDP environment but also contributes to explainable
recommendations by the multi-hop paths. Nonetheless, there
are some other possible limitations of RL-based recommender
systems, as well as the challenges in complex application
scenarios. For these limitations and challenges, we provide
detailed discussions in the following section.
VI. DISCUSSION
In this section, we first discuss the central issues in ap-
plying RL algorithms for recommender systems. Afterwards,
we analyze practical challenges and provide several potential
insights into successful techniques for RL-based recommender
systems.
A. Open Issues
In practice, RL algorithms are not applicable to real-time
recommender systems, due to their trial-and-error mechanism.
A key step in the application of RL algorithms is to improve
the recommendation effectiveness and realize an intelligent
design. Hence, the following issues need to be solved for
making much progress in this field.
1) Sampling Efficiency: Sampling plays a substantial role
in RL, especially in DRL-based recommender systems. Im-
portance sampling has empirically demonstrated its application
feasibility in both on-policy and off-policy. Some works [110],
[112] have highlighted the superiority of negative sampling
in recommender systems. Nevertheless, it is necessary to
focus on the improvement in sampling efficiency, since the
user feedback available to train recommendation models is
scarce. We can leverage auxiliary tasks to improve sampling
efficiency. For example, [151] develops a user response model
to predict the user’s positive or negative responses towards the
recommendation results. Thus, the state and action representa-
tions can be enhanced with these responses. Moreover, transfer
learning may be qualified for the sampling task. For example,
we can use the transfer of knowledge between temporal tasks
[152] to perform RL on an extended state space, and concretize
similarity between the source task and the target task by
logical transferability. In addition, the transfer of experience
samples [153] can estimate the relevance of source samples.
Both transfer learning methods may be able to improve the
sampling efficiency in RL-based recommender systems.
2) Reproducibility: In many application scenarios, includ-
ing recommendation, it is often difficult to reproduce the
results of RL algorithms, because of various factors such as
instability of RL algorithms, lacking open source codes &
account of hyper-parameters, and different simulation environ-
ments (e.g., experimental setup and implementation details),
although most existing studies have performed empirical ex-
periments on public datasets, as shown in Table II. Especially
for DRL methods, non-determinism in policy network and
intrinsic variance of these methods leads to reproducibility
crisis. Besides policy evaluation methods, there are several
future directions for the reproducibility investigation with
statistical analysis. We can use significance testing according
to related metrics or hyper-parameters such as batch size and
reward scale [154]. For example, the reward scale needs to
check its rationality and robustness in specific recommenda-
tion scenarios. The average returns should be evaluated to
verify whether it is proportional to the relevant performance.
Moreover, due to the sensitivity of RL algorithms changed
in their environments, random seeds, and definition of the
reward function, we should ensure reproducible results with
fair comparisons. To achieve this objective, we need to run
the same experiment trials for baseline algorithms, and take
each evaluation with the same preset epoch. Moreover, all
experiments should adopt the same random seeds [154].
3) MDP Formulation: In principle, MDP formulation is
essential to guarantee the performance of RL algorithms.
However, the state and action spaces in recommender systems
often suffer from the curse of dimensionality. Besides, the
reward definition is sensitive to the external environment since
users’ demands may change constantly, while interactions
between users and items are random or uncertain. To address
these issues, we can employ task-specific inductive biases
to learn the representations of selection-specific action. As
a result, the agent relies on better action structures to learn
RL policies when the recommendation problem is formulated
[155]. Another direction is to use causal graph and probabilis-
tic graph models [156] to enable the MDP formulation, where
the causal graph describes the causal-effect relations among
user-item interactions, and the probabilistic graph reasons the
recommendation paths. This joint process can improve the
efficiency of agent search and tracks users’ interests over time.
However, how to design MDPs for recommendation problems
in a verifiable way, remains an open issue.
4) Generalization: Model generalization ability is almost
the pursuit of all applications. Limited by the shortcomings
of different RL algorithms, it is difficult to develop a general
framework to meet various specific requirements of recom-
mendation. On the one hand, model-based algorithms are only
applicable to the specified recommendation problem, and fail
to solve the problem that cannot be modeled. On the other
hand, model-free algorithms are insufficient in dealing with
different tasks in complex environments. Fortunately, meta-
RL, such as Meta-Strategy [157] and Meta-Q-Learning [158],
first learns a large number of RL tasks to obtain enough
prior knowledge, and later can be quickly adapted to the
new environment in face of new tasks. In this case, it may
enable the generalization of RL-based recommender systems.
Besides transfer learning based on meta-RL, we can use multi-
task learning [159] to handle related tasks (e.g., construction
of user profiles, recommendation, and causal reasoning) in
parallel. These tasks complement each other to improve the
generalization performance via shared representation of do-
main information, such as parameter-based sharing, and joint
feature-based sharing.
5) Autonomy: It seems easy to utilize RL for autonomous
control [48]. However, in practice, it is difficult to achieve
this objective in online recommendation scenarios, since there
are complex interactions between users and items, while it is
not feasible to capture users’ dynamic intentions by existing
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 17
RL algorithms. To this end, we can achieve the feedback
control of recommender systems by combining RL with LSTM
or GRU, which enables the RL agent to be possessed of
powerful memory that preserves the sequences of state and
action, as well as different kinds of reward. Later, the historical
information is encoded and transmitted to the policy network,
thereby improving the autonomous navigation ability of the RL
agent. Another interesting research direction is to improve the
autonomous learning ability of RL from the fields of biology,
neuroscience, and cybernetics. Thus, it is straightforward to
use the learned knowledge to make better recommendations.
B. Practical Challenges
The existing recommendation models based on RL have
demonstrated superior recommendation performance. Never-
theless, there are many challenges and opportunities in this
field. We summarize five potential directions that deserve more
research efforts from the practical aspect.
1) Computational Complexity: RL-Based recommender
systems often suffer from huge computations due to the
exploration-exploitation tradeoff and the curse of dimension-
ality [160]. Apart from task structuring techniques, the IRL
algorithm is an efficient solution to reduce the computa-
tion cost. Initializing RL with demonstrations can supervise
the agent taking the action correctly, particularly for some
specific recommendation tasks. For example, apprenticeship
learning via IRL algorithm [116] highlights the need for
learning from an expert, which maximizes a reward based
on a linear combination of known features. The hierarchical
DQNs [161] alleviate the curse of dimensionality in large-scale
recommender systems. Besides, a promising method of driving
route recommendation based on RL is to employ behavior
cloning [122], which makes expert trajectories available and
quickens the learning speed. Another feasible scheme is to
improve the efficiency of agent exploration. For example, we
can adopt NoisyNet [162] that adds parametric noise to its
network weights, which aids efficient exploration according
to the stochasticity of the policy for the recommender agent.
2) Evaluation: Most existing RL-based recommender sys-
tems focus on the single goal of recommendation accuracy,
without considering recommendation novelty and diversity
[163] based on the user experience. Beyond the need for new
evaluation measures for multi-objective goals of recommen-
dation (e.g., popularity rate [164]), we should design stan-
dard metrics for other non-standardized evaluation measures
(e.g., diversity, novelty, explainability, and safety). In general,
these kinds of evaluation measures can be considered to be
combinatorial optimization problems, naturally, maybe well
achieved by multi-objective evolutionary algorithms [165].
Recently, some works have developed Pareto efficiency models
for multi-objective recommendation [166], [167]. For example,
in the personalized approximate Pareto-efficient recommender
system [168], a Pareto-oriented RL module learns personalized
objective weights on multiple objectives for the target user.
Nevertheless, it remains a challenge to reconcile different
evaluation measures for the recommendation, since these eval-
uation measures are usually relevant and even conflict.
3) Biases: In recommender systems, item popularity often
changes over time due to the user engagement and recom-
mendation strategy [169], thus long-tailed items are rarely
recommended to users. The selection bias may hurt user satis-
faction [170]. Therefore, it is necessary to concentrate on the
fairness work. Towards this end, we can combine anthropology
to analyze the differences in human behavior and cultural
characteristics, or explore the user’s intentions in the user-
recommender interactions. Besides, heterogeneous data of the
user behavior often exists in online recommendation platforms,
whereas most recommendation approaches are trained with a
single type of data. Due to the information asymmetry between
the user behavior and training model, the recommender system
suffers from learning bias. Undoubtedly, the previous feature
extraction and representation learning are crucial to deal with
such bias. In addition, MARL can be used to encourage differ-
ent agents to execute multi-dimensional data simultaneously,
and share parameters for a unified recommendation goal.
4) Interpretability: Due to the complexity of RL, the
post-hoc explanation may be easier to achieve than intrin-
sic interpretability [171]. Actually, the same is true in RL-
based recommendation systems. How to explore intrinsic
interpretable methods for RL to provide more convincing
recommendations is a promising line. Another research direc-
tion towards explainable recommendation is to provide formal
guidance of recommendation reasoning process, rather than
being concerned with a multi-hop reasoning process [172].
[173] proposes a user-centric path reasoning framework that
adopts a multi-view structure to combine sequence reasoning
information with subsequent user decision-making. However,
they only focus on the user’s demand and do not provide
theoretical proof of the rationality of reasoning. We should
avoid plausible explanations in the reasoning process. To
this end, we can employ multi-task learning to perform the
recommendation reasoning process among multiple related
tasks, e.g., representation of interaction, recommended path
generation, and Bayesian inference for policy network. These
related tasks jointly provide credible recommendation expla-
nations by sharing the representation of internal correlation
and causality.
5) Safety and Privacy: System security and user privacy are
important issues, which are ignored by most existing studies.
For example, personal privacy is easy to be leaked when
using RL and KG to perform the explainable recommendation,
because the relationships among users and items are exposed.
Differential privacy is widely applied to protect user privacy,
and DRL can be employed to choose the privacy budget
against inference attacks [174]. Besides, recent studies have
found that Deep Neural Networks (DNNs) are vulnerable to
attacks, such as adversarial attacks and data poisoning. For
example, in DNNs-based recommender systems, users with
fake profiles may be generated to promote selected items.
To address this issue, a novel black-box attacking framework
[175] adopts policy gradient networks to refine real user
profiles from a source domain and copy them into the target
domain. Notwithstanding, we need to make more efforts on
the research topic. For example, the safe RL [176] can be
utilized to guarantee the security of recommender systems
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 18
(e.g., detecting error states, and preventing abnormal agent
actions). We may also leverage federated learning [177],
[178] to achieve privacy-preserving data analysis for MARL-
based recommender systems, where each agent is localized on
distributed device and updates a local model based on the user
data stored in the corresponding user device.
VII. CONCLUSION
Recommender systems serve as a powerful technique to
address the information overload issue on the Internet. There
has been increasing interest in extending RL approaches for
recommendations in recent years. RL-based recommendation
methods autonomously learn the optimal recommendation
policies from user-item interactions, and thus they recommend
better items to users, compared with other recommendation
methods. In this survey, we firstly conduct a comprehensive
review on RL-based recommender systems, using three major
categories of RL (i.e., value-function, policy search, and Actor-
Critic) to cover four typical recommendation scenarios. We
also restructure the general frameworks for some specific
scenarios, such as interactive recommendation, conversational
recommendation, and the explainable recommendation based
on KG. Furthermore, the challenges of applying RL in rec-
ommender systems are systematically analyzed, including en-
vironment construction, prior knowledge, the definition of the
reward function, learning bias, and task structuring. To facili-
tate future progress in this field, we discuss theoretical issues
of RL and analyze the limitations of existing approaches, and
finally put forward some possible future directions.
REFERENCES
[1] S. Deng, L. Huang, G. Xu, X. Wu, and Z. Wu, “On deep learning for
trust-aware recommendations in social networks, IEEE Trans. Neural
Netw. Learn. Syst., vol. 28, no. 5, pp. 1164–1177, May 2017.
[2] J. Bobadilla, F. Ortega, A. Hernando, and A. Guti¨¦Rrez, “Recom-
mender systems survey, Knowl.-Based Syst., vol. 46, pp. 109–132,
July 2013.
[3] Z. Huang, X. Xu, H. Zhu, and M. Zhou, “An efficient group recommen-
dation model with multiattention-based neural networks, IEEE Trans.
Neural Netw. Learn. Syst., vol. 31, no. 11, pp. 4461–4474, November
2020.
[4] G. Adomavicius and A. Tuzhilin, “Toward the next generation of
recommender systems: A survey of the state-of-the-art and possible
extensions, IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734–
749, June 2005.
[5] Y. Shi, M. Larson, and A. Hanjalic, “Collaborative filtering beyond the
user-item matrix: A survey of the state of the art and future challenges,
ACM Computing Surveys, vol. 47, no. 1, p. p. 3, May 2014.
[6] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based rec-
ommender system: A survey and new perspectives, ACM Computing
Surveys, vol. 52, no. 1, p. p. 5, February 2019.
[7] W. Zhao, B. Wang, M. Yang, J. Ye, Z. Zhao, X. Chen, and Y. Shen,
“Leveraging long and short-term information in content-aware movie
recommendation via adversarial training, IEEE Trans. Syst. Man
Cybern., vol. 50, no. 11, pp. 4680–4693, November 2020.
[8] F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He, “Policy gradients for
contextual recommendations, in Proc. WWW, 2019, pp. 1421–1431.
[9] L. Huang, M. Fu, F. Li, H. Qu, Y. Liu, and W. Chen, A deep
reinforcement learning based long-term recommender system,Knowl.-
Based Syst., vol. 213, p. 106706, February 2021.
[10] L. Ji, Q. Qin, B. Han, and H. Yang, “Reinforcement learning to
optimize lifetime value in cold-start recommendation, in Proc. CIKM,
2021, pp. 782–791.
[11] D. Zhao, L. Zhang, B. Zhang, L. Zheng, Y. Bao, and W. Yan, “Mahrl:
Multi-goals abstraction based deep hierarchical reinforcement learning
for recommendations, in Proc. SIGIR, 2020, pp. 871–880.
[12] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, and X. Xie, A reinforce-
ment learning framework for explainable recommendation, in Proc.
IEEE Int. Conf. Data Mining (ICDM), 2018, pp. 587–596.
[13] Y. Cao, X. Chen, L. Yao, X. Wang, and W. E. Zhang, Adversarial
attacks and detection on reinforcement learning-based interactive rec-
ommender systems, in Proc. SIGIR, 2020, pp. 1669–1672.
[14] E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artificial
and biological systems,Nat. Mach. Intell., vol. 1, pp. 133–143, March
2019.
[15] H. Li, D. Liu, and D. Wang, “Manifold regularized reinforcement
learning, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp.
932–943, April 2018.
[16] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
“Deep reinforcement learning: A brief survey, IEEE Signal Proc.
Mag., vol. 34, no. 6, pp. 26–38, November 2017.
[17] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and
Z. Li, “Drn: A deep reinforcement learning framework for news
recommendation, in Proc. WWW, 2018, pp. 167–176.
[18] D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y. Tian, J. Chae,
Y. Ma, A. Kejariwal, and X. Hu, Autoshard: Automated embedding
table sharding for recommender systems, in Proc. 28th ACM SIGKDD
Int. Conf. Knowl. Discovery Data Mining, 2022, pp. 4461–4471.
[19] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G.
Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman,
N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hass-
abis, K. Kavukcuoglu, and T. Graepel, “Human-level performance in
3d multiplayer games with population-based reinforcement learning,
Science, vol. 364, no. 6443, pp. 859–865, May 2019.
[20] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
robotics: A survey, Artif. Intell., vol. 32, no. 11, pp. 1238–1274,
September 2013.
[21] L. Zou, L. Xia, Y. Gu, X. Zhao, W. Liu, J. X. Huang, and D. Yin,
“Neural interactive collaborative filtering, in Proc. SIGIR, 2020, pp.
749–758.
[22] Q. Liu, S. Tong, C. Liu, H. Zhao, E. Chen, H. Ma, and S. Wang,
“Exploiting cognitive structure for adaptive learning, in Proc. 25th
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2019, pp.
627–635.
[23] S. Ji, Z. Wang, T. Li, and Y. Zheng, “Spatio-temporal feature fusion for
dynamic taxi route recommendation via deep reinforcement learning,
Knowl.-Based Syst., vol. 205, p. 106302, October 2020.
[24] H. Lee, D. Hwang, K. Min, and J. Choo, “Towards validating long-
term user feedbacks in interactive recommendation systems, in Proc.
SIGIR, 2022, pp. 2607–2611.
[25] K. Wang, Z. Zou, Q. Deng, R. Wu, J. Tao, C. Fan, L. Chen, and P. Cui,
“Reinforcement learning with a disentangled universal value function
for item recommendation, in Proc. AAAI, 2021, pp. 4427–4435.
[26] Y. Xian, Z. Fu, S. Muthukrishnan, G. de Melo, and Y. Zhang,
“Reinforcement knowledge graph reasoning for explainable recommen-
dation, in Proc. SIGIR, 2019, pp. 285–294.
[27] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
ment learning, in Proc. ICLR, 2016, pp. 1–14.
[28] K. Zhao, X. Wang, Y. Zhang, L. Zhao, Z. Liu, C. Xing, and X. Xie,
“Leveraging demonstrations for reinforcement recommendation reason-
ing over knowledge graphs, in Proc. SIGIR, 2020, pp. 239–248.
[29] H. Wang, F. Zhang, X. Xie, and M. Guo, “Dkn: Deep knowledge-
aware network for news recommendation, in Proc. WWW, 2018, pp.
1835–1844.
[30] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr:
Bayesian personalized ranking from implicit feedback, in Proc. UAI,
2009, pp. 452–461.
[31] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo,
“Ripplenet: Propagating user preferences on the knowledge graph for
recommender systems, in Proc. CIKM, 2018, pp. 417–426.
[32] L. Zhang, Z. Sun, J. Zhang, Y. Wu, and Y. Xia, “Conversation-based
adaptive relational translation method for next poi recommendation
with uncertain check-ins, IEEE Trans. Neural Netw. Learn. Syst., pp.
1–14, February 2022.
[33] F. Zhou, R. Yin, K. Zhang, G. Trajcevski, T. Zhong, and J. Wu,
Adversarial point-of-interest recommendation, in Proc. WWW, 2019,
pp. 3462–3468.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 19
[34] Z. Fu, L. Yu, and X. Niu, “Trace: Travel reinforcement recommenda-
tion based on location-aware context extraction, ACM Trans. Knowl.
Discovery Data, vol. 16, no. 4, pp. 1–22, August 2022.
[35] Y. Sun, F. Zhuang, H. Zhu, Q. He, and H. Xiong, “Cost-effective
and interpretable job skill recommendation with deep reinforcement
learning, in Proc. WWW, 2021, pp. 3827–3838.
[36] Y. Wang, A hybrid recommendation for music based on reinforcement
learning, in Proc. PAKDD, 2020, pp. 91–103.
[37] P. Wei, S. Xia, R. Chen, J. Qian, C. Li, and X. Jiang, A deep-
reinforcement-learning-based recommender system for occupant-driven
energy optimization in commercial buildings, IEEE Internet Things J.,
vol. 7, no. 7, pp. 6402–6413, July 2020.
[38] X. He, B. An, Y. Li, H. Chen, R. Wang, X. Wang, R. Yu, X. Li, and
Z. Wang, “Learning to collaborate in multi-module recommendation via
multi-agent reinforcement learning without communication, in Proc.
ACM Conf. Rec. Syst., 2020, pp. 210–219.
[39] G. Ke, H.-L. Du, and Y.-C. Chen, “Cross-platform dynamic goods
recommendation system based on reinforcement learning and social
networks, Appl. Soft Comput., vol. 104, p. 107213, June 2021.
[40] J. O, J. Lee, J. W. Lee, and B.-T. Zhang, Adaptive stock trading with
dynamic asset allocation using reinforcement learning, Inf. Sci., vol.
176, no. 15, pp. 2121–2147, August 2006.
[41] J. Zhang, B. Hao, B. Chen, C. Li, H. Chen, and J. Sun, “Hierarchical
reinforcement learning for course recommendation in moocs, in Proc.
AAAI, 2019, pp. 435–442.
[42] Y. Lin, F. Lin, W. Zeng, J. Xiahou, L. Li, P. Wu, Y. Liu, and
C. Miao, “Hierarchical reinforcement learning with dynamic recurrent
mechanism for course recommendation, Knowl.-Based Syst., vol. 244,
p. 108546, May 2022.
[43] L. Wang, W. Zhang, X. He, and H. Zha, “Supervised reinforcement
learning with recurrent neural network for dynamic treatment recom-
mendation, in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2018, pp. 2447–2456.
[44] Z. Zheng, C. Wang, T. Xu, D. Shen, P. Qin, X. Zhao, B. Huai, X. Wu,
and E. Chen, “Interaction-aware drug package recommendation via
policy gradient, ACM Trans. Inf. Sys., pp. 1–32, February 2022.
[45] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and
S. A. Murphy, “Informing sequential clinical decision-making through
reinforcement learning: an empirical study, Mach. learn., vol. 84,
no. 1, pp. 109–136, July 2011.
[46] M. M. Afsar, T. Crump, and B. H. Far, “Reinforcement learning based
recommender systems: A survey, ACM Comput. Surv., pp. 1–37, June
2022.
[47] X. Chen, L. Yao, J. Mcauley, G. Zhou, and X. Wang, A survey of deep
reinforcement learning in recommender systems: A systematic review
and future directions, ArXiv Preprint ArXiv:2109.03540v1, 2021.
[48] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis,
“Optimal and autonomous control using reinforcement learning: A
survey, IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp.
2042–2062, June 2018.
[49] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. Massachusetts Ave, MA: MIT, 2018.
[50] C. J. Watkins and P. Dayan, “Technical note q-learning, Mach. Learn.,
vol. 8, no. 3, pp. 279–292, May 1992.
[51] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforce-
ment learning, ArXiv Preprint ArXiv:1312.5602, 2013.
[52] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning, Mach. Learn., vol. 8, no. 3, pp.
229–256, May 1992.
[53] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy
gradient methods for reinforcement learning with function approxima-
tion, in Proc. NIPS, 2000, pp. 1057–1063.
[54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms, ArXiv Preprint
ArXiv:1707.06347, 2017.
[55] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust
region policy optimization, in Proc. ICML, 2015, pp. 1889–1897.
[56] S. Levine and V. Koltun, “Guided policy search, in Proc. ICML, 2013,
pp. 1–9.
[57] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms, Siam
Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166,
2003.
[58] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap,
D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep
reinforcement learning, in Proc. ICML, 2016, pp. 1928–1937.
[59] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor, in Proc. ICML, 2018, pp. 1856–1865.
[60] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
miller, “Deterministic policy gradient algorithms, in Proc. ICML,
2014, pp. 387–395.
[61] S. Adam, L. Busoniu, and R. Babuska, “Experience replay for real-
time reinforcement learning control, IEEE Trans. Syst. Man Cybern.,
vol. 42, no. 2, pp. 201–212, March 2012.
[62] X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose, “Self-supervised
reinforcement learning for recommender systems, in Proc. SIGIR,
2020, pp. 931–940.
[63] F. Liu, H. Guo, X. Li, R. Tang, Y. Ye, and X. He, “End-to-end
deep reinforcement learning based recommendation with supervised
embedding, in Proc. WSDM, 2020, pp. 384–392.
[64] N. Taghipour, A. Kardan, and S. S. Ghidary, “Usage-based web
recommendations: A reinforcement learning approach, in Proc. ACM
Conf. Rec. Syst., 2007, pp. 113–120.
[65] Y. Zhang, C. Zhang, and X. Liu, “Dynamic scholarly collaborator
recommendation via competitive multi-agent reinforcement learning,
in Proc. ACM Conf. Rec. Syst., 2017, pp. 331–335.
[66] T. Mahmood and F. Ricci, “Learning and adaptivity in interactive
recommender systems, in Proc. ICEC, 2007, pp. 75–84.
[67] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “Drcgr: Deep
reinforcement learning framework incorporating cnn and gan-based for
interactive recommendation, in Proc. IEEE Int. Conf. Data Mining
(ICDM), 2019, pp. 1048–1053.
[68] Y. Lei and W. Li, “Interactive recommendation with user-specific deep
reinforcement learning, ACM Trans. Knowl. Discovery Data, vol. 13,
no. 6, p. 61, October 2019.
[69] L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin, “Reinforcement
learning to optimize long-term user engagement in recommender
systems, in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2019, pp. 2810–2818.
[70] L. Zou, L. Xia, P. Du, Z. Zhang, T. Bai, W. Liu, J.-Y. Nie, and D. Yin,
“Pseudo dyna-q: A reinforcement learning framework for interactive
recommendation, in Proc. WSDM, 2020, pp. 816–824.
[71] S. Zhou, X. Dai, H. Chen, W. Zhang, K. Ren, R. Tang, X. He,
and Y. Yu, “Interactive recommender system via knowledge graph-
enhanced reinforcement learning, in Proc. SIGIR, 2020, pp. 179–188.
[72] R. Zhang, T. Yu, Y. Shen, H. Jin, and C. Chen, “Text-based interactive
recommendation via constraint-augmented reinforcement learning, in
Proc. NIPS, 2019, pp. 15 214–15 224.
[73] H. Chen, X. Dai, H. Cai, W. Zhang, X. Wang, R. Tang, Y. Zhang, and
Y. Yu, “Large-scale interactive recommendation with tree-structured
policy gradient, in Proc. AAAI, 2019, pp. 3312–3320.
[74] W. Liu, F. Liu, R. Tang, B. Liao, G. Chen, and P. A. Heng, “Balancing
between accuracy and fairness for interactive recommendation with
reinforcement learning, in Proc. Pacific-Asia Conf. Knowl. Discovery
Data Mining, 2020, pp. 155–167.
[75] T. Xiao and D. Wang, A general offline reinforcement learning
framework for interactive recommendation, in Proc. AAAI, 2021.
[76] T. Yu, Y. Shen, R. Zhang, X. Zeng, and H. Jin, “Vision-language
recommendation via attribute augmented multimodal reinforcement
learning, in Proceedings of the 27th ACM International Conference
on Multimedia, 2019, pp. 39–47.
[77] F. Liu, R. Tang, X. Li, W. Zhang, Y. Ye, H. Chen, H. Guo, Y. Zhang,
and X. He, “State representation modeling for deep reinforcement
learning based recommendation, Knowl.-Based Syst., vol. 205, p.
106170, October 2020.
[78] M. S. Llorente and S. E. Guerrero, “Increasing retrieval quality in
conversational recommenders,IEEE Trans. Knowl. Data Eng., vol. 24,
no. 10, pp. 1876–1888, October 2012.
[79] T. Mahmood and F. Ricci, “Improving recommender systems with
adaptive conversational strategies, in Proc. HT, 2009, pp. 73–82.
[80] Y. Wu, C. Macdonald, and I. Ounis, “Partially observable reinforcement
learning for dialog-based interactive recommendation, in Proc. ACM
Conf. Rec. Syst., 2021, pp. 241–251.
[81] D. Tsumita and T. Takagi, “Dialogue based recommender sys-
tem that flexibly mixes utterances and recommendations, in Proc.
IEEE/WIC/ACM Int. Conf. Web Intelligence, 2019, pp. 51–58.
[82] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, and T.-S. Chua,
“Interactive path reasoning on graph for conversational recommenda-
tion, in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data
Mining, 2020, pp. 2073–2083.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 20
[83] Y. Deng, Y. Li, F. Sun, B. Ding, and W. Lam, “Unified conversational
recommendation policy learning via graph-based reinforcement learn-
ing, in Proc. SIGIR, 2021, pp. 1431–1441.
[84] Y. Sun and Y. Zhang, “Conversational recommender system, in Proc.
SIGIR, 2018, pp. 235–244.
[85] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M.-Y. Kan, and T.-S.
Chua, “Estimation-action-reflection: Towards deep interaction between
conversational and recommender systems, in Proc. WSDM, 2020, pp.
304–312.
[86] X. Ren, H. Yin, T. Chen, H. Wang, N. Q. V. Hung, Z. Huang,
and X. Zhang, “Crsal: Conversational recommender systems with
adversarial learning, ACM Trans. Inf. Sys., vol. 38, no. 4, pp. 1–40,
October 2020.
[87] A. Montazeralghaem and J. Allan, “Extracting relevant information
from user’s utterances in conversational search and recommendation,
in Proc. 28th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2022, pp. 1275–1283.
[88] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recom-
mendations with negative feedback via pairwise deep reinforcement
learning, in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2018, pp. 1040–1048.
[89] D. Hong, Y. Li, and Q. Dong, “Nonintrusive-sensing and
reinforcement-learning based adaptive personalized music recommen-
dation, in Proc. SIGIR, 2020, pp. 1721–1724.
[90] O. Moling, L. Baltrunas, and F. Ricci, “Optimal radio channel recom-
mendations with explicit and implicit feedback, in Proc. ACM Conf.
Rec. Syst., 2012, pp. 75–82.
[91] P. Wang, Y. Fan, L. Xia, W. X. Zhao, S. Niu, and J. Huang, “Kerl: A
knowledge-guided reinforcement learning model for sequential recom-
mendation, in Proc. SIGIR, 2020, pp. 209–218.
[92] Y. Lin, S. Feng, F. Lin, W. Zeng, Y. Liu, and P. Wu, Adaptive course
recommendation in moocs, Knowl.-Based Syst., vol. 224, p. 107085,
July 2021.
[93] X. Z. L. Xia, L. Zou, H. Liu, D. Yin, and J. Tang, “Whole-chain
recommendations, in Proc. CIKM, 2020, pp. 1883–1891.
[94] S. Antaris and D. Rafailidis, “Sequence adaptation via reinforcement
learning in recommender systems, in Proc. ACM Conf. Rec. Syst.,
2021, pp. 714–718.
[95] Y. Lu, R. Dong, and B. Smyth, “Why i like it: Multi-task learning
for recommendation and explanation, in Proc. ACM Conf. Rec. Syst.,
2018, pp. 4–12.
[96] S. Tao, R. Qiu, Y. Ping, and H. Ma, “Multi-modal knowledge-
aware reinforcement learning network for explainable recommenda-
tion, Knowl.-Based Syst., vol. 227, p. 107217, September 2021.
[97] S.-J. Park, D.-K. Chae, H.-K. Bae, S. Park, and S.-W. Kim, “Reinforce-
ment learning over sentiment-augmented knowledge graphs towards
accurate and explainable recommendation, in Proc. WSDM, 2022, pp.
784–793.
[98] D. Liu, J. Lian, Z. Liu, X. Wang, G. Sun, and X. Xie, “Reinforced
anchor knowledge graph generation for news recommendation reason-
ing, in Proc. 27th ACM SIGKDD Int. Conf. Knowl. Discovery Data
Mining, 2021, pp. 1055–1065.
[99] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning, in Proc. AAAI, 2016, pp. 2094–2100.
[100] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and
N. D. Freitas, “Dueling network architectures for deep reinforcement
learning, in Proc. ICML, 2016, pp. 1995–2003.
[101] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,
in Proc. NIPS, 2014, pp. 2672–2680.
[102] C. Gao, W. Lei, X. He, M. de Rijke, and T.-S. Chua, Advances and
challenges in conversational recommender systems: A survey, ArXiv
Preprint ArXiv:2101.09459, 2021.
[103] C. Hu, S. Huang, Y. Zhang, and Y. Liu, “Learning to infer user implicit
preference in conversational recommendation, in Proc. SIGIR, 2022,
pp. 256–266.
[104] S. Rendle, “Factorization machines, in Proc. IEEE Int. Conf. Data
Mining (ICDM), 2010, pp. 995–1000.
[105] A. Schwartz, A reinforcement learning method for maximizing undis-
counted rewards, in Proc. ICML, 1993, pp. 298–305.
[106] X. Wang, K. Liu, D. Wang, L. Wu, Y. Fu, and X. Xie, “Multi-level
recommendation reasoning over knowledge graphs with reinforcement
learning, in Proc. WWW, 2022, pp. 2098–2108.
[107] Y. Zhang and X. Chen, “Explainable recommendation: A survey and
new perspectives, Foundations and Trends in Information Retrieval,
vol. 14, no. 1, pp. 1–101, March 2020.
[108] W. Shang, Y. Yu, Q. Li, Z. Qin, Y. Meng, and J. Ye, “Environment
reconstruction with hidden confounders for reinforcement learning
based recommendation,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl.
Discovery Data Mining, 2019, pp. 566–576.
[109] B. Shi, M. G. Ozsoy, N. Hurley, B. Smyth, E. Z. Tragos, J. Geraci,
and A. Lawlor, “Pyrecgym: A reinforcement learning gym for recom-
mender systems, in Proc. ACM Conf. Rec. Syst., 2019, pp. 491–495.
[110] X. Wang, Y. Xu, X. He, Y. Cao, M. Wang, and T.-S. Chua, “Reinforced
negative sampling over knowledge graph for recommendation,” in Proc.
WWW, 2020, pp. 99–109.
[111] X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose, “Supervised
advantage actor-critic for recommender systems, in Proc. WSDM,
2022, pp. 1186–1196.
[112] J. Ding, Y. Quan, X. He, Y. Li, and D. Jin, “Reinforced negative
sampling for recommendation with exposure data, in Proc. IJCAI,
2019, pp. 2230–2236.
[113] J. Zhao, H. Li, L. Qu, Q. Zhang, Q. Sun, H. Huo, and M. Gong,
“Dcfgan: An adversarial deep reinforcement learning framework with
improved negative sampling for session-based recommender systems,
Inf. Sci., vol. 596, pp. 222–235, June 2022.
[114] Y. Lei, Z. Wang, W. Li, H. Pei, and Q. Dai, “Social attentive deep q-
networks for recommender systems, IEEE Trans. Knowl. Data Eng.,
p. 99, July 2020.
[115] Z. Lu, M. Gao, X. Wang, J. Zhang, H. Ali, and Q. Xiong, “Srrl:
Select reliable friends for social recommendation with reinforcement
learning, in Proc. Int. Conf. Neural Inf. Process., 2019, pp. 631–642.
[116] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforce-
ment learning, in Proc. ICML, 2004, pp. 1–8.
[117] A. Y. Ng and S. J. Russell, Algorithms for inverse reinforcement
learning, in Proc. ICML, 2000, pp. 663–670.
[118] M. David and F. Ricci, “Harnessing a generalised user behaviour model
for next-poi recommendation, in Proc. ACM Conf. Rec. Syst., 2018,
pp. 402–406.
[119] M. Babes, V. Marivate, K. Subramanian, and M. L. Littman, Appren-
ticeship learning about multiple intentions, in Proc. ICML, 2011, pp.
897–904.
[120] H. Liang, “Drprofiling: Deep reinforcement user profiling for rec-
ommendations in heterogenous information networks, IEEE Trans.
Knowl. Data Eng., p. 99, May 2020.
[121] Y. Gong, Y. Zhu, L. Duan, Q. Liu, Z. Guan, F. Sun, W. Ou, and K. Q.
Zhu, “Exact-k recommendation via maximal clique optimization, in
Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2019, pp. 617–626.
[122] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from obser-
vation, in Proc. IJCAI, 2018, pp. 4950–4957.
[123] C. Gao, K. Xu, K. Zhou, L. Li, X. Wang, B. Yuan, and P. Zhao, “Value
penalized q-learning for recommender systems, in Proc. SIGIR, 2022,
pp. 2008–2012.
[124] M. Preda and D. Popescu, “Personalized web recommendations: sup-
porting epistemic information about end-users, in Proc. WI, 2005, pp.
692–695.
[125] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative ad-
versarial user model for reinforcement learning based recommendation
system, in Proc. ICML, 2019, pp. 1052–1061.
[126] X. Chen, L. Yao, A. Sun, X. Wang, X. Xu, and L. Zhu, “Generative
inverse deep reinforcement learning for online recommendation, in
Proc. CIKM, 2021, pp. 201–210.
[127] C. Pei, X. Yang, Q. Cui, X. Lin, P. J. Fei Sun, W. Ou, and Y. Zhang,
“Value-aware recommendation based on reinforcement profit maxi-
mization, in Proc. WWW, 2019, pp. 3123–3129.
[128] X. Zhao, C. Gu, H. Zhang, X. Yang, X. Liu, H. Liu, and J. Tang,
“Dear: Deep reinforcement learning for online advertising impression
in recommender systems, in Proceedings of the AAAI Conference on
Artificial Intelligence, 2021.
[129] L. Zou, L. Xia, Z. Ding, D. Yin, J. Song, and W. Liu, “Reinforcement
learning to diversify top-n recommendation, in International Confer-
ence on Database Systems for Advanced Applications, 2019, pp. 104–
120.
[130] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal
difference learning with function approximation,” in Proc. ICML, 2001,
pp. 417–424.
[131] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare, “Safe
and efficient off-policy reinforcement learning, in Proc. NIPS, 2016,
pp. 1054–1062.
[132] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi,
“Top-k off-policy correction for a reinforce recommender system, in
Proc. WSDM, 2019, pp. 456–464.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 21
[133] J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H.
Chi, “Off-policy learning in two-stage recommender systems, in Proc.
WWW, 2020, pp. 463–473.
[134] D. G. Horvitz and D. J. Thompson, A generalization of sampling
without replacement from a finite universe,J. Am. Stat. Assoc., vol. 47,
no. 260, pp. 663–685, April 1952.
[135] X. Bai, J. Guan, and H. Wang, A model-based reinforcement learning
with adversarial training for online recommendation, in Proc. NIPS,
2019, pp. 10 735–10 746.
[136] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, A peep into the future:
Adversarial future encoding in recommendation, in Proc. WSDM,
2022, pp. 1177–1185.
[137] L. Busoniu, R. Babuska, and B. D. Schutter, A comprehensive survey
of multiagent reinforcement learning, IEEE Trans. Syst. Man Cybern.,
vol. 38, no. 2, pp. 156–172, March 2008.
[138] Z. Du, N. Yang, Z. Yu, and P. Yu, “Learning from atypical behavior:
Temporary interest aware recommendation based on reinforcement
learning, IEEE Trans. Knowl. Data Eng., pp. 1–13, January 2022.
[139] Y. Z. Wei, L. Moreau, and N. R. Jennings, “Learning users’ interests
by quality classification in market-based recommender systems, IEEE
Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 1678–1688, December
2005.
[140] W. Zhang, H. Liu, H. Xiong, T. Xu, F. Wang, H. Xin, and H. Wu,
“Rlcharge: Imitative multi-agent spatiotemporal reinforcement learning
for electric vehicle charging station recommendation, IEEE Trans.
Knowl. Data Eng., pp. 1–14, May 2022.
[141] W. Zhang, H. Liu, F. Wang, T. Xu, H. Xin, D. Dou, and H. Xiong,
“Intelligent electric vehicle charging recommendation based on multi-
agent reinforcement learning, in Proc. WWW, 2021, pp. 1856–1867.
[142] R. Parr and S. J. Russell, “Reinforcement learning with hierarchies of
machines, in Proc. NIPS, 1997, pp. 1043–1049.
[143] T. G. Dietterich, “Hierarchical reinforcement learning with the maxq
value function decomposition, J. Artif. Intell. Res., vol. 13, no. 1, pp.
227–303, November 2000.
[144] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, “Hierarchical reinforce-
ment learning for integrated recommendation, in Proc. AAAI, 2021.
[145] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:
A framework for temporal abstraction in reinforcement learning, Artif.
Intell., vol. 112, no. 1, pp. 181–211, August 1999.
[146] L. Wang, R. Tang, X. He, and X. He, “Hierarchical imitation learning
via subgoal representation learning for dynamic treatment recommen-
dation, in Proc. WSDM, 2022, pp. 1081–1089.
[147] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh, “Personalized
ad recommendation systems for life-time value optimization with
guarantees, in Proc. IJCAI, 2015, pp. 1806–1812.
[148] G. Theocharous, P. Thomas, and M. Ghavamzadeh, Ad recommenda-
tion systems for life-time value optimization, in Proc. WWW, 2015,
pp. 1305–1310.
[149] F. Liu, R. Tang, H. Guo, X. Li, Y. Ye, and X. He, “Top-aware
reinforcement learning based recommendation, Neurocomputing, vol.
417, pp. 255–269, December 2020.
[150] H. Liu, Z. Sun, X. Qu, and F. Yuan, “Top-aware recommender
distillation with deep reinforcement learning, Inf. Sci., vol. 576, pp.
642–657, October 2021.
[151] M. Chen, B. Chang, C. Xu, and E. H. Chi, “User response models to
improve a reinforce recommender system, in Proc. WSDM, 2021, pp.
121–129.
[152] Z. Xu and U. Topcu, “Transfer of temporal logic formulas in reinforce-
ment learning, in Proc. IJCAI, 2019, pp. 4010–4018.
[153] A. Tirinzoni, A. Sessa, M. Pirotta, and M. Restelli, “Importance
weighted transfer of samples in reinforcement learning,” in Proc. ICML,
2018, pp. 4936–4945.
[154] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and
D. Meger, “Deep reinforcement learning that matters, in Proc. AAAI,
2017, pp. 3207–3214.
[155] J. Welborn, M. Schaarschmidt, and E. Yoneki, “Learning index selec-
tion with structured action spaces, ArXiv Preprint ArXiv:1909.07440,
2019.
[156] Y. Xu, L. Qin, X. Liu, J. Xie, and S.-C. Zhu, A causal and-or graph
model for visibility fluent reasoning in tracking interacting objects, in
Proc. CVPR, 2018, pp. 2178–2187.
[157] R. Powers and Y. Shoham, “New criteria and a new algorithm for
learning in multi-agent systems, in Proc. NIPS, 2004, pp. 1089–1096.
[158] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola, “Meta-q-learning,
in Eighth International Conference on Learning Representations, 2020.
[159] Q. Zhang, J. Liu, Y. Dai, Y. Qi, Y. Yuan, K. Zheng, F. Huang,
and X. Tan, “Multi-task fusion via reinforcement learning for long-
term user satisfaction in recommender systems, in Proc. 28th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2022, pp. 4510–
4520.
[160] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, “Explore, filter and
distill: Distilled reinforcement learning in recommendation, in Proc.
CIKM, 2021, pp. 4243–4252.
[161] M. Fu, A. Agrawal, A. A. Irissappane, J. Zhang, L. Huang, and
H. Qu, “Deep reinforcement learning framework for category-based
item recommendation, IEEE Trans. Cybern., pp. 1–14, August 2021.
[162] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,
V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg,
“Noisy networks for exploration, in Proc. ICLR, 2018.
[163] M. Kunaver and T. Porl, “Diversity in recommender systems a survey,
Knowl.-Based Syst., vol. 123, pp. 154–162, May 2017.
[164] Y. Ge, X. Zhao, L. Yu, S. Paul, D. Hu, C.-C. Hsieh, and Y. Zhang,
“Toward pareto efficient fairness-utility trade-off in recommendation
through reinforcement learning, in Proc. WSDM, 2022, pp. 316–324.
[165] C. L
¨
ucken, B. Bar
´
an, and C. Brizuela, A survey on multi-objective
evolutionary algorithms for many-objective problems,Comput. Optim.
Appl., vol. 58, no. 3, pp. 707–756, February 2014.
[166] X. Chen, Y. Du, L. Xia, and J. Wang, “Reinforcement recommendation
with user multi-aspect preference, in Proc. WWW, 2021, pp. 425–435.
[167] D. Stamenkovic, A. Karatzoglou, I. Arapakis, X. Xin, and K. Katevas,
“Choosing the best of both worlds: Diverse and novel recommendations
through multi-objective reinforcement learning,” in Proc. WSDM, 2022,
pp. 957–965.
[168] R. Xie, Y. Liu, S. Zhang, R. Wang, F. Xia, and L. Lin, “Personalized
approximate pareto-efficient recommendation, in Proc. WWW, 2021,
pp. 3839–3849.
[169] Y. Ge, S. Liu, R. Gao, Y. Xian, Y. Li, X. Zhao, C. Pei, F. Sun, J. Ge,
W. Ou, and Y. Zhang, “Towards long-term fairness in recommenda-
tion, in Proc. WSDM, 2021, pp. 445–453.
[170] D. Li, X. Li, J. Wang, and P. Li, “Video recommendation with multi-
gate mixture of experts soft actor critic, in Proc. SIGIR, 2020, pp.
1553–1556.
[171] E. Puiutta and E. M. S. P. Veith, “Explainable reinforcement learning:
A survey, in International Cross-Domain Conference for Machine
Learning and Knowledge Extraction, 2020, pp. 77–95.
[172] P. Wu, H. Li, Y. Deng, W. Hu, Q. Dai, Z. Dong, J. Sun, R. Zhang, and
X.-H. Zhou, “On the opportunity of causal learning in recommendation
systems: Foundation, estimation, prediction and challenges, in Proc.
IJCAI, 2022, pp. 1–8.
[173] C.-Y. Tai, L.-Y. Huang, C.-K. Huang, and L.-W. Ku, “User-centric path
reasoning towards explainable recommendation,” in Proc. SIGIR, 2021,
pp. 879–889.
[174] Y. Xiao, L. Xiao, X. Lu, H. Zhang, S. Yu, and H. V. Poor, “Deep-
reinforcement-learning-based user profile perturbation for privacy-
aware recommendation,IEEE Internet of Things Journal, vol. 8, no. 6,
pp. 4560–4568, March 2021.
[175] W. Fan, T. Derr, X. Zhao, Y. Ma, H. Liu, J. Wang, J. Tang, and Q. Li,
Attacking black-box recommendations via copying cross-domain user
profiles, in Proc. IEEE 37th Int. Conf. Data Engineering, 2021, pp.
1583–1594.
[176] J. Garcia and F. Fernandez, “A comprehensive survey on safe reinforce-
ment learning, J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480,
August 2015.
[177] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:
Challenges, methods, and future directions, IEEE Signal Proc. Mag.,
vol. 37, no. 3, pp. 50–60, May 2020.
[178] W. Huang, J. Liu, T. Li, T. Huang, S. Ji, and J. Wan, “Feddsr: Daily
schedule recommendation in a federated deep reinforcement learning
framework, IEEE Trans. Knowl. Data Eng., pp. 1–1, November 2021.