A Survey on Reinforcement Learning for Recommender Systems

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 1

A Survey on Reinforcement Learning for

Recommender Systems

Yuanguo Lin

, Yong Liu

, Fan Lin

∗

, Lixin Zou

∗

, Pengcheng Wu, Wenhua Zeng, Huanhuan Chen, Chunyan Miao

Abstract—Recommender systems have been widely applied in

different real-life scenarios to help us ﬁnd useful information.

In particular, Reinforcement Learning (RL) based recommender

systems have become an emerging research topic in recent years,

owing to the interactive nature and autonomous learning ability.

Empirical results show that RL-based recommendation methods

often surpass supervised learning methods. Nevertheless, there

are various challenges of applying RL in recommender systems.

To understand the challenges and relevant solutions, there should

be a reference for researchers and practitioners working on

RL-based recommender systems. To this end, we ﬁrstly provide

a thorough overview, comparisons, and summarization of RL

approaches applied in four typical recommendation scenarios,

including interactive recommendation, conversational recommen-

dation, sequential recommendation, and explainable recommen-

dation. Furthermore, we systematically analyze the challenges

and relevant solutions on the basis of existing literature. Finally,

under discussion for open issues of RL and its limitations of

recommender systems, we highlight some potential research

directions in this ﬁeld.

Index Terms—Reinforcement learning, Recommender systems,

Interactive recommendation, Policy gradient.

I. INTRODUCTION

ERSONALIZED recommender systems [1]–[3] are com-

petent to provide interesting information that matches

users’ preferences, and thereby alleviating the informa-

tion overload problem. Recommendation technologies usually

make use of various information to provide potential items

for users. To this end, the early recommendation research

primarily focuses on developing content-based and collabo-

rative ﬁltering-based methods [4], [5]. Recently, motivated

by the quick developments of deep learning, various neural

recommendation methods have been developed [6]. However,

modeling the various information is not enough. In real-world

scenarios, the recommender system suggests items according

to the user-item interaction history, and then receives user

Y. Lin is with the School of Computer Engineering, Jimei University;

and the School of Informatics, Xiamen University, China. Email: xd-

[email protected].

F. Lin and W. Zeng are with the School of Informatics, Xiamen University,

China. Email: [email protected] and [email protected].

Y. Liu and P. Wu are with the Joint NTU-UBC Research Centre of Excel-

lence in Active Living for the Elderly (LILY), Nanyang Technological Univer-

sity, Singapore. Email: [email protected] and [email protected].

C. Miao is with the School of Computer Science and Engineering, Nanyang

Technological University, Singapore. Email: [email protected].

L. Zou is with the School of Cyber Science and Engineering, Wuhan

University, China. Email: [email protected].

H. Chen is with the School of Computer Science and Technology, Univer-

sity of Science and Technology of China, China. Email: [email protected].

∗

Corresponding author

Co-ﬁrst authors

feedback to make further recommendations [7], [8]. In other

words, the recommender system aims to obtain users’ prefer-

ences from the interactions and recommend items that users

may be interested in. Nevertheless, existing recommendation

methods (e.g., supervised learning) usually ignore the interac-

tions between a user and the recommendation model. They do

not effectively capture the user’s timely feedback to update

the recommendation model, thus leading to unsatisfactory

recommendation results.

In general, the recommendation task could be modeled as

an interactive process, i.e., the user is recommended an item

and then provides feedback (e.g., skip, click or purchase)

for the recommendation model. In the next interaction, the

recommendation model learns the optimal policy from the

user’s explicit/implicit feedback and recommends a new item

to the user. From the user’s point of view, an efﬁcient interac-

tion means helping users ﬁnd their favorite items as soon as

possible. The interactive recommendation approach has been

applied in real-world recommendation tasks. However, it often

suffers from some problems, e.g., cold-start [9], [10], data

sparsity [11], interpretability [12] and safety [13].

As a machine learning method that focuses on how an

intelligent agent interacts with its environment, Reinforce-

ment Learning (RL) [14], [15] learns the policy by trial

and error search, which is beneﬁcial to sequential decision

making. Hence, it can provide potential solutions to model

the interactions between the user and agent. In particular,

Deep Reinforcement Learning (DRL) [16], the combination

of traditional RL with deep learning methods, is competent

to learn from historical data with enormous state and action

spaces to address large-scale problems. It has powerful rep-

resentation learning and function approximation properties to

be applied across various ﬁelds [17], [18], e.g., games [19]

and robotics [20]. Recently, the application of RL to solve

recommendation problems has become a new research trend

in recommender systems [21]–[23]. Speciﬁcally, RL enables

the recommender agent to constantly recommend items to

users for learning the optimal recommendation policies. Many

experimental results have demonstrated that RL-based rec-

ommendation methods [24], [25] evidently outperform super-

vised learning methods. For example, as shown in Table I,

RL-based recommendation methods (i.e., PGPR [26], Actor-

Critic [27], and ADAC [28]) consistently perform better than

the supervised learning-based recommendation methods (i.e.,

DKN [29], BPR [30], and RippleNet [31]) on two Amazon

datasets in terms of Hit Ratio (HR) and Normalized Dis-

counted Cumulative Gain (NDCG), especially with signiﬁcant

margin (p−value < 0.01) on the Clothing dataset. In practice,

arXiv:2109.10665v4 [cs.IR] 11 Jun 2023

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 2

TABLE I

THE RECOMMENDATION PERFORMANCE OF SUPERVISED LEARNING METHODS (i.e., DKN, BPR, RIPPLENET) AND RL-BASED METHODS (i.e., PGPR,

ACTOR-CRITIC, ADAC) ON TWO AMAZON DATASETS IN TERMS OF HR AND NDCG (%). THE P−VALUE DENOTES THE T-TEST USED TO TEST THE

PERFORMANCE DIFFERENCES BETWEEN PGPR AND OTHER METHODS.

Dataset Beauty Clothing

Metrics HR NDCG p−value HR NDCG p−value

DKN 8.673±0.058 1.872±0.049 0.021 1.203±0.088 0.300±0.024 1.45E-05

BPR 9.021±0.068 2.744±0.045 0.036 1.820±0.061 0.609±0.027 6.95E-05

RippleNet 9.294±0.027 2.401±0.036 0.040 1.882±0.041 0.624±0.025 8.07E-05

PGPR 14.559±0.051 5.513±0.042 - 7.003±0.032 2.843±0.030 -

Actor-Critic 14.821±0.043 5.730±0.051 0.912 6.924±0.044 2.796±0.031 0.949

ADAC 15.856±0.053 5.863±0.048 0.718 7.501±0.022 3.010±0.058 0.748

Fig. 1. Distribution of year-wise publications (until October 2022) about

traditional RL- and DRL-based recommendation methods.

RL-based recommender systems have been applied to many

speciﬁc scenarios [32]–[37], such as e-commerce [38]–[40],

e-learning [41], [42], and health care [43]–[45].

Data collection methodology. There are a growing number

of studies of RL-based recommender systems. To search

relevant articles for the analysis, we adopted the following

collection rules to include or exclude papers.

• Search terms: Our survey involves two keywords: Re-

inforcement Learning

and Recommender Systems. Ac-

cordingly, we mainly adopted the following search

terms: ’Reinforcement Learning’ AND ’Recommender

Systems’. To ﬁnd more research papers, we also used

related RL algorithms as the search terms (e.g., ’Q-

learning’, ’Policy Gradient’, and ’Actor-Critic’) instead

of ’Reinforcement Learning’, along with ’Recommender

Systems’. Similarly, we used the search term ’Recom-

mendation’ instead of ’Recommender Systems’, along

with ’Reinforcement Learning’.

• Search sources: We ﬁrst used Google Scholar to ﬁnd re-

search papers with these search terms. We then increased

the collection of relevant articles by the following aca-

demic databases: Science Direct, ACM Portal, Springer

Link, and IEEE Xplore. Finally, we selected 98 related

papers to include in this survey.

• Publication type: Only high-level publications on RL-

based recommendation from the international conferences

and top journals were included in our survey.

Note that this survey focuses on recommender systems using full RL.

Therefore, we did not include bandits, which are diferent from full RL.

The distribution of collected research papers over the years

is shown in Fig. 1. There are a few publications on traditional

RL-based recommendation methods from 2005 to 2017, with

an increase in number of papers on DRL-based recommen-

dation methods since 2018. The main reason is that DRL

algorithms have proved to be the satisfactory solutions to some

recommendation issues, and they have attracted much attention

from the research community.

Related work. To facilitate the research about RL-based

recommender systems, [46] provides a review of the RL- and

DRL-based algorithms developed for recommendations, and

presents several research directions in top-K recommenda-

tion, application architecture, and evaluation. Besides, [47]

provides an overview of DRL-based recommender systems

mainly according to model-based and model-free algorithms,

and discusses the beneﬁts and drawbacks of DRL-based

recommendations. Nevertheless, it is necessary to make a

more comprehensive overview and analysis of (D)RL-based

recommender systems.

Our contribution. Different from [46] and [47], the main

contributions made in this work are as follows.

1) Comprehensive review: We summarize existing (D)RL

algorithms applied in four typical recommendation scenarios,

i.e., interactive recommendation, conversational recommenda-

tion, sequential recommendation, and explainable recommen-

dation. It could be helpful for readers to understand how

(D)RL algorithms are applied in different recommendation

systems. Moreover, from the RL perspective, the comprehen-

sive survey of RL-based recommender systems follows three

classes of RL algorithms: value-function, policy search, and

Actor-Critic. This taxonomy of the literature is made based on

the fact that these three types of methods have been widely

applied in existing RL-based recommender systems.

2) Systematical analysis: We systematically analyze the

challenges of applying (D)RL in recommender systems and

relevant solutions, including environment construction (e.g.,

the state representation and negative sampling), prior knowl-

edge, reward function deﬁnition, learning bias (e.g., the data

or policy bias), and task structuring (i.e., the task of RL can be

decomposed into basic components or a sequence of subtasks).

3) Open directions: To facilitate future progress in this

ﬁeld, we put forward open issues of RL, analyze practical

challenges of this ﬁeld, and suggest possible future direc-

tions for the research and application of recommender sys-

tems.The open issues and emerging topics contain sampling

efﬁciency, reproducibility, generalization, evaluation, biases,

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 3

Fig. 2. (a) Markov decision process formulated as the interaction between an agent and its environment [49]. (b) RL-based recommender system models the

interactive recommendation task as a Markov decision process.

interpretability, safety and privacy.

The remainder of this paper is organized as follows. Sec-

tion II introduces the background of RL, deﬁnes related

concepts, and lists commonly used approaches. Section III

presents a standard deﬁnition of the RL-based recommenda-

tion methods. Section IV provides a comprehensive review of

the RL algorithms developed for recommender systems. Then,

Section V discusses the challenges and relevant solutions

of applying RL in recommender systems. Next, Section VI

discusses various limitations and potential research directions

of RL-based recommender systems. Finally, Section VII con-

cludes this study.

II. OVERVIEW OF REINFORCEMENT LEARNING

Different from supervised and unsupervised learning, RL

[48] focuses on goal-directed learning that maximizes the

total reward achieved by an agent when interacting with its

environment. Trial-and-error and delayed rewards are two most

important characteristics that distinguish RL from the other

types of machine learning methods.

In RL, the agent learns the optimal policy from its interac-

tions with the environment to maximize the total reward for

sequential decision making. As shown in Fig. 2(a), the learning

process of RL can be formulated as a Markov Decision

Process (MDP) [49]. Formally, the MDP is deﬁned as a 5-

tuple < S, A, P, R, γ >, where S denotes a ﬁnite set of states,

A denotes a ﬁnite set of actions, P is the state transition

probabilities, R denotes the reward function, and γ ∈ [0, 1]

is a discount factor of the reward. At each time step t, the

agent receives an environment state S

∈ S and selects the

corresponding action A

∈ A. Then, the agent receives a

numerical reward R

t+1

∈ R and makes itself into a new state

t+1

. Thus, the MDP forms a sequence τ as follows.

τ = {S

, A

, R

, S

, A

, R

, · · · , S

}, (1)

where T is the maximum time step in a ﬁnite MDP. To

maximize the return in the long run, the agent tries to select

actions so that the cumulative reward it receives over the

future is maximized. In this case, we introduce the concept

of discounting. In general, the agent selects an action A

maximize the discounted return G

[49]:

= R

t+1

+ γR

t+2

+ · · · =

i=t+1

i−t−1

. (2)

The discount factor γ affects the return. If γ = 0, the agent

only maximizes immediate rewards then it reduces the return

in the long run. As γ approaches 1, the agent takes future

rewards into account more strongly.

Based on whether to use models and planning for solving

RL problems, RL algorithms can be classiﬁed into two main

groups, i.e., model-free algorithms and model-based algo-

rithms. The model-free algorithms directly learn the policy

without any model of the transition function, whereas the

model-based algorithms employ a learned or pre-determined

model to learn the policy. On the other hand, according to

the way of action conducted by the agent, RL algorithms fall

into the following major groups, i.e., value-function methods,

policy search methods, and Actor-Critic methods.

A. Value-function Approaches

Many traditional RL methods generally achieve a global

optimum return by obtaining the maximal value in terms

of the best action. These methods are called value-function

approaches, which utilize the maximal value to learn the

optimal policy indirectly. Intuitively, the maximal value is

generated by the best action a

∗

following the optimal policy

∗

, that is, the state-value under an optimal policy equals the

expected return for the best action from the state. This is the

Bellman equation for the optimal state-value function v

∗

(s),

or called the Bellman optimality equation deﬁned as:

∗

(s) = max

E[R

t+1

+ γv

∗

t+1

)|S

= s, A

= a]

= max

′

p(s

′

, r|s, a)



r + γv

∗

′

)



(3)

where E[·] denotes the expected value of a random variable

given by following the optimal policy π

∗

Similarly, the Bellman optimality equation for the action-

value function q

∗

(s, a) is deﬁned by

∗

(s, a) = E[R

t+1

+ γ max

′

∗

t+1

, a

′

)|S

= s, A

= a]

′

p(s

′

, r|s, a)



r + γ max

′

∗

′

, a

′

)



(4)

Different from the optimal state-value function v

∗

(s),

∗

(s, a) explicitly reﬂects the effects of the best action, which

is more prone to be adopted by many algorithms.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 4

In general, the optimal value function is estimated by Dy-

namic Programming (DP), Monte Carlo methods or temporal-

difference (TD) learning, such as Sarsa [49], Q-Learning [50],

and Deep Q-Networks (DQN) [51]. Compared to policy-

search methods, the value-function approaches are often of

vastly reduced variance. Nevertheless, they are not suitable for

complex application scenarios due to the slow convergence.

B. Policy Search Methods

In contrast to value-function approaches, policy-search

methods directly optimize the policies, which are parameter-

ized by a set of policy parameters θ

. These policy parameters

can be updated to maximize the expected return with either

gradient-free or gradient-based optimization methods [16]. As

one of the most popular RL algorithms, the gradient-based

optimization method is competent to solve complex issues. To

search for the optimal policies, the gradient-based optimization

method learns the policy parameters with the gradient of some

performance measure J(θ). Formally, the updates approximate

gradient ascent in J(θ) by

t+1

= θ

+ α∇J(θ

), (5)

where ∇J(θ

) denotes a stochastic estimate that approximates

the gradient of J(θ

) in terms of θ

, and α is the step size

that inﬂuences the learning rate [49]. Existing policy gradient

methods generally follow the gradient updating strategy in

Eq. (5).

The policy gradient methods provide an appropriate equa-

tion proportional to the policy gradient, which may need

Monte-Carlo based method of sampling their expectation that

approximates the equation. Thus, the REINFORCE algorithm

[52] that adopts the Monte-Carlo policy gradient method can

be established by

∇J(θ) ∝

µ(s)

∇

π(a|s, θ)q

(s, a)

= E

∇

π(a|S

, θ)q

, a)

(6)

where the symbol ∝ refers to “proportional to”, the distribution

µ(s) is the on-policy distribution under the policy π, and the

gradients are column vectors of partial derivatives with respect

to the policy parameter θ.

Some advanced algorithms have been proposed to address

the shortcomings of policy-search methods. For example,

policy gradient methods with function approximation [53]

ensure the stability of the algorithms. By adjusting super-

parameters artiﬁcially or adaptively, Proximal Policy Opti-

mization (PPO) Algorithms [54] and Trust Region Policy

Optimization (TRPO) [55] speed up the convergence of the

algorithms. Moreover, Guided Policy Search (GPS) [56] uti-

lizes the path optimization algorithm to guide the training

process of the policy gradient method, and thereby improves

its efﬁciency.

C. Actor-Critic Algorithms

There are a set of algorithms that incorporate the advantages

of value-function approaches and policy search methods. They

attempt to estimate a value function, meanwhile adopt the

policy gradient to search in the policy space. Actor-Critic

[57] is one of the most representative algorithms. It combines

the policy-based method (i.e., the actor) with the value-based

approach (i.e., the critic) to learn the policy and value-function

together. The actor trains the policy according to the value

function of the critic’s feedback, while the critic trains the

value function and uses the TD method to update it in one-step.

The one-step Actor-Critic algorithm replaces the full return

with the one-step return as

t+1

= θ

+α



t+1

+γˆv(S

t+1

, w)−ˆv(S

, w)



∇

π(A

, θ

)

π(A

, θ

)

(7)

where w is the state-value weight vector learned by the Actor-

Critic algorithm [49]. In Eq. (7), ˆv(S

, w) is a learned state

value function that is used as the baseline. In recent years,

many improved Actor-Critic algorithms have been developed,

such as Asynchronous Advantage Actor-Critic (A3C) [58],

Soft Actor-Critic (SAC) [59], Deterministic Policy Gradient

(DPG) [60] and its variation Deep Deterministic Policy Gra-

dient (DDPG) [27]. Actor-Critic algorithms may alleviate the

problem of sampling efﬁciency by experience replay [61].

However, due to the coupling of value evaluation and policy

updates, the stability of these algorithms is unsatisfactory.

III. FORMULATION OF RECOMMENDATION PROBLEM

In a typical recommender system, suppose there are a set

of users U and a set of items I, with R ∈ R

X×Y

denotes

the user-item interaction matrix, where X and Y denote the

number of users and items, respectively. Let r

denote the

interaction behavior between user u and item i at time step t.

The recommender system aims to generate a predicted score

ˆr

, which describes the user’s preference on the item i.

Generally, we can formulate the recommendation task as a

ﬁnite MDP [62], [63], as shown in Fig. 2(b). At each time

step, the recommender agent interacts with the environment

(i.e., the user and/or logged data) by recommending an item

to the user in the current state. At the next time step, the

recommender agent receives feedback from the environment

and recommends a new item to the user in a new state. The

user’s feedback may contain explicit feedback (e.g., purchase

and rating) or implicit feedback (e.g., user’s browsing record

from logged data). The recommender agent aims at learning

an optimal policy with the policy network to maximize the

cumulative reward. More precisely, the MDP in recommenda-

tion scenario is a 5-tuple < S, A, P, R, γ >, which can be

deﬁned as follows.

• States S. The ﬁnite state space describes the environment

states in the ﬁxed length history trajectories, in which

= {i

, i

, · · · , i

} is an observed state from the

sequence of interacted items at time step t.

• Actions A. The discrete action space contains the whole

set of recommended candidate items

. An action A

is to

Note that the action space may include other kinds of actions in different

recommendation scenarios, e.g., the selection of query attributes in conversa-

tional recommender systems, and the outgoing edges of entities in KG-based

explainable recommender systems.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 5

recommend an item i at time step t. In logged data, the

action can be taken from the user-item interactions.

• Transition probability P. It is the state transition prob-

ability matrix. The state transits from s to s

′

according to

the probability p(s

′

, r|s, a) after the recommender agent

receives the user’s feedback (i.e., the reward r).

• Reward function R. Once the recommender agent takes

action a at state s, it obtains the reward r(s, a) in

accordance with the user’s feedback.

• Discount factor γ. It is the discount-rate parameter for

future rewards.

We assume in online RL-based recommendation environ-

ment, the recommender agent recommends an item i

to a user

u, while the user provides a feedback f

for the recommender

agent at the t-th interaction. The recommender agent obtains

the reward r(S

, A

) associated with the feedback f

, and

recommends a new item i

t+1

to the user u at the next interac-

tion. Given the observation on the multi-turn of interactions,

the recommender system generates a recommendation list.

The recommender agent aims to learn a target policy π

maximize the cumulative reward of the sampled sequence [62]:

max

τ ∼π

], (8)

where θ refers to policy parameters, and R

|τ |

t=0

r(S

, A

) denotes the cumulative reward with

respect to the sampled sequence τ = {S

, A

, · · · , S

The recommender agent often suffers from the high cost to

learn the target policy by interacting with real users online.

An alternative is to employ ofﬂine learning, which learns a

behavior policy π

from the logged data. We should solve

the policy bias to learn an optimal policy π

∗

when using the

ofﬂine learning, since there is a noticeable difference between

the target policy π

and the behavior policy π

IV. RECOMMENDER SYSTEMS BY REINFORCEMENT

LEARNING

In this section, we summarize speciﬁc RL algorithms ap-

plied in four typical recommendation scenarios (i.e., interactive

recommendation, conversational recommendation, sequential

recommendation, and explainable recommendation) following

value-function, policy search, and Actor-Critic, respectively.

An overview of related literature is shown in Table II. Note that

some models/frameworks have an ofﬂine evaluation strategy

in the ofﬂine experimental environment, whereas an online

evaluation strategy denotes the experiments conducted on

online communities or real users.

A. Interactive Recommendation

In a typical interactive recommendation scenario, a user u is

recommended an item i

and provides a feedback f

at the t-th

interaction. The recommendation system recommends a new

item i

t+1

based on the feedback f

. Such an interactive process

can be easily formulated as an MDP, where the recommender

agent constantly interacts with the user and learns the policy

from the feedback to improve the quality of recommendations

[64], as shown in Fig. 2 (b). Due to their nature of learning

from dynamic interactions, RL algorithms have been widely

adopted to solve interactive recommendation problems [64].

1) Value-function Approaches: There is a common chal-

lenge of sample efﬁciency for RL algorithms, which may

lead to inefﬁcient learning of the recommendation policy. To

address the limitation, Knowledge Graph (KG) is incorporated

within RL algorithms for the interactive recommendation

[71]. KG can provide rich supplementary information, which

reduces the sparsity of the user feedback. To this aim, KG

enhanced Q-learning model is proposed to make the sequential

decision efﬁciently. Besides, a DQN approach with double-Q

[99] and dueling-Q [100] models the long-term user satisfac-

tion. From the interaction history o

with item i

at time step t,

the recommender agent obtains the reward R

and then stores

the experience in the replay buffer D. The goal is to improve

the performance of the Q-network by minimizing the mean-

square loss function as follows.

L(θ

) = E

t+1

)∼D

[(y

− q(S

, i

; θ

))

], (9)

where (o

, i

, R

, o

t+1

) refers to the learnt experience, and y

denotes the target value of the optimal q

∗

Moreover, [67] proposes a DQN-based recommendation

framework that incorporates Convolutional Neural Network

(CNN) and Generative Adversarial Networks (GAN) [101],

called DRCGR. It automatically learns the optimal policy

based on both the positive and negative user feedback.

To optimize instant and long-term user engagement in rec-

ommender systems, [69] designs a novel Q-network that has

three layers named raw behavior embedding layer, hierarchical

behavior layer, and Q-value layer. Moreover, Pseudo Dyna-Q

(PDQ) [70] is proposed to ensure the stability of convergence

and low computation cost of existing algorithms. The PDQ

framework consists of two major components: a world model

imitates user’s feedback from the historical logged data and

generates pseudo experiences. A recommendation policy based

on Q-learning maximizes the expected reward by the pseudo

experiences from the world model and logged experiences.

Besides, a few attempts use the DP method to optimize the

recommendation policies [66]. For example, [65] adopts the

value iteration algorithm to learn the true status value in a

collaboration network.

The multi-step problem in interactive recommender systems

has been studied in [68], where the multi-step interactive

recommendation is cast as a multi-MDP task for all target

users. To model user-speciﬁc preferences explicitly, a biased

User-speciﬁc Deep Q-learning Network (UDQN) is proposed

by adding a bias parameter to capture the difference in the

Q-values of different target users.

2) Policy Search Methods: Existing RL-based algorithms

are often developed for short-term recommendation, whereas

[9] employs DRL and Recurrent Neural Network (RNN) to

improve the accuracy of long-term recommendation. Speciﬁ-

cally, RNN is performed to simulate the sequential interactions

between the environment (the user) and the recommender

agent by evolving user states adaptively. This strategy can help

tackle the cold-start issue in recommender systems. On the

other hand, the interaction process is split into sub-episodes

and reboot the accumulated reward of each sub-episode, which

signiﬁcantly improves the effectiveness of policy learning.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 6

TABLE II

OVERVIEW OF RL ALGORITHMS FOR DIFFERENT RECOMMENDER SCENARIOS.

Scenario Model RL Algorithm RL Environment Evaluation Strategy Dataset

Interactive

Recommendation

UWR [64]

Value-function

Q-Learning model-free Ofﬂine N/A

Multi-with RL [65] DP model-based Ofﬂine ACM

ARA [66] DP model-based Online & Ofﬂine N/A

DRCGR [67] DQN model-free Ofﬂine N/A

UDQN [68] DQN model-free Ofﬂine ML100K, ML1M, YMusic

FeedRec [69] DQN model-free Ofﬂine JD

PDQ [70] Q-Learning model-based Ofﬂine Taobao, Retailrocket

KGQR [71] DQN model-free Ofﬂine Book-Crossing, ML20M

SL+RL [9]

Policy Search

REINFORCE model-free Ofﬂine ML100K, ML1M, Steam

RCR [72] REINFORCE model-free Online & Ofﬂine Yelp, UT-Zappos50K

TPGR [73] REINFORCE model-free Ofﬂine ML10M, Netﬂix

FairRec [74]

Actor-Critic

DPG model-free Ofﬂine ML100K, Kiva

Attacks&Detection [13] AC model-free Ofﬂine Amazon

SDAC [75] AC model-based Ofﬂine RecSys, Kaggle

AAMRL [76] AC model-free Online UT-Zappos50K

DRR [77] AC model-free Online & Ofﬂine

ML1M, Yahoo! Music,

ML100K, Jester

Conversational

Recommendation

EMC,BTD, EHL [78]

Value-function

Monte Carlo,

TD learning

model-free Online & Ofﬂine

TRAVEL, PC,

CAMERA, CAR

ISRA [79] DP model-based Online N/A

EGE [80] Q-learning model-free Online Shoes, Fashion IQ Dress

MemN2N [81] DQN model-free Ofﬂine Personalized Dialog

SCPR [82] DQN model-free Online & Ofﬂine Yelp, LastFM

UNICORN [83] DQN model-free Online & Ofﬂine Yelp, LastFM, Taobao

CRM [84]

Policy Search

REINFORCE model-free Online & Ofﬂine Yelp

EAR [85] REINFORCE model-free Online & Ofﬂine Yelp, LastFM

CRSAL [86] Actor-Critic A3C model-free Ofﬂine

DSTC2, CamRest676,

MultiWOZ 2.1

RelInCo [87] AC model-free Ofﬂine OpenDialKG, REDIAL

Sequential

Recommendation

SQN [62]

Value-function

Q-learning model-free Ofﬂine RC15, RetailRocket

DEERS [88] DQN model-free Online & Ofﬂine JD

NRRS [89]

Monte Carlo

tree search

model-based Ofﬂine Million Song

RLradio [90] R-Learning model-based Online & Ofﬂine N/A

KERL [91]

Policy Search

Truncated PG model-free Ofﬂine Amazon, LastFM

HRL+NAIS [41] REINFORCE model-free Ofﬂine XuetangX

DARL [92] REINFORCE model-free Ofﬂine XuetangX

SAC [62]

Actor-Critic

AC model-free Ofﬂine RC15, RetailRocket

DeepChain [93] AC model-based Online & Ofﬂine JD

SAR [94] AC model-free Ofﬂine

Steam, Electronics,

ML10M, Kindle

Explainable

Recommendation

MT Learning [95]

Policy Search

REINFORCE model-free Ofﬂine Amazon, Yelp

RL-Explanation [12] REINFORCE model-free Ofﬂine Amazon, Yelp

MKRLN [96] REINFORCE model-free Ofﬂine movie, book, KKBOX

SAKG+SAPL [97] REINFORCE model-free Ofﬂine Amazon

PGPR [26] REINFORCE model-free Ofﬂine Amazon

ADAC [28]

Actor-Critic

AC model-free Ofﬂine Amazon

AnchorKG [98] AC model-free Ofﬂine MIND, Bing News

In text-based interactive recommender systems, user feed-

back with natural language usually causes undesired issues.

For instance, the recommender system may violate user’s

preferences, since it ignores the previous interactions and thus

recommends similar items. To handle these issues, a Reward

Constrained Recommendation (RCR) model [72] is proposed

to incorporate user preferences sequentially. More speciﬁcally,

the text-based interactive recommendation is formulated as a

constraint-augmented RL problem, where the user feedback

is taken as a constraint. To generalize from the constraints,

a discriminator parameterized as the constraint function is

developed to detect the violation of user’s preferences. The

policy is optimized by the policy gradient with baseline (i.e., a

general constraint), to learn constraints on violations of user’s

preferences. Based on the user’s feedback on the recommended

items, the recommender system utilizes constraints from these

feedback to prevent undesired text generation.

Moreover, most existing RL methods fail to alleviate the

issue of large discrete action space in interactive recom-

mender systems, since there are a large number of items

to be recommended. To solve this problem, [73] proposes a

Tree-structured Policy Gradient Recommendation framework

(TPGR) to achieve high effectiveness and efﬁciency for large-

scale interactive recommendations. To maximize long-run cu-

mulative rewards, the REINFORCE algorithm is utilized to

learn the strategy for making recommendation decisions.

3) Actor-Critic Algorithms: Actor-Critic Algorithms have

also been adopted extensively for interactive recommender

systems in recent studies [77]. For instance, an RL-based

framework (i.e., FairRec) [74] is proposed to dynamically

achieve a fairness-accuracy balance, in which the fairness

status of the system and user’s preferences combine to form

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 7

the state representation. The FairRec framework contains two

parts: an actor network and a critic network. The actor network

generates the recommendation according to the fairness-aware

state representation. The actor network is trained from the

critic network, and updated by the DPG algorithm. Then, the

critic network estimates the value of the actor network outputs.

The critic network is updated by TD learning.

[13] addresses adversarial attacks in RL-based interactive

recommender systems. They propose a general framework

that consists of two models. In the adversarial attack model,

the agent crafts adversarial examples following Actor-Critic

or REINFORCE algorithm. In the encoder-classiﬁcation de-

tection model, the agent detects potential attacks based on

the adversarial examples, employing a classiﬁer designed by

attention networks. Following [26], the authors demonstrate

the effectiveness of the proposed framework through judging

whether the recommender system is attacked.

Online interactions in RL for interactive recommendation

may hurt the user experiences. An alternative is to adopt

logged feedbacks to perform ofﬂine learning. However, it usu-

ally suffers from some challenges, such as unknown logging

policy and extrapolation error.To deal with these challenges,

[75] proposes a stochastic Actor-Critic method based on

a probabilistic formulation, and present some regularization

approaches to reduce the extrapolation error.

In interactive recommender systems, utilizing multi-modal

data can enrich user feedback. To this end, [76] proposes a

vision-language recommendation approach that enables effec-

tive interactions with the user by providing natural language

feedback. In addition, an attribute augmented RL is introduced

to model explicit multi-modal matching. More speciﬁcally, the

multi-modal data A

(i.e., the action of recommending items

at time t) and x

(i.e., natural language feedback from users

at time t) are leveraged in the proposed approach. Then a

recommendation tracker, consisting of a feature extractor and a

multi-modal history critic, is designed to enhance the ground-

ing of natural language to visual items. The recommendation

tracker may track the user’s preferences based on a history

of multi-modal matching rewards. The policy is updated via

the Actor-Critic algorithm for recommending the items with

desired attributes to the user.

B. Conversational Recommendation

Contrary to interactive recommender systems where users

receive information passively (i.e., the recommendation system

is dominant), conversational recommender systems interact

with users actively. They explicitly acquire users’ active feed-

back and make recommendations that users really like. To

achieve this objective, different from interactive recommender

systems that recommend items from each interaction, conver-

sational recommender systems [103] usually recommend items

after communicating with users by real-time multi-turn inter-

active conversations, based on natural language understanding

and generation. In this case, there is a critical issue of trade-

offs between exploration and exploitation for conversation

and recommendation strategies [102]. Conversational recom-

mender systems explore the items unseen by a user to capture

the user’s preferences by multi-turn interactions. However,

compared to exploiting the related items that have already

been captured, exploring the items that may be unrelated will

harm the user experience. RL provides potential solutions to

address this challenge. As shown in Fig. 3, the policy network

stimulates the reward centers in conversational recommender

systems, integrating exploration with exploitation in multi-turn

interactions.

1) Value-function Approaches: At each conversation turn

of conversational recommendation, the policy learning usually

aims to decide what to ask, which to recommend, and when to

ask or recommend. [83] introduces the UNICORN (i.e., UNI-

ﬁed COnversational RecommeNder) model that employs an

adaptive DQN method to cast these decision-making processes

as a uniﬁed policy learning task. To adapt the conversational

strategy, [79] adopts the DP method to yield better policies that

assist user behaviors more efﬁciently. Besides, [81] proposes

an RL-based dialogue strategy that utilizes recommendation

results based on the user’s utterances, whose intention is

estimated by a Long Short-Term Memory (LSTM) network.

It is interesting that conversational recommender systems

can use incremental critiquing as a powerful type of user

feedback, to retrieve the items in line with the user’s soft

preferences at each turn. From this point of view, it needs

a related quality measure for the recommendation efﬁciency.

To achieve this objective, a novel approach is proposed to

improve the quality measure by combining a compatibility

score with similarity score [78]. To evaluate the compatibility

of user critiques, exponential reward functions are presented by

Monte Carlo and TD methods based on the user specialization.

Moreover, a global weighting of user’s preferences is proposed

to enhance the critiquing quality, which brings about faster

convergence of the similarity. By using the combination of

these two scores, the conversational recommender system

improves its robustness against the noisy user data.

However, it is difﬁcult to capture users’ preferences over

time since the recommender system usually only obtains

partial observations of the users’ preferences. In this case,

we can formalize the observed interactions as a Partially Ob-

servable Markov Decision Process (POMDP). Afterwards, the

Estimator-Generator-Evaluator (EGE) [80] trains the POMDP

with Q-learning to track users’ preferences and generates the

next recommendations at each iteration.

The aforementioned conversational recommender systems

usually utilize user feedback in implicit ways. Instead, to make

full use of user preferred attributes on items in an explicit way,

a Simple Conversational Path Reasoning (SCPR) framework

[82] is proposed to conduct interactive path reasoning for

conversational recommendation on a graph. The SCPR obtains

user preferred attributes more easily, by pruning off irrelevant

candidate attributes following user feedback based on a policy

network. The policy network inputs the state vector s and

outputs the action-value Q(s, a), referring to the estimated

reward for asking action a

ask

or recommending action a

rec

The policy is optimized by the standard DQN to achieve

its convergence. Different from EAR [85], SCPR uses the

adjacent attribute constraint on the graph to reduce the search

space of attributes, such that the decision-making efﬁciency

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 8

Fig. 3. The common framework of conversational recommendation model based on RL. The user interface extracts user intention from the utterances during

the conversation [102]. The policy network learns the optimal policy from dialogue states. The recommender system is trained with user intention, item

features, and dialogue states, to make personalized recommendations [81], [84], [86].

can be improved.

2) Policy Search Methods: To model the user’s current

intention and long term preferences for personalized recom-

mendations, [84] utilizes deep learning technologies to train

a conversational recommender system. Speciﬁcally, a deep

belief tracker extracts a user intention by analyzing the user’s

current utterance. Meanwhile, a deep policy network guides

the dialogue management by the user’s current utterance and

long-term preferences learned by a recommender module.

The framework of the proposed Conversational Recommender

Model (CRM) is composed of three modules: a natural lan-

guage understanding with the belief tracker, dialogue man-

agement based on the policy network, and a recommender

designed based on factorization machine [104]. More specif-

ically, at ﬁrst, the belief tracker converts the utterance (i.e.,

the input vector z

) into a learned vector representation (i.e.,

) by LSTM network. Afterwards, to maximize the long-

term return, they use the deep policy network to select a

reasonable action from the dialogue state at each turn. The

REINFORCE algorithm is adopted to optimize the policy

parameter. Finally, the factorization machine is utilized to

train the recommendation module, which generates a list of

personalized items for the corresponding user.

Distinct from previous studies that ignore how to adapt

the recommended items for user feedback, [85] proposes a

uniﬁed framework called Estimation Action Reﬂection (EAR),

in which a Conversation Component (CC) intensively interacts

with a Recommender Component (RC) in the three–stage

process. Speciﬁcally, the framework starts from the estimation

stage, the RC guides the action of CC by ranking candidate

items. Afterwards, at the action stage, the CC decides which

questions to ask in terms of item features and makes a

recommendation. The conversation action is performed by

a policy network, which is optimized via the REINFORCE

algorithm. When a user rejects the recommended items that

are made at the action stage, the framework moves to the

reﬂection stage for adjusting its estimations. Extensive experi-

ments demonstrate the proposed EAR outperforms CRM [84]

according to not only fewer conversation turns but also better

recommendations, mainly because the candidate items adapt

to user feedback in the interaction between the RC and CC.

3) Actor-Critic Algorithms: To address the training issue

of task-oriented dialogue systems, a Conversational Recom-

mender System with Adversarial Learning (CRSAL) [86] is

proposed to fully enable two-way communications. The pro-

cess of CRSAL is divided into three stages. In the information

extraction stage, a dialogue state tracker ﬁrstly infers the

current dialogue belief state b

from the user’s utterances.

Afterwards, a neural intent encoder extracts and encodes the

user’s utterance intention z

. Finally, a neural recommender

network derives recommendations from item features and di-

alogue states. In the conversational response generation stage,

a neural policy agent (i.e., the actor) generates human-like

action in the current dialogue state, and a natural language

generator updates conversational responses from the critic

network by the selected action. In the RL stage, the decision

procedure of dialogue actions can be formulated as a Partially

Observable MDP (POMDP). The agent selects the best action

in each conversation round under the long-term policy. To this

end, an adversarial learning mechanism based on the A3C

algorithm is developed to ﬁne-tune the actions generated by

the agent, which employs the discriminator to train the optimal

parameters of the proposed model.

In conversational recommender systems, valuable informa-

tion from user’s utterances is often conducive to the retrieval

performance. For example, [87] proposes an RL-based model

to extract relevant information from the context of the con-

versation. The model introduces two Actors: a selector-Actor

ﬁnds the most relevant words for the target of the conversation,

and an arrangement-Actor returns the related order of words

based on the user’s utterances. Both Actors are trained by the

Actor-Critic algorithm.

C. Sequential Recommendation

Unlike interactive recommendation methods that generate

recommendations based on the user’s feedback via constant in-

teractions, sequential recommender systems predict the user’s

future preference and recommend the next item given a

sequence of historical interactions. Let i

i:n

= i

→ i

→

· · · → i

denote the user-item interaction sequence, where n

is the sequence length. The sequential recommender system

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 9

aims to recommend the next item

i that is not in the historical

interactions. Some attempts have revealed that RL algorithms

deal well with the sequential recommendation problems, since

such problems can be naturally formulated as an MDP to

predict the user’s long-term preferences. In this case, the

recommender agent easily performs a sequence of ranking,

which usually learns the optimal policy from the logged data

with off-policy methods.

1) Value-function Approaches: In sequential recommenda-

tion, it is essential to fuse long-term user engagement and

user-item interactions (e.g., clicks and purchases) into the rec-

ommendation model training. Learning the recommendation

policy from logged implicit feedback by RL is a promising

direction. However, there exists challenges in implementation

due to a lack of negative feedback and the pure off-policy

setting. [62] presents a novel self-supervised RL method to

address the challenges. More precisely, the next item rec-

ommendation problem is modeled as an MDP, where the

recommender agent sequentially recommends items to related

users to maximize the cumulative reward. To optimize the

recommendation model, they deﬁne a ﬂexible reward function

that contains purchase interactions, long-term user engage-

ment, and item novelty. Based on this method, the authors

develop a self-supervised Q-learning model to train two layers

with the logged implicit feedback. Similarly, [90] leverages

R-Learning [105] to develop a music recommender system

named RLradio. It exploits both explicitly revealed channel

preferences and user’s implicit feedback, i.e., the user actually

listens to a music track that played in the recommended

channel. [89] also leverages the wireless sensing and RL algo-

rithm to improve the user experience in music recommender

systems, in which user’s preferences are explored by the Monte

Carlo tree search method.

Moreover, in [88], a novel DQN is built for the proposed

framework, where Gated Recurrent Unit (GRU) captures the

user’s sequential behaviors as positive state s

, and negative

state s

−

is obtained by a similar way. The positive and negative

signals are fed into the input layer separately, which assists

the new DQN to distinguish contributions of the positive and

negative feedback in recommendations.

2) Policy Search Methods: It is a non-trivial problem to

capture the user’s long-term preferences in sequential rec-

ommender systems, since the user-item interactions may be

sparse or limited. Thus, it is unreliable for RL algorithms to

learn user interests by using random exploration strategies.

To overcome these issues, [91] proposes a Knowledge-guidEd

Reinforcement Learning (KERL) framework that adopts an

RL model to make recommendations over KG. In the MDP

modeled by KERL, the environment contains the information

of interaction data and KG, which are useful for the sequential

recommendation. In this case, the agent selects an action

a in state S

for recommending an item i

t+1

to a user u.

During each recommendation process, the agent obtains an

intermediate reward, which is deﬁned by integrating sequence-

level and knowledge-level rewards.

Moreover, [41] analyzes users’ sequential learning behaviors

and points out that the attention-based recommender systems

perform poorly when the users enroll in diverse historical

courses, because the effects of the contributing items are

diluted by many different items. To remove noisy items and

recommend the most relevant items at the next time, a proﬁle

reviser with two-level MDPs is designed. In this proﬁle reviser,

a high-level task decides whether to revise the user proﬁle,

and a low-level task decides which item should be removed.

The agent in the proposed Hierarchical Reinforcement Learn-

ing (HRL) framework performs hierarchical tasks under the

revising policy, which is updated by the REINFORCE algo-

rithm. In addition, for capturing users’ dynamic preferences

in sequential learning behaviors, [92] proposes a dynamic

attention mechanism, which combines with the HRL-based

proﬁle reviser to distinguish the effects of contributing courses

in each interaction. As a result, the proposed model achieves

more accurate recommendations than HRL [41].

3) Actor-Critic Algorithms: As mentioned before, [62]

proposes a novel self-supervised RL method to learn the

recommendation policy from logged implicit feedback in

sequential recommendation. To optimize the recommendation

model, a Self-supervised Actor-Critic (SAC) model treats the

self-supervised layer as an actor to perform ranking and takes

the RL layer as a critic to estimate the state-action value.

However, both SQN and SAC [62] utilize a ﬁxed length

of interaction sequences as input to train the models, which

affects the recommendation accuracy, because users usually

have various sequential patterns. To address this issue, [94]

proposes a Sequence Adaptation model via deep Reinforce-

ment learning (SAR) to adjust the length of interaction se-

quences. In particular, the RL agent performs the selection of

a sequence length (i.e., action) in a personalized manner in

the actor network. Finally, a joint loss function is optimized

to align the cumulative rewards of the critic network with the

recommendation accuracy.

D. Explainable Recommendation

The objective of explainable recommendation is to solve

the problem of interpretability in the recommender systems.

The explainable recommender systems not only provide high-

quality recommendations but also generate relevant explana-

tions for recommendation results [106]. In particular, the visual

explanation seems more intelligent when a recommender sys-

tem conducts path reasoning over KG, because KG contains

rich relationships between users and items for intuitive expla-

nations. Fig. 4 illustrates the common framework of RL with

KG for the explainable recommendation. The KG is cast as

a part of the MDP environment, and the agent performs path

reasoning for recommendations [26], [28].

In the survey on explainable recommendation [107], the

explainable recommendation methods are divided into ﬁve

classes: explanation with relevant items or users, feature-based

explanation, social explanation, textual sentence explanation,

and visual explanation. The emphasis of this section is to

survey RL applied to the explainable recommendation, where

the approaches focus on textual sentence explanation and

visual explanation.

1) Policy Search Methods: In terms of textual sentence

explanation, [95] proposes a multi-task learning framework

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 10

Fig. 4. The RL algorithms with KG for explainable recommendation [26], [28]. The models aim to learn a path-ﬁnding policy over KG. The learned policy

is adopted for the interpretable reasoning paths that make accurate recommendations to the given user.

that simultaneously makes rating predictions and generates

explanations from users’ reviews. The rating prediction is

performed by a context-aware matrix factorization model,

which learns latent factor vectors of users and items. The

recommendation explanation is employed by an adversar-

ial sequence-to-sequence model, in which GRU generates

personalized reviews from observed review documents, and

adversarial training for the review generation is optimized by

the REINFORCE algorithm.

To control the explanation quality ﬂexibly, [12] designs

an RL framework to generate sentence explanations. In the

proposed framework, there are couple of agents instantiated

with attention-based neural networks. The task of one agent

is to generate explanations, while the other agent is respon-

sible for predicting the recommendation ratings based on

the explanations. The environment mainly consists of users,

items, and prior knowledge. In this work, the recommendation

model is treated as a black box. The agents extract the

interpretable components from the environment to generate

effective explanations.

Furthermore, to leverage the image information of entities,

rather than focusing on rich semantic relationships in a hetero-

geneous KG, [96] presents a Multi-modal Knowledge-aware

RL Network (MKRLN), where the representation of recom-

mended path consists of both structural and visual information

of entities. The recommender agent starts from a user and

searches suitable items along hierarchical attention-paths over

the multi-modal KG. These attention-paths can improve the

recommendation accuracy and explicitly explain the reason

for recommendations.

The real-world KG is generally enormous, with a focus

on ﬁnding more reasonable paths in the graph. From this

aspect, [26] proposes a reinforcement KG reasoning approach

called Policy-Guided Path Reasoning (PGPR), which performs

recommendations and explanations by providing actual paths

with the REINFORCE algorithm over the KG. Speciﬁcally,

the recommendation problem is treated as a deterministic MDP

over the KG. In the training stage, following a soft reward and

user-conditional action pruning strategy, the agent starts from a

given user, and learns to reach the correct items of interest. In

the inference stage, the agent samples diverse reasoning paths

for recommendation by a policy-guided search algorithm, and

generates genuine explanations to answer why the items are

recommended to the user.

Distinct from previous explainable recommendation meth-

ods that use KGs, [97] focuses on sentiment on relations in the

KG. To obtain more convincing explanations with sentiment

analysis, a Sentiment-Aware Knowledge Graph (SAKG) is

constructed by analyzing users’ reviews and ratings on items.

Moreover, a Sentiment-Aware Policy Learning (SAPL) method

is introduced to make recommendations and guide the rea-

soning over the SAKG. Experimental results demonstrate that

the proposed framework outperforms state-of-the-art baselines

(e.g., PGPR [26]), in terms of both accuracy and explainability.

2) Actor-Critic Algorithms: Differing from PGPR [26], the

ADversarial Actor-Critic (ADAC) model [28] is proposed to

identify interpretable reasoning paths. By learning the path-

ﬁnding policy, the actor obtains its search states over the KG

and potential actions. The actor obtains the reward R

e,t

the path-ﬁnding policy from the current state ﬁts the observed

interactions. To integrate the expert path demonstrations, they

design an adversarial imitation learning module based on two

discriminators (i.e., meta-path and path discriminators), which

are trained to distinguish the expert paths from the paths

generated by the actor. When its paths are similar to the expert

paths in the meta-path or path discriminator, the actor obtains

the reward (R

m,t

or R

p,t

) from the imitation learning module.

The critic merges these three rewards (i.e., R

e,t

, R

m,t

, and

p,t

) to estimate each action-value accurately. Later, the actor

is trained with an unbiased estimate of the reward gradient

through the learned action-values. Finally, the major modules

of ADAC are jointly optimized to ﬁnd the demonstration-

guided path, which brings accurate recommendations.

Although existing works (e.g., PGPR [26] and ADAC [28])

combine KG and RL to enhance recommendation reasoning,

they are not suitable for news recommendation, where a

news article usually contains multiple entities. To address this

challenge, a recommendation reasoning paradigm, named An-

chorKG, is proposed to employ anchor KG path to make news

recommendation [98]. The AnchorKG model consists of two

parts. An anchor graph generator captures the latent knowledge

information of the article, which leverages k-hop neighbors

of article entities to learn high-quality reasoning paths. On

the other hand, an AC-based framework is developed to train

the anchor graph generator. Finally, the model performs a

multi-task learning process to optimize both the anchor graph

generator and the recommendation task jointly.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 11

E. Discussion on RL Aspect

As mentioned above, the comprehensive survey of RL-based

recommender systems follows three classes of RL algorithms:

value-function, policy search, and Actor-Critic. From the RL

perspective, we make a comparison of these three types of

methods when they are applied in the recommender systems.

• Value-function approaches depend heavily on the sam-

ples to learn the optimal policy. Thus, these approaches

are suitable for small discrete action spaces and are often

applied in small-scale recommender systems, such as

traditional interactive recommendation and conversational

recommendation. However, real-world recommender sys-

tems usually contain large action spaces, which leads to

the slow convergence of RL when they utilize value-

function approaches to make recommendations.

• Policy search methods directly optimize the policy with-

out relying on the value function. They are only suitable

for continuous action spaces due to the policy gradient,

which results in high variances. However, beneﬁtting

from fast convergence properties, policy search methods

are competent in large-scale recommender systems, such

as sequential recommendation and explainable recom-

mendation, especially KG-based recommendation.

• Actor-Critic algorithms incorporate the advantages of

value-function approaches and policy search methods.

Nevertheless, Actor-Critic algorithms usually cause po-

tential information loss since they map discrete action

spaces into continuous action spaces by the policy net-

work [47], which ensures the policy is differentiable with

respect to its parameters. Hence, Actor-Critic algorithms

are rarely applied in the recommender systems that focus

on the convincing recommendations, e.g., conversational

recommendation and explainable recommendation.

RL-based recommendation models can also be divided into

model-based and model-free algorithms. As shown in Table II,

a few recommendation models adopt model-based algorithms,

while most existing recommendation models utilize model-free

algorithms.

• Model-based algorithms (e.g., DP and heuristic search)

require a model to represent the environment of the

recommender system, thus the agent relies on planning

and guarantees sample efﬁciency. Nevertheless, such al-

gorithms often result in biased estimations since the envi-

ronment of recommender systems dynamically changes.

Moreover, the transition probability is deterministic in

model-based algorithms. Therefore, these algorithms are

not applicable to real-world recommender systems.

• Model-free algorithms (e.g., TD, DQN, and REIN-

FORCE) generally achieve better recommendation perfor-

mances, because the agent mainly relies on learning from

previous experiences. The drawback of such algorithms

is sample inefﬁciency, i.e., they require a large number

of samples to ensure the convergence of the algorithms.

V. CHALLENGES IN REINFORCEMENT LEARNING BASED

RECOMMENDATION APPROACHES

As noted above, RL aims to maximize long-run cumulative

rewards by learning the optimal policy from the interactions

between the agent and its environment. Thus, it relies not

only on the environment but also on prior knowledge. In

the recommendation applications, RL approaches often suffer

from various challenges. To this end, many researchers put

forward relevant solutions to different issues. In the following,

we summarize relevant studies from ﬁve aspects: environment

construction, prior knowledge, reward function deﬁnition,

learning bias, and task structuring.

A. Environment Construction

In RL, the agent observes states from the environment, then

conducts relevant actions under the policy, and receives a re-

ward from the environment. Applied to recommender systems,

the policy training in the environment is often confronted with

many unpredictable situations, due to the need for exploration.

In this case, environment construction is critical to learn the

optimal recommendation policy.

1) State Representation: The state representation plays an

important role in RL to capture the dynamic information

during the interactions between users and items, since the

environment state is a key component of MDP. However,

most current RL methods focus on policy learning to opti-

mize recommendation performance. To effectively model the

state representation, [77] proposes a DRL-based recommenda-

tion framework termed DRR, where four state representation

schemes are designed to learn the recommendation policy (the

actor network) and the value function (the critic network).

Speciﬁcally, the DRR-p scheme employs the element-wise

product operator to learn the pairwise local dependency be-

tween items. DRR-u scheme incorporates the pairwise interac-

tions between users and items into the DRR-p scheme. DRR-

ave scheme concatenates the user embedding, the average

pooling result of items, and the user-item interactions into

a whole vector to describe the state representation. In the

DRR-att scheme, a weighted average pooling is conducted

by attention networks. The actor network conducts a ranking

action a = π

(s) according to the state representation s,

and generates a corresponding Q-value by the action-value

function Q

(s, a). The critic network leverages a DQN pa-

rameterized as Q

(s, a) to approximate Q

(s, a). Based on

DPG [60], the actor network can be updated by the policy

gradient via

∇

J(π

) ≈

∇

(s, a)|

s=S

,a=π

)

∇

(s)|

s=S

(10)

where J(π

) denotes the expectation of all possible Q-values

following the policy π

, and N is the batch size.

Moreover, [108] introduces the DEMER (i.e., DEcon-

founded Multi-agent Environment Reconstruction) framework,

which assumes that the environment reconstruction from the

historical data is powerful in RL-based recommender systems.

DEMER randomly samples one trajectory from the observed

data, and then forms its ﬁrst state as the initial observation.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 12

It ﬁnally generates a policy-generated trajectory by the TRPO

algorithm, considering the confounder embedded policy as a

role of hidden confounders in the environment.

It may be practical to formulate a simulation environment

of recommendation as an RL gym. To achieve this objective,

PyRecGym [109] is designed for RL-based recommender

systems to support standard test datasets and common input

types. In the PyRecGym, the states of the environment refer

to user proﬁles, items, and interactive features with contextual

information. The RL agent interacts with a gym engine in the

current state, and obtains related feedback from the gym en-

gine according to the reward. PyRecGym contains three major

functions: initialization function that initializes the initial state

of the environment for the user, reset function that resets the

user state in the next episode and returns the initial state to the

RL agent, and step-function that reacts to the agent’s action

and returns the next state.

2) Knowledge Graph Leverage: Recent advances in KG

have attracted increasing attention in RL-based recommender

systems. The KG-enhanced interaction fashion enriches user-

item relations by the structural knowledge to describe the MDP

environment. Besides, the multi-hop paths in a KG contribute

to the reasoning process for explainable recommendation.

The KERL framework [91] incorporates KG into RL-based

recommender systems, with the aim of predicting future user’s

preferences and addressing the sparsity. To learn the user’s

preferences from sparse user feedback, [71] proposes a KG-

enhanced Q-learning model for interactive recommender sys-

tems. Instead of learning the policy by trial-and-error search,

the model utilizes the knowledge of correlations among items

learned from KG to enrich the state representation of both

items and users. Thus, it spreads the user’s preferences among

the correlated items over KG.

[26] made the ﬁrst attempt to leverage KG and RL for the

explainable recommendation. They develop a uniﬁed frame-

work (i.e., PGPR) to provide actual recommendation paths

guided by the policy network in a KG. To further improve

the convergence of the PGPR approach, [28] proposes a

demonstration-based KG reasoning framework named ADAC,

in which imperfect path demonstrations are extracted to guide

path-ﬁnding. The ADAC model aims to identify interpretable

reasoning paths for accurate recommendations.

Another interesting work is the negative sampling by KG

Policy Network [110]. It is incorporated into the recommen-

dation framework to explore informative negatives over KG.

The recommender (i.e., matrix factorization) and the sampler

(i.e., KG Policy Network) are jointly trained by the iteration

optimization. The recommender parameters are updated by the

stochastic gradient descent (SGD) method, and the sampler

parameters are updated via the REINFORCE algorithm.

3) Negative Sampling: Most existing studies extract nega-

tive sampling from the unobserved data to assist the training

of the recommendation model. However, they often fail to

yield high-quality negative samples to reﬂect the user’s needs,

which provides an important clue on the environment con-

struction for RL. To discover informative negative feedback

from the missing data, [110] proposes a KG policy network

for knowledge-aware negative sampling, which employs an RL

agent to search high-quality negative instances with multi-hop

exploring paths over the KG. Similarly, to learn the sampling

strategy from the missing negative signals, the supervised

negative Q-learning method [111] trains the RL algorithm with

a supervised sequential learning method to sample negative

items.

Moreover, the user exposure data, which records the history

interactions based on implicit feedback, is also beneﬁcial

to train negative samples. From this point of view, [112]

introduces a recommender-sampler framework, where the sam-

pler samples candidate negative items as the output, and the

recommender is optimized by the SGD method to learn the

pairwise ranking relation between a ground truth item and a

generated negative item. After obtaining the multiple rewards,

the sampler is optimized by the REINFORCE algorithm to

generate both real and hard negative items. Moreover, [113]

uses a CF-based pre-training method to sample negative items

for RL-based recommender systems.

4) Social Relation: Traditional recommender systems often

suffer from two main challenges, i.e., cold start and data

sparsity. A promising way for alleviating these issues is to

utilize users’ social relationships to model users’ preferences

efﬁciently. Thus, the agent can sample users’ social relation-

ships and deliver them to the environment.

Applied to RL-based recommendation, [114] integrates

social networks into the estimation of action-values. More

speciﬁcally, a social matrix factorization method is proposed

to describe the high-level state/action representations. To learn

more relevant hidden representations from personal prefer-

ences and social inﬂuence, an enhanced SADQN model is

developed to utilize additional neural layers to summarize

potential features from the hidden representations, and then

predict the ﬁnal action-value with the summarized features.

Moreover, [115] proposes a Social Recommendation frame-

work based on Reinforcement Learning (SRRL) to identify

reliable social relationships for the target user. In particular,

SRRL adaptively samples the social friends to improve the

recommendation quality with user feedback, since the reward

is always real-time.

B. Prior Knowledge

The combination of imitation learning with RL can be

named apprenticeship learning [116], which utilizes demon-

strations to initialize the RL. In particular, the Inverse Rein-

forcement Learning (IRL) algorithm [117] often assumes that

the expert acts to maximize the reward, and derives the optimal

policy from the learned reward function. In RL-based rec-

ommender systems, the reward signals are usually unknown,

since users scarcely offer feedback. On the other hand, IRL

algorithms are good at reconstructing the reward function for

the optimal policy from users’ observed trajectories.

To improve the novelty of the next-POI (i.e., Point-

of-Interests) recommendation that boosts users’ interests,

[118] adopts a novel IRL algorithm termed Maximum Log-

likelihood (MLIRL) [119] to model the unknown user’s pref-

erences based on the state features. This method exploits the

knowledge of the user’s preferences to estimate an initial

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 13

reward function that justiﬁes the observed trajectories, and

optimizes the user’s behavior by a gradient ascent method.

To mine users’ interests in online communities efﬁciently,

[120] designs a reinforced user proﬁling for recommendations

by employing both data-speciﬁc and expert knowledge, where

the agent employs random search to ﬁnd data-speciﬁc paths in

the environment. These meta-paths contain expert knowledge

and semantic meanings for searching useful nodes.

Distinct from the top-K recommendation with the target

of ranking optimization, the novel exact-K recommendation

focuses on combinatorial optimization. It may be more suit-

able to address the recommendation problems in application

scenarios. Towards this end, [121] designs an encoder-decoder

framework named Graph Attention Networks (GAttN). The

proposed framework learns the joint distribution of the K

items and outputs an optimal card that contains these K items.

To train the GAttN efﬁciently, an RL from demonstrations

method integrates behavior cloning [122] into RL.

C. Reward Function Deﬁnition

The agent behavior in RL indirectly relies on the reward

function, while in practice, it is a key challenge to deﬁne

a promising reward function. Beyond the need to learn the

policy by intermediate rewards, the reward deﬁnition based on

the speciﬁc demands in different recommendation scenarios is

necessary. For instance, [62] deﬁnes a ﬂexible reward function

that contains purchase interactions, long-term user engage-

ment, and item novelty for the relevant recommendation tasks.

[123] deﬁnes discrete rewards for different user behaviors.

Moreover, a one-step reward in [124] is distributed to the

online recommendations during visitor interactions.

A creative work is reported in [125] which leverages

GAN to model the dynamics of user behaviors and learn

relevant reward functions, which depends not only on the

user’s historical behaviors but also on the selected item. The

learned reward allows the agent to recommend items in a

principled way, instead of relying on the manual design. Based

on the simulation environment using the user behavior model,

a cascading DQN is proposed to learn the combinatorial rec-

ommendation policy. Similarly, [126] introduces a generative

IRL approach to avoid deﬁning a reward function manually.

The recommendation problem is regarded as automatic policy

learning. Thus, this approach ﬁrst generates a policy based on

the user’s preference. Later it uses a discriminative actor-critic

network to evaluate the learned policy, based on the reward

function deﬁned by

R(s, a) = logD(s, a)−log



max(ϵ, 1−logD(s, a))



+r, (11)

where logD(s, a) is the negative reward generated by the

discriminator, r ∈ [0, 1] represents the user’s feedback to

prevent the agent from taking an extra step to update itself

when the user clicks each recommendation, and ϵ is the

maximum percentage of change that is updated at a time.

For commercial recommendation systems, online advertis-

ing is frequently inserted into personalized recommendation

to maximize the proﬁt. To this aim, a value-aware recommen-

dation model [127] based on RL is designed to optimize the

economic value of candidate items to make recommendations.

In the value-aware model, measuring by the gross merchandise

volume, the total reward is deﬁned as the expected proﬁt that is

converted from all types of user actions (e.g., click and pay).

Moreover, [128] presents an advertising strategy for DQN-

based recommendation, where the advertising agent simulta-

neously maximizes the income of advertising and minimizes

the negative inﬂuence of advertising on the user experience.

Thus, the reward contains both the income of advertising and

the inﬂuence of advertising on the user experience.

In practice, users have interests in exploring novel items. To

this end, [129] proposes a fast Monte Carlo tree search method

for diversifying recommendations, where the reward function

is designed with the diversity and accuracy gain derived from

recommending items at corresponding positions.

D. Learning Bias

RL algorithms can be classiﬁed into on-policy and off-

policy methods [49]. On-policy methods often sweep through

all states to learn the policy and result in expensive costs,

hence on-policy methods are not applicable to large-scale

recommender systems. On the other hand, learning the rec-

ommendation policy from logged data (e.g., logged user

feedback) is a more practical solution, because it alleviates

complex state space and high interaction cost with off-policy

[130], [131]. Nevertheless, due to the difference between the

target policy and the behavior policy, the off-policy methods

usually result in data biases or policy biases.

1) Data Biases: There are lots of logged implicit feedback,

such as user clicks and dwell time, available for learning users’

preferences. However, the learning methods tend to easily

suffer from biases caused by only observing partial feedback

on previous recommendations. To deal with such data biases,

[132] proposes a recommender system with the REINFORCE

algorithm, where an off-policy correction approach is utilized

to learn the recommendation policy from the logged implicit

feedback, and incorporates the learned model of the behavior

policy to adjust the data biases.

Differing from [132] that simply tackles data biases in

the candidate generation module, the scalable recommender

systems should contain not only the candidate generation stage

but also a more powerful ranking stage. Toward this end, a

two-stage off-policy method [133] is proposed to obtain data

biases from logged user feedback and correct such biases

by using inverse propensity scoring [134]. More precisely,

the ranking module usually changes between the logged data

(in behavior policy) and the candidate generation (in target

policy). When there are evidently different preferences on the

items to recommend, the two-stage off-policy policy gradient

can be conducted to correct such biases. Moreover, the vari-

ance is reduced by introducing a hyper-parameter to down-

weight the gradient of the sampled candidates.

2) Policy Biases: The direct ofﬂine learning methods, such

as Monte Carlo and TD methods are subject to either huge

computations or instability of convergence. To handle these

problems, [70] proposes the PDQ framework to tackle the

selection bias between the recommendation policy and logging

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 14

policy. More speciﬁcally, the ofﬂine learning process is divided

into two steps. In the ﬁrst step, a user simulator is iteratively

updated under the recommendation policy via de-biasing the

selection bias. In the second step, the recommendation policy

is improved with Q-learning, by both the logged data and user

simulator. In addition, a regularizer for the reward function

reduces both bias and variance in the learned policy.

Moreover, [135] proposes a model-based RL method to

model the user-agent interactions for ofﬂine policy learning

with adversarial training, where the agent interacts with the

user behavior model to generate recommendations that ap-

proximate the true data distribution. To reduce the bias in

the learned policy, the discriminator is employed to rescale

the generated rewards for the policy learning, and de-bias the

user behavior model by distinguishing simulated trajectories

from the real interactions. As a result, the recommendation

performance can be improved. Similarly, in the Adversarial

future encoding (AFE) model [136], a future-aware discrimi-

nator is taken as a recommendation module to identify user-

item pairs, whereas a generator confuses the future-aware

discriminator by generating items with only common features.

The AFE model is optimized with the recommendation loss,

which reduces the optimization bias in the pre-training task.

E. Task Structuring

The complexity of the RL task can generally be reduced

by decomposing the RL task into several basic components

or a sequence of subtasks [20]. For recommender systems,

previous studies focus on Multi-Agent Reinforcement Learn-

ing (MARL), HRL, and Supervised Reinforcement Learning

(SRL).

1) Multi-agent Reinforcement Learning: RL-based recom-

mender systems suffer from inherent challenges (e.g., the curse

of dimensionality) when they employ a single agent to perform

the task. An alternative solution is to leverage multiple agents

with similar tasks to improve the learning efﬁciency with the

help of parallel computation. Generally, MARL algorithms can

be classiﬁed into four categories, i.e., fully cooperative, fully

competitive, both cooperative and competitive, and neither

cooperative nor competitive tasks [137].

Fully cooperative tasks in MARL-based recommender sys-

tems have the same goal (i.e., maximizing the same cumulative

return) achieved by all the agents [93]. For instance, in the

DEMER approach [108], the policy agent cooperates with the

environment agent to learn the policy of a hidden confounder.

Moreover, to capture the general preference of users and

their temporary interests, [138] introduces a Temporary In-

terest Aware Recommendation (TIARec) model with MARL.

Particularly, an auxiliary classiﬁer agent can judge whether

each interaction is atypical or not. The classiﬁer agent and

the recommender agent are jointly trained to maximize the

cumulative return of the recommendations.

In contrast, fully competitive tasks typically adopt the mini-

max principle: each agent maximizes its own beneﬁt under the

assumption that the opponents keep endeavoring to minimize

it. In a dynamic collaboration recommendation method for

recommending collaborators to scholars [65], the competition

should be characterized as a latent factor, since scholars hope

to compete for better candidates. To this end, the proposed

method uses competitive MARL to model scholarly competi-

tion, i.e., multiple agents (authors) compete with each other by

learning the optimized recommendation trajectories. Besides

that, an improved market-based recommendation model [139]

urges all agents to classify their recommendations into various

internal quality levels, and employs Boltzmann exploration

strategy to conduct these tasks by the recommender agents.

Both cooperative and competitive tasks also exist in MARL-

based recommender systems [140]. For example, [141] aims

to recommend public accessible charging stations intelligently.

Each charging station is regarded as an individual agent. Sub-

sequently, a centralized attentive critic module is developed to

stimulate multiple agents to learn cooperative policies. While a

delayed access strategy is proposed to exploit future charging

competition information during model training. Besides, to ad-

dress the sub-optimal policy of ranking due to the competition

between independent recommender modules, [38] promotes

the cooperation of different modules by generating signals

for these modules. Each agent acts on the basis of its signal

without mutual communication. For the i-th agent with the

signal vector ϕ

, given a Q-value function Q

, A

) and a

shared signal network Φ

), the objective function can be

deﬁned as follows.

(ξ) =



−i

∼D,ϕ

∼Φ

[−Q

, A

−i

, π

, ϕ

))

+ αlogΦ

(ϕ

)]



, (12)

where ξ and θ are the network parameters, D denotes the

distribution of samples, and N is the batch size. The objective

function for each agent is optimized by the SAC algorithm.

2) Hierarchical Reinforcement Learning: In HRL, an RL

problem can be decomposed into a hierarchy of subproblems

or subtasks, which reduces the computational complexity.

There exist some studies for HRL, which solve the recom-

mendation problems well. For example, using Hierarchies of

Abstract Machines (HAMs) [142], [41] formalizes the overall

task of proﬁle reviser as an MDP, and decomposes the task

MDP into two abstract subtasks. If the agent decides to revise

the user proﬁle (i.e., a high-level task), it allows the high-

level task to call a low-level task to remove noisy courses. To

improve the recommendation adaptivity, a Dynamic Attention

and hierarchical Reinforcement Learning (DARL) framework

[92] is developed to automatically track the changes of the

user’s preferences in each interaction. These two methods

adopt the REINFORCE algorithm to optimize both the high-

level and low-level policy functions.

HRL with the MaxQ approach [143] is a task-hierarchy that

restricts subtasks to different subsets of states, actions, and

policies of the task MDP without importing extra states. For

recommender systems, a multi-goals abstraction based HRL

[11] is designed to learn the user’s hierarchical interests. In

addition, [144] proposes a novel HRL model for the integrated

recommendation (i.e., simultaneously recommending the het-

erogeneous items from different sources). In the proposed

model, the task of integrated recommendation is divided into

two subtasks (i.e., sequentially recommending channels and

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 15

TABLE III

OVERVIEW OF CHALLENGES IN RL-BASED RECOMMENDATION APPROACHES.

Issue Model Evaluation Strategy

Environment

Construction

State Representation

DRR [77], DEMER [49] Online & Ofﬂine

PyRecGym [109] Ofﬂine

KG Leverage

KERL [91], KGQR [71], PGPR [26], ADAC [28],

Attacks&Detection [13], KGPolicy [110]

Ofﬂine

Negative Sampling KGPolicy [110], RNS [112], SNQN [111], DCFGAN [113] Ofﬂine

Social Relation

SADQN [114] Online

SRRL [115] Ofﬂine

Prior Knowledge CBHR [118], DR [120], GAttN [121] Ofﬂine

Reward Function

Deﬁnition

SQN and SAC [62], VPQ [123], GAN-CDQN [125]

Ofﬂine

DEAR [128] Online & Ofﬂine

PWR [124], Value-based RL [127], GIRL [126] Online

Learning Bias

Data Biases Off-policy Correction [132], 2-IPS [133] Online & Ofﬂine

Policy Biases

PDQ [70] Ofﬂine

IRecGAN [135], AFE [136] Online & Ofﬂine

Task Structuring

Multi-agent RL

DEMER [108], MASSA [38], DeepChain [93], RLCharge [140] Online & Ofﬂine

Multi-with RL [65], INQ [139], TIARec [138], MASTER [141] Ofﬂine

Hierarchical RL

HRL+NAIS [41], DARL [92], SHIL [146] Ofﬂine

MaHRL [11], HRL-Rec [144] Online & Ofﬂine

Supervised RL

SQN and SAC [62], SRL-RNN [43], SL+RL [9],

PAR [147], Off-policy with guarantees [148]

Ofﬂine

SRR [149], EDRR [63] Online & Ofﬂine

items). The low-level agent works as a channel selector,

which provides personalized channel lists in terms of user’s

preferences. On the other hand, the high-level agent is regarded

as an item recommender, which recommends heterogeneous

items based on the channel constraints.

Another prevailing approach is the options framework (i.e.,

closed-loop policies for taking action over a period of time)

[145], which generalizes the primitive actions to include a

temporally extended navigation of actions. For example, [146]

designs a Subgoal conditioned Hierarchical Imitation Learning

(SHIL) framework for dynamic treatment recommendation.

In the SHIL framework, the high-level policy sequentially

selects a subgoal for each sub-task. Based on each subgoal, the

low-level policy produces the low-level action (i.e., effective

medication) in the corresponding state for each sub-task.

3) Supervised Reinforcement Learning: In practice, the rec-

ommendation problems can be more easily solved by utilizing

a combination of RL and supervised learning rather than

only using RL, as supervised learning and RL can handle

corresponding tasks according to their advantages respectively.

For example, [62] proposes a self-supervised RL model to

learn the recommendation policy from users’ logged feedback

in sequential recommender systems. The model has two output

layers: One is the self-supervised layer trained with cross-

entropy loss function to perform ranking. The other is trained

with RL based on a ﬂexible reward function, which performs

as a regularizer for the supervised layer. Studies in [147], [148]

focus on optimizing the user’s life-time value in personalized

ad recommender systems. To achieve this goal, they propose

an RL-based recommendation model, where mapping from

features to actions is learned by a random forest algorithm.

SRL is often applied to satisfy the adaptability of recom-

mendation strategies. For instance, to address the top-aware

drawback (i.e., the performance on the top positions) that

may reduce user experiences, a supervised DRL model [149]

jointly utilizes the supervision signals and RL signal to learn

the recommendation policy in a complementary way. Different

from the top-aware recommender distillation framework [150]

that utilizes DQN to reinforce the rank of recommendation

lists, the supervised DRL model contains two styles of supervi-

sion signals. It adopts the cross-entropy loss for classiﬁcation-

based supervision signal, and employs pairwise ranking-based

loss for the ranking-based supervision signal. The suitable

supervision signals adaptively balance the long-term reward

and immediate reward.

Moreover, [43] puts forward SRL with RNN for dynamic

treatment recommender systems. By combining the supervised

learning signal (e.g., the indicator of doctor prescriptions) with

the RL signal (e.g., maximizing the cumulative reward from

survival rates), this approach learns the prescription policy

to refrain from unacceptable risks and provides the optimal

treatment. [9] also proposes a novel SRL with RNN for the

long-term recommendation. More precisely, RNN is employed

to adaptively evolve user states for simulating the sequential

interactions between recommender system and users.

To tackle the training compatibility among the components

of embedding, state representation, and policy in RL-based

recommender systems, [63] proposes an End-to-end DRL-

based Recommendation framework namely EDRR, in which

a supervised learning signal is designed as a classiﬁer of

the user’s feedback on the recommendation results. Moreover,

DQN and DDPG are employed to elaborate how embedding

component works in the proposed framework.

F. Summary

An overview of challenges in RL-based recommendation

approaches is presented in Table III. Overall, a number of

studies have attempted to address the challenges of applying

RL in recommender systems. On the one hand, many studies

focus on adapting RL algorithms to different recommenda-

tion scenarios. For instance, a promising reward function

that is skillfully designed can meet speciﬁc requirements of

the recommender system. On the other hand, some studies

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 16

leverage related techniques to solve the issues of RL-based

recommender systems. For example, the KG not only enriches

user-item relations by the structural knowledge to describe

the MDP environment but also contributes to explainable

recommendations by the multi-hop paths. Nonetheless, there

are some other possible limitations of RL-based recommender

systems, as well as the challenges in complex application

scenarios. For these limitations and challenges, we provide

detailed discussions in the following section.

VI. DISCUSSION

In this section, we ﬁrst discuss the central issues in ap-

plying RL algorithms for recommender systems. Afterwards,

we analyze practical challenges and provide several potential

insights into successful techniques for RL-based recommender

systems.

A. Open Issues

In practice, RL algorithms are not applicable to real-time

recommender systems, due to their trial-and-error mechanism.

A key step in the application of RL algorithms is to improve

the recommendation effectiveness and realize an intelligent

design. Hence, the following issues need to be solved for

making much progress in this ﬁeld.

1) Sampling Efﬁciency: Sampling plays a substantial role

in RL, especially in DRL-based recommender systems. Im-

portance sampling has empirically demonstrated its application

feasibility in both on-policy and off-policy. Some works [110],

[112] have highlighted the superiority of negative sampling

in recommender systems. Nevertheless, it is necessary to

focus on the improvement in sampling efﬁciency, since the

user feedback available to train recommendation models is

scarce. We can leverage auxiliary tasks to improve sampling

efﬁciency. For example, [151] develops a user response model

to predict the user’s positive or negative responses towards the

recommendation results. Thus, the state and action representa-

tions can be enhanced with these responses. Moreover, transfer

learning may be qualiﬁed for the sampling task. For example,

we can use the transfer of knowledge between temporal tasks

[152] to perform RL on an extended state space, and concretize

similarity between the source task and the target task by

logical transferability. In addition, the transfer of experience

samples [153] can estimate the relevance of source samples.

Both transfer learning methods may be able to improve the

sampling efﬁciency in RL-based recommender systems.

2) Reproducibility: In many application scenarios, includ-

ing recommendation, it is often difﬁcult to reproduce the

results of RL algorithms, because of various factors such as

instability of RL algorithms, lacking open source codes &

account of hyper-parameters, and different simulation environ-

ments (e.g., experimental setup and implementation details),

although most existing studies have performed empirical ex-

periments on public datasets, as shown in Table II. Especially

for DRL methods, non-determinism in policy network and

intrinsic variance of these methods leads to reproducibility

crisis. Besides policy evaluation methods, there are several

future directions for the reproducibility investigation with

statistical analysis. We can use signiﬁcance testing according

to related metrics or hyper-parameters such as batch size and

reward scale [154]. For example, the reward scale needs to

check its rationality and robustness in speciﬁc recommenda-

tion scenarios. The average returns should be evaluated to

verify whether it is proportional to the relevant performance.

Moreover, due to the sensitivity of RL algorithms changed

in their environments, random seeds, and deﬁnition of the

reward function, we should ensure reproducible results with

fair comparisons. To achieve this objective, we need to run

the same experiment trials for baseline algorithms, and take

each evaluation with the same preset epoch. Moreover, all

experiments should adopt the same random seeds [154].

3) MDP Formulation: In principle, MDP formulation is

essential to guarantee the performance of RL algorithms.

However, the state and action spaces in recommender systems

often suffer from the curse of dimensionality. Besides, the

reward deﬁnition is sensitive to the external environment since

users’ demands may change constantly, while interactions

between users and items are random or uncertain. To address

these issues, we can employ task-speciﬁc inductive biases

to learn the representations of selection-speciﬁc action. As

a result, the agent relies on better action structures to learn

RL policies when the recommendation problem is formulated

[155]. Another direction is to use causal graph and probabilis-

tic graph models [156] to enable the MDP formulation, where

the causal graph describes the causal-effect relations among

user-item interactions, and the probabilistic graph reasons the

recommendation paths. This joint process can improve the

efﬁciency of agent search and tracks users’ interests over time.

However, how to design MDPs for recommendation problems

in a veriﬁable way, remains an open issue.

4) Generalization: Model generalization ability is almost

the pursuit of all applications. Limited by the shortcomings

of different RL algorithms, it is difﬁcult to develop a general

framework to meet various speciﬁc requirements of recom-

mendation. On the one hand, model-based algorithms are only

applicable to the speciﬁed recommendation problem, and fail

to solve the problem that cannot be modeled. On the other

hand, model-free algorithms are insufﬁcient in dealing with

different tasks in complex environments. Fortunately, meta-

RL, such as Meta-Strategy [157] and Meta-Q-Learning [158],

ﬁrst learns a large number of RL tasks to obtain enough

prior knowledge, and later can be quickly adapted to the

new environment in face of new tasks. In this case, it may

enable the generalization of RL-based recommender systems.

Besides transfer learning based on meta-RL, we can use multi-

task learning [159] to handle related tasks (e.g., construction

of user proﬁles, recommendation, and causal reasoning) in

parallel. These tasks complement each other to improve the

generalization performance via shared representation of do-

main information, such as parameter-based sharing, and joint

feature-based sharing.

5) Autonomy: It seems easy to utilize RL for autonomous

control [48]. However, in practice, it is difﬁcult to achieve

this objective in online recommendation scenarios, since there

are complex interactions between users and items, while it is

not feasible to capture users’ dynamic intentions by existing

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 17

RL algorithms. To this end, we can achieve the feedback

control of recommender systems by combining RL with LSTM

or GRU, which enables the RL agent to be possessed of

powerful memory that preserves the sequences of state and

action, as well as different kinds of reward. Later, the historical

information is encoded and transmitted to the policy network,

thereby improving the autonomous navigation ability of the RL

agent. Another interesting research direction is to improve the

autonomous learning ability of RL from the ﬁelds of biology,

neuroscience, and cybernetics. Thus, it is straightforward to

use the learned knowledge to make better recommendations.

B. Practical Challenges

The existing recommendation models based on RL have

demonstrated superior recommendation performance. Never-

theless, there are many challenges and opportunities in this

ﬁeld. We summarize ﬁve potential directions that deserve more

research efforts from the practical aspect.

1) Computational Complexity: RL-Based recommender

systems often suffer from huge computations due to the

exploration-exploitation tradeoff and the curse of dimension-

ality [160]. Apart from task structuring techniques, the IRL

algorithm is an efﬁcient solution to reduce the computa-

tion cost. Initializing RL with demonstrations can supervise

the agent taking the action correctly, particularly for some

speciﬁc recommendation tasks. For example, apprenticeship

learning via IRL algorithm [116] highlights the need for

learning from an expert, which maximizes a reward based

on a linear combination of known features. The hierarchical

DQNs [161] alleviate the curse of dimensionality in large-scale

recommender systems. Besides, a promising method of driving

route recommendation based on RL is to employ behavior

cloning [122], which makes expert trajectories available and

quickens the learning speed. Another feasible scheme is to

improve the efﬁciency of agent exploration. For example, we

can adopt NoisyNet [162] that adds parametric noise to its

network weights, which aids efﬁcient exploration according

to the stochasticity of the policy for the recommender agent.

2) Evaluation: Most existing RL-based recommender sys-

tems focus on the single goal of recommendation accuracy,

without considering recommendation novelty and diversity

[163] based on the user experience. Beyond the need for new

evaluation measures for multi-objective goals of recommen-

dation (e.g., popularity rate [164]), we should design stan-

dard metrics for other non-standardized evaluation measures

(e.g., diversity, novelty, explainability, and safety). In general,

these kinds of evaluation measures can be considered to be

combinatorial optimization problems, naturally, maybe well

achieved by multi-objective evolutionary algorithms [165].

Recently, some works have developed Pareto efﬁciency models

for multi-objective recommendation [166], [167]. For example,

in the personalized approximate Pareto-efﬁcient recommender

system [168], a Pareto-oriented RL module learns personalized

objective weights on multiple objectives for the target user.

Nevertheless, it remains a challenge to reconcile different

evaluation measures for the recommendation, since these eval-

uation measures are usually relevant and even conﬂict.

3) Biases: In recommender systems, item popularity often

changes over time due to the user engagement and recom-

mendation strategy [169], thus long-tailed items are rarely

recommended to users. The selection bias may hurt user satis-

faction [170]. Therefore, it is necessary to concentrate on the

fairness work. Towards this end, we can combine anthropology

to analyze the differences in human behavior and cultural

characteristics, or explore the user’s intentions in the user-

recommender interactions. Besides, heterogeneous data of the

user behavior often exists in online recommendation platforms,

whereas most recommendation approaches are trained with a

single type of data. Due to the information asymmetry between

the user behavior and training model, the recommender system

suffers from learning bias. Undoubtedly, the previous feature

extraction and representation learning are crucial to deal with

such bias. In addition, MARL can be used to encourage differ-

ent agents to execute multi-dimensional data simultaneously,

and share parameters for a uniﬁed recommendation goal.

4) Interpretability: Due to the complexity of RL, the

post-hoc explanation may be easier to achieve than intrin-

sic interpretability [171]. Actually, the same is true in RL-

based recommendation systems. How to explore intrinsic

interpretable methods for RL to provide more convincing

recommendations is a promising line. Another research direc-

tion towards explainable recommendation is to provide formal

guidance of recommendation reasoning process, rather than

being concerned with a multi-hop reasoning process [172].

[173] proposes a user-centric path reasoning framework that

adopts a multi-view structure to combine sequence reasoning

information with subsequent user decision-making. However,

they only focus on the user’s demand and do not provide

theoretical proof of the rationality of reasoning. We should

avoid plausible explanations in the reasoning process. To

this end, we can employ multi-task learning to perform the

recommendation reasoning process among multiple related

tasks, e.g., representation of interaction, recommended path

generation, and Bayesian inference for policy network. These

related tasks jointly provide credible recommendation expla-

nations by sharing the representation of internal correlation

and causality.

5) Safety and Privacy: System security and user privacy are

important issues, which are ignored by most existing studies.

For example, personal privacy is easy to be leaked when

using RL and KG to perform the explainable recommendation,

because the relationships among users and items are exposed.

Differential privacy is widely applied to protect user privacy,

and DRL can be employed to choose the privacy budget

against inference attacks [174]. Besides, recent studies have

found that Deep Neural Networks (DNNs) are vulnerable to

attacks, such as adversarial attacks and data poisoning. For

example, in DNNs-based recommender systems, users with

fake proﬁles may be generated to promote selected items.

To address this issue, a novel black-box attacking framework

[175] adopts policy gradient networks to reﬁne real user

proﬁles from a source domain and copy them into the target

domain. Notwithstanding, we need to make more efforts on

the research topic. For example, the safe RL [176] can be

utilized to guarantee the security of recommender systems

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 18

(e.g., detecting error states, and preventing abnormal agent

actions). We may also leverage federated learning [177],

[178] to achieve privacy-preserving data analysis for MARL-

based recommender systems, where each agent is localized on

distributed device and updates a local model based on the user

data stored in the corresponding user device.

VII. CONCLUSION

Recommender systems serve as a powerful technique to

address the information overload issue on the Internet. There

has been increasing interest in extending RL approaches for

recommendations in recent years. RL-based recommendation

methods autonomously learn the optimal recommendation

policies from user-item interactions, and thus they recommend

better items to users, compared with other recommendation

methods. In this survey, we ﬁrstly conduct a comprehensive

review on RL-based recommender systems, using three major

categories of RL (i.e., value-function, policy search, and Actor-

Critic) to cover four typical recommendation scenarios. We

also restructure the general frameworks for some speciﬁc

scenarios, such as interactive recommendation, conversational

recommendation, and the explainable recommendation based

on KG. Furthermore, the challenges of applying RL in rec-

ommender systems are systematically analyzed, including en-

vironment construction, prior knowledge, the deﬁnition of the

reward function, learning bias, and task structuring. To facili-

tate future progress in this ﬁeld, we discuss theoretical issues

of RL and analyze the limitations of existing approaches, and

ﬁnally put forward some possible future directions.

REFERENCES

[1] S. Deng, L. Huang, G. Xu, X. Wu, and Z. Wu, “On deep learning for

trust-aware recommendations in social networks,” IEEE Trans. Neural

Netw. Learn. Syst., vol. 28, no. 5, pp. 1164–1177, May 2017.

[2] J. Bobadilla, F. Ortega, A. Hernando, and A. Guti¨¦Rrez, “Recom-

mender systems survey,” Knowl.-Based Syst., vol. 46, pp. 109–132,

July 2013.

[3] Z. Huang, X. Xu, H. Zhu, and M. Zhou, “An efﬁcient group recommen-

dation model with multiattention-based neural networks,” IEEE Trans.

Neural Netw. Learn. Syst., vol. 31, no. 11, pp. 4461–4474, November

2020.

[4] G. Adomavicius and A. Tuzhilin, “Toward the next generation of

recommender systems: A survey of the state-of-the-art and possible

extensions,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734–

749, June 2005.

[5] Y. Shi, M. Larson, and A. Hanjalic, “Collaborative ﬁltering beyond the

user-item matrix: A survey of the state of the art and future challenges,”

ACM Computing Surveys, vol. 47, no. 1, p. p. 3, May 2014.

[6] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based rec-

ommender system: A survey and new perspectives,” ACM Computing

Surveys, vol. 52, no. 1, p. p. 5, February 2019.

[7] W. Zhao, B. Wang, M. Yang, J. Ye, Z. Zhao, X. Chen, and Y. Shen,

“Leveraging long and short-term information in content-aware movie

recommendation via adversarial training,” IEEE Trans. Syst. Man

Cybern., vol. 50, no. 11, pp. 4680–4693, November 2020.

[8] F. Pan, Q. Cai, P. Tang, F. Zhuang, and Q. He, “Policy gradients for

contextual recommendations,” in Proc. WWW, 2019, pp. 1421–1431.

[9] L. Huang, M. Fu, F. Li, H. Qu, Y. Liu, and W. Chen, “A deep

reinforcement learning based long-term recommender system,” Knowl.-

Based Syst., vol. 213, p. 106706, February 2021.

[10] L. Ji, Q. Qin, B. Han, and H. Yang, “Reinforcement learning to

optimize lifetime value in cold-start recommendation,” in Proc. CIKM,

2021, pp. 782–791.

[11] D. Zhao, L. Zhang, B. Zhang, L. Zheng, Y. Bao, and W. Yan, “Mahrl:

Multi-goals abstraction based deep hierarchical reinforcement learning

for recommendations,” in Proc. SIGIR, 2020, pp. 871–880.

[12] X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, and X. Xie, “A reinforce-

ment learning framework for explainable recommendation,” in Proc.

IEEE Int. Conf. Data Mining (ICDM), 2018, pp. 587–596.

[13] Y. Cao, X. Chen, L. Yao, X. Wang, and W. E. Zhang, “Adversarial

attacks and detection on reinforcement learning-based interactive rec-

ommender systems,” in Proc. SIGIR, 2020, pp. 1669–1672.

[14] E. O. Neftci and B. B. Averbeck, “Reinforcement learning in artiﬁcial

and biological systems,” Nat. Mach. Intell., vol. 1, pp. 133–143, March

2019.

[15] H. Li, D. Liu, and D. Wang, “Manifold regularized reinforcement

learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, pp.

932–943, April 2018.

[16] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,

“Deep reinforcement learning: A brief survey,” IEEE Signal Proc.

Mag., vol. 34, no. 6, pp. 26–38, November 2017.

[17] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and

Z. Li, “Drn: A deep reinforcement learning framework for news

recommendation,” in Proc. WWW, 2018, pp. 167–176.

[18] D. Zha, L. Feng, B. Bhushanam, D. Choudhary, J. Nie, Y. Tian, J. Chae,

Y. Ma, A. Kejariwal, and X. Hu, “Autoshard: Automated embedding

table sharding for recommender systems,” in Proc. 28th ACM SIGKDD

Int. Conf. Knowl. Discovery Data Mining, 2022, pp. 4461–4471.

[19] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G.

Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman,

N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hass-

abis, K. Kavukcuoglu, and T. Graepel, “Human-level performance in

3d multiplayer games with population-based reinforcement learning,”

Science, vol. 364, no. 6443, pp. 859–865, May 2019.

[20] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in

robotics: A survey,” Artif. Intell., vol. 32, no. 11, pp. 1238–1274,

September 2013.

[21] L. Zou, L. Xia, Y. Gu, X. Zhao, W. Liu, J. X. Huang, and D. Yin,

“Neural interactive collaborative ﬁltering,” in Proc. SIGIR, 2020, pp.

749–758.

[22] Q. Liu, S. Tong, C. Liu, H. Zhao, E. Chen, H. Ma, and S. Wang,

“Exploiting cognitive structure for adaptive learning,” in Proc. 25th

ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2019, pp.

627–635.

[23] S. Ji, Z. Wang, T. Li, and Y. Zheng, “Spatio-temporal feature fusion for

dynamic taxi route recommendation via deep reinforcement learning,”

Knowl.-Based Syst., vol. 205, p. 106302, October 2020.

[24] H. Lee, D. Hwang, K. Min, and J. Choo, “Towards validating long-

term user feedbacks in interactive recommendation systems,” in Proc.

SIGIR, 2022, pp. 2607–2611.

[25] K. Wang, Z. Zou, Q. Deng, R. Wu, J. Tao, C. Fan, L. Chen, and P. Cui,

“Reinforcement learning with a disentangled universal value function

for item recommendation,” in Proc. AAAI, 2021, pp. 4427–4435.

[26] Y. Xian, Z. Fu, S. Muthukrishnan, G. de Melo, and Y. Zhang,

“Reinforcement knowledge graph reasoning for explainable recommen-

dation,” in Proc. SIGIR, 2019, pp. 285–294.

[27] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforce-

ment learning,” in Proc. ICLR, 2016, pp. 1–14.

[28] K. Zhao, X. Wang, Y. Zhang, L. Zhao, Z. Liu, C. Xing, and X. Xie,

“Leveraging demonstrations for reinforcement recommendation reason-

ing over knowledge graphs,” in Proc. SIGIR, 2020, pp. 239–248.

[29] H. Wang, F. Zhang, X. Xie, and M. Guo, “Dkn: Deep knowledge-

aware network for news recommendation,” in Proc. WWW, 2018, pp.

1835–1844.

[30] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr:

Bayesian personalized ranking from implicit feedback,” in Proc. UAI,

2009, pp. 452–461.

[31] H. Wang, F. Zhang, J. Wang, M. Zhao, W. Li, X. Xie, and M. Guo,

“Ripplenet: Propagating user preferences on the knowledge graph for

recommender systems,” in Proc. CIKM, 2018, pp. 417–426.

[32] L. Zhang, Z. Sun, J. Zhang, Y. Wu, and Y. Xia, “Conversation-based

adaptive relational translation method for next poi recommendation

with uncertain check-ins,” IEEE Trans. Neural Netw. Learn. Syst., pp.

1–14, February 2022.

[33] F. Zhou, R. Yin, K. Zhang, G. Trajcevski, T. Zhong, and J. Wu,

“Adversarial point-of-interest recommendation,” in Proc. WWW, 2019,

pp. 3462–3468.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 19

[34] Z. Fu, L. Yu, and X. Niu, “Trace: Travel reinforcement recommenda-

tion based on location-aware context extraction,” ACM Trans. Knowl.

Discovery Data, vol. 16, no. 4, pp. 1–22, August 2022.

[35] Y. Sun, F. Zhuang, H. Zhu, Q. He, and H. Xiong, “Cost-effective

and interpretable job skill recommendation with deep reinforcement

learning,” in Proc. WWW, 2021, pp. 3827–3838.

[36] Y. Wang, “A hybrid recommendation for music based on reinforcement

learning,” in Proc. PAKDD, 2020, pp. 91–103.

[37] P. Wei, S. Xia, R. Chen, J. Qian, C. Li, and X. Jiang, “A deep-

reinforcement-learning-based recommender system for occupant-driven

energy optimization in commercial buildings,” IEEE Internet Things J.,

vol. 7, no. 7, pp. 6402–6413, July 2020.

[38] X. He, B. An, Y. Li, H. Chen, R. Wang, X. Wang, R. Yu, X. Li, and

Z. Wang, “Learning to collaborate in multi-module recommendation via

multi-agent reinforcement learning without communication,” in Proc.

ACM Conf. Rec. Syst., 2020, pp. 210–219.

[39] G. Ke, H.-L. Du, and Y.-C. Chen, “Cross-platform dynamic goods

recommendation system based on reinforcement learning and social

networks,” Appl. Soft Comput., vol. 104, p. 107213, June 2021.

[40] J. O, J. Lee, J. W. Lee, and B.-T. Zhang, “Adaptive stock trading with

dynamic asset allocation using reinforcement learning,” Inf. Sci., vol.

176, no. 15, pp. 2121–2147, August 2006.

[41] J. Zhang, B. Hao, B. Chen, C. Li, H. Chen, and J. Sun, “Hierarchical

reinforcement learning for course recommendation in moocs,” in Proc.

AAAI, 2019, pp. 435–442.

[42] Y. Lin, F. Lin, W. Zeng, J. Xiahou, L. Li, P. Wu, Y. Liu, and

C. Miao, “Hierarchical reinforcement learning with dynamic recurrent

mechanism for course recommendation,” Knowl.-Based Syst., vol. 244,

p. 108546, May 2022.

[43] L. Wang, W. Zhang, X. He, and H. Zha, “Supervised reinforcement

learning with recurrent neural network for dynamic treatment recom-

mendation,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery

Data Mining, 2018, pp. 2447–2456.

[44] Z. Zheng, C. Wang, T. Xu, D. Shen, P. Qin, X. Zhao, B. Huai, X. Wu,

and E. Chen, “Interaction-aware drug package recommendation via

policy gradient,” ACM Trans. Inf. Sys., pp. 1–32, February 2022.

[45] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and

S. A. Murphy, “Informing sequential clinical decision-making through

reinforcement learning: an empirical study,” Mach. learn., vol. 84,

no. 1, pp. 109–136, July 2011.

[46] M. M. Afsar, T. Crump, and B. H. Far, “Reinforcement learning based

recommender systems: A survey,” ACM Comput. Surv., pp. 1–37, June

2022.

[47] X. Chen, L. Yao, J. Mcauley, G. Zhou, and X. Wang, “A survey of deep

reinforcement learning in recommender systems: A systematic review

and future directions,” ArXiv Preprint ArXiv:2109.03540v1, 2021.

[48] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis,

“Optimal and autonomous control using reinforcement learning: A

survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp.

2042–2062, June 2018.

[49] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,

2nd ed. Massachusetts Ave, MA: MIT, 2018.

[50] C. J. Watkins and P. Dayan, “Technical note q-learning,” Mach. Learn.,

vol. 8, no. 3, pp. 279–292, May 1992.

[51] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,

D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforce-

ment learning,” ArXiv Preprint ArXiv:1312.5602, 2013.

[52] R. J. Williams, “Simple statistical gradient-following algorithms for

connectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3, pp.

229–256, May 1992.

[53] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy

gradient methods for reinforcement learning with function approxima-

tion,” in Proc. NIPS, 2000, pp. 1057–1063.

[54] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,

“Proximal policy optimization algorithms,” ArXiv Preprint

ArXiv:1707.06347, 2017.

[55] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust

region policy optimization,” in Proc. ICML, 2015, pp. 1889–1897.

[56] S. Levine and V. Koltun, “Guided policy search,” in Proc. ICML, 2013,

pp. 1–9.

[57] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” Siam

Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166,

2003.

[58] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap,

D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep

reinforcement learning,” in Proc. ICML, 2016, pp. 1928–1937.

[59] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-

policy maximum entropy deep reinforcement learning with a stochastic

actor,” in Proc. ICML, 2018, pp. 1856–1865.

[60] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-

miller, “Deterministic policy gradient algorithms,” in Proc. ICML,

2014, pp. 387–395.

[61] S. Adam, L. Busoniu, and R. Babuska, “Experience replay for real-

time reinforcement learning control,” IEEE Trans. Syst. Man Cybern.,

vol. 42, no. 2, pp. 201–212, March 2012.

[62] X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose, “Self-supervised

reinforcement learning for recommender systems,” in Proc. SIGIR,

2020, pp. 931–940.

[63] F. Liu, H. Guo, X. Li, R. Tang, Y. Ye, and X. He, “End-to-end

deep reinforcement learning based recommendation with supervised

embedding,” in Proc. WSDM, 2020, pp. 384–392.

[64] N. Taghipour, A. Kardan, and S. S. Ghidary, “Usage-based web

recommendations: A reinforcement learning approach,” in Proc. ACM

Conf. Rec. Syst., 2007, pp. 113–120.

[65] Y. Zhang, C. Zhang, and X. Liu, “Dynamic scholarly collaborator

recommendation via competitive multi-agent reinforcement learning,”

in Proc. ACM Conf. Rec. Syst., 2017, pp. 331–335.

[66] T. Mahmood and F. Ricci, “Learning and adaptivity in interactive

recommender systems,” in Proc. ICEC, 2007, pp. 75–84.

[67] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “Drcgr: Deep

reinforcement learning framework incorporating cnn and gan-based for

interactive recommendation,” in Proc. IEEE Int. Conf. Data Mining

(ICDM), 2019, pp. 1048–1053.

[68] Y. Lei and W. Li, “Interactive recommendation with user-speciﬁc deep

reinforcement learning,” ACM Trans. Knowl. Discovery Data, vol. 13,

no. 6, p. 61, October 2019.

[69] L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin, “Reinforcement

learning to optimize long-term user engagement in recommender

systems,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery

Data Mining, 2019, pp. 2810–2818.

[70] L. Zou, L. Xia, P. Du, Z. Zhang, T. Bai, W. Liu, J.-Y. Nie, and D. Yin,

“Pseudo dyna-q: A reinforcement learning framework for interactive

recommendation,” in Proc. WSDM, 2020, pp. 816–824.

[71] S. Zhou, X. Dai, H. Chen, W. Zhang, K. Ren, R. Tang, X. He,

and Y. Yu, “Interactive recommender system via knowledge graph-

enhanced reinforcement learning,” in Proc. SIGIR, 2020, pp. 179–188.

[72] R. Zhang, T. Yu, Y. Shen, H. Jin, and C. Chen, “Text-based interactive

recommendation via constraint-augmented reinforcement learning,” in

Proc. NIPS, 2019, pp. 15 214–15 224.

[73] H. Chen, X. Dai, H. Cai, W. Zhang, X. Wang, R. Tang, Y. Zhang, and

Y. Yu, “Large-scale interactive recommendation with tree-structured

policy gradient,” in Proc. AAAI, 2019, pp. 3312–3320.

[74] W. Liu, F. Liu, R. Tang, B. Liao, G. Chen, and P. A. Heng, “Balancing

between accuracy and fairness for interactive recommendation with

reinforcement learning,” in Proc. Paciﬁc-Asia Conf. Knowl. Discovery

Data Mining, 2020, pp. 155–167.

[75] T. Xiao and D. Wang, “A general ofﬂine reinforcement learning

framework for interactive recommendation,” in Proc. AAAI, 2021.

[76] T. Yu, Y. Shen, R. Zhang, X. Zeng, and H. Jin, “Vision-language

recommendation via attribute augmented multimodal reinforcement

learning,” in Proceedings of the 27th ACM International Conference

on Multimedia, 2019, pp. 39–47.

[77] F. Liu, R. Tang, X. Li, W. Zhang, Y. Ye, H. Chen, H. Guo, Y. Zhang,

and X. He, “State representation modeling for deep reinforcement

learning based recommendation,” Knowl.-Based Syst., vol. 205, p.

106170, October 2020.

[78] M. S. Llorente and S. E. Guerrero, “Increasing retrieval quality in

conversational recommenders,” IEEE Trans. Knowl. Data Eng., vol. 24,

no. 10, pp. 1876–1888, October 2012.

[79] T. Mahmood and F. Ricci, “Improving recommender systems with

adaptive conversational strategies,” in Proc. HT, 2009, pp. 73–82.

[80] Y. Wu, C. Macdonald, and I. Ounis, “Partially observable reinforcement

learning for dialog-based interactive recommendation,” in Proc. ACM

Conf. Rec. Syst., 2021, pp. 241–251.

[81] D. Tsumita and T. Takagi, “Dialogue based recommender sys-

tem that ﬂexibly mixes utterances and recommendations,” in Proc.

IEEE/WIC/ACM Int. Conf. Web Intelligence, 2019, pp. 51–58.

[82] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, and T.-S. Chua,

“Interactive path reasoning on graph for conversational recommenda-

tion,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery Data

Mining, 2020, pp. 2073–2083.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 20

[83] Y. Deng, Y. Li, F. Sun, B. Ding, and W. Lam, “Uniﬁed conversational

recommendation policy learning via graph-based reinforcement learn-

ing,” in Proc. SIGIR, 2021, pp. 1431–1441.

[84] Y. Sun and Y. Zhang, “Conversational recommender system,” in Proc.

SIGIR, 2018, pp. 235–244.

[85] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M.-Y. Kan, and T.-S.

Chua, “Estimation-action-reﬂection: Towards deep interaction between

conversational and recommender systems,” in Proc. WSDM, 2020, pp.

304–312.

[86] X. Ren, H. Yin, T. Chen, H. Wang, N. Q. V. Hung, Z. Huang,

and X. Zhang, “Crsal: Conversational recommender systems with

adversarial learning,” ACM Trans. Inf. Sys., vol. 38, no. 4, pp. 1–40,

October 2020.

[87] A. Montazeralghaem and J. Allan, “Extracting relevant information

from user’s utterances in conversational search and recommendation,”

in Proc. 28th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,

2022, pp. 1275–1283.

[88] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin, “Recom-

mendations with negative feedback via pairwise deep reinforcement

learning,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery

Data Mining, 2018, pp. 1040–1048.

[89] D. Hong, Y. Li, and Q. Dong, “Nonintrusive-sensing and

reinforcement-learning based adaptive personalized music recommen-

dation,” in Proc. SIGIR, 2020, pp. 1721–1724.

[90] O. Moling, L. Baltrunas, and F. Ricci, “Optimal radio channel recom-

mendations with explicit and implicit feedback,” in Proc. ACM Conf.

Rec. Syst., 2012, pp. 75–82.

[91] P. Wang, Y. Fan, L. Xia, W. X. Zhao, S. Niu, and J. Huang, “Kerl: A

knowledge-guided reinforcement learning model for sequential recom-

mendation,” in Proc. SIGIR, 2020, pp. 209–218.

[92] Y. Lin, S. Feng, F. Lin, W. Zeng, Y. Liu, and P. Wu, “Adaptive course

recommendation in moocs,” Knowl.-Based Syst., vol. 224, p. 107085,

July 2021.

[93] X. Z. L. Xia, L. Zou, H. Liu, D. Yin, and J. Tang, “Whole-chain

recommendations,” in Proc. CIKM, 2020, pp. 1883–1891.

[94] S. Antaris and D. Rafailidis, “Sequence adaptation via reinforcement

learning in recommender systems,” in Proc. ACM Conf. Rec. Syst.,

2021, pp. 714–718.

[95] Y. Lu, R. Dong, and B. Smyth, “Why i like it: Multi-task learning

for recommendation and explanation,” in Proc. ACM Conf. Rec. Syst.,

2018, pp. 4–12.

[96] S. Tao, R. Qiu, Y. Ping, and H. Ma, “Multi-modal knowledge-

aware reinforcement learning network for explainable recommenda-

tion,” Knowl.-Based Syst., vol. 227, p. 107217, September 2021.

[97] S.-J. Park, D.-K. Chae, H.-K. Bae, S. Park, and S.-W. Kim, “Reinforce-

ment learning over sentiment-augmented knowledge graphs towards

accurate and explainable recommendation,” in Proc. WSDM, 2022, pp.

784–793.

[98] D. Liu, J. Lian, Z. Liu, X. Wang, G. Sun, and X. Xie, “Reinforced

anchor knowledge graph generation for news recommendation reason-

ing,” in Proc. 27th ACM SIGKDD Int. Conf. Knowl. Discovery Data

Mining, 2021, pp. 1055–1065.

[99] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning

with double q-learning,” in Proc. AAAI, 2016, pp. 2094–2100.

[100] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and

N. D. Freitas, “Dueling network architectures for deep reinforcement

learning,” in Proc. ICML, 2016, pp. 1995–2003.

[101] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”

in Proc. NIPS, 2014, pp. 2672–2680.

[102] C. Gao, W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Advances and

challenges in conversational recommender systems: A survey,” ArXiv

Preprint ArXiv:2101.09459, 2021.

[103] C. Hu, S. Huang, Y. Zhang, and Y. Liu, “Learning to infer user implicit

preference in conversational recommendation,” in Proc. SIGIR, 2022,

pp. 256–266.

[104] S. Rendle, “Factorization machines,” in Proc. IEEE Int. Conf. Data

Mining (ICDM), 2010, pp. 995–1000.

[105] A. Schwartz, “A reinforcement learning method for maximizing undis-

counted rewards,” in Proc. ICML, 1993, pp. 298–305.

[106] X. Wang, K. Liu, D. Wang, L. Wu, Y. Fu, and X. Xie, “Multi-level

recommendation reasoning over knowledge graphs with reinforcement

learning,” in Proc. WWW, 2022, pp. 2098–2108.

[107] Y. Zhang and X. Chen, “Explainable recommendation: A survey and

new perspectives,” Foundations and Trends in Information Retrieval,

vol. 14, no. 1, pp. 1–101, March 2020.

[108] W. Shang, Y. Yu, Q. Li, Z. Qin, Y. Meng, and J. Ye, “Environment

reconstruction with hidden confounders for reinforcement learning

based recommendation,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl.

Discovery Data Mining, 2019, pp. 566–576.

[109] B. Shi, M. G. Ozsoy, N. Hurley, B. Smyth, E. Z. Tragos, J. Geraci,

and A. Lawlor, “Pyrecgym: A reinforcement learning gym for recom-

mender systems,” in Proc. ACM Conf. Rec. Syst., 2019, pp. 491–495.

[110] X. Wang, Y. Xu, X. He, Y. Cao, M. Wang, and T.-S. Chua, “Reinforced

negative sampling over knowledge graph for recommendation,” in Proc.

WWW, 2020, pp. 99–109.

[111] X. Xin, A. Karatzoglou, I. Arapakis, and J. M. Jose, “Supervised

advantage actor-critic for recommender systems,” in Proc. WSDM,

2022, pp. 1186–1196.

[112] J. Ding, Y. Quan, X. He, Y. Li, and D. Jin, “Reinforced negative

sampling for recommendation with exposure data,” in Proc. IJCAI,

2019, pp. 2230–2236.

[113] J. Zhao, H. Li, L. Qu, Q. Zhang, Q. Sun, H. Huo, and M. Gong,

“Dcfgan: An adversarial deep reinforcement learning framework with

improved negative sampling for session-based recommender systems,”

Inf. Sci., vol. 596, pp. 222–235, June 2022.

[114] Y. Lei, Z. Wang, W. Li, H. Pei, and Q. Dai, “Social attentive deep q-

networks for recommender systems,” IEEE Trans. Knowl. Data Eng.,

p. 99, July 2020.

[115] Z. Lu, M. Gao, X. Wang, J. Zhang, H. Ali, and Q. Xiong, “Srrl:

Select reliable friends for social recommendation with reinforcement

learning,” in Proc. Int. Conf. Neural Inf. Process., 2019, pp. 631–642.

[116] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforce-

ment learning,” in Proc. ICML, 2004, pp. 1–8.

[117] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement

learning,” in Proc. ICML, 2000, pp. 663–670.

[118] M. David and F. Ricci, “Harnessing a generalised user behaviour model

for next-poi recommendation,” in Proc. ACM Conf. Rec. Syst., 2018,

pp. 402–406.

[119] M. Babes, V. Marivate, K. Subramanian, and M. L. Littman, “Appren-

ticeship learning about multiple intentions,” in Proc. ICML, 2011, pp.

897–904.

[120] H. Liang, “Drproﬁling: Deep reinforcement user proﬁling for rec-

ommendations in heterogenous information networks,” IEEE Trans.

Knowl. Data Eng., p. 99, May 2020.

[121] Y. Gong, Y. Zhu, L. Duan, Q. Liu, Z. Guan, F. Sun, W. Ou, and K. Q.

Zhu, “Exact-k recommendation via maximal clique optimization,” in

Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,

2019, pp. 617–626.

[122] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from obser-

vation,” in Proc. IJCAI, 2018, pp. 4950–4957.

[123] C. Gao, K. Xu, K. Zhou, L. Li, X. Wang, B. Yuan, and P. Zhao, “Value

penalized q-learning for recommender systems,” in Proc. SIGIR, 2022,

pp. 2008–2012.

[124] M. Preda and D. Popescu, “Personalized web recommendations: sup-

porting epistemic information about end-users,” in Proc. WI, 2005, pp.

692–695.

[125] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative ad-

versarial user model for reinforcement learning based recommendation

system,” in Proc. ICML, 2019, pp. 1052–1061.

[126] X. Chen, L. Yao, A. Sun, X. Wang, X. Xu, and L. Zhu, “Generative

inverse deep reinforcement learning for online recommendation,” in

Proc. CIKM, 2021, pp. 201–210.

[127] C. Pei, X. Yang, Q. Cui, X. Lin, P. J. Fei Sun, W. Ou, and Y. Zhang,

“Value-aware recommendation based on reinforcement proﬁt maxi-

mization,” in Proc. WWW, 2019, pp. 3123–3129.

[128] X. Zhao, C. Gu, H. Zhang, X. Yang, X. Liu, H. Liu, and J. Tang,

“Dear: Deep reinforcement learning for online advertising impression

in recommender systems,” in Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, 2021.

[129] L. Zou, L. Xia, Z. Ding, D. Yin, J. Song, and W. Liu, “Reinforcement

learning to diversify top-n recommendation,” in International Confer-

ence on Database Systems for Advanced Applications, 2019, pp. 104–

120.

[130] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal

difference learning with function approximation,” in Proc. ICML, 2001,

pp. 417–424.

[131] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare, “Safe

and efﬁcient off-policy reinforcement learning,” in Proc. NIPS, 2016,

pp. 1054–1062.

[132] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi,

“Top-k off-policy correction for a reinforce recommender system,” in

Proc. WSDM, 2019, pp. 456–464.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, ACCEPTED 17 MAY 2023. 21

[133] J. Ma, Z. Zhao, X. Yi, J. Yang, M. Chen, J. Tang, L. Hong, and E. H.

Chi, “Off-policy learning in two-stage recommender systems,” in Proc.

WWW, 2020, pp. 463–473.

[134] D. G. Horvitz and D. J. Thompson, “A generalization of sampling

without replacement from a ﬁnite universe,” J. Am. Stat. Assoc., vol. 47,

no. 260, pp. 663–685, April 1952.

[135] X. Bai, J. Guan, and H. Wang, “A model-based reinforcement learning

with adversarial training for online recommendation,” in Proc. NIPS,

2019, pp. 10 735–10 746.

[136] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, “A peep into the future:

Adversarial future encoding in recommendation,” in Proc. WSDM,

2022, pp. 1177–1185.

[137] L. Busoniu, R. Babuska, and B. D. Schutter, “A comprehensive survey

of multiagent reinforcement learning,” IEEE Trans. Syst. Man Cybern.,

vol. 38, no. 2, pp. 156–172, March 2008.

[138] Z. Du, N. Yang, Z. Yu, and P. Yu, “Learning from atypical behavior:

Temporary interest aware recommendation based on reinforcement

learning,” IEEE Trans. Knowl. Data Eng., pp. 1–13, January 2022.

[139] Y. Z. Wei, L. Moreau, and N. R. Jennings, “Learning users’ interests

by quality classiﬁcation in market-based recommender systems,” IEEE

Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 1678–1688, December

2005.

[140] W. Zhang, H. Liu, H. Xiong, T. Xu, F. Wang, H. Xin, and H. Wu,

“Rlcharge: Imitative multi-agent spatiotemporal reinforcement learning

for electric vehicle charging station recommendation,” IEEE Trans.

Knowl. Data Eng., pp. 1–14, May 2022.

[141] W. Zhang, H. Liu, F. Wang, T. Xu, H. Xin, D. Dou, and H. Xiong,

“Intelligent electric vehicle charging recommendation based on multi-

agent reinforcement learning,” in Proc. WWW, 2021, pp. 1856–1867.

[142] R. Parr and S. J. Russell, “Reinforcement learning with hierarchies of

machines,” in Proc. NIPS, 1997, pp. 1043–1049.

[143] T. G. Dietterich, “Hierarchical reinforcement learning with the maxq

value function decomposition,” J. Artif. Intell. Res., vol. 13, no. 1, pp.

227–303, November 2000.

[144] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, “Hierarchical reinforce-

ment learning for integrated recommendation,” in Proc. AAAI, 2021.

[145] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:

A framework for temporal abstraction in reinforcement learning,” Artif.

Intell., vol. 112, no. 1, pp. 181–211, August 1999.

[146] L. Wang, R. Tang, X. He, and X. He, “Hierarchical imitation learning

via subgoal representation learning for dynamic treatment recommen-

dation,” in Proc. WSDM, 2022, pp. 1081–1089.

[147] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh, “Personalized

ad recommendation systems for life-time value optimization with

guarantees,” in Proc. IJCAI, 2015, pp. 1806–1812.

[148] G. Theocharous, P. Thomas, and M. Ghavamzadeh, “Ad recommenda-

tion systems for life-time value optimization,” in Proc. WWW, 2015,

pp. 1305–1310.

[149] F. Liu, R. Tang, H. Guo, X. Li, Y. Ye, and X. He, “Top-aware

reinforcement learning based recommendation,” Neurocomputing, vol.

417, pp. 255–269, December 2020.

[150] H. Liu, Z. Sun, X. Qu, and F. Yuan, “Top-aware recommender

distillation with deep reinforcement learning,” Inf. Sci., vol. 576, pp.

642–657, October 2021.

[151] M. Chen, B. Chang, C. Xu, and E. H. Chi, “User response models to

improve a reinforce recommender system,” in Proc. WSDM, 2021, pp.

121–129.

[152] Z. Xu and U. Topcu, “Transfer of temporal logic formulas in reinforce-

ment learning,” in Proc. IJCAI, 2019, pp. 4010–4018.

[153] A. Tirinzoni, A. Sessa, M. Pirotta, and M. Restelli, “Importance

weighted transfer of samples in reinforcement learning,” in Proc. ICML,

2018, pp. 4936–4945.

[154] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and

D. Meger, “Deep reinforcement learning that matters,” in Proc. AAAI,

2017, pp. 3207–3214.

[155] J. Welborn, M. Schaarschmidt, and E. Yoneki, “Learning index selec-

tion with structured action spaces,” ArXiv Preprint ArXiv:1909.07440,

2019.

[156] Y. Xu, L. Qin, X. Liu, J. Xie, and S.-C. Zhu, “A causal and-or graph

model for visibility ﬂuent reasoning in tracking interacting objects,” in

Proc. CVPR, 2018, pp. 2178–2187.

[157] R. Powers and Y. Shoham, “New criteria and a new algorithm for

learning in multi-agent systems,” in Proc. NIPS, 2004, pp. 1089–1096.

[158] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola, “Meta-q-learning,”

in Eighth International Conference on Learning Representations, 2020.

[159] Q. Zhang, J. Liu, Y. Dai, Y. Qi, Y. Yuan, K. Zheng, F. Huang,

and X. Tan, “Multi-task fusion via reinforcement learning for long-

term user satisfaction in recommender systems,” in Proc. 28th ACM

SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2022, pp. 4510–

4520.

[160] R. Xie, S. Zhang, R. Wang, F. Xia, and L. Lin, “Explore, ﬁlter and

distill: Distilled reinforcement learning in recommendation,” in Proc.

CIKM, 2021, pp. 4243–4252.

[161] M. Fu, A. Agrawal, A. A. Irissappane, J. Zhang, L. Huang, and

H. Qu, “Deep reinforcement learning framework for category-based

item recommendation,” IEEE Trans. Cybern., pp. 1–14, August 2021.

[162] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,

V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg,

“Noisy networks for exploration,” in Proc. ICLR, 2018.

[163] M. Kunaver and T. Porl, “Diversity in recommender systems a survey,”

Knowl.-Based Syst., vol. 123, pp. 154–162, May 2017.

[164] Y. Ge, X. Zhao, L. Yu, S. Paul, D. Hu, C.-C. Hsieh, and Y. Zhang,

“Toward pareto efﬁcient fairness-utility trade-off in recommendation

through reinforcement learning,” in Proc. WSDM, 2022, pp. 316–324.

[165] C. L

ucken, B. Bar

an, and C. Brizuela, “A survey on multi-objective

evolutionary algorithms for many-objective problems,” Comput. Optim.

Appl., vol. 58, no. 3, pp. 707–756, February 2014.

[166] X. Chen, Y. Du, L. Xia, and J. Wang, “Reinforcement recommendation

with user multi-aspect preference,” in Proc. WWW, 2021, pp. 425–435.

[167] D. Stamenkovic, A. Karatzoglou, I. Arapakis, X. Xin, and K. Katevas,

“Choosing the best of both worlds: Diverse and novel recommendations

through multi-objective reinforcement learning,” in Proc. WSDM, 2022,

pp. 957–965.

[168] R. Xie, Y. Liu, S. Zhang, R. Wang, F. Xia, and L. Lin, “Personalized

approximate pareto-efﬁcient recommendation,” in Proc. WWW, 2021,

pp. 3839–3849.

[169] Y. Ge, S. Liu, R. Gao, Y. Xian, Y. Li, X. Zhao, C. Pei, F. Sun, J. Ge,

W. Ou, and Y. Zhang, “Towards long-term fairness in recommenda-

tion,” in Proc. WSDM, 2021, pp. 445–453.

[170] D. Li, X. Li, J. Wang, and P. Li, “Video recommendation with multi-

gate mixture of experts soft actor critic,” in Proc. SIGIR, 2020, pp.

1553–1556.

[171] E. Puiutta and E. M. S. P. Veith, “Explainable reinforcement learning:

A survey,” in International Cross-Domain Conference for Machine

Learning and Knowledge Extraction, 2020, pp. 77–95.

[172] P. Wu, H. Li, Y. Deng, W. Hu, Q. Dai, Z. Dong, J. Sun, R. Zhang, and

X.-H. Zhou, “On the opportunity of causal learning in recommendation

systems: Foundation, estimation, prediction and challenges,” in Proc.

IJCAI, 2022, pp. 1–8.

[173] C.-Y. Tai, L.-Y. Huang, C.-K. Huang, and L.-W. Ku, “User-centric path

reasoning towards explainable recommendation,” in Proc. SIGIR, 2021,

pp. 879–889.

[174] Y. Xiao, L. Xiao, X. Lu, H. Zhang, S. Yu, and H. V. Poor, “Deep-

reinforcement-learning-based user proﬁle perturbation for privacy-

aware recommendation,” IEEE Internet of Things Journal, vol. 8, no. 6,

pp. 4560–4568, March 2021.

[175] W. Fan, T. Derr, X. Zhao, Y. Ma, H. Liu, J. Wang, J. Tang, and Q. Li,

“Attacking black-box recommendations via copying cross-domain user

proﬁles,” in Proc. IEEE 37th Int. Conf. Data Engineering, 2021, pp.

1583–1594.

[176] J. Garcia and F. Fernandez, “A comprehensive survey on safe reinforce-

ment learning,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480,

August 2015.

[177] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:

Challenges, methods, and future directions,” IEEE Signal Proc. Mag.,

vol. 37, no. 3, pp. 50–60, May 2020.

[178] W. Huang, J. Liu, T. Li, T. Huang, S. Ji, and J. Wan, “Feddsr: Daily

schedule recommendation in a federated deep reinforcement learning

framework,” IEEE Trans. Knowl. Data Eng., pp. 1–1, November 2021.