Jinyang Jiang 1, Xiaotian Liu 2, Tao Ren1, Qinghao Wang 3, Yi Zheng 1, Yufu Du 4, Yijie Peng 1 and Cheng Zhang 4Abstract—We introduce a deep reinforcement learning (DRL) approach for solving management problems including inventory management, dynamic pricing, and recommendation. This DRL approach has the potential to lead to a large management model based on certain transformer neural networkstructures, resulting in an artificial general intelligence paradigm for various management tasks. Traditional methods have limitations for solving complex real-world problems, and we demonstrate how DRL can surpass existing heuristic approaches for solving management tasks. We aim to solve the problems in a unified framework, considering the interconnections between different tasks. Central to our methodologyis the development of a foundational decision model coordinating decisions across the different domains through generative decision-making. Our experimental results affirm the effectiveness of our DRL-based framework in complex and dynamic business environments. This work opens new pathways for the application of DRL in management problems, highlighting its potential to revolutionize traditional businessmanagement.
The Myopic policy , which involves a myopic pricing policy that determines each periods price pt as a function of the beginning inventory level xt at time t. Additionally, it employs a base-stock policy for inventory replenishment and has shown excellent performance in numerical studies. It = [It1 + qtL dt]+, where dt is the realization of Dt, L is the length of lead time. The actual sales At is given by At = min{dt, It1 + qtL}. The backlog Bt is (4) (5) We evaluate PPOs performance on dynamic pricing and replenishment in competition through four experiments: (a) backlogged demand with no fixed ordering costs; (b) lost demand with no fixed ordering costs; (c) backlogged demand with fixed ordering costs; (d) lost demand with fixed ordering costs. With demands generated by Eq. (3), comparisons of reward in Fig. 1 show that PPO achieves the highest profits in all cases. Bt = max{dt (It1 + qtL), 0}. (6) The sales profit rt is
id: 1fa7ef2943a66a832e4e37b05060b942 - page: 3
VI. INVENTORY MANAGEMENT WITH RECOMMENDATION SYSTEM rt = ptAt hIt bBt cqt, where h, b, c are unit holding, backlogged, and ordering costs, respectively. To apply DRL algorithms, we formulate the dynamic pricing and replenishment as an MDP, where the actions are (7) In this part, we formulate the coordination of the inventory management and recommendation system as an MDP. With DRL as a powerful solver, it becomes feasible to jointly consider the two problems. Numerical results are carried out to demonstrate the advantage of coordination via DRL. TABLE II RESULTS OF APPLYING DRL METHODS ON MULTI-ECHELON INVENTORY PROBLEMS. Paper Method Inventory Settings Main Findings DQN Beer game in serial supply chain DQN exceeds the base stock policy when other supply chain actors make realistic ordering decisions. HAPPO
id: 520a52bb7a11e29322b841406e3b06b9 - page: 3
Beer game in serial supply chain with non-stationary demands HAPPO surpasses the non-stationary base stock policy without fixed costs and the non-stationary (s, S) policy with fixed costs. HAPPO A supply chain network where each echelon has two actors HAPPO outperforms the capped dual index, dual index, and tailored base surge policies adapted to the problem. A3C One-warehouse multi-retailers A3C outperforms the base stock policy with constant base-stock levels. PPO One-warehouse multi-retailers with dual sourcing and multi-products PPO outperforms three combined heuristic policies. where di,j t purchasing capacity of customer j, and i,j t ities over all products given by follows a binomial distribution, cj denotes the is the probabilt }(cid:14) t = exp{Ri,j i,j N (cid:88) exp{Rk,j t }, (10) k=1 where a higher rating implies the customer is more likely to purchase the product.
id: 0e8017d8116d5262c7df71c2c5014698 - page: 4
Fig. 1. Comparison of reward under different scenarios. We consider a standard N -product single-echelon inventory system with lost sales during T periods, where the unmet demand will disappear immediately. We use Si t to denote the sales quantity, lost sales, and on-hand inventory of product i at the end of period t. The replenishment quantity of product i is denoted as qi t, which is the decision variable in inventory management. The status of the inventory system is given by t, Oi t, and I i
id: 5d0418a8b1e0e2f7f441c5e2c64456dd - page: 4