Đăng ký Đăng nhập
Trang chủ On the problem of synthesizing gestures for 3d virtual human...

Tài liệu On the problem of synthesizing gestures for 3d virtual human

.PDF
37
9
127

Mô tả:

On the problem of synthesizing gestures for 3D virtual human Le Kim Thu Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Assoc. Prof. Bui The Duy A thesis submitted in fulfillment of the requirements for the degree of Master of Information Technology June 2010 1 Originality Statement I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom i have worked at UET or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged. 2 Acknowledgements My thesis could not have been completed without support and encouragement from many people. My greatest gratitude is extended to: My supervisor, Bui The Duy, who encouraged and gave me a lot of valuable advices in research methodology, writing styles, presentation skills… I feel so lucky to be one of his students. All of my teachers in the Master course, whose lectures were supported me a solid foundation to work with my problem. My colleagues in the Faculty of applied Mathematics and Informatics, Hanoi university of Technology – who encourage me to join this course; who read my thesis and pointed out and helped me fixed my deficiencies. My friends, who commented on my writing styles and my project structures. Last but not less, my family – especially my mother – for their encouragement, care and love. They are people who always trust in me; stand by me in all the difficult situation. This understanding is the priceless present I have ever had. Ha noi 06/2010 Le Kim Thu. 3 ABSTRACT Autonomous learning is one of the main reasons of the human development. Many computer systems is also need this properties to improve its performance. Nevertheless, it is very difficult to construct such system. The reinforcement learning framework is the prominent solution for solving the problem. In the thesis, we construct a self-learning system that uses a reinforcement learning algorithm - Q-learning. In addition, we also propose a new method for control choosing action based on supervised learning. Our experiments show that except some bad cases have happened, this strategy provided quite good result. This is an encouragement for us continue to widen the problem to more complex problem. 4 CONTENTS CONTENTS.........................................................................................................5 List of figures .......................................................................................................6 INTRODUCTION ...............................................................................................7 CHAPTER I: BACKGROUND ..........................................................................9 1.1 Gesture synthesis ................................................................................. 9 1.2 Reinforcement learning .......................................................................... 11 1.2.1 Model description .............................................................................11 1.2.2 Solutions ...........................................................................................13 1.3 Exploitation versus exploration .............................................................. 15 1.4 Exploration strategies ............................................................................. 16 1.4.1 Predefined trajectory strategy ...........................................................16 1.4.2 Next-Best-View ................................................................................17 CHAPTER II: ....................................................................................................20 A MODEL FOR LEARNING PICKING UP PROBLEM ................................20 2.1 Problem representation ........................................................................... 20 2.2 Model identification................................................................................ 21 2.3 Propose exploration strategy................................................................... 24 CHAPTER III ....................................................................................................27 IMPLEMENTATION, EXPERIMENT AND EVALUATION .......................27 3.1 Overview ............................................................................................ 27 3.2 Experiment and result ........................................................................ 30 CONCLUSION ..................................................................................................33 References ..........................................................................................................34 5 List of figures Figure 1.1 Data collection phase [41] ...................................................................... 6 Figure 1.2: Data annotation ...................................................................................... 6 Figure 1.3: (a) Combine discrete segment and manual adjustment (b) Hole filling and detailed surface model ..................................................................... 7 Figure 1.4: Reference skeleton joint transformation ................................................ 7 Figure 1.5: An example of folding arm .................................................................... 8 Figure 1.6: Interaction in a reinforcement learning ............................................... 9 Figure 1.7: Algorithm Q-learning for estimating Q(s, a) ....................................... 11 Figure 1.8: Average performance of ε-greedy action-value methods. [10] .......... 15 Figure 2.1: Angles between an element and its parent........................................... 18 Figure 2.2: Shapes for fingers ................................................................................ 19 Figure 3.1: An example of the learning system ..................................................... 25 Figure 3.2: Total time to build Q-table1 ................................................................ 26 Figure 3.3: Average, min and max number of state in 25 run times ...................... 26 Figure 3.4: Number of steps before completes the task ......................................... 27 Figure 3.5: Distances between the agent and the object in the worst case of new strategy ................................................................................................. 27 Figure 3.6: Distances between the agent and the object after each movement ...... 28 6 INTRODUCTION Simulation and modeling are catching a lot of concerns due to their wide range of applications in the real life. For instance, in biology, using those technologies, we can simulate interactions of cells, ADN sequences or infection process of virus from infected cells to normal cells…. [36] Many other simulations, such as test cases in constructing, designing… helped to save a lot of man power and money [45]. Those technologies also enable us to practice many situations until we can handle the situation in the fact [46]. Using this practice method help increase the driving skill without any damage. Among a great number of simulations, human simulating is one of the areas which have the biggest number of applications in real life. Typical applications are entertainment, interaction system, education, e-commerce, heath, etc... Modeling humans is a very difficult task, because of the complexity of human behaviors. For example, when we look at children, we can see them trying to do one thing many times in order to do well. Activities which are very simple to us such as sitting, standing and moving around are really difficult for them. Therefore, they have to undergo many trials and error sequences before identifying what they should do. Similarly, robots or embodied agents can also learn to plan their movement. In this thesis, we focus on gesture synthesis with a machine learning approach. Particularly, we use reinforcement learning for an embodied agent to learn to control its hand to raise an object. In a reinforcement learning process, at the beginning, the agent has no idea of the environment and what it should do to reach its goal. The agent will perform some actions and observe responded information. Based on that information, the agent learns the way to improve its performance. The difficult problem is that the agent knows only parts of the environment, so it cannot guarantee if the information it received is good or not. This issue leads to the problem of tradingoff between exploration and exploitation. To solve this problem, we also develop an exploration algorithm based on supervised learning. The thesis is structured as follows. Chapter 1 represents background of synthetic and reinforcement learning problem; it also presents in detail the 7 exploration and exploitation trade-off problem. Our problem and a propose strategy to explore environment is described detail in Chapter 2. Chapter 3 displays result after we implement the problem, the comparison between two methods and the illustration for its. 8 CHAPTER I: BACKGROUND This chapter contains two parts, the first one presents some technique to synthesis gesture of virtual human, and the second part presents the model of RL problem and illustrate the exploration and exploitation trade-off problem during learning process. 1.1 Gesture synthesizing Gesture synthesis is a wide range applicable problem. A lot of algorithms propose to work with it [37, 38…]. Some typical methods were proposed such as inverse kinetic methods [37, 44], using motion capture data [38, 41], considering a hierarchical frame [39, 42] Motion capture data is an effective class of algorithms. The typical order of this method as follow: first, data is collected; this data will be annotated manually to construct a database. Then use some deformation technique for example: EFFD, NFFD [40], etc. to generate the unit models. Other techniques can be used to smooth the model. Lastly, the gesture will be combined from these unit models. Figure 1.1 presents an example of this approach Figure 1.1 Data collection phase [41] Figure 1.2: Data annotation 9 (a) (b) Figure 1.3: (a) Combine discrete segment and manual adjustment (b) Hole filling and detailed surface model The result of this approach is quite good; however, it is highly depended on the motion data that can be collected. In addition, few systems could generate variant action styles.. Another approach uses a frame for each parts of human to control the movement. Some algorithms in this approach such as vertex-blending, boneblending [39], volume-preserving deformation [43]… The main advantage of this approach is the ability to represent wide range of action in a system without too much data require. In our project, we based a hierarchical structure to control the model, so we will examine an algorithm to grasp the approach idea. In bone-blending algorithms, authors build the skeleton frame of joints and bones and the skin is a set of triangles, each triangle is attach to bones or joints. Changing skeleton made joints and bones move, these movement will spread to the skin. Example 1: The wrist of arm is considered a joint of two bones (fig. 1.5a). The skin is determined by vertices on the triangle mesh. The different colors of vertices refer to the different bones they depend on. An action will correspond to a shape of skeleton, for example, if we bending the arm, the angle Figure 1.4: Reference skeleton joint transformation between two bones change (fig 1.5b). This change also affects to the skin. 10 Figure 1.5: An example of folding arm Many other methods in this approach may add weight to each joint to manage the effect of bones over a vertex on the skin mesh or divide the skeleton into small parts; the changing will be control in these parts before combine with others. In general, this approach is easier than the motion data approach to implement in a small scale. This approach also can generate the action with very high quality, however, high quality require a huge computation that cannot implement on normal computer system [47]. Because of the complicating of synthesizing gestures, we only used a simple method of the second approach; shape of frame is controlled by the reinforcement learning process. 1.2 Reinforcement learning Reinforcement learning (RL) is an interesting problem, which many scientists in the field of artificial intelligence concern. Though the idea of the problem appeared a very long time ago, about 1950s under the form of “optimal control” [23, 24], the problem does not attract lot of attention. Most of the modern RL algorithms started to appear in the 1980s [17, 18]. In this part, we consider problems connected with it. 1.2.1 Model description Reinforcement learning, supervised learning and unsupervised learning are three main problems in machine learning. Different from other problems, where training data is nearly fixed, RL has a value called reward for estimating how well the action did. From its own experience, the agent will adjust its behaviors to gain the highest reward. (Fig. 1.6) 11 Figure 1.6: Interaction in a reinforcement learning Formally, the model consists of: - : Set of environment states : Set of agent actions : Scalar reward value : Probability transition model (optional) State of environments: in the problem, we considered only the discrete environment. It means that the environment can be divided into separate state however the algorithm can be extended to apply for continuous environment. Among many types of environment, we only consider environments that have the Markov property. It means that states of such environments represent all the relevant information from the past. So, the probabilities of changing from one state to another state only depend on the current state, not previous state. This type of environment is called a Markov decision process (MDP). Reward: the scalar value calculated by environment and sent back to the agent. It is a function of state and the action agent execute at that time. Notice that, the reward is computed by the environment. And reward values only provide the result for the action in the state, not the way to find the solution (see the example below). 12 And agent’s job is finding policy , which defines agent behavior at a given time, to maximize the total reward of agent. Example 2: A very well known example is the game tic-tac-toe. In this game, players use a 3x3 board; and symbols X and O to mark cells one by one. If a player places his symbols together in a row, a column or a diagonal, he wins the game. Then, elements of the problem are: - State: in this situation the environment state is the value of the board. A cell has 3 choices: X or O or nothing, so the environment has totally states.. - Action in each state: a player takes an action – placing his symbol on a blank cell. - Reward: value 1 if the player wins, -1 if the player loses and 0 otherwise. Remember that, the action of placing the symbol that makes a couple in a line may lead to the win of game but still receive reward 0. Return: As the goal focuses on the long-term reward, not immediate - reward, we are not only going to use the reward alone, but also to use the reward values afterward. But the better action should happen early, so we multiple rewards with a discount factor to accumulate reward values and represent the important of good action should happen first. This value is usually called Return and is determined by the formula: (1.1) γ: Discount factor – in range [0,1) Goal of program is to find a policy π – strategies to choose actions in each state – that guarantee the agent will win the game. 1.2.2 Solutions Most of solutions for RL are based on some kind of value functions to decide which actions should do or which states agent should gain. The most typical value functions are state value and state-action value – the first one evaluate how well if agent in a state while the second one estimate the utility if agent takes an action in a state. When agent is stand in a state of environment, according to the value function, it should take action that will change the state to a new state with 13 maximum state value among all its successors; if agent has state-action value function then it takes the action with the highest state-action value. To solve an RL problem, there are three fundamental methods: Dynamic programming (DP), Monte Carlo (MC) and temporal difference. - Dynamic programming [4, 5, 14…]: is a model-based method. This approach finds the value functions via iterating some work, until the value converge. - Monte Carlo [1, 10…]: model-free. This method use date as episodes with terminal state. MC method also update value function similar to DP method but it store all sequences of transition to a terminal state, then value estimating of the state is clear and independent. - Temporal difference [6, 8, 26…]: model-free. This method has the idea of two above methods, its means this method does not require a model and uses only a part of learned knowledge. One of the most well known algorithms is proposed by Watkins in [26] – Qlearning algorithm. The Q-learning algorithm use state-value pairs to help decision process. At the beginning, each state-action pair is initialized with an arbitrarily value, then, during agent trying the return reward will be used to update these value. The order of algorithm is presented in fig. 1.7: Figure 1.7: Algorithm Q-learning for estimating Q(s, a) The algorithms is quite easy to understand, but how we should choose the action a in the action set at each state. The chosen state has to better than the others 14 in some ways. How the agent uses its accumulated information to make such decision is a big issue that will be discussed in the next session. 1.3 Exploitation versus exploration We have known that agent has to try many time to observe information from environment. When agent in a state, it is clearly that agent should take the action with maximum utility (e.g. maximum reward). So it will use its belief to choose the best action to perform. Example 3.a: Assume that the environment is in the state as in the table, and the agent take symbol X. The best action the agent should perform is place its symbol on cell (3,3) – when agent completes this action, it will receive reward 1 because it win the game. X X O O This is an example of exploitation process, that the agent chooses actions based on its knowledge about environment. If the agent has got enough information about the environment then agent may determine the action which will return a good result. However, in many cases, e.g. at the beginning of the learning process, the agent has almost no idea about environment or its belief is inadequate. Therefore, the chosen action may not be the best action of the agent at that state (see e.g. 3b) Example 3.b: We consider the same state of environment as in example 2.a. But agent has tried to place symbol in only 3 cells: (2, 1), (3, 1) and (3, 3). Based on the utility of the three candidates, the agent will take its symbol in one of these three cells. In such situation, if the agent only uses available information, it would not have the optimal action. So it should try another action in the action set to explore a new one – this is exploration process. Both processes are important, but contradict in execute. Then we must build a strategy to balance them. The strategy decides the important of each process in a state and how can take candidate action that may maximize agent long-term reward. 15 Many strategies were proposed to balance the two processes, for instance greedy strategy, using agent behavior… [25, 29 …]. The next part represents some typical strategies to balance the two problems. 1.4 Exploration strategies As mentioned in the previous section, a strategy to explore the environment plays an important role in the learning process. When an agent choose an action to perform, it often take the action which result better than that of other actions. However, nothing guarantees that the agent was completely right, so sometimes it tries another action. There are a lot of exploration strategies, these strategies can be divided into two main categories. The first one is predefined trajectory approach. The agent collects and processes all information about the environment to build a model for the environment and to determine a trajectory. The second trend is called next-bestview approach. The agent accumulates environment information; choose the next best action in a state to execute. We will study in details those strategies. 1.4.1 Predefined trajectory strategy This approach plays a very important role in robot control. Many robots always interact in a small region and its environment does not change much over time. On the other hand, it’s not applied much in our reinforcement learning, hence, we only take a glance at this approach. In a large number of methods were proposed [29, 33…] the most popular idea is constructing a map to determine robot trajectory. One typical method is collecting environment images to synthesis agent belief. For example, in [32] robot takes images of the surrounded environment in different angles, then analyze the density and orientation on 2D images to identify frontiers and key points and to build the complete model of environment. Then the model is used to determine an optimal trajectory for robot acting. Although the method has gained very good results, it encounters an issue of region. If the region in which agent interact is large, as in [33], it is very difficult to collect all the information. In addition, the ability of the robot may be not affordable to process. Consequently, it needs new method that can compute efficiently. In [33], authors develop a new method, in which the environment is 16 divided into sub parts called local regions. Then the robot constructs the map for each local region and finds the optimal trajectories within the region. Robot will follow the trajectory was defined in the current region until it comes to another one. Experimental result over two methods had pointed out that: the second method has decreased a lot of requirement to implement the method. While the result is very acceptable – it means that, the gaps between two methods is not large although the first one guarantees to minimize error on global environment while the other is not. Many scientists, as a whole, have paid more and more concern in this direction. 1.4.2 Next-Best-View Strategies in this type of exploration use the information about candidate actions at a time to choose the action which is the best in some aspects. In general, choosing action usually checks in a restricted space and base on utilities values that were accumulated by agent in learning process. There are a lot of strategies of this type, for example ε-greedy [26], softmax [25], using GA [30], frontier-based [31], behavior-based [28, 29]… In this part, we focus on using greedy idea as an example for this trend of exploring the environment. As an agent chooses action with biggest utility, it is exploiting the environment based on its knowledge. This method is called greedy method. Since it should know more about environment, so the simplest idea we can think of is choosing an action randomly as in ε-greedy method: ε-greedy method: At a state s of environment, agent will choose an action to execute as follow: o With probability (1-ε): Choose the action has maximum utility o With probability ε: Choose an action randomly among all candidates. By using this method, it is guaranteed that when the numbers of trials come to infinite, the number of choosing each action come to infinite too. And the utility of each action will converge to its maximum value consequently. Therefore, at the end of the learning process, the agent will be able to choose the best action. 17 ` In figure 1.3 [10], authors compared two methods. We can see that, at the beginning, the bigger ε gave very good result compare to a small one, but the gap is reduced while the knowledge of environment increases. In order to solve the issue, they may reduce ε values during learning process. Figure 1.8: Average performance of ε-greedy action-value methods. [10] The ε-greedy method is much better than the greedy method. However, the exploring action is taken randomly. It means very bad candidate action could be taken. So [25] in proposed another method – softmax. As such, the action will be ranked based on its estimated value and choosing phase will choose action correspond to the rank of an action. For example, they use the following probability to choose an action: (1.2) There are also many other strategies to select the best action in the candidate set. [30] uses greedy for known part of environment while use genetic algorithm to calculate and determine the best action in the unknown parts; [31, 34] chooses actions that lead agent to the frontier of a local, etc… They also execute two processes exploration and exploitation simultaneously and combine the effect of both processes to control action choosing like [35]. 18 In the limit of a Master thesis, we cannot examine in detail all of these methods, for more information, see reference papers. In brief, this chapter has presented an introduction to reinforcement learning problem; take a deep look in the problem of balancing exploitation and exploration environment, and two main methods to deal with this issue. Our problem is described in the next chapter. 19 CHAPTER II: A MODEL FOR LEARNING PICKING UP PROBLEM If we look at a child trying to tie a knot, this is far different from how we do such a business. The child tries a lot of different ways but she faces many difficulties, such as made the wire is twisted. Mean while, we can tie the knots without any afford. These differences are natural, because children have no idea of tightening, no example of how to tie (or too complicated to understand). Meanwhile, tieing a knot is really easy for an adult since he experienced this task when he small and knew how to complete it. After a lot of trying, the little girl, in the end, may tie the knot. She has completed a learning process over her experience or in another words, she has undergone a reinforcement learning of tieing. Similarly, apply this idea to control an embodied agent; so the agent may be equipped with the self learning ability to execute some task by reinforcement learning. This chapter is dedicated to present the problem of simulating the learning process of an agent in raising task using reinforcement learning approach, and proposes a method to explore environment based on supervised learning. 2.1 Problem representation Assume that there are no outside factor intervene the learning process. When trying to do something, a child always follows his goal such as kicking a ball or turning over a book. However, he doesn’t know how to reach the goal, or how the object (the ball) reacts to his action. The child will try to kick the ball; maybe, the ball will not move or does not move the way he wants. Then he will try again, but he will kick with slightly difference and repeats doing this… until he can make the ball move in the right way. In the machine learning aspect, this process is a model free reinforcement learning problem. The problem we intend to model is similar to the above sequences, but particularly in the raising task. 20
- Xem thêm -

Tài liệu liên quan