In the modern world, people use autonomous robots to do some work. Autonomous vacuum robots today commonly exist in many homes, it uses sensors for obstacle detection, a new feature of cliff detection has been used as an additional sensing technology for safety. Under the development of technology, we believe that auto vacuum robots should have the ability to detect cliff and down the stairs safely. Our project is a downhill survival game in Minecraft, which asks the agent to go down the map as many levels as possible within limited steps. The agent is able to look down five floors in order to determine the next step. Once the player reaches the bottom of the map, the map will be regenerated. Rewards are given for each level the player goes down, and penalties are applied for touching the map boundary and dropping more than five levels at a time. The agent is expected to have higher rewards for optimization (considered to be a combination of less falling damages and fewer steps).
Different from path searching game and item collecting game, we are trying to build up a universal strategy that can deal with a randomized map. Currently, we are implementing the game based on Q-network with PyTorch, and the map is build based on the most basic needs. Where we need a map that is easy to observe and achievable for AI training. Therefore, instead of making a 3D map, our initial map is a 2D version while the player can either go left or right. Rewards and Penalties are also applied through XML and loss functions. Where the rewards are the reward of doing downhill effectively and the reward of achieving goals, the penalties are the penalty of moving toward the boundary and of falling over five floors. The performance of our fist model is largely dependent on the weight of rewards and penalties, and we have found a reasonable value for those feedbacks.
The following result is our first ‘successful’ try. However, the graph of the training result still has some problems.
Even though out model is opreating normally, we think it can do better than that. Sometimes the player will turn around on the first level and stuck there, and we believe there is something wrong with the rating for each possible move. Modifing the training function with more weighted parameters might be helpful in this situation. The rewards and penalities can me further improved but we might need a bit more time on testing different combinations. The loss function is a another potential problem while our model gives huge loss values over five thousand while our rewards was set to five points per level.
One of the frustrating part in our project is we waste a lot of time setting up our linux environment and realized there is barely no improvement on the speed of out model limited bu Malmo, and some functions works really wired in Malmo like you have to add .5 on the player’s x and z coordinates in order to make ‘move’ command funcional.
We used Deep Q-network in homework 2 and modified to fit in our game. Pytouch is used and greedy algorithm is added in our network.