[Technical] Monte Carlo Tree Search and Decision AI

⚠️ Article May Be Outdated

This article was last updated 1 year 2 months ago. The content may no longer be accurate or relevant.

This article is mainly written based on personal understanding. If there are errors, please point them out.

This article can help you quickly understand the basics of Monte Carlo Tree Search, but the specific principles won’t be explained in detail. Because I don’t understand it that well either.

Monte Carlo Tree Search (MCTS)

Basic Principles

The main idea of various Monte Carlo methods is to perform multiple simple simulations for environments with huge state spaces to obtain an approximation close to reality. For example, to find the area of a circle, you can frame it with a rectangle, then randomly place many points in the rectangle and calculate whether they’re inside the circle. Then you can get the ratio of the circle area to the rectangle area based on the ratio of points inside the circle to total points.

The core of the Monte Carlo Tree Search algorithm is to obtain a strategy close to optimal through statistics from the results of multiple simulated strategies. The number of visits and total value obtained from simulations on the formed tree nodes can determine whether a choice is relatively good. Therefore, the more simulations, the better the final strategy generally is.

Algorithm Applicability

Monte Carlo Tree Search is mainly suitable for discrete decisions where all information is known, such as chess games where you move step by step toward the end and there are no issues that can’t be simulated during the process. This is also how I learned about MCTS (although what I wanted to make in the end failed).

Main Process

Each cycle of MCTS mainly has 4 steps: Selection, Expansion, Simulation, and Backpropagation.

Selection

Starting from the root node (current state), select step by step through a strategy (UCB formula, or UCT, I don’t quite understand the difference between these two) until reaching a leaf node.

The UCB formula is as follows:

$UCB(v_i,v)={Q(v_i)/N(v_i)}+c{\sqrt{ {\log(N(v))}/{N(v_i)} } }$

Where $v$ refers to the current node, $v_i$ refers to a child node, $Q$ is value, $N$ is visit count, $c$ is a custom constant, generally $\sqrt{2}$ , adjusted based on experience. From this we can see that as the number of simulations (i.e., $N(v)$ ) increases, the part after the plus sign is dominated by $N(v_i)$ , making the UCB value tend to select nodes with smaller $N(v_i)$ .

After comparing the UCB values of child nodes, select the highest one, then proceed to the next selection until reaching a leaf node.

Expansion

This is the strategy I adopted.

When reaching a leaf node, add all child nodes corresponding to available actions. In chess terms, add all next board states from all possible moves.

Another strategy is to only expand one child node. With this strategy, the end of the selection step above can be reaching a not fully expanded node. After selection ends, add a child node for an unadded action to the current node.

Simulation

If the current node is not a terminal state (e.g., the game hasn’t ended), then starting from the current node, randomly perform actions until the terminal condition (game ends or time limit reached), and obtain a value (e.g., 1 point for winning, -1 point for losing).

Backpropagation

Also called value backtracking. Mainly based on the selected path and simulation result, add 1 to the visit count of all selected nodes, and add the final value to the value.

Combining with Neural Networks

If you really simulate many times every time to build a tree, the computational requirements are quite high.

One idea is that MCTS can calculate the estimated best strategy and value, use these values to train a neural network model, giving the model the ability to make decisions and evaluate values.

Because the model has decision and evaluation capabilities, it can also be used in the simulation part to calculate the final value.

Because simulation is abandoned, the model’s influence is also added in the selection part:

$UCB(v_i,v)={Q(v_i)/N(v_i)}+cP(v,v_i){\sqrt{ {N(v)}/{(N(v_i)+1)} } }$

Where $P(v_i,v)$ represents the probability of transitioning from state $v_i$ to $v$ , obtained through the neural network. c is a constant used to control exploration intensity.

Using the model to guide the search is actually similar to a kind of pruning.

I originally wanted to write a Gomoku AI. I wrote a program I thought was correct, but the results were very poor. I don’t know if it’s because the number of simulations was too small.

Later I ran someone else’s project, which used a different training strategy, and got good results with very few steps. The success rate when I tested it on my program was mystical.