# X86 平台 StoreLoad 乱序描述与验证

# Intel 手册描述

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.
The fact that a load may be reordered with an earlier store to a different location is illustrated by the following 
    
Intel-64内存排序模型允许将 load 操作 与 较早的 store 操作重新排序。但是，load 操作 不会与 store 操作 在操作相同地址时进行重排序。以下例子说明了 load 操作 可以 与 较早的 store 操作 重新排序

example:
Example 8-3. Loads May be Reordered with Older Stores
Processor 0          Processor 1
mov [ x ], 1         mov [ y ], 1
mov r1, [ y ]        mov r2, [ x ]
初始值： x == y == 0
r1 == 0 and r2 == 0 结果被允许

1
2
3
4
5
6
7
8
9
10
11
12
13

从上述描述中，我们看到 intel 的 TSO 模型允许 store load 的乱序现象，如何分析上述乱序？我们只需要固定搞一个值来分析即可：

1、假设 r1 == 0 ，那么按照程序执行顺序来说：Processor 0 肯定比 Processor 1 先执行

2、若 Processor 0 肯定比 Processor 1 先执行，那么 Processor 1 的 mov r2, [ x ] 中 r2 应该为 1

3、但 Processor 1 却读到了 x 的旧值

如此可以判定发生了乱序现象，当然，我们也可以固定 r2 == 0 的结果，然后判定 r1 的值。

我们还可以这样分析，如下描述，我们将每个 CPU 的操作简化为 store 和 load 操作。

Processor 0          Processor 1
store x , 1         store y , 1
load  y             load  x
store r1, y         store r2, x

1
2
3
4

那么我们将 r1 == 0 and r2 == 0 结果按照上述 store 和 load 操作顺序描述（此时：固定 r2 == 0 ）：

load  x  // 整体 Processor 0  和  Processor 1 的执行顺序中， load x 操作重排序到了 store x , 1 操作前

store x , 1 
load  y 
store r1, y

store y , 1

store r2, x

1
2
3
4
5
6
7
8
9

那么为何 Intel 存在以上乱序现象？我们来看手册的如下描述：

At each processor, the load and the store are to different locations and hence may be reordered. Any interleaving of the operations is thus allowed. One such interleaving has the two loads occurring before the two stores. This would result in each load returning value 0.

在每个处理器上，加载和存储位于不同的位置，因此可能会重新排序。因此允许操作的任何交错执行。一个这样的交错使两个加载操作发生在两个存储之前。这将导致每个加载操作的返回值为0。

1
2
3

此时两者的执行等价于：

Processor 0          Processor 1
mov r1, [ y ]         mov r2, [ x ]  // 加载操作重排序
mov [ x ], 1          mov [ y ], 1

1
2
3

那么，以上操作为每个 CPU 的自身的加载操作提前执行导致的乱序现象。还存在其他导致乱序发生的现象么？我们来看如下手册的描述（这里我们关注 store buffer 即可）：

11.1 INTERNAL CACHES, TLBS, AND BUFFERS

The store buffer is associated with the processors instruction execution units. It allows writes to system memory and/or the internal caches to be saved and in some cases combined to optimize the processor’s bus accesses. The store buffer is always enabled in all execution modes.

存储缓冲区与处理器的指令执行单元相关联。它允许保存对系统内存和/或内部缓存的写操作，在某些情况下，还可以合并写操作来优化处理器的总线访问。存储缓冲区在所有执行模式下都是启用的。

The processor’s caches are for the most part transparent to software. When enabled, instructions and data flow through these caches without the need for explicit software control. However, knowledge of the behavior of these caches may be useful in optimizing software performance. For example, knowledge of cache dimensions and replacement algorithms gives an indication of how large of a data structure can be operated on at once without causing cache thrashing.

处理器的缓存系统在很大程度上对软件是透明的。当启用时，指令和数据在这些缓存中流动，而不需要显式的软件控制。但是，了解这些缓存的行为可能有助于优化软件性能。例如，缓存维度和替换算法的知识可以指示一次可以操作多大的数据结构而不引起缓存抖动。

In multiprocessor systems, maintenance of cache consistency may, in rare circumstances, require intervention by system software. For these rare cases, the processor provides privileged cache control instructions for use in flushing caches and forcing memory ordering.

在多处理器系统中，在极少数情况下，维护缓存一致性可能需要系统软件的干预。对于这些罕见的情况，处理器提供特权缓存控制指令，用于刷新缓存和强制内存排序。

11.10 STORE BUFFER 

Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles.

Intel 64 和 IA-32 处理器将每次写入(存储)到内存的数据临时存储在存储缓冲区中。存储缓冲区允许处理器继续执行指令，而不必等待对内存和/或缓存的写入完成，从而提高了处理器的性能。它还允许延迟写入，以更有效地利用内存访问总线周期。

In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors. The processor ensures that write operations are always carried out in program order. It also insures that the contents of the store buffer are always drained to memory in the following situations:

通常，存储缓冲区的存在对软件是透明的，即使在使用多个处理器的系统中也是如此。处理器确保写操作总是按照程序顺序执行（注：store buffer  为 fifo 队列，自身特性保证一定会按照 写入该队列的顺序 刷出内存）。它还确保在以下情况下存储缓冲区的内容总是被清空到内存中:

• When an exception or interrupt is generated.
• (P6 and more recent processor families only) When a serializing instruction is executed. 
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (P6 and more recent processor families only) When a BINIT operation is performed.
• (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.

• 发生中断或异常时 清空 store buffer
• 序列化指令执行时 清空 store buffer
• IO指令（in、out指令执行时）清空 store buffer
• LOCK 指令执行时 清空 store buffer
• BINIT 操作执行时 清空 store buffer
• SFENCE 指令执行时 清空 store buffer
• MFENCE 指令执行时 清空 store buffer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

那么，根据 store buffer 的描述，我们就可以确定上述乱序出现的另外一个因素 ---- store buffer 中临时存储的数据对其他 CPU 不可见。每个处理器写出的值对其他 CPU 不可见，因为值在 store buffer 中，而为什么 store buffer 中的数据其他CPU 不可见呢？因为 Intel 的 MESI 实现只针对于高速缓存而言。来看原文描述。此时我们可以很轻易的看到：store buffer 不属于 MESI 的一部分，MESI 只控制 L1/L2/L3 的缓存一致性，此时得出结论：store buffer 的数据刷出后，将由 MESI 协议保证每个 CPU的缓存一致性。

11.4 CACHE CONTROL PROTOCOL
In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors.
    
在L1数据缓存和L2/L3统一缓存中，MESI(修改的、独占的、共享的、无效的)缓存协议与其他处理器的缓存保持一致。

1
2
3
4

那么 MESI 在 Intel 中的实现如何保证一致性的呢？我们来看以下原文，可以看到 intel 并没有使用 WIKI 中的 MESI 的优化技术：invalid queue，而是使用嗅探技术（snoop）来完成自身缓存行的状态更新。

11.2 CACHING TERMINOLOGY

When operating in an MP system, IA-32 processors (beginning with the Intel486 processor) and Intel 64 processors have the ability to snoop other processor’s accesses to system memory and to their internal caches. They use this snooping ability to keep their internal caches consistent both with system memory and with the caches in other processors on the bus. For example, in the Pentium and P6 family processors, if through snooping one processor detects that another processor intends to write to a memory location that it currently has cached in shared state, the snooping processor will invalidate its cache line forcing it to perform a cache line fill the next time it accesses the same memory location.

当在MP系统（注：多处理器系统）中运行时，IA-32处理器（从Intel 486处理器开始）和 Intel 64 处理器能够窥探其他处理器对系统内存和内部缓存的访问。它们使用这种窥探能力来保持内部缓存与系统内存和总线上其他处理器中的缓存一致。例如，在奔腾和P6系列处理器中，如果通过窥探一个处理器检测到另一个处理器打算写入它当前以共享状态缓存的内存位置，窥探处理器将使它的缓存线失效，迫使它在下次访问相同的内存位置时执行缓存线填充。

1
2
3
4
5

最后，我们按照上述描述来从 store buffer 的角度来重新审视一开始分析的代码。此时我们可以知道，由于没有满足刷新 STORE BUFFER 的条件：没有执行IO指令、没有序列化指令、没有中断处理（我们知道 CPU 可以在指令执行后面增加一个中断处理周期来检测中断，比如：mov [ y ], 1 指令执行后，检测硬中断：APIC 或者 INTR 中断线，但这里由于只有两条指令，所以我们假定这一段时间并没有硬中断发生）、没有 LOCK 指令。那么 Processor 0 和 Processor 1 写出的 x 和 y 的值还存在与彼此的 store buffer 中，所以导致了 r1 == r2 == 0 的结果。

那么问题来了？如果 Processor 0 和 Processor 1 不断循环获取 x 和 y 的值，最终能否获取到 x 和 y 的最新值 1 呢？想必各位也能直接给出答案：必须能，因为 store buffer 只是临时存储，当发生中断时或者执行序列化指令时一定刷出，为何？考虑下 CPU 的时钟中断切换进程的条件、处理鼠标键盘等外设的中断事件，还有操作系统的其他异常事件。必然导致 store buffer 刷出。

Processor 0          Processor 1
mov [ x ], 1         mov [ y ], 1    
mov r1, [ y ]        mov r2, [ x ]

1
2
3

# Java 语言描述

我们来看如下 java 语言描述：

1、初始化了 a、b、c、d 四个变量

2、存在三个线程： thread - 1、 thread - 2、main 线程

thread - 1 执行：

a = 0xfa;
b = c;

1
2

thread - 2 执行：

c = 0xfc;
d = a;

1
2

main 线程执行：

if (b == 0 && d == 0 && a == 0xfa && c == 0xfc) {
    System.out.println(String.format("a:%d\tb:%d\tc:%d\td:%d\t", a, b, c, d));
    break;
}
a = 0;
b = 0;
c = 0;
d = 0;

1
2
3
4
5
6
7
8

那么，在并发条件下，我们很容易看出 main 函数看到以下 thread - 1、 thread - 2 的执行结果，其中最后一种执行结果为非法结果。为什么呢？我们考虑下：

1、若定义：a == 0xfa 且 b == 0，那么此时： thread - 1 必定优先于 thread - 2 执行

2、若 thread - 1 优先于 thread - 2 执行，那么 thread - 2 中的 d = a 操作，此时 d 应该为 0xfa，而不是 0

当然，也可以反过来定义 thread - 2 的 c 和 d 的结果，来推理 thread - 1 的 a 和 b的值。

a=0xfa c=0xfc b=0xfc d=0xfa
a=0xfa c=0xfc b=0 d=0xfa
a=0xfa c=0xfc b=0xfc d=0

a=0xfa c=0xfc b=0 d=0 // 非法结果

1
2
3
4
5

那么这种非法结果是否能够出现呢？我们直接运行以下代码来验证，将会看到如下结果：

/**
 * @author hj
 * @version 1.0
 */
public class StoreLoadDemo {
    static int a = 0, b = 0, c = 0, d = 0;

    static CyclicBarrier cyclicBarrier = new CyclicBarrier(2); // 使用线程屏障保证两个线程最大限度并行执行

    public static void run1() throws Exception {
        cyclicBarrier.await();
        a = 0xfa;
        b = c;
    }

    public static void run2() throws Exception {
        cyclicBarrier.await();
        c = 0xfc;
        d = a;
    }

    public static void main(String[] args) throws Exception {

        for (; ; ) {
            Thread t1 = new Thread(() -> {
                try {
                    run1();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
            Thread t2 = new Thread(() -> {
                try {
                    run2();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
            t1.start();
            t2.start();
            t1.join();
            t2.join();  // 主线程等待两个子线程完成执行
            cyclicBarrier.reset();
            if (b == 0 && d == 0 && a == 0xfa && c == 0xfc) {
                System.out.println(String.format("a:%d\tb:%d\tc:%d\td:%d\t", a, b, c, d));
                break;
            }
            a = 0;
            b = 0;
            c = 0;
            d = 0;
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54

那么，我们在进行分析这种情况出现之前，笔者必须强调 thread - 1、 thread - 2、main 线程三个线程的可见性问题，避免读者产生如下疑惑：

1、主线程读取的 a、b、c、d 值是否为最新值

2、主线程初始化的 a、b、c、d 值，thread - 1 和 thread - 2 是否可见

我们直接看 JMM 对于线程 join 和 start 的描述：

什么是 happens before？来看描述：

Happens-Before Relationship Two actions can be ordered by a happens-before relationship. If one action happens before another, then the first is visible to and ordered before the second. It should be stressed that a happens before relationship between two actions does not imply that those actions must occur in that order in a Java implementation.Rather, itimplies that if they occur out of order, that fact cannot be detected. There are a number ofways to induce a happens-before ordering in a Java program, including:

• A call to start() on a thread happens before any actions in the started thread.
• All actions in a thread happen before any other thread successfully returns from a
join() on that thread.

两个动作可以通过Happens-Before关系排序。如果一个动作发生在另一个动作之前，那么第一个动作的操作对于 第二个动作是可见的。应该强调的是，发生在两个操作之间的关系之前并不意味着这些操作必须在Java实现中以这种顺序发生。相反，这个规则意味着如果它们无序发生，那么编程人员无法检测到这一事实。在Java程序中，有许多方法必须满足这样的规则

注：JVM 的实现 你随便整，但必须满足 happens before 的规则定义

1
2
3
4
5
6
7
8
9

所以我们很容易知道：三个线程间的可见性完全能够满足。这是分析前的定义。当然，如果你不信这个规范，你可以直接使用全屏障来定义三个可见性，此时代码变为如下描述，但这是多余的，因为 cyclicBarrier.await (opens new window)() 方法中本身存在 CAS 操作，该操作底层依赖 lock 指令，线程的 start 、join 操作也在 jvm层面依赖 OS 的 mutex（futex）锁，里面也间接使用原子性操作的指令本身就保证了指令不会乱序执行，同时刷新 store buffer 。所以这就是多余的操作，当然你可以加上这些屏障，仍然也会出现非法结果。

public class StoreLoadDemo {
    static int a = 0, b = 0, c = 0, d = 0;

    static CyclicBarrier cyclicBarrier = new CyclicBarrier(2);

    public static void run1() throws Exception {
        MyUtils.getUnsafe().fullFence(); // 全屏障保证刷新 store buffer 并且限制 CPU 的乱序执行
        cyclicBarrier.await();
        a = 0xfa;
        b = c;
    }

    public static void run2() throws Exception {
        MyUtils.getUnsafe().fullFence();// 全屏障保证刷新 store buffer 并且限制 CPU 的乱序执行
        cyclicBarrier.await();
        c = 0xfc;
        d = a;
    }

    public static void main(String[] args) throws Exception {

        for (; ; ) {
            Thread t1 = new Thread(() -> {
                try {
                    run1();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
            Thread t2 = new Thread(() -> {
                try {
                    run2();
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }
            });
            t1.start();
            t2.start();
            t1.join();
            t2.join();
            cyclicBarrier.reset();
            if (b == 0 && d == 0 && a == 0xfa && c == 0xfc) {
                System.out.println(String.format("a:%d\tb:%d\tc:%d\td:%d\t", a, b, c, d));
                break;
            }
            a = 0;
            b = 0;
            c = 0;
            d = 0;
            MyUtils.getUnsafe().fullFence(); // 全屏障保证刷新 store buffer 并且限制 CPU 的乱序执行
        }
    }
}

public class MyUtils {
    public static final Unsafe UNSAFE;
    static {
        try {
            Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
            theUnsafe.setAccessible(true);
            UNSAFE = (Unsafe) theUnsafe.get(null);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static Unsafe getUnsafe() {
        return UNSAFE;
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

# JVM 层面描述

我们来看看 unsafe 的 fullFence 屏障在 X86 的实现。很容易看到：使用 lock 指令来完成全屏障操作，为何？刷 store buffer 的作用我们在前面看到过了，那么能否禁止单核处理器上的 store 和 load 操作不同地址时的乱序呢？来看 intel 手册描述，很明显了吧。

8.2.2 Memory Ordering in P6 and More Recent Processor Families

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles (Note the memory-ordering principles for single-processor and multiple processor systems are written from the perspective of software executing on the processor, where the term “processor” refers to a logical processor. For example, a physical processor supporting multiple cores and/or Intel Hyper-Threading Technology is treated as a multi-processor systems.):
• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the following exceptions:
— streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, 
MOVNTDQ, MOVNTPS, and MOVNTPD); and
— string operations (see Section 8.2.4.1).
• No write to memory may be reordered with an execution of the CLFLUSH instruction; a write may be reordered 
with an execution of the CLFLUSHOPT instruction that flushes a cache line other than the one being written.1
Executions of the CLFLUSH instruction are not reordered with each other. Executions of CLFLUSHOPT that 
access different cache lines may be reordered with each other. An execution of CLFLUSHOPT may be reordered 
with an execution of CLFLUSH that accesses a different cache line.
• Reads may be reordered with older writes to different locations but not with older writes to the same location. 
• Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
• Reads cannot pass earlier LFENCE and MFENCE instructions.
• Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE 
instructions.
• LFENCE instructions cannot pass earlier reads.
• SFENCE instructions cannot pass earlier writes or executions of CLFLUSH and CLFLUSHOPT.
• MFENCE instructions cannot pass earlier reads, writes, or executions of CLFLUSH and CLFLUSHOPT.

在定义为回写(write-back)可缓存的内存区域的单处理器系统中，内存排序模型遵循以下原则(注意，单处理器和多处理器系统的内存排序原则是从在处理器上执行的软件的角度编写的，其中术语“处理器”指的是逻辑处理器。例如，支持多核和/或Intel超线程技术的物理处理器被视为多处理器系统):（注：我们关注以下两个顺序即可）
• Reads may be reordered with older writes to different locations but not with older writes to the same location.  对较旧的写入到不同位置的读取可能会重新排序，但对相同位置的较旧写入则不会重新排序
• Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions. I/O指令、锁定指令或序列化指令对读或写进行不会重新排序
JNINativeMethod fence_methods[] = {
    {CC"loadFence",          CC"()V",                    FN_PTR(Unsafe_LoadFence)},
    {CC"storeFence",         CC"()V",                    FN_PTR(Unsafe_StoreFence)},
    {CC"fullFence",          CC"()V",                    FN_PTR(Unsafe_FullFence)},
};

UNSAFE_ENTRY(void, Unsafe_FullFence(JNIEnv *env, jobject unsafe))
    UnsafeWrapper("Unsafe_FullFence");
OrderAccess::fence();
UNSAFE_END
    
    inline void OrderAccess::fence() {
        if (os::is_MP()) {
            // always use locked addl since mfence is sometimes expensive
            #ifdef AMD64
            __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
            #else
            __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
            #endif
        }
    }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

好的，那么我们知道 java 代码执行时会受到 JIT 的优化，可能有读者会怀疑是不是 JIT 即时编译器对代码产生了优化导致了 thread - 1 、 thread - 2、main 的代码优化呢？其实如果了解 CPU 指令屏障和编译器屏障的读者应该知道：CPU 指令屏障兼具编译器屏障功能，本身三个线程的操作代码里就间接包含了 CPU 指令屏障，自然也就禁止了 JIT 的优化。当然为了更能够说明，笔者这里将 hotspot 1.8 在 Ubuntu 14 上进行了编译，编译模式为 debug 模式，同时将上述java代码 javac 编译为了字节码放入到该虚拟机中执行，同时加上如下参数：

-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:CompileCommand=compileonly -Xcomp -XX:TieredStopAtLevel=4 -XX:+PrintLIRWithAssembly -Xbatch -XX:+LogVMOutput

此时，将会通过最高等级的 C2 编译器进行优化，同时输出编译后的汇编代码。我们来看线程 1 和线程 2、main 主线程的执行的汇编代码即可（读者现在应该知道为何我使用 0xfa 和 0xfc 来赋值，因为产生的汇编代码太多了，所以我需要这个作为关键词来搜索）。如下所示，我们看到 JIT 并没有对汇编代码进行重排序，符合编程的顺序。

# thread - 1 ：

0x00007fcee138594c: movl   $0xfa,0x64(%rbp)   ;*putstatic a
0x00007fcee1385953: mov    0x6c(%rbp),%r11d
0x00007fcee1385957: mov    %r11d,0x68(%rbp)   ;*putstatic b

# thread - 2 ：

0x00007fcee144b8cc: movl   $0xfc,0x6c(%rbp)   ;*putstatic c       
0x00007fcee144b8d3: mov    0x64(%rbp),%r11d
0x00007fcee144b8d7: mov    %r11d,0x70(%rbp) ;*putstatic d


# main:

0x00007fcee13b62ac: mov    %r12d,0x68(%r10)   ;*putstatic b

0x00007fcee13b62b0: mov    %r12d,0x6c(%r10)   ;*putstatic c

0x00007fcee13b62b4: mov    %r12d,0x64(%r10)   ;*putstatic a

0x00007fcee13b62b8: mov    %r12d,0x70(%r10)   ;*putstatic d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

最后，我们来看看 hotspot 对于 volatile 的实现，毕竟当我们对 a、b、c、d 变量加上 volatile后将不会出现非法结果，为何呢？（这里我们以 bytecoderinterpreter C++ 解释器来描述，因为 C++代码比汇编代码更好看对吧？）。通过源码我们很容易看到 volatile 的实现在 x86的实现，同样使用 lock 来实现全屏障，其他操作退化为编译器屏障。

CASE(_putstatic):
{
    ...
     if (cache->is_volatile()) {
         if (tos_type == itos) {
              obj->release_int_field_put(field_offset, STACK_INT(-1));
         } ... // 其他类型
         OrderAccess::storeload();
     }
}

inline void oopDesc::release_int_field_put(int offset, jint contents)       {
    OrderAccess::release_store(int_field_addr(offset), contents);  
                                                                            }

// x86 拥有 store buffer，前面描述过，同时约束了不会出现 store store 、load store、loadload 乱序 所以这里使用 volatile 保证编译器不会重排序
inline void     OrderAccess::release_store(volatile jshort*  p, jshort  v) { *p = v; }


// 编译器屏障，保证 JIT 不会优化代码
static inline void compiler_barrier() {
  __asm__ volatile ("" : : : "memory");
}
inline void OrderAccess::loadload()   { compiler_barrier(); }
inline void OrderAccess::storestore() { compiler_barrier(); }
inline void OrderAccess::loadstore()  { compiler_barrier(); }
inline void OrderAccess::storeload()  { fence();            }

inline void OrderAccess::acquire()    { compiler_barrier(); } // 读屏障 退化为 编译器屏障
inline void OrderAccess::release()    { compiler_barrier(); } // 写屏障 退化为 编译器屏障

inline void OrderAccess::fence() {
   // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
  __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
  __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  compiler_barrier();
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

# 附录

最后我们来看个可见性的问题，以下代码我们使用默认 JVM 参数不能停止，但我们可以用以下 JVM 参数来让它停止，其实我们很容易就能知道：这是 C2 编译器搞事情，为何？ store buffer 的数据终究会刷出（参考之前 store buffer的描述），很明显这里停不下来不是 CPU 的事，我们来验证。

 -Xint  // 单独加停止
 
 -XX:TieredStopAtLevel=3 // 单独加也能停止，该参数意义：JIT 停止在 C1 最后一个阶段，不进入 C2
/**
 * @author hj
 * @version 1.0
 */
public class Demo {

    public static int flag;

    public static void test() {
        while (flag != 0xfa) {

        }
    }

    public static void main(String[] args) throws Exception {
        new Thread(() -> test()).start();
        Thread.sleep(1000);
        flag = 0xfa;
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

我们同样在编译好的 hotspot 中加入如下参数输出汇编代码，该参数将会编译出 C2 的汇编代码。

-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:CompileCommand=compileonly -Xcomp -XX:TieredStopAtLevel=4 -XX:+PrintLIRWithAssembly -Xbatch -XX:+LogVMOutput
# 子线程汇编
0x00007f3c4523458c: mov    $0xd6901170,%r10   ;   {oop(a &apos;java/lang/Class&apos; = &apos;Demo&apos;)} # 静态变量在 堆内存中的 class 对象末尾，这里我们获取 class对象
0x00007f3c45234596: mov    0x60(%r10),%r11d  # 然后通过偏移量获取 变量 flag 的值
0x00007f3c4523459a: cmp    $0xfa,%r11d # 比较该值是否为 0xfa

0x00007f3c452345a1: je     0x00007f3c452345ab  ; OopMap{off=35} # 若为 0xfa，那么跳转到 0x00007f3c452345ab 退出 test 方法
                                                ;*goto
                                                ; - Demo::test@9 (line 8) 
                                                
# 否则继续执行
0x00007f3c452345a3: test   %eax,0xcf10a57(%rip)        # 0x00007f3c52145000 检测线程安全点，在需要 STW时，当前线程可以停止，JVM 使用 poling page 来完成此操作，当设置不可访问时，这里只要读取了该内存，将会被 OS 设置信号，而线程响应该信号 将自身阻塞，从而进入 STW 阶段
                                                ;   {poll}
0x00007f3c452345a9: jmp    0x00007f3c452345a3   # C2 编译器优化为跳转同一个地址，不在检查 变量 flag 的值                                     
                                                
                                                
0x00007f3c452345ab: add    $0x10,%rsp # 退出 test 方法
0x00007f3c452345af: pop    %rbp
0x00007f3c452345b0: test   %eax,0xcf10a4a(%rip)        # 0x00007f3c52145000
                                                ;   {poll_return}
0x00007f3c452345b6: retq                      ;*goto
                                                ; - Demo::test@9 (line 8)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

所以，根据以下两个概念:

1、 x86 的写屏障和读屏障退化为编译器屏障

2、停不下来的可见性由于编译器优化

我们对上述的代码进行改造。看如下代码，读者自己运行，完美的停下来了。

public class Demo {

    public static int flag;

    public static void test() {
        while (flag != 0xfa) {
            MyUtils.getUnsafe().loadFence(); // 加上编译器屏障
        }
    }

    public static void main(String[] args) throws Exception {
        new Thread(() -> test()).start();
        Thread.sleep(1000);
        flag = 0xfa;

    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

上述的汇编代码如下：

  0x00007f9d094490cc: mov    $0xd6907350,%r10   ;   {oop(a &apos;java/lang/Class&apos; = &apos;Demo&apos;)}
  0x00007f9d094490d6: mov    0x64(%r10),%r8d
  0x00007f9d094490da: cmp    $0xfa,%r8d
  0x00007f9d094490e1: je     0x00007f9d094490f6  ; OopMap{r10=Oop off=35} # 若 flag 为 0xfa 则跳转退出
                                                ;*goto
                                                ; - Demo::test@15 (line 19)
                                                
  0x00007f9d094490e3: test   %eax,0xca58f17(%rip)        # 0x00007f9d15ea2000 检测线程安全点
                                                ;*goto
                                                ; - Demo::test@15 (line 19)
                                                ;   {poll}
  0x00007f9d094490e9: mov    0x64(%r10),%r11d
  0x00007f9d094490ed: cmp    $0xfa,%r11d
  0x00007f9d094490f4: jne    0x00007f9d094490e3  ;*if_icmpeq  一直检测并跳转到 0x00007f9d094490e3，可以看到该循环跳转过程中，不断检测 0x64(%r10) 地址，也即 flag 的地址
                                                ; - Demo::test@6 (line 18)
                                                
  0x00007f9d094490f6: add    $0x10,%rsp  # 退出 test 方法
  0x00007f9d094490fa: pop    %rbp
  0x00007f9d094490fb: test   %eax,0xca58eff(%rip)        # 0x00007f9d15ea2000
                                                ;   {poll_return}
  0x00007f9d09449101: retq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

# 总结

对于 Intel CPU 来说，本身拥有强顺序，但由于读取优化和 store buffer 的存在将会导致重排序现象，而这种重排序现象导致的原因有两个：

1、读取优化

2、store buffer 延迟写出

而重排序本身不是导致 JAVA 层面的可见性的原因，因为：就算重排序，就算在 store buffer 终究会刷出到主存中，最终其他线程终究可见，而导致 java 层面的可见性问题的根本原因在于：JIT 编译器的激进优化，与 CPU 无关。

那么读者也可以按照这种方式去查看 ARM 平台下的约束，ARM 使用 MOESI ，实现机制不像 INTEL 的 snoop 机制来保证 MESI 的一致性，但是根本是一样的：不管在 store buffer 还是 invalid queue 中，终究会对其他线程可见，因为只要发生异常、进程切换、中断等等，必定刷出这些数据，具体的论据读者可以参考 ARM 的手册，因为如果不这样做那就是 CPU 的设计缺陷，本身这种优化机制在一定程度上如 Intel 的描述：对软件开发者透明，但有时可能导致重排序的现象，那么提供 ARM 的指令来限制这种优化，保证指令不会发生重排序现象，但重要的一点：重排序与可见性是两码事。

← Linux mmap 原理二 Linux线程模型：LinuxThreads 与 NPTL →