On a "silent shape bug": A bug goes unnoticed due to broadcasting

Ok, let's do a quick post on an annoying bug I encountered today.

Recently, I have been working on some algorithms for Bayesian optimization. As I have been extensively benchmarking my new implementations against existing methods, I noticed a strange behavior. In a certain setup, the output of the function is completely horizontal, which by no means is the expected behavior.

I was absolutely clueless initially, as I have been working on this code for several weeks and the results were generally good. After hours of debugging, I realized the problem was due to a mismatch in the shapes of two tensors introduced by a wrong call of squeeze().

Demonstration of the unexpected broadcasting

Say I have two tensors x and y. Both are two-dimensional with the same shape. At the end of the numerical expression, we reduce the last dimension by taking the mean value along that dimension.

In [1]:

import torch

x = torch.empty((10, 1))
y = torch.empty((10, 1))
z = x + y
w = z.mean(dim=-1)

print(f"x: {x.shape}")
print(f"y: {y.shape}")
print(f"z: {z.shape}")
print(f"w: {w.shape}")

x: torch.Size([10, 1])
y: torch.Size([10, 1])
z: torch.Size([10, 1])
w: torch.Size([10])

We simply do an element-wise addition of x and y, and then take the mean along the last dimension. The final result w should be a one-dimensional tensor.

Now, let's assume we mildly screw up a little bit in the dimensionalities.

In [2]:

x2 = torch.empty(10)
y2 = torch.empty((10, 1))
z2 = x2 + y2
w2 = z2.mean(dim=-1)

print(f"x2: {x2.shape}")
print(f"y2: {y2.shape}")
print(f"z2: {z2.shape}")
print(f"w2: {w2.shape}")

x2: torch.Size([10])
y2: torch.Size([10, 1])
z2: torch.Size([10, 10])
w2: torch.Size([10])

Because x2 is only one-dimensional, z2 = x2 + y2 becomes a broadcasted addition, instead of element-wise. While the resulting tensor z2 has a wrong shape of (10, 10), the next step is immediately a dimension reduction, which produces a tensor w2 of the current shape. However, the math behind is completely wrong.

Because this broadcasted version is programmingly valid, no error is raised. As for the numerical side, it is hard to notice as well.

The fist version,

$$w_i = x_i + y_i$$

While the second version,

$$w2_i = y2_i + \bar{x2}$$

Apparently, the difference can be small, when one tensor is dominating or x is close to degenerate. Especially in the context of Bayesian optimization, lots of tensor operations are involved, and the "correct" values are not immediately clear to the user. To make it worse, many tensors are not explicitly exposed to the user unless the user is digging into the code. All in all, there is no easy way to tell the numbers are numerically wrong.

I was lucky enough that I was dealing with an edge case where y is 0, so instead of getting w = x, I got w2 = mean(x), whose elements are a constant value. This was way too obvious that something was wrong.

As for the root cause? A wrong call of squeeze() in another function.

import torch

def some_func(x):
    # do something here
    return x.squeeze(-1)


# some dozens of lines of code here


x = torch.empty((10, 1))
y = torch.empty((10, 1))
x_new = some_func(x)  # x_new has shape (10,)
z = x + y
w = z.mean(dim=-1)

I was happily assuming x_new was still of shape (10, 1) for weeks...

Thoughts?

As you can see, this can go unnoticed fairly easily. Due to the random nature of the Gaussian processes (I use deterministic random numbers, but still the whole outcome is "unknown" to me), it is really hard to know that there is a hidden mathematical bug.

I will probably add some assert statements here and there to make sure the dimensions are always intended. Another way is to use type annotations like jaxtyping instead of relying on static linters.