#all_slow
One sometimes sees claims like “within high dimensional spaces, if you choose random vectors, they tend to be orthogonal to each other,” but is this true?
The answer depends on whether you’re talking about unit vectors or not.
More to the point, it depends on the context in which the above nugget of wisdom is being applied, and on not so much orthogonality per se as what we might call “near orthogonality”: Are we interested in cases when the dot product is nearly zero, or when the cosine of the angle between the vectors is nearly zero? (In the case of exact orthogonality, the dot product and cosine are both exactly zero.) The latter statement is equivalent to restricting attention to unit vectors, and also is isomorphic to considering the Pearson correlation coefficient between two random signals.
Let’s do some direct computations. We’ll vary the number of dimensions and compute a bunch of dot products between random vectors, and then plot histograms of these dot product values. The sharper the distribution is around the value of zero, the more likely it is for vectors to be orthogonal.
General Setup
Imports and global settings
# General Setup
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import scipy.stats as ss
from IPython.display import display
42)
np.random.seed(
= 50000 # number of random pairs of vectors to try
n = [2,3,5,10,100,1000] # different dimensionalities to try. Same as "D" in plots below
dims = [] results
Utility Routines
Define some functions that we’re likely to use multiple times.
Show the code
# Utility routines
def norm_rows(arr):
"normalize vectors which exist as rows with components as columns"
= arr.T # .T is to make broadcasting work easily
arrT = np.sqrt( (arrT*arrT).sum(axis=0) ) # vector magnitudes
mags return (arrT/mags).T
def uniform_rand(size=(10,10)):
"random number on [-1..1]"
return np.random.uniform(low=-1.0, high=1.0, size=size)
def softmax(x):
"good ol' softmax function, for a bunch of row vectors"
= x.T # .T's are just to make broadcasting work out.
xT = np.max(xT,axis=0)
vec_max = np.exp(xT - vec_max)
e_x return (e_x / e_x.sum(axis=0)).T
def make_plot(n=n, dims=dims, title='', rand_func=uniform_rand, normalize=False,
='#1f77b4', xlim=None, softmax_them=False, scale_x_by_sqrtd=False):
color"""A function we'll call again & again to make plots with"""
= plt.subplots(2,3, figsize=(12,6))
fig, axes ="x-large")
fig.suptitle(title, fontsize= axes.ravel()
axes = [] # list of standard devations for different dims
sds for i,ax in enumerate(axes):
= dims[i]
dim = rand_func(size=(n,dim))
a = rand_func(size=(n,dim))
b if softmax_them: a, b = softmax(a), softmax(b)
if normalize: a, b = norm_rows(a), norm_rows(b)
= (a*b).sum(axis=-1)
dots if scale_x_by_sqrtd: dots /= np.sqrt(dim)
= np.std(dots,axis=0) # measure standard dev of distribution of dot product
std =True, bins=100, label=f'D={dim}, $\sigma$={std:3.2g}', color=color)
ax.hist(dots, density
sds.append(std)if xlim!=None: ax.set_xlim(xlim)
=4) # legend will show dimensions and std dev
ax.legend(locif i%3==0: ax.set_ylabel('Probability')
if i>2: ax.set_xlabel('Dot product value')
return sds, axes, fig # return some axes & figure info too in case we want to replot
Now we can get started by looking at distributions of dot products for various kinds of vectors. Starting with…
Uniform distribution, Non-Unit Vectors
These will be vectors that fill a hypercube, i.e. each component can take on a value on [-1…1].
= "Uniform distribution, Non-Unit Vectors"
title = make_plot(title=title)
sds, axes, fig 'label': title, 'sds': sds}) # save for later results.append({
From the above graph, we see that as the dimensionality D increases, it’s still the case that the most common dot product is zero, but the probability distribution becomes wider & flatter, i.e. the chances that a pair of random vectors are not orthogonal becomes increasingly more likely as D increases.
Ooo, ooo! Some of you may have seen how in the “scaled dot product attention” used in Transformer models, they normalize by the square root of the number of dimensions? Let’s try that on the above graphs and see if the variance (i.e. the scale on the horizontal axis) stays roughly the same:
= "Uniform distribution, Non-Unit Vectors, Scaled by 1/sqrt(D)"
title = make_plot(title=title, scale_x_by_sqrtd=True)
sds, axes, fig 'label': title, 'sds': sds}) # save for later results.append({
See how the variance stays constant as D increases? Cool, right? ;-)
Uniform distribution, Unit vectors
= "Uniform distribution, Unit vectors"
title = make_plot(title=title, normalize=True, color='orange', xlim=[-1,1])
sds, axes, fig 'label': title, 'sds': sds}) # save for later results.append({
Wait, back up! Those orange graphs remind me of a beta distribution. Can we fit that? Let’s try…
def fit_beta(axes):
= np.linspace(0,1,100)
x for i,ax in enumerate(axes):
= dims[i]
dim = (dim-1)/2 # this seems to work quite well
alpha = 0.5*ss.beta.pdf(x, alpha, alpha) # just messing with beta distribution
y 2*x-1, y, color='green')
ax.plot(
fit_beta(axes) fig
(Turns out that that beta distribution can be derived symbolically. Lucky guess on my part. ;-) )
Interesting that for D=2, the vectors are more likely to be parallel or antiparallel than orthogonal, but that changes as D increases.
In the above graphs we see the opposite trend from the previous non-unit-vector case: the distribution gets narrower as D increases, meaning that a given pair of random unit vectors are more likely to be orthogonal as D increases.
What if we draw from a normal distribution of components instead of a uniform one?
Normal Distribution, Non-Unit Vectors
= "Normal distribution, Non-Unit Vectors"
title = make_plot(title=title, rand_func=np.random.normal)
sds, axes, fig 'label': title, 'sds': sds}) # save for later results.append({
OK, same trend as before: the distribution gets wider as dimensionality increases. What about for unit vectors?
Normal distribution, Unit Vectors
= "Normal distribution, Unit vectors"
title = make_plot(title=title, rand_func=np.random.normal,
sds, axes, fig =True, color='orange', xlim=[-1,1])
normalize'label': title, 'sds': sds }) # save for later
results.append({ fit_beta(axes)
Looks like the normal distribution case is the same – wow, exactly the same – as the uniform distribution only more extreme: for unit vectors, the distribution gets narrower (around 0) as the dimension increases.
If the vectors have only positive components (e.g. what you get from softmax) then you will never have orthogonal vectors (because they all exist in the “positive subspace”). But lets see what happens with such “softmax vectors”:
“Softmax Vectors”
If the vectors all have positive coefficients, e.g. because they are the output of a softmax function, then there’s no way they can be orthogonal. We might ask, what’s the most common value for the dot product?
(Note that these are not unit vectors according to the L2/“Euclidean” norm, rather they are unit vectors according to an L1/“Manhattan distance” norm).
= "Softmax vectors, not unit in L2"
title = make_plot(title=title, normalize=False, softmax_them=True, color='red')
sds, axes, fig 'label': title, 'sds': sds}) # save for later
results.append({
for i,ax in enumerate(axes): # seems the mean is 1/D; let's show that:
= dims[i]
dim 1/dim,1/dim],[0,ax.get_ylim()[1]], label='vert bar at 1/D') ax.plot([
Given that the dot product / cosine similarity for these vectors seems to consistently have an expectation value of \(1/D\) for dimensions \(D\), then as \(D\) increases this value will go to zero, and thus the vectors will be increasingly “orthogonal”.
Cosine Similarity for Softmax Vectors
What about the cosine between the vectors we just did? That’s equivalent to normalizing the softmax vectors according to an L2/Eucliean norm. Here we go:
= "Cosine Similarity for Softmax"
title = make_plot(title=title, normalize=True, softmax_them=True, color='purple')
sds, axes, fig 'label': title, 'sds': sds}) # save for later results.append({
…yeah I dunno ’bout that. Seems to be approaching a number a bit larger than 0.76. ?? I can’t immediately think any “special” number that matches that (e.g., it’s too low to be \(\pi\)/4 and much too high to be \(1/\sqrt{2}\)). Let’s keep going to higher dimensions and see what the mean converges to:
# let's keep going higher
= [10000,20000,50000]
new_dims for dim in new_dims:
= 2*np.random.uniform(size=(n,dim)).astype(np.float32) - 1 # gotta watch for mem overruns
a = 2*np.random.uniform(size=(n,dim)).astype(np.float32) - 1
b = softmax(a), softmax(b)
a, b = norm_rows(a), norm_rows(b) # normalize to get cosine
a, b = (a*b).sum(axis=-1)
dots = np.std(dots,axis=0)
std print(dim, dots.mean(), std)
10000 0.7616262 0.002711203
20000 0.7616077 0.0019324023
50000 0.76159537 0.001209799
Likely candidate special number is the hyperbolic tangent of 1:
1) np.tanh(
0.7615941559557649
…though I’m not sure how that arises. Hmmm!
Summary
In all cases, the mode of the distribution is still at zero dot product. So we can say with confidence that vectors are most likely to be (near-)orthogonal regardless of the dimensionality. However we see that although for unit vectors the distribution of dot product values gets sharper (around zero) as the number of dimensions increases, for non-unit vectors it gets flatter, i.e. betting on orthogonality becomes increasinly less of a safe bet as you go to higher dimensions.
The following graph summarizes our results, in terms of the standard deviation of the distribution of dot product values:
= plt.subplots()
fig, ax for k in range(len(results)):
= ('s-',12) if k==1 else ('o-',11)
marker, markersize 'sds'], marker, markersize=markersize, label=results[k]['label'])
ax.loglog(dims, results[k][
plt.legend()'Dimensions')
plt.xlabel('Std Dev')
plt.ylabel("For dot products of random vectors...")
plt.title( plt.show()
(where the orange line and the red line are right on top of each other)
Thus we see that the dot products of unit vectors, “softmax vectors” (positive definite with unit L1 norm), as well as the cosine similarity for the latter, are all more likely to be zero as the dimensionality increases, whereas for other vectors the dimensionality trend goes the other way.
Also: can I just say that I really like the scaling laws apparent in that last plot.