Using gradient descent to maximize structural balance in signed graphs

14 Nov 2021

Overview

I developed a novel algorithm for partitioning large-scale signed networks to maximize their structural balance. The algorithm utilizes gradient descent applied to an \(L_1\)-norm loss function and a deep-learning optimizer to optimize in discrete state space.

Structural balance

An overwhelming body of empirical evidence suggests social networks can be well partitioned into two sets of friends with complete mutual antagonism between them. While early research particularly in the social sciences focused on the local interactions between nodes, e.g. Heider’s Axioms (Heider, 1946), much of the current research aims at applying structural balance theory and related algorithms to real-world, large-scale, signed networks -often to predict future dynamics.

Heider’s axioms — A visualization of Heider's axioms. Green edges represent friendship and red edges represent antagonism between nodes. These axioms may also apply in a social context, e.g. 'the enemy of my enemy is my friend' [View uncompressed]

Maximizing structural balance is a particularly intriguing problem as it is related not only to social networks but also to the maximum cut problem (which is about partitioning a graph into two sets to maximize the number of edges between them) as well as the Ising model (which describes how atomic spins interact with each other to settle in the state of lowest energy).

Since the state space of this partitioning problem is exponential (\(2^n\) for a graph with \(n\) nodes), determining the globally optimal partitioning for nearly any real-world graph is computationally intractable.

Algorithm description

Deriving a loss function

Without an external field, the Hamiltonian, \(h\), of the Ising model is given by the below equation. Facchetti, et al. (2011) elegantly used this equation to analyze the balance of large-scale signed networks.

\[ h(\mathbf{s}) = \frac{1}{2}\sum_{(i, j)}(1 - J_{ij}s_is_j) = m - \frac{1}{2} \mathbf{s}^\top J\mathbf{s} \]

where \(J \in \mathbb{R}^{n \times n}\) whose elements \(J_{i,j}\) determine the sign and strength of the edge \(\langle i, j \rangle\). \( \mathbf{s} \in \mathbb{R}^{n \times 1} \) is a column vector whose elements \( \mathbf{s_i} \) indicate membership (and strength of membership) to either of two antagonistic sets of nodes. Lastly, \( m \) is the number of edges in an undirected graph. Intuitively, \( h\) is minimized when 1: both nodes \(i\) and \(j\) are in the same sets, \( \mathrm{sign}(\mathbf{s_i}) = \mathrm{sign}(\mathbf{s_j}) \), and connected by positive edges \( J_{i,j} > 0 \) and \( J_{j, i} > 0 \), or 2: \(i\) and \(j\) are in different sets, \(\mathrm{sign}(\mathbf{s_i}) \neq \mathrm{sign}(\mathbf{s_j}) \), and connected by a negative edge \(J_{i,j} < 0\) and \( J_{j, i} < 0 \). Evidently, the energy functional is a more global measure of structural balance than the number of “triangles” that violate Heider’s Axioms.

To extend the energy functional equation to directed graphs while retaining the 0-property ( \( h(\mathbf{s}) = 0 \) indicates the graph is perfectly balanced), I modify the energy functional to \( h_d(\mathbf{s}, J) = m_d - \mathbf{s}^\top J\mathbf{s} \) where \(m_d \) is the number of directed edges (counting undirected edges as two directed edges of opposite directions). Since \(J\) is not guaranteed to be positive definite, minimizing \( h_d \) is not equivalent to minimizing structural imbalance.

An important consideration in the design of the loss function was that the gradient-based optimization is continuous, yet, most applications require discrete optimization results with \( J_{i,j}\in \{-1, 0, 1\} \) and \( \mathbf{s_i} \in \{-1, 1\}\). Therefore, my optimization procedure must change the sign of the right elements of \( \mathbf{s} \). It turns out that optimizing in continuous space using gradient descent before discretizing the final optimization result using \( \mathrm{sign}(\cdot) \), is successful in achieving such sign changes given an appropriate loss function; Since \( \lim_{x \rightarrow 0}\frac{\partial x^2}{\partial x}=0 \), quadratic loss functions have diminishing gradients and are not conducive to sign changes; Thus, I propose to use the simple \(L^1\)-norm loss function

\[ \mathcal{L}(\mathbf{s}, J) = |h_d(\mathbf{s}, J)| \]

which has derivatives

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{s}} = -\mathrm{sign}(h_d(\mathbf{s}, J))(J+J^\top) \mathbf{s} \]

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{s_i}} = -\mathrm{sign}(h_d(\mathbf{s}, J)) \sum_{j=1}^n (J_{i,j} + J_{j,i})\mathbf{s_j} \]

Loss hyperplane and optimizers

A thorough analysis of the loss hyper-plane is beyond the scope of this post. However, it is easy to see that if \(\mathbf{s}\) is an optimum for a given \(J\), then \(-\mathbf{s}\) must also be an optimum. Moreover, the sparsity of \( J \) appears to be correlated with the “degrees of freedom” in \( \mathbf{s} \) such that if all elements of \(J\) are 0, all elements in \(\mathbf{s}\) are free to vary. This hints at the fact that the loss hyper-plane is non-convex and that numerous local minima may exist.

A ubiquitous method to deal with such issues is to apply transformations (“optimizers”) like momentum to the gradient before performing gradient descent. Optimizers are used extensively in deep learning to speed up convergence and yield better optimization results by helping to escape saddle points. It is, therefore, reasonable to incorporate these techniques into my algorithm as well. I decided to use the RMSProp optimizer (Hinto et al, n.d.) as it performed best in preliminary tests.

The code

I implemented the algorithm in Python using NetworkX and NumPy. Much of the Code is devoted to the RMSProp optimizer. Using a deep learning library like TensorFlow or PyTorch would simplify this implementation a lot.

#!/usr/bin/python3
import networkx as nx
import numpy as np

def balance_partition(G, steps=10, lr=0.001, alpha=0.001, beta=0.9):
	'''
		Takes signed graph whose edge "weight" is defined,
		partitions the graph to maximise structural balance,
		and sets the corresponding node "sign" attribute to -1 or +1
		to indicate partitioning.

		Args:
			G: signed or unsigned NetworkX graph
			steps: number of optimisation steps
			lr: the learning rate
			alpha: RMSProp hyperparameter
			beta: RMSProp hyperparameter
	'''
	eps = 1e-15
	n = G.number_of_nodes()
	s = np.full(n, eps) # vector of small values
	J = nx.linalg.graphmatrix.adjacency_matrix(G)
	m = (J != 0.0).sum() # number of directed edges
	v = 0.0

	# perform gradient descent with RMSProp optimiser
	for step in range(steps):
		grad = -np.sign(m - s.T @ J @ s) * (J + J.T) @ s
		v = beta * v + (1 - beta) * grad**2
		s -= alpha / np.sqrt(v + eps) * grad

	# set each node's "sign" attribute
	nx.set_node_attributes(G, {
		k: v for k, v in zip(G.nodes, np.sign(s))
	}, 'sign')

The code can be used as follows

G = nx.Graph()
G.add_edge(1, 4, weight=1.0)
G.add_edge(2, 3, weight=-1.0)
G.add_edge(3, 5, weight=-1.0)
G.add_edge(4, 5, weight=1.0)
G.add_edge(4, 6, weight=-1.0)
G.add_edge(4, 7, weight=-1.0)

balance_partition(G)

for node in G.nodes(data=True):
	print(node)

Performance study

Generating test graphs with known global optima

I tested how close the optimization results for \(\mathbf{s}\) are to the global minimum \(\mathbf{s}^*\) of randomly generated graphs. This allowed me to investigate how certain properties of these graphs (like the number of nodes, average directed node degree, and the distribution of nodes among the two sets) affect convergence.

The process of generating graphs with a known global optimum is as follows: First \( \mathbf{s}^* \) is generated randomly following a Bernoulli distribution such that a specified number of nodes and proportion of positive to negative nodes (and thus edges) exist. The corresponding optimal edge weights are given by \( J=\mathbf{s}^* \mathbf{s}^{* \top} \) which is essentially the linear algebra form of algorithm 1 in Estrada (2019). Any elements in \(J\) can also be set to 0 to achieve a desired average node degree (this implies that there are \( 2^{n^2} \) optimal solutions for the optimization of \(J\) given \(s\)). The algorithm is then applied to \(J\) in anticipation of reconstructing \( \mathbf{s}^* \). Every experiment was repeated 5 times for the randomly generated values of \(\mathbf{s}^*\) to assess the variance of the results.

Results

Top left: convergence for graphs with a varying number of nodes. Top right: convergence for graphs with a varying number of edges. Bottom: convergence for graphs with a varying proportion of positive to negative edges (pos is the proportion of positive edges). [View uncompressed]

Remarkably, the above figure suggests that the convergence is nearly invariant to the number of nodes holding everything else constant. Even more remarkable is that the global optimum was achieved in all 5 repetitions after just 3 iterations for random graphs with up to 10,000 nodes, up to 50,000 directed edges, and 60% positive edges -see the table below. Some preliminary tests suggested a similar performance on graphs with up to 1 million directed edges.

The top right plot shows how the rate of convergence decreases as the mean directed degree is reduced. In other words, the sparser the graph, the more difficult it is to find the optimal partitioning. Intuitively, this relates to the “degrees of freedom” in \(\mathbf{s}\) as discussed earlier. Fortunately, many real-world graphs are much less sparse than the graphs tested in this experiment.

The bottom plot indicates that a very balanced partition (number positive edges ≈ number negative edges) can lead to a slower and unstable convergence as well as significantly less optimal solutions. Fortunately, the tested graphs represent worst-case scenarios with a mean directed node degree of 1 and a 50-50 partition of nodes. Most real-world graphs are highly unbalanced, e.g. soc-sign-bitcoin-otc has a 90-10 partition.

Node count	Directed edges	proportion positive	Start loss (\(\mu\) \( \sigma\))	End loss (\(\mu\) \(\sigma\))	Iteration duration (s)
100.0	500.0	0.6	464.8 (34.0)	0.0 (0.0)	0.001284
1000.0	5000.0	0.6	4758.8 (123.1)	0.0 (0.0)	0.021178
10000.0	50000.0	0.6	48115.6 (404.4)	0.0 (0.0)	0.653195
10000.0	10000.0	0.6	9561.6 (189.7)	1101.6 (31.3)	0.642916
10000.0	25000.0	0.6	24041.6 (312.0)	2.4 (2.2)	0.654622
10000.0	50000.0	0.6	48115.6 (404.4)	0.0 (0.0)	0.653195
10000.0	100000.0	0.6	95517.6 (696.7)	0.0 (0.0)	0.633586
10000.0	10000.0	0.5	10024.8 (157.6)	2016.8 (16.9)	0.653039
10000.0	10000.0	0.6	9561.6 (189.7)	1101.6 (31.3)	0.642916
10000.0	10000.0	0.7	8343.2 (125.0)	352.0 (48.1)	0.630903
10000.0	10000.0	0.8	6442.8 (137.0)	138.8 (17.7)	0.646358