Graph Convolution Network (GCN) and Graph Attention Network (GAT)
- Introduction to GCN
- GraphSAGE
- Defferrard, Michaël, Xavier Bresson, and Pierre Vandergheynst. "Convolutional neural networks on graphs with fast localized spectral filtering." Advances in neural information processing systems 29 (2016).
- Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." International Conference on Learning Representations (2017).
- Hamilton, Will, Zhitao Ying, and Jure Leskovec. "Inductive representation learning on large graphs." Advances in neural information processing systems 30 (2017).
- Wu, Felix, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. "Simplifying graph convolutional networks." In International conference on machine learning, pp. 6861-6871. PMLR, 2019.
- Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. "Graph attention networks." International Conference on Learning Representations (2018).
1. Laplacian Matrix on Graph
The Laplacian matrix of a n-vertex graph G is defined as L=D−W, where W∈Rn×n is the adjacent matrix with weights Wij and D=diag(Di)∈Rn×n,Di=∑jWij is the degree matrix. The normalized Laplacian matrix is defined as LN=D−1/2(D−W)D−1/2=In−D−1/2WD−1/2. The Laplace operator is defined as Δf=n∑i=1∂2f∂x2i, In 1-dimensional space, we have ∂2f∂x2i=f′′(x)≈f′(x)−f′(x−1)≈f(x+1)−f(x)−(f(x)−f(x−1))=f(x+1)+f(x−1))−2f(x). Similarly, if we consider a n-vertex graph G and a function f on it, f=(f1,…,fn), where fi is the function value of the i-th vertex in the graph. Then the Laplacian operation on this function f on graph G should be Δfi=∑j∈Ni∂fi∂j2≈∑jWij(fi−fj)=∑jWij(fi−fj)=(∑jWij)fi−∑jWijfj=Difi−∑jWijfj. In a vector form, we can recover the definition of Laplacian matrix on graph G, Δf=(D−W)f=Lf. For an undirected graph, the Laplacian matrix is a real symmetric matrix and therefore we have the spectral decomposition L=UΛU⊤,Λ=diag(λ1,…,λn). There are also some properties for Laplacian matrix L,
- There is at least one eigenvalue 0.
- All the eigenvalues are non-negative. For normalized Laplacian matrix, all the eigenvalues are between 0 and 2, and the summation is n.
- If there are k eigenvalues with value 0, then there are k mutually unconnected sub-graphs within the graph.
2. Fourier Transformer
Recall Fourier transformer of function f(⋅), FT(ω)=∫+∞−∞f(t)e−iωtdt, where {e−iwt} is a set of orthogonal basis. We also have the inverse Fourier transformer, f(t)=12π∫+∞−∞FT(ω)eiωtdω.
For the basis function {e−iwt}, notice that it is the eigenfunction for Laplacian operator Δ, Δe−iωt=∇2e−iωt=de−iωtdt2=−ω2e−iωt. Similarly, the eigenvectors U=[u1,…,un] of graph Laplacian matrix should also be the basis for the Fourier transformer on the graph G, i.e., FT(λk)≜ˆfk=⟨f,uk⟩=n∑i=1fiuk(i). Therefore, ˆf=U⊤f. Similarly, we have the inverse Fourier transformer on graph G, f=Uˆf=UU⊤f.
3. Convolution on Graphs
Denote the continuous convolution for two functions f,g, (f∗g)(t) def =∫Rnf(τ)g(t−τ)dτ, as well as the discrete convolution, (f∗g)(n)=∞∑m=−∞f(m)g(n−m)=∞∑m=−∞f(n−m)g(m). Recall the property for convolution and Fourier transformer, F{f∗g}=F{f}⋅F{g}. Therefore, f∗g=F−1(F{f}⋅F{g}). For convolution operations on graph G, similarly, we have (f∗g)G=F−1[F{f}⋅F{g}]=F−1[UTf⋅ˆg]=F−1[diag(ˆg1,…,ˆgn)UTf]. If we parameterize ˆg directly by trainable parameters θ1,…,θn, (f∗g)G=Udiag(θ1,…,θn)UTf. This is the convolution filter on graphs.
4. Fast Localized Spectral Filtering
Recall that in the above graph convolution, ˆg is a vector depending on the eigenvalues of the Laplacian matrix L, we can rewrite it as ˆg=gθ(Λ), and the graph convolution is then y=Ugθ(Λ)UTx. Notice that it can also be seen as a signal x filtered by gθ as, y=gθ(L)x=gθ(UΛU⊤)x=Ugθ(Λ)UTx. The parameterization in the previous section can be seen as a non-parametric filter, i.e., gθ(Λ)=diag(θ1,…,θn). There are however two limitations with non-parametric filters:
- they are not localized in space
- their learning complexity is in O(n), the dimensionality of the graph.
These issues can be overcome with the use of a polynomial filter, gθ(Λ)=K−1∑k=0θkΛk. Under the polynomial filter, the convolution is given by y=Ugθ(Λ)U⊤x=(K−1∑k=0θkLk)x. Notice that dG(i,j)>K implies (LK)i,j=0, where dG is the shortest path distance between i and j. Consequently, spectral filters represented by Kth order polynomials of the Laplacian matrix are exactly K-localized. Besides, their learning complexity is O(K), the support size of the filter, and thus the same complexity as classical CNNs.
However, the matrix computation of U matrix part still involves the O(n2) computation complexity. Recall that the Chebyshev polynomial Tk(x) of order k may be computed by the stable recurrence relation Tk(x)=2xTk−1(x)−Tk−2(x), with T0=1 and T1=x. These polynomials form an orthogonal basis for L2([−1,1],dy/√1−y2), the Hilbert space of square integrable functions with respect to the measure dy/√1−y2. A filter can thus be parametrized as the truncated expansion, gθ(Λ)=K−1∑k=0θkTk(˜Λ), where ˜Λ=2Λ/λmax−In is the normalized diagonal matrix with scaled eigenvalues that lie in [−1,1]. The convolution operation is therefore, y=K−1∑k=0θkTk(˜L)x. Here ˜L=2L/λmax−In. Denoting ˉxk=Tk(˜L)x∈Rn, we can use the recurrence relation to compute ˉxk=2˜Lˉxk−1−ˉxk−2 with ˉx0=x and ˉx1=˜Lx. The entire filtering operation then costs O(K|E|) operations, where |E| is the number of edges in the graph G.
5. Fast Approximate Convolutions on Graphs
Recall that in the previous section, we have the convolution operation on vector x of a graph G using Chebyshev polynomials, y=K−1∑k=0θkTk(˜L)x. If we consider the simple case of K=2 and consider the normalized Laplacian matrix LN, and approximate λmax=2, which is the eigenvalue upper bound for LN, we have y=1∑k=0θkTk(˜LN)x=θ0x+θ1(LN−In)x=θ0x−θ1D−12WD−12x. In practice, it can be beneficial to constrain the number of parameters further to address overfitting and to minimize the number of operations (such as matrix multiplications) per layer. This leaves us with the following expression using θ=θ0=−θ1, y=θx+θD−12WD−12x=θ(In+D−12WD−12)x. Note that In+D−12WD−12 now has eigenvalues in the range [0,2]. Repeated application of this operator can therefore lead to numerical instabilities and exploding/vanishing gradients when used in a deep neural network model.
To alleviate this problem, we introduce the following renormalization trick: In+D−12WD−12→˜D−12˜W˜D−12, with ˜W=W+IN and ˜Dii=∑j˜Wij.
We can generalize this definition to a signal X∈RN×C with C input channels (i.e. a C-dimensional feature vector for every node) and F filters or feature maps as follows, Y=˜D−12˜W˜D−12XΘ, where Θ∈RC×F is now a matrix of filter parameters and Y∈RN×F is the convolved signal matrix. This filtering operation has complexity O(|E|FC), as ˜WX can be efficiently implemented as a product of a sparse matrix with a dense matrix.
6. Simple GCN
From Kipf and Welling (2017) above, for a input X(t)∈RN×C with C input channels (i.e. a C-dimensional feature vector for every node) and F filters or feature maps as follows, X(t+1)=σ(SX(t)Θ),S=˜D−12˜W˜D−12. where Θ∈RC×F is now a matrix of filter parameters, σ is the non-liner activation function and X(t+1)∈RN×F is the convolved signal matrix. For a K-depth network, we will repeat the above procedure K times.
This paper hypothesizes that the nonlinearity between GCN layers is not critical - but that the majority of the benefit arises from the local averaging. The resulting model is linear, but still has the same increased “receptive field” of a K-layer GCN, i.e., Y=softmax(S…SSXΘ(1)Θ(2)…Θ(K))=softmax(SKXΘ). Under this network, the authors obtain a comparable experiment results with the previous structure, and the computation cost / the number of parameters is significantly reduced. Notice that S is a 2|E|-sparse matrix, where |E| is the number of edges in the graph G, and the exponential computation for sparse matrix is quite fast.
7. GraphSAGE
- Hamilton, Will, Zhitao Ying, and Jure Leskovec. "Inductive representation learning on large graphs." Advances in neural information processing systems 30 (2017).
- GraphSAGE
We now consider the graph convolution from a different aspect. We mimic the logic of convolution on graph: for a 3x3 convolution layer, we conduct weighted average of the value of the 1-step neighbors of the center point. The next layer convolution operation increases the "receptive field" to the 2-step neighbors of the center point.
The algorithm is given below. For each step k=1,…,K, we first gather and aggregate all the hidden vecotr from previous step hk−1u in the neightbourhood of vertex v, i.e., u∈N(v). Then in line 5, we combine the neighborhood information vector vkN(v) with hk−1v of v.
This process has been repeated K times and the information from all the K-step neighbor will be included. Instead of training a distinct embedding vector for each node, this algorithm train a set of aggregator functions that learn to aggregate feature information from a node’s local neighborhood.
There are several choices of aggregator function:
Mean aggregator: which is the average of all the {hk−1u,∀u∈N(v)}.
Pooling aggregator: each neighbor’s vector is independently fed through a fully-connected neural network; following this transformation, an elementwise max-pooling operation is applied to aggregate information across the neighbor set: AGGREGATE pool k=max({σ(Whku+b),∀u∈N(v)}).
8. Graph Attention Network
The input to our layer is a set of node features, h={→h1,→h2,…,→hN},→hi∈RF, where N is the number of nodes, and F is the number of features in each node. The layer produces a new set of node features (of potentially different cardinality F′ ), h′={→h′1,→h′2,…,→h′N},→h′i∈RF′, as its output.
Consider a self-attention mechanism on the nodes eij=a(W→hi,W→hj), where W∈RF′×F is the weight matrix and a is the attention mechanism a:RF′×RF′→R.
Consider the masked attention softmax, which normalize eij across all the neighbors of node i, αij=softmaxj(eij)=exp(eij)∑k∈Niexp(eik). We consider the attention mechanism a as a single-layer feed-forward neural network, parametrized by a weight vector →a∈R2F′, and applying the LeakyReLU non-linearity, αij=exp(LeakyReLU(→aT[W→hi∥W→hj]))∑k∈Niexp(LeakyReLU(→aT[W→hi∥W→hk])). where ∥ is the concatenation operation. Therefore, the one-layer graph attention is computed as follows, →h′i=σ(∑j∈NiαijW→hj). Utilizing the multi-head attention mechanism, we can extend the network structure to →h′i=∥Kk=1σ(∑j∈NiαkijWk→hj), where ∥ represents concatenation, αkij are normalized attention coefficients computed by the k-th attention mechanism (ak), and Wk is the corresponding input linear transformation's weight matrix. Note that, in this setting, the final returned output, →h′i∈RKF′.