Graph Convolution Network (GCN) and Graph Attention Network (GAT)

1. Laplacian Matrix on Graph

The Laplacian matrix of a n-vertex graph G is defined as L=DW, where WRn×n is the adjacent matrix with weights Wij and D=diag(Di)Rn×n,Di=jWij is the degree matrix. The normalized Laplacian matrix is defined as LN=D1/2(DW)D1/2=InD1/2WD1/2. The Laplace operator is defined as Δf=ni=12fx2i, In 1-dimensional space, we have 2fx2i=f(x)f(x)f(x1)f(x+1)f(x)(f(x)f(x1))=f(x+1)+f(x1))2f(x). Similarly, if we consider a n-vertex graph G and a function f on it, f=(f1,,fn), where fi is the function value of the i-th vertex in the graph. Then the Laplacian operation on this function f on graph G should be Δfi=jNifij2jWij(fifj)=jWij(fifj)=(jWij)fijWijfj=DifijWijfj. In a vector form, we can recover the definition of Laplacian matrix on graph G, Δf=(DW)f=Lf. For an undirected graph, the Laplacian matrix is a real symmetric matrix and therefore we have the spectral decomposition L=UΛU,Λ=diag(λ1,,λn). There are also some properties for Laplacian matrix L,

  • There is at least one eigenvalue 0.
  • All the eigenvalues are non-negative. For normalized Laplacian matrix, all the eigenvalues are between 0 and 2, and the summation is n.
  • If there are k eigenvalues with value 0, then there are k mutually unconnected sub-graphs within the graph.

2. Fourier Transformer

Recall Fourier transformer of function f(), FT(ω)=+f(t)eiωtdt, where {eiwt} is a set of orthogonal basis. We also have the inverse Fourier transformer, f(t)=12π+FT(ω)eiωtdω.

For the basis function {eiwt}, notice that it is the eigenfunction for Laplacian operator Δ, Δeiωt=2eiωt=deiωtdt2=ω2eiωt. Similarly, the eigenvectors U=[u1,,un] of graph Laplacian matrix should also be the basis for the Fourier transformer on the graph G, i.e., FT(λk)ˆfk=f,uk=ni=1fiuk(i). Therefore, ˆf=Uf. Similarly, we have the inverse Fourier transformer on graph G, f=Uˆf=UUf.

3. Convolution on Graphs

Denote the continuous convolution for two functions f,g, (fg)(t) def =Rnf(τ)g(tτ)dτ, as well as the discrete convolution, (fg)(n)=m=f(m)g(nm)=m=f(nm)g(m). Recall the property for convolution and Fourier transformer, F{fg}=F{f}F{g}. Therefore, fg=F1(F{f}F{g}). For convolution operations on graph G, similarly, we have (fg)G=F1[F{f}F{g}]=F1[UTfˆg]=F1[diag(ˆg1,,ˆgn)UTf]. If we parameterize ˆg directly by trainable parameters θ1,,θn, (fg)G=Udiag(θ1,,θn)UTf. This is the convolution filter on graphs.

4. Fast Localized Spectral Filtering

Recall that in the above graph convolution, ˆg is a vector depending on the eigenvalues of the Laplacian matrix L, we can rewrite it as ˆg=gθ(Λ), and the graph convolution is then y=Ugθ(Λ)UTx. Notice that it can also be seen as a signal x filtered by gθ as, y=gθ(L)x=gθ(UΛU)x=Ugθ(Λ)UTx. The parameterization in the previous section can be seen as a non-parametric filter, i.e., gθ(Λ)=diag(θ1,,θn). There are however two limitations with non-parametric filters:

  • they are not localized in space
  • their learning complexity is in O(n), the dimensionality of the graph.

These issues can be overcome with the use of a polynomial filter, gθ(Λ)=K1k=0θkΛk. Under the polynomial filter, the convolution is given by y=Ugθ(Λ)Ux=(K1k=0θkLk)x. Notice that dG(i,j)>K implies (LK)i,j=0, where dG is the shortest path distance between i and j. Consequently, spectral filters represented by Kth  order polynomials of the Laplacian matrix are exactly K-localized. Besides, their learning complexity is O(K), the support size of the filter, and thus the same complexity as classical CNNs.

However, the matrix computation of U matrix part still involves the O(n2) computation complexity. Recall that the Chebyshev polynomial Tk(x) of order k may be computed by the stable recurrence relation Tk(x)=2xTk1(x)Tk2(x), with T0=1 and T1=x. These polynomials form an orthogonal basis for L2([1,1],dy/1y2), the Hilbert space of square integrable functions with respect to the measure dy/1y2. A filter can thus be parametrized as the truncated expansion, gθ(Λ)=K1k=0θkTk(˜Λ), where ˜Λ=2Λ/λmaxIn is the normalized diagonal matrix with scaled eigenvalues that lie in [1,1]. The convolution operation is therefore, y=K1k=0θkTk(˜L)x. Here ˜L=2L/λmaxIn. Denoting ˉxk=Tk(˜L)xRn, we can use the recurrence relation to compute ˉxk=2˜Lˉxk1ˉxk2 with ˉx0=x and ˉx1=˜Lx. The entire filtering operation then costs O(K|E|) operations, where |E| is the number of edges in the graph G.

5. Fast Approximate Convolutions on Graphs

Recall that in the previous section, we have the convolution operation on vector x of a graph G using Chebyshev polynomials, y=K1k=0θkTk(˜L)x. If we consider the simple case of K=2 and consider the normalized Laplacian matrix LN, and approximate λmax=2, which is the eigenvalue upper bound for LN, we have y=1k=0θkTk(˜LN)x=θ0x+θ1(LNIn)x=θ0xθ1D12WD12x. In practice, it can be beneficial to constrain the number of parameters further to address overfitting and to minimize the number of operations (such as matrix multiplications) per layer. This leaves us with the following expression using θ=θ0=θ1, y=θx+θD12WD12x=θ(In+D12WD12)x. Note that In+D12WD12 now has eigenvalues in the range [0,2]. Repeated application of this operator can therefore lead to numerical instabilities and exploding/vanishing gradients when used in a deep neural network model.

To alleviate this problem, we introduce the following renormalization trick: In+D12WD12˜D12˜W˜D12, with ˜W=W+IN and ˜Dii=j˜Wij.

We can generalize this definition to a signal XRN×C with C input channels (i.e. a C-dimensional feature vector for every node) and F filters or feature maps as follows, Y=˜D12˜W˜D12XΘ, where ΘRC×F is now a matrix of filter parameters and YRN×F is the convolved signal matrix. This filtering operation has complexity O(|E|FC), as ˜WX can be efficiently implemented as a product of a sparse matrix with a dense matrix.

6. Simple GCN

From Kipf and Welling (2017) above, for a input X(t)RN×C with C input channels (i.e. a C-dimensional feature vector for every node) and F filters or feature maps as follows, X(t+1)=σ(SX(t)Θ),S=˜D12˜W˜D12. where ΘRC×F is now a matrix of filter parameters, σ is the non-liner activation function and X(t+1)RN×F is the convolved signal matrix. For a K-depth network, we will repeat the above procedure K times.

This paper hypothesizes that the nonlinearity between GCN layers is not critical - but that the majority of the benefit arises from the local averaging. The resulting model is linear, but still has the same increased “receptive field” of a K-layer GCN, i.e., Y=softmax(SSSXΘ(1)Θ(2)Θ(K))=softmax(SKXΘ). Under this network, the authors obtain a comparable experiment results with the previous structure, and the computation cost / the number of parameters is significantly reduced. Notice that S is a 2|E|-sparse matrix, where |E| is the number of edges in the graph G, and the exponential computation for sparse matrix is quite fast.

7. GraphSAGE

We now consider the graph convolution from a different aspect. We mimic the logic of convolution on graph: for a 3x3 convolution layer, we conduct weighted average of the value of the 1-step neighbors of the center point. The next layer convolution operation increases the "receptive field" to the 2-step neighbors of the center point.

The algorithm is given below. For each step k=1,,K, we first gather and aggregate all the hidden vecotr from previous step hk1u in the neightbourhood of vertex v, i.e., uN(v). Then in line 5, we combine the neighborhood information vector vkN(v) with hk1v of v.

This process has been repeated K times and the information from all the K-step neighbor will be included. Instead of training a distinct embedding vector for each node, this algorithm train a set of aggregator functions that learn to aggregate feature information from a node’s local neighborhood.

There are several choices of aggregator function:

  • Mean aggregator: which is the average of all the {hk1u,uN(v)}.

  • Pooling aggregator: each neighbor’s vector is independently fed through a fully-connected neural network; following this transformation, an elementwise max-pooling operation is applied to aggregate information across the neighbor set:  AGGREGATE pool k=max({σ(Whku+b),uN(v)}).

8. Graph Attention Network

The input to our layer is a set of node features, h={h1,h2,,hN},hiRF, where N is the number of nodes, and F is the number of features in each node. The layer produces a new set of node features (of potentially different cardinality F ), h={h1,h2,,hN},hiRF, as its output.

Consider a self-attention mechanism on the nodes eij=a(Whi,Whj), where WRF×F is the weight matrix and a is the attention mechanism a:RF×RFR.

Consider the masked attention softmax, which normalize eij across all the neighbors of node i, αij=softmaxj(eij)=exp(eij)kNiexp(eik). We consider the attention mechanism a as a single-layer feed-forward neural network, parametrized by a weight vector aR2F, and applying the LeakyReLU non-linearity, αij=exp(LeakyReLU(aT[WhiWhj]))kNiexp(LeakyReLU(aT[WhiWhk])). where is the concatenation operation. Therefore, the one-layer graph attention is computed as follows, hi=σ(jNiαijWhj). Utilizing the multi-head attention mechanism, we can extend the network structure to hi=Kk=1σ(jNiαkijWkhj), where represents concatenation, αkij are normalized attention coefficients computed by the k-th attention mechanism (ak), and Wk is the corresponding input linear transformation's weight matrix. Note that, in this setting, the final returned output, hiRKF.

results matching ""

    No results matching ""