As an electronics engineering major, I initially thought of “embedding” as a kind of small sub-systems used as a small computer. Later, graph theory gives me another view of “embedding”:

Graph embedding of a graph $G$ on a surface $\Sigma$ is a representation of $G$ on $\Sigma$ in which points of $\Sigma$ are associated to vertices and simple arcs are associated to edges such that connectivity is preserved and there is no over-lapping between edges.

This view of “embedding” is more abstract, and somewhat more fun. I understand it as the act of “drawing” a graph onto another surface (or space). It is well known that any planar graph can be embedded on a 2D surface, and any graph can be embedded on a 3D “surface”. However, recently, the word “embedding” seems to mean something more concrete to a machine learning practitioner:

Embedding, as in “word embedding”, refers to a set of modeling and feature learning techniques that map structured tokens to vectors of real numbers.

I think embedding is dimensionality reduction or representation learning, just another different fancy name. Maybe, in the context of natural language processing and graph processing, we call dimensionality reduction embedding .

## Background

There were many representation learning algorithms in the field of machine learning. One of the most famous method is Spectral Clustering in which the similarity of data is preserved while the dimensionality is reduced. More recently, the t-SNE algorithm is widely used for data visualization. The general objective of t-SNE is somewhat similar to Spectral Clustering. However, instead of dealing with large matrix computation, the authors solved the embedding problem by minimizing a loss function defined for the representations of each data point in two spaces (original data space and the embedding space). The result of t-SNE is a 2D or 3D representation of high dimensional data.

Similar to the t-SNE technique, DeepWalk was proposed by Perozzi et al at KDD’14. DeepWalk and other algorithms inspired by it were based on the Skipgram model proposed by Mikolov et al in 2013. By observing the similarity between the vertex frequencies distribution on random walks in a graph and the word frequency distribution in text, Perozzi et al proposed that we can use random walks on a graph to generate “sentences”, and then learn the node representation using the Skipgram model. The slide for my talk introducing DeepWalk at The University of Tokyo can be found here.

As mentioned above, Deepwalk is an algorithm that uses the word2vec software package on a generated artificial text. The “text” here is generated by performing random walks on graphs. Such idea is simple and effective.

The power law is observed in both domains. This observation can be the motivation for Deepwalk.

Denote a random walk starting from vertex $i$ is $W(v_i)$. The set containing all random walks is denoted $\mathbf{C} = \cup_{i \in V} W(v_i)$ with $V$ is the vertex set. Under this notations, the optimization problem is formulated in the same way as the skipgram model:

In the optimization problem above, $\Phi(v_i)$ is the mapping from a vertex to its vector representation. The probability is assumed to be able to factor as:

Demonstration of how Deepwalk works.

From here there are two problems need to be solved:

1. How to construct the formula for $P(v_j \mid \Phi(v_i))$.
2. How to efficiently learn and compute the conditional probability.

For the first problem, two embedding matrices are introduced. The first matrix is $\Phi_C$ - the context matrix storing the embedding of the centered vertices ($v_i$). The second matrix is $\Phi_E$ - the embedding matrix storing surround vertices’ embedding vectors. These two matrices are jointly-learned as the algorithm scans the corpus.

There are two popular solutions for the second problem: hierarchical softmax and negative sampling (my personal favorite learning algorithm). Hierarchical softmax is a decomposition technique that computes the conditional probability using a binary tree. In this technique, we first organize the words (in our context they are graph vertices) into the leaves of a binary tree. Then, the conditional probability $P(v_j \mid v_i)$ is given by the probability of a random walk starts at the root node and ends at the leave node representing $v_j$. The probability of turning left or right is computed with vector for $v_j$ and vectors associated with the intermediate binary tree nodes. We can compute the probability as:

where $[\cdot]$ is a condition evaluation operator that returns 1 if true and -1 otherwise and $\sigma$ is the sigmoid function. The detail of this function is explained in Stanford’s CS224n course. I highly recommend reading the material provided there. Their lecture notes are extremely well-written (thank you Stanford! ).

This picture is an attempt to visualize how hierarchical softmax works. At the first step of the algorithm, tokens (words or vertices) are assigned to the leaves of a binary tree. This assignment can utilize Huffman coding for minimum path length (frequent tokens get shorter paths from root). The root node and leave nodes of the binary tree are just placeholder. The algorithm keeps two sets of learn-able vectors: 1. The embedding vectors associated with each token (not on the tree), and 2. The vectors associated with the internal nodes of the binary tree (blue nodes in the figure above). When we need to compute a conditional probability, say $P(a \mid b)$, the embedding vector of $b$ and the internal node vectors will be used. For each internal node, we take the inner-product of $v_b$ and its associated vector then apply the sigmoid function to the scalar result. The output of the sigmoid functions are interpreted as the probability of going left from that node. Finally, the conditional probability $P(a \mid b)$ equals the probability of going from root node to leaf node $a$. Note that due to the property of binary trees, the probability is naturally normalized (P(going left) + P(going right) = 1).

Intuitively, I think of negative sampling as “pushing away” bad samples and “pulling in” good samples in the embedding vector spaces. When the algorithm “sees” a pair of vertices that are supposed to be close to each other, it updates the vectors to maximize the inner product. On the other hand, when the algorithm is given a pair of negative samples which is generated from some distribution, it tries to minimize the inner product. Under the hood, this algorithm optimize a objective that discriminate “true” samples (samples from our dataset) against “negative” samples (samples generated from some distribution). Negative sampling rises from a technique named Noise Contrastive Estimation. NCE deals with the intractable normalization term in unnormalized parametric statistical models. Instead of computing the normalization term directly, NCE views the normalization term as a learn-able variable. Then, the problem of learning the model parameters and the normalization term is posed as a linear regression to discriminate true data samples and generated (negative) data samples. Such technique shows great time-performance trade-off.

## Our method

Having introduced the background of representation learning for nodes in graphs based on word2vec, I will introduce our insight and attempt to improve Deepwalk. First, some analogues:

Text Graph
Element Word Vertex
(Semi) Structure Sentences, grammar Connections, subgraphs (motifs)
Distribution Power law (frequency) Power law (degree)

It is a common knowledge that “friends of friends are friends”. In network science, this phenomenon can be captured by the statistic of a subgraph in the shape of a triangle.

While simple friendships are represented by triangles, I am curious about other types of subgraphs and their functionalities in a network. More specifically, I am interested in how communities are formed. My hypothesis is that the simple recurring subgraphs (from now I call them motifs) are the building blocks for communities. Then, I investigated the z-scores for each directed size-3 motifs on a benchmark network dataset.

For each motifs in the vertical axis, I first compute its z-score for the whole network. Then, I exacted three sub-networks: Intra communities (2 networks for each label) and inter-communities (contains only inter-community links). The z-scores for each networks is reported in the figure above. We are particularly interested in motifs that has high z-scores for intra-communities, but low z-scores for inter-communities. These motifs are likely to be the patterns that build communities.

From this insight, we propose motifwalk. Our method aims to improve the result of DeepWalk through context manipulation. More conretely, instead of only perform random walk, we propose the concept of biased walk for context generation.

Definition: biased walk - A random walk in which the next vertices is chosen based on a pre-defined probability distribution.

Keeping the same learning process as DeepWalk, we propose to generate the artificial context that favor a certain motifs. In such way, nodes within a community are more likely to be closer in the embedding space. The motifs pattern for the biased random walk are selected based on some knowledge about the network or the z-score. Motif selection is so far the weakest point of my proposed model as I have not come up with a concrete method of motif selection.

## Experiments

I conducted experiment on several benchmark networks. The f1-macro score for Blogcatalog3 and Cora is presented here:

The random walk on blogcatalog is the triangle motif, while on cora it is the bi-fan motif (size-4). The embedding is learned by a word2vec model implemented using tensorflow.