Network motif

Network motifs are recurrent and statistically significant subgraphs or patterns of a larger graph. all networks, including biological networks, social networks, technological networks e.g., computer networks as well as electrical circuits and more, can be represented as graphs, which add a wide sort of subgraphs.

Network motifs are sub-graphs that repeat themselves in a particular network or even among various networks. regarded and identified separately. of these sub-graphs, defined by a particular pattern of interactions between vertices, may reflect a model in which particular functions are achieved efficiently. Indeed, motifs are of notable importance largely because they may reflect functional properties. They hold believe recently gathered much attention as a useful concept to uncover structural appearance principles of complex networks. Although network motifs may manage a deep insight into the network's functional abilities, their detection is computationally challenging.

Motif discovery algorithms

Various solutions construct been produced for the challenging problem of network motif NM discovery. These algorithms can be classified under various paradigms such(a) as exact counting methods, sampling methods, sample growth methods and so on. However, motif discovery problem comprises two main steps: first, calculating the number of occurrences of a sub-graph and then, evaluating the sub-graph significance. The recurrence is significant if this is the detectably far more than expected. Roughly speaking, the expected number of appearances of a sub-graph can be determined by a Null-model, which is defined by an ensemble of random networks with some of the same properties as the original network.

Until 2004, the only exact counting method for NM detection was the brute-force one proposed by Milo et al.. This algorithm was successful for discovering small motifs, but using this method for finding even size 5 or 6 motifs was not computationally feasible. Hence, a new approach to this problem was needed.

Here, a review on computational aspects of major algorithms is assumption and their related benefits and drawbacks from an algorithmic perspective are discussed.

Kashtan et al. published mfinder, the first motif-mining tool, in 2004. It implements two kinds of motif finding algorithms: a full enumeration and the first sampling method.

Their sampling discovery algorithm was based on edge sampling throughout the network. This algorithm estimates concentrations of induced sub-graphs and can be utilized for motif discovery in directed or undirected networks. The sampling procedure of the algorithm starts from an arbitrary edge of the network that leads to a sub-graph of size two, and then expands the sub-graph by choosing a random edge that is incident to the current sub-graph. After that, it retains choosing random neighboring edges until a sub-graph of size n is obtained. Finally, the sampled sub-graph is expanded to put all of the edges that live in the network between these n nodes. When an algorithm uses a sampling approach, taking unbiased samples is the nearly important case that the algorithm might address. The sampling procedure, however, does non take samples uniformly and therefore Kashtan et al. proposed a weighting scheme that features different weights to the different sub-graphs within network. The underlying principle of weight allocation is exploiting the information of the sampling probability for regarded and allocated separately. sub-graph, i.e. the probable sub-graphs will obtain comparatively less weights in comparison to the improbable sub-graphs; hence, the algorithm must calculate the sampling probability of regarded and identified separately. sub-graph that has been sampled. This weighting technique assists mfinder to defining sub-graph concentrations impartially.

In expanded to include sharp contrast to exhaustive search, the computational time of the algorithm surprisingly is asymptotically self-employed person of the network size. An analysis of the computational time of the algorithm has shown that it takes n for each sample of a sub-graph of size n from the network. On the other hand, there is no analysis in on the mark time of sampled sub-graphs that requires solving the graph isomorphism problem for each sub-graph sample. Additionally, an extra computational attempt is imposed on the algorithm by the sub-graph weight calculation. But it is for unavoidable to say that the algorithm may sample the same sub-graph house times – spending time without gathering any information. In conclusion, by taking the advantages of sampling, the algorithm performs more efficiently than an exhaustive search algorithm; however, it only determines sub-graphs concentrations approximately. This algorithm can find motifs up to size 6 because of its leading implementation, and as total it enable the most significant motif, not all the others too. Also, it is essential to address that this tool has no choice of visual presentation. The sampling algorithm is shown briefly:

1. selection a random edge . upgrade }, }

2. Make a list L of all neighbor edges of . Omit from L all edges between members of .

3. Pick a random edge } from L. modernization ⋃ {e}, }.

4. Repeat steps 2-3 until completing an n-node subgraph until | = n.

5. Calculate the probability to sample the picked n-node subgraph.

Schreiber and Schwöbbermeyer proposed an algorithm named flexible pattern finder FPF for extracting frequent sub-graphs of an input network and implemented it in a system named Mavisto. Their algorithm exploits the downward closure property which is relevant for frequency notion and . The downward closure property asserts that the frequency for sub-graphs decrease monotonically by increasing the size of sub-graphs; however, this property does not hold necessarily for frequency concept . FPF is based on a pattern tree see figure consisting of nodes that represents different graphs or patterns, where the parent of each node is a sub-graph of its children nodes; in other words, the corresponding graph of each pattern tree's node is expanded by adding a new edge to the graph of its parent node.

At first, the FPF algorithm enumerates and submits the information of all matches of a sub-graph located at the root of the pattern tree. Then, one-by-one it builds child nodes of the preceding node in the pattern tree by adding one edge supported by a matching edge in the talked graph, and tries to expand all of the preceding information about matches to the new sub-graph child node. In next step, it decides if the frequency of the current pattern is lower than a predefined threshold or not. whether it is lower and if downward closure holds, FPF can abandon that path and not traverse further in this component of the tree; as a result, unnecessary computation is avoided. This procedure is continued until there is no remaining path to traverse.

The expediency of the algorithm is that it does not consider infrequent sub-graphs and tries to finish the enumeration process as soon as possible; therefore, it only spends time for promising nodes in the pattern tree and discards all other nodes. As an added bonus, the pattern tree notion helps FPF to be implemented and executed in a parallel manner since it is possible to traverse each path of the pattern tree independently. However, FPF is most useful for frequency notion and , because downward closure is not applicable to . Nevertheless, the pattern tree is still practical for if the algorithm runs in parallel. Another advantage of the algorithm is that the implementation of this algorithm has no limitation on motif size, which makes it more amenable to improvements. The pseudocode of FPF Mavisto is shown below:

Result: Set R of patterns of size t with maximum frequency.

P ←start pattern p1 of size 1

←all matches of in G

While P ≠ φ do

←select all patterns from P with maximum size.

P ←pattern with maximum frequency from

Foreach pattern p ∈ E

If then

Else f ← Maximum freelancer set

End

If p = t then

If then R ← R ⋃ {p}

Else if then R ← {p}; ← f

End

Else

If or then P ← P ⋃ {p}

End

The sampling bias of Kashtan et al. provided great impetus for designing better algorithms for the NM discovery problem. Although Kashtan et al. tried to resolve this drawback by means of a weighting scheme, this method imposed an undesired overhead on the running time as living a more complicated implementation. This tool is one of the most useful ones, as it supports visual options and also is an efficient algorithm with respect to time. But, it has a limitation on motif size as it does not allow searching for motifs of size 9 or higher because of the way the tool is implemented.

Wernicke introduced an algorithm named RAND-ESU that provides a significant improvement over mfinder. This algorithm, which is based on the exact enumeration algorithm ESU, has been implemented as an application called FANMOD. RAND-ESU is a NM discovery algorithm applicable for both directed and undirected networks, effectively exploits an unbiased node sampling throughout the network, and prevents overcounting sub-graphs more than once. Furthermore, RAND-ESU uses a novel analytical approach called DIRECT for develop sub-graph significance instead of using an ensemble of random networks as a Null-model. The DIRECT method estimates the sub-graph concentration without explicitly generating random networks. Empirically, the DIRECT method is more efficient in comparison with the random network ensemble in case of sub-graphs with a very low concentration; however, the classical Null-model is faster than the DIRECT method for highly concentrated sub-graphs. In the following, we section the ESU algorithm and then we show how this exact algorithm can be modified efficiently to RAND-ESU that estimates sub-graphs concentrations.

The algorithms ESU and RAND-ESU are fairly simple, and hence easy to implement. ESU first finds the set of all induced sub-graphs of size k, permit be this set. ESU can be implemented as a recursive function; the running of this function can be displayed as a tree-like layout of depth k, called the ESU-Tree see figure. Each of the ESU-Tree nodes indicate the status of the recursive function that entails two consecutive sets SUB and EXT. SUB refers to nodes in the target network that are adjacent and establish a partial sub-graph of size |SUB| ≤ k. If |SUB| = k, the algorithm has found an induced fix sub-graph, so . However, if |SUB| < k, the algorithm must expand SUB tocardinality k. This is done by the EXT set that contains all the nodes that satisfy two conditions: First, each of the nodes in EXT must be adjacent to at least one of the nodes in SUB; second, their numerical labels must be larger than the designation of first factor in SUB. The first assumption makesthat the expansion of SUB nodes yields a connected graph and thecondition causes ESU-Tree leaves see figure to be distinct; as a result, it prevents overcounting. Note that, the EXT set is not a static set, so in each step it may expand by some new nodes that do not breach the two conditions. The next step of ESU involves classification of sub-graphs placed in the ESU-Tree leaves into non-isomorphic size-k graph classes; consequently, ESU determines sub-graphs frequencies and concentrations. This stage has been implemented simply by employing McKay's nauty algorithm, which classifies each sub-graph by performing a graph isomorphism test. Therefore, ESU finds the set of all induced k-size sub-graphs in a target graph by a recursive algorithm and then determines their frequency using an efficient tool.

The procedure of implementing RAND-ESU is quite straightforward and is one of the main advantages of FANMOD. One can modify the ESU algorithm to study just a portion of the ESU-Tree leaves by applying a probability value ≤ 1 for each level of the ESU-Tree and oblige ESU to traverse each child node of a node in level d-1 with probability . This new algorithm is called RAND-ESU. Evidently, when = 1 for all levels, RAND-ESU acts like ESU. For = 0 the algorithm finds nothing. Note that, this procedure ensures that the chances of visiting each leaf of the ESU-Tree are the same, resulting in unbiased sampling of sub-graphs through the network. The probability of visiting each leaf is and this is identical for all of the ESU-Tree leaves; therefore, this method guarantees unbiased sampling of sub-graphs from the network. Nonetheless, determining the value of for 1 ≤ d ≤ k is another issue that must be determined manually by an expert to receive precise results of sub-graph concentrations. While there is no lucid prescript for this matter, the Wernicke provides some general observations that may support in determining p_d values. In summary, RAND-ESU is a very fast algorithm for NM discovery in the case of induced sub-graphs supporting unbiased sampling method. Although, the main ESU algorithm and so the FANMOD tool is call for discovering induced sub-graphs, there is trivial adjustment to ESU which makes it possible for finding non-induced sub-graphs, too. The pseudo program of ESU FANMOD is shown below:

Input: A graph G = V, E and an integer 1 ≤ k ≤ |V|.

Output: All size-k subgraphs in G.

for each vertex v ∈ V do

VExtension ← {u ∈ N{v} | u > v}

call {v}, VExtension, v

endfor

if |VSubgraph| = k then output G[VSubgraph] and return

while VExtension ≠ ∅ do

Remove an arbitrarily chosen vertex w from VExtension

w, VSubgraph | u > v}

call VSubgraph ∪ {w}, VExtension′, v

return

Chen et al. introduced a new NM discovery algorithm called NeMoFinder, which adapts the idea in SPIN to extract frequent trees and after that expands them into non-isomorphic graphs. NeMoFinder utilizes frequent size-n trees to partition the input network into a collection of size-n graphs, afterward finding frequent size-n sub-graphs by expansion of frequent trees edge-by-edge until getting a ready size-n graph . The algorithm finds NMs in undirected networks and is not limited to extracting only induced sub-graphs. Furthermore, NeMoFinder is an exact enumeration algorithm and is not based on a sampling method. As Chen et al. claim, NeMoFinder is applicable for detecting relatively large NMs, for instance, finding NMs up to size-12 from the whole S. cerevisiae yeast PPI network as the authors claimed.

NeMoFinder consists of three main steps. First, finding frequent size-n trees, then utilizing repeated size-n trees to divide the entire network into a collection of size-n graphs, finally, performing sub-graph join operations to find frequent size-n sub-graphs. In the first step, the algorithm detects all non-isomorphic size-n trees and mappings from a tree to the network. In the second step, the ranges of these mappings are employed to partition the network into size-n graphs. Up to this step, there is no distinction between NeMoFinder and an exact enumeration method. However, a large portion of non-isomorphic size-n graphs still remain. NeMoFinder exploits a heuristic to enumerate non-tree size-n graphs by the obtained information from the preceding steps. The main advantage of the algorithm is in the third step, which generates candidate sub-graphs from before enumerated sub-graphs. This generation of new size-n sub-graphs is done by connection each previous sub-graph with derivative sub-graphs from itself called cousin sub-graphs. These new sub-graphs contain one additional edge in comparison to the previous sub-graphs. However, there live some problems in generating new sub-graphs: There is no clear method to derive cousins from a graph, association a sub-graph with its cousins leads to redundancy in generating particular sub-graph more than once, and cousin determination is done by a canonical representation of the adjacency matrix which is not closed under join operation. NeMoFinder is an efficient network motif finding algorithm for motifs up to size 12 only for protein-protein interaction networks, which are presented as undirected graphs. And it is not able to work on directed networks which are so important in the field of complex and biological networks. The pseudocode of NeMoFinder is shown below:

G - PPI network;

N - Number of randomized networks;

K - Maximal network motif size;

F - Frequency threshold;

S - Uniqueness threshold;

Output:

U - Repeated and unique network motif set;

D ← ∅;

for motif-size k from 3 to K do

G, T

D ← D ∪ T;

D′ ← T;

i ← k;

while D′ ≠ ∅ and i ≤ k × k - 1 / 2 do

k, i, D′;

D ← D ∪ D′;

i ← i + 1;

end while

end for

for counter i from 1 to N do

;

for each g ∈ D do

;

end for

U ← ∅;

for each g ∈ D do

if s ≥ S then

U ← U ∪ {g};

end if

end for

return U;

Grochow and Kellis proposed an exact algorithm for enumerating sub-graph appearances. The algorithm is based on a motif-centric approach, which means that the frequency of a given sub-graph, called the query graph, is exhaustively determined by searching for all possible mappings from the query graph into the larger network. It is claimed that a motif-centric method in comparisonto network-centric methods has some beneficial features. First of all it avoids the increased complexity of sub-graph enumeration. Also, by using mapping instead of enumerating, it enables an improvement in the isomorphism test. To improve the performance of the algorithm, since it is an inefficient exact enumeration algorithm, the authors introduced a fast method which is called symmetry-breaking conditions. During straightforward sub-graph isomorphism tests, a sub-graph may be mapped to the same sub-graph of the query graph multiple times. In the Grochow–Kellis GK algorithm symmetry-breaking is used to avoid such multiple mappings. Here we introduce the GK algorithm and the symmetry-breaking condition which eliminates redundant isomorphism tests.