Typical Pipeline
================

The tools included in the ``gpvolve`` API can be combined into a computational pipeline which simplifies the repetition
of analyzes. We recommend using Jupyter notebooks for this purpose.

.. image:: ../img/comp_pipeline.png

In the following example we will to go through the 5 steps shown in the figure above. We note that the pipeline is
modular, meaning that all steps except step 1 and step 2 can be executed in any desired order.

Example
-------

Import functions.

.. code-block:: python

    # Import gpvolve functions explicitly.
    from gpvolve import GenotypePhenotypeMSM
    from gpvolve.fitness import one_to_one
    from gpvolve.fixation import mccandlish
    from gpvolve.flux import TransitionPathTheory
    from gpvolve.visualization import plot_network, plot_matrix
    from gpvolve.paths import flux_decomp
    from gpvolve.utils import *
    from gpvolve.cluster.pcca import PCCA
    from gpvolve.cluster.base import GenotypePhenotypeClusters
    from gpvolve.cluster.utils import *
    from gpvolve.analysis.pathways import *

    # Import gpmap for genotype-phenotype data handling
    from gpmap.simulate import MultiPeakMountFujiSimulation, MountFujiSimulation

    # Helper and plotting functions.
    import matplotlib.pyplot as plt
    from scipy.sparse import dok_matrix
    import pickle

**Load data.**

Simulate 7-site map with 4 fitness peaks. Can also import exisiting genotype-phenotype data here.
See the gpmap Python package for more information (https://github.com/harmslab/gpmap)

.. code-block:: python

    gpm = MultiPeakMountFujiSimulation.from_length(7, peak_n=4, min_dist=2, roughness_width=0.2, field_strength=1)

**Build Markov model.**

First we need to **map a fitness** value onto each phenotype. In this example we simple use phenotype values as fitnesses.
The user can also pass their own fitness function here.

.. code-block:: python

    evomsm.apply_selection(one_to_one)

Building the transition matrix requires the user to define a **fixation probability** function and all its required
parameteres. Here, we use the fitness function by McCandlish (McCandlish, 2011) and a population size of 10.

.. code-block:: python

    evomsm.build_transition_matrix(mccandlish, population_size=10)

**Transition Path Theory**

Having built an evolutionary Markov model of the genotype-phenotype map, we can use Transition Path Theory to compute
reactive flux that moves from the wild type to the 7-site mutant.

.. code-block:: python

    # Compute reactive flux between wild type and 7-site mutant.
    tpt = TransitionPathTheory(evomsm, [0], [127])

Now we can use the reactive flux to compute the pathways that go from genotype 0 (0000000) to 127 (1111111) and their relative probabilities.

.. code-block:: python

    paths = tpt.ReactiveFlux.pathways()

**Visualize Transition Path Theory output.**

.. code-block:: python

    # Sort path probabilities
    path_probabilitities = sorted(paths[1], reverse=True)

    # Normalize all paths with probability > 0.001.
    prob_sum = sum(path_probabilitities)
    norm_path_probabilities = [prob/prob_sum for prob in path_probabilitities if prob/prob_sum > 0.001]
    x = range(len(norm_path_probabilities))

    # Draw figure
    figsize=(20,8)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

    # Plot genotype-phenotype map as network.
    plot_network(evomsm, ax=ax1, figsize=figsize, node_size=200, flux=dok_matrix(tpt.net_flux/tpt.total_flux))

    ax1.set_title("Normalized probability\nfluxes between top and bottom peak", size=20)
    ax1.text(1,-0.05, "source peak", size=15)
    ax1.text(1,-7.1, "target peak", size=15)

    # Plot the probabilities of the most likely paths.
    ax2.bar(x, norm_path_probabilities, color='0.3')

    ax2.set_xlabel("Paths", size=20)
    ax2.set_ylabel("Probability", size=20)
    ax2.set_title("Probability of Paths", size=20)

.. image:: ../img/tpt_output.png

**Coarse-grain genotype-phenotype map using PCCA+.**

Next, we use Robust Perron-Cluster Cluster Analysis (PCCA+) to coarse-grain the genotype-phenotype map.

.. code-block:: python

    # Cluster genotype-phenotype map into 4 clusters.
    pcca = PCCA(evomsm, 4)

Since PCCA+ might not return the 4 clusters in the order that we prefer, we have to reorder the clusters.

.. code-block:: python

    # Find peaks and order them.
    peaks = sorted([list(peak)[0] for peak in evomsm.peaks()])

    # Reorder the pcca clusters according to the peak they contain.
    ord_clusters = sort_clusters_by_nodes(pcca.clusters, peaks)

    # We simply instantiate a new object for the newly sorted clusters
    clustering = GenotypePhenotypeClusters(evomsm, ord_clusters)

    # Memberships won't be in order but can be reordered using the command below
    reordered_row = pcca.memberships[clustering.order, :]
    reordered_mem = reordered_row[:, [0,1,3,2]]
    clustering.memberships = reordered_mem

    # Check if sorting was correct by finding the peak in each cluster.
    string = ""
    for i, cluster in enumerate(clustering.clusters):
        for peak in peaks:
            if peak in cluster:
                string = string + "Cluster %s: %s (%s)\n" % (i, clustering.gpmsm.gpm.data.binary[peak], peak)
    print(string)

.. image:: ../img/pcca_ordering.png

**Visualize the metastability of the coarse-grained genotype-phenotype map.**

If we plot the coarse transition matrix, we see that most of the transition probability is located at the
diagonal, i.e. transitions that happen within a cluster are more likely than transitions between clusters.
The colorbar on the right is in log10(transition probability).

.. code-block:: python

    # Plot coarse-grained transition matrix.
    fig, ax = plt.subplots(figsize=(7,5.5))
    header = 17
    ax_labels = 15

    plot_matrix(clustering.transition_matrix, ax=ax, colorbar=True)
    ax.set_title("Coarse-grained transition matrix", size=header)
    ax.set_ylabel("Clusters", size=ax_labels)
    ax.set_xlabel("Clusters", size=ax_labels)
    ax.set_aspect("equal")

    plt.tight_layout()

.. image:: ../img/pcca_tm.png

**Analyze evolutionary mechanisms along fitness peaks.**

Finally, we can analyze evolutionary trajectories that pass genotypes of interest. In this example we will look at the
trajectories that pass through peak 2 on their way to peak 4 (1-2-4). We split the trajectories
into the part between peak 1 and 2 and the part between peak 2 and 4. We find that the sub-trajectories between 2 and 4 are
much less constrained than the ones between 1 and 2, which we quantify by calculating the mean path divergence.

.. code-block:: python

    # First get normalized path probabilities as dict
    path_probs = paths_and_probs_to_dict(paths[0], paths[1], normalize=True)
    all_paths = list(path_probs.keys())

    # Path 1-2-4
    paths124 = paths_that_do_not_contain(paths_that_contain(all_paths, [0,14,127], bool_and=True), [101])
    paths124_probs = {path: path_probs[path] for path in paths124}
    mpd124 = mean_path_divergence(evomsm, paths124_probs)
    entr124 = entropy(list(paths124_probs.values()))

    ## Segment 1-2
    paths124_12 = get_sub_paths(paths124_probs, 0, 14)
    mdp124_12 = mean_path_divergence(evomsm, paths124_12)
    ## Segment 2-4
    paths124_24 = get_sub_paths(paths124_probs, 14, 127)
    mdp124_24 = mean_path_divergence(evomsm, paths124_24)

    print("Path\tPath count\tMean Path Divergence\n%s\t%s\t\t%s\n%s\t%s\t\t%s\n%s\t%s\t\t%s" % ("1-2", len(paths124_12), round(mdp124_12,2), "2-4", len(paths124_24), round(mdp124_24,2), "1-2-4", len(paths124), round(mpd124, 2)))

.. image:: ../img/path_analysis.png