Typical Pipeline

The tools included in the gpvolve API can be combined into a computational pipeline which simplifies the repetition of analyzes. We recommend using Jupyter notebooks for this purpose.


In the following example we will to go through the 5 steps shown in the figure above. We note that the pipeline is modular, meaning that all steps except step 1 and step 2 can be executed in any desired order.


Import functions.

# Import gpvolve functions explicitly.
from gpvolve import GenotypePhenotypeMSM
from gpvolve.fitness import one_to_one
from gpvolve.fixation import mccandlish
from gpvolve.flux import TransitionPathTheory
from gpvolve.visualization import plot_network, plot_matrix
from gpvolve.paths import flux_decomp
from gpvolve.utils import *
from gpvolve.cluster.pcca import PCCA
from gpvolve.cluster.base import GenotypePhenotypeClusters
from gpvolve.cluster.utils import *
from gpvolve.analysis.pathways import *

# Import gpmap for genotype-phenotype data handling
from gpmap.simulate import MultiPeakMountFujiSimulation, MountFujiSimulation

# Helper and plotting functions.
import matplotlib.pyplot as plt
from scipy.sparse import dok_matrix
import pickle

Load data.

Simulate 7-site map with 4 fitness peaks. Can also import exisiting genotype-phenotype data here. See the gpmap Python package for more information (https://github.com/harmslab/gpmap)

gpm = MultiPeakMountFujiSimulation.from_length(7, peak_n=4, min_dist=2, roughness_width=0.2, field_strength=1)

Build Markov model.

First we need to map a fitness value onto each phenotype. In this example we simple use phenotype values as fitnesses. The user can also pass their own fitness function here.


Building the transition matrix requires the user to define a fixation probability function and all its required parameteres. Here, we use the fitness function by McCandlish (McCandlish, 2011) and a population size of 10.

evomsm.build_transition_matrix(mccandlish, population_size=10)

Transition Path Theory

Having built an evolutionary Markov model of the genotype-phenotype map, we can use Transition Path Theory to compute reactive flux that moves from the wild type to the 7-site mutant.

# Compute reactive flux between wild type and 7-site mutant.
tpt = TransitionPathTheory(evomsm, [0], [127])

Now we can use the reactive flux to compute the pathways that go from genotype 0 (0000000) to 127 (1111111) and their relative probabilities.

paths = tpt.ReactiveFlux.pathways()

Visualize Transition Path Theory output.

# Sort path probabilities
path_probabilitities = sorted(paths[1], reverse=True)

# Normalize all paths with probability > 0.001.
prob_sum = sum(path_probabilitities)
norm_path_probabilities = [prob/prob_sum for prob in path_probabilitities if prob/prob_sum > 0.001]
x = range(len(norm_path_probabilities))

# Draw figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)

# Plot genotype-phenotype map as network.
plot_network(evomsm, ax=ax1, figsize=figsize, node_size=200, flux=dok_matrix(tpt.net_flux/tpt.total_flux))

ax1.set_title("Normalized probability\nfluxes between top and bottom peak", size=20)
ax1.text(1,-0.05, "source peak", size=15)
ax1.text(1,-7.1, "target peak", size=15)

# Plot the probabilities of the most likely paths.
ax2.bar(x, norm_path_probabilities, color='0.3')

ax2.set_xlabel("Paths", size=20)
ax2.set_ylabel("Probability", size=20)
ax2.set_title("Probability of Paths", size=20)

Coarse-grain genotype-phenotype map using PCCA+.

Next, we use Robust Perron-Cluster Cluster Analysis (PCCA+) to coarse-grain the genotype-phenotype map.

# Cluster genotype-phenotype map into 4 clusters.
pcca = PCCA(evomsm, 4)

Since PCCA+ might not return the 4 clusters in the order that we prefer, we have to reorder the clusters.

# Find peaks and order them.
peaks = sorted([list(peak)[0] for peak in evomsm.peaks()])

# Reorder the pcca clusters according to the peak they contain.
ord_clusters = sort_clusters_by_nodes(pcca.clusters, peaks)

# We simply instantiate a new object for the newly sorted clusters
clustering = GenotypePhenotypeClusters(evomsm, ord_clusters)

# Memberships won't be in order but can be reordered using the command below
reordered_row = pcca.memberships[clustering.order, :]
reordered_mem = reordered_row[:, [0,1,3,2]]
clustering.memberships = reordered_mem

# Check if sorting was correct by finding the peak in each cluster.
string = ""
for i, cluster in enumerate(clustering.clusters):
    for peak in peaks:
        if peak in cluster:
            string = string + "Cluster %s: %s (%s)\n" % (i, clustering.gpmsm.gpm.data.binary[peak], peak)

Visualize the metastability of the coarse-grained genotype-phenotype map.

If we plot the coarse transition matrix, we see that most of the transition probability is located at the diagonal, i.e. transitions that happen within a cluster are more likely than transitions between clusters. The colorbar on the right is in log10(transition probability).

# Plot coarse-grained transition matrix.
fig, ax = plt.subplots(figsize=(7,5.5))
header = 17
ax_labels = 15

plot_matrix(clustering.transition_matrix, ax=ax, colorbar=True)
ax.set_title("Coarse-grained transition matrix", size=header)
ax.set_ylabel("Clusters", size=ax_labels)
ax.set_xlabel("Clusters", size=ax_labels)


Analyze evolutionary mechanisms along fitness peaks.

Finally, we can analyze evolutionary trajectories that pass genotypes of interest. In this example we will look at the trajectories that pass through peak 2 on their way to peak 4 (1-2-4). We split the trajectories into the part between peak 1 and 2 and the part between peak 2 and 4. We find that the sub-trajectories between 2 and 4 are much less constrained than the ones between 1 and 2, which we quantify by calculating the mean path divergence.

# First get normalized path probabilities as dict
path_probs = paths_and_probs_to_dict(paths[0], paths[1], normalize=True)
all_paths = list(path_probs.keys())

# Path 1-2-4
paths124 = paths_that_do_not_contain(paths_that_contain(all_paths, [0,14,127], bool_and=True), [101])
paths124_probs = {path: path_probs[path] for path in paths124}
mpd124 = mean_path_divergence(evomsm, paths124_probs)
entr124 = entropy(list(paths124_probs.values()))

## Segment 1-2
paths124_12 = get_sub_paths(paths124_probs, 0, 14)
mdp124_12 = mean_path_divergence(evomsm, paths124_12)
## Segment 2-4
paths124_24 = get_sub_paths(paths124_probs, 14, 127)
mdp124_24 = mean_path_divergence(evomsm, paths124_24)

print("Path\tPath count\tMean Path Divergence\n%s\t%s\t\t%s\n%s\t%s\t\t%s\n%s\t%s\t\t%s" % ("1-2", len(paths124_12), round(mdp124_12,2), "2-4", len(paths124_24), round(mdp124_24,2), "1-2-4", len(paths124), round(mpd124, 2)))