Hone your modeling skills with this challenging Protein Folding problem. We’ll show you how to create a binary optimization model of the problem with the Gurobi Python API and then solve it using the Gurobi Optimizer.

This model is example 28 from the fifth edition of Model Building in Mathematical Programming by H. Paul Williams on pages 289-290 and 344-345.

This modeling example is at the advanced level, where we assume that you know Python and the Gurobi Python API and that you have advanced knowledge of building mathematical optimization models. Typically, the objective function and/or constraints of these examples are complex or require advanced features of the Gurobi Python API.

**Download the Repository**

You can download the repository containing this and other examples by clicking here.

**Gurobi License**

In order to run this Jupyter Notebook properly, you must have a Gurobi license. If you do not have one, you can request an evaluation license as a *commercial user*, or download a free license as an *academic user*.

The problem described in this Jupyter Notebook is based on a molecular biology problem that is discussed in a paper entitled “Quadratic Binary Programming Models in Computational Biology” by Forrester and Greenberg (2008). The problem pertains to a protein, which consists of a chain of amino acids. In this problem, the amino acids come in two forms: hydrophilic (waterloving) and hydrophobic (water hating). An example of such a chain is given in the following figure with the hydrophobic acids marked in bold.

Such a chain naturally folds so as to bring as many hydrophobic acids as possible close together. A folding for the chain, in two dimensions, is given in the following figure, with the new matches marked by dashed lines. Our objective is to predict the optimum folding.

To solve the problem posed here, we must find the optimum folding for a chain of 50 amino acids with hydrophobic acids at positions 2, 4, 5, 6, 11, 12, 17, 20, 21, 25, 27, 28, 30, 31, 33, 37, 44 and 46.

$k \in A =\{1,2,...,50\}$: Chain of aminoacids.

$i,j \in H =\{2,4,5,6,11,12,17,20,21,25,27,28,30,31,33,37,44,46\} \subseteq Aminoacids $: Subset of aminoacids that are hydrophobic.

$\text{match}_{i,j} \equiv x_{i,j} = 1$, iff hydrophobic acid $i$ is matched with acid $j$, for all hydrophobic acids $ i < j \in H$. This matching does not include those matchings that are predefined by virtue of the acids being contiguous in the chain, i.e. $j > i+1$).

$\text{fold}_{k} \equiv y_{k} = 1$, iff a fold occurs between the $i$th and $(i +1)$st acids in the chain.

For each pair of hydrophobic acids $i$ and $j$, we can match them if:

- they are not contiguous, that is, already matched,
- they have an even number of acids between them in the chain,
- and there is exactly one fold between $i$ and $j$ .

This gives rise to the following constraints.

- $y_{k} + x_{i,j} \leq 1, \; \forall k \in A, (i,j) \in H, \; \text{such that} \; i \leq k < j, \; \text{and} \; k \neq (i+j-1)/2$.
- $x_{i,j} \leq y_{k}, \; \text{where} \; k = (i+j-1)/2 $.

Let $\text{H_fold} = \{(i,j) \in H: x_{i,j} \leq y_{k}, \; k = (i+j-1)/2 \}$ be the set of hydrophobic acids for which there is a folding that enables the matching.

The objective is to maximize the number of matchings of hydrophobic acids.

$$ \sum_{i,j \in \text{H_fold}} x_{i,j} $$In [1]:

```
import gurobipy as gp
from gurobipy import GRB
# tested with Python 3.7.0 & Gurobi 9.0
```

In [2]:

```
# list of aminoacids and hydrophobic acids
acids = [*range(1,51)]
h_phobic = [2,4,5,6,11,12,17,20,21,25,27,28,30,31,33,37,44,46]
```

In [3]:

```
# Creating the data structures to generate the model
list_ij = []
# Indices of hydrophobic acids that can be matched
for i in h_phobic:
for j in h_phobic:
if j > i + 1:
tp = i,j
list_ij.append(tp)
ij = gp.tuplelist(list_ij)
###
list_ik1j = []
list_ik2j = []
for i,j in ij:
for k in range(i,j):
if (k == (i+j-1)/2 ):
tp = i,j,k
list_ik2j.append(tp)
else:
tp = i,j,k
list_ik1j.append(tp)
# Indices for constraints of type 2
ik2j = gp.tuplelist(list_ik2j)
# Indices for constraints of type 1
ik1j = gp.tuplelist(list_ik1j)
# Matchings that are enabled by a folding
list_ijfold = []
for i,j,k in ik2j:
tp = i,j
list_ijfold.append(tp)
ijfold = gp.tuplelist(list_ijfold)
```

We create a model and the decision variables. There are two types of decision variables: the variables that determine which hydrophobic acids to match, and the variables that determine at which amino acid the folding of the protein happens.

In [4]:

```
model = gp.Model('ProteinFolding')
# Matching variables
match = model.addVars(ij, vtype=GRB.BINARY, name="match")
# Folding variables
fold = model.addVars(acids, vtype=GRB.BINARY, name="fold")
```

- $y_{k} + x_{i,j} \leq 1, \; \forall k \in A, (i,j) \in H, \; \text{such that} \; i \leq k < j, \; \text{and} \; k \neq (i+j-1)/2$.
- $x_{i,j} \leq y_{k}, \; \text{where} \; k = (i+j-1)/2 $.

In [5]:

```
# Constraint 1:
C1 = model.addConstrs( (fold[k] + match[i,j] <= 1 for i,j,k in ik1j ) , name='C1')
# Constraint 2:
C2 = model.addConstrs( ( match[i,j] <= fold[k] for i,j,k in ik2j ) , name='C2')
```

Maximize the matchings of hydrophobic acids.

In [6]:

```
# Objective function
model.setObjective(gp.quicksum(match[i,j] for i,j in ijfold) , GRB.MAXIMIZE )
```

In [7]:

```
# Verify model formulation
model.write('ProteinFolding.lp')
# Run optimization engine
model.optimize()
```

In [8]:

```
# Output report
print(f"Optimal number of hydrophobic acids matchings: {model.objVal}")
print("_______________________________________")
print(f"Optimal matching of hydrophobic acids.")
print("_______________________________________")
for i,j,k in ik2j:
if (match[i,j].x > 0.5):
print(f"Hydrophobic acid matching {i,j} with folding at amonacid {k}.")
```

H. Paul Williams, Model Building in Mathematical Programming, fifth edition.

Forrester, R.J. and Greenberg, H.J. (2008) Quadratic Binary Programming Models in Computational Biology. Algorithmic Operations Research, 3, 110–129.

Copyright © 2020 Gurobi Optimization, LLC

In [ ]:

```
```