Predicting "protein-protein interactions" with neural networks

Part 1: parsing data

Well, I recently review a manuscript on predicting the so-called "protein-protein interactions" with a convolutional neural network (CNN). The quality of the paper was exceptionally low, the authors are from different institutions that they are really unlikely to collaborate with each other. It is highly likely that the paper was from a paper mill. Nonetheless, as I have never done formal research on biological system, the draft still helped to me know something in the field. I was really curious on the validity of their results and if I could achieve better predictions, so here it is.

What are they doing here

They are basically doing a sequence based binary classification. Some protein pairs are divided into two categories, interacting paris and non-interacting pairs. As far as I understand, "interacting" means the two proteins can bind with each other, while "non-interacting" ones will repulse each other (if they are close enough).

This problem sounds really trivial to me. You just have to process two strings (one-letter amino acid sequence of a protein is just a string) and classify them in two two group. I felt I should be able to get it done pretty quickly, so I started looking into it.

I have attached the original notebook to this site and can be downloaded here, so you can play with it yourself.

The dataset

They were dealing with several datasets from different sources, but focused on one dataset published by a group from SJTU.

Datasets

The two files used are supporting information A and B, for interacting and non-interacting pairs, respectively. As the names suggest, these files are SI for a certain paper. However, I have no idea who decided a MS Word document could be a good choice in this case.

The two files consist of a header

Online support information A
There are 36,630 protein-protein pairs from total 9476 proteins, and the first column is protein ID from HPRD, the second column is the other protein ID and the two proteins constitute the positive Protein-protein interaction.

Index        Protein_1_ID         protein_2_ID

and blocks like the following

1 NP_663777.1  NP_001233.1
>NP_663777.1
MESSKKMDSPGALQTNPPLKLHTDRSAGTPVFVPEQGGYKEKFVKTVEDKYKCEKCHLVLCSPKQTECGHRFCESCMAALLSSSSPKCTACQESIVKDKVFKDNCCKREILALQIYCRNESRGCAEQLMLGHLLVHLKNDCHFEELPCVRPDCKEKVLRKDLRDHVEKACKYREATCSHCKSQVPMIALQKHEDTDCPCVVVSCPHKCSVQTLLRSELSAHLSECVNAPSTCSFKRYGCVFQGTNQQIKAHEASSAVQHVNLLKEWSNSLEKKVSLLQNESVEKNKSIQSLHNQICSFEIEIERQKEMLRNNESKILHLQRVIDSQAEKLKELDKEIRPFRQNWEEADSMKSSVESLQNRVTELESVDKSAGQVARNTGLLESQLSRHDQMLSVHDIRLADMDLRFQVLETASYNGVLIWKIRDYKRRKQEAVMGKTLSLYSQPFYTGYFGYKMCARVYLNGDGMGKGTHLSLFFVIMRGEYDALLPWPFKQKVTLMLMDQGSSRRHLGDAFKPDPNSSSFKKPTGEMNIASGCPVFVAQTVLENGTYIKDDTIFIKVIVDTSDLPDP
>NP_001233.1
MARPHPWWLCVLGTLVGLSATPAPKSCPERHYWAQGKLCCQMCEPGTFLVKDCDQHRKAAQCDPCIPGVSFSPDHHTRPHCESCRHCNSGLLVRNCTITANAECACRNGWQCRDKECTECDPLPNPSLTARSSQALSPHPQPTHLPYVSEMLEARTAGHMQTLADFRQLPARTLSTHWPPQRSLCSSDFIRILVIFSGMFLVFTLAGALFLHQRRKYRSNKGESPVEPAEPCRYSCPREEEGSTIPIQEDYRKPEPACSP

In side each block, the first line is the index for the pair, and their RefSeq IDs. Then the next line will start with a > symbol and followed by the RefSeq ID of the first protein. The next line is the amino acid sequence of the protein in the one letter code. Similarly, the next 2 lines are for protein 2.

Process the Word documents

Now I need find a programming way to deal with these two files. There are some Python packages for parsing MS Word documents, but I did not have good luck with it, possibly because these files are in the Word 2003 format.

After some googling, I decided to give antiword a shot. You can easily install it via the package manager from your OS.

# download the MS Word file
wget http://www.csbio.sjtu.edu.cn/bioinf/LR_PPI/Supp-A.doc
# well, I use debian and ubuntu anyway...
sudo apt update && apt install antiword
# convert!
antiword Supp-A.doc > Supp-A.txt

A very smooth process. It worked like a charm. The resulting Supp-A.txt is just a plaintext version of the Word document. The only difference is that the soft wrap in the Word document is now a hard wrap, meaning that the super long sequences will be break into multiple lines. This makes parsing the file a little bit more tedious. Before, you know that the sequence is always stored in one line.

Online support information A
There are 36,630 protein-protein pairs from total 9476 proteins, and the
first column is protein ID from HPRD, the second column is the other
protein ID and the two proteins constitute the positive Protein-protein
interaction.

Index        Protein_1_ID         protein_2_ID
1 NP_663777.1  NP_001233.1
>NP_663777.1
MESSKKMDSPGALQTNPPLKLHTDRSAGTPVFVPEQGGYKEKFVKTVEDKYKCEKCHLVLCSPKQTECGHRFCESC
MAALLSSSSPKCTACQESIVKDKVFKDNCCKREILALQIYCRNESRGCAEQLMLGHLLVHLKNDCHFEELPCVRPD
CKEKVLRKDLRDHVEKACKYREATCSHCKSQVPMIALQKHEDTDCPCVVVSCPHKCSVQTLLRSELSAHLSECVNA
PSTCSFKRYGCVFQGTNQQIKAHEASSAVQHVNLLKEWSNSLEKKVSLLQNESVEKNKSIQSLHNQICSFEIEIER
QKEMLRNNESKILHLQRVIDSQAEKLKELDKEIRPFRQNWEEADSMKSSVESLQNRVTELESVDKSAGQVARNTGL
LESQLSRHDQMLSVHDIRLADMDLRFQVLETASYNGVLIWKIRDYKRRKQEAVMGKTLSLYSQPFYTGYFGYKMCA
RVYLNGDGMGKGTHLSLFFVIMRGEYDALLPWPFKQKVTLMLMDQGSSRRHLGDAFKPDPNSSSFKKPTGEMNIAS
GCPVFVAQTVLENGTYIKDDTIFIKVIVDTSDLPDP
>NP_001233.1
MARPHPWWLCVLGTLVGLSATPAPKSCPERHYWAQGKLCCQMCEPGTFLVKDCDQHRKAAQCDPCIPGVSFSPDHH
TRPHCESCRHCNSGLLVRNCTITANAECACRNGWQCRDKECTECDPLPNPSLTARSSQALSPHPQPTHLPYVSEML
EARTAGHMQTLADFRQLPARTLSTHWPPQRSLCSSDFIRILVIFSGMFLVFTLAGALFLHQRRKYRSNKGESPVEP
AEPCRYSCPREEEGSTIPIQEDYRKPEPACSP

Anyway, once we have a plaintext file, reading it is always easy. After some tweaking, I came up with this way of reading the file.

Let us load necessary packages first.

In [1]:

from copy import deepcopy

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, TensorDataset, random_split

In [2]:

def read_sequence():
    digits = {}
    for i in range(9):
        digits[str(i + 1)] = ""
    files = ["Supp-A.txt", "Supp-B.txt"]
    types = [1, 0]
    skip_lines = [9, 8]
    pairs = []
    proteins = {}
    interactions = []
    for _, file in enumerate(files):
        count = 0
        eof=False
        with open(file, "r") as doc:
            print(file)
            for i in range(skip_lines[_]):
                line = doc.readline()
            while True:
                ids = line.split()[1:]
                pairs.append(ids)
                # ">ID1" line
                line = doc.readline()
                for i, id in enumerate(ids):
                    # read sequence for the first protein
                    sequence = ""
                    while True:
                        line = doc.readline()
                        # break when 1. ">ID2" or next pair or EOF is read
                        if not line or line[0] == ">" or line[0] in digits:
                            if id not in proteins:
                                proteins[id] = sequence
                            if not line:
                                eof=True
                            elif line[0] == ">":
                                count += 1
                            break
                        sequence += line.rstrip("\n")
                    # print(count, id, sequence)
                    if eof:
                        break
                if eof:
                    interactions += [types[_]] * count
                    break
    return proteins, pairs, interactions

In [3]:

proteins, pairs, interactions = read_sequence()

Supp-A.txt
Supp-B.txt

It is a little tricky to make a script to read the script fully automatically. I guess I will add some comments over here later. The resulting sequences need to be further processed before we can use them.

One common thing people in this field do is to exclude all sequences below 50 amino acids and remove all sequences with non-canonical amino acids, for example, in this paper.

In [4]:

# use the dictionary `to_remove` to store any protein ID that follows in the two criteria above
to_remove = {}
for k, v in proteins.items():
    if len(v) < 50:
        to_remove[k] = ""
        continue
    # I have checked the sequences, the only "non-canonical" is "U" which is selenocysteine
    for l in v:
        if l == "U":
            to_remove[k] = ""

Then if the keys in to_remove show up in any pair, we remove the pair and their interaction value from the lists.

In [5]:

for i in range(len(pairs) - 1, -1, -1):
    id1, id2 = pairs[i]
    if id1 in to_remove or id2 in to_remove:
        del pairs[i], interactions[i]
for k in to_remove.keys():
    proteins.pop(k)

Then I would like to save the processed data as csv files, though I never use them really...

In [6]:

ppi = {}
for i, (pair, value) in enumerate(zip(pairs, interactions)):
    ppi[i] = {"protein 1": pair[0], "protein 2": pair[1], "interaction": value}
protein_df = pd.DataFrame(proteins, index=[0]).transpose()
protein_df.columns = ["sequence"]
interaction_df = pd.DataFrame(ppi).transpose()
protein_df.to_csv("proteins.csv")
interaction_df.to_csv("ppi.csv")

Next, I would like to find a percentile to truncate long sequences. The reason to do it is that vast majority of the sequences are very short (a few hundred), but there are some very long outliers (for example, > 30K). If you allocate a tensor based on the longest sequence, the drawback is obvious. The tensor will be very sparse. As a result, it is wise to cut these long sequences.

Loop through all sequences and get the length of the sequence at 95% percentile.

In [7]:

seq_len = []
for v in proteins.values():
    seq_len.append(len(v))
length = int(np.percentile(seq_len, np.arange(0, 100, 5))[-1])

Next, convert the strings of all sequences to a tensor of integers based on ASC II. Note that here we have the tensor of sequences, not pairs yet.

In [8]:

sequences = torch.zeros((len(proteins), length), dtype=int)
for i, v in enumerate(proteins.values()):
    for j in range(len(v) if length > len(v) else length):
        sequences[i, j] = ord(v[j]) - 64

Then we can build the tensors for the pairs. protein1 and protein2 are the sequences of the first and second proteins in each pair. targets is the tensor of labels for whether the pair is interacting or not. I am doing a manual one-hot encoder here. Of course you can ask torch to do it. Not sure why I wanted to do it manually.

Here I build another helper dictionary called protein_idx just to quickly locate the index of each protein in sequences. ($O(1)$ for a hashtable and $O(n)$ for a list)

In [9]:

protein_idx = {}
count = 0
for k in proteins.keys():
    protein_idx[k] = count
    count += 1

In [10]:

targets = torch.zeros((len(interactions), 2), dtype=torch.float)
protein1 = torch.empty((len(interactions), length), dtype=torch.int64)
protein2 = torch.empty((len(interactions), length), dtype=torch.int64)
for i, (pair, value) in enumerate(zip(pairs, interactions)):
    targets[i, value] = 1
    protein1[i] = sequences[protein_idx[pair[0]]]
    protein2[i] = sequences[protein_idx[pair[1]]]

We have got all the tensors we need, so we can easily assemble a dataset with torch's TensorDataset class.

In [11]:

dataset = TensorDataset(protein1, protein2, targets)

After this, we will start to build our network. Please check out the section part of this post over here.