Getting Started with PyBioMed

This document is intended to provide an overview of how one can use the PyBioMed functionality from Python. If you find mistakes, or have suggestions for improvements, please either fix them yourselves in the source document (the .py file) or send them to the mailing list: oriental-cds@163.com and gadsby@163.com.

Installing the PyBioMed package

PyBioMed has been successfully tested on Linux and Windows systems. The user could download the PyBioMed package via: https://raw.githubusercontent.com/gadsbyfly/PyBioMed/master/PyBioMed/download/PyBioMed-1.0.zip. The installation process of PyBioMed is very easy:

Note

You first need to install RDKit and pybel successfully.

On Windows:

(1): download the PyBioMed-1.0.zip

(2): extract the PyBioMed-1.0.zip file

(3): open cmd.exe and change dictionary to PyBioMed-1.0 (write the command “cd PyBioMed-1.0” in cmd shell)

(4): write the command “python setup.py install” in cmd shell

On Linux:

(1): download the PyBioMed package (.zip)

(2): extract PyBioMed-1.0.zip

(3): open shell and change dictionary to PyBioMed-1.0 (write the command “cd PyBioMed-1.0” in shell)

(4): write the command “python setup.py install” in shell

Getting molecules

The PyGetMol provide different formats to get molecular structures, protein sequence and DNA sequence.

Getting molecular structure

In order to be convenient to users, the Getmol module provides the tool to get molecular structures by the molecular ID from website including NCBI, EBI, CAS, Kegg and Drugbank.

>>> from PyBioMed.PyGetMol import Getmol
>>> DrugBankID = 'DB01014'
>>> smi = Getmol.GetMolFromDrugbank(DrugBankID)
>>> print smi
N[C@@H](CO)C(=O)O
>>> smi=Getmol.GetMolFromCAS(casid="50-12-4")
>>> print smi
CCC1(c2ccccc2)C(=O)N(C)C(=N1)O
>>> smi=Getmol.GetMolFromNCBI(cid="2244")
>>> print smi
CC(=O)Oc1ccccc1C(=O)O
>>> smi=Getmol.GetMolFromKegg(kid="D02176")
>>> print smi
C[N+](C)(C)C[C@H](O)CC(=O)[O-]

Reading molecules

The Getmol module also provides the tool to read molecules in different formats including SDF, Mol, InChi and Smiles.

Users can read a molecule from string.

>>> from PyBioMed.PyGetMol.Getmol import ReadMolFromSmile
>>> mol = ReadMolFromSmile('N[C@@H](CO)C(=O)O')
>>> print mol
<rdkit.Chem.rdchem.Mol object at 0x0D8D3688>

Users can also read a molecule from a file.

>>> from PyBioMed.PyGetMol.Getmol import ReadMolFromSDF
>>> mol = ReadMolFromSDF('./PyBioMed/test/test_data/test.sdf')  #You should change the path to your own real path
>>> print mol
<rdkit.Chem.rdmolfiles.SDMolSupplier at 0xd8d03f0>

Getting protein sequence

The GetProtein module provides the tool to get protein sequence by the pdb ID and uniprot ID from website.

>>> from PyBioMed.PyGetMol import GetProtein
>>> GetProtein.GetPDB(['1atp'])
>>> seq = GetProtein.GetSeqFromPDB('1atp.pdb')
>>> print seq
GNAAAAKKGSEQESVKEFLAKAKEDFLKKWETPSQNTAQLDQFDRIKTLGTGSFGRVMLVKHKESGNHYAMKILDKQKVVKLKQ
IEHTLNEKRILQAVNFPFLVKLEFSFKDNSNLYMVMEYVAGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRDLK
PENLLIDQQGYIQVTDFGFAKRVKGRTWXLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVS
GKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKNGVNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIR
VXINEKCGKEFTEFTTYADFIASGRTGRRNAIHD
>>> seq = GetProtein.GetProteinSequence('O00560')
>>> print seq
MSLYPSLEDLKVDKVIQAQTAFSANPANPAILSEASAPIPHDGNLYPRLYPELSQYMGLSLNEEEIRANVAVVSGAPLQGQLVA
RPSSINYMVAPVTGNDVGIRRAEIKQGIREVILCKDQDGKIGLRLKSIDNGIFVQLVQANSPASLVGLRFGDQVLQINGENCAG
WSSDKAHKVLKQAFGEKITMTIRDRPFERTITMHKDSTGHVGFIFKNGKITSIVKDSSAARNGLLTEHNICEINGQNVIGLKDS
QIADILSTSGTVVTITIMPAFIFEHIIKRMAPSIMKSLMDHTIPEV

Reading protein sequence

>>> from PyBioMed.PyGetMol.GetProtein import ReadFasta
>>> f = open('./PyBioMed/test/test_data/protein.fasta')  #You should change the path to your own real path
>>> protein_seq = ReadFasta(f)
>>> print protein
['MLIHQYDHATAQYIASHLADPDPLNDGRWLIPAFATATPLPERPARTWPFFLDGAWVLRPDHRGQRLYRTDTGEAAEIVAAG
IAPEAAGLTPTPRPSDEHRWIDGAWQIDPQIVAQRARDAAMREFDLRMASARQANAGRADAYAAGLLSDAEIAVFKAWAIYQMD
LVRVVSAASFPDDVQWPAEPDEAAVIEQADGKASAGDAAAA',
'MLIHQYDHATAQYIASHLADPDPLNDGRWLIPAFATATPLPERPARTWPFFLDGAWVLRPDHRGQRLYRTDTGEAAEIVAAGI
APEAAGLTPTPRPSDEHRWIDGAWQIDPQIVAQRARDAAMREFDLRMASARQANAGRADAYAAGLLSDAEIAVFKAWAIYQMDL
VRVVSAASFPDDVQWPAEPDEAAVIEQADGKASAGDAAAA']

Getting DNA sequence

The GetDNA module provides the tool to get DNA sequence by the Gene ID from website.

>>> from PyBioMed.PyGetMol import GetDNA
>>> seq = GetDNA.GetDNAFromUniGene('AA954964')
>>> print seq
>ENA|AA954964|AA954964.1 op24b10.s1 Soares_NFL_T_GBC_S1 Homo sapiens cDNA clone IMAGE:1577755 3&apos;, mRNA sequence.
TTTTAAAATATAAAAGGATAACTTTATTGAATATACAAATTCAAGAGCATTCAATTTTTT
TTTAAGATTATGGCATAAGACAGATCAATGGTAATGGTTTATATATCCTATACTTACCAA
ACAGATTAGGTAGATATACTGACCTATCAATGCTCAAAATAACAAAATGAATACATGTCC
CTAAACTATTTCTGTATTCTATGACTACTAAATGGGAAATCTGTCAGCTGACCACCCACC
AGACTTTTTCCCATAGGAAGTTTGATATGCTGTCATTGATATATACCATTTCTGAATATA
AACCTCTATCTTGGGTCCTTTTCTCTTTGCCTACTTCATTATCTGTCTTCCCAACCCACC
TAAGACTTAGTCAAAACAGGATACAGAGATCTGGATGGCTCTACGCAGAG

Reading DNA sequence

>>> from PyBioMed.PyGetMol.GetDNA import ReadFasta
>>> f = open('./PyBioMed/test/test_data/example.fasta')  #You should change the path to your own real path
>>> dna_seq = ReadFasta(f)
>>> print dna
['GACTGAACTGCACTTTGGTTTCATATTATTTGCTC']

Pretreating structure

The PyPretreat can pretreat the molecular structure, the protein sequence and the DNA sequence.

Pretreating molecules

The PyPretreatMol can pretreat the molecular structure. The PyPretreatMol proivdes the following functions:

  • Normalization of functional groups to a consistent format.
  • Recombination of separated charges.
  • Breaking of bonds to metal atoms.
  • Competitive reionization to ensure strongest acids ionize first in partially ionize molecules.
  • Tautomer enumeration and canonicalization.
  • Neutralization of charges.
  • Standardization or removal of stereochemistry information.
  • Filtering of salt and solvent fragments.
  • Generation of fragment, isotope, charge, tautomer or stereochemistry insensitive parent structures.
  • Validations to identify molecules with unusual and potentially troublesome characteristics.

The user can diconnect metal ion.

>>> from PyBioMed.PyPretreat.PyPretreatMol import StandardizeMol
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
>>> sdm = StandardizeMol()
>>> mol = sdm.disconnect_metals(mol)
>>> print Chem.MolToSmiles(mol, isomericSmiles=True)
O=C([O-])c1ccc(C[S+2]([O-])[O-])cc1.[Na+]

Pretreat the molecular structure using all functions.

>>> from PyBioMed.PyPretreat import PyPretreatMol
>>> stdsmi = PyPretreatMol.StandardSmi('[Na]OC(=O)c1ccc(C[S+2]([O-])([O-]))cc1')
>>> print stdsmi
O=C([O-])c1ccc(C[S](=O)=O)cc1

Pretreating protein sequence

The user can check the protein sequence using the PyPretreatPro. If the sequence is right, the result is the number of amino acids. If the sequence is wrong, the result is 0.

>>> from PyBioMed.PyPretreat import PyPretreatPro
>>> protein="ADGCGVGEGTGQGPMCNCMCMKWVYADEDAADLESDSFADEDASLESDSFPWSNQRVFCSFADEDASU"
>>> print PyPretreatPro.ProteinCheck(protein)
0
>>> from PyBioMed.PyPretreat import PyPretreatPro
>>> protein="ADGCRN"
>>> print PyPretreatPro.ProteinCheck(protein)
6

Pretreating DNA sequence

The user can check the DNA sequence using the PyPretreatDNA. If the sequence is right, the result is True. If the sequence is wrong, the result is the wrong word.

>>> from PyBioMed.PyPretreat import PyPretreatDNA
>>> DNA="ATTTAC"
>>> print PyPretreatDNA.DNAChecks(DNA)
True
>>> DNA= "ATCGUA"
>>> print PyPretreatDNA.DNAChecks(DNA)
U

Calculating molecular descriptors

The PyBioMed package could calculate a large number of molecular descriptors. These descriptors capture and magnify distinct aspects of chemical structures. Generally speaking, all descriptors could be divided into two classes: descriptors and fingerprints. Descriptors only used the property of molecular topology, including constitutional descriptors, topological descriptors, connectivity indices, E-state indices, Basak information indices, Burden descriptors, autocorrelation descriptors, charge descriptors, molecular properties, kappa shape indices, MOE-type descriptors. Molecular fingerprints contain FP2, FP3, FP4,topological fingerprints, Estate, atompairs, torsions, morgan and MACCS.

_images/single_features.png

The descriptors could be calculated through PyBioMed package

Calculating descriptors

We could import the corresponding module to calculate the molecular descriptors as need. There is 14 modules to compute descriptors. Moreover, a easier way to compute these descriptors is construct a PyMolecule object, which encapsulates all methods for the calculation of descriptors.

Calculating molecular descriptors via functions

The GetConnectivity() function in the connectivity module can calculate the connectivity descriptors. The result is given in the form of dictionary.

>>> from PyBioMed.PyMolecule import connectivity
>>> from rdkit import Chem
>>> smi = 'CCC1(c2ccccc2)C(=O)N(C)C(=N1)O'
>>> mol = Chem.MolFromSmiles(smi)
>>> molecular_descriptor = connectivity.GetConnectivity(mol)
>>> print molecular_descriptor
{'Chi3ch': 0.0, 'knotp': 2.708, 'dchi3': 3.359, 'dchi2': 2.895, 'dchi1': 2.374, 'dchi0': 2.415, 'Chi5ch': 0.068, 'Chiv4': 1.998, 'Chiv7': 0.259, 'Chiv6': 0.591, 'Chiv1': 5.241, 'Chiv0': 9.344, 'Chiv3': 3.012, 'Chiv2': 3.856, 'Chi4c': 0.083, 'dchi4': 2.96, 'Chiv4pc': 1.472, 'Chiv3c': 0.588, 'Chiv8': 0.118, 'Chi3c': 1.264, 'Chi8': 0.636, 'Chi9': 0.322, 'Chi2': 6.751, 'Chi3': 6.372, 'Chi0': 11.759, 'Chi1': 7.615, 'Chi6': 2.118, 'Chi7': 1.122, 'Chi4': 4.959, 'Chi5': 3.649, 'Chiv5': 1.244, 'Chiv4c': 0.04, 'Chiv9': 0.046, 'Chi4pc': 3.971, 'knotpv': 0.884, 'Chiv5ch': 0.025, 'Chiv3ch': 0.0, 'Chiv10': 0.015, 'Chiv6ch': 0.032, 'Chi10': 0.135, 'Chi4ch': 0.0, 'Chiv4ch': 0.0, 'mChi1': 0.448, 'Chi6ch': 0.102}

The function GetTopology() in the topology module can calculate all topological descriptors.

>>> from PyBioMed.PyMolecule import topology
>>> from rdkit import Chem
>>> smi = 'CCC1(c2ccccc2)C(=O)N(C)C(=N1)O'
>>> mol = Chem.MolFromSmiles(smi)
>>> molecular_descriptor = topology.GetTopology(mol)
>>> print len(molecular_descriptor)
25

The function CATS2D() in the cats2d module can calculate all CATS2D descriptors.

>>> from PyBioMed.PyMolecule.cats2d import CATS2D
>>> smi = 'CC(N)C(=O)[O-]'
>>> mol = Chem.MolFromSmiles(smi)
>>> cats = CATS2D(mol,PathLength = 10,scale = 3)
>>> print cats
{'CATS_AP4': 0.0, 'CATS_AP3': 1.0, 'CATS_AP6': 0.0, 'CATS_AA8': 0.0, 'CATS_AA9': 0.0, 'CATS_AP1': 0.0, ......, 'CATS_AP5': 0.0}
>>> print len(cats)
150

Calculating molecular descriptors via PyMolecule object

The PyMolecule class can read molecules in different format including MOL, SMI, InChi and CAS. For example, the user can read a molecule in the format of SMI and calculate the E-state descriptors (316).

>>> from PyBioMed import Pymolecule
>>> smi = 'CCC1(c2ccccc2)C(=O)N(C)C(=N1)O'
>>> mol = Pymolecule.PyMolecule()
>>> mol.ReadMolFromSmile(smi)
>>> molecular_descriptor = mol.GetEstate()
>>> print len(molecular_descriptor)
237

The object can also read molecules in the format of MOL file and calculate charge descriptors (25).

>>> from PyBioMed import Pymolecule
>>> mol = Pymolecule.PyMolecule()
>>> mol.ReadMolFromMol('test/test_data/test.mol')   #change path to the real path in your own computer
>>> molecular_descriptor = mol.GetCharge()
>>> print molecular_descriptor
{'QNmin': 0, 'QOss': 0.534, 'Mpc': 0.122, 'QHss': 0.108, 'SPP': 0.817, 'LDI': 0.322, 'QCmin': -0.061, 'Mac': 0.151, 'Qass': 0.893, 'QNss': 0, 'QCmax': 0.339, 'QOmax': -0.246, 'Tpc': 1.584, 'Qmax': 0.339, 'QOmin': -0.478, 'Tnc': -1.584, 'QHmin': 0.035, 'QCss': 0.252, 'QHmax': 0.297, 'QNmax': 0, 'Rnc': 0.302, 'Rpc': 0.214, 'Qmin': -0.478, 'Tac': 3.167, 'Mnc': -0.198}

In order to be convenient to users, the object also provides the tool to get molecular structures by the molecular ID from website including NCBI, EBI, CAS, Kegg and Drugbank.

>>> from PyBioMed import Pymolecule
>>> DrugBankID = 'DB01014'
>>> mol = Pymolecule.PyMolecule()
>>> smi = mol.GetMolFromDrugbank(DrugBankID)
>>> mol.ReadMolFromSmile(smi)
>>> molecular_descriptor = mol.GetKappa()
>>> print molecular_descriptor
{'phi': 5.989303307692309, 'kappa1': 22.291, 'kappa3': 7.51, 'kappa2': 11.111, 'kappam1': 18.587, 'kappam3': 5.395, 'kappam2': 8.378}

The code below can calculate all molecular descriptors except fingerprints.

>>> from PyBioMed import Pymolecule
>>> smi = 'CCOC=N'
>>> mol = Pymolecule.PyMolecule()
>>> mol.ReadMolFromSmile(smi)
>>> alldes = mol.GetAllDescriptor()
>>> print len(alldes)
765

Calculating molecular fingerprints

In the fingerprint module, there are eighteen types of molecular fingerprints which are defined by abstracting and magnifying different aspects of molecular topology.

Calculating fingerprint via functions

The CalculateFP2Fingerprint() function calculates the FP2 fingerprint.

>>> from PyBioMed.PyMolecule.fingerprint import CalculateFP2Fingerprint
>>> import pybel
>>> smi = 'CCC1(c2ccccc2)C(=O)N(C)C(=N1)O'
>>> mol = pybel.readstring("smi", smi)
>>> mol_fingerprint = CalculateFP2Fingerprint(mol)
>>> print len(mol_fingerprint[1])
103

The CalculateEstateFingerprint() function calculates the Estate fingerprint.

>>> from PyBioMed.PyMolecule.fingerprint import CalculateEstateFingerprint
>>> smi = 'CCC1(c2ccccc2)C(=O)N(C)C(=N1)O'
>>> mol = Chem.MolFromSmiles(smi)
>>> mol_fingerprint = CalculateEstateFingerprint(mol)
>>> print len(mol_fingerprint[2])
79

The function GhoseCrippenFingerprint() in the ghosecrippen module can calculate all ghosecrippen descriptors.

>>> from PyBioMed.PyMolecule.ghosecrippen import GhoseCrippenFingerprint
>>> smi = 'CC(N)C(=O)O'
>>> mol = Chem.MolFromSmiles(smi)
>>> ghoseFP = GhoseCrippenFingerprint(mol)
>>> print ghoseFP
{'S3': 0, 'S2': 0, 'S1': 0, 'S4': 0, ......, 'N9': 0, 'Hal2': 0}
>>> print len(ghoseFP)
110

Calculating fingerprint via object

The PyMolecule class can calculate eleven kinds of fingerprints. For example, the user can read a molecule in the format of SMI and calculate the ECFP4 fingerprint (1024).

>>> from PyBioMed import Pymolecule
>>> smi = 'CCOC=N'
>>> mol = Pymolecule.PyMolecule()
>>> mol.ReadMolFromSmile(smi)
>>> res = mol.GetFingerprint(FPName='ECFP4')
>>> print res
(4294967295L, {3994088662L: 1, 2246728737L: 1, 3542456614L: 1, 2072128742: 1, 2222711142L: 1, 2669064385L: 1, 3540009223L: 1, 849275503: 1, 2245384272L: 1, 2246703798L: 1, 864674487: 1, 4212523324L: 1, 3340482365L: 1}, <rdkit.DataStructs.cDataStructs.UIntSparseIntVect object at 0x0CA010D8>)

Calculating protein descriptors

PyProtein is a tool used for protein feature calculation. PyProtein calculates structural and physicochemical features of proteins and peptides from amino acid sequence. These sequence-derived structural and physicochemical features have been widely used in the development of machine learning models for predicting protein structural and functional classes, post-translational modification, subcellular locations and peptides of specific properties. There are two ways to calculate protein descriptors in the PyProtein module. One is to directly use the corresponding methods, the other one is firstly to construct a PyProtein class and then run their methods to obtain the protein descriptors. It should be noted that the output is a dictionary form, whose keys and values represent the descriptor name and the descriptor value, respectively. The user could clearly understand the meaning of each descriptor.

Calculating protein descriptors via functions

The user can input the protein sequence and calculate the protein descriptors using function.

>>> from PyBioMed.PyProtein import AAComposition
>>> protein="ADGCGVGEGTGQGPMCNCMCMKWVYADEDAADLESDSFADEDASLESDSFPWSNQRVFCSFADEDAS"
>>> AAC=AAComposition.CalculateAAComposition(protein)
>>> print AAC
{'A': 11.94, 'C': 7.463, 'E': 8.955, 'D': 14.925, 'G': 8.955, 'F': 5.97, 'I': 0.0, 'H': 0.0, 'K': 1.493, 'M': 4.478, 'L': 2.985, 'N': 2.985, 'Q': 2.985, 'P': 2.985, 'S': 11.94, 'R': 1.493, 'T': 1.493, 'W': 2.985, 'V': 4.478, 'Y': 1.493}

PyBioMed also provides getpdb to get sequence from PDB website to calculate protein descriptors.

>>> from PyBioMed.PyGetMol import GetProtein
>>> GetProtein.GetPDB(['1atp','1efz','1f88'])
>>> seq = GetProtein.GetSeqFromPDB('1atp.pdb')
>>> print seq
GNAAAAKKGSEQESVKEFLAKAKEDFLKKWETPSQNTAQLDQFDRIKTLGTGSFGRVMLVKHKESGNHYAMKILDKQKVVKLKQ
IEHTLNEKRILQAVNFPFLVKLEFSFKDNSNLYMVMEYVAGGEMFSHLRRIGRFSEPHARFYAAQIVLTFEYLHSLDLIYRDLK
PENLLIDQQGYIQVTDFGFAKRVKGRTWXLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVS
GKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKNGVNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIR
VXINEKCGKEFTEFTTYADFIASGRTGRRNAIHD
>>> from PyBioMed.PyProtein import CTD
>>> protein_descriptor = CTD.CalculateC(protein)
>>> print protein_descriptor
{'_NormalizedVDWVC2': 0.224, '_PolarizabilityC2': 0.328, '_PolarizabilityC3': 0.179, '_ChargeC1': 0.03, '_PolarizabilityC1': 0.493, '_SecondaryStrC2': 0.239, '_SecondaryStrC3': 0.418, '_NormalizedVDWVC3': 0.179, '_SecondaryStrC1': 0.343, '_SolventAccessibilityC1': 0.448, '_SolventAccessibilityC2': 0.328, '_SolventAccessibilityC3': 0.224, '_NormalizedVDWVC1': 0.522, '_HydrophobicityC3': 0.284, '_HydrophobicityC1': 0.328, '_ChargeC3': 0.239, '_PolarityC2': 0.179, '_PolarityC1': 0.299, '_HydrophobicityC2': 0.388, '_PolarityC3': 0.03, '_ChargeC2': 0.731}

Calculate protein descriptors via object

The PyProtein can calculate all kinds of protein descriptors in the PyBioMed. For example, the PyProtein can calculate DPC.

>>> from PyBioMed import Pyprotein
>>> protein="ADGCGVGEGTGQGPMCNCMCMKWVYADEDAADLESDSFADEDASLESDSFPWSNQRVFCSFADEDAS"
>>> protein_class = Pyprotein.PyProtein(protein)
>>> print len(protein_class.GetDPComp())
400

The PyProtein also provide the tool to get sequence from Uniprot through the Uniprot ID.

>>> from PyBioMed import Pyprotein
>>> from PyBioMed.PyProtein.GetProteinFromUniprot import GetProteinSequence
>>> uniprotID = 'P48039'
>>> protein_sequence = GetProteinSequence(uniprotID)
>>> print protein_sequence
MEDINFASLAPRHGSRPFMGTWNEIGTSQLNGGAFSWSSLWSGIKNFGSSIKSFGNKAWNSNTGQMLRDKLKDQNFQQKVVDGL
ASGINGVVDIANQALQNQINQRLENSRQPPVALQQRPPPKVEEVEVEEKLPPLEVAPPLPSKGEKRPRPDLEETLVVESREPPS
YEQALKEGASPYPMTKPIGSMARPVYGKESKPVTLELPPPVPTVPPMPAPTLGTAVSRPTAPTVAVATPARRPRGANWQSTLNS
IVGLGVKSLKRRRCY
>>> protein_class = Pyprotein.PyProtein(protein_sequence)
>>> CTD = protein_class.GetCTD()
>>> print len(CTD)
147

The PyProtein can calculate all protein descriptors except the tri-peptide composition descriptors.

>>> from PyBioMed import Pyprotein
>>> protein="ADGCGVGEGTGQGPMCNCMCMKWVYADEDAADLESDSFADEDASLESDSFPWSNQRVFCSFADEDAS"
>>> protein_class = Pyprotein.PyProtein(protein)
>>> print len(protein_class.GetALL())
10049

Calculating DNA descriptors

The PyDNA module can generate various feature vectors for DNA sequences, this module could:

  • Calculating three nucleic acid composition features describing the local sequence information by means of kmers (subsequences of DNA sequences);
  • Calculating six autocorrelation features describing the level of correlation between two oligonucleotides along a DNA sequence in terms of their specific physicochemical properties;
  • Calculating six pseudo nucleotide composition features, which can be used to represent a DNA sequence with a discrete model or vector yet still keep considerable sequence order information, particularly the global or long-range sequence order information, via the physicochemical properties of its constituent oligonucleotides.

Calculating DNA descriptors via functions

The user can input a DNA sequence and calculate the DNA descriptors using functions.

>>> from PyBioMed.PyDNA.PyDNAac import GetDAC
>>> dac = GetDAC('GACTGAACTGCACTTTGGTTTCATATTATTTGCTC', phyche_index=['Twist','Tilt'])
>>> print(dac)
{'DAC_4': -0.004, 'DAC_1': -0.175, 'DAC_2': -0.185, 'DAC_3': -0.173}

The user can check the parameters and calculate the descriptors.

>>> from PyBioMed.PyDNA import PyDNApsenac
>>> from PyBioMed.PyDNA.PyDNApsenac import GetPseDNC
>>> dnaseq = 'GACTGAACTGCACTTTGGTTTCATATTATTTGCTC'
>>> PyDNApsenac.CheckPsenac(lamada = 2, w = 0.05, k = 2)
>>> psednc = GetPseDNC('ACCCCA',lamada=2, w=0.05)
>>> print(psednc)
{'PseDNC_18': 0.0521, 'PseDNC_16': 0.0, 'PseDNC_17': 0.0391, 'PseDNC_14': 0.0, 'PseDNC_15': 0.0, 'PseDNC_12': 0.0, 'PseDNC_13': 0.0, 'PseDNC_10': 0.0, 'PseDNC_11': 0.0, 'PseDNC_4': 0.0, 'PseDNC_5': 0.182, 'PseDNC_6': 0.545, 'PseDNC_7': 0.0, 'PseDNC_1': 0.0, 'PseDNC_2': 0.182, 'PseDNC_3': 0.0, 'PseDNC_8': 0.0, 'PseDNC_9': 0.0}

Calculating DNA descriptors via object

The PyDNA can calculate all kinds of protein descriptors in the PyBioMed. For example, the PyDNA can calculate SCPseDNC.

>>> from PyBioMed import Pydna
>>> dna = Pydna.PyDNA('GACTGAACTGCACTTTGGTTTCATATTATTTGCTC')
>>> scpsednc = dna.GetSCPseDNC()
>>> print len(scpsednc)
16

Calculating Interaction descriptors

The PyInteraction module can generate six types of interaction descriptors indcluding chemical-chemical interaction features, chemical-protein interaction features, chemical-DNA interaction features, protein-protein interaction features, protein-DNA interaction features, and DNA-DNA interaction features by integrating two groups of features.

The user can choose three different types of methods to calculate interaction descriptors. The function CalculateInteraction1() can calculate two interaction features by combining two features.

\[F_{ab} = \bigl(F_a, F_b\bigr)\]

The function CalculateInteraction2() can calculate two interaction features by two multiplied features.

\[F = \{F(k)= F_a(i) ×F_b(j), i = 1, 2, …, p, j = 1, 2 ,… , p, k = (i-1) ×p+j\}\]

The function CalculateInteraction3() can calculate two interaction features by

\[F=[F_a(i)+F_b(i)),F_a(i)*F_b(i)]\]

The function CalculateInteraction3() is only used in the same type of descriptors including chemical-chemical interaction, protein-protein interaction and DNA-DNA interaction.

The user can calculate chemical-chemical features using three methods .

_images/CCI.png

The calculation process for chemical-chemical interaction descriptors.

>>> from PyBioMed.PyInteraction import PyInteraction
>>> from PyBioMed.PyMolecule import moe
>>> from rdkit import Chem
>>> smis = ['CCCC','CCCCC','CCCCCC','CC(N)C(=O)O','CC(N)C(=O)[O-].[Na+]']
>>> m = Chem.MolFromSmiles(smis[3])
>>> mol_des = moe.GetMOE(m)
>>> mol_mol_interaction1 = PyInteraction.CalculateInteraction1(mol_des,mol_des)
>>> print mol_mol_interaction1
{'slogPVSA6ex': 0.0, 'PEOEVSA10ex': 0.0,......, 'EstateVSA9ex': 4.795, 'slogPVSA2': 4.795, 'slogPVSA3': 0.0, 'slogPVSA0': 5.734, 'slogPVSA1': 17.118, 'slogPVSA6': 0.0, 'slogPVSA7': 0.0, 'slogPVSA4': 6.924, 'slogPVSA5': 0.0, 'slogPVSA8': 0.0, 'slogPVSA9': 0.0}
>>> print len(mol_mol_interaction1)
120
>>> mol_mol_interaction2 = PyInteraction.CalculateInteraction2(mol_des,mol_des)
>>> print len(mol_mol_interaction2)
3600
>>> mol_mol_interaction3 = PyInteraction.CalculateInteraction3(mol_des,mol_des)
{'EstateVSA9*EstateVSA9': 22.992, 'EstateVSA9+EstateVSA9': 9.59, 'PEOEVSA1+PEOEVSA1': 9.59, 'VSAEstate10*VSAEstate10': 0.0, 'PEOEVSA3*PEOEVSA3': 0.0, 'PEOEVSA11*PEOEVSA11': 0.0, 'PEOEVSA4*PEOEVSA4': 0.0, 'VSAEstate2+VSAEstate2': 0.0, 'MRVSA0+MRVSA0': 19.802, 'MRVSA6+MRVSA6': 0.0......}
>>> print len(mol_mol_interaction3)
120

The user can calculate chemical-protein feature using two methods.

_images/CPI.png

The calculation process for chemical-protein interaction descriptors.

>>> from rdkit import Chem
>>> from PyBioMed.PyMolecule import moe
>>> from PyBioMed.PyInteraction.PyInteraction import CalculateInteraction2
>>> smis = ['CCCC','CCCCC','CCCCCC','CC(N)C(=O)O','CC(N)C(=O)[O-].[Na+]']
>>> m = Chem.MolFromSmiles(smis[3])
>>> mol_des = moe.GetMOE(m)
>>> from PyBioMed.PyDNA.PyDNApsenac import GetPseDNC
>>> protein_des = GetPseDNC('ACCCCA',lamada=2, w=0.05)
>>> pro_mol_interaction1 = PyInteraction.CalculateInteraction1(mol_des,protein_des)
>>> print len(pro_mol_interaction1)
78
>>> pro_mol_interaction2 = CalculateInteraction2(mol_des,protein_des)
>>> print len(pro_mol_interaction2)
1080

The user can calculate chemical-DNA feature using two methods.

>>> from PyBioMed.PyDNA import PyDNAac
>>> DNA_des = PyDNAac.GetTCC('GACTGAACTGCACTTTGGTTTCATATTATTTGCTC', phyche_index=['Dnase I', 'Nucleosome','MW-kg'])
>>> from rdkit import Chem
>>> smis = ['CCCC','CCCCC','CCCCCC','CC(N)C(=O)O','CC(N)C(=O)[O-].[Na+]']
>>> m = Chem.MolFromSmiles(smis[3])
>>> mol_des = moe.GetMOE(m)
>>> mol_DNA_interaction1 = PyInteraction.CalculateInteraction1(mol_des,DNA_des)
>>> print len(mol_DNA_interaction1)
72
>>> mol_DNA_interaction2 = PyInteraction.CalculateInteraction2(mol_des,DNA_des)
>>> print len(mol_DNA_interaction2)
720