GetProtein module

Created on Sat Jul 13 11:18:26 2013

This module is used for downloading the PDB file from RCSB PDB web and

extract its amino acid sequence. This module can also download the protein

sequence from the uniprot ( website. You can only

need input a protein ID or prepare a file (ID.txt) related to ID. You can

obtain a .txt (ProteinSequence.txt) file saving protein sequence you need.

Authors: Zhijiang Yao and Dongsheng Cao.

Date: 2016.06.04



Download the PDB file from PDB FTP server by providing a list of pdb id.


Get the protein sequence from the uniprot website by ID.



Input: ProteinID is a string indicating ID such as “P48039”.

GetProtein.GetProteinSequenceFromTxt(path, openfile, savefile)[source]

Get the protein sequence from the uniprot website by the file containing ID.



Input: path is a directory path containing the ID file such as “/home/orient/protein/”

openfile is the ID file such as “proteinID.txt”


Get the amino acids sequences from pdb file.


Judge the Seq object is in FASTA format. Two situation: 1. No seq name. 2. Seq name is illegal. 3. No sequence.

Parameters:seq – Seq object.

Read a fasta file.

Parameters:f – HANDLE to input. e.g. sys.stdin, or open(<file>)
class GetProtein.Seq(name, seq, no)[source]
exception GetProtein.TimeoutException[source]

Bases: exceptions.Exception

GetProtein.pdbDownload(file_list, hostname='', directory='/pub/pdb/data/structures/all/pdb/', prefix='pdb', suffix='.ent.gz')[source]

Download all pdb files in file_list and unzip them.

GetProtein.pdbSeq(pdb, use_atoms=False)[source]

Parse the SEQRES entries in a pdb file. If this fails, use the ATOM entries. Return dictionary of sequences keyed to chain and type of sequence used.


A decorator to limit the execution time.

GetProtein.unZip(some_file, some_output)[source]

Unzip some_file using the gzip library and write to some_output.