Working with PDB Structures in DataFrames

Loading PDB Files

There are several ways to load a PDB structure into a PandasPdb object.

1 -- Loading a PDB file from the Protein Data Bank

PDB files can be directly fetched from The Protein Data Bank at http://www.rcsb.org via its unique 4-letter after initializing a new PandasPdb object and calling the fetch_pdb method:

from biopandas.pdb import PandasPdb

# Initialize a new PandasPdb object
# and fetch the PDB file from rcsb.org
ppdb = PandasPdb().fetch_pdb('3eiy')

2 -- Loading a PDB file from the AlphaFold Structure Database

(New in version 0.4.0)

PDB files can be directly fetched from The AlphaFold Structure Database at https://alphafold.ebi.ac.uk/ via its unique UniProt Identifier after initializing a new PandasPdb object and calling the fetch_af2 method:

from biopandas.pdb import PandasPdb

# Initialize a new PandasPdb object
# and fetch the PDB file from alphafold.ebi.ac.uk
ppdb = PandasPdb().fetch_pdb(uniprot_id='Q5VSL9', source="alphafold2-v2")
ppdb.df["ATOM"]
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 111
1 ATOM 2 CA ... C NaN 112
2 ATOM 3 C ... C NaN 113
3 ATOM 4 CB ... C NaN 114
4 ATOM 5 O ... O NaN 115
... ... ... ... ... ... ... ... ... ...
6713 ATOM 6714 CG ... C NaN 6824
6714 ATOM 6715 CD ... C NaN 6825
6715 ATOM 6716 NE2 ... N NaN 6826
6716 ATOM 6717 OE1 ... O NaN 6827
6717 ATOM 6718 OXT ... O NaN 6828

6718 rows × 21 columns

3 a) -- Loading a PDB structure from a local file

Alternatively, we can load PDB files from local directories as regular PDB files using read_pdb:

ppdb.read_pdb('./data/3eiy.pdb')
<biopandas.pdb.pandas_pdb.PandasPdb at 0x11c393a60>

[File link: 3eiy.pdb]

3 b) -- Loading a PDB structure from a local gzipped PDB file

Or, we can load them from gzip archives like so (note that the file must end with a '.gz' suffix in order to be recognized as a gzip file):

ppdb.read_pdb('./data/3eiy.pdb.gz')
<biopandas.pdb.pandas_pdb.PandasPdb at 0x11c393a60>

[File link: 3eiy.pdb.gz]

After the file was succesfully loaded, we have access to the following attributes:

print('PDB Code: %s' % ppdb.code)
print('PDB Header Line: %s' % ppdb.header)
print('\nRaw PDB file contents:\n\n%s\n...' % ppdb.pdb_text[:1000])
PDB Code: 3eiy
PDB Header Line:     HYDROLASE                               17-SEP-08   3EIY

Raw PDB file contents:

HEADER    HYDROLASE                               17-SEP-08   3EIY              
TITLE     CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHATASE FROM BURKHOLDERIA      
TITLE    2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE                                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: INORGANIC PYROPHOSPHATASE;                                 
COMPND   3 CHAIN: A;                                                            
COMPND   4 EC: 3.6.1.1;                                                         
COMPND   5 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: BURKHOLDERIA PSEUDOMALLEI 1710B;                
SOURCE   3 ORGANISM_TAXID: 320372;                                              
SOURCE   4 GENE: PPA, BURPS1710B_1237;                                          
SOURCE   5 EXPRESSION_SYSTEM
...

The most interesting / useful attribute is the PandasPdb.df DataFrame dictionary though, which gives us access to the PDB files as pandas DataFrames. Let's print the first 3 lines from the ATOM coordinate section to see how it looks like:

ppdb.df['ATOM'].head(3)
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 609
1 ATOM 2 CA ... C NaN 610
2 ATOM 3 C ... C NaN 611

3 rows × 21 columns

But more on that in the next section.

4 -- Loading a PDB file from a Python list

Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:

with open('./data/3eiy.pdb', 'r') as f:
    three_eiy = f.readlines()

ppdb2 = PandasPdb()
ppdb2.read_pdb_from_list(three_eiy)

ppdb2.df['ATOM'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 609
1 ATOM 2 CA ... C NaN 610
2 ATOM 3 C ... C NaN 611
3 ATOM 4 O ... O NaN 612
4 ATOM 5 CB ... C NaN 613

5 rows × 21 columns

5 -- Obtaining a PDB file from a mmCIF structure

Since v0.5.0, it is now also possible to obtain a PandasPdb object from a mmCIF file, using PandasMmcift's PandasMmcif.get_pandas_pdb():

from biopandas.mmcif import PandasMmcif


mmcif = PandasMmcif().fetch_mmcif("3EIY")
pdb = mmcif.convert_to_pandas_pdb()

print("Type:", type(pdb))
pdb.df["ATOM"].head()    
Type: <class 'biopandas.pdb.pandas_pdb.PandasPdb'>
record_name atom_number blank_1 atom_name alt_loc residue_name blank_2 chain_id residue_number insertion ... x_coord y_coord z_coord occupancy b_factor blank_4 segment_id element_symbol charge line_idx
0 ATOM 1 N SER A 2 ... 2.527 54.656 -1.667 1.0 52.73 N NaN 0
1 ATOM 2 CA SER A 2 ... 3.259 54.783 -0.368 1.0 52.54 C NaN 1
2 ATOM 3 C SER A 2 ... 4.127 53.553 -0.105 1.0 52.03 C NaN 2
3 ATOM 4 O SER A 2 ... 5.274 53.451 -0.594 1.0 52.45 O NaN 3
4 ATOM 5 CB SER A 2 ... 2.273 54.944 0.792 1.0 52.69 C NaN 4

5 rows × 21 columns

Looking at PDBs in DataFrames

PDB files are parsed according to the PDB file format description. More specifically, BioPandas reads the columns of the ATOM and HETATM sections as shown in the following excerpt from http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.

COLUMNS DATA TYPE CONTENTS biopandas column name
1 - 6 Record name "ATOM" record_name
7 - 11 Integer Atom serial number. atom_number
12 blank_1
13 - 16 Atom Atom name. atom_name
17 Character Alternate location indicator. alt_loc
18 - 20 Residue name Residue name. residue_name
21 blank_2
22 Character Chain identifier. chain_id
23 - 26 Integer Residue sequence number. residue_number
27 AChar Code for insertion of residues. insertion
28 - 30 blank_3
31 - 38 Real(8.3) Orthogonal coordinates for X in Angstroms. x_coord
39 - 46 Real(8.3) Orthogonal coordinates for Y in Angstroms. y_coord
47 - 54 Real(8.3) Orthogonal coordinates for Z in Angstroms. z_coord
55 - 60 Real(6.2) Occupancy. occupancy
61 - 66 Real(6.2) Temperature factor (Default = 0.0). bfactor
67-72 blank_4
73 - 76 LString(4) Segment identifier, left-justified. segment_id
77 - 78 LString(2) Element symbol, right-justified. element_symbol
79 - 80 LString(2) Charge on the atom. charge

Below is an example of how this would look like in an actual PDB file:

Example: 
         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM    145  N   VAL A  25      32.433  16.336  57.540  1.00 11.92      A1   N
ATOM    146  CA  VAL A  25      31.132  16.439  58.160  1.00 11.85      A1   C
ATOM    147  C   VAL A  25      30.447  15.105  58.363  1.00 12.34      A1   C
ATOM    148  O   VAL A  25      29.520  15.059  59.174  1.00 15.65      A1   O
ATOM    149  CB AVAL A  25      30.385  17.437  57.230  0.28 13.88      A1   C
ATOM    150  CB BVAL A  25      30.166  17.399  57.373  0.72 15.41      A1   C
ATOM    151  CG1AVAL A  25      28.870  17.401  57.336  0.28 12.64      A1   C
ATOM    152  CG1BVAL A  25      30.805  18.788  57.449  0.72 15.11      A1   C
ATOM    153  CG2AVAL A  25      30.835  18.826  57.661  0.28 13.58      A1   C
ATOM    154  CG2BVAL A  25      29.909  16.996  55.922  0.72 13.25      A1   C

After loading a PDB file from rcsb.org or our local drive, the PandasPdb.df attribute should contain the following 4 DataFrame objects:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb')
ppdb.df.keys()
dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

[File link: 3eiy.pdb]

  • 'ATOM': contains the entries from the ATOM coordinate section
  • 'HETATM': ... entries from the "HETATM" coordinate section
  • 'ANISOU': ... entries from the "ANISOU" coordinate section
  • 'OTHERS': Everything else that is not a 'ATOM', 'HETATM', or 'ANISOU' entry

The columns of the 'HETATM' DataFrame are indentical to the 'ATOM' DataFrame that we've seen earlier:

ppdb.df['HETATM'].head(2)
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 HETATM 1332 K ... K NaN 1940
1 HETATM 1333 NA ... NA NaN 1941

2 rows × 21 columns


Note that "ANISOU" entries are handled a bit differently as specified at http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.

ppdb.df['ANISOU'].head(2)
record_name atom_number blank_1 atom_name ... blank_4 element_symbol charge line_idx

0 rows × 21 columns

Not every PDB file contains ANISOU entries (similarly, some PDB files may only contain HETATM or ATOM entries). If records are basent, the DataFrame will be empty as show above.

ppdb.df['ANISOU'].empty
True

Since the DataFrames are fairly wide, let's us take a look at the columns by accessing the DataFrame's column attribute:

ppdb.df['ANISOU'].columns
Index(['record_name', 'atom_number', 'blank_1', 'atom_name', 'alt_loc', 'residue_name', 'blank_2', 'chain_id', 'residue_number', 'insertion', 'blank_3', 'U(1,1)', 'U(2,2)', 'U(3,3)', 'U(1,2)', 'U(1,3)', 'U(2,3)', 'blank_4', 'element_symbol', 'charge', 'line_idx'], dtype='object')

ANISOU records are very similar to ATOM/HETATM records. In fact, the columns 7 - 27 and 73 - 80 are identical to their corresponding ATOM/HETATM records, which means that the 'ANISOU' DataFrame doesn't have the following entries:

set(ppdb.df['ATOM'].columns).difference(set(ppdb.df['ANISOU'].columns))
{'b_factor', 'occupancy', 'segment_id', 'x_coord', 'y_coord', 'z_coord'}

Instead, the "ANISOU" DataFrame contains the anisotropic temperature factors "U(-,-)" -- note that these are scaled by a factor of () by convention.

set(ppdb.df['ANISOU'].columns).difference(set(ppdb.df['ATOM'].columns))
{'U(1,1)', 'U(1,2)', 'U(1,3)', 'U(2,2)', 'U(2,3)', 'U(3,3)'}



Ah, another interesting thing to mention is that the columns already come with the types you'd expect (where object essentially "means" str here):

ppdb.df['ATOM'].dtypes
record_name        object
atom_number         int64
blank_1            object
atom_name          object
alt_loc            object
residue_name       object
blank_2            object
chain_id           object
residue_number      int64
insertion          object
blank_3            object
x_coord           float64
y_coord           float64
z_coord           float64
occupancy         float64
b_factor          float64
blank_4            object
segment_id         object
element_symbol     object
charge            float64
line_idx            int64
dtype: object


Typically, all good things come in threes, however, there is a 4th DataFrame, an'OTHER' DataFrame, which contains everything that wasn't parsed as 'ATOM', 'HETATM', or 'ANISOU' coordinate section:

ppdb.df['OTHERS'].head(5)
record_name entry line_idx
0 HEADER HYDROLASE 17... 0
1 TITLE CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA... 1
2 TITLE 2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE 2
3 COMPND MOL_ID: 1; 3
4 COMPND 2 MOLECULE: INORGANIC PYROPHOSPHATASE; 4

Although these 'OTHER' entries are typically less useful for structure-related computations, you may still want to take a look at them to get a short summary of the PDB structure and learn about it's potential quirks and gotchas (typically listed in the REMARKs section). Lastly, the "OTHERS" DataFrame comes in handy if we want to reconstruct the structure as PDB file as we will see later (note the line_idx columns in all of the DataFrames).

Working with PDB DataFrames

In the previous sections, we've seen how to load PDB structures into DataFrames, and how to access them. Now, let's talk about manipulating PDB files in DataFrames.

from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 609
1 ATOM 2 CA ... C NaN 610
2 ATOM 3 C ... C NaN 611
3 ATOM 4 O ... O NaN 612
4 ATOM 5 CB ... C NaN 613

5 rows × 21 columns

[File link: 3eiy.pdb.gz]

Okay, there's actually not that much to say ...
Once we have our PDB file in the DataFrame format, we have the whole convenience of pandas right there at our fingertips.

For example, let's get all Proline residues:

ppdb.df['ATOM'][ppdb.df['ATOM']['residue_name'] == 'PRO'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
38 ATOM 39 N ... N NaN 647
39 ATOM 40 CA ... C NaN 648
40 ATOM 41 C ... C NaN 649
41 ATOM 42 O ... O NaN 650
42 ATOM 43 CB ... C NaN 651

5 rows × 21 columns

Or main chain atoms:

ppdb.df['ATOM'][ppdb.df['ATOM']['atom_name'] == 'C'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
2 ATOM 3 C ... C NaN 611
8 ATOM 9 C ... C NaN 617
19 ATOM 20 C ... C NaN 628
25 ATOM 26 C ... C NaN 634
33 ATOM 34 C ... C NaN 642

5 rows × 21 columns

It's also easy to strip our coordinate section from hydrogen atoms if there are any ...

ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 609
1 ATOM 2 CA ... C NaN 610
2 ATOM 3 C ... C NaN 611
3 ATOM 4 O ... O NaN 612
4 ATOM 5 CB ... C NaN 613

5 rows × 21 columns

Or, let's compute the average temperature factor of our protein main chain:

mainchain = ppdb.df['ATOM'][(ppdb.df['ATOM']['atom_name'] == 'C') | 
                            (ppdb.df['ATOM']['atom_name'] == 'O') | 
                            (ppdb.df['ATOM']['atom_name'] == 'N') | 
                            (ppdb.df['ATOM']['atom_name'] == 'CA')]

bfact_mc_avg = mainchain['b_factor'].mean()
print('Average B-Factor [Main Chain]: %.2f' % bfact_mc_avg)
Average B-Factor [Main Chain]: 28.83

Loading PDB files from a Python List

Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:

with open('./data/3eiy.pdb', 'r') as f:
    three_eiy = f.readlines()

ppdb2 = PandasPdb()
ppdb2.read_pdb_from_list(three_eiy)

ppdb2.df['ATOM'].head()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 ATOM 1 N ... N NaN 609
1 ATOM 2 CA ... C NaN 610
2 ATOM 3 C ... C NaN 611
3 ATOM 4 O ... O NaN 612
4 ATOM 5 CB ... C NaN 613

5 rows × 21 columns

Working with PDBs Containing Multiple Models

(New in version 0.4.0)

Some PDB files, particularly those containing NMR structures, provide an ensemble of models. There are various ways to extract these.

In these examples we will work with 2JYF: an RNA structure containing 10 models of the same underlying RNA structure.

To start, we con obtain a DataFrame denoting the lines of the PDB files corresponding to each model.

from biopandas.pdb import PandasPdb

ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')
ppdb.get_model_start_end()
/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values
record_name model_idx start_idx end_idx
129 MODEL 1 129 2896
133 MODEL 2 2897 5664
137 MODEL 3 5665 8432
141 MODEL 4 8433 11200
145 MODEL 5 11201 13968
149 MODEL 6 13969 16736
153 MODEL 7 16737 19504
157 MODEL 8 19505 22272
161 MODEL 9 22273 25040
165 MODEL 10 25041 27808

Assigning model IDs to the PDB DataFrames

For ease of use, the label_models() method adds an additional column, "model_id" to the dataframes contained within the PandasPdb object.

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

ppdb.label_models()
ppdb.df["ATOM"]["model_id"]
/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values





0         1
1         1
2         1
3         1
4         1
         ..
27635    10
27636    10
27637    10
27638    10
27639    10
Name: model_id, Length: 27640, dtype: int64

Subsetting PandasPdb objects to a given model

We can obtain new PandasPdb objects containing only a given model using the get_model() method

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

model_4 = ppdb.get_model(model_index=4)
model_4.df["ATOM"]
/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values
record_name atom_number blank_1 atom_name ... element_symbol charge line_idx model_id
8292 ATOM 1 O5' ... O NaN 8434 4
8293 ATOM 2 C5' ... C NaN 8435 4
8294 ATOM 3 C4' ... C NaN 8436 4
8295 ATOM 4 O4' ... O NaN 8437 4
8296 ATOM 5 C3' ... C NaN 8438 4
... ... ... ... ... ... ... ... ... ...
11051 ATOM 2761 HO2' ... H NaN 11194 4
11052 ATOM 2762 H1' ... H NaN 11195 4
11053 ATOM 2763 H3 ... H NaN 11196 4
11054 ATOM 2764 H5 ... H NaN 11197 4
11055 ATOM 2765 H6 ... H NaN 11198 4

2764 rows × 22 columns

Subsetting PandasPdb objects to a list of given models

We can obtain new PandasPdb objects containing only a given models using the get_models() method

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

model_ensemble = ppdb.get_models(model_indices=[2, 4, 6, 8])
model_ensemble.df["ATOM"]
/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values
record_name atom_number blank_1 atom_name ... element_symbol charge line_idx model_id
2764 ATOM 1 O5' ... O NaN 2898 2
2765 ATOM 2 C5' ... C NaN 2899 2
2766 ATOM 3 C4' ... C NaN 2900 2
2767 ATOM 4 O4' ... O NaN 2901 2
2768 ATOM 5 C3' ... C NaN 2902 2
... ... ... ... ... ... ... ... ... ...
22107 ATOM 2761 HO2' ... H NaN 22266 8
22108 ATOM 2762 H1' ... H NaN 22267 8
22109 ATOM 2763 H3 ... H NaN 22268 8
22110 ATOM 2764 H5 ... H NaN 22269 8
22111 ATOM 2765 H6 ... H NaN 22270 8

11056 rows × 22 columns

Plotting

Since we are using pandas under the hood, which in turns uses matplotlib under the hood, we can produce quick summary plots of our PDB structures relatively conveniently:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')

[File link: 3eiy.pdb.gz]

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
ppdb.df['ATOM']['b_factor'].plot(kind='hist')
plt.title('Distribution of B-Factors')
plt.xlabel('B-factor')
plt.ylabel('count')
plt.show()

png

ppdb.df['ATOM']['b_factor'].plot(kind='line')
plt.title('B-Factors Along the Amino Acid Chain')
plt.xlabel('Residue Number')
plt.ylabel('B-factor in $A^2$')
plt.show()

png

ppdb.df['ATOM']['element_symbol'].value_counts().plot(kind='bar')
plt.title('Distribution of Atom Types')
plt.xlabel('elements')
plt.ylabel('count')
plt.show()

png

Computing the Root Mean Square Deviation

BioPandas also comes with certain convenience functions, for example, ...

The Root-mean-square deviation (RMSD) is simply a measure of the average distance between atoms of 2 protein or ligand structures. This calculation of the Cartesian error follows the equation:

So, assuming that the we have the following 2 conformations of a ligand molecule

we can compute the RMSD as follows:

from biopandas.pdb import PandasPdb

l_1 = PandasPdb().read_pdb('./data/lig_conf_1.pdb')
l_2 = PandasPdb().read_pdb('./data/lig_conf_2.pdb')
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'],
                   s=None) # all atoms, including hydrogens
print('RMSD: %.4f Angstrom' % r)
RMSD: 2.6444 Angstrom
l_1.df['HETATM']
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
0 HETATM 1 C1 ... C NaN 0
1 HETATM 2 O1 ... O NaN 1
2 HETATM 3 C2 ... C NaN 2
3 HETATM 4 O2 ... O NaN 3
4 HETATM 5 C3 ... C NaN 4
5 HETATM 6 O3 ... O NaN 5
6 HETATM 7 C4 ... C NaN 6
7 HETATM 8 O4 ... O NaN 7
8 HETATM 9 C5 ... C NaN 8
9 HETATM 10 O5 ... O NaN 9
10 HETATM 11 C6 ... C NaN 10
11 HETATM 12 O6 ... O NaN 11
12 HETATM 13 C7 ... C NaN 12
13 HETATM 14 C8 ... C NaN 13
14 HETATM 15 C9 ... C NaN 14
15 HETATM 16 C10 ... C NaN 15
16 HETATM 17 H1 ... H NaN 16
17 HETATM 18 H2 ... H NaN 17
18 HETATM 19 H3 ... H NaN 18
19 HETATM 20 H4 ... H NaN 19
20 HETATM 21 H5 ... H NaN 20
21 HETATM 22 H6 ... H NaN 21
22 HETATM 23 H7 ... H NaN 22
23 HETATM 24 H8 ... H NaN 23

24 rows × 21 columns

[File links: lig_conf_1.pdb, lig_conf_2.pdb]

r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='carbon') # carbon atoms only
print('RMSD: %.4f Angstrom' % r)
RMSD: 1.7249 Angstrom
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='heavy') # heavy atoms only
print('RMSD: %.4f Angstrom' % r)
RMSD: 1.9959 Angstrom

Similarly, we can compute the RMSD between 2 related protein structures:

The hydrogen-free RMSD:

p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='heavy')
print('RMSD: %.4f Angstrom' % r)
RMSD: 0.7377 Angstrom

Or the RMSD between the main chains only:

p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='main chain')
print('RMSD: %.4f Angstrom' % r)
RMSD: 0.4781 Angstrom


Filtering PDBs by Distance

We can use the distance method to compute the distance between each atom (or a subset of atoms) in our data frame and a three-dimensional reference point. For example:

p_1 = PandasPdb().read_pdb('./data/3eiy.pdb')

reference_point = (9.362, 41.410, 10.542)
distances = p_1.distance(xyz=reference_point, records=('ATOM',))

[File link: 3eiy.pdb]

The distance method returns a Pandas Series object:

distances.head()
0    19.267419
1    18.306060
2    16.976934
3    16.902897
4    18.124171
dtype: float64

And we can use this Series object, for instance, to select certain atoms in our DataFrame that fall within a desired distance threshold. For example, let's select all atoms that are within 7A of our reference point:

all_within_7A = p_1.df['ATOM'][distances < 7.0]
all_within_7A.tail()
record_name atom_number blank_1 atom_name ... segment_id element_symbol charge line_idx
786 ATOM 787 CB ... C NaN 1395
787 ATOM 788 CG ... C NaN 1396
788 ATOM 789 CD1 ... C NaN 1397
789 ATOM 790 CD2 ... C NaN 1398
790 ATOM 791 N ... N NaN 1399

5 rows × 21 columns

Visualized in PyMOL, this subset (yellow surface) would look as follows:

Converting Amino Acid codes from 3- to 1-letter codes

Residues in the residue_name field can be converted into 1-letter amino acid codes, which may be useful for further sequence analysis, for example, pair-wise or multiple sequence alignments:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().fetch_pdb('5mtn')
sequence = ppdb.amino3to1()
sequence.tail()
chain_id residue_name
1378 B I
1386 B N
1394 B Y
1406 B R
1417 B T

As shown above, the amino3to1 method returns a DataFrame containing the chain_id and residue_name of the translated 1-letter amino acids. If you like to work with the sequence as a Python list of string characters, you could do the following:

sequence_list = list(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])
sequence_list[-5:] # last 5 residues of chain A
['V', 'R', 'H', 'Y', 'T']

And if you prefer to work with the sequence as a string, you can use the join method:

''.join(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])
'SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT'

To iterate over the sequences of multi-chain proteins, you can use the unique method as shown below:

for chain_id in sequence['chain_id'].unique():
    print('\nChain ID: %s' % chain_id)
    print(''.join(sequence.loc[sequence['chain_id'] == chain_id, 'residue_name']))
Chain ID: A
SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT

Chain ID: B
SVSSVPTKLEVVAATPTSLLISWDAPAVTVVYYLITYGETGSPWPGGQAFEVPGSKSTATISGLKPGVDYTITVYAHRSSYGYSENPISINYRT

Wrapping it up - Saving PDB structures

Finally, let's talk about how to get the PDB structures out of the DataFrame format back into the beloved .pdb format.

Let's say we loaded a PDB structure, removed it from its hydrogens:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H']

[File link: 3eiy.pdb.gz]

We can save the file using the PandasPdb.to_pdb method:

ppdb.to_pdb(path='./data/3eiy_stripped.pdb', 
            records=None, 
            gz=False, 
            append_newline=True)

[File link: 3eiy_stripped.pdb]

By default, all records (that is, 'ATOM', 'HETATM', 'OTHERS', 'ANISOU') are written if we set records=None. Alternatively, let's say we want to get rid of the 'ANISOU' entries and produce a compressed gzip archive of our PDB structure:

ppdb.to_pdb(path='./data/3eiy_stripped.pdb.gz', 
            records=['ATOM', 'HETATM', 'OTHERS'], 
            gz=True, 
            append_newline=True)

[File link: 3eiy_stripped.pdb.gz]