Working with PDB Structures in DataFrames

Loading PDB Files

There are several ways to load a PDB structure into a PandasPdb object.

1 -- Loading a PDB file from the Protein Data Bank

PDB files can be directly fetched from The Protein Data Bank at http://www.rcsb.org via its unique 4-letter after initializing a new PandasPdb object and calling the fetch_pdb method:

from biopandas.pdb import PandasPdb

# Initialize a new PandasPdb object
# and fetch the PDB file from rcsb.org
ppdb = PandasPdb().fetch_pdb('3eiy')

2 -- Loading a PDB file from the AlphaFold Structure Database

(New in version 0.4.0)

PDB files can be directly fetched from The AlphaFold Structure Database at https://alphafold.ebi.ac.uk/ via its unique UniProt Identifier after initializing a new PandasPdb object and calling the fetch_af2 method:

from biopandas.pdb import PandasPdb

# Initialize a new PandasPdb object
# and fetch the PDB file from alphafold.ebi.ac.uk
ppdb = PandasPdb().fetch_pdb(uniprot_id='Q5VSL9', source="alphafold2-v2")

ppdb.df["ATOM"]

	record_name	atom_number	blank_1	atom_name	...	segment_id	element_symbol	charge	line_idx
0	ATOM	1		N	...		N	NaN	111
1	ATOM	2		CA	...		C	NaN	112
2	ATOM	3		C	...		C	NaN	113
3	ATOM	4		CB	...		C	NaN	114
4	ATOM	5		O	...		O	NaN	115
...	...	...	...	...	...	...	...	...	...
6713	ATOM	6714		CG	...		C	NaN	6824
6714	ATOM	6715		CD	...		C	NaN	6825
6715	ATOM	6716		NE2	...		N	NaN	6826
6716	ATOM	6717		OE1	...		O	NaN	6827
6717	ATOM	6718		OXT	...		O	NaN	6828

6718 rows × 21 columns

3 a) -- Loading a PDB structure from a local file

Alternatively, we can load PDB files from local directories as regular PDB files using read_pdb:

ppdb.read_pdb('./data/3eiy.pdb')

<biopandas.pdb.pandas_pdb.PandasPdb at 0x11c393a60>

[File link: 3eiy.pdb]

3 b) -- Loading a PDB structure from a local gzipped PDB file

Or, we can load them from gzip archives like so (note that the file must end with a '.gz' suffix in order to be recognized as a gzip file):

ppdb.read_pdb('./data/3eiy.pdb.gz')

<biopandas.pdb.pandas_pdb.PandasPdb at 0x11c393a60>

[File link: 3eiy.pdb.gz]

After the file was succesfully loaded, we have access to the following attributes:

print('PDB Code: %s' % ppdb.code)
print('PDB Header Line: %s' % ppdb.header)
print('\nRaw PDB file contents:\n\n%s\n...' % ppdb.pdb_text[:1000])

PDB Code: 3eiy
PDB Header Line:     HYDROLASE                               17-SEP-08   3EIY

Raw PDB file contents:

HEADER    HYDROLASE                               17-SEP-08   3EIY              
TITLE     CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHATASE FROM BURKHOLDERIA      
TITLE    2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE                                
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: INORGANIC PYROPHOSPHATASE;                                 
COMPND   3 CHAIN: A;                                                            
COMPND   4 EC: 3.6.1.1;                                                         
COMPND   5 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: BURKHOLDERIA PSEUDOMALLEI 1710B;                
SOURCE   3 ORGANISM_TAXID: 320372;                                              
SOURCE   4 GENE: PPA, BURPS1710B_1237;                                          
SOURCE   5 EXPRESSION_SYSTEM
...

The most interesting / useful attribute is the PandasPdb.df DataFrame dictionary though, which gives us access to the PDB files as pandas DataFrames. Let's print the first 3 lines from the ATOM coordinate section to see how it looks like:

ppdb.df['ATOM'].head(3)

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	ATOM	1	N	...	N	NaN	609
1	ATOM	2	CA	...	C	NaN	610
2	ATOM	3	C	...	C	NaN	611

3 rows × 21 columns

But more on that in the next section.

4 -- Loading a PDB file from a Python list

Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:

with open('./data/3eiy.pdb', 'r') as f:
    three_eiy = f.readlines()

ppdb2 = PandasPdb()
ppdb2.read_pdb_from_list(three_eiy)

ppdb2.df['ATOM'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	ATOM	1	N	...	N	NaN	609
1	ATOM	2	CA	...	C	NaN	610
2	ATOM	3	C	...	C	NaN	611
3	ATOM	4	O	...	O	NaN	612
4	ATOM	5	CB	...	C	NaN	613

5 rows × 21 columns

5 -- Obtaining a PDB file from a mmCIF structure

Since v0.5.0, it is now also possible to obtain a PandasPdb object from a mmCIF file, using PandasMmcift's PandasMmcif.get_pandas_pdb():

from biopandas.mmcif import PandasMmcif


mmcif = PandasMmcif().fetch_mmcif("3EIY")
pdb = mmcif.convert_to_pandas_pdb()

print("Type:", type(pdb))
pdb.df["ATOM"].head()

Type: <class 'biopandas.pdb.pandas_pdb.PandasPdb'>

	record_name	atom_number	atom_name	residue_name	chain_id	residue_number	...	x_coord	y_coord	z_coord	occupancy	b_factor	element_symbol	charge	line_idx
0	ATOM	1	N	SER	A	2	...	2.527	54.656	-1.667	1.0	52.73	N	NaN	0
1	ATOM	2	CA	SER	A	2	...	3.259	54.783	-0.368	1.0	52.54	C	NaN	1
2	ATOM	3	C	SER	A	2	...	4.127	53.553	-0.105	1.0	52.03	C	NaN	2
3	ATOM	4	O	SER	A	2	...	5.274	53.451	-0.594	1.0	52.45	O	NaN	3
4	ATOM	5	CB	SER	A	2	...	2.273	54.944	0.792	1.0	52.69	C	NaN	4

5 rows × 21 columns

Looking at PDBs in DataFrames

PDB files are parsed according to the PDB file format description. More specifically, BioPandas reads the columns of the ATOM and HETATM sections as shown in the following excerpt from http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.

COLUMNS	DATA TYPE	CONTENTS	biopandas column name
1 - 6	Record name	"ATOM"	record_name
7 - 11	Integer	Atom serial number.	atom_number
12			blank_1
13 - 16	Atom	Atom name.	atom_name
17	Character	Alternate location indicator.	alt_loc
18 - 20	Residue name	Residue name.	residue_name
21			blank_2
22	Character	Chain identifier.	chain_id
23 - 26	Integer	Residue sequence number.	residue_number
27	AChar	Code for insertion of residues.	insertion
28 - 30			blank_3
31 - 38	Real(8.3)	Orthogonal coordinates for X in Angstroms.	x_coord
39 - 46	Real(8.3)	Orthogonal coordinates for Y in Angstroms.	y_coord
47 - 54	Real(8.3)	Orthogonal coordinates for Z in Angstroms.	z_coord
55 - 60	Real(6.2)	Occupancy.	occupancy
61 - 66	Real(6.2)	Temperature factor (Default = 0.0).	bfactor
67-72			blank_4
73 - 76	LString(4)	Segment identifier, left-justified.	segment_id
77 - 78	LString(2)	Element symbol, right-justified.	element_symbol
79 - 80	LString(2)	Charge on the atom.	charge

Below is an example of how this would look like in an actual PDB file:

Example: 
         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM    145  N   VAL A  25      32.433  16.336  57.540  1.00 11.92      A1   N
ATOM    146  CA  VAL A  25      31.132  16.439  58.160  1.00 11.85      A1   C
ATOM    147  C   VAL A  25      30.447  15.105  58.363  1.00 12.34      A1   C
ATOM    148  O   VAL A  25      29.520  15.059  59.174  1.00 15.65      A1   O
ATOM    149  CB AVAL A  25      30.385  17.437  57.230  0.28 13.88      A1   C
ATOM    150  CB BVAL A  25      30.166  17.399  57.373  0.72 15.41      A1   C
ATOM    151  CG1AVAL A  25      28.870  17.401  57.336  0.28 12.64      A1   C
ATOM    152  CG1BVAL A  25      30.805  18.788  57.449  0.72 15.11      A1   C
ATOM    153  CG2AVAL A  25      30.835  18.826  57.661  0.28 13.58      A1   C
ATOM    154  CG2BVAL A  25      29.909  16.996  55.922  0.72 13.25      A1   C

After loading a PDB file from rcsb.org or our local drive, the PandasPdb.df attribute should contain the following 4 DataFrame objects:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb')
ppdb.df.keys()

dict_keys(['ATOM', 'HETATM', 'ANISOU', 'OTHERS'])

[File link: 3eiy.pdb]

'ATOM': contains the entries from the ATOM coordinate section
'HETATM': ... entries from the "HETATM" coordinate section
'ANISOU': ... entries from the "ANISOU" coordinate section
'OTHERS': Everything else that is not a 'ATOM', 'HETATM', or 'ANISOU' entry

The columns of the 'HETATM' DataFrame are indentical to the 'ATOM' DataFrame that we've seen earlier:

ppdb.df['HETATM'].head(2)

	record_name	atom_number	blank_1	atom_name	...	segment_id	element_symbol	charge	line_idx
0	HETATM	1332		K	...		K	NaN	1940
1	HETATM	1333		NA	...		NA	NaN	1941

2 rows × 21 columns

Note that "ANISOU" entries are handled a bit differently as specified at http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM.

ppdb.df['ANISOU'].head(2)

	record_name	atom_number	blank_1	atom_name	...	blank_4	element_symbol	charge	line_idx

0 rows × 21 columns

Not every PDB file contains ANISOU entries (similarly, some PDB files may only contain HETATM or ATOM entries). If records are basent, the DataFrame will be empty as show above.

ppdb.df['ANISOU'].empty

True

Since the DataFrames are fairly wide, let's us take a look at the columns by accessing the DataFrame's column attribute:

ppdb.df['ANISOU'].columns

Index(['record_name', 'atom_number', 'blank_1', 'atom_name', 'alt_loc', 'residue_name', 'blank_2', 'chain_id', 'residue_number', 'insertion', 'blank_3', 'U(1,1)', 'U(2,2)', 'U(3,3)', 'U(1,2)', 'U(1,3)', 'U(2,3)', 'blank_4', 'element_symbol', 'charge', 'line_idx'], dtype='object')

ANISOU records are very similar to ATOM/HETATM records. In fact, the columns 7 - 27 and 73 - 80 are identical to their corresponding ATOM/HETATM records, which means that the 'ANISOU' DataFrame doesn't have the following entries:

set(ppdb.df['ATOM'].columns).difference(set(ppdb.df['ANISOU'].columns))

{'b_factor', 'occupancy', 'segment_id', 'x_coord', 'y_coord', 'z_coord'}

Instead, the "ANISOU" DataFrame contains the anisotropic temperature factors "U(-,-)" -- note that these are scaled by a factor of $10^4$ ( $\text{Angstroms}^2$ ) by convention.

set(ppdb.df['ANISOU'].columns).difference(set(ppdb.df['ATOM'].columns))

{'U(1,1)', 'U(1,2)', 'U(1,3)', 'U(2,2)', 'U(2,3)', 'U(3,3)'}

Ah, another interesting thing to mention is that the columns already come with the types you'd expect (where object essentially "means" str here):

ppdb.df['ATOM'].dtypes

record_name        object
atom_number         int64
blank_1            object
atom_name          object
alt_loc            object
residue_name       object
blank_2            object
chain_id           object
residue_number      int64
insertion          object
blank_3            object
x_coord           float64
y_coord           float64
z_coord           float64
occupancy         float64
b_factor          float64
blank_4            object
segment_id         object
element_symbol     object
charge            float64
line_idx            int64
dtype: object

Typically, all good things come in threes, however, there is a 4th DataFrame, an'OTHER' DataFrame, which contains everything that wasn't parsed as 'ATOM', 'HETATM', or 'ANISOU' coordinate section:

ppdb.df['OTHERS'].head(5)

	record_name	entry	line_idx
0	HEADER	HYDROLASE 17...	0
1	TITLE	CRYSTAL STRUCTURE OF INORGANIC PYROPHOSPHA...	1
2	TITLE	2 PSEUDOMALLEI WITH BOUND PYROPHOSPHATE	2
3	COMPND	MOL_ID: 1;	3
4	COMPND	2 MOLECULE: INORGANIC PYROPHOSPHATASE;	4

Although these 'OTHER' entries are typically less useful for structure-related computations, you may still want to take a look at them to get a short summary of the PDB structure and learn about it's potential quirks and gotchas (typically listed in the REMARKs section). Lastly, the "OTHERS" DataFrame comes in handy if we want to reconstruct the structure as PDB file as we will see later (note the line_idx columns in all of the DataFrames).

Working with PDB DataFrames

In the previous sections, we've seen how to load PDB structures into DataFrames, and how to access them. Now, let's talk about manipulating PDB files in DataFrames.

from biopandas.pdb import PandasPdb
ppdb = PandasPdb()
ppdb.read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	ATOM	1	N	...	N	NaN	609
1	ATOM	2	CA	...	C	NaN	610
2	ATOM	3	C	...	C	NaN	611
3	ATOM	4	O	...	O	NaN	612
4	ATOM	5	CB	...	C	NaN	613

5 rows × 21 columns

[File link: 3eiy.pdb.gz]

Okay, there's actually not that much to say ...
Once we have our PDB file in the DataFrame format, we have the whole convenience of pandas right there at our fingertips.

For example, let's get all Proline residues:

ppdb.df['ATOM'][ppdb.df['ATOM']['residue_name'] == 'PRO'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
38	ATOM	39	N	...	N	NaN	647
39	ATOM	40	CA	...	C	NaN	648
40	ATOM	41	C	...	C	NaN	649
41	ATOM	42	O	...	O	NaN	650
42	ATOM	43	CB	...	C	NaN	651

5 rows × 21 columns

Or main chain atoms:

ppdb.df['ATOM'][ppdb.df['ATOM']['atom_name'] == 'C'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
2	ATOM	3	C	...	C	NaN	611
8	ATOM	9	C	...	C	NaN	617
19	ATOM	20	C	...	C	NaN	628
25	ATOM	26	C	...	C	NaN	634
33	ATOM	34	C	...	C	NaN	642

5 rows × 21 columns

It's also easy to strip our coordinate section from hydrogen atoms if there are any ...

ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	ATOM	1	N	...	N	NaN	609
1	ATOM	2	CA	...	C	NaN	610
2	ATOM	3	C	...	C	NaN	611
3	ATOM	4	O	...	O	NaN	612
4	ATOM	5	CB	...	C	NaN	613

5 rows × 21 columns

Or, let's compute the average temperature factor of our protein main chain:

mainchain = ppdb.df['ATOM'][(ppdb.df['ATOM']['atom_name'] == 'C') | 
                            (ppdb.df['ATOM']['atom_name'] == 'O') | 
                            (ppdb.df['ATOM']['atom_name'] == 'N') | 
                            (ppdb.df['ATOM']['atom_name'] == 'CA')]

bfact_mc_avg = mainchain['b_factor'].mean()
print('Average B-Factor [Main Chain]: %.2f' % bfact_mc_avg)

Average B-Factor [Main Chain]: 28.83

Loading PDB files from a Python List

Since biopandas 0.3.0, PDB files can also be loaded into a PandasPdb object from a Python list:

with open('./data/3eiy.pdb', 'r') as f:
    three_eiy = f.readlines()

ppdb2 = PandasPdb()
ppdb2.read_pdb_from_list(three_eiy)

ppdb2.df['ATOM'].head()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	ATOM	1	N	...	N	NaN	609
1	ATOM	2	CA	...	C	NaN	610
2	ATOM	3	C	...	C	NaN	611
3	ATOM	4	O	...	O	NaN	612
4	ATOM	5	CB	...	C	NaN	613

5 rows × 21 columns

Working with PDBs Containing Multiple Models

(New in version 0.4.0)

Some PDB files, particularly those containing NMR structures, provide an ensemble of models. There are various ways to extract these.

In these examples we will work with 2JYF: an RNA structure containing 10 models of the same underlying RNA structure.

To start, we con obtain a DataFrame denoting the lines of the PDB files corresponding to each model.

from biopandas.pdb import PandasPdb

ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')
ppdb.get_model_start_end()

/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values

	record_name	model_idx	start_idx	end_idx
129	MODEL	1	129	2896
133	MODEL	2	2897	5664
137	MODEL	3	5665	8432
141	MODEL	4	8433	11200
145	MODEL	5	11201	13968
149	MODEL	6	13969	16736
153	MODEL	7	16737	19504
157	MODEL	8	19505	22272
161	MODEL	9	22273	25040
165	MODEL	10	25041	27808

Assigning model IDs to the PDB DataFrames

For ease of use, the label_models() method adds an additional column, "model_id" to the dataframes contained within the PandasPdb object.

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

ppdb.label_models()
ppdb.df["ATOM"]["model_id"]

/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values





0         1
1         1
2         1
3         1
4         1
         ..
27635    10
27636    10
27637    10
27638    10
27639    10
Name: model_id, Length: 27640, dtype: int64

Subsetting PandasPdb objects to a given model

We can obtain new PandasPdb objects containing only a given model using the get_model() method

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

model_4 = ppdb.get_model(model_index=4)
model_4.df["ATOM"]

/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values

	record_name	atom_number	blank_1	atom_name	...	element_symbol	charge	line_idx	model_id
8292	ATOM	1		O5'	...	O	NaN	8434	4
8293	ATOM	2		C5'	...	C	NaN	8435	4
8294	ATOM	3		C4'	...	C	NaN	8436	4
8295	ATOM	4		O4'	...	O	NaN	8437	4
8296	ATOM	5		C3'	...	C	NaN	8438	4
...	...	...	...	...	...	...	...	...	...
11051	ATOM	2761		HO2'	...	H	NaN	11194	4
11052	ATOM	2762		H1'	...	H	NaN	11195	4
11053	ATOM	2763		H3	...	H	NaN	11196	4
11054	ATOM	2764		H5	...	H	NaN	11197	4
11055	ATOM	2765		H6	...	H	NaN	11198	4

2764 rows × 22 columns

Subsetting PandasPdb objects to a list of given models

We can obtain new PandasPdb objects containing only a given models using the get_models() method

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/2jyf.pdb')

model_ensemble = ppdb.get_models(model_indices=[2, 4, 6, 8])
model_ensemble.df["ATOM"]

/Users/sebastian/Desktop/biopandas/biopandas/pdb/pandas_pdb.py:680: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  idxs["end_idx"] = ends.line_idx.values

	record_name	atom_number	blank_1	atom_name	...	element_symbol	charge	line_idx	model_id
2764	ATOM	1		O5'	...	O	NaN	2898	2
2765	ATOM	2		C5'	...	C	NaN	2899	2
2766	ATOM	3		C4'	...	C	NaN	2900	2
2767	ATOM	4		O4'	...	O	NaN	2901	2
2768	ATOM	5		C3'	...	C	NaN	2902	2
...	...	...	...	...	...	...	...	...	...
22107	ATOM	2761		HO2'	...	H	NaN	22266	8
22108	ATOM	2762		H1'	...	H	NaN	22267	8
22109	ATOM	2763		H3	...	H	NaN	22268	8
22110	ATOM	2764		H5	...	H	NaN	22269	8
22111	ATOM	2765		H6	...	H	NaN	22270	8

11056 rows × 22 columns

Plotting

Since we are using pandas under the hood, which in turns uses matplotlib under the hood, we can produce quick summary plots of our PDB structures relatively conveniently:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')

[File link: 3eiy.pdb.gz]

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

ppdb.df['ATOM']['b_factor'].plot(kind='hist')
plt.title('Distribution of B-Factors')
plt.xlabel('B-factor')
plt.ylabel('count')
plt.show()

png

ppdb.df['ATOM']['b_factor'].plot(kind='line')
plt.title('B-Factors Along the Amino Acid Chain')
plt.xlabel('Residue Number')
plt.ylabel('B-factor in $A^2$')
plt.show()

png

ppdb.df['ATOM']['element_symbol'].value_counts().plot(kind='bar')
plt.title('Distribution of Atom Types')
plt.xlabel('elements')
plt.ylabel('count')
plt.show()

png

Computing the Root Mean Square Deviation

BioPandas also comes with certain convenience functions, for example, ...

The Root-mean-square deviation (RMSD) is simply a measure of the average distance between atoms of 2 protein or ligand structures. This calculation of the Cartesian error follows the equation:

$RMSD(a, b) = \sqrt{\frac{1}{n} \sum^{n}_{i=1} \big((a_{ix})^2 + (a_{iy})^2 + (a_{iz})^2 \big)} = \sqrt{\frac{1}{n} \sum^{n}_{i=1} || a_i + b_i||_2^2}$

So, assuming that the we have the following 2 conformations of a ligand molecule

we can compute the RMSD as follows:

from biopandas.pdb import PandasPdb

l_1 = PandasPdb().read_pdb('./data/lig_conf_1.pdb')
l_2 = PandasPdb().read_pdb('./data/lig_conf_2.pdb')
r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'],
                   s=None) # all atoms, including hydrogens
print('RMSD: %.4f Angstrom' % r)

RMSD: 2.6444 Angstrom

l_1.df['HETATM']

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
0	HETATM	1	C1	...	C	NaN	0
1	HETATM	2	O1	...	O	NaN	1
2	HETATM	3	C2	...	C	NaN	2
3	HETATM	4	O2	...	O	NaN	3
4	HETATM	5	C3	...	C	NaN	4
5	HETATM	6	O3	...	O	NaN	5
6	HETATM	7	C4	...	C	NaN	6
7	HETATM	8	O4	...	O	NaN	7
8	HETATM	9	C5	...	C	NaN	8
9	HETATM	10	O5	...	O	NaN	9
10	HETATM	11	C6	...	C	NaN	10
11	HETATM	12	O6	...	O	NaN	11
12	HETATM	13	C7	...	C	NaN	12
13	HETATM	14	C8	...	C	NaN	13
14	HETATM	15	C9	...	C	NaN	14
15	HETATM	16	C10	...	C	NaN	15
16	HETATM	17	H1	...	H	NaN	16
17	HETATM	18	H2	...	H	NaN	17
18	HETATM	19	H3	...	H	NaN	18
19	HETATM	20	H4	...	H	NaN	19
20	HETATM	21	H5	...	H	NaN	20
21	HETATM	22	H6	...	H	NaN	21
22	HETATM	23	H7	...	H	NaN	22
23	HETATM	24	H8	...	H	NaN	23

24 rows × 21 columns

[File links: lig_conf_1.pdb, lig_conf_2.pdb]

r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='carbon') # carbon atoms only
print('RMSD: %.4f Angstrom' % r)

RMSD: 1.7249 Angstrom

r = PandasPdb.rmsd(l_1.df['HETATM'], l_2.df['HETATM'], 
                   s='heavy') # heavy atoms only
print('RMSD: %.4f Angstrom' % r)

RMSD: 1.9959 Angstrom

Similarly, we can compute the RMSD between 2 related protein structures:

The hydrogen-free RMSD:

p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='heavy')
print('RMSD: %.4f Angstrom' % r)

RMSD: 0.7377 Angstrom

Or the RMSD between the main chains only:

p_1 = PandasPdb().read_pdb('./data/1t48_995.pdb')
p_2 = PandasPdb().read_pdb('./data/1t49_995.pdb')
r = PandasPdb.rmsd(p_1.df['ATOM'], p_2.df['ATOM'], s='main chain')
print('RMSD: %.4f Angstrom' % r)

RMSD: 0.4781 Angstrom

Filtering PDBs by Distance

We can use the distance method to compute the distance between each atom (or a subset of atoms) in our data frame and a three-dimensional reference point. For example:

p_1 = PandasPdb().read_pdb('./data/3eiy.pdb')

reference_point = (9.362, 41.410, 10.542)
distances = p_1.distance(xyz=reference_point, records=('ATOM',))

[File link: 3eiy.pdb]

The distance method returns a Pandas Series object:

distances.head()

0    19.267419
1    18.306060
2    16.976934
3    16.902897
4    18.124171
dtype: float64

And we can use this Series object, for instance, to select certain atoms in our DataFrame that fall within a desired distance threshold. For example, let's select all atoms that are within 7A of our reference point:

all_within_7A = p_1.df['ATOM'][distances < 7.0]
all_within_7A.tail()

	record_name	atom_number	atom_name	...	element_symbol	charge	line_idx
786	ATOM	787	CB	...	C	NaN	1395
787	ATOM	788	CG	...	C	NaN	1396
788	ATOM	789	CD1	...	C	NaN	1397
789	ATOM	790	CD2	...	C	NaN	1398
790	ATOM	791	N	...	N	NaN	1399

5 rows × 21 columns

Visualized in PyMOL, this subset (yellow surface) would look as follows:

Converting Amino Acid codes from 3- to 1-letter codes

Residues in the residue_name field can be converted into 1-letter amino acid codes, which may be useful for further sequence analysis, for example, pair-wise or multiple sequence alignments:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().fetch_pdb('5mtn')
sequence = ppdb.amino3to1()
sequence.tail()

	chain_id	residue_name
1378	B	I
1386	B	N
1394	B	Y
1406	B	R
1417	B	T

As shown above, the amino3to1 method returns a DataFrame containing the chain_id and residue_name of the translated 1-letter amino acids. If you like to work with the sequence as a Python list of string characters, you could do the following:

sequence_list = list(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])
sequence_list[-5:] # last 5 residues of chain A

['V', 'R', 'H', 'Y', 'T']

And if you prefer to work with the sequence as a string, you can use the join method:

''.join(sequence.loc[sequence['chain_id'] == 'A', 'residue_name'])

'SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT'

To iterate over the sequences of multi-chain proteins, you can use the unique method as shown below:

for chain_id in sequence['chain_id'].unique():
    print('\nChain ID: %s' % chain_id)
    print(''.join(sequence.loc[sequence['chain_id'] == chain_id, 'residue_name']))

Chain ID: A
SLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYT

Chain ID: B
SVSSVPTKLEVVAATPTSLLISWDAPAVTVVYYLITYGETGSPWPGGQAFEVPGSKSTATISGLKPGVDYTITVYAHRSSYGYSENPISINYRT

Wrapping it up - Saving PDB structures

Finally, let's talk about how to get the PDB structures out of the DataFrame format back into the beloved .pdb format.

Let's say we loaded a PDB structure, removed it from its hydrogens:

from biopandas.pdb import PandasPdb
ppdb = PandasPdb().read_pdb('./data/3eiy.pdb.gz')
ppdb.df['ATOM'] = ppdb.df['ATOM'][ppdb.df['ATOM']['element_symbol'] != 'H']

[File link: 3eiy.pdb.gz]

We can save the file using the PandasPdb.to_pdb method:

ppdb.to_pdb(path='./data/3eiy_stripped.pdb', 
            records=None, 
            gz=False, 
            append_newline=True)

[File link: 3eiy_stripped.pdb]

By default, all records (that is, 'ATOM', 'HETATM', 'OTHERS', 'ANISOU') are written if we set records=None. Alternatively, let's say we want to get rid of the 'ANISOU' entries and produce a compressed gzip archive of our PDB structure:

ppdb.to_pdb(path='./data/3eiy_stripped.pdb.gz', 
            records=['ATOM', 'HETATM', 'OTHERS'], 
            gz=True, 
            append_newline=True)

[File link: 3eiy_stripped.pdb.gz]

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search