Description of the programs and scripts written for the screening of library by AUTODOCK

Mihaly Mezei

Department of Structural and Chemical Biology,

Mount Sinai School of Medicine, NYU

New York, NY 10029

Mihaly.Mezei@mssm.edu

Sept. 21, 2009.

The screening of a library of ligand by Autodock (Version 3.0.5 or Version 4) requires

The programs and scripts in this package help in all three steps. It is assumed that the ligands are available in Tripos' mol2 format, the executables of Autodock suite of programs are available, and (for Autodock 4) Python 2.4 (or higher) is installed. Semiempirical partial charge calculations also require the availabity of the program Gaussian.

Furthermore, the programs in this package have to be compiled with a Fortran compiler (e.g., g77). To obtain an executable form the source file program.f type (replace g77 with your compiler's name, if different)

>g77 -O4 -o program program.f

I. Ligand preparation The programs ans scripts used for praparation make sure that there is a single directory containing files with a single molecule each, with likely valid partial charges.

I.1. Prepare individual .mol2 or .pdb* files from a single file The program splitmol (written by D.A. Gschwend and adapted to xlf by M. Mezei) reads in a file containing several .mol2 or .pdb* structures and creates several files with a limited number (e.g., one) structures each. A sample run looks like this:

% splitmol
 Running splitmol DAG/MM v.2.16...        24-Oct-0

 This program supports command-line arguments.
 Type splitmol -h for a brief description.

 Use PDB-type or MOL2 file (<P>/M)? m
 Enter name of the MOL2 file: ../split_test.mol2
 Enter name to add to the molecule number: xxx
 Enter starting and ending molecule numbers to extract.
 (Enter 0 0 to do all, enter a negative number to use a list file.) 0 0
 Split this file into groups of how many molecules? 1
 Label by absolute or relative molecule # (<A>/R)? a
 Output file names will be of the form <number>.xxx.mol2
 All molecules in input will be extracted.
 Each molecule will be put into an individual file.
 Absolute molecule numbers will be used.

 Working...

 Execution completed.
 2 molecules written.
 2 file(s) created.
%

I.2. Filter the ligand directory

The C-shell script filtermol2.csh filters files by the following criteria:

Running filtermol2.csh requires the executables prepmol2 and filtermol2.

A sample run looks like this:

% filtermol2.csh
Filtering a directory of .mol2 files - Version 09/26/2008
This script requires the following files to be present in this directory:
     The executable program prepmol2
     The executable program filtermol2

Name of the input directory containing the ligand .mol2 files=../mol2_files_t
Name of the directory containing the filtered ligand .mol2 files=TEST
 Checking for filtermol2
 Checking for prepmol2
Maximum molecular weight to keep=250
Current directory: /Users/mezei/autodock4/fullscreen

Number of files found in ../mol2_files_t : 52
Number of files kept: 47
%
A file filtermol2.log will contain detailed information about the filtering of each ligand.

I.3. Get a list of Autodock 4 atom types used in a ligand library

The C-shell script get_typlist.csh reads all the ligand structures in a directory, extracts a list of Autodock 4 atom types and creates a file all.pdbqt with atoms having all the atomtypes found - this file will be needed for the screening run.

Running get_typlist.csh requires the executable get_typlist and the template file all0.pdbqt. It also uses Python and thus has to set the path to it. Current implementation runs on Faradis only,

A sample run looks like this:

 % get_typlist.csh 
Gather atom types used in this library - written by Mihaly Mezei
Version 10/22/2007 - Autodock 4
Running Thu Oct 25 11:23:52 EDT 2007

Running on faradis.mssm.edu
Host specific information (to be customized):
    MGLROOT = /Library/MGLTools/1.4.5
    pythonutil = /Library/MGLTools/1.4.5/MGLToolsPckgs/AutoDockTools/Utilities24
End host specific information

Name of the ligand  database file contaning the .mol2 files=../mol2_files_t
/private/var/automount/Users/mezei/autodock4/mol2_files_t
Number of files found in ../mol2_files_t : 3
 get_typlist: file 000007.pdbqt opened OK
 get_typlist: file all0.pdbqt opened
 Read  26  lines from 000007.pdbqt
 Read  24 lines from all0.pdbqt
 Types found so far= C  OA
 Types found so far= C  OA A  Cl NA HD
 get_typlist: file all0.pdbqt opened as new
 mol2 file No: 1
 get_typlist: file 000008.pdbqt opened OK
 get_typlist: file all.pdbqt opened
 Read  34  lines from 000008.pdbqt
 Read  24 lines from all.pdbqt
 Types found so far= C  OA A  Cl NA HD
 Types found so far= C  OA A  Cl NA HD N 
 get_typlist: file all.pdbqt opened as new
 mol2 file No: 2
 get_typlist: file 000009.pdbqt opened OK
 get_typlist: file all.pdbqt opened
 Read  21  lines from 000009.pdbqt
 Read  24 lines from all.pdbqt
 Types found so far= C  OA A  Cl NA HD N 
 mol2 file No: 3
 get_typlist: file 000011.pdbqt opened OK
 get_typlist: file all.pdbqt opened
 Read  23  lines from 000011.pdbqt
 Read  24 lines from all.pdbqt

Final all.pdbqt file (copied to /private/var/automount/Users/mezei/autodock4/alkynes):
REMARK  3 active torsions:
REMARK  status: ('A' for Active; 'I' for Inactive)
ROOT
ATOM      1  C1  <0> d           1.327   3.455  -0.398  0.00  0.00     0.036 C
ATOM      2  O2  <0> d           1.420   1.980  -0.001  0.00  0.00    -0.032 OA
ATOM      3  C1  <0> d           2.376   1.251  -0.914  0.00  0.00     0.031 A
ATOM      4 C12  <0> d           3.741   1.077  -0.220  0.00  0.00     0.035 Cl
ATOM      5  N5  <0> d           4.048   2.342   0.567  0.00  0.00     0.013 NA
ATOM      6  H19 <0> d           3.145   2.387   1.810  0.00  0.00     0.008 HD
ATOM      7  N9  <0> d           1.805   1.812   1.483  0.00  0.00     0.013 N
ATOM      8  C8  <0> d           0.592   2.429   2.198  0.00  0.00     0.027 C
ATOM      9  C9  <0> d          -0.610   1.949   1.340  0.00  0.00     0.006 C
ATOM     10  C10 <0> d          -0.017   1.420   0.010  0.00  0.00     0.240 C
ATOM     11  C13 <0> d           3.178   3.801   2.370  0.00  0.00     0.033 C
ATOM     12  C14 <0> d           4.601   4.036   2.906  0.00  0.00     0.070 C
ATOM     13  C15 <0> d           5.624   3.720   1.831  0.00  0.00    -0.059 C
ATOM     14  C16 <0> d           6.588   4.614   1.625  0.00  0.00    -0.092 C
ATOM     15  C17 <0> d           7.646   4.378   0.641  0.00  0.00     0.383 C
ATOM     16  O2  <0> d           8.375   5.279   0.269  0.00  0.00    -0.452 OA
ATOM     17  C18 <0> d           7.797   2.971   0.104  0.00  0.00     0.029 C
ATOM     18  C19 <0> d           6.408   2.416  -0.189  0.00  0.00     0.030 C
ATOM     19  C20 <0> d           5.503   2.449   1.031  0.00  0.00     0.172 C
ENDROOT
TORSDOF 0
Processed 23 .mol2 files
% 

I.4. Replace the .mol2 partial charges with AM1-calculated charges

The C-shell script gausscharge.csh runs Gaussian for all ligands in a directory, and replaces the partial charges with the result of the Mulliken population analysis of their AM1 wavefunction. Optionally, the conformation can be minimized with Gaussian. Running gausscharge.csh requires the C-shell script mol2togauss.csh and the executables prepmol2, gausstomol2, mol2togauss. mol2togauss prepares the input file for Gaussian and gausstomol2 extracts the charges from the Gaussian output file and replaces the values in the .mol2 file.

The programs mol2togauss and gausstomol2 can be run independently. The first command-line argument is the full name of the .mol2 file. mol2togauss takes the second argumnt as an indicator for optimization: it will set up input for an optimization run if the first character is 'y' or 'Y'. For example,
% mol2togauss test.mol2 yes
will create an input file test.mol2.g99 from the structure in test.mol2 asking for optimization.
% gausstomol2 test.mol2
will replace the charges and coordinates of test.mol2 by the charges and coordinates calculated by Gaussian (read from the Gaussian output file test.mol2.g99out) and write a new mol2 file called test.mol2.am1.

A sample run looks like this:

% gausscharge.csh
Name of the directory containing the ligand .mol2 files=hits_mol2
Name of the directory to put the converted .mol2 files=hits_mol2_am1
First molecule to use [1]=
Last molecule to use [1000000]=
Name of the Gaussian executable [gaussian]=
Is your Gaussian version 03 or higher? (y/n)? n
Do you want to remove all Gaussian files (y/n)? n
Number of CPUs to use [10]=4
Do you want to optimize the structure as well (y/n)? n
Number of CPUs to be used: 4
Current directory: /hosts/fulcrum/home/mezei/MOLMOD/autodock/morph
 
Converting ligands in directory hits_mol2 to directory hits_mol2_am1
Number of files found in hits_mol2 : 2
Trying file 5236332.1.mol2
Calling prepmol2 for ligand No 1 5236332.1.mol2
 Prepmol2: file 5236332.1.mol2 opened OK
 Prepmol2: file 5236332.1.mol2.new opened OK
 Number of lines=  87 number of atoms=  37 number of bonds= 39
 Number of substitutions= 0
 Number of atoms dropped=  0 number of bonds dropped= 0
 Charge sum= 0.00000
ls: No match.
Starting conversion for 5236332.1.mol2
 mol2togauss: file 5236332.1.mol2 opened OK
 mol2togauss: file 5236332.1.mol2.g99 opened OK
 MOLECULE 5236332.1.mol2
 Number of atoms=  37 Charge sum= 0.00000
User = mezei
Host = pepi
WARNING:
Reserve directory  /hosts/pepi/reserve/mezei is not found
/hosts/pepi/scr/mezei  will be used instead
more WARNING:
You do not even have a scr directory here, the current directory:
/hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph  will be used instead

Input file: 5236332.1.mol2.g99
GAUSS_EXEDIR = /hosts/fulcrum/homes/molmod/src/g99
GAUSS_ARCDIR = /hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph
GAUSS_SCRDIR = /hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph
job 5236332.1.mol2 done
 File 5236332.1.mol2 opened OK
 File 5236332.1.mol2.am1 opened OK
 File 5236332.1.mol2.g99out opened OK
 Sum of rounded charges= -2.6822090E-07
 Adjusting the charge of atom            1 with -2.6822090E-07
 Finished writing to file 5236332.1.mol2.am1  37 atoms
Trying file 5464817.1.mol2
Calling prepmol2 for ligand No 2 5464817.1.mol2
 Prepmol2: file 5464817.1.mol2 opened OK
 Prepmol2: file 5464817.1.mol2.new opened OK
 Number of lines= 103 number of atoms=  45 number of bonds= 47
 Number of substitutions= 0
 Number of atoms dropped=  0 number of bonds dropped= 0
 Charge sum= 0.00000
ls: No match.
Starting conversion for 5464817.1.mol2
 mol2togauss: file 5464817.1.mol2 opened OK
 mol2togauss: file 5464817.1.mol2.g99 opened OK
 MOLECULE 5464817.1.mol2
 Number of atoms=  45 Charge sum= 0.00000
User = mezei
Host = pepi
WARNING:
Reserve directory  /hosts/pepi/reserve/mezei is not found
/hosts/pepi/scr/mezei  will be used instead
more WARNING:
You do not even have a scr directory here, the current directory:
/hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph  will be used instead

Input file: 5464817.1.mol2.g99
GAUSS_EXEDIR = /hosts/fulcrum/homes/molmod/src/g99
GAUSS_ARCDIR = /hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph
GAUSS_SCRDIR = /hosts/fulcrum/homes/mezei/MOLMOD/autodock/morph
job 5464817.1.mol2 done
 File 5464817.1.mol2 opened OK
 File 5464817.1.mol2.am1 opened OK
 File 5464817.1.mol2.g99out opened OK
 Sum of rounded charges= -4.0026009E-04
 Adjusting the charge of atom            1 with -4.0026009E-04
 Finished writing to file 5464817.1.mol2.am1  45 atoms
% 

II. Runnig the screening

II.1. Submitting the job(s)

The screening script fullscreen.csh requires individual .mol2 or .pdbqt files in a single directory. It also requires a sample file all.pdbq (for Autodock 3) or all.pdbqt (for Autodock 4) containing all the atom types that are used in the library.

There are placeholders in the script to specify the path to the

If these are not changed in the script to the particular system the screening is to be run, the user will be prompted for them.

The script fullscreen.csh asks the user to specify

fullscreen.csh creates a directory macro_dock and a log file macro.log, and calls the script screenlist_loop_3.csh or screenlist_loop_4.csh for Autodock 3 or 4, resp. These shell scripts prepare the input for Autogrid, run it and for each ligand prepare the input file for Autodock and run it. They use the executable prepmol2 and a system-dependent script submit_*.csh. For systems using the Sun grid engine the shell script dockit_gridengine.csh will work with Autodock4.

Currently, screenlist_loop_3.csh can be run on the SGI systems served by Fulcrum or on Faradis, on other systems the path to the Autodock executables have to be changed. screenlist_loop_4.csh runs on systems wuth the Sun grid engine (e.g., Faradis and Nagini), on TACC/Ranger using the launcher utility and on any Unix/Linux system using a single CPU. For other queing systems it has to be extended.

The script runs Autogrid on a single compute noder. Once the grid is ready, it submits the allowed number of docking jobs to run on different CPU's. It creates a directory runcount where each job creates a file when starting the run and deleting it when the run ends. The number of files in runcount control the submission of subsequent jobs. Note, that the number of CPU's to be used can be modified during the run - see the instruction printed in the sample run below.

There is also an option to run Autogrid and the various scripts preparing the input for the docking runs (.pdbqt and .dpf files for each ligand) but not running Atutodock. Using this option allows the perparation of a system that can be run on a different system (e.g. on a supercomputer) that has only Autodock installed, but not the script libraries.

This option is complemented by the option of skipping all preliminaries and assume that all grid map (created my Autogrid) and docling input files are already present and just perform the docking runs.

A sample run looks like this:

 % fullscreen.csh

                Automated screening using Autodock
             Written by Mihaly Mezei - version 10/18/2009

Will the grid have more than 128 points in any direction (y/n)? n
Server is Nagini - distributed memory system using the Sun grid engine
Normal size grid
Name of the macromolecule file (without the .pdbqs or .pdbqt)=PAH2_m
PAH2_m.pdbqt found - run will use Autodock 4
Do you have flexible residues (y/n) [n]?
This script requires the following files to be present in this directory:

     The executable program prepmol2
     The executable program checkresnum
     The sample structure file with all atom types all.pdbqt
     The awk scripts mod_gpf.awk
     The csh scripts add_dpf.csh
     The c-shell script screenlist_loop_4.csh
     The c-shell script dockit_gridengine.csh

Checking checkresnum
Checking prepmol2
Checking mod_gpf.awk
Checking mod_dpf.awk
Checking  screenlist_loop_4.csh
Checking  dockit_gridengine.csh
 Checking residue numbers in file PAH2_m.pdbqt
 Residue number check - Version 09/23/08
Number of CPUs to use [20]=75
NOTE: the number of CPUs can be changed by creating a file in this directory
      called PAH2_m.NEWNCPU containing the new number of CPUs and nothing else
First molecule to use [1]=
Last molecule to use [99999999]=
Number of dockings per ligand (50 or more recommended) [50]=100
RMSD tolerance for clustering (in A) [1.0]=2
Maximum number of GA energy evaluations (0 torsions) [250000]=
Maximum number of GA energy evaluations (1-2 torsions) [500000]=
Maximum number of GA energy evaluations (3-5 torsions) [1000000]=
Maximum number of GA energy evaluations (6-10 torsions) [2000000]=
Maximum number of GA energy evaluations (11-  torsions) [3000000]=
Maximum number of GA generations [27000]=
Maximum number of GA populations [200]=
Ligand charge option
     Add Gasteiger charges : g
     Add Kollman   charges : k
     Keep input    charges : i [g]

X[Grid center] (in A) [0]=-1
Y[Grid center] (in A) [0]=21
Z[Grid center] (in A) [0]=9
Grid spacing (in A) [0.375]=
gridspace set to 0.375
Number of gridpoints in the X direction (even number)=96
Number of gridpoints in the Y direction (even number)=96
Number of gridpoints in the Z direction (even number)=96
Do you want to ignore symmetry for ligand pose clustering (y/n) [n]?
Do you want to print all members of the clusters (y/n) [n]?
Do you want to drop files with names of the form H.M.T
       where only the M part differs (y/n) [n]?
Do you want to partition the number of dockings for files  of the form H.M.T
       where only the M part differs (y/n) [y]?
Minimum number of docking attempts per copy [20]=
Name of the directory containing the ligand files=../CHEMBR_chir_opt_pdbqt
Are the ligand files already in .pdbqt form (instead of .mol2) (y/n) y
Do you want to only run the setup but skip docking for now (y/n) [n]? n
Do you want to remove all ligand .mol2, .pdbqt, .dpf  files (y/n) [n]? y

All ligand .mol2, .pdbqt, .dpf files will be removed
Number of CPUs to be used: 75
Calculations will be logged on file PAH2_m.log
Docking ligands in directory ../CHEMBR_chir_opt_pdbqt to macromolecule PAH2_m.pdbqt
RMSD tolerance for clustering: 2 A
Files with names of the form H.M.T where only the M part differs will
share the number of docking attempts
Minimum number of docking attempts per copy= 20
Number of dockings/ligand: 100
Maximum number of GA energy evaluations (0 torsions)=250000
Maximum number of GA energy evaluations (1-2 torsions)=500000
Maximum number of GA energy evaluations (3-5 torsions)=1000000
Maximum number of GA energy evaluations (6-10 torsions)=2000000
Maximum number of GA energy evaluations (11- torsions)=3000000
Maximum number of GA generations: 27000
Maximum number of GA populations: 200
Grid center at < -1 , 21 , 9 >
Number of gridpoints in the x, y, and z direction: 96, 96, and 96
Grid spacing = 0.375 A
Gasteiger charges will be added to the ligand
Grids will be generated for atom types (based on all.pdbqt):
1 C
2 OA
3 A
4 N
5 HD
6 Cl
7 NA
8 SA
9 P
10 F
11 Br
12 S
13 I

The rm command (to delete a file) is aliased to
For proper functioning of this script, the alias will be removed
thus files will be removed without confirmation.
OK to submit the run (y/n)? y
[1] 23784
 %
In this example, the script recognized our host and used the built-in values for the platform-dependent information. It your host is not recognized by the script, you will we prompted for those data.

II.2.Tracking the jobs

The script fullscreen.csh keeps track of the jobs running via files in the directory runcount. If a job aborts then the file corresponding to this job (that would be deleted when the job exits normally) may not be deleted. The script cleanruncount.csh checks for such files and offers to delete them.

II.3.Changing the number of CPU's used

The number of CPUs used by a job (i.e., the number of ligands docked simulatanously on different CPU's) can be changed by creating a file in the directory running the job called macro.NEWNCPU containing the new number of CPUs and nothing else. The requested change is executed by attrition: the next job will only be submitted when the currently running jobs number less than the new limit.

III. Post-processing

These functions include the sorting, filtering and extracting of the results, as well as some house cleaning.

III.1. Gather a list of docking log files

The script getdir.csh looks into the directory of docking log files and extracts the name of the grid parameter file and of all the logfiles. This information is written into a file macro.dir and used by the program dockres.

A sample run looks like this:

% getdir_new.csh
Name of the macromolecule file (without the .pdbq*)=cbx_h
cbx_h.pdbqt found
Existing cbx_h.dir file is removed
Directory of .dlg files: cbx_h_dock - OK (y/n)? y
Number of files found in cbx_h_dock : 2114
Number of cbx_h .dlg files found in cbx_h_dock : 208
%

III.2. Sort, filter and extract docked poses

The program Dockres gathers the top binders and diplays a variety of statistics, both on the ligand set and on the top binding poses. It writes the result in a file with extension .res and, if requested, extracts docked poses into .pdb files. Since it can be run independently of these utilities, it has a separate documentation.

III.3. Combine results of different screening runs

The program compare_pose reads the .res files created by dockres from docking of the same ligand library to different conformations of the same macromolecule and for each ligand appearing on the top-scoring list gathered by dockres collects the binding free energies and the multiplicities of each hit.

Ligand ID's of the form nnnnnn_aa, nnnnnn_bb are assumed to be stereoisomers of the same molecule. If such ligands are found, the user are given the option of lumping stereoisomers together for calculating the averager free-energy estimates.

Running the program requires the specification of the names of the .res files. A sample run looks like this:

% compare_pose
 Name of the (next) .res file (w/o the .res)=pcafpm
 Name of the (next) .res file (w/o the .res)=pcafpl
 Name of the (next) .res file (w/o the .res)=

 Files read:
  1 pcafpm.res
  2 pcafpl.res
  FE   1  lig  FE   2  lig  FE  
 -10.6   10817 -12.4   10815
 -10.4    1865 -11.8    2029
 -10.1   11209 -11.5    2944
 -10.0    5784 -11.4   11995
 -10.0    2945 -11.3    7831
  -9.9    2945 -11.2    5782
  -9.9   10817 -11.2    1864
  -9.8    5784 -11.1    2944
  -9.7   11213 -11.1    2817
  -9.6    7833 -11.0    6629
  -9.6    1865 -10.8    1864
  -9.6   11997 -10.8    2944
  -9.6   11209 -10.8    7831
  -9.6   12316 -10.6    3793
  -9.6   10817 -10.6   10815
  -9.4   11213 -10.5    1864
  -9.4    9441 -10.4   11207
  -9.4    2411 -10.4    2029
  -9.3    2945 -10.4   11207
  -9.3    3795 -10.4   11995
 Ligand      5791272 #  10815 file#= 2 pose=  4 rank=  1
                              Contact res:   83(THR ) fe= -12.4 multiplicity=43
 Ligand      5791272 #  10815 file#= 2 pose=  6 rank= 15
                              Contact res:   42(GLU ) fe= -10.6 multiplicity=43
 Ligand      5791272 #  10815 = -11.5 # of hits=  2 Distribution:  0  2
                              # of poses=  86 Distribution=   0  86

 Ligand      5348595 #   5782 file#= 2 pose=  2 rank=  6
                              Contact res:   83(THR ) fe= -11.2 multiplicity=15
 Ligand      5348595 #   5782 = -11.2 # of hits=  1 Distribution:  0  1
                              # of poses= 101 Distribution=   0 101

 ..............
%
(output is truncated)

The ligands are sorted by the average free energy of all poses on all targets.

III.4. 'House-cleaning' of the docking log directory

When a directory contains a large number of files the ls command in the C shell is known to fail. To circumvent this, there are three scripts that clean, copy, compress and uncompress files in a directory, irrespective of the nuber of files there:

clean_dock_dir.csh and compressdir.csh accept the directory name as an argument, e.g.,
% compressdir x_dock
will process all files in the directory x_dock. If no argument is given, each script will prompt for the directory name to clean, compress or uncompress.