How to run bootstrap based variance estimation method using sparx?

Bootstrap based variance estimation method (from now on referred to as "bootstrap") is a method to evaluate the flexibility of a 3D object from its CryoEM images proposed by Penczek et al. (Penczek 2006). The input of this method is a set of 2D projections while the output is a 3D variance map. The theory of bootstrap method is rather complicated, please refer to the paper mentioned above for details. The algorithm itself however is fairly simple: assuming N 2D projections are provided, for each iteration, N projections are randomly selected (one projection can appear multiple times) and reconstruction on these selected projections yields a bootstrap volume, the 3D variance map is calculated as the variance of a set of bootstrap volumes. We will discuss later how many bootstrap volumes are needed to obtain a reliable variance map.

The whole procedure to use SPARX implementation of bootstrap method can be divided into 5 steps:

1. select evenly distributed projections (using sxstat_even.py and sxselect_even.py).

2. normalize the projections (using sxnormal_prj.py, MPI version availiable).

3. generate scratch files (using sxbootstrap_genbuf.py, MPI version availiable).

4. generate bootstrap volumes (using sxbootstrap_run.py, MPI version availiable).

5. run analysis to get variance map (using sxincvar.py)

Step 1. select evenly distributed projections

The input of bootstrap method is a set of 2D projections which of course should be in SPARX's HDF format with the appropriate headers set including the orientation information (phi, theta, psi, s2x, s2y) and the CTF information (ctf_applied, voltage, Cs, amp_contrast, defocus and B_factor).

It should be noted here that the time for bootstrap calculation is almost proportional to the square of number of projections. First, more time will be needed to yield one bootstrap volumes for more projections; second, more bootstrap volumes will be needed to get the variance map to converge due to the nature of bootstrap method. This fact makes it infeasible for many cases to use all the projections user have for bootstrap calculation. Therefore, which projections to select becomes an important issue.

SPARX provides a method to select projections according to the distribution of orientation and the defocus value. The reason these two criteria were adopted is: first, it is our experience that unevenly distributed orientation will cause distortions in variance map; second, the images from higher defocus value are usually more reliable then the ones of lower defocus value.

Two python scripts have been written for the selection of projection directions.

The usage of sxstat_even.py is:

The purpose of sxstat_even.py is to run a statistics on a set of projections about their orientation and defocus. It generates a set of evenly distributed orientations (for the given delta) and assign projections to their closest orientation. The output of it is a text file which contained projections sorted by orientation and their defocus value.

The script sxselect_even.py is used to select projections using the output of sxstat_even.py. The usage is:

for each orientation, nprj projections which have highest defocus values will be selected and written to output_stack.hdf.

Step 2. normalize the projections

To normalize the projections, use the command sxnormal_prj.py:

The results are stored in the directory outdir. Normally, one would use this command on a cluster. In that case, the results are saved in the directory outdir with filenames prj0000.hdf, prj0001.hdf, prj0002.hdf, etc. To concatenate all files, use command pack.py in directory sparxroot/sparx/test/bootstrap:

Here, file_prefix is the prefix of the files you wish to concatenate (should be prj here), nproc is the number of processors used for normalization, output.hdf is the file to save the result of concatenation.

Step 3. generate scratch file

The reconstruction algorithm used with bootstrap method of SPARX is the nearest neighboring scheme together with 4 times padding, which is a type of direct Fourier inversion reconstruction algorithm. Its basic idea is to do 2D Fourier transform for each projection and insert them back into the 3D Fourier volume. The advantage of this algorithm is that one can perform the 2D Fourier transform in the first place and store the results as a scratch file. This scratch files can then be used for reconstruction directly.

The program to generate scratch file is sxbootstrap_genbuf.py located in SPARXROOT/sparx/bin, the usage is very simple:

two files will be generated by this program: file_prefix.bin which is a binary file containing the result of 2D Fourier transform, the other is file_prefix.txt which records the header information.

An very important issue about sxbootstrap_genbuf.py is the size of the scratch file will be about 16 times bigger than the original projection file because of the four times padding method used to gain finer grid and to reduce interpolation error, which may sometimes use up the free disk spaces and cause multiple problems.

Step 4. generate bootstrap volumes

The program which generates bootstrap volumes is sxbootstrap_run.py locate in directory SPARXROOT/sparx/bin. The usage is:

user needs to specify number of bootstrap volumes to be generated (see step 5), signal-to-noise ratio and the symmetry. The volume will be generated as vol_prefix00xx.hdf in which xx is the id of the cpu (please keep in mind bootstrap_run is an MPI program and each cpu has an ID and generate bootstrap volumes independently).

Step 5. determine the number of the bootstrap volumes and calculate the variance

After the first step, user should now have a set of files containing bootstrap volumes. The next step is to calculate the variance map. The program which provides this feature is sxincvar.py. The program generates 3 sets of variances: on odd volumes, on even volumes and on all volumes. The variance is generated incrementally. For every 100 bootstrap volumes (50 for odd and 50 for even), the odd and even variance will be generated and the cross-correlation coefficient between them will be printed as a criterion of the stability of the variance. Usually ccc > 0.80 could indicate a reliable variance map.

The usage of sxincvar.py is:

Here, nfile is the number of volume files generated in step 4 (is also the number of processors used). nprj is the number of projections you used, outputfile should be an .hdf file. filt_l and filt_h are the parameters for low-pass filtration. They are related to the resolution of the projections. In the FSC curve of the projections, find out the frequency corresponding to FSC equals 0.5. The suggested value of low pass filtration is around half of the frequency. radccc is the radius for cross-corelation coefficient evaluation which should be a tighter mask of the object, i.e. inside this radius should be the core of the object.

It should be noted here that sxincvar.py has not been parallelized yet, therefore it should be run using single processor.

The variance map is stored as file specified by outputfile.

One can also do a Principal Component Analysis (PCA) using the results from calculating the variance.

Here, filt_vol_prefix is the filtered volume results from sxincvar.py, currently hard-wired to btwl_cir_vol_prefix, where vol_prefix should be same as the one used in sxincvar.py. n_begin and n_end determine how many volume files should be used, average_output is the average output file generated from sxincvar.py, pca_output.hdf stores a stack of principal componenets, n_eigen is the number of eigenvectors you want to generate.

After generating the pca_output.hdf, it is recommended that each volume be multipled by the square root of its eigenvalues, which is stored in the header of each volume.

Sample PBS job file

Considering most bootstrap calculation is carried on PC clusters, here we provide a sample script for bootstrap calculation using PBS system. It can be easily changed to other systems such as SUN grid engine.

#PBS -r n -l cput=100:00:00 -l walltime=100:00:00
#PBS -q workq
#PBS -l nodes=32:ppn=2
cd /your_directory
cp $PBS_NODEFILE nodelist
uniq nodelist > singlenodelist

mpirun -machinefile singlenodelist -np 32 SPARXROOT/sparx/bin/sxbootstrap_genbuf.py prj.hdf $PBSTMPDIR/tmpslice

mpirun -machinefile $PBS_NODEFILE -np 64 SPARXROOT/sparx/bin/sxbootstrap_run.py prj.hdf $PBSTMPDIR/tmpslice $PBSTMPDIR/bs_efg_vol 10240 --snr=1.0

foreach i (`uniq nodelist`)
    ssh $i cp "$PBSTMPDIR/bs_efg_vol*.hdf" /garibaldi/people-a/jbrown/sparx/test/EFG-ALL-2/run1/
end

touch done


HowToBootstrap (last edited 2008-04-11 19:44:29 by zweig)