package SmotifTF::RankEnumeratedStructures;

use 5.8.8 ;
use strict;
use warnings;

use File::Spec::Functions qw(catfile catdir);
use SmotifTF::GeometricalCalculations;
use SmotifTF::Protein;
use Data::Dumper;
use Carp;
use Storable qw(dclone);
use Cwd;

BEGIN {
    use Exporter ();
    our ( $VERSION, @ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS);
    $VERSION = "0.05";

    #$AUTHOR  = "Vilas Menon(vilas\@fiserlab.org )";
    @ISA = qw(Exporter);

    #Name of the functions to export
    @EXPORT = qw(
        rank_structures
	pre_rank_structures
    );

    #Name of the functions to export on request
    @EXPORT_OK = qw(
        get_energy
    );
}

use constant DEBUG => 0;
our @EXPORT_OK;
use Config::Simple;
my $config_file = $ENV{'SMOTIFTF_CONFIG_FILE'};
croak "Environmental variable SMOTIFTF_CONFIG_FILE should be set" unless $config_file;
my $cfg    = new Config::Simple($config_file );

my $pdb      = $cfg->param(-block=>'pdb' );
my $PDB_PATH = $pdb->{'pdb_path'};
my $USER_SPECIFIC_PDB_PATH = $pdb->{'user_specific_pdb_path'};


=head1 NAME

RankEnumeratedStructures

=head1 VERSION

Version 0.05

=cut

our $VERSION = '0.01';

=head1 SYNOPSIS

This module ranks all the enumerated structures using a composite energy function 
that consists of four parameters:
(1) Radius of Gyration
(2) Solvation potential
(3) Hydrogen bond potential
(4) Statistical potential

    use RankEnumeratedStructures;

    rank_structures ($pdbcode,$stericlimit,@indices);

=head1 EXPORT
	rank_structures
	pre_rank_structures
	get_energy

=head2 pre_rank_structures

Subroutine to prepare for rank_structures

=cut

sub pre_rank_structures {
	my ($smotifs) = @_;
	my @st;
	my $sterlimit=$smotifs/2+1;          # hardwired criterion for steric clashes
        my @col=2*$smotifs..2*$smotifs+3;
	push (@st, $col[0], $col[1], $col[2], $col[3]);
	return ($sterlimit, @st);
}

=head2 rank_structures

This subroutine ranks the structures generated by the full enumeration of the candidate smotif
combinations. The ranking takes place in two parts: the full set is ranked using a 'coarse' scoring
function, and the top 1000 structures are re-ranked using a 'refined' scoring function. Both
functions use 4 scoring component values: radius of gyration, statistical pairwise contact
potential, implicit solvation potential, and long range H bond potential.

INPUT ARGUMENTS
1) $pdbcode - the 4-character name of the folder to store input and output data
2) $sterlimit - number of allowable steric clashes (these clashes are calculated during the enumeration and are
part of the input file - they are not calculated directly by this script)
3) @st - list of 4 numbers corresponding to the indices of the scoring function components in the tab-delimited
output file from the full enumeration. Index numbering starts from 0 (not 1). See "INPUT FILES" for further details.


REQUIRED FILES (all to be found in the <pdbcode> directory)
<pdbcode>.out - file containing a list of start and end points of smotifs in the query protein, as well as secondary structure
and loop lengths. This is one of the standard output files of the generate_shift_files.pl script.
<pdbcode>_motifs_best.csv - file containing a list of candidates for each putative smotif. This is one of the standard output
files of the findranks.pl script.

INPUT FILES
In the <pdbcode> directory, a set of files indicating the results of the full enumeration. These are the standard
output files from the all_enum.pl script, and have the following format:

Sample line for a structure with 4 smotifs
1.437   0.740   1.867   8.377   224162 148918 54194 127698      1.7483  0.9973  0.9616  1.2306  8.8294  58.8240 12 0 0 0        0

Explanation:
1.437   0.740   1.867   8.377  : RMSDs of the 4 smotif components individually
224162  148918  54194   127698  : Nids of the 4 smotif components
1.7483  : Per-residue radius of gyration z-score
0.9973  : Per-residue pairwise contact potential z-score
0.9616  : Per-residue solvation potential z-score
1.2306  : Long-range H-bond potential z-score
8.8294  : Overall structure RMSD (from solved structure)
58.8250 : Overall structure GDT_TS score
12 0 0 0: List of indices of smotifs, as found in the <pdbcode>_motifs_best.csv file
0       : Number of steric clashes

In this case, the indices for the scoring function components are 8,9,10, and 11. In general, the indices will be from
2*n through 2*n+3 inclusive, where n is the number of smotifs

OUTPUT FILES
In the <pdbcode> directory:
1) <pdbcode>_ranked_coarse.csv : Top 5000 structures as ranked by the coarse scoring function. The format of each line is the
same as in the enumeration output files (see INPUT FILES, above), with an additional final entry representing the scoring function
output for each line.
2) <pdbcode>_ranked_refined.csv : Same as 1), but for the top structures re-ranked using the refined scoring function.

=cut

sub rank_structures {
    use File::Spec::Functions qw(catfile catdir);
	
    my ($pdbcode, $sterlimit, @st) = @_;

	croak "4-letter pdb code is required"                  unless $pdbcode;
	croak "Number of allowable steric clashes is required" unless $sterlimit;
	croak "Array of indices for energy parameters is required" unless @st;

	#Weights for scoring function

	my @coarse_weights = (1.76,0.54,0.8,1);     # coarse scoring function
	my @refined_weights= (1.3,0.8,0.2,1.5);		# refined scoring function

	#Parameter to determine how many structures to re-rank with the refined scoring function
	my $refined_rank_number = 1000;

	#Read data file and run  coarse ranking of structures
	my @full;
	opendir(DIR,$pdbcode) or croak "No directory $pdbcode\n";
	while (my $file = readdir(DIR)) {
		# if ((-f "$pdbcode/$file") and ($file =~ /^\_.+all.+csv/)) {	#since the enumeration was split up, read each output file in succession
		
        my $file_path = catfile($pdbcode,$file);
		if ((-f "$file_path") and ($file =~ /^\_.+all.+csv/)) {	#since the enumeration was split up, read each output file in succession
			my @tempset;
			open(INFILE,"$file_path") 
                or croak "Unable to open file $file_path\n";
			
            my $line=<INFILE>;		#ignore header line
			while ($line=<INFILE>) {				
				my @lin = split(/\s+/, $line);
				if (scalar(@lin)>$st[0]+3) {			
					if ($lin[-1]<$sterlimit) {
						push(@tempset,[@lin]);		
					}
				}
			}
			close(INFILE);
		
			#keep only the top 5000 structures, remove the rest to save memory and speed up ranking process
			push(@full,@tempset);	
			my @ranked=rank_energies(\@full,\@refined_weights,@st);
			my $limit=5000;
			if (scalar(@ranked)<5000) { $limit=scalar(@ranked)};
			my @templist;
			for (my $aa=0;$aa<$limit;$aa++) {
				push(@templist,$full[$ranked[$aa][1]]);
			}
			@full=@templist;
		}
	}
	closedir(DIR);

	my @keeplist=@full;
	#Print the top 5000 structures, as ranked with the coarse energy function
	my @ranked=rank_energies(\@keeplist,\@refined_weights,@st);

	#Re-rank top structures using the refined scoring function
	my @newlist;
	if ($refined_rank_number>scalar(@ranked)) {$refined_rank_number=scalar(@ranked)}
	for (my $aa=0;$aa<$refined_rank_number;$aa++) {
		push(@newlist,$keeplist[$ranked[$aa][1]]);
	}
	#Print top ranked structures derived from the refined scoring function
	my @newranked = rank_energies(\@newlist,\@refined_weights,@st);
	# open(OUTFILE,">$pdbcode/$pdbcode"."_ranked_refined.csv") or croak "Unable to open refined ranking file for $pdbcode\n";
	
    my $ranked_refined_csv = catfile($pdbcode, "$pdbcode".'_ranked_refined.csv');
	open(OUTFILE,">$ranked_refined_csv") 
        or croak "Unable to open refined ranking file($ranked_refined_csv) for $pdbcode $!";
	
    for (my $aa=0;$aa<scalar(@newranked);$aa++) {
		print OUTFILE "@{$newlist[$newranked[$aa][1]]}\t$newranked[$aa][0]\n";
	}
	close(OUTFILE);
}

=head2 rank_energies

Subroutine to calculate energy and rank structures, 
given a list of energy function component scores and weights

=cut

sub rank_energies {
	my ($keeplist,$coeffs,@st)=@_;
	croak "Array containing modelled structures is required"   unless $keeplist;
	croak "Weights for energy parameters is required"          unless $coeffs;
	croak "Array of indices for energy parameters is required" unless @st;

	my @percentage;
	my @rank;
	for (my $aa=0;$aa<scalar(@$keeplist);$aa++) {
		my $temp=0;
		my $val=0;
		for (my $bb=0;$bb<scalar(@$coeffs);$bb++) {
			$val += $$coeffs[$bb]*($$keeplist[$aa][$st[$bb]]);
		}
		push(@rank,[$val,$aa]);
		push(@percentage,$val);
	}
	@rank = sort {$a->[0] <=> $b->[0]} @rank;
	return @rank;	
}			

=head1 AUTHOR

Fiserlab Members , C<< <andras at fiserlab.org> >>

=head1 BUGS

Please report any bugs or feature requests to C<bug-. at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=.>.  I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.




=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc RankEnumeratedStructures


You can also look for information at:

=over 4

=item * RT: CPAN's request tracker (report bugs here)

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=.>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/.>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/.>

=item * Search CPAN

L<http://search.cpan.org/dist/./>

=back


=head1 ACKNOWLEDGEMENTS


=head1 LICENSE AND COPYRIGHT

Copyright 2015 Fiserlab Members .

This program is free software; you can redistribute it and/or modify it
under the terms of the the Artistic License (2.0). You may obtain a
copy of the full license at:

L<http://www.perlfoundation.org/artistic_license_2_0>

Any use, modification, and distribution of the Standard or Modified
Versions is governed by this Artistic License. By using, modifying or
distributing the Package, you accept this license. Do not use, modify,
or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made
by someone other than you, you are nevertheless required to ensure that
your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service
mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge
patent license to make, have made, use, offer to sell, sell, import and
otherwise transfer the Package with respect to any patent claims
licensable by the Copyright Holder that are necessarily infringed by the
Package. If you institute patent litigation (including a cross-claim or
counterclaim) against any party alleging that the Package constitutes
direct or contributory patent infringement, then this Artistic License
to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER
AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY
YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR
CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR
CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


=cut

1; # End of RankEnumeratedStructures
