Skip to the content.

Python Dependencies Contributions welcome Status

MathFeature

Feature Extraction Package for Biological Sequences Based on Mathematical Descriptors

HomeKey FeaturesList of filesDependenciesInstallingHow To UseCitation

Preprocessing

Before executing any method in this package, it is necessary to run a pre-processing script, to eliminate any noise from the sequences (e.g., other letters as: N, K …,). To use this script, follow the example below:

Important: This package only accepts sequence files in Fasta format as input to the methods.

To run the tool (Example): $ python3.7 preprocessing/preprocessing.py -i input -o output


Where:

-h = help

-i = Input - Fasta format file, e.g., test.fasta

-o = output - Fasta format file, e.g., output.fasta

Running:

$ python3.7 preprocessing/preprocessing.py -i dataset.fasta -o preprocessing.fasta 

Numerical Mapping

This method generates a numerical mapping of all sequence. Essentially, we provide 7 mappings. The theory can be consulted in this article. Nevertheless, this method will generate a vector with the size of the largest sequence. We developed a code that applies everything automatically. Therefore, it is necessary to pass all the classes/labels that will form the dataset. Thereby. to use this model, follow the example below:

To run the code (Example): $ python3.7 methods/MappingClass.py -n number of datasets/labels -o output -r representation


Where:

-h = help

-n = number of datasets/labels

-o = output - CSV format file, e.g., test.csv

-r = representation/mappings, e.g., 1 = Binary, 2 = Z-curve, 3 = Real, 4 = Integer, 5 = EIIP, 6 = Complex Number, 7 = Atomic Number.

Running:

$ python3.7 methods/MappingClass.py -n 2 -o dataset.csv -r 2

Note Input sequences for feature extraction must be in fasta format.

Note This example will generate a csv file with the extracted features.