This entry contains the data used to implement the bachelor thesis. It was investigated how embeddings can be used to analyze supersecondary structures.
Abstract of the thesis:
This thesis analyzes the behavior of supersecondary structures in the context of embeddings. For this purpose, data from the Protein Topology Graph Library was provided with embeddings. This resulted in a structured graph database, which will be used for future work and analyses. In addition, different projections were made into the two-dimensional space to analyze how the embeddings behave there.
In the Jupyter Notebook 1_data_retrival.ipynb the download process of the graph files from the Protein Topology Graph Library (https://ptgl.uni-frankfurt.de) can be found. The downloaded .gml files can also be found in graph_files.zip. These form graphs that represent the relationships of supersecondary structures in the proteins. These form the data basis for further analyses.
These graph files are then processed in the Jupyter Notebook 2_data_storage_and_embeddings.ipynb and entered into a graph database. The sequences of the supersecondary and secondary structures from the PTGL can be found in fastas.zip.
The embeddings were also calculated using the ESM model of the Facebook Research Group (huggingface.co/facebook/esm2_t12_35M_UR50D), which can be found in three .h5 files. These are then added there subsequently. The whole process in this notebook serves to build up the database, which can then be searched using Cypher querys.
In the Jupyter Notebook 3_data_science.ipynb different visualizations and analyses are then carried out, which were made with the help of UMAP.
For the installation of all dependencies, it is recommended to create a Conda environment and then install all packages there. To use the project, PyEED should be installed using the snapshot of the original repository (source repository: https://github.com/PyEED/pyeed).
The best way to install PyEED is to execute the pip install -e . command in the pyeed_BT folder. The dependencies can also be installed by using poetry and the .toml file.
In addition, seaborn, h5py and umap-learn are required.
These can be installed using the following commands:
pip install h5py==3.12.1
pip install seaborn==0.13.2
umap-learn==0.5.7
(2024-11-18)