View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
GraphML files for protein sequence networks of glycoside hydrolase 19 homologues |
Identification Number: |
doi:10.18419/darus-802 |
Distributor: |
DaRUS |
Date of Distribution: |
2020-06-01 |
Version: |
1 |
Bibliographic Citation: |
Orlando, Marco, 2020, "GraphML files for protein sequence networks of glycoside hydrolase 19 homologues", https://doi.org/10.18419/darus-802, DaRUS, V1 |
Citation |
|
Title: |
GraphML files for protein sequence networks of glycoside hydrolase 19 homologues |
Identification Number: |
doi:10.18419/darus-802 |
Authoring Entity: |
Orlando, Marco (University of Milano Bicocca) |
Grant Number: |
031B0571A |
Grant Number: |
EXC2075 |
Distributor: |
DaRUS |
Access Authority: |
Pleiss, Jürgen |
Depositor: |
Buchholz, Patrick C. F. |
Date of Deposit: |
2020-05-11 |
Holdings Information: |
https://doi.org/10.18419/darus-802 |
Study Scope |
|
Keywords: |
Medicine, Health and Life Sciences, alignment, network, amino acid sequence, graph, protein sequence |
Abstract: |
GraphML files for undirected weighted graphs with nodes that represent protein sequences of glycoside hydrolase 19 homologues. Protein sequences were clustered by a threshold of 90% sequence identity to derive representative sequences. Pairwise sequence identity between two sequences was derived from global Needleman-Wunsch alignment. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a predefined threshold. Metadata of the nodes (e.g. annotations) and of the edges (the edge weights) were summarized in GraphML files. |
Notes: |
The GraphML attributes for the edges comprise the edge weights (pairwise sequence identity, "weight"). The GraphML attributes for the nodes comprise the identifiers from the GH19ED ("sequence_id", "protein_id", "hfam_id", and "sfam_id" for sequence, protein, homologous family and superfamily identifiers, respectively), the NCBI taxonomy ID ("tax_id"), the annotated (organism) source name ("tax_name"), the taxonomic lineage of the source organism ("lineage", with taxa separated by "<--"), and the length of the amino acid sequence ("sequence_length"). In addition, suggested color names are given for both fill color and border color of each node ("color" and "color_border"). |
Methodology and Processing |
|
Sources Statement |
|
Data Sources: |
<a href="http://pfam.xfam.org/family/Glyco_hydro_19">http://pfam.xfam.org/family/Glyco_hydro_19</a> |
<a href="https://gh19ed.biocatnet.de/">https://gh19ed.biocatnet.de/</a> |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Title: |
Orlando M, Buchholz PCF, Lotti M, Pleiss J (2020) Large-scale exploration of sequences, substrate specificity and evolution in glycoside hydrolase family 19: the GH19 Engineering Database (submitted) |
Bibliographic Citation: |
Orlando M, Buchholz PCF, Lotti M, Pleiss J (2020) Large-scale exploration of sequences, substrate specificity and evolution in glycoside hydrolase family 19: the GH19 Engineering Database (submitted) |
Label: |
CHITs_60_percent_identity_cutoff.graphml |
Text: |
Protein sequence network for the chitinase domains from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments). |
Notes: |
text/xml-graphml |
Label: |
ELYSs_60_percent_identity_cutoff.graphml |
Text: |
Protein sequence network for the endolysin domains from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments). |
Notes: |
text/xml-graphml |
Label: |
GH19_40_percent_identity_cutoff.graphml |
Text: |
Protein sequence network for representative GH19 domains (corresponding to Pfam’s GH19 profile HMM: PF00182) from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 40% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments). |
Notes: |
text/xml-graphml |