GraphML files for protein sequence networks of glycoside hydrolase 19 homologues (doi:10.18419/darus-802)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

GraphML files for protein sequence networks of glycoside hydrolase 19 homologues

Identification Number:

doi:10.18419/darus-802

Distributor:

DaRUS

Date of Distribution:

2020-06-01

Version:

1

Bibliographic Citation:

Orlando, Marco, 2020, "GraphML files for protein sequence networks of glycoside hydrolase 19 homologues", https://doi.org/10.18419/darus-802, DaRUS, V1

Study Description

Citation

Title:

GraphML files for protein sequence networks of glycoside hydrolase 19 homologues

Identification Number:

doi:10.18419/darus-802

Authoring Entity:

Orlando, Marco (University of Milano Bicocca)

Grant Number:

031B0571A

Grant Number:

EXC2075

Distributor:

DaRUS

Access Authority:

Pleiss, Jürgen

Depositor:

Buchholz, Patrick C. F.

Date of Deposit:

2020-05-11

Holdings Information:

https://doi.org/10.18419/darus-802

Study Scope

Keywords:

Medicine, Health and Life Sciences, alignment, network, amino acid sequence, graph, protein sequence

Abstract:

GraphML files for undirected weighted graphs with nodes that represent protein sequences of glycoside hydrolase 19 homologues. Protein sequences were clustered by a threshold of 90% sequence identity to derive representative sequences. Pairwise sequence identity between two sequences was derived from global Needleman-Wunsch alignment. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a predefined threshold. Metadata of the nodes (e.g. annotations) and of the edges (the edge weights) were summarized in GraphML files.

Notes:

The GraphML attributes for the edges comprise the edge weights (pairwise sequence identity, "weight"). The GraphML attributes for the nodes comprise the identifiers from the GH19ED ("sequence_id", "protein_id", "hfam_id", and "sfam_id" for sequence, protein, homologous family and superfamily identifiers, respectively), the NCBI taxonomy ID ("tax_id"), the annotated (organism) source name ("tax_name"), the taxonomic lineage of the source organism ("lineage", with taxa separated by "<--"), and the length of the amino acid sequence ("sequence_length"). In addition, suggested color names are given for both fill color and border color of each node ("color" and "color_border").

Methodology and Processing

Sources Statement

Data Sources:

<a href="http://pfam.xfam.org/family/Glyco_hydro_19">http://pfam.xfam.org/family/Glyco_hydro_19</a>

<a href="https://gh19ed.biocatnet.de/">https://gh19ed.biocatnet.de/</a>

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

Orlando M, Buchholz PCF, Lotti M, Pleiss J (2020) Large-scale exploration of sequences, substrate specificity and evolution in glycoside hydrolase family 19: the GH19 Engineering Database (submitted)

Bibliographic Citation:

Orlando M, Buchholz PCF, Lotti M, Pleiss J (2020) Large-scale exploration of sequences, substrate specificity and evolution in glycoside hydrolase family 19: the GH19 Engineering Database (submitted)

Other Study-Related Materials

Label:

CHITs_60_percent_identity_cutoff.graphml

Text:

Protein sequence network for the chitinase domains from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

ELYSs_60_percent_identity_cutoff.graphml

Text:

Protein sequence network for the endolysin domains from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

GH19_40_percent_identity_cutoff.graphml

Text:

Protein sequence network for representative GH19 domains (corresponding to Pfam’s GH19 profile HMM: PF00182) from the Glycoside Hydrolase 19 Engineering Database. The GraphML file contains representative nodes (clustered by 90% in USEARCH) connected by at least 40% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml