GraphML files for protein sequence networks of expansin homologues (doi:10.18419/darus-624)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

GraphML files for protein sequence networks of expansin homologues

Identification Number:

doi:10.18419/darus-624

Distributor:

DaRUS

Date of Distribution:

2020-01-30

Version:

1

Bibliographic Citation:

Lohoff, Caroline, 2020, "GraphML files for protein sequence networks of expansin homologues", https://doi.org/10.18419/darus-624, DaRUS, V1

Study Description

Citation

Title:

GraphML files for protein sequence networks of expansin homologues

Identification Number:

doi:10.18419/darus-624

Authoring Entity:

Lohoff, Caroline (Universität Stuttgart)

Distributor:

DaRUS

Access Authority:

Pleiss, Jürgen

Depositor:

Buchholz, Patrick C. F.

Date of Deposit:

2020-01-27

Holdings Information:

https://doi.org/10.18419/darus-624

Study Scope

Keywords:

Medicine, Health and Life Sciences, protein sequence, graph, network, amino acid sequence, alignment

Abstract:

GraphML files for undirected weighted graphs with nodes that represent protein sequences of expansin homologues. Protein sequences were clustered by a threshold of sequence identity to derive representative sequences.Pairwise sequence identity between two sequences was derived from global Needleman-Wunsch alignment. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a predefined threshold. Metadata of the nodes (e.g. annotations) and of the edges (the edge weights) were summarized in GraphML files.

Notes:

The GraphML attributes for the edges comprise the edge weights (pairwise sequence identity, "weight"). The GraphML attributes for the nodes comprise the identifiers from the ExED ("sequence_id", "protein_id", "hfam_id", and "sfam_id" for sequence, protein, homologous family and superfamily identifiers, respectively), the NCBI taxonomy ID ("tax_id"), the annotated (organism) source name ("tax_name"), the taxonomic lineage of the source organism ("lineage", with taxa separated by "<--"), and the length of the amino acid sequence ("sequence_length"). In addition, suggested color names are given for both fill color and border color of each node ("color" and "color_border").

Methodology and Processing

Sources Statement

Data Sources:

Expansin Engineering Database (<a href="https://exed.biocatnet.de/">https://exed.biocatnet.de/</a>)

Carbohydrate-Active enZYmes Database (<a href="http://www.cazy.org/">http://www.cazy.org/</a>)

Pfam Database (<a href="https://pfam.xfam.org/">https://pfam.xfam.org/</a>)

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

Lohoff C., Buchholz P. C. F., Le Roes-Hill M. & Pleiss J. (2020). The Expansin Engineering Database: a navigation and classification tool for expansins and homologues. Proteins: Structure, Function, and Bioinformatics 89:2.

Identification Number:

10.1002/prot.26001

Bibliographic Citation:

Lohoff C., Buchholz P. C. F., Le Roes-Hill M. & Pleiss J. (2020). The Expansin Engineering Database: a navigation and classification tool for expansins and homologues. Proteins: Structure, Function, and Bioinformatics 89:2.

Other Study-Related Materials

Label:

CBM63_Sfams123_210-300_90_50.graphml

Text:

Protein sequence network for the bacterial, fungal, and plant superfamily from the Expansin Engineering Database (for sequences with length between 210 and 300 residues) including members of the CBM63 family (downloaded from the CAZy database on June 3, 2019). The GraphML file contains representative nodes (clustered by 0.9 in CD-Hit) connected by at least 50% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

GH45_Sfam_1234_Ndomain_90_30.graphml

Text:

Protein sequence network for the bacterial, fungal, plant and N-terminal domains superfamily from the Expansin Engineering Database including members of the GH45 family (from Pfam, version 32.0, accession PF02015). The GraphML file contains representative nodes (clustered by 0.9 in CD-Hit) connected by at least 30% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

Ndomains_1234_CBM_09_60identity.graphml

Text:

Protein sequence network for N-terminal expansin domains from the bacterial, fungal, plant and N-terminal domains superfamily from the Expansin Engineering Database. The GraphML file contains representative nodes (clustered by 0.9 in USEARCH/ UCLUST) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

Sfams_123_210-300_90_50.graphml

Text:

Protein sequence network for N-terminal expansin domains from the bacterial, fungal, and plant superfamily from the Expansin Engineering Database, for sequences with length between 210 and 300 residues.The GraphML file contains representative nodes (clustered by 0.9 in USEARCH/ UCLUST) connected by at least 50% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml

Other Study-Related Materials

Label:

Sfam_123_Cdomain_90_60.graphml

Text:

Protein sequence network for C-terminal expansin domains from the bacterial, fungal, and plant superfamily from the Expansin Engineering Database. The GraphML file contains representative nodes (clustered by 0.9 in USEARCH/ UCLUST) connected by at least 60% pairwise sequence identity (edge weights derived from Needleman-Wunsch alignments).

Notes:

text/xml-graphml