Sequence cross-references and taxonomic lineage for glycoside hydrolase family 19 (doi:10.18419/darus-1163)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 4: Variable Description
Entire Codebook

Document Description

Citation

Title:

Sequence cross-references and taxonomic lineage for glycoside hydrolase family 19

Identification Number:

doi:10.18419/darus-1163

Distributor:

DaRUS

Date of Distribution:

2021-05-20

Version:

1

Bibliographic Citation:

Buchholz, Patrick C. F., 2021, "Sequence cross-references and taxonomic lineage for glycoside hydrolase family 19", https://doi.org/10.18419/darus-1163, DaRUS, V1, UNF:6:zi8TRxkq1C/pCN14pXTA0Q== [fileUNF]

Study Description

Citation

Title:

Sequence cross-references and taxonomic lineage for glycoside hydrolase family 19

Identification Number:

doi:10.18419/darus-1163

Authoring Entity:

Buchholz, Patrick C. F. (Universität Stuttgart)

Distributor:

DaRUS

Access Authority:

Pleiss, Jürgen

Depositor:

Buchholz, Patrick C. F.

Date of Deposit:

2020-11-30

Holdings Information:

https://doi.org/10.18419/darus-1163

Study Scope

Keywords:

Medicine, Health and Life Sciences, protein sequence, protein structure, taxonomy, lineage, source organism, amino acid sequence

Abstract:

The Glycoside Hydrolase 19 Engineering Database (GH19ED) contains information on protein sequences and structures of glycoside hydrolases from family 19. This dataset lists cross-references to the National Center for Biotechnology Information (NCBI), cross-references to the Protein Data Bank (PDB) and the taxonomic lineage for each sequence entry in the GH19ED.

Notes:

The tab-separated tabular file comprises nine columns:<br> (1) the sequence identifier from the GH19ED, integer (Sequence_id),<br> (2) the protein sequence accessions from the NCBI, semicolon-separated (NCBI_accessions),<br> (3) the PDB accessions, semicolon-separated (PDB_accessions),<br> (4) the name of the source or source organism (Source_name),<br> (5) the NCBI taxonomy identifier for the source (NCBI_taxonomy_id),<br> (6) the taxonomic lineage from the lowest to the highest rank, as inferred from NCBI taxonomy (Lineage),<br> (7) the "protein" identifier from the GH19ED, integer (Protein_id),<br> (8) the "homologous family" (or group) identifier from the GH19ED, integer (Homologous_family_id),<br> (9) the "superfamily" (or subfamily) identifier from the GH19ED, integer (Superfamily_id). For sequence entries assigned to more than one source organism name, only the first taxonomic lineage found in the GH19ED is listed.

Methodology and Processing

Sources Statement

Data Sources:

<a href="https://gh19ed.biocatnet.de/">https://gh19ed.biocatnet.de/</a>

<a href="https://www.ncbi.nlm.nih.gov/protein">https://www.ncbi.nlm.nih.gov/protein</a>

<a href="https://www.rcsb.org/">https://www.rcsb.org/</a>

<a href="https://www.ncbi.nlm.nih.gov/taxonomy">https://www.ncbi.nlm.nih.gov/taxonomy</a>

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

Orlando M., Buchholz P. C. F., Lotti M. & Pleiss J. (2020). The GH19 Engineering Database: an extended classification system for exploring the properties of sequence space and protein evolution. (submitted)

Bibliographic Citation:

Orlando M., Buchholz P. C. F., Lotti M. & Pleiss J. (2020). The GH19 Engineering Database: an extended classification system for exploring the properties of sequence space and protein evolution. (submitted)

File Description--f33717

File: GH19ED.tab

  • Number of cases: 22461

  • No. of variables per record: 9

  • Type of File: text/tab-separated-values

Notes:

UNF:6:zi8TRxkq1C/pCN14pXTA0Q==

Variable Description

List of Variables:

Variables

Sequence_id

f33717 Location:

Summary Statistics: Mean 11820.976759716817; Min. 1.0; Valid 22461.0; StDev 6835.113144441923; Max. 23858.0

Variable Format: numeric

Notes: UNF:6:tr2WNacahxpTOY90CWiKWA==

NCBI_accessions

f33717 Location:

Variable Format: character

Notes: UNF:6:sw7YrE4Qu4aalqUUXlKShQ==

PDB_accessions

f33717 Location:

Variable Format: character

Notes: UNF:6:8paabhcg0KMzrnxHqu4s2Q==

Source_name

f33717 Location:

Variable Format: character

Notes: UNF:6:NLBXar37N1eA9yj+BHlpmw==

NCBI_taxonomy_id

f33717 Location:

Summary Statistics: Mean 522380.47415521316; Valid 22461.0; Min. 7.0; StDev 761131.9421858811; Max. 2563569.0;

Variable Format: numeric

Notes: UNF:6:Q3rSJ34TQIMAew2tbzorZg==

Lineage

f33717 Location:

Variable Format: character

Notes: UNF:6:iIpo3pjBbeWpoM2yG8pCUA==

Protein_id

f33717 Location:

Summary Statistics: Valid 22461.0; Max. 23856.0; Min. 1.0; Mean 10857.60647344091; StDev 7332.244548359417

Variable Format: numeric

Notes: UNF:6:dAFBJ8MUPKrA+JwUeitddA==

Homologous_family_id

f33717 Location:

Summary Statistics: Max. 55.0; StDev 16.810820731203552; Mean 20.40844129825125; Min. 2.0; Valid 22461.0

Variable Format: numeric

Notes: UNF:6:cthdMl/FqSE+EfUrrftkRA==

Superfamily_id

f33717 Location:

Summary Statistics: Valid 22461.0; StDev 0.6702829784262895; Mean 1.7500556520191046; Min. 1.0; Max. 3.0;

Variable Format: numeric

Notes: UNF:6:OG748kLSgJ+BDkFY2y3vCg==