WGS Variant Effect Predictor Fields

WGS Annotation VEP Fields

Once variant calling has completed for whole genome sequences (WGS), researchers want to know more about the variants, such as whether the variant impacts protein coding or how common the variant is in various populations.

AMP PD utilized tools from the Google Cloud Health team for its variant annotations. The table below details what databases and annotation fields were included in the VEP annotation pipeline.

Field Type Mode Description
alternate_bases.CSQ RECORD REPEATED List of CSQ annotations for this alternate.
alternate_bases.CSQ.allele STRING NULLABLE The ALT part of the annotation field.
alternate_bases.CSQ.Consequence STRING NULLABLE Consequence type of this variant
alternate_bases.CSQ.IMPACT STRING NULLABLE The impact modifier for the consequence type
alternate_bases.CSQ.SYMBOL STRING NULLABLE The gene symbol
alternate_bases.CSQ.Gene STRING NULLABLE Ensembl stable ID of affected gene
alternate_bases.CSQ.Feature_type STRING NULLABLE Type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
alternate_bases.CSQ.Feature STRING NULLABLE Ensembl stable ID of feature
alternate_bases.CSQ.BIOTYPE STRING NULLABLE Biotype of transcript or regulatory feature
alternate_bases.CSQ.EXON STRING NULLABLE The exon number (out of total number)
alternate_bases.CSQ.INTRON STRING NULLABLE The intron number (out of total number)
alternate_bases.CSQ.HGVSc STRING NULLABLE The HGVS coding sequence name
alternate_bases.CSQ.HGVSp STRING NULLABLE The HGVS protein sequence name
alternate_bases.CSQ.cDNA_position STRING NULLABLE Relative position of base pair in cDNA sequence
alternate_bases.CSQ.CDS_position STRING NULLABLE Relative position of base pair in coding sequence
alternate_bases.CSQ.Protein_position STRING NULLABLE Relative position of amino acid in protein
alternate_bases.CSQ.Amino_acids STRING NULLABLE Reference and variant amino acids. Only given if the variant
affects the protein-coding sequence
alternate_bases.CSQ.Codons STRING NULLABLE The alternative codons with the variant base in upper case
alternate_bases.CSQ.Existing_variation STRING NULLABLE Known identifier of existing variant
alternate_bases.CSQ.ALLELE_NUM STRING NULLABLE Allele number from input; 0 is reference, 1 is first alternate etc
alternate_bases.CSQ.DISTANCE STRING NULLABLE Shortest distance from variant to transcript
alternate_bases.CSQ.STRAND STRING NULLABLE The DNA strand (1 or -1) on which the transcript/feature lies
alternate_bases.CSQ.FLAGS STRING NULLABLE Transcript quality flags (cds_start_NF, cds_start_NF)
alternate_bases.CSQ.VARIANT_CLASS STRING NULLABLE Sequence Ontology variant class
alternate_bases.CSQ.SYMBOL_SOURCE STRING NULLABLE The source of the gene symbol
alternate_bases.CSQ.HGNC_ID STRING NULLABLE HUGO Gene Nomenclature Committee approved symbol
alternate_bases.CSQ.CANONICAL STRING NULLABLE A flag indicating if the transcript is denoted as the canonical
transcript for this gene
alternate_bases.CSQ.TSL STRING NULLABLE Transcript support level. NB: not available for GRCh37
alternate_bases.CSQ.APPRIS STRING NULLABLE Annotates alternatively spliced transcripts as primary or alternate based on a range of computational methods. NB: not available for GRCh37
alternate_bases.CSQ.CCDS STRING NULLABLE The CCDS identifer for this transcript, where applicable
alternate_bases.CSQ.ENSP STRING NULLABLE The Ensembl protein identifier of the affected transcript
alternate_bases.CSQ.SWISSPROT STRING NULLABLE Best match UniProtKB/Swiss-Prot accession of protein product
alternate_bases.CSQ.TREMBL STRING NULLABLE Best match UniProtKB/TrEMBL accession of protein product
alternate_bases.CSQ.UNIPARC STRING NULLABLE Best match UniParc accession of protein product
alternate_bases.CSQ.GENE_PHENO STRING NULLABLE Indicates if overlapped gene is associated with a phenotype,
disease or trait
alternate_bases.CSQ.SIFT STRING NULLABLE The SIFT prediction and/or score, with both given as
prediction(score)
alternate_bases.CSQ.PolyPhen STRING NULLABLE The PolyPhen prediction and/or score
alternate_bases.CSQ.DOMAINS STRING NULLABLE The source and identifer of any overlapping protein domains
alternate_bases.CSQ.HGVS_OFFSET STRING NULLABLE Indicates by how many bases the HGVS notations for this
variant have been shifted
alternate_bases.CSQ.AF STRING NULLABLE Frequency of existing variant in 1000 Genomes
alternate_bases.CSQ.AFR_AF STRING NULLABLE Frequency of existing variant in 1000 Genomes combined
African population
alternate_bases.CSQ.AMR_AF STRING NULLABLE Frequency of existing variant in 1000 Genomes combined
American population
alternate_bases.CSQ.EAS_AF STRING NULLABLE Frequency of existing variant in 1000 Genomes combined
East Asian population
alternate_bases.CSQ.EUR_AF STRING NULLABLE Frequency of existing variant in 1000 Genomes combined
European population
alternate_bases.CSQ.SAS_AF STRING NULLABLE Frequency of existing variant in 1000 Genomes combined
South Asian population
alternate_bases.CSQ.AA_AF STRING NULLABLE Frequency of existing variant in NHLBI-ESP African
American population
alternate_bases.CSQ.EA_AF STRING NULLABLE Frequency of existing variant in NHLBI-ESP European
American population
alternate_bases.CSQ.gnomAD_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes combined
population
alternate_bases.CSQ.gnomAD_AFR_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes African/American
population
alternate_bases.CSQ.gnomAD_AMR_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes American
population
alternate_bases.CSQ.gnomAD_ASJ_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes Ashkenazi Jewish population
alternate_bases.CSQ.gnomAD_EAS_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes East Asian population
alternate_bases.CSQ.gnomAD_FIN_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes Finnish
population
alternate_bases.CSQ.gnomAD_NFE_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes Non-Finnish
European population
alternate_bases.CSQ.gnomAD_OTH_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes combined
other combined populations
alternate_bases.CSQ.gnomAD_SAS_AF STRING NULLABLE Frequency of existing variant in gnomAD exomes South
Asian population
alternate_bases.CSQ.MAX_AF STRING NULLABLE Maximum observed allele frequency in 1000 Genomes, ESP
and gnomAD
alternate_bases.CSQ.MAX_AF_POPS STRING NULLABLE Populations in which maximum allele frequency was observed
alternate_bases.CSQ.CLIN_SIG STRING NULLABLE ClinVar clinical significance of the dbSNP variant
alternate_bases.CSQ.SOMATIC STRING NULLABLE Somatic status of existing variant(s); multiple values correspond
to multiple values in the Existing_variation field
alternate_bases.CSQ.PHENO STRING NULLABLE Indicates if existing variant is associated with a phenotype,
disease or trait; multiple values correspond to multiple values in the Existing_variation field
alternate_bases.CSQ.PUBMED STRING NULLABLE Pubmed ID(s) of publications that cite existing variant
alternate_bases.CSQ.MOTIF_NAME STRING NULLABLE The source and identifier of a transcription factor binding
profile aligned at this position
alternate_bases.CSQ.MOTIF_POS STRING NULLABLE The relative position of the variation in the aligned TFBP
alternate_bases.CSQ.HIGH_INF_POS STRING NULLABLE A flag indicating if the variant falls in a high information
position of a transcription factor binding profile (TFBP)
alternate_bases.CSQ.MOTIF_SCORE_CHANGE STRING NULLABLE The difference in motif score of the reference and variant
sequences for the TFBP