alexa Addressing Benefits, Risks and Consent in Next Generation Sequencing Studies | Open Access Journals
ISSN: 2155-9627
Journal of Clinical Research & Bioethics
Like us on:
Make the best use of Scientific Research and information from our 700+ peer reviewed, Open Access Journals that operates with the help of 50,000+ Editorial Board Members and esteemed reviewers and 1000+ Scientific associations in Medical, Clinical, Pharmaceutical, Engineering, Technology and Management Fields.
Meet Inspiring Speakers and Experts at our 3000+ Global Conferenceseries Events with over 600+ Conferences, 1200+ Symposiums and 1200+ Workshops on
Medical, Pharma, Engineering, Science, Technology and Business

Addressing Benefits, Risks and Consent in Next Generation Sequencing Studies

Meller R*
Neuroscience Institute, Morehouse School of Medicine, Atlanta, USA
Corresponding Author : Meller R
Neuroscience Institute, Morehouse School of Medicine
720 Westview, Atlanta, GA, 30310, USA
Tel: 4047565789
E-mail: [email protected]
Received: October 13, 2015; Accepted: December 11, 2015; Published: December 14, 2015
Citation: Meller R (2015) Addressing Benefits, Risks and Consent in Next Generation Sequencing Studies. J Clin Res Bioeth 6:249. doi:10.4172/2155-9627.1000249
Copyright: © 2015 Meller R. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at Pubmed, Scholar Google

Visit for more related articles at Journal of Clinical Research & Bioethics

Abstract

The sequencing of the human genome and technological advances in DNA sequencing have led to a revolution with respect to DNA sequencing and its potential to diagnose genetic disorders. However, requests for open access to genomic data must be balanced against the guiding principles of the Common Rule for human subject research. Unfortunately, the risks to patients involved in genomic studies are still evolving and as such may not be clear to learned and well-intentioned scientists. Central to this issue are the strategies that enable human participants in such studies to remain anonymous, or de-identified. The wealth of genomic data on the Internet in genomic data repositories and other databases has enabled de-identified data to be broken and research subjects to be identified. The security of de-identification neglects the fact that DNA itself is an identifying element. Therefore, it is questionable whether data security standards can ever truly protect the identity of a patient, under the current conditions or in the future. As Big Data methodologies advance, additional sources of data may enable the reidentification of patients enrolled in next-generation sequencing (NGS) studies. As such, it is time to re-evaluate the risks of sharing genomic data and establish new guidelines for good practices. In this commentary, I address the challenges facing federally funded investigators who need to strike a balance between compliance with federal (US) rules for human subjects and the recent requirement for open access/sharing of data from National Institute for Health (NIH)-funded studies involving human subjects

Keywords
Data sharing; Human subjects' research; Next generation sequencing; Consent; Privacy
Introduction
Genome sequencing and big data
Next generation sequencing (NGS) has exploded upon the scientific scene in the last ten years, leading to advances in our understanding of genes and genomic information [1]. Advances in technologies enable the very high throughput of sequencing of genomes, and the NIH goal of the $1000 genome was recently reported [2]. Keeping pace with this data explosion has been the advances in bio-informatic techniques and application of powerful computational methods to collate and mine biologically important information from genomic datasets. As these techniques become more commonly used in clinical research, their power is increased due to the ability to link genetic and genomic information to patient health records. Examples where these advances have increased our understanding of disease include an understanding the role of mutations in cancer pathologies resulting in a molecular basis for cancer diagnosis [3], understanding the role of genomic abnormalities in neurological disorders [4], and the development of pre-natal screening methods using maternal blood samples [5]. As such the benefits to society can easily be determined based on these health benefits from NGS technologies.
NIH genomic data sharing policy
To maximize the benefit to the community of such large (complex and expensive) datasets the NIH recently requested that all data generated from NIH funded projects be made available to the community under shared or controlled access. NIH (Not OD-13-119) sets forth the planned NIH genomic data sharing policy such that all NGS data generated by federal funds to be shared [6]. The policy states that “prior to data submission, traditional identifiers such as name, date of birth, street address, and social security number should be removed. The de-identified data are coded using a random, unique code to protect participant privacy [6].”
The NIH announcement also recommends that future IRB consent forms include language to enable the patient to be informed that their genomic data will be shared for future research. Sharing in open access repositories requires explicit consent [6]. An allowance is made to enable controlled data access, especially if the informed consent was not obtained. Such limited access must be outlined in a grant submission. “Upon request limited controlled access for submitted data can be requested. In such a situation the institution can restrict what the data will be used for. The requestor must then have their project approved by NIH and are subject to restrictions on what they can do with the data.”
In addition, the policy outlined what level of data must be submitted [6]. DNA sequencers function by identifying the sequence (base composition) of short reads of DNA. These reads are then aligned with a reference genome. Therefore data obtained from sequencers typically compose of raw image files i.e., a TIFF file (Level 0), raw read files i.e., xsq, ssf, or fastq files (Level 1) aligned and cleaned i.e., BAM or SAM files (Level 2), analyzed i.e., output file from analysis software (Level 3), correlated to phenotype (Level 4). NIH policy does not expect submission of level 0 or 1 data. However, the Policy states that submission should be a level 2 data file (e.g., binary alignment matrix (BAM) files), but also containing the unmapped reads. Unmapped reads are reads of DNA which do not align to the primary species genome, for example from human biofluids, the primary data would be mapped to the human reference genome, but the unmapped DNA files may contain virus or bacterial DNA information.
There are a few points from this policy which could cause concern, which we will discuss further.
First while the removal of personal identification information from the data set is requested, it is not as detailed as the HIPPA standard for de-identification. Furthermore, there is a fundamental assumption that the sequence data is not an identifier, or a potential source of data for identification.
Second is the issue of consent, which is currently being revised [7]. However, a broad consent prospectively may be limited in its ability to accurately spell out the potential future risks of identification.
Third, the policy does not limit what will be included and excluded in the data, or its use. Unmapped reads/data still contain information for example viral and bacterial sequence information, which if matched with a patients identity could be harmful.
Finally, it is explicit in human subject research that a participant is able to withdraw their consent at any time during a study. How this relates to genomic data is not yet clear.
Identifying personal genomes by surname inference
In Jan 2013 a study was reported in Science whereby surnames of de-identified human subjects were recovered from chromosome Y tandem repeat data and genealogical databases [8]. The authors used freely available public internet data to perform this analysis. In the Grymek study it was noted that consent forms identified data breech as a security risk, and that re-identification was not prevented, or that privacy could not be guaranteed and that future techniques may be able to identify them [8]. The study concluded that surname inference from public databases was feasible and puts the privacy of current deidentified data sets at risk.
Such an idea is actually not new; indeed in 2006 it was shown how use of public records and death records could enable a researcher to identify people based on familial pedigree studies of genetic markers [9]. Use of genealogical databases has resulted in the parental identification of adopted children [10]. While the identification of a family or parent of a child may have damaging consequences, far more information can be extracted from genomic information.
The impact of this study, with respect to its calculated relevance to the US population, has been challenged. Indeed, some groups believe that the study was published more for media grounds than scientific basis [11]. The surname inference algorithm was tested first against a well know individual who has published their genome (Craig Venter) and then against a well-defined population (Utah residents of European ancestry associated with the Mormon Church) [8]. It should be noted that this data set has previously been shown to be vulnerable to identification. What these opinions miss is the point that as more and more information becomes available via databases and online repositories, in the future more inferences will be possible, especially when other pieces of information can be ascertained from public sources and built into the model. It also opines that it is only possible in the context of an academic department with appropriate resources [11].
The challenge to the community is to determine what changes in the rules and regulations regarding what one can do with such information are needed. A second element is that it is likely in the future, that participants will be identifiable from their genetic/ genomic information. It is recommended that the public be better educated in the consent process about the risks of identification. Big Data is becoming a common term. In essence in involves the mining of very large data sets for biological (or other) inferences. While the benefits of society that such studies may provide are still being evangelized and debated, less consideration has focused on the privacy concerns of Big Data. It is estimated that Facebook has over one billion users. This provides a wealth of personal information which can be mined. Newer biometric devices also give access to health information. The number of internet connected devices is estimated to top 50 billion in 2020[12].
While Big Data may not immediately pose a threat to reidentification at this time, there is cause for concern in the future. The techniques and algorithms to mine such data sets mean that similar algorithms may be adapted to mine publically available data for similar inferences and about personal data. Such data bases include current genealogy websites, limited genetic studies etc.
An issue with big data is the vagueness of the term and process, as well as the lack of transparency of what is big data and who is performing such research. Recent revelations of the scope of big data collection and infrastructure infiltration by the NSA results in fears for privacy that are hard to overcome. A consideration of the risk to benefit of big data is beyond the scope of what we are considering here, but one must be mindful of the power of such large computational systems to obtain inference from disparate and varied social media as well as more identifiable sources of health associated information.
Who would want to re-identify patients in a genomic study?
When balancing the risks to a participant of being part of a genomic study, one would consider why would, re-identification be an issue for a participant. One of the more troubling quotes associated with the Grymek study is that one of the co-authors, who developed the software used in the study, decided to perform that study on the basis that “they could not resist trying to do so”. This suggests a potential worrying lack of oversight or regulation. Notwithstanding the academic question of whether one can perform such a re-identification of a given patient in a study, there are additional potential groups who may have an interest in such data (this is not an exhaustive list):
Insurance companies: While some federal rules and laws prohibit discrimination based on genetic data, the enforcement of such polices is still problematic and violation of such rules are hard to prove. While health insurance cannot be denied due to genetic disorders [13,14], other forms of insurance are not subject to these laws, such as long term care policies, which may need addressing in the near future.
Employment: Some opinion posts suggest that employers may discriminate on the basis of the finding of certain disease or personality traits. For example certain APO4E variants are associated with a markedly higher risk of Alzheimer’s disorder. Would an employer hire someone with a heightened chance of Alzheimer’s or other costly disorder?
Law enforcement: A commonly cited concern is the ability of genomic data to be matched to DNA fingerprinting data. DNA fingerprinting is based on identification of DNA microsatellite patterns. Such sequences could be interpreted from an annotated genome, and then used to search DNA fingerprinting files.
Unknown family members: Genealogical data has already been shown to be useful in the search for parents, either from adopted children, or from children generated by the use of artificial donors. A recent case in Kansas, whereby an artificial donor was sued for child support highlights some of the potential issues associated with this situation [15]. This case was prosecuted by the state, even though the donor had signed away their paternal rights. In addition to the obvious cost, some sperm donors being confronted later in life by potential children could cause substantial personal/social issues.
Criminals (blackmail): If a genetic trait or vulnerability could be identified, this could be used to extort money. In addition one could easily consider a similar scenario whereby paternity and evidence of communicable diseases could also be subject to similar illegal activities.
Other: If a participant in a study requested their information or decided to determine which of the genomics published in a study belonged to them. The risk in this situation is that a person may not possess the knowledge or tools to correctly interpret the data or the risk, leading to unfortunate consequences. In the case of James Watson, he discovered that he had a BRCA mutation. Fortunately he obtained genetic counselling prior to announcing this discovery to family members. The mutation was not associated with disease, therefore negating additional testing. However this is a clear example of the potential of harm from limited data. Indeed one recent post suggests that only negative associated information is known from DNA mutations [16].
What is the damage of releasing personal genomic data and data from sequencing?
Dangers of the release of genomic data come from the potential damage to employment, the ability to obtain health and life insurance, as well as the impact on personal relationships. The recent controversy surrounding the sequencing and publication of the HeLa cell genome [17], and the inference on the family of Henrietta Lacks resulted in a new NIH policy on genomic studies of HeLa cells (NOT-OD-14-08).
While a clearer understanding of the risks associated with genetic information is being elucidated, there are clear risks that can be attributed at this time to the release of identified genetic information. First, and most commonly mentioned, is the use of genetic risk factors when considering risk for medical or life insurance. Current federal rules prohibit the denial and premium loading for pre-existing conditions [13,14]. In addition, there are few diseases that have such direct correlation of risk vs. disease, in which the disease is not already apparent or predictable. For example Huntington’s chorea has a strong identifiable familial linkage. BRCA mutations are associated with high risk of certain cancers, but other progressive diseases may not be as easily identifiable in genetic data at this time. Clearly the concern is for premium loading of people with genetic indicators predicting higher likelihood of diseases that have expensive and/or prolonged therapies. Such information may also have consequence of identified family members.
Such predictions of genetic associated disorders can be directly obtained for genetic/genomic data (type 1-3). Once a potential candidate for a disease is identified, the challenge would be to then identify the person. It is hoped that legislation will be able to thwart such a use of genetic data, either via limitations of using genetic information for insurance calculations, or legislation prohibiting reidentification [11].
What else should be included in a patent consent for genomic data?
A second and perhaps less obvious risk come for the identification of other disorders from genomic/genetic data sets. While this has mostly been concerned with the identification of Human diseases based in the genome, for example BRCA mutations, the DNA data obtained from a person will contain other-non-human DNA. Urine has commonly been considered a sterile bio-fluid. However, recent studies show microbe DNA can commonly be extracted from urine. Indeed a number of projects have focused on the identification of the microbiome, a complement of microbial organisms in different body compartments which influence our health. This approach is currently a large focus for NIH, as it offers some fundamental changes in how we view disease.
However, it was recently shown that microbiomes associated with sexually transmitted infections could be identified in asymptomatic patients [18]. In this study, S16 ribosomal RNA patterns were used to identify various STIs once the human sequences were removed from the data set. This in essence is the same data as “unmapped DNA” which is required of all NIH data submissions [6]. Therefore, similar approaches could be used to identify bacterial or viral genetic information from a patient’s tissue. The evidence of, for example, hepatitis C or human papilloma virus may be used to increase the cost of health insurance due to the long-term potential cost of hepatitis, and HPV-induced cancers. A second and perhaps more nefarious use of such data would be for the blackmailing or extorting of a patient to prevent public release of this information.
The identification of such diseases requires more raw data access (level 1 data), typically the unaligned sequencing reads). However, NIH policy suggests also depositing the unaligned reads in a file. This would actually make such searching easier because the data will have already been filtered for human genomic sequences, thereby reducing the computational power required for such studies.
Considering other “omic” data sets
To date considerations of risk have come from genomic studies, i.e. studies of the sequence of our genetic code- DNA. However, RNA sequencing studies (transcriptomics) are being sought as an alternative to gene expression studies (whereby only the level of a gene is reported- not its sequence). Since RNA is transcribed from DNA, the DNA sequence can be inferred from RNA. The methodology of Grymek to identify patients based on tandem repeat analysis [8], may not be applicable to RNA sequencing, because these elements may not be made into RNA. However, that discounts the fact that RNA sequence and DNA sequence information are identifiers. Many people have mutations in DNA and RNA which may not be translationally relevant, i.e., do not encode a change in a protein structure, but may never the less enable identification (depending on how unique it is). Again the issue here is that our DNA is an identifier. As such we may want to consider all gene and gene product sequence information as subject to the same considerations.
Proteomic studies are also subject to the same considerations. Peptides are sequenced to identify them; hence a novel mutation in a protein (encoded by a gene) may be sufficient to enable identification of a person. Proteomics may also enable the identification of nonhuman i.e., bacterial proteins in the tissue samples being analyzed. As such proteomic data needs to be considered with the same rigor as genomic data.
Clearly, if patients assuming de-identified status are then identified, then these issues become more problematic.
Balancing risks vs. benefits
We have focused on potential risks to patient privacy, and especially the potential risk to a participant of re-identification. However we must also attempt to balance the potential benefit to patients and society, of genomic studies. Using this approach we are attempting to identify diagnostic markers for stroke. Studies show genomics can assist in the identification of patients who respond better to various anti-cancer drugs [19], thereby enabling more focused therapies and avoiding wasteful damaging treatment. Whole genome sequencing has been shown to the invaluable in the diagnosis of some very rare neurological conditions. Therefore patient privacy concerns must be balanced against potential benefits to human disease understanding.
How can we reduce the risk to patients?
What is a potential solution? Clearly a consent form will have to address this possibility. Whether or not a participant of an NGS study should have to explicitly agree to non-human alignment of their data (with the risks defined) may also need to be included in the data set. Given the identified risks above, this may indeed be considered a higher risk, than genetic disease information.
De identification strategies have focused on the removal of 4 critical identifiers - name, date of birth, address and Social security number. The Health Insurance Portability and Accountability of 1996 Act (HIPAA) suggests scrubbing of 19 potential determining factors. However, one study noted that age can be used to obtain a year of birth, which enables searching of databases to identify a patient. As such the question remains if ages were stratified in 5-10 year blocks, would the data still have as much utility to a researcher as being in 1 year divisions. If data stratification is being performed using age, 10 year blocks may be sufficient to prevent most attempts of reidentification based on age. As such as a minimum it would seem appropriate to release stratified rather than absolute ages of participants. Guidelines to assist participant remaining un-identified may be considered and suggested. Indeed, many studies of reidentification suggest we must educate patients to the risk of such studies. The following are potential inclusions to consent forms.
- Explain whether the genomic data will or will not be released to open access or restricted access data base repositories
- Explain that absolute privacy cannot be guaranteed in the future
- Explain that the patients may compromise their privacy via the use of genealogical studies etc
NIH data policies are intent on the provision of level 2 data at a minimum. The question then remains whether we want to publically release human genome aligned BAM files (or equivalent)? Or should all human data be on a restricted access policy. A recent presidential review committee strongly suggested that Whole Genome Sequencing studies should not be performed without explicit consent (a line this author is in agreement with) [20]. As such opt out should not be used when there is the potential for future sequencing. Furthermore consent forms need to include such language regarding the risks of WGS and determine what data sets are released.
NIH polices request the release of unaligned data. The release of unaligned data may have great utility, for example identifying the infection of a population, or a microbiome changes associated with a clinical condition (ref urine a neuropathy) associated with disease. Given the risks identified above, a separate explicit consent may be considered if data will also be aligned to non-human reference genomes. Perhaps re-analysis of data to reference genomes not explicitly referenced to in the original consent procedure protocol may require an amendment of the original IRB protocol and even reconsenting of the participant. A second issue not discussed above is the requirement for reporting if evidence is gained for certain transmittable diseases. As NGS technologies become used for clinical diagnostics, this may soon require further consideration.
Final Comments
The big data revolution is with us, and one of the key contributors to this movement is the large scale acquisition of genetic information linked to clinical/health information. As we move into a new era of data science and information driven healthcare we must determine best practices to ensure minimal risk to patients, and maximal societal benefit of the data generated. Public trust in science is paramount for the success of such studies. While guidance is becoming clearer for scientists involved in such research, this area is quite complex and combines elements of ethics and risk assessment, as well as research science. We will not be able to foresee all risks regarding information release, but with a little foresight we can attempt to reduce such risk.
Acknowledgement
I acknowledge the assistance of Profs PR MacLeish, PhD and RP Simon M.D, and A Pearson for helpful discussions in developing this article. R Meller is supported by NIMHD U54 MD007588, (NINDS) R01NS59588. The views represent in this article are those of the author and do not represent the views of the Morehouse School of Medicine, NIH, NINDS or NIMHD.
References




















 
Select your language of interest to view the total content in your interested language
Post your comment

Share This Article

Recommended Conferences

Article Usage

  • Total views: 7892
  • [From(publication date):
    December-2015 - Jun 28, 2017]
  • Breakdown by view type
  • HTML page views : 7834
  • PDF downloads :58
 

Post your comment

captcha   Reload  Can't read the image? click here to refresh

Peer Reviewed Journals
 
Make the best use of Scientific Research and information from our 700 + peer reviewed, Open Access Journals
International Conferences 2017-18
 
Meet Inspiring Speakers and Experts at our 3000+ Global Annual Meetings

Contact Us

 
© 2008-2017 OMICS International - Open Access Publisher. Best viewed in Mozilla Firefox | Google Chrome | Above IE 7.0 version
adwords