Tuesday, January 25, 2011

The CODIS Marker SNAFU

So I blogged earlier this month about Damian and I having some "technical difficulties" with the DNA database.  I'm going to do my best to explain the situation - and heck, maybe someone out there in the near-infinite WWW will see it and be able to give us some direction!

Background:
Damian and I started this DNA database project last fall after I was presented with the possibility of finding a half-sibling.  Based on previous DNA tests that we had both taken we shared our paternal allele on 10 of 15 CODIS markers.  Of course with this knowledge I HAD to know what it meant and if it was significant!  We both set out researching, and low-and-behold, Damian and I were able to successfully determine the algorithms and equations used by professional DNA testing companies to determine Siblingship/Relationship Indices (also called the likelihood ratio) and discovered that the results were around 99% probability that we were in fact half-sisters.  When the professional results came back they were right on par with what Damian and I had projected, which gave us further evidence that our algorithms were correct.

We decided that having this knowledge could be very beneficial for other offspring, and we had hoped that there was a database similar to the UKDonorLink available worldwide for donor-conceived adults to find siblings.  So we launched our project and found another DC-adult to help us with the database framework and coding.

Database Philosophy:
We wanted a database that offspring around the world could upload their previously done DNA tests (paternity, siblingship, DNA Profiles) and be provided with a list of potential siblings, similar to databases like Y-Search that provide potential relatives for individuals who have done a Y-DNA test.  Many offspring do numerous, some even dozens, of DNA tests in their quest to find their kin and we wanted a central location for finding donor-conceived siblings that would alleviate the needless costs involved in doing multiple DNA tests without prior reason.  Offspring would be given a list of other offspring in descending %probability of relatedness order, with each offspring's "vital stats" such as year and place of conception, doctor/clinic/sperm bank, donor number, etc, as well as their contact information.  Therefore, it would be each offspring's individual decision to contact potential siblings and determine if there was a possibility of relatedness and then go forward with professional testing.

And...Problems Arise:
We always knew that siblingship results are best and most accurate/conclusive when both alleged siblings mothers were also tested.  Unfortunately, for many older offspring this is not feasible.  So Damian and I had the idea to set a threshold that was low enough that for offspring without their mother's DNA.  We even had an idea of setting two separate thresholds, one for offspring who had their mother's DNA and one for without.  We also felt that there was some serious concern with UKDL and their setting a threshold of 99% probability.  In the USA, siblingship DNA tests are admissible and conclusive in court at 90% probability.  So why the high threshold for the UKDL?!  We originally thought it simply had to do with government officials not wanting to deal with it so by setting the threshold so high they'd have less matches to contact.

However, as we began adding DNA samples into our database things took an interesting turn.  We were getting a ridiculous number of false positives.  Not false negatives, as we originally suspected, especially with offspring without mom's DNA.  We were getting results that were well over the 1.0 threshold that is the dividing line between not related and possible relatedness, and in several cases results that were well into the conclusive range.  With individuals that could in no way be related, at least at such a close degree of consanguinity.

Evidence against CODIS:
Damian and I were becoming increasingly worried about some of the results we were getting from our simulations, and finally our CS/IT guy jumped ship because he finally put together what we were already concerned about.  That the algorithms that we were using were not working.  But the question remained....if they did not work in our database, and we knew that they were the same algorithms being used by professional testing companies (as we were routinely getting near-identical results when recreating previously done tests), does this mean that professional DNA testing companies could be producing false positive results as well?!

Both Damian and I have noticed some intriguing circumstances that show that CODIS has some serious flaws.  For example, I added my 15-marker CODIS results to this database called DNA Reunions.  It's a free database that you can join and upload CODIS markers, as well as Y-DNA and mt-DNA results and be matched with "potential" relatives.  Since I have not done an mt-DNA test, as it's not helpful for tracing my paternal line, and I am genetically unable to do a Y-DNA test, I only submitted my CODIS markers.  They also did not ask for any relative's results, so I was no able to add my mother's.  Then again, it would not really be necessary for this type of database that caters to solely to genealogy.  Several months later when my results were approved and I was actually in their database, I checked it out and noticed that there were 3 other individuals on this database that shared at least one allele on ALL 15 of my CODIS markers!!  And there was NO WAY that any of these three individuals were related to me at all, at least not in the close proximity that CODIS markers can assess.  So that means, if one of those three individuals turned out to be donor-conceived and I went through a siblingship DNA test with them, it would come back highly significant.  Especially if neither of our mothers were tested.  And yet, it's not possible.  But I still shared such a significant portion of CODIS markers with them.......

So is it CODIS?  Is it the testing criteria?  Or is it donor-conceived adults?  What is the problem here?!

Lindsay's Theory:
My theory is this...donor-conceived people do not follow what is considered the "norm" for individuals who seek DNA testing.  In traditional cases two individuals have prior reason to believe that they are related/half-siblings.  Usually it involves infidelity and/or multiple partners.  However, in both of these cases the said individuals did not just find each other at their local supermarket and say, hey we might have the same father!!  With donor conception, in contrast, it's even more random than the local supermarket!!  We might find someone who lives and was conceived 2,000 miles away from us on some message board and the vague descriptions we both have of our biological father was that he had brown hair, was 6'0 tall, and was a medical student in 1984.  That description could accurately be applied to thousands of men.  Yet, as donor-conceived adults, we see beyond that.  We see the possibilities.  And in our world, even the tiniest of possibilities could be significant and need to be further pursued.

This means that what we consider a potential sibling does not fit the "norm" of individuals who are normally tested for siblingship and may be too random to be accurately determined through current protocol.

Possibilities...but not necessarily probabilities:
At the same time, professional DNA testing companies, and the algorithms that we are using are based on a 50% prior probability.  The statistics behind this are the likelihood ratio test, which basically states that there are two potential outcomes, the "null" and the "alternative".  The "null", as in null hypothesis means there is no relationship between the two datasets (for DNA tests it would be the two individuals CODIS markers).  The "alternative" is the opposite hypothesis.  It means that there is a relationship between the two datasets.  So, there are two possible outcomes.  For any two random samples there is a 50% probability either way, before the test is done (hence "prior").  Either the results support relatedness or they do not support relatedness.  There are only two outcomes.  Positive and negative.

This also means that 50% probability is that threshold between related and non-related....at least statistically.  As probability gets farther below 50% it means there is less likelihood of relatedness.  On the converse, as that probability increases beyond 50% it increases the likelihood of relatedness exponentially.

Back to the SNAFU:
So the bottom line is that Damian and I have unearthed a serious problem.  This problem I fear much farther than just our DNA database, but could seriously put into question any half-siblingship DNA test done between donor-conceived persons.  The likelihood of false positives seems to be unnaturally great. I blogged last year about my anger towards the UKDL and their ridiculously high 99% probability before contacting individuals.  My anger may have been misdirected, as what Damian and I are realizing is that this could be a corrective measure (moving from 90% to 99% probability) to adjust for the fact that they too were getting false positives.  However, I worry that their increase in stringency may also result in false negatives.

What needs to be addressed is that the problem lies with the inherent limits of CODIS markers, and instead of increasing the threshold either:

a) a new protocol needs to be developed with different algorithms, possibly one that sets a much lower prior probability; or

b) redirect testing to more in-depth analyses such SNP (single nucleotide polymorphism) tests like the new FamilyFinder DNA Test that look at millions of locations across the genome and identifies close relatives.

The downfall of this test is the cost.  The plus is that your DNA is stored in FTDNA's database and you can potentially find paternal relatives such as cousins, grandparents, uncles/aunts, and even possibly your donor!!

Bottom line....our advice right now is to use CODIS marker results as only a guideline.  We do not trust these results, ESPECIALLY if neither mother was tested!!  If there is good reason to believe that two donor-conceived individuals are related....such as sharing the same donor number, then I would take the test results as accurate.  This is because the randomness has been eliminated.  Another example is with my half-sister and I.  She did not have a donor number, but her mom knew it was the same sperm bank.  And on top of that I had the previous knowledge of knowing that the doctor's office that her mom used was 100% for sure using my donor on their patients.  So again, there was prior reason to believe we were related.

5 comments:

Tom said...

If they are using a prior probability, surely this is a Bayesian test, not a frequentist one i.e. the outcome should not be accept/reject the hypothesis, but it should be a probability of the hypothesis being true.

Mark Diebel said...

What do you know (and think) about the website called 23 and me? Does this work?

Lindsay said...

Tom, correct. The assumption is that there are only two possible outcomes - which there are, you are either related or you are not - so I guess that is why prior probability is used. The algorithms that are used to determine each individual likelihood ratio, though, those LRs are dependent on the frequency of that allele in the general population, and in our algorithms for half-siblingship it takes into account the mother's DNA, but that is really just a variation on that original equation.

Unfortunately math is not my strong point so I can comprehend what needs to be for the sake of genetics, but I don't understand all the theories and reasoning behind it. Hence why I'm not completely sure WHY it all works as it does.

The reason I believe as to why it cannot be a probability of the hypothesis being true is that before any genetic evidence is laid out there can only be two independent outcomes. Either they are siblings or they are not. So the prior probability is 50%. The evidence is taken into account and it provides the posterior probability.

If we break it down even further, lets say for any given marker, two alleged half-siblings do not share any markers. However, they each only inherited 1 allele from the alleged shared father. Therefore, there's a 50% chance that the allele that they inherited from the father is not the same. Which is why for half-siblingship if there are no matches on a marker the LR is 0.5. The prior probability is still 50%.

Lindsay said...

Mark,
I know a bit about 23andMe. It's test is similar to FamilyFinder from FTDNA, but it goes even deeper than that. It also looks at health/medical aspects.....which I feel opens up a whole nother can of worms.

It also has a database to locate relatives called Relative Finder.

I have looked to see more about the type of test they use....I believe it's a SNP test. Though I'm not 100% certain.

It's definitely something worth looking into though - I just wonder if it's more hoopla and focused on the health/disease aspect than on the finding relatives. Don't know.....

I'd be interested to hear from anyone who has done a 23andMe test.

sustine hefalu said...

Saludos desde España(Barcelona)