Introduction to Population Genetics – Lynn Jorde (2014)

 

Tyra Wolfsberg:
Good morning, everyone. Welcome to week seven of our current topic
series. Thank you for coming. This week we’re honored to have with us Dr.
Lynn Jorde from the University of Utah, School of Medicine, where he holds the H.A. and Edna
Benning Presidential Endowed Chair in the Department of Human Genetics, and he’s also
the appointed Chair in the Department of Human Genetics. Dr. Jorde received his degree from the University
of New Mexico. His lab studies the evolution of mobile elements,
and the effects of these elements on the human genome.

 

He’s also interested in natural selection
in humans and has identified genes that have helped Tibetan populations adapt to living
at high altitudes. Finally, he used whole genome sequencing
to uncover disease-causing mutations and to estimate the human mutation rate. Dr. Jorde served on several advisory panels
for the National Science Foundation and the NIH, and in 2012 he was elected as a fellow
of the American Association for Advancement of Science. Finally, Dr. Jorde has received 12 teaching
awards from the University of Utah, as well as one from the American Society of Human
Genetics. I’m pleased to say he’ll be bringing that
excellent teaching style here to NIH this morning, and I’m sure you’ll enjoy learning
a lot from today’s talk, which is intended to provide you with an overview of population
genetics. Please join me in welcoming Dr. Jorde to the
NIH this morning.

 

[applause] Lynne Jorde:
Well, thanks very much, Tyra. It’s a pleasure to be here again. And before I start, let me say that I’m happy
to entertain questions at any point in the talk. So if something comes up that you’d like to
know more about, don’t be shy about asking a question. This discloses that I have no commercial interest
related to this presentation. So this morning what I’d like to talk to you
about is, first of all, an overview of patterns of human genetic variation, both among populations,
because really that is the essence of populations genetics, but also now, particularly with
whole genome data, we can dissect pattern of variations, similarities, and differences
at the individual level, giving us, I think, a very different and much fuller perspective
of human genetic variation.

 

We’ll talk about the implications of our findings
in human population genetics, and offer the concept of race, something that I think always stirs
a certain amount of controversy, and something that I think can be illuminated by our genetic
data. We’ll talk about linkage disequilibrium, a
fundamental population genetic process that has been very important in disease gene identification. Throughout, we’ll be talking about the relevance
of genome sequencing data for these topics. So there are several applications of human
genetic variation. One is in deciphering human history because
really, the history of our species is written in our genome. And more and more, we have the technology
to make inferences about that history. And I’ll be giving you a few examples of how
genetic data can be used to infer human history going back hundreds of thousands of years. We can infer individual ancestry. I’ll give you some examples of that. And this is something that I think is much
more informative than traditional self-identified population categories. Genetic variation is used commonly now, as
you know, in the field of forensics. Tens of thousands of cases every year are
solved using DNA data.

 

So this is a very important, and to some extent
unanticipated application of basic population genetics, application of things like Hardy-Weinberg
equilibrium, linkage disequilibrium to help exonerate the innocent and to convict the
guilty. And finally, perhaps most importantly, principles
of populations genetics are used to find, identify, and understand disease-causing
genes. And we’ll be talking about some of those applications. So, of course, the mutation is the fundamental
source of genetic variation in our species and others. We now can estimate the human mutation rate
directly by sequencing families. We sequenced a human family from Utah a few
years ago and estimated the human mutation rate to be about 1.3 x 10-8 per base pair
per generation. And there have been now several estimates
using families that all come up with about the same number: roughly one in a 100 million
base pairs per generation for single nucleotide variance. So what that means is that we transmit about
30 new DNA variants each time we make a gamete. And I like this quote from Lewis Thomas,
the science writer, about mutation.

 

He said, “The capacity to blunder slightly
is the real marvel of DNA. Without this special attribute, we would still
be anaerobic bacteria, and there would be no music.” So I think we should be thankful for our mutations
because some mutations under natural selection lead to adaptation to a changing environment;
others, of course, cause disease. Another thing we’ve learned by sequencing
families is that the mutation rate goes up substantially with advanced paternal age. We’ve known for some time that certain autosomal
dominant diseases increase in frequency with the age of the father, but now, by looking
at sequence information, whole genome sequences in families, we know that — we estimate that
there are about an additional two mutations each year with each additional year of paternal
age after around age 30 as a result of spermatogonia continuing to undergo mitotic divisions throughout
the life of the male.

 

So at least three-quarters of all new mutations
in mammalian species can be attributed to males. So in addition, to wreaking a lot of the havoc
in the world in general, males also wreak most of the havoc in the genome, at least
at the level of single nucleotide variance. So given that these mutations are happening
all the time, that we’re transmitting them from generation to generation, a natural question
is, “Well, how much — at the DNA level if we look at aligned DNA bases, how much do
we differ?” Well, identical twins are nature’s clones,
so for all intents and purposes, they differ in none of their DNA base pairs.

 

There are, of course, somatic mutations that
cause small differences, but we can say that they are, essentially, genetically identical. You probably know that for any pair of unrelated
humans, we differ at about one in 1,000 of our base pairs. And I think that’s a very important result
because it tells us that at the level of DNA, the most fundamental biological unit, we are
99.9 percent identical. If we compare ourselves to our nearest evolutionary
relative, the chimp, we are about 99 percent identical. We are about 99 percent chimp at the DNA level. Mouse, as you would expect with 70 million
years of separation, we differ at one-sixth to a third of our base pairs. And if we look at something very different,
broccoli, we are thankful, mostly different from broccoli. Well, a small number of differences, then,
proportionally, only one in 1,000, but because, as you know, we have 3 billion base pairs
in a haploid genome, that means that between any pair of haploid genomes, including the
two genomes that you get from your parents, there are about 3 million single nucleotide
polymorphism or variant differences.

 

So actually a lot of variation for evolution
to work with. Now, we can put this in context a little bit
by comparing the amount of variation in humans with that of other great ape species. And this is a paper published just last year,
sequencing 79 great apes. And we see that for humans on average there
are around 3 million single nucleotide variances per individual. We compare an individual genome to the reference
for common chimps; it’s nearly double.

 

For gorillas, it’s more than double. For orang, it’s about three times as much. So humans, at least relative to other great
ape species, are somewhat depauperate in genetic variation, and what this suggests is that
we were founded by a relatively small number of individuals not so very long ago. So we haven’t had that — as much time to
accumulate variation. Now, another important kind of genetic variation,
and one that population geneticists are using more and more, are copy number variance. So here we have a couple of genes, A and B
that exist in extra copies in a genome. And these are often defined as deletions or
duplications greater than 1,000, sometimes greater than 500 base pairs. And altogether, they account for a substantial
amount of inter-individual variation, each human being heterozygous for at least 100
copy number variance, or more if you define them as being a bit smaller; but another important
source of variation, and one that is traced, in some cases to the causation of diseases
like schizophrenia and autism.

 

So we can also ask the question — we’ve said,
how much do individuals differ from each other. We can ask the question, “Well, how much do
populations from each other?” And of course, this has been a central
focus of population genetics for a long time. So I’ll show you some data from a fairly widely-distributed
series of human populations. We’ve collected many of these over the years. Eight hundred fifty individuals in 40 different
populations distributed across the major continents of the world. And of course, there’s a substantial amount
of phenotypic variation in these individuals. And these are photographs of some of the people
that were sampled in the course of these studies. So one of the ways that we can look at variation
among populations is with a simple tabulation of allele frequencies. So if we have, let’s say, three populations
here, and let’s suppose for simplicity, we’re looking at three single nucleotide variances. These are the major allele frequencies, the
allele with higher frequency. We can assess variation among populations
simply by looking at the frequencies of these alleles and comparing them.

 

And one of the things that we typically do
is to estimate average heterozygosity — this is a fundamental measure of variation — so that
for each locus, we assess the proportion of heterozygous individuals, typically by direct
counting, or we can make a Hardy-Weinberg calculation, and then we can average that
heterozygosity across loci. So one of the ways that we apply this is to
— is to estimate a quantity called FST, and this is something used very often in population
genetic analysis. And we can think of FST as the amount of genetic
variation in a whole population, a whole sample — let’s say the whole world — that arises
because of differences in populations, and rises because of subdivision. So a simple measure of FST is shown here. We look at the total heterozygosity in our
sample; let’s say all the heterozygosity in humans across the world, the average heterozygosity.

 

And then we subtract from that the average
heterozygosity within each subpopulation. So if we divide our populations into continents,
we would look at the average heterozygosity in each continent, subtract that from the
total, and then normalize by dividing by the total. So you could imagine that if this quantity
were very high — in fact if there was much variation within populations as there is in
the whole sample, then FST would be zero. What that says is there is no differentiation
across human populations. Every subpopulation has just as much variation
as the entire population. No differentiation. On the other hand, if all variation exists
between populations, in other words, if this quantity is always zero, every subpopulation
is essentially a clone, then FST would be one.

 

So this is a way of saying how much variation
in a sample is due to subdivision; because this is not a completely random
mating population. So if we look at some measures of FST using
different kinds of genetic systems. These are short tandem repeats; these are
a couple of kinds of mobile element systems. Here’s a 250k SNP. What we see is very consistent
across different kinds of genetic systems; that FST, the amount of variation due to subdivision,
typically runs between 10 and about 15 percent. We see similar results for sequence data as
well.

 

So most of the variation in human populations
would be found within any major subdivision; within, let’s say, Asia or within Africa — a
little more in Africa, but the bottom line is that if we look at the variation within
one major human population, we see 90 percent of human genetic variation in that population. We only get an extra 10 percent if we look
at the rest of the world. So we are somewhat minimally differentiated,
which I think is another important point with some real social implications.

 

Now, we can compare FST in these genetic systems
with FST for a measure of skin pigmentation, which is highly differentiated across continents. And we see essentially the opposite result:
90 percent of variation is found between major continents. So for this very visible indicator that people
often use to essentially classify populations, there is a lot of variation among populations. Essentially, the reverse of what we see for
genetic systems. And if we now look at some of the genes that
underlie skin pigment — skin pigmentation, they also vary tremendously among populations,
as you would expect.

 

So here is the tabulation that we did on
the samples I showed you earlier with a 250k SNP simply to ask the question, “Well, how
many — what proportion of alleles are shared among populations?” And we divided our populations into Sub-Saharan
Africa, Europe, East Asia, and the Indian subcontinent. And what we found with that SNP ChIP, which,
of course, consists mostly of common SNPs, where the minor allele frequencies exceed
5 percent, about 80 percent of the SNPs of the minor alleles were shared in all four
groups: 88 percent in at least three, 92 percent in at least two; 7 percent were African specific,
and less than 1 percent were specific to any of the three non-African populations. So the bottom line here is that for these
SNPs with frequencies greater than 5 percent or so are — typically old
polymorphisms. You have the — polymorphism typically has
to have some age to attain a higher frequency.

 

They tend to be shared among populations. And in fact, none of these SNPs were fixed
present in one population, fixed absent in another. So they’re — none of them could be used actually
on its own to differentiate populations. And this is a similar result to the 1,000
Genomes data. In an earlier version of dbSNP that consisted
mostly of common SNPs, this is — these are the Asian 1,000 Genomes populations, the European-derived
— this is a sample from Utah — and then Africa. And most of these SNPs are shared in all three
populations — somewhat more are found in Africa, relative to Europe and Asia, but
mostly shared. And these are — these are relatively common
SNPs where the average allele frequency difference between populations is right around 15 percent. But now more recently, we can look at rarer
SNPs are identified by sequencing. And now you see a very different pattern. Most of these are not shared among populations. They’re rare enough so that they arose relatively
recently and therefore tend not to be shared among continental populations. And in fact, for alleles where the minor — for
SNPs where minor allele frequency is less than 5 percent, less than 2 percent of those
are shared across continents.

 

So it’s much, much more common to see population
specificity with these rare alleles, which is what we would expect given population history,
but a very different picture from one that we see for the more common SNPs. So we can look at differences among populations
using a simple, genetic distance measure. And I’ll just take you through how we estimate
those to give you the basic principle. The simplest form of a genetic distance, if
we’re estimating the distance between population I and J is to simply take the absolute value
of the difference in allele frequencies. So the allele frequency in population I, minus
the allele frequency in population J.

 

So if we look — back at our little matrix of
allele frequencies, our distance for locus one would simply be this number minus that
one, the absolute value. And then, we can just average this over all
of our SNVs — we might have half a million or a billion of them — to get the distance,
the genetic distance between that pair of populations. And you could imagine that this starts to
get much more complex to evaluate as we get more and more populations. If we have 50 populations, then we’ve got
a 50 by 50 matrix of genetic distances.

 

So we can use these genetic distances to build
a population network that displays similarities of populations. So let’s take that first single nucleotide
variant. Here are our three populations. And we can subtract a piece of one, a piece
of two from a piece of one here, so these two SNV frequencies. And we can take that difference to place a
node between populations one and two. And then, a commonly used approach averages
these two allele frequencies, the ones from P1 and P2, and then subtract that from a piece
of three, this frequency, to give us the distance between these two populations averaged, here,
and the third population. So we can see, very simply, that populations
one and two are more closely related; three is a bit more distantly related. And that’s essentially how these networks
are built.

 

Now, this is kind of a whimsical analysis
that my colleague Steve Guthrie [spelled phonetically] did a few years ago, just illustrating how
you can use this technique to understand not just genetic distances, but all kinds of variation. The New York Times published this matrix of
disagreements on decisions in the U.S. Supreme Court a few years ago. So it’s a nine-by-nine matrix showing the
percent of the time that each pair of justices disagrees. This would be just like a genetic distance,
except in this case it’s a disagreement distance.

 

So you that Justices Thomas and Scalia disagreed
only 9 percent of the time. Well, that makes sense. Whereas Thomas and Stevens disagreed most
of the time; Scalia and Stevens disagreed most of the time. But you have to stare at a matrix like this
for a while before you can intuit the pattern. So what Steve did — he was interested in
learning some of these techniques — he put this matrix into a program that made a neighbor-joining network.

 

And you can immediately see the two wings
of the court: conservative here, more liberal here, and the swing vote, Justice Kennedy. So these networks can very easily portray
relationships among individuals or populations. And that’s one of the reasons we like to use
them. So here’s an application of that technique,
a neighbor-joining network, using 100 autosomal Alu polymorphisms. So these are mobile elements that insert into
the genome. There are thousands of polymorphic Alus, where
they are present in some individuals and absent in others. We like them for these kinds of studies because
we know that if two people share an Alu at a given spot in the genome, then they share
a common ancestor in whom that Alu occurred. So these give us, essentially, polarity. We know that the absence of the Alu is the
ancestral state; the presence of the Alu is the derived state.

 

And they are virtually never precisely deleted. So they’re very good markers of events in
population history. So we looked at this series of populations,
made a neighbor-joining network using the techniques I just described, and we see some
interesting patterns in a diagram like this. Here are African populations, and we see quite
a lot of variation among these populations. Here’s a group of European populations, with substantially
less variation; East Asian, and South Indian populations, giving us a nice portrayal of human genetic
diversity in the Old World. And we also see that there’s quite a long
branch separating these Sub-Saharan African populations from the others and, as I mentioned,
more variation here. And the ancestral state, which would be the absence
of Alus, is closest to this group of populations, suggesting that this would be the ancestral
— the descendants of the ancestral population of modern humans.

 

These are bootstrap support levels telling
us that this result is supported 100 percent of the time; this branch 97 percent; this
branch 97 percent. So with just 100 polymorphisms, we have
quite good confidence in this result. Now, here’s a similar exercise done with a
250k SNP ChIP on 40 populations. And we see very much the same patterns again. Here’s a series of African populations. Here are the European populations. Here are populations from the Indian subcontinent
and Pakistan. Here you see a fairly long branch length for
Native American populations, but branching off an Asian cluster, as we would expect. And down here are a couple of South Pacific
populations, again, with a long branch length, indicated founder effect as they were founded
by a relatively small number of individuals, but a pattern in general quite consistent
with what we saw for those Alu polymorphisms.

 

This is a completely different set of populations
published a few years ago in Nature, where once again we see a very, very similar pattern
both for half a million SNPs, geographic patterning to genetic distances, and also
for a smaller number of the number variance. So the bottom line here is that we see a very
consistent picture of human genetic variation, regardless of the sampling frame, regardless
of the kinds of genetic systems that we examine. And another thing that we see very clearly
from these data is that as we go — if we look at heterozygosity — in this case, we’re
looking at haplotype heterozygosity, so these are groups of linked SNPs. And we’re asking how much they vary. We see the greatest variation in Africa, and
then a progressive decline in variation as we go from Africa to Europe to East Asia,
and then the more recently founded Polynesian and American populations. So this is a very reproducible pattern. And what it reflects is what’s termed a serial
founder effect. So the largest ancestral population, being
in Africa, is a subset of that population going out to found Europe and Asia, so a founder
effect there.

 

Another subset of that population went out
to found the Americas, so a continued serial founder effect as humans spread across the
globe, resulting in less and less genetic variation, essentially, the further we go
from Africa. And this is a nice diagram published in a
review a couple of years ago that just outlines those major patterns. An out-of-Africa movement something like 80,000,
maybe 100,000 years ago; then going into Eurasia; and finally about 20,000 years ago into the
Americas; very recently into Polynesia. And one of the interesting questions and
something I’ll come back to in a minute is whether these anatomically modern humans,
people who looked just like you and me, as they came out of Africa and encountered Neanderthals
in Europe, was there mixture with that population? And we’ll come back to what genomic results
tell us about that in just a minute.

 

Now, that’s a nice summary of essentially
the origins of modern humans across the world, but there are other sources of information
on our origins. The supermarket shelf is a good one. So I ran across this at the supermarket at 10
years or so ago, and I was surprised to learn that Adam and Eve’s skeletons had been stolen
— I didn’t know they had been discovered — but because there were more amazing photos
inside, I bought this, and this is what I learned: all that’s left was Eve’s
leg, and it looks like the identity of the perpetrator may have been established. It’s kind of interesting what you can learn
from supermarket tabloids. Well, another way that we can look at genetic
variation is through something we call principle components analysis, and we should go through
this, because this is — this is a way that genetic data, population, individual data
are often displayed now.

 

And what it is is a — is a data
reduction technique, because imagine that you’re looking at 1,000 individuals and you
want to assess the genetic patterns, the differences, and similarities in those 1,000 individuals. You have a 1,000 by 1,000 matrix to try to explore. We need some way of reducing the variation
in that matrix down to something we can look at. That’s what principle components are. And here’s a very simple example. Let’s imagine we’re looking at height and
weight. We can diagram it like this and we can run
just a standard regression line through that set of points, and that’s the line that accounts
for as much variation in height and weight as possible; it’s probably a representation
of the overall size.

 

And then, we could run another line through
to try to account for the next greatest amount of variation. And that’s what principal components analysis
does. It takes a huge matrix, in this case, 850 by
850; each of these dots is an individual. We look at the amount of the allele sharing
between each pair of individuals, and then we run a line through that multi-dimensional
matrix and ask, what single line accounts for as much variation among individuals as
possible? And we plot the individuals along that line. And so, the first principal component here,
we can see separates this group of sub-Saharan African individuals from other populations
— so consistent with there being a founder event in which a subset of the ancestors of
this population went on to found the rest of the world — and if you look at the second
axis, it’s a west-to-east axis: Europe, west Asia, Central Asia, all the way
out to East Asia, with these groups plotting in here closest to their ancestral population.

 

So, it’s a very convenient way in just two
dimensions of representing as much variation in human diversity as we can. Here’s a plot for just Eurasian populations,
and what you see here is that this creates essentially a map of Eurasia. So here is northern Europe, southern Europe,
Central Asia, East Asia, Southeast Asia, and then the Indian subcontinent with Nepalese
out here distributed quite widely. So this tells us that geographic patterning
does affect genetic relationships among populations because for the vast majority of our history
we’re much more likely to mate with someone five kilometers away than with someone 5,000
kilometers away.

 

And we still see the signatures of that relative
degree of isolation when we look at genetic variation in populations. Over the last few hundred years, of course,
this is beginning to change and to break down. And we’ll show you some examples of that and
how that affects our genomes. But in many cases, we can distinguish between
fairly — closely related populations. So we published this just recently looking
at a couple of Tibetan populations. They speak different dialects; they’re largely
discernable from one another on a plot like this.

 

And here are different Mongolian populations,
and here; again, distinguishable on a principal components plot. So if we’re looking, for example, for population
stratification, if we’re doing an association study, this kind of a display helps us to
determine, helps us to detect stratification in populations. And then, we can use the loadings on these
axes to essentially control for that stratification if we need to. Here’s a great example. This is published by Carlos Bustamante’s Group
a few years ago, looking at 3,000 individuals from Europe.

 

And what you see here, these are color-coded. Each of these is an individual. These are two principal components. They used a 500,000k ChIP, and looked at allele
sharing among pairs of individuals — and this essentially reconstructs a map of Europe. So the countries here pretty much correspond
to the locations of the individuals here, although some individuals fall closer to members
of other populations. So as a result of gene flow through time — this
is not by any means perfect, but they estimated that the majority of — the individuals
in their sample they could trace their birthplace to within a few hundred kilometers based on
their genetic profile.

 

So in many important ways, our history is written
in our genomes. Now, one thing I like to — I compare this
plot from 2008 to one that we published, now 30 years ago, doing pretty much the same thing,
but with only 15 loci instead of 500,000. We were not able to look at individuals. You wouldn’t have adequate resolution with
just 15 loci, so we looked at allele frequencies and populations, but what you see again, with
just 15 loci, is a map of Europe. So it’s quite interesting to see this reproduced
on a much grander scale, and at the individual level with a larger number of populations. So, so far I’ve been talking about data based
primarily on microarrays — SNP arrays — but as I’m sure you’re aware, SNP arrays
miss an important part of variation; that is a variation due to less common alleles. They’re also typically selected for diversity
in a specific population, usual populations of European ancestry.

 

So we worry about biases, ascertainment biases,
in the data that we get from SNP microarrays. Sequences, on the other hand, give us information
about rare variances, and in most ways, we can consider them to be unbiased. So they do permit several inferences that
simply aren’t possible from microarray data. The reason is shown here. This was an early study done by Andy Clark
comparing the allele frequency spectrum — so these are alleles with a minor count of one,
two, three, or four — through this sample.

 

This is what you would expect at equilibrium;
that is for a constant population you expect an excess of rare alleles. For the HapMap data, which were based on SNP
microarrays you can see that there is a real deficiency of these rare alleles because
these SNPs were designed for more common SNPs. And then, for two sequence data sets at that
time, Pearlagin [spelled phonetically] and NIHS, there was an excess of rare
alleles over what you would expect at equilibrium. But it’s this class of alleles that tell us
a lot of things about population history, population size, and about growth rates. So sequence data give us this information
that the microarray data doesn’t give us accurately.

 

One of the things that this allows — this
is from the 1000 Genomes data — is an accurate inference of population sizes and migration
rates through time for human populations. So these bars represent the size of populations. This is the African founder population. This is the effective size of that population. The estimate here is that about 50,000 years
ago a small piece of that population went out to found Eurasia, and then there was rapid
expansion of that derived population; very, very rapid population growth from an initial
bottleneck with migration among population subsequently.

 

So although we think of out-of-Africa as a
single event, it was probably multiple events, and there were probably — there were probably
back-to-Africa events as well, at least to some extent. But with sequence data, we can portray
human history much more accurately in greater detail. So here is an allele frequency spectrum like
the one I just showed you now for 200,400 exomes from the Seattle Group. And we see again this excess of very rare
variance; in fact, more than we would expect in a constant population. What this reflects is population growth, and
I’ll show you an example of that in a second. But one of the interesting findings of this
study is that 73 percent of all protein-coding single nucleotide variants, and 86 percent
of the deleterious SNVs are very young.

 

They’ve arisen within the past 5,000 to 10,000
years as human populations exploded because a growing population does not successfully
eliminate this rare variance, including the deleterious ones. And another interesting finding from this
study is that we see more deleterious single nucleotide variants in European and Asian
populations than in African populations. The reason for that is that European and Asian
populations had this incredible bottleneck as they came out of Africa, and then expanded
very, very rapidly retaining those rare variants, including the deleterious ones, not
necessarily lethal — those would be eliminated quickly, but other deleterious variants. And this, from a population genetic perspective,
helps to explain why we see more rare variants and more deleterious rare variance in European
and Asian than in African populations. So to understand why population expansions
increased the frequency of rare variance, let me use this little example here. So here we have an individual who has had
two children and, of course, if that individual has received a new variant, a de novo variant
from one of his parents, if he has just two children, there’s a chance with each child
that he will not pass on the new variant.

 

So the extinction probability when he has
only two children is one-half times one-half, or one-quarter. So there’s a good chance that that new variant
is simply going to go extinct in one generation. On the other hand, let’s say he’s from Utah
and he’s got 10 children. Now, the extinction probability goes to one-half
to the tenth. In other words, the chance is that only one
in a thousand that that allele will go extinct in this generation. So this would represent a rapidly growing
population, and if this person’s descendants also have a lot of descendants, that extinction
probability is low. So for rare variance, for a variance that arises
in a time of rapid population growth, they tend not to be eliminated, simply because
the extinction probability in any generation is quite low.

 

And that helps to explain why we see this
excess of rare variance in human populations, particularly in human populations that
have undergone a bottleneck and extreme expansion. Now, I said that we’d come back to the issue
of mixture with Neanderthals because people are naturally interested in this. As our ancestors came out of Africa, did they
mix with Neanderthals? And the separation of human and Neanderthal
ancestors took place something like 300,000 or 400,000 years ago, but the question is,
when these populations were near each other some 50,000 to 60,000 — 70,000 years ago,
was there gene flow between them? And we now have, actually, very good evidence
from nuclear sequencing of Neanderthal skeletons that about 1 to 3 or 4 percent of modern human
DNA has a Neanderthal origin, but only among non-Africans, so as humans went out of Africa,
and encountered Neanderthals — probably first in the mid-East — there was a small amount
of mixture.

 

So instead of the African replacement hypothesis,
we now refer to a leaky replacement hypothesis. Neanderthals were mostly replaced, but probably
not entirely, and in fact, we see Neanderthal DNA in pretty much all non-African populations. And one of the interesting questions is, could
some of the shared sequences have adaptive significance? And there is now some evidence based on surveys
of the 1000 Genomes data that in fact, they do. For example, genes that encode keratin filaments
appear to have been selected for in these Neanderthal modern human mixed populations. So here is a plot showing the probability of Neanderthal
ancestry in CEU Europeans, CHB East Asians, and sub-Saharan Africans. You can see that there are sections in this
individual that are very, very likely — almost a 100 percent probability Neanderthal origin,
and in this European individual; whereas for sub-Saharan Africans, typically, you see no
evidence of Neanderthal contribution. So another interesting application of the
1000 Genomes data, searching for Neanderthal genes, searching for those that may have been
selected for adaptation in this new environment as populations were coming out of Africa many
thousands of years ago.

 

And, of course, you can send your DNA to
some of the direct-to-consumer testing companies, and they will estimate your portion of Neanderthal
genes, typically between 1 and 3 percent. So this is finding your inner Neanderthal,
as they say. So one of the interesting questions that arise
as we’re looking at population similarities and differences is what can genetics tell
us about the concept of race. And I put this in quotes because it’s a term
I don’t use in writing, but certainly, it is used, and I think often misunderstood. Here’s a quote from an editorial in the New
England Journal in the last decade, stating quite unequivocally that race is biologically
meaningless. There was a response in the New York Times
by Sally Satel, a psychiatrist, who said, “I am” — and this was deliberately provocative
— “I am a racially profiling doctor.” She argued that self-identified population
affiliation gave information about response to some of the drugs that she prescribes as
a psychiatrist.

 

So the question is, how useful a concept is
this, and what can genetics do to illuminate our understanding? Back about 10 years ago this article made
the cover of Scientific American, Steve Olsen, the science writer, and Mike Bamshad,
my colleague, were the co-authors. And the question was, does race exist,
according to Scientific American, science has the answer. I always get suspicious anytime they say that
science has the answer, but I think science and genetics can give us, at least, some insight. So, we can start by looking at DNA sequence
differences among individuals. And we’ve kind of gone over this concept,
but if we have DNA sequences — and I thought I would use some political figures for this
example. Let’s say we have a sequence from Rick Santorum,
Mick Romney, Hillary Clinton, and, I almost hated to — I hated to put him in, but John
Edwards.

 

 

And our question is, how different are they? We can make a matrix of DNA sequence differences. We see that Romney and Santorum differ on
just two bases here. Clinton and Santorum differ at five. And we see that Edwards and Santorum differ
at six. Edwards and Clinton at only one. This is a hypothetical example, but now we
can put this pattern in a tree, a network, as we did before, and again, we see some very
discernable patterns, a clustering. So we can do the same thing with real DNA
sequences. And we did this with a sequence at the angiotensinogen
locus some time ago. It’s a 14 KBf sequence, so a relatively small
amount of sequence. But we’re asking the question, for these major
population groups — Asia, Europe, Africa — how similar are people to one another for
the DNA at this locus? And what we see is that for this gene sometimes
an individual of African descent is more similar to people from Asia or Europe
than to others from Africa. Now, partly this reflects the fact that this
is a relatively small amount of genetic variation, but it says that for any given gene it’s very
difficult to trace population origin from that gene, and conversely, if we know your
population origin, we can’t predict necessarily your genotype or genotypes at that locus.

 

This also reflects the sharing of DNA that
has gone on through the history of our species, because human populations have mixed and migrated
fairly extensively throughout our history. And the mosaic patterns that we see in many
of these diagrams are a reflection of that. And this is something that Darwin
himself was aware of. He said a long time ago, “That it may be doubted
that whether any character can be named, which is distinctive of a race and is constant.” In other words, characters tend to be shared
across populations and any single character is not going to delineate a specific population
group.

 

So we then took that same group of individuals
and used about 200 loci, and again, made a diagram. And now you see that these individuals
— and they’re from Africa, Europe, and East Asia, so they are geographical –the group’s
somewhat separated. But now every individual falls into a group
that is consistent with their continent of origin. Now, one of the things you notice here is
that the lengths of these branches are much, much greater within populations and that’s
consistent with that FST estimate that says most differentiation, most variation occurs
within populations, but there is enough between population variation here so that we can begin
to see a pattern according to ancestry.

 

And that may seem a little bit paradoxical
when you compare this to the diagram I just showed you for angiotensinogen, but it makes
sense if you think in terms of this being a lot more information about ancestry, about
population history. So in a way, it’s like looking at, say, just
height in males and females. If we measure everyone’s height and try to
determine sex from that, we’re going to be wrong a lot of the time, but there is on
average a difference. If we add another characteristic, like waist/hip
ratio, well, then we have a more accurate separation of our two groups. And the more characters we look at the more
accurately we can discern these two different groups.

 

So, that’s essentially what we’re saying,
is that with more genetic information we can more accurately discern the histories, at
least, on a very basic level, of these continental populations. So here’s another example, now using more
single nucleotide variants, and you can see in this neighbor-joining network it appears
that there are groups, and they do correspond to various worldwide populations. These are new world populations, Asian populations,
African, a Spanish population, and a south Indian population. But we shouldn’t get too misled by this,
because we can add populations with a more complex history, such as African Americans,
where some fall into this group with African populations; others trend toward other groups
because of the complex history of this population.

 

The same thing if we look at, say, Puerto
Ricans who, again, has a complex history, complex ancestral history, where some fall
in with the Spanish group, others fall in closer to an African group. So the point here is that, especially as human
populations become more mobile, it’s very difficult to classify every individual into
a nice, neat category. Here’s a similar exercise that my graduate
student Wilfred Wu carried out a year or so ago with the complete genomics data. So this is a whole genome sequence. And we see very much the same kind of pattern
where in general individuals — and these are individuals from the 1000 Genomes Project,
sequenced by complete genomics — and we can see that in general, these population groups
do tend to fall together, but there are interesting exceptions.

 

For example, individuals from Mexico are distributed
in various places throughout the graph, once again illustrating their complex demographic
history. Another thing Wilfred did that was kind of
interesting, just from a genomics point of view, was to compare — the included here the
same subject sequenced in the 1000 Genomes database with their sequence in complete genomics,
and on average the between platform differences were about 348,000 variants. A lot of that has to do with relatively low
coverage in the 1000 Genomes database, so we would expect it. And it’s kind of encouraging that
each of these pairs, which are the same individuals on two different platforms, did at least cluster
together.

 

But you can see that between platforms a difference
quite clearly in this slide. So here’s just one more example of the point
I’m making it here. This is a principal component plot for American
populations of African, European, Asian, and Hispanic descent. And again, you see that some individuals,
for example of African descent, are closer to members of other populations than they
are to many of the other individuals of African descent. So very difficult to put a self-identified
group into a nice, neat little compartment. So what this tells us is that if we look at
multiple polymorphisms, if we look at a lot of SNPs or single nucleotide variants, if
we look at enough of them we can often learn something about population affiliation, kind
of the non-overlapping parts of these circles. But the converse — and this is where people
sometimes get confused — the converse is not true. If we know your population affiliation, we
can’t predict your SNP genotype, because these populations typically differ just in the frequency
of SNPs and there’s a lot of overlap.

 

So I think that’s a very important point that
we need to make, especially to the general public. And it points up, I think, the fallacy
of thinking typologically, which is what racial categories tend to be. Humans don’t fall into discrete groups
like this. They’re — what the genetic data tell us is
that there’s a tremendous amount of overlap in genetic information across human populations. But here’s a good example of that, or also
of how self-identified population affiliation can be misleading. Wayne Joseph was a principal, is principal,
in the school system in California. He was raised in a family in Louisiana that
was self-identified as African American.

 

He sent his saliva into a direct-to-consumer
testing company and this is what he learned: that, at least according to their estimates,
he was 57 percent European, 39 percent Native American, maybe 4 percent East Asian, although
that could just be an error term, but no apparent African ancestry. So in his case, his self-identified population
affiliation appears to have been completely wrong. Now, this didn’t change anything important
for him. Culturally he maintained his same affiliation,
but it shows how that self-identified affiliation can be wrong, can be misleading. So I think a much more useful concept
of race is individual ancestry because we can now estimate genetic ancestry for individuals,
at least at a broad level. And someone with this apportionment of ancestry
would likely self-identify as African American as very likely would someone with this.

 

And yet their ancestries and their genetic
makeup could be quite different. And that’s why I think it’s much better really
to assess ancestry at the individual level rather than to use these categories. I’ll just give you an example from my
genetic testing because I sent my DNA to one of the companies — I guess this was 23andMe
— a few years ago. And they will assess your paternal and maternal
ancestry. This is based on Y-chromosomes.

 

So I have this particular Y-haplogroup, I1*. And it was kind of amusing to learn I share
it with Jimmy Buffet and Warren Buffet. They don’t know that. [laughter] Lynne Jorde:
And it hasn’t done anything for my singing or my investing ability. But my grandparents all came from Norway. So this is consistent with what I know about
my ancestry. My maternal line, my mitochondrial DNA again,
the haplogroup I have is quite common in Europe, fairly widely spread throughout Europe. So that, again, makes sense. And then using ancestry informative markers
across the genome, they attempt to essentially paint your chromosomes with ancestry. And I was hoping that I would have something
exotic, but according to this, at least, my ancestry derives 100 percent from Europe. I was hoping that my kind of rambunctious
Viking ancestors might have brought something interesting into the genome, but it doesn’t
look like that’s the case. But here we’re looking at the ancestry of
a Berber female from North Africa. So this is an African, but where 86 percent
of the ancestry is predicted to be European-derived.

 

And we see quite a lot of mixture in that
ancestry, even more so for a self-identified African American. And the important point here is that for this
individual some regions of the genome would be African-derived, and other regions of the genome
would be European-derived. And if we’re interested in disease susceptibility
that is genetically related, what we want to do is to look at those specific regions
and look at their genetic makeup rather than assessing self-identified population affiliation. So for biomedicine, I think these findings
do have some important implications. First of all, they tell us that if we look
at a large number of independent polymorphisms we can learn about population history.

 

We can learn about ancestry. But, and very importantly, these variants
typically differ only in their frequency and they typically overlap a lot among populations. And here’s an example of that. This was a study done on response to ACE inhibitors
in African American and European American populations — a very large meta-analysis
— and it addressed the issue or the question of whether African Americans tend to respond
fewer ACE inhibitors for lowering blood pressure than European Americans. And what we see here is that the decrease
in blood pressure, in systolic pressure in response to ACE inhibitors, is a few millimeters
less in African Americans than European Americans, but that there’s a large distribution here,
a large amount of overlap, and so as you can see, many of the African American patients
would respond better to an ACE inhibitor than would many of the European American populations.

 

So far better than using this average difference
as an indicator of who should get this drug, it’d be much better to be able to look directly
at genotypes and individuals to predict response. And we see a good example of that with EGFR
inhibitors and non-small cell lung cancer. So EGFR inhibitors, like gefitinib and erlotinib,
inhibit tyrosine kinase activity, and they’re estimated to be effective in treating this
condition in roughly 10 percent of Europeans, but a higher percentage of Asians. So one might imagine using population affiliation
as an indicator for who should get this drug to treat non-small cell lung cancer. But it’s interesting that if you look at somatic
mutations in EGFR — these are gain-of-function mutations — we see those in about 10 percent
of European patients with this condition, a higher percentage of Japanese and other
Asian individuals, and in fact, 70 to 80 percent of those who have the mutations respond to
the drug; fewer than 10 percent of those without responsibility. So you can see that looking at the gene itself,
looking for gain-of-function mutations is a much better indicator of who is going to
be a good responder than is the more blunt population category.

 

This is one more example of that response
or the calibration of warfarin dosage. So this is a standard clinical algorithm that
takes population affiliation into account, but here are the results of looking — of
doing genetic testing for VKORC1, and CYP2C9; they are both involved in warfarin metabolism. And here you see a much, much bigger difference
between this genotype category and this genotype category than we do across a population. So again, individual testing gives us a much
better prediction of response than the use of population affiliation.

 

So what this, I think, tells us is that genetic
variation, we’ve seen, is correlated with geographic location, but it tends to be distributed
continuously across space. It’s difficult to delineate specific borders
or boundaries between populations. So race — going back to a question raised
earlier — may not be biologically meaningless, but it’s biologically very imprecise. It is a very blunt tool, and we can use better
tools, genetic tools, to infer individual ancestry, and that, I think, will provide
more medically relevant and useful information.

 

And I want to go on now to the topic of linkage
disequilibrium, but everyone has been sitting very still for about an hour, and so I’d like
to invite you to just stand up and stretch for a minute before we do the last half hour
of the lecture. So I think it’s cruel and unusual punishment
to make you sit here for 90 minutes. It violates your — what is it? Eighth Amendment rights. Female Speaker:
Can I ask a question while we’re — Lynne Jorde:
Oh, yes. Absolutely. Female Speaker:
— not on. Is it on? What’s the — how did you define what were
Neanderthal genotypes? Lynne Jorde:
Okay. That’s a — Female Speaker:
Since there is no Neanderthals around. Lynne Jorde:
Oh, but several have been sequenced. Female Speaker:
From frozen material, as it were? [laughs] Lynne Jorde:
Well, it was — I’m not so sure about the exact provenance, but several Neanderthal

specimens have been sequenced, including one at 42X coverage, so that has given a baseline
for the Neanderthal genomes.

 

Female Speaker:
And they’re taken from geographically somewhat diverse areas? Lynne Jorde:
Yeah. One was in Croatia, another much further east. I don’t remember the exact location. Female Speaker:
But the point is they’re closer to each other than to anything else. Lynne Jorde:
Yes. Female Speaker:
Okay. Lynne Jorde:
They’re much — the sequences, the Neanderthal sequences are much more similar to each other
and quite divergent from humans. Male Speaker:
I had another question on the — so the findings of much greater diversity, genetic diversity
within Africa than other populations, how much of that is due to population substructure
within Africa, FST values between different populations in Africa? Lynne Jorde:
Yeah. So there is, as you’d probably expect, more
substructure as we look across Africa, that population has been resident in Africa
longer, and has more time to subdivide and differentiate.

 

It also has a larger effect size, and the
larger the effective size of a population, the more variation you see. So in all the different kinds of systems we’ve
looked at, we tend to see about 20 to 25 percent more variation in samples of persons of African
ancestry than in non-African — those of non-African ancestry. Okay. Well, it looks like everyone has sat back
down, so we’ll go on to talk about what I think is a very interesting application of
a population genetic concept: linkage disequilibrium to disease gene mapping. Let me ask, how many of you are familiar with
the concept of linkage disequilibrium? Okay, I see just a few hands. So let’s go through this because this turns
out to be very important for understanding not just SNP data, but also genome data.

 

So basically linkage disequilibrium is,
it can be described as the non-random association of alleles at linked loci. So let’s imagine that we have here two loci,
A and B and their alleles are big A and little A. At equilibrium, we’re going to see all possible
combinations, but in disequilibrium where there’s a non-random association of alleles,
we see big A and big B together, little A and little B together, but very seldom do
we see the other combinations. And that, in essence, is what we mean by linkage
disequilibrium. Now, we can quantify this by looking at the
allele frequencies of big A and little A, 60 and 40 percent; big B and little B, 70
and 30 percent. Now, what we would expect under equilibrium
is that in this population, haplotypes having this combination would be seen 42 percent
of the time.

 

That is the frequency of big A multiplied
by the frequency of big B. That’s essentially random association, very much like Hardy-Weinberg,
except now extended to two loci. Similarly, we would expect the frequency of
big A and little B together on the same chromosome copy, the same haplotype, to be 18 percent,
60 percent times this frequency, 30 percent. So that’s what we would expect under linkage
equilibrium, but let’s suppose we assay a population and we see that we have a real
excess of this haplotype and an excess of this haplotype, and then a paucity of the
other two haplotypes. Well, that would be linkage disequilibrium. We’re finding these alleles in combination
much more often than we would expect given their frequency. So what this suggests, most of the time, is
that the alleles that have higher linkage disequilibrium have had less opportunity for
crossover to occur between their respective loci. So over many generations, we’re going to find
big B and big C together on the same haplotype more often than big A and big B, because being
further apart, these two loci will have had their alleles broken up by recombination much
more frequently than this pair.

 

So what that implies is that we can look at
linkage disequilibrium patterns as a way of inferring how close together any two loci
are. It’s another way of doing linkage analysis. But it has some advantages. We don’t need family data necessarily. If you’re doing traditional linkage, you’re,
of course, counting recombinants from one generation to the next. We can use microarrays or sequence data so
we can look at a large number of single nucleotide variants spaced as closely as a kb or so. And we can do association studies that effectively
incorporate not just the last two or three generations of recombination to map loci,
but essentially all of the generations of recombination that have occurred since a variant
arose, because, really, for any given mutation, populations are in essence just one big pedigree.

 

So if all of these individuals in these different
families inherited a mutation from this founder back here, what linkage disequilibrium allows
us to do is to look at recombinations that have occurred between this mutation and nearby
SNPs throughout the generations. So in principle, it allows us to more finely
map loci than we could map if we were just doing recombination mapping, linkage analysis,
in say three generations of a family. So that’s the advantage of linkage disequilibrium,
and that’s one of the reasons why, if you look at the number of papers published over
the last 30 years on linkage disequilibrium, back in the 1980s — this was when I first
became interested in that topic — there were about 50 papers a year published on linkage
disequilibrium.

 

You could read a paper a week, and you knew
everything that was going on. Now, this has kind of plateaued, but at around
12 to 14, 15 hundred papers a year are published on this topic. So it has gained a lot of interest, and a lot
of popularity as a gene mapping tool. But there are a lot of factors that can influence
linkage disequilibrium patterns.

 

One is chromosome location, just as with recombination
because recombination is more common near telomeres, the relationship between linkage
disequilibrium, between two loci and their actual distance is going to vary. We also know that there is less recombination
within genes than in extragenic regions which will, again, affect the relationship between
linkage disequilibrium and physical distance. Sequence patterns affect recombination and
therefore linkage disequilibrium. So GC content is associated with more recombination,
presence of inserts like Alus is associated with more recombination. We know now of recombination hotspots every
50 to 100 kb in the genome; in particular, motifs that are bound by this zinc finger
protein, PRDM9, are associated with a high proportion of hotspots.

 

Interestingly, there is more variation
in PRDM9, in a repeat unit in PRDM9 in African populations than non-African populations,
one of the reasons why we see more recombinations in African populations, and their population
history. And finally, the evolutionary factors that
we’re interested in population genetics — natural selection, gene flow, mutation,
and genetic drift — all influence the pattern of linkage disequilibrium. So linkage disequilibrium can be rather complex
to interpret. Here’s an example: the age of a population. And of course, populations all have
the same age, but when we talk about an older population, we’re referring to a population
that was founded longer ago like the current African population, and in such populations,
there have been many generations for recombinations to occur. So that means that there’ll be a lot of different
haplotypes in relatively smaller blocks in a population like that. On the other hand, if we look at, for example,
a Finnish population, most of which was founded relatively recently from a small number of
individuals, there haven’t been as many generations passing for recombinations to have occurred,
so we tend to see larger blocks of haplotypes, and more disequilibrium.

 

And that means that a mutation here will be
associated with more nearby polymorphisms even after many — even in modern populations,
whereas a mutation that occurred in this population will tend to occur in association with a smaller
several polymorphisms. And if we look at patterns of disequilibrium
in these populations — we’re looking now back at the angiotensinogen locus, and each
of these little units here is an SNP at that locus, and we can interpret this plot much
like we do a mileage chart.

 

For those of you that remember mileage charts
from atlases, this might be, say, San Francisco, this would be New York, and here would be
the distance between San Francisco and New York. If this were Los Angeles, then this would
be the distance between San Francisco and Los Angeles. Well, for these SNPs, this is the amount of
linkage disequilibrium between these two SNPs, this pair of SNPs, and this is the amount
of linkage disequilibrium between these two rather distant SNPs.

 

Red indicates high disequilibrium, and what
you see here is much more disequilibrium in this locus in the more recently founded Eurasian
population than in the African sample; so consistent with what we know about population
history. So one of the questions that we want to ask
is “Well, how general are these patterns across the genome? And how much does linkage disequilibrium vary
with genomic location and with population?” And I would say that about 10 or 12 years
ago, our knowledge of that, of haplotype structure across the genome, was kind of like this map
of the world in 1544. I think these maps are fascinating. At the time Europe was reasonably mapped out,
Asia to some extent, North America was not even on the map, so it was a — it was a pretty
low-resolution and misinformed map of the world. Well, the HapMap Project, which I know all
of you have heard about, really sought to create a better, more accurate map, haplotype
map, of the human genome.

 

So it started with 600,000 SNPs. That was expanded. The populations were three: 90 Utah CEPH individuals
representing people of European ancestry, 90 Yorubans from Nigeria, and 90 East Asians;
by no means a complete sample of human diversity, but a small subsample. And the idea was to evaluate patterns of linkage
disequilibrium in haplotype structure across the genome in these different populations. And I think the result was a map that looks
more like this. By 1688 we had a much better-resolved map
of the world. Somehow California still escaped notice here,
but for the most part, we had a much, much better map of linkage disequilibrium. And this has led to some very useful applications:
first of all, understanding human genome-wide haplotype diversity; detecting recombination
hotspots; detecting genes that have experienced strong natural selection; and of course, detecting
disease-causing mutations. And in this last part of the talk, I’ll go
through a few examples of those. Certainly one of the take-home messages from
that project was that SNPs, many SNPs throughout the genome are redundant.

 

So if you have this SNP here, then you almost
certainly have these alleles here, so these tag SNPs are all we have to genotype. The others are effectively uninformative,
because they’re in strong linkage disequilibrium, meaning that we don’t have to type nearly
as many polymorphisms to get complete coverage of the human genome, more in individuals of
African descent, but still far fewer than the total number of SNPs that have been discovered. And that has led to this success story, I
think, and you’ve all seen this slide or some version of it: the many, many hundreds of
replicated associations across the genome using SNPs designated from the HapMap Project.

 

Now, it also — these data also allow us to
detect hotspots of recombination, because what we will see often is blocks of linkage
disequilibrium for this group of sequences, where there are strong associations among
alleles, but no association between essentially this block and this block, because of a recombination
hotspot, where recombination is elevated at least tenfold over the rest of the genome,
rapidly disassociating those groups of alleles from one another. And of course, that is going to influence
our estimates of distance among loci. If there’s an intervening hotspot, we’re going
to have unexpectedly low linkage disequilibrium. So we estimate that there are as many as 50,000
or so recombination hotspots throughout the genome, and that about 60 percent of all recombination
occurs in just 6 percent of the genome, much of it focused on this highly active hotspot
areas.

 

And what’s very interesting is that hotspots
vary among species. In fact, in chimp, the location of hotspots
completely different from that of humans. PRDM9 is not active in chimps, so that explains
part of it. And we also see variation even among human
populations, in the location and activity of hotspots. So this is helping us to understand
this very, very important property of genomes, how frequently, and where they shuffle and recombine.

 

Now, another thing that this linkage disequilibrium
patterns allow us to do is to detect regions that have undergone very strong natural selection. And the idea is diagrammed here. If we imagine a new DNA variant arising on
a haplotype background, it will slowly — if it’s neutral, that is, if it does not undergo
natural selection, will in some cases slowly increase in frequency, but as it does so,
that background haplotype, that is the other SNPs associated with it, become smaller and
smaller due to recombination that’s occurring generation after generation.

 

So for a neutral variant, if it attains high
frequency it will have relatively low disequilibrium with other nearby SNPs, because of recombination. But now imagine that this is an advantageous
variant that it sweeps very rapidly to high frequency. What it’s going to do is to carry those other
variants along with it, also at high frequency, and we’re going to see long regions of homozygosity
in populations, because of selection not only of this advantageous variant but of nearby
SNPs.

 

So we can look for regions that have this
signature as a way of detecting SNPs, detecting variants that have undergone very rapid positive
selection. So this is another illustration of the idea. If there’s a positive selection for this variant,
it will pull the adjacent variant along with it that it’s associated with here, and after
a while, most, maybe all, members of the population will have this combination of variants. You’ll see the region of homozygosity here. We can compare that, for example, to purifying
selection where variants occur, but because they’re deleterious, they simply get eliminated. So this approach is now being used in several very interesting applications. For example, to show that the variation at
the G6PD locus was selected for very strongly for malaria protection. This cytochrome P450 locus underwent selection
for sodium retention. A very interesting story, the lactase enhancer
populations, independent populations, some in Europe and some in Africa, that are herding
populations have hereditary lactase persistence so that they can digest milk throughout their
lives. There’s an enhancer element that has undergone
strong selection in those independent populations, a good example of convergent evolution just
within the last 10,000 years.

 

Several skin pigmentation loci have,
again, undergone rapid selection as humans encountered new environments, and Tyra mentioned
work that we and others have done on high-altitude hypoxia response in Tibetan populations, because
Tibet is one of those great, essentially natural experiments done on humans where humans lived
— moved to an altitude of 15,000 feet or so and successfully adapted, in part by altering
their response to high-altitude hypoxia.

 

And so we’ve discovered selection, and now
specific variants at these members of the HIF pathway of — that helped to confer to
that high-altitude adaptation. So these were all discovered by exploiting
these properties of linkage disequilibrium in populations. And I’ll say that population genetics is also
guiding the development of sequence analysis, as we are now analyzing more and more exomes
and whole genomes. The 1000 Genomes Project provides a very useful
set of control sequences because whenever we sequence a group of patients, one of the
questions is if we find a variant in that group, is it a variant that is absent in other
populations, or at least very rare? And the 1000 Genomes Project has given us
one of the important sets of control sequences for that kind of variant analysis.

 

And I think we need our population’s genetic
analysis to inform us about the nature, the behavior of rare variants, because these rare
variants often are the ones that we are especially interested in, in terms of their association
with the disease now that we’re able to do whole genome sequencing. And evolutionary principles, population genetic
principals, help us to determine when a variant is functionally significant, because
we can find associated variants, but figuring out which ones have functional relevance
is, in many cases, quite a challenge.

 

So we incorporate, and others do this as well,
purifying selection. We look at regions that have undergone purifying
selection as a way of prioritizing candidate variants when we’re doing genome analysis. And I’ll just mention this software that we’ve
developed in the last few years: VAAST, and now Pedigree VAAST, which has just come out. So this is a tool for analyzing sequence data,
and Pedigree VAAST makes use of sequence data in pedigrees. So that’ll — that’s just coming out in Nature
Biotechnology, but one of the things we use is evidence of purifying selection to assign
functional significance, and of course, evolutionary conservation among species — again, very
useful in deciding which variants might have functional relevance and significance.

 

So I’ll just wrap up by saying that what I
hope you’ve seen today that genetic variation does contain a lot of useful information about
the history of our species. I think it gives us a more subtle and nuanced
view of issues like race and how relevant they are or are not to medicine. I think it gives us some useful alternatives. And, population genetic analysis, our
understanding of concepts like linkage disequilibrium has been of fundamental significance
in gene mapping efforts, and now as we’re trying to understand the role of rare and
common variants in disease, again, understanding the evolutionary processes that give rise
to those variants is turning out to be of key significance.

 

And I hope you’ve seen that population genetics,
which sometimes people associate with a lot of heavy math, is fun. So I hope I’ve convinced you of that. I’ll leave you with this picture of the lovely
Wasatch Mountains. This is my back yard where I enjoy playing,
and here are some of the people that contributed to some of the work I told you about, and
I want to thank you for your kind attention, and I’m happy to take any questions. [applause] Lynne Jorde:
We’ve got about three or four minutes here. Yes, sir? Female Speaker:
Can you use the microphone, please? Can you use the microphone? Male Speaker:
Like plants and microorganisms, do humans have significant numbers of mobile elements,
transposons and such, and how does this complicate genetic analysis? Lynne Jorde:
Yeah.

 

That’s a great question. We estimate that at least half of our genome
is derived from mobile element insertions. So if you look at Plus, it’s about 11 percent. There are more than a million 300-base Alus
in the human genome, mostly inserted earlier in the course of primate evolution, but some
of them since humans diverged along their independent lineage. Another 17 to 20 percent are LINE-1 elements,
and one of the interesting questions is what effect these have on the genome. We know that occasionally you can have, for
example, transduction of other genetic elements, as an L1 pops out and goes someplace else;
sometimes it takes other material with it because it has a rather weak poly(A) signal,
so it is sometimes involved in the transfer of other genetic elements. Because these are highly methylated sequences,
they’re CG-rich, and they may affect gene regulation depending on where they land.

 

So we think that they do occasionally have
effects on the genome. And, of course, we’ve got some very good examples
now in which these elements have been inserted into a specific gene and caused loss of function. And there are some good examples in which they
mediate unequal crossover. The BRCA1 gene is full of Alu elements, and
that’s one of the reasons why you see so many deletions in BRCA1 is that these Alus are
mediating unequal crossover and causing deletion. Also, they do have some interesting effects,
and I think because they’ve been difficult to identify easily, they have been somewhat
challenging to understand, but a lot of work is being done in that area. Other questions? Okay. Well, thanks very much. [applause].

As found on YouTube

My 6-step formula for GCSE exam success. Achieve a top grade in all your GCSE exams whilst spending half of your time doing the things you enjoy. I explain why note-taking is NOT the way ➯➱ ➫ ➪➬ The General Certificate of Secondary Education (GCSE) is an academic qualification in a particular subject, taken in England, Wales, and Northern Ireland. State schools in Scotland use the Scottish Qualifications Certificate instead. Private schools in Scotland may choose to use GCSEs from England.