What is a gene, and what is it for?

ChromosomesThe gene is a fundamental concept in biology, and it has been since Mendel introduced it in the late 19th century, unbeknownst to Darwin, who was just at the same time looking for a theory of heredity, flirted with Lamarckism, and tiene came up with his own, incorrect, notion of blended inheritance.

Mendel’s work was rediscovered in 1900 (it’s bad for one’s academic career when one publishes in obscure journals and agrees to become an administrator, as the Augustinian friar did), and ever since it has been a crucial component of our understanding of biology. But scientists have developed a number of different concepts of gene, concepts that don’t always sit quite nicely and coherently with each other. Sounds like a job for philosophers of science…

And sure enough, a number of philosophers have written about genes and what they are “for” (including yours truly). By far the most active pair, however, is made of Paul Griffith and Karola Stotz, who have published a number of influential papers on gene concepts, and have now produced a new book on the topic, Genetics and Philosophy: An Introduction. Below I will comment on an in-depth essay about the book, published by Lindell Bromham (a biologist) in Biology and Philosophy. (Often book reviews in philosophy journals are a lot more than just book reviews, and provide the author a way to articulate his own thoughts on whatever subject matter is covered by the book.)

Before proceeding, one might well wonder what is the point of this sort of philosophical exercise. As Bromham puts it: “the word ‘gene’ belongs to all of us, and understanding its meaning is critically important for informed discussion of many important ideas in medicine, biology and in wider society. None of us is unconnected to debates about what is a ‘gene for’.”

One of the first things Bromham clarifies in his commentary is that common locutions such as “gene for bowel cancer” are nonsense. This is because of course genes don’t evolve (usually) to harm their carriers. What typically happens, though, is that we identify a gene when it misfire (say, contributing to causing bowel cancer), even though its normal function is something entirely different. It would be like trying to understand how cars work by studying when they break down and concluding that the brakes are “for” smashing into other cars when they unexpectedly cross your path.

Bromham correctly argues that one problem is that there is no common, i.e., non-technical, term to identify gene variants (the technical word is “allele”), so that people can make more sense of statements from doctors and other health care professionals: “So in most people the Huntington Disease gene does not give them Huntington’s Disease, only those people unlucky enough to inherit an allele of the gene with more than 40 repeats will get the disease.” Which means that people shouldn’t worry about whether they carry the gene “for” Huntington’s, but rather whether they have inherited a particular, lethal allele of a gene that does whatever it does when it works properly and causes Huntington’s when it doesn’t.

The bulk of the paper is then devoted to a section-by-section discussion of Griffith and Stotz’s classification of different concepts of genes, beginning, of course with the Mendelian one: “A mendelian ‘gene’ is a pattern of inheritance: it does not need to be a ‘gene’ in the sense of being a cistron [a DNA sequence that functions as an unbroken hereditary unit] that makes a product or influences phenotype.”

One of the complications is that sometimes blocks of cistrons (known as “haplotypes”) are inherited at once, perhaps because the individual genes making up the haplotype contribute to a particular biological function and natural selection therefore favored their inheritance as a larger, coherent unit (a “supergene”).

Then again, sometimes natural selection surprises biologists, as in the case of the mimicry in some species of butterflies, such as Papilio polytes. Their mimetic phenotype is the result of a coordinated action among a number of genes, so for a long time scientists thought these genes must be part of a single haplotype block and be inherited together, to maintain coordination. But no, it turns out that the whole shebang is orchestrated instead by a regulatory gene, known as doublesex, which sends signals to all the other genes involved in the production of the mimetic phenotype. No need for a supergene, if the work can be done by a single regulatory element.

As Bromham summarizes it: “Uncovering the rich complexity of gene action has not dissolved the mendelian gene, but it has disassociated the mendelian gene from any particular form of DNA sequence.” Moreover, there are genes whose influence is actually dependent on epigenetic factors, as in the case of the GNAS complex, defects in which are associated with a number of diseases in humans, including certain kinds of tumors. Some of these deleterious effects, however, are manifested only if the GNAS allele is inherited from the mother, while others take place only if the variant is inherited from the father. Not exactly a standard Mendelian factor.

There are also so-called “postgenomic genes,” which are defined as “the DNA sequences that have a linear correspondence to the gene product of interest, wherever these occur in the genome. This way you start with the product and work backwards to the DNA sequences that were needed for its construction. The ‘product’ is somewhat vague, as it will look different if we focus on the RNA transcript, a processed RNA molecule, or a translated protein.”

For instance, the above mentioned GNAS is actually a complex made of a whopping 17 different exons (transcribed units). The complex is controlled by a number of regulatory sequences, which cause differential “splicing” (i.e., different cuts off the larger molecule, resulting in different proteins after translation from the intermediate RNA). Some of these spliced versions are inherited maternally, other paternally, and yet others biparentally. Moreover, one of the exons can be “read” by the transcription enzymes in two different ways (two “reading frames,” in molecular biological jargon), resulting in two distinct proteins. What, exactly, constitutes a gene in the case of the GNAS complex?

Then we have the “reactive genome,” that is the ensemble of the gene that codes for a given protein plus the regulatory sequences that control when and where during the development or life of an organism the gene gets transcribed and then translated. This is affected by the environment, a term that includes not just the external conditions in which the organism happens to live, but also the intra-organismal one, such as which cells surround which others, their communication channels, and so forth.

Griffith and Stotz, and therefore also Bromham, also comment on the relationship between genes and the ever treacherous concept of information. The authors of the book — and Bromham seems to agree — correctly state that genes cannot really be said to be “coding” for anything other than sequences of amino acids (and not even that, in the case of many regulatory sequences). They certainly do not code for phenotypes, so that the phrase “gene for X” where X is the color of the eyes, a particular behavior, or whatever, is nonsensical from a strictly biological perspective: “in most usages, the use of the word ‘information’ to describe the functioning of genetic systems is at best a sloppy metaphor.”

Indeed, even the famous distinction between genotypes and phenotypes, which is at the center of simplistic metaphors such as Richard Dawkins’ “selfish gene” is undermined by advances in understanding within the field of molecular genomics. As Bromham puts it: “consider DNA sequences that do not act solely as a template but also act directly as a binding site for other molecules which recognize the particular sequence of nucleotides as a kind of ‘docking station’ … This means that DNA control elements that serve as recognition sequences are both genotype (a message that can be copied from one form to another) and phenotype (a structure that contributes to form and function).”

The last two sections of the paper address interesting questions concerning which I think Bromham comes across as reasonable but perhaps a bit too conservative.

The first one is dedicated to the issue of epigenetic inheritance. This is by now definitely a well established phenomenon in a number of organisms, but Bromham is right that the jury is still out there in terms of its long-term evolutionary significance. Most epigenetic variants are relatively short-lived, influencing phenotypes over the span of a few generations, very brief when compared to the stability of genetic sequences, whose shared ancestry is recognizable in organisms that have diverged for millions or even tens of millions of years. My take on this is that we have evidence at the moment that epigenetic inheritance can influence the short-term response of natural populations to environmental changes, but that long-term stability of phenotypes is encoded only in the DNA. This, however, doesn’t make the import of epigenetic factors negligible, since after all evolution happens one generation at a time (“evolutionary time” is simply the cumulation of a number of short moments), and because the epigenotype constantly interacts with the genetic system of inheritance, in highly non-linear and complex ways.

The last section of Bromham’s commentary concerns the so-called Extended Synthesis, the new version of evolutionary theory that some people (including yours truly, duly cited by Bromham) have been put forth in recent years, contrasting it with the standard Modern Synthesis.

Here is Bromham’s summary of the situation: “If ‘extended’ means a richer appreciation of the tangled web of interactions between genes, development and environment, then there are very few fields of biology that do not contribute to such an extension: the more we find out, the more we know, and the more fiendishly complicated everything looks … If ‘extended’ means that knowledge of the complex interactions between genotype and phenotype cannot be accommodated in current evolutionary theory, then there are no concrete examples here that illustrate how this new knowledge does not fit into the established framework.”

In my mind, “Extended” actually means something in between those extremes: neither trivial (like the first one), nor revolutionary (like the second one). But I’ve already explained my take on the ES vs MS debate elsewhere, so I will refer readers to that commentary.

151 thoughts on “What is a gene, and what is it for?

  1. Robin Herbert

    Hi Coel

    But, one of the ways in which DNA blatantly is “information” is (in line with comments from Massimo and Synred) from the functional recipe point of view. DNA is translated into RNA and thence proteins which then build bodies. From that perspective the sequence of bases matters a lot. One can then talk about the number of bytes needed to encode the information content of that sequence.

    You said it was blatantly information as physicists use the term and I am trying to get at what that is.

    It seems to me that when “information” is used in physics it does not mean that there is a sequence that encodes some action for an apparatus.

    Like

  2. synred

    Shannon:

    The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities.

    Like

  3. Coel

    Hi Robin,

    You said it was blatantly information as physicists use the term and I am trying to get at what that is.

    There is more than one way in which physicists use the term, depending on context. The most blatant way is one I described.

    Like

  4. synred

    http://www.sveiby.com/articles/Information.html#Shannon

    Shannon defines the amount of information as the negative of the logarithm of a sum of probabilities. The minus sign in this formula means the opposite of Wiener`s minus sign. It is there because the amount of information according to Shannon is equal to entropy.

    So this engineering use of ‘amount of information’ is confusing. You might be sending messages that nobody has assigned any meaning to and thus they have no information in them in the usual, every day and philosophical sense of ‘information’. A less confusing term would be ‘information capacity’ which is given by the Shannon entropy which is related to the logarithm of the number of distinguishable difference in your DNA, your computer word, or your water molecule.

    Water molecules have a lot angular momentum states. But there mostly not distinguishable in practice. You won’t see one replace your disk drive anytime soon.

    Liked by 1 person

  5. brodix

    This does pose an interesting dilemma. Water molecules cannot carry information, but we live in a thermodynamic system in which the primary medium of circulation is water molecules.

    Possibly one way to look at water is a vehicle for information, in terms of thermal energy and dissolved minerals, microorganisms, nutrients, etc. Its very fluidity makes it a good medium, if not message.

    Like

  6. synred

    But Shannon`s “information” is not even a signal (p.100):

    If one is confronted with a very elementary situation where he has to choose on of two alternative messages, then it is arbitrarily said that the information, associated with this situation, is unity. Note that it is misleading (although often convenient) to say that one or the other message conveys unit information. The concept of information applies not to the individual messages (as the concept of meaning would), but rather to the situation as a whole, the unit information indicating that in this situation one has a freedom of choice, in selecting a message, which it is convenient to regard as a standard or unit amount.

    Like

  7. davidlduffy

    Robin writes “It seems to me that when ‘information’ is used in physics it does not mean that there is a sequence that encodes some action for an apparatus.”
    Maxwell’s demon is information processing in order to carry out a particular function, and Landauer’s insight (which has now been vindicated experimentally) was that the energy gains are completely balanced by the energy expenditure needed to reset the demon’s memory. One can create systems where erasure is free, but instead it costs the same to lay down the memory. So the concept of information here ends up being the standard one used in computing. Here is the Ito and Sagawa paper I am sure I have linked to previously,

    http://www.nature.com/ncomms/2015/150623/ncomms849/full/ncomms8498.html

    where they obtain results equivalent to Shannon’s second theorem (the noisy-channel coding theorem) for a biological model where there is no actual (artificial) channel code.

    Zurek, Gell‐Mann, and Lloyd are among the authors of papers showing that algorithmic complexity or algorithmic information content (AIC) is equivalent to Shannon information in thermodynamic and in QM settings. This AIC is precisely the length of the shortest program or message that can be used to describe the entity in question.

    I obviously think there is a profound relationship between this stuff and ideas like Dawkins’s that the gene is where the main information flow in evolution takes place. Mutation, for example, involves only a relatively small amount of energy that leads to a large change in the thermodynamic state of a population of organisms vis a vis the environment at a future time, because of the multiplier effect of replication. This isn’t true of most other perturbations that one can think of. Selection can be relatively passive, in the sense that the environmental features that sieve out favourable mutations don’t actually have to have changed. Obviously this isn’t true of niche construction and other energetically active processes.

    Liked by 1 person

  8. Robin Herbert

    In computing you would make the distinction that it is data rather than information.

    In algorithmic information theory there are a few different kinds of information.

    Like

  9. Robin Herbert

    Hi davidlduffy,

    The way ‘information’ is used in your examples, water is also information.

    Incidentally “,,,the length of the shortest program or message that can be used to describe the entity in question.”. Isn’t that Kolmorogov Complexity rather than Shannon information, or am I remembering wrong? Also, Shannon information is something that was deliberately modelled on the concept of entropy so the link between information theory and thermodynamics is perhaps not so surprising..

    Like

  10. Robin Herbert

    Acutally, yes, I do see it needs a; memory. Really it is no different to the “hair trigger trapdoor” version of Maxwell’s Demon. If you assume that the hair trigger has a negligible effect on the momentum of the particle then it can let the higher energy particles pass through.

    But the trigger could be said to “gain information” and this has to be “erased”, so to speak as the trigger is reset and this takes energy. (As well as recharging the trapdoor mechanism).

    As far as I know, no one has tried to build one of these, but it can be simulated. Of course the simulation is, quite uncontroversially information and so not really any distinction between it being about thermodynamics or information theory.

    Like

  11. brodix

    “One can create systems where erasure is free, but instead it costs the same to lay down the memory.”

    Wouldn’t this be time as well, since new information being recorded over old information is the basic effect of time. While the sense is the past is permanently etched, it is only to the extent some remnant or effect remains in current information.

    Like

  12. synred

    Neither water nor DNA are strcto sensu ‘information’ in any sense. They are, indeed, molecules (in DNA case a vast family of molecules).

    Information is the arrangements or ‘patterns’. The larger the number of possible arrangements, the larger the information (or better information capacity).

    Any subtract can be used to store or transmit it. The angular momentum states of a water molecule could store such patterns in principle, but it’s not practical. DNA with its vast array of possible arrangements works well.

    Like

  13. synred

    I am not entirely sure why Maxwell’s Demon needs a memory, surely it is only concerned with the present and the very immediate future.

    It presumably needs short term memory to act in the future. Otherwise it would just be randomly opening and closing the gate.

    Like

  14. synred

    There are more proteins than genes…

    This, by itself, is a herculean task, since even though there are only about 22,000 genes, there is not a 1: 1 relationship of genes to proteins. Proteins undergo alternative splicing and cells can change proteins after they are produced (so called posttranslational modifications), so they can appear in any given individual in many forms, and it is estimated that the human proteome comprises more than one million distinct proteins.

    Topol, Eric. The Creative Destruction of Medicine: How the Digital Revolution Will Create Better Health Care (p. 105). Basic Books. Kindle Edition.

    Like

  15. davidlduffy

    “there is not a 1: 1 relationship of genes to proteins”: this is the point that Massimo is addressing in the OP. For example, the p14arf (alias ARF) gene
    https://en.wikipedia.org/wiki/P14arf
    uses exactly the same stretch of DNA as the INK4A gene, but starts reading from a different base pair – so the triplet codes that are translated into amino acids come out differently: the DNA for the start of exon 2
    GGTGATGAT
    read starting from the first G gives Gly-Asp-Asp for INK4A, but starting from the second G for ARF gives:
    Met-Met
    Both proteins have important roles in regulating the cell cycle, so this region is frequently mutated in cancer. Obviously, it is dumb luck that reading the same section of DNA offset by one gave two useful proteins, but we are now stuck with it (mice and humans have the same setup).

    Like

  16. synred

    “Both proteins have important roles in regulating the cell cycle, so this region is frequently mutated in cancer. Obviously, it is dumb luck that reading the same section of DNA offset by one gave two useful proteins, but we are now stuck with it (mice and humans have the same setup).”

    I did know that ‘genes’ can overlap like that. Despite the massive overlap this case is presumably counted as two genes.

    What I didn’t know is that in ‘post construction’ downstream processing can modify proteins to make more ‘proteins’ Anyway that’s what that quote seems to say and that there are ‘millions’ of proteins rather than the ~20K explicitly coded for by genes. If this is correct, the ‘information’ to do that must be encoded elsewhere – in the maternal cell or in the so called ‘non-coding’ DNA.

    The book [a] in parts has a lot of hype, so I’m not entirely confident in what it says.

    [a] Topol, Eric. The Creative Destruction of Medicine: How the Digital Revolution Will Create Better Health Care (pp. 162-163). Basic Books. Kindle Edition.

    Like

  17. Robin Herbert

    Hi synred,

    Any subtract can be used to store or transmit it. The angular momentum states of a water molecule could store such patterns in principle, but it’s not practical. DNA with its vast array of possible arrangements works well.

    But we are still using two slightly different (and overlapping) definitions of ‘information’. In terms of thermodynamic ‘information’ you can’t exactly say ‘water is information’ but neither can you say ‘DNA is information’, but ‘information’ is used about them in the same way.

    When you talk about the ability of DNA to store patterns, you are talking about ‘information’ in a different sense. The ability of DNA to store patterns is not just about the number of possible arrangements (water has more) but that there is a discrete set of a fixed number of arrangements which remains stable over time and can be changed without altering the overall structure of the molecule and can be read without changing the overall structure of the collection of arrangements.

    And DNA cannot be said to be information, in this sense, on its own. There is nothing special about the part where a gene starts or a gene stops. It is only information in the context of the apparatuses which do certain things when presented with certain combinations of pairs.

    Like

  18. Robin Herbert

    Hi synred,

    It presumably needs short term memory to act in the future. Otherwise it would just be randomly opening and closing the gate.

    Yes, I corrected myself in the subsequent post. All versions of Maxwell’s Demon have this ‘memory’, but it basically needs only one bit of memory.

    Like

  19. synred

    Hi Daniel D,

    So is this kind of overlap mechanism enough to account for 250K to a million proteins;

    https://www.quora.com/How-can-one-gene-code-for-more-than-one-protein-and-how-is-that-process-regulated

    I would have been inclined to define gene as a DNA sequence that codes for a protein even if you have to slice and dice different overlapping sections of DNA to get at it. I thought that was the definition. I guess this definition would give a much higher gene count. How is a gene defined by biologists for counting purposes?

    With overlaps so prevalent it seems like a single mutation is likely to affect many proteins.

    Here’s what google says:

    noun

    Biology

    noun: gene; plural noun: genes

    1. (in informal use) a unit of heredity that is transferred from a parent to offspring and is held to determine some characteristic of the offspring.

    “proteins coded directly by genes”

    o (in technical use) a distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule which a cell (or virus) may synthesize

    Like

  20. Robin Herbert

    What is not obvious is why the one bit of memory cannot be reset and the door closed using less energy on average than has been transferred via the door. Or why the fact of the closed door cannot be that one bit of memory.

    Like

  21. synred

    But we are still using two slightly different (and overlapping) definitions of ‘information’. In terms of thermodynamic ‘information’ you can’t exactly say ‘water is information’ but neither can you say ‘DNA is information’, but ‘information’ is used about them in the same way.

    Neither water nor DNA is ‘information’. They are stuff. They can be used to code ‘information’ in the ordinary sense. You can study their ‘information capacity’ (or ‘information’ in engineering jargon). Single water molecules are not very practical – many states but hard to read and write ‘em. Lots of water works fine though most ways of arranging it would be kind of cumbersome.

    Like

  22. synred

    It is only information in the context of the apparatuses which do certain things when presented with certain combinations of pairs.

    Well, of course, messages (‘information’ ordinaire) have to be defined and somebody or something has to read and write them. DNA is just one possible such substance. The DNA is not itself information. I doubt anybody thinks it is.

    The information is ‘encoded’ in the DNA, or the state of magnetization states on a floppy disk or whatever.

    Like

  23. synred

    Well the closed door doesn’t tell the demon how fast the molecule is going and whether or not he should open the door or not.

    So the demon has at least two bits: One that tells him whether the molecule is fast or slow and one (the door) that tells him (the now useless) information about the previous molecule.

    I think the point of those demon entropy calculations is that they have worked all that out and you can’t beat the second law (at least not for long [a]). I don’t think I could do the calculation myself.

    [a] The door could just flop open and closed while the demon slept and you might accidentally get a few extra fast guys through, but in the longer haul it’s going to average out.

    Like

  24. synred

    http://www.genomenewsnetwork.org/articles/02_01/One_gene.shtml

    A somewhat dated note (number estimated genes as dropped, since 2001), but it relevant to many protein from one gene.

    “What does this mean? It means that the old paradigm that one gene makes one protein is clearly in need of revision. Through mechanisms that include “alternative splicing,” one gene can direct the synthesis of many proteins. Victor A. McKusick, of the Johns Hopkins University School of Medicine, says, “It seems to be a matter of five or six proteins, on average, from one gene.” McKusick, who is a co-author of the Science paper, suggests that people who now claim that the number of human genes is much higher, may be looking at and counting separate messenger RNAs—the molecules that take information from genes and direct the production of proteins. That being likely, mRNAs will now have to be reexamined against the human genome sequence.”

    Like

  25. synred

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614620/

    This paper would seem to contradict the idea that one gene could express several proteins very often.

    “Overlapping genes are relatively common in DNA and RNA viruses. There are several examples in bacterial and eukaryotic genomes, but, in general, overlapping genes are quite rare in organisms other than viruses. There have been a few reports of overlapping genes in mammalian genomes. The present study identified all of the overlapping loci and overlapping exons in every chromosome of the human genome using a public database. The total number of overlapping loci on the same and opposite strands was 949 and 743,”

    Like

  26. davidlduffy

    Dear Arthur, there are 79000 protein coding genes, and similarly large numbers of important functional non-protein-coding genes.
    http://www.ncbi.nlm.nih.gov/genome/51
    There are several free textbooks downloadable at NCBI too, where you can get a feel of why “gene” is so vague. For example, the gene for blue eye colour is a single nucleotide change in a non-coging intron of one gene (HERC2) that controls the amount transcribed of another larger protein-coding gene (OCA2) next door.. Most of the quantitative trait loci we find are changes in noncoding regulatory DNA.

    Like

  27. synred

    So why does so much of the literature (even not-so-popular) say there are only ~20K human genes? if it’s actually 79K. And it’s still pretty far from 79K to 250K or 1M.

    One my projects is to make my evolution simulation FINCHES use a more realistic model of genetics including linkage and such. I might even investigate different patterns of inheritance to see, e.g., if a more bird like scheme leads is more prone to produce female selectivity. So the info you point to could prove useful. Thanks.

    Like

Comments are closed.