[Cz-biology] Auto-created articles about genes

Andrew Su asu at drewsu.com
Wed Sep 19 11:59:45 CDT 2007


Thanks all for the discussion, and apologies in advance for the
lengthy email.  I've tried to address all the issues raised, and I
thought I'd err on the side of passing along more information about
our thoughts and plans.  (Many may be interested in jumping down to
the last section on "What are the next immediate steps?".)  More
discussion/thoughts/suggestions are always welcome, particularly on
the licensing issue…

Cheers,
-andrew


>********************************
> How does this effort relate to the many other gene databases/portals already available?

There are two main advantages of this effort, both of which stem from
the fact that CZ is a wiki.  First, all the other gene portals (our
SymAtlas database included) are primarily composed to tag-value pairs
(e.g., symbol = "APP", function = "apoptosis", etc.).  Second, other
gene portals are 99% one-way communication, from data providers to
data consumers.  Of course, we all here know that wikis are a great
complementary resource to these types of databases, allowing both
free-text and user-contributed gene annotation.


>********************************
> since a parallel effort is intended at Wikipedia, will the intent be substantially different?
> Other than our use of subpages, how might our articles/clusters differ from
> Wikipedia's?  If they wouldn't differ appreciably, is that a reason for us not to do it
> (or, indeed, to insist on doing it)?

I've been thinking about this issue quite a bit, since there is a
compelling argument that doing parallel efforts at CZ and WP dilutes
the impact and contributions of both.   Nevertheless, my current
thought is that the WP effort will likely be done regardless of the
decision here at CZ.  I think the greater name recognition of WP
(especially by the many biologists who aren't terribly tech-savvy)
makes that a necessity.   All the problems at WP that CZ is meant to
address?  I would say that most biologists aren't even aware of them
yet.  My hope is that the WP effort will attract many more biology
contributors to edit pages (once pages exist for genes that they care
about), and after gaining that experience, the community will be in a
better position to make a choice between the WP and CZ models.  And if
lo and behold, gene stubs already exist at CZ, well then the
transition would be that much easier.  But from my perspective, I want
to advance the position that a gene wiki can be valuable for
biologists and remain relatively agnostic with respect to WP/CZ.

To see how the bot is performing over at WP, here is the link to the
33 pages that have been either created or modified using the bot
content:
http://en.wikipedia.org/wiki/User:ProteinBoxBot#Trial_Run


>********************************
> It is to be watched whether a pharma company might have any commercial
> interest, even one not evident to you, in influencing the content in any way
> of an article they are involved with.

A valid point, and we welcome the scrutiny.  First, it's worth
pointing out that potential biases pertain to hand-made edits as well.
 The fact that we're talking about a bot to make automated edits
changes the number of contributions I'm (indirectly) making, and not
the fact that I work for a company.  Unless CZ plans on excluding all
contributors who work for commercial entities, then I think this comes
down to a person-by-person evaluation of credentials when approving
authorship and editorship and ongoing evaluation of contributions.

If you buy that first argument, then let me make a very quick second
argument that GNF has a pretty great track-record of being a good
scientific citizen.  We've released into the public domain some of the
most well-used and well-cited data sets for gene expression and
genetic variation.  My attitude is that good science will lead back to
benefit for GNF, in some way or another.  (Not sure if GNF shares my
loose view of ROI…)

Third, as was pointed out in an email that Larry forwarded, the
functions of the bot and the rules by which it operates are completely
transparent.  If anyone is concerned about objectivity or fairness,
please let us know how you would like the see the rules changed.  As I
see it, the only potential conflict of interest is the link from the
gene stubs to SymAtlas (the free and public gene portal that we
created) and the SymAtlas images displayed on the "Gallery" subpage.**
 This issue was briefly discussed the forums, and I still heartily
believe that the data that we created are important and useful aspects
of gene annotation and should be included in gene stubs.  Happy to
discuss the point further if anyone is interested…

** it turns out that I actually didn't set up the APP example stub how
I'd really like to see it.  I intended to put a link directly back to
SymAtlas, where additional gene expression data sets are available.
Take a look at the WP pages linked above to see basically how I'd
propose linking them here ("More reference expression data" link).


>********************************
> And what is the long-term plan here?  And why is the license an issue?

Well, no one asked that first question, but it certainly relates to
the second.  Eventually I'd like to incorporate gene wiki content
directly into SymAtlas (actually SymAtlas' successor, being developed
now…), including reciprocal links.  One link will take CZ/WP users to
SymAtlas and its additional gene expression data sets.  Similarly,
SymAtlas will display the community-contributed wiki content and link
back to CZ/WP.  In my mind, this synergy will benefit both user
communities.  SymAtlas users come with a lot of domain expertise in
specific areas, and CZ/WP users come with a lot of expertise in
organizing and communicating information (as well as domain
expertise).  For reference, in the first eight months of this year,
SymAtlas received 1.25M page views, 94K visits, and 32.6K unique
visitors from 1432 cities worldwide.

Our planned usage to incorporate these pages into our SymAtlas gene
portal is why I'm insisting on a license that permits commercial
reuse.  Although SymAtlas does not make us any money at all and is
decidedly non-commercial, GNF as a whole is a for-profit institute.  I
have no interest in getting our lawyers involved in figuring out what
does and doesn't qualify as commercial use (Mike Johnson's licensing
essay alludes to the ambiguity of defining this), so we'd only move
forward with the CZ effort if we have a license that is okay with
commercial use.


>********************************
> And what are the next immediate steps?
	
The next step as far as CZ will be to test whether the WP bot will
work with little/no modifications.  There were no objections from the
CZ-Tools group, so we hope to do this in the next week or two.  The WP
bot trial period is done, so we expect to go into mass production mode
there later this week.  Although hiccups aren't unexpected, I hope to
have at least a thousand or so automated and semi-automated WP edits
done in the next month.  Not long after that, I hope to draft a
manuscript to submit to an academic journal.  If the CZ bot test goes
as expected, I think it would be possible to quickly catch up over
here (assuming there continues to be support for it here and the
licensing issue can be worked out) so that the CZ effort can also be
mentioned/highlighted in the manuscript.




On 9/17/07, Larry Sanger <sanger at citizendium.org> wrote:
> Dear biologists,
>
> Dr. Andrew Su, who works for the Genomics Institute of the Novartis Research
> Foundation, wishes to run a "bot" (a computer script that emulates human
> behavior on the wiki) which would automatically create thousands of articles
> about genes.  Geneticists could then add to the bare articles, and the
> resulting information clusters, e.g.,
>
> http://en.citizendium.org/wiki/APP
>
> would be free.  Andrew's group insists that, if they are to be involved,
> *these* articles must be available under a free license, either GNU FDL or
> CC-by-sa, that would allow commercial reuse.  I.e., that's the offer on the
> table; I personally am comfortable with that condition.
>
> Andrew says the bot is ready to start testing.  Since we're getting close to
> the point where these articles could be created, I would like the biologists
> *in particular* to examine the example Andrew has uploaded.
>
> As a philosopher and Internet guy, of course I can't make heads or tails of
> it.  (I see lots of pretty pictures!)  But I can recognize several issues
> that will need discussion:
>
> * Is the information actually *useful* to biologists and biology students?
> Would they ever visit these CZ pages (after adequate development)--or would
> they always look elsewhere for such data?
>
> * Are the external links provided appropriate?  Do they unfairly benefit one
> resource over another?
>
> * Does the "value added" by Andrew's organization actually justify our
> releasing these articles under a free license that permits commercial reuse,
> perhaps contrary to a decision we'll make about our larger body of articles
> in a few months?
>
> If you are interested in discussing these issues, won't you please do so on
> the cz-biology list, here:
>
> http://mail.citizendium.org/mailman/listinfo/cz-biology
>
> I will be listening closely to our own biologists on cz-biology.
>
> Editors, if you want, I can forward comments from you to the biology list,
> to make sure they receive them.
>
> --Larry
>
> -----
> Lawrence M. Sanger, Ph.D. | http://www.larrysanger.org/
> Editor-in-Chief, Citizendium | http://www.citizendium.org/
> sanger at citizendium.org
>
>


More information about the Cz-biology mailing list