CZ Talk:Statistics: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Aleksander Stos
(Source file)
imported>Aleksander Stos
 
(30 intermediate revisions by 7 users not shown)
Line 10: Line 10:


To produce these stats, on May 18 I dumped the histories of edits of all pages. This was transformed into an xml-like file in a format similar to "stub-meta-history" dump files released by Wikipedia. I'm willing to share the data with the interested CZ members, so if you want to make your own stats, just let me know on my talk page. --[[User:Aleksander Stos|Aleksander Stos]] 17:43, 18 May 2007 (CDT)
To produce these stats, on May 18 I dumped the histories of edits of all pages. This was transformed into an xml-like file in a format similar to "stub-meta-history" dump files released by Wikipedia. I'm willing to share the data with the interested CZ members, so if you want to make your own stats, just let me know on my talk page. --[[User:Aleksander Stos|Aleksander Stos]] 17:43, 18 May 2007 (CDT)
== Further development ==
I've just put some fresh data (I plan to update graphs too). Please do copy edit. Perhaps reorganisation of headers/text would be needed as well. I'm willing to feed similar info in future and the present structure is not well suited for this. --[[User:Aleksander Stos|Aleksander Stos]] 02:23, 10 June 2007 (CDT)
I love this, Alexander. Do you think you could graph articles by workgroup? Editors by workgroup? If it's not too time consuming, I'd like to see the progressions over time. [[User:Nancy Sculerati|Nancy Sculerati]] 02:23, 10 June 2007 (CDT)
:Good idea! Give me a few days.--[[User:Aleksander Stos|Aleksander Stos]] 14:32, 10 June 2007 (CDT)
:: OK, here it goes. I hope that's  what you requested. But do not hesitate to point out if something should be improved. --[[User:Aleksander Stos|Aleksander Stos]] 08:48, 11 June 2007 (CDT)
This is amazing, again--and getting better.  I'd like to note that May was actually one of our first months where nothing was happening that would "skew" the statistics.  November, launch.  January and February, self-registration.  March (late), launch.  April, aftereffects of launch.  Moreover, May is one of the busiest months of the school year. --[[User:Larry Sanger|Larry Sanger]] 08:56, 11 June 2007 (CDT)
:::I can only second. Alexander, do you have any way to estimate readers? After all, tht is really our ultimate goal, to provide the reader, rather than to exist for the user. Is that possible to know? [[User:Nancy Sculerati|Nancy Sculerati]] 09:12, 11 June 2007 (CDT)
::You're right. A similar question was raised before, see [http://forum.citizendium.org/index.php/topic,939.0.html here]. Unfortunately, I have no data. Access logs to CZ servers are not publicly available. I do not even know how they look like or how big they are (if I had  such a file I could try my luck). Nevertheless, some _relative_ (comparative) stats exist. They are produced by specialized enterprises like Alexa and rely the on data provided by an "army" of users of AlexaBar or something. According to them, the Citizendium is number 4 in the world of "open content encyclopaedias" (look at [http://www.alexa.com/browse/general/?&Mode=general&CategoryID=502412&Start=1&SortBy=Popularity&R=True# this]) and during few days after launch we were [http://www.alexa.com/data/details/traffic_details?q=&url=http://en.citizendium.org&site0=citizendium.org&site1=britannica.com&y=p more popular] than Britannica ;-). However, it looks like we have to work hard to stay high in the ranking. BTW, you easily recognize the "big vandalism era" and the launch on graphs.  --[[User:Aleksander Stos|Aleksander Stos]] 12:19, 11 June 2007 (CDT)
== Forums ==
Is there anyway data can be compiled from them?  That's where nearly all meta-discussion occurs.   —[[User:Stephen Ewen|Stephen Ewen]] [[User talk:Stephen Ewen|(Talk)]] 21:46, 24 July 2007 (CDT)
:Forum has its own statistics page, see [http://forum.citizendium.org/index.php?action=stats here]. It could be linked from here (done)--[[User:Aleksander Stos|Aleksander Stos]] 03:04, 25 July 2007 (CDT)
== Word count ==
Any easy way to make a total CZ (main space) word count? --[[User:Larry Sanger|Larry Sanger]] 00:06, 28 July 2007 (CDT)
: Yes, at least approximately -- I put it on my todo list. BTW, it would be great if we had some database dumps/backups, something like  [http://download.wikimedia.org/backup-index.html this] --[[User:Aleksander Stos|Aleksander Stos]] 07:54, 28 July 2007 (CDT)
Well, I counted that. It seem that is it safe to assume that CZ has over 4,200K words in the mainspace. I worked on the "raw" wikitext (i.e. what we see while editing). The templates were cut off (so tables, boxes etc), as well as obvious technical parts (categories, images, www links). I counted refs and headings, however. Just a choice. Comments/questions regarding methodology welcome! If what has been done appears to be reasonable, we can put it on the page. --[[User:Aleksander Stos|Aleksander Stos]] 08:01, 30 July 2007 (CDT)
Excellent--thanks!  Very interesting, too, so what's the average word count per article?  I assume you're counting more than just the words contained in CZ Live articles? --[[User:Larry Sanger|Larry Sanger]] 08:31, 30 July 2007 (CDT)
: About 1250 words/article (3351 pages as listed on Special:Allpages -- subpages and disambigs included, redirs, obviously, skipped). But, just like in the case of salaries in a population, the average does not tell us that much. The distribution of the article length follows a [[power law]]: there are many short articles and relatively few extremely long ones. In such a situation the [[median]] appears to be more meaningful. Our median is 552, which means that a half of our articles is longer than that and a half is shorter. BTW, we have about 2300 articles longer than 250 words and 2600 articles longer than 150 words (the rest being really short stubs, disambigs and a couple of almost empty experimental subpages). --[[User:Aleksander Stos|Aleksander Stos]] 12:10, 30 July 2007 (CDT)
:Still working on it; results to be announced soon (i.e. put on the page). The devil is in the details.. I'm fine tuning regular expressions in the script -- and thus a definition of a 'word' in the jungle of wikisyntax (eg. the number 45,67 counts for a word? two? what about U.S.? indefinite articles count?). This changes not that much. The basic question is whether at the present stage the subpages are to be included or not. They are either almost empty or very long drafts - "copies" of its 'main article'. If so, then we have about 4000K words (a bit less) with the median 568 (a bit better). This seems to be more accurate. Just thinking loudly. [[User:Aleksander Stos|Aleksander Stos]] 10:03, 31 July 2007 (CDT)
Two things: indefinite articles (words!) always count, I thought.  Also, anything of the form X/Draft should ''not'' be counted ''if'' X is other than a redirection page.  All other subpages ''of the main namespace'' should be counted. --[[User:Larry Sanger|Larry Sanger]] 10:09, 31 July 2007 (CDT)
: Thanks for the hint. Well, for the global word count I think it's OK to do as you suggest and, clearly, the drafts should be simply excluded anyway. But when it comes to the question "how long is our average article" then adding all the subpages systematically biases the result. I mean that at present there are many empty placeholders. Furthermore, some *standard* subpages, as galleries and tables, will always be (almost) "empty" for the word count procedure as it stands. Also, the links pages, if counted separately, bring a systematic bias -- average comments to links will always be shorter than the associated 'main' article. I feel that giving to the links the same "weight" as to the 'main' article results in inaccurate average (median). Either we simply skip links or we  *concatenate* them to its main page (i.e. we count clusters, not individual subpages). The latter seems to be the right approach, i.e. this IMHO would give the best answer to the "how-long-are-articles" question. In this case, the answer is about 4100K words (total) and the median length 562. If we count each subpage separately, the resulting median would be 517 -- the difference is not negligible and, frankly, I think 562 better corresponds to what we could label as our "average" article seen on the screen. Furthermore, the 'cluster' method would allow meaningful comparisons to other wikis (that put everything on the same page). [[User:Aleksander Stos|Aleksander Stos]] 05:32, 2 August 2007 (CDT)
== How do we compare with... ==
Ciao to all.
Just a curiosity. How do we compare with [[Uncyclopedia]]? And with [[Conservapedia]]? --[[User:Nereo Preto|Nereo Preto]] 06:25, 7 October 2007 (CDT)
No offense but...who cares??  :-)  Neither of those even aspires to be a neutral reference. --[[User:Larry Sanger|Larry Sanger]] 07:50, 7 October 2007 (CDT)
:Well, just a curiosity. To be frank, I really don't care about Conservapedia, but when I'm sad, I always take a look at Uncyclopedia and things go better for me, it's a great service to humanity! :) --[[User:Nereo Preto|Nereo Preto]] 01:25, 8 October 2007 (CDT)
== Recent spike ==
Thanks again for updating & maintaining this, Alek!
The recent spike is due to subpages, isn't it?  Alek, I think we should track main article growth and subpage growth separately, although we should give the median word count ''per cluster'' not per article. --[[User:Larry Sanger|Larry Sanger]] 07:39, 8 October 2007 (CDT)
: Right, the spike is due to subpages. I promise a better presentation -- now I added only a (too) modest footnote what the 'number of pages' mean. Median has always been given per cluster. [[User:Aleksander Stos|Aleksander Stos]] 02:37, 9 October 2007 (CDT)
::Wish "Signed articles" and "Addendum"s could count with main articles.  --[[User:Anthony.Sebastian|Anthony.Sebastian]] [[User talk:Anthony.Sebastian|(Talk)]] 21:47, 10 October 2007 (CDT)
== Suggest additional chart ==
Suggest repeat "Articles by workgroup" chart arranged in rank order.  Stimulate workgroup competition. --[[User:Anthony.Sebastian|Anthony.Sebastian]] [[User talk:Anthony.Sebastian|(Talk)]] 21:43, 10 October 2007 (CDT)
:We may give it a try. Actually some charts are overloaded (and I have not enough colours...), some are almost 'empty', depending on the field. Grouping by rank would give more balanced presentation. [[User:Aleksander Stos|Aleksander Stos]] 02:37, 11 October 2007 (CDT)
==Word lengths ===
Word length of articles on average has been declining while article numbers have been increasing. One must assume that this is due to more articles being created but these articles are stubs? [[User:Meg Ireland|Meg Ireland]] 04:22, 3 June 2008 (CDT)
==Creation rate==
Most of these charts appear to have stopped recording in November 2013 and remained un-updated. It's now April 2014. [[User:Meg Ireland|Meg Ireland]] 09:29, 5 April 2014 (UTC)
: Well, the problem is that the Citizendium lasts too long to include the whole history in one graph, the programmer of the scripts didn't expected this ;)  I'm (slowly) working on it... Cheers, A. [[User:Aleksander Stos|Aleksander Stos]] 15:50, 5 May 2014 (UTC)

Latest revision as of 09:50, 5 May 2014

This is fantastic--I think I can speak for everyone when I say that this is very much appreciated, Alex. --Larry Sanger 09:59, 18 May 2007 (CDT)

Definitely Cool! Keep them coming. --Matt Innis (Talk) 10:32, 18 May 2007 (CDT)

Thanks. Needless to say that everybody is invited to edit this page, add his own work or make requests for further improvements, comments about the page, suggestions of new ideas etc. --Aleksander Stos 14:36, 18 May 2007 (CDT)
Terrific page, very informative. Maybe you could put in a slightly clearer explanation of #6, Is this the number of users that log in each day, averaged over a month? BTW it would be easier to edit if you could break the page up into sections. (Maybe I'll try and you can revert if you don't like it.) I think the main data that stands out as missing would be the number of visitors to the site, which could be presented in different ways. David Hoffman 16:09, 18 May 2007 (CDT)
#6 is the number of users that actually made an edit given month (no average, just count). Now, I put a line of explanation, but I'm not sure about it -- do correct it please if you find a good formulation! BTW, something similar to the number of users that log in each day, averaged over a month is the "daily use" section, but it concerns the actual edits instead of logins (no login info is publicly available). Thanks for your remarks! --Aleksander Stos 16:47, 18 May 2007 (CDT)
It is great seeing some of these stats. It gives me a good feel on how alive CZ is and how it is growing. Robert Winmill 16:12, 18 May 2007 (CDT)

Source file

To produce these stats, on May 18 I dumped the histories of edits of all pages. This was transformed into an xml-like file in a format similar to "stub-meta-history" dump files released by Wikipedia. I'm willing to share the data with the interested CZ members, so if you want to make your own stats, just let me know on my talk page. --Aleksander Stos 17:43, 18 May 2007 (CDT)

Further development

I've just put some fresh data (I plan to update graphs too). Please do copy edit. Perhaps reorganisation of headers/text would be needed as well. I'm willing to feed similar info in future and the present structure is not well suited for this. --Aleksander Stos 02:23, 10 June 2007 (CDT)

I love this, Alexander. Do you think you could graph articles by workgroup? Editors by workgroup? If it's not too time consuming, I'd like to see the progressions over time. Nancy Sculerati 02:23, 10 June 2007 (CDT)

Good idea! Give me a few days.--Aleksander Stos 14:32, 10 June 2007 (CDT)
OK, here it goes. I hope that's what you requested. But do not hesitate to point out if something should be improved. --Aleksander Stos 08:48, 11 June 2007 (CDT)

This is amazing, again--and getting better. I'd like to note that May was actually one of our first months where nothing was happening that would "skew" the statistics. November, launch. January and February, self-registration. March (late), launch. April, aftereffects of launch. Moreover, May is one of the busiest months of the school year. --Larry Sanger 08:56, 11 June 2007 (CDT)

I can only second. Alexander, do you have any way to estimate readers? After all, tht is really our ultimate goal, to provide the reader, rather than to exist for the user. Is that possible to know? Nancy Sculerati 09:12, 11 June 2007 (CDT)
You're right. A similar question was raised before, see here. Unfortunately, I have no data. Access logs to CZ servers are not publicly available. I do not even know how they look like or how big they are (if I had such a file I could try my luck). Nevertheless, some _relative_ (comparative) stats exist. They are produced by specialized enterprises like Alexa and rely the on data provided by an "army" of users of AlexaBar or something. According to them, the Citizendium is number 4 in the world of "open content encyclopaedias" (look at this) and during few days after launch we were more popular than Britannica ;-). However, it looks like we have to work hard to stay high in the ranking. BTW, you easily recognize the "big vandalism era" and the launch on graphs. --Aleksander Stos 12:19, 11 June 2007 (CDT)

Forums

Is there anyway data can be compiled from them? That's where nearly all meta-discussion occurs.  —Stephen Ewen (Talk) 21:46, 24 July 2007 (CDT)

Forum has its own statistics page, see here. It could be linked from here (done)--Aleksander Stos 03:04, 25 July 2007 (CDT)

Word count

Any easy way to make a total CZ (main space) word count? --Larry Sanger 00:06, 28 July 2007 (CDT)

Yes, at least approximately -- I put it on my todo list. BTW, it would be great if we had some database dumps/backups, something like this --Aleksander Stos 07:54, 28 July 2007 (CDT)

Well, I counted that. It seem that is it safe to assume that CZ has over 4,200K words in the mainspace. I worked on the "raw" wikitext (i.e. what we see while editing). The templates were cut off (so tables, boxes etc), as well as obvious technical parts (categories, images, www links). I counted refs and headings, however. Just a choice. Comments/questions regarding methodology welcome! If what has been done appears to be reasonable, we can put it on the page. --Aleksander Stos 08:01, 30 July 2007 (CDT)

Excellent--thanks! Very interesting, too, so what's the average word count per article? I assume you're counting more than just the words contained in CZ Live articles? --Larry Sanger 08:31, 30 July 2007 (CDT)

About 1250 words/article (3351 pages as listed on Special:Allpages -- subpages and disambigs included, redirs, obviously, skipped). But, just like in the case of salaries in a population, the average does not tell us that much. The distribution of the article length follows a power law: there are many short articles and relatively few extremely long ones. In such a situation the median appears to be more meaningful. Our median is 552, which means that a half of our articles is longer than that and a half is shorter. BTW, we have about 2300 articles longer than 250 words and 2600 articles longer than 150 words (the rest being really short stubs, disambigs and a couple of almost empty experimental subpages). --Aleksander Stos 12:10, 30 July 2007 (CDT)
Still working on it; results to be announced soon (i.e. put on the page). The devil is in the details.. I'm fine tuning regular expressions in the script -- and thus a definition of a 'word' in the jungle of wikisyntax (eg. the number 45,67 counts for a word? two? what about U.S.? indefinite articles count?). This changes not that much. The basic question is whether at the present stage the subpages are to be included or not. They are either almost empty or very long drafts - "copies" of its 'main article'. If so, then we have about 4000K words (a bit less) with the median 568 (a bit better). This seems to be more accurate. Just thinking loudly. Aleksander Stos 10:03, 31 July 2007 (CDT)

Two things: indefinite articles (words!) always count, I thought. Also, anything of the form X/Draft should not be counted if X is other than a redirection page. All other subpages of the main namespace should be counted. --Larry Sanger 10:09, 31 July 2007 (CDT)

Thanks for the hint. Well, for the global word count I think it's OK to do as you suggest and, clearly, the drafts should be simply excluded anyway. But when it comes to the question "how long is our average article" then adding all the subpages systematically biases the result. I mean that at present there are many empty placeholders. Furthermore, some *standard* subpages, as galleries and tables, will always be (almost) "empty" for the word count procedure as it stands. Also, the links pages, if counted separately, bring a systematic bias -- average comments to links will always be shorter than the associated 'main' article. I feel that giving to the links the same "weight" as to the 'main' article results in inaccurate average (median). Either we simply skip links or we *concatenate* them to its main page (i.e. we count clusters, not individual subpages). The latter seems to be the right approach, i.e. this IMHO would give the best answer to the "how-long-are-articles" question. In this case, the answer is about 4100K words (total) and the median length 562. If we count each subpage separately, the resulting median would be 517 -- the difference is not negligible and, frankly, I think 562 better corresponds to what we could label as our "average" article seen on the screen. Furthermore, the 'cluster' method would allow meaningful comparisons to other wikis (that put everything on the same page). Aleksander Stos 05:32, 2 August 2007 (CDT)

How do we compare with...

Ciao to all.

Just a curiosity. How do we compare with Uncyclopedia? And with Conservapedia? --Nereo Preto 06:25, 7 October 2007 (CDT)

No offense but...who cares??  :-) Neither of those even aspires to be a neutral reference. --Larry Sanger 07:50, 7 October 2007 (CDT)

Well, just a curiosity. To be frank, I really don't care about Conservapedia, but when I'm sad, I always take a look at Uncyclopedia and things go better for me, it's a great service to humanity! :) --Nereo Preto 01:25, 8 October 2007 (CDT)

Recent spike

Thanks again for updating & maintaining this, Alek!

The recent spike is due to subpages, isn't it? Alek, I think we should track main article growth and subpage growth separately, although we should give the median word count per cluster not per article. --Larry Sanger 07:39, 8 October 2007 (CDT)

Right, the spike is due to subpages. I promise a better presentation -- now I added only a (too) modest footnote what the 'number of pages' mean. Median has always been given per cluster. Aleksander Stos 02:37, 9 October 2007 (CDT)
Wish "Signed articles" and "Addendum"s could count with main articles. --Anthony.Sebastian (Talk) 21:47, 10 October 2007 (CDT)

Suggest additional chart

Suggest repeat "Articles by workgroup" chart arranged in rank order. Stimulate workgroup competition. --Anthony.Sebastian (Talk) 21:43, 10 October 2007 (CDT)

We may give it a try. Actually some charts are overloaded (and I have not enough colours...), some are almost 'empty', depending on the field. Grouping by rank would give more balanced presentation. Aleksander Stos 02:37, 11 October 2007 (CDT)

Word lengths =

Word length of articles on average has been declining while article numbers have been increasing. One must assume that this is due to more articles being created but these articles are stubs? Meg Ireland 04:22, 3 June 2008 (CDT)

Creation rate

Most of these charts appear to have stopped recording in November 2013 and remained un-updated. It's now April 2014. Meg Ireland 09:29, 5 April 2014 (UTC)

Well, the problem is that the Citizendium lasts too long to include the whole history in one graph, the programmer of the scripts didn't expected this ;) I'm (slowly) working on it... Cheers, A. Aleksander Stos 15:50, 5 May 2014 (UTC)