||2 years ago|
|data||5 years ago|
|js||5 years ago|
|node_modules||2 years ago|
|public||2 years ago|
|published-figures||3 years ago|
|scripts||5 years ago|
|.gitignore||5 years ago|
|COPYING.code||5 years ago|
|COPYING.data||6 years ago|
|Makefile||2 years ago|
|README.md||3 years ago|
|package.json||2 years ago|
|yarn.lock||2 years ago|
Network Analysis of Biologists in the Biometry-Mendelism Debate
In all cases, a three-digit number in parentheses refers to the file number in the data directory.
Building the Initial Network
The network was seeded with the initial biologists from Figure 2 in Kyung-Man Kim's Explaining Scientific Consensus: The Case of Mendelian Genetics (redrawn in Inkscape as 000). Kim extracts a number of biologists as "central" to the biometry-Mendlism debate over the period from 1900-1910.
The number of biologists here, however, isn't enough to run a detailed analysis, as it doesn't include enough of the relevant players in the literature. The first goal was thus to expand the list.
Expanding the List of Biologists
To do so, I turned to evoText, and created datasets for all of the articles published in Nature by the following biologists. The search term (i.e., the name under which they published in Nature) is in quotation marks.
- Weldon "W. F. R. WELDON" (001)
- Bateson "W. BATESON" (002)
- Pearson "KARL PEARSON" (003)
- Kim identifies these three as centers of the biometry-Mendelism network around 1900
- Darbishire "A. D. DARBISHIRE" (004)
- Schuster (did not contribute to Nature)
- Yule "G. UDNY YULE" (005)
- Pearl (did not contribute to Nature)
- Shull (did not contribute to Nature)
- Kim identifies these five as "paradigm articulators," who were centers of the biometry-Mendelism network around 1905 but, importantly, defected from biometry to Mendelism
- East (did not contribute to Nature)
- Johannsen "W. JOHANNSEN" (006)
- Nilsson-Ehle (did not contribute to Nature)
- Based on visual inspection of Kim's Figure 2, these last three seem to be centers of the biometry-Mendelism network around 1910; Johannsen is, for Kim, the person whose work "settles" the debate
These lists were then inspected by hand to produce a list of all relevant 19th-century biologists who were living and hence could be contributing to the debate. The following were ignored, as they will be too common and would drown out signal:
Finally, this list was merged with the full list of biologists in Kim's Figure 2. This produced a list of 98 biologists for the network. (007)
Returning to evoText, I created a dataset containing every article published in Nature by any of those 98 biologists. I already had searches for the six authors noted above. I had to determine how each and every author appeared when they published in Nature, if at all. (008) 52 of those 98 biologists published in Nature, for a total of 1,622 articles.
Exploring the Network
I next plotted those articles by year of publication, to get an idea of when they were published. (009, 010, 011) 1872-1940 is the range, with an expected lull during WWI, after the pre-synthesis debates but before the Synthesis.
I converted all of the names of the relevant biologists to lowercased last names, in order to use them as a filter-list for a word frequency analysis. (012) I then ran a word-frequency analysis against our dataset of articles authored by members of the network. (013)
- Analyze single words
- Explicit list of words (list from 012)
- Stem words: no
- Text block method: number of blocks
- Number of blocks: 1
- Split across: no
- Word cloud: no
The resulting CSV leaves you with several extra unneeded columns, which were removed. (014)
The word frequency analysis, when not splitting-across, gives you the frequency lists for each word in each article, where articles are specified by the internal evoText identifier (in the case of these Nature articles, an SHA-1 checksum). The next step was thus to query the server for the author names for each of those documents. At this stage, South and East unfortunately had to be removed, as it was clear that they were producing noise, not signal. (015)
Those author names were then matched to the analyzed words. Trailing data was stripped from this CSV file using
This CSV was then converted into a graph CSV, with repeated rows for edge weight, that could be read into Gephi. (017) These were laid out and visualized in Gephi. (018) Nodes were clustered according to Gephi’s modularity statistic, colored, and laid out. (019)
The modularity data was confirmed by looking for similar clustering at several other resolution values (numbers of groupings). The grouping results presented were found to be robust, and the data looked interesting enough to be worthy of continued pursuit.
Evolution of the Network
The next step was to explore the groupings over time. To do this I queried evoText for the publication year of each of the articles in the network, in addition to their authors (020), and again matched author names to analyzed words. (Once more, trailing data stripped via
I then performed modularity analyses on each of the time ranges in Gephi, and manually saved modularity class data as a list of color palette indices in JS. Node IDs, sizes, and edge weights were generated automatically in JS format.
The graph data is then animated and visualized using sigma.js. Each of the five timestamps worth of data is loaded, node sizes and initial random positions are set, the graph is run through a ForceAtlas2 layout, and positions are then saved. All data – node positions, sizes, colors, and edge weights – are then interpolated between the timestamps for animation.
Comparison with Kim's Figure 2
The edges that correspond to edges in Kim's Figure 2 were then extracted from the full-network graph (018), and a version of Kim's figure was produced with the edges scaled according to our network's edge weights (excepting the edge from Pearson to Pearl, which would be so large that it would cover a number of names on the graph), with dashed lines indicating zero-weight edges (026).
The 'published-figures' folder contains the source material for the figures that were published in the final version of the paper to arise from this chapter. (Citation information will be added here when available.)
Thanks to Nature Publishing Group for access to the text of Nature for text mining purposes, via the evoText project.
Thanks to the authors of the following software and libraries, without which I'd have had a much harder time putting this together:
All text, data, and source code authored by Charles H. Pence is copyright © 2016, Louisiana State University.
All text and data contained here is available under the Creative Commons Attribution 4.0 International license (CC-BY 4.0), as applicable. Some of the metadata and content from Nature may remain copyright Nature Publishing Group in your local jurisdiction, depending on rules regarding metadata, fair use, and text mining.
Any source code authored by Charles H. Pence is released under the MIT License. Some of the source here is authored by others. Zurb Foundation for Sites, sigma.js, jQuery, what-input, the ES5- and ES6-shims are also released under the MIT license. chroma.js is released under the BSD license. All these libraries remain copyright their original authors.