A network visualization of the debate over Mendelism in the journal Nature, 1890–1910
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Charles Pence 5d931f6c7d
Add checksums to rsync, these are small files.
2 years ago
data Regenerate 24. 5 years ago
js Update CSS/JS for wider date ranges. 5 years ago
node_modules Move for better deployment, update bundle. 2 years ago
public Move for better deployment, update bundle. 2 years ago
published-figures Add source data for final published figures. 3 years ago
scripts Update phantomjs script. 5 years ago
.gitignore Update phantomjs script. 5 years ago
COPYING.code Update copyright year. 5 years ago
COPYING.data Add licenses. 6 years ago
Makefile Add checksums to rsync, these are small files. 2 years ago
README.md Add source data for final published figures. 3 years ago
package.json Move for better deployment, update bundle. 2 years ago
yarn.lock Move for better deployment, update bundle. 2 years ago

README.md

Network Analysis of Biologists in the Biometry-Mendelism Debate

Data Analysis

In all cases, a three-digit number in parentheses refers to the file number in the data directory.

Building the Initial Network

The network was seeded with the initial biologists from Figure 2 in Kyung-Man Kim's Explaining Scientific Consensus: The Case of Mendelian Genetics (redrawn in Inkscape as 000). Kim extracts a number of biologists as "central" to the biometry-Mendlism debate over the period from 1900-1910.

The number of biologists here, however, isn't enough to run a detailed analysis, as it doesn't include enough of the relevant players in the literature. The first goal was thus to expand the list.

Expanding the List of Biologists

To do so, I turned to evoText, and created datasets for all of the articles published in Nature by the following biologists. The search term (i.e., the name under which they published in Nature) is in quotation marks.

  • Weldon "W. F. R. WELDON" (001)
  • Bateson "W. BATESON" (002)
  • Pearson "KARL PEARSON" (003)
    • Kim identifies these three as centers of the biometry-Mendelism network around 1900
  • Darbishire "A. D. DARBISHIRE" (004)
  • Schuster (did not contribute to Nature)
  • Yule "G. UDNY YULE" (005)
  • Pearl (did not contribute to Nature)
  • Shull (did not contribute to Nature)
    • Kim identifies these five as "paradigm articulators," who were centers of the biometry-Mendelism network around 1905 but, importantly, defected from biometry to Mendelism
  • East (did not contribute to Nature)
  • Johannsen "W. JOHANNSEN" (006)
  • Nilsson-Ehle (did not contribute to Nature)
    • Based on visual inspection of Kim's Figure 2, these last three seem to be centers of the biometry-Mendelism network around 1910; Johannsen is, for Kim, the person whose work "settles" the debate

These lists were then inspected by hand to produce a list of all relevant 19th-century biologists who were living and hence could be contributing to the debate. The following were ignored, as they will be too common and would drown out signal:

  • Darwin
  • Wallace
  • Mendel

Finally, this list was merged with the full list of biologists in Kim's Figure 2. This produced a list of 98 biologists for the network. (007)

Returning to evoText, I created a dataset containing every article published in Nature by any of those 98 biologists. I already had searches for the six authors noted above. I had to determine how each and every author appeared when they published in Nature, if at all. (008) 52 of those 98 biologists published in Nature, for a total of 1,622 articles.

Exploring the Network

I next plotted those articles by year of publication, to get an idea of when they were published. (009, 010, 011) 1872-1940 is the range, with an expected lull during WWI, after the pre-synthesis debates but before the Synthesis.

I converted all of the names of the relevant biologists to lowercased last names, in order to use them as a filter-list for a word frequency analysis. (012) I then ran a word-frequency analysis against our dataset of articles authored by members of the network. (013)

Analysis settings:

  • Analyze single words
  • Explicit list of words (list from 012)
  • Stem words: no
  • Text block method: number of blocks
  • Number of blocks: 1
  • Split across: no
  • Word cloud: no

The resulting CSV leaves you with several extra unneeded columns, which were removed. (014)

The word frequency analysis, when not splitting-across, gives you the frequency lists for each word in each article, where articles are specified by the internal evoText identifier (in the case of these Nature articles, an SHA-1 checksum). The next step was thus to query the server for the author names for each of those documents. At this stage, South and East unfortunately had to be removed, as it was clear that they were producing noise, not signal. (015)

Those author names were then matched to the analyzed words. Trailing data was stripped from this CSV file using head. (016)

This CSV was then converted into a graph CSV, with repeated rows for edge weight, that could be read into Gephi. (017) These were laid out and visualized in Gephi. (018) Nodes were clustered according to Gephi’s modularity statistic, colored, and laid out. (019)

The modularity data was confirmed by looking for similar clustering at several other resolution values (numbers of groupings). The grouping results presented were found to be robust, and the data looked interesting enough to be worthy of continued pursuit.

Evolution of the Network

The next step was to explore the groupings over time. To do this I queried evoText for the publication year of each of the articles in the network, in addition to their authors (020), and again matched author names to analyzed words. (Once more, trailing data stripped via head.) (021) These were split by year, into five groups: through 1894, 1895--1899, 1900--1904, 1905--1909, and 1910 and later (022). I converted them into two representations. One could be visualized in Gephi (024), and another is output in Javascript, to be loaded into sigma.js (023).

I then performed modularity analyses on each of the time ranges in Gephi, and manually saved modularity class data as a list of color palette indices in JS. Node IDs, sizes, and edge weights were generated automatically in JS format.

The graph data is then animated and visualized using sigma.js. Each of the five timestamps worth of data is loaded, node sizes and initial random positions are set, the graph is run through a ForceAtlas2 layout, and positions are then saved. All data – node positions, sizes, colors, and edge weights – are then interpolated between the timestamps for animation.

Comparison with Kim's Figure 2

The edges that correspond to edges in Kim's Figure 2 were then extracted from the full-network graph (018), and a version of Kim's figure was produced with the edges scaled according to our network's edge weights (excepting the edge from Pearson to Pearl, which would be so large that it would cover a number of names on the graph), with dashed lines indicating zero-weight edges (026).

Miscellaneous

The 'published-figures' folder contains the source material for the figures that were published in the final version of the paper to arise from this chapter. (Citation information will be added here when available.)

Thanks

Thanks to Nature Publishing Group for access to the text of Nature for text mining purposes, via the evoText project.

Thanks to the authors of the following software and libraries, without which I'd have had a much harder time putting this together:

License

All text, data, and source code authored by Charles H. Pence is copyright © 2016, Louisiana State University.

All text and data contained here is available under the Creative Commons Attribution 4.0 International license (CC-BY 4.0), as applicable. Some of the metadata and content from Nature may remain copyright Nature Publishing Group in your local jurisdiction, depending on rules regarding metadata, fair use, and text mining.

Any source code authored by Charles H. Pence is released under the MIT License. Some of the source here is authored by others. Zurb Foundation for Sites, sigma.js, jQuery, what-input, the ES5- and ES6-shims are also released under the MIT license. chroma.js is released under the BSD license. All these libraries remain copyright their original authors.