Code for optimizing a conference schedule based on similarity of abstracts
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Charles Pence d7eb21021e
Grammar and formatting tweaks.
3 months ago
ISH2019 Replace with much smaller model. 3 months ago
ISH2021 Remove old file, update README. 3 months ago
src Stop optimizing on the priors, I *think* that may be breaking cosine similarity. 3 months ago
.gitignore Move the venv to where VSCode will find it, install Black. 4 months ago
LICENSE.txt Add an initial version of the code and a data dump from ISH19. 4 months ago
README-original.md Prettier for original file abstract. 3 months ago
README.md Grammar and formatting tweaks. 3 months ago
requirements.txt Okay, working on a WordVec solution for similarity, after seeing some bugs, but can't keep testing today. 3 months ago

README.md

ISH Conference Scheduler

Do you want to run a conference? Do you need to bundle together individual submissions into sessions that both make coherent sense with one another and are different enough from the other sessions happening at the same time to prevent people from bouncing back and forth between rooms as much as possible? Does that sound like something a computer should help you with? This is the software for you.

This code was cobbled together by Charles Pence on the basis of some existing code by Prashanti Manda and an article describing that code's use, available at http://doi.org/10.7717/peerj-cs.234. The goal was to use this system to evaluate sessions for the 2021 meeting of the International Society for the History, Philosophy, and Social Studies of Biology (ISHPSSB).

How to Use

Prerequisites

This code is set up to work with python (I developed it with 3.9.4) and virtualenv. You can use these by executing (from this directory):

virtualenv .venv
source .venv/bin/activate

pip install -r requirements.txt

What this Does

The scripts here all have optional support for two kinds of conferences:

  1. A conference with only "individual papers" – that is, individual talks are submitted, and the algorithm bundles these into sessions of a given size.
  2. A conference with both "individual papers" and "sessions" – that is, where a session organizer submitted a set of talks that should be kept together.

Also, these scripts have support for "blocks" in your conference – different categories of time-slot, where users can choose which blocks they are willing to present in. (Think, for example, of users picking between different times of day that are consistent with their local timezone for an online conference.)

In general, the algorithm here will perform the following steps:

  1. Assess the similarity between all of the paper and session abstracts in your meeting. There's a number of ways to do this, including LDA topic-modeling approaches and WordVec (word embedding) based solutions.
  2. Create an initial randomized schedule, consistent with user block preferences (if you're using those).
  3. Optimize that random schedule, by randomly swapping talks/sessions, with the goal of:
    • Maximizing the similarity within a session (i.e., putting papers together that are similar)
    • Minimizing the similarity between sessions in the same time-slot (i.e., reducing the incentive for people to jump between sessions)

There are two sets of sample data here – ISH2019 and ISH2021. The former was testing data, and only includes individual paper submissions. The latter was real data for the first schedule of the ISHPSSB 2021 conference, and includes both individual papers and sessions.

Details

Brief details about the scripts found in the src directory can be found here. In general, all of the parameters that you can configure for these scripts can be passed on the command line, and you can learn about them by calling python <script.py> --help.

Note: For all three of the similarity scripts, if you are using both individual papers and sessions, you should pass all of the individual paper abstract files followed by all of the session abstract files. This will produce the right kind of combined document similarity matrix that is needed for the optimization script later on.

Similarity-LDA.py — Compute similarity with topic models

The first of three different algorithms for computing document similarity, this code creates topic models from the documents in the corpus, and then measures the distance between the representations of each talk or session in terms of those topics.

The major tunable parameters in this script are --num-topics and --passes. The former sets the number of topics, and should be evaluated by examining the metrics calculated for the models that result (there are scripts for looking at these in the two example folders). The latter sets the number of passes through the corpus for model training. In general, higher is better at the expense of more time spent training the models.

By default, the distance between document vectors is computed with cosine similarity; you can switch to Hellinger distance by passing --hellinger.

Similarity-WordVecWMD.py — Compute similarity with Word Mover Distance

This script uses the GloVe model for word embeddings to describe the positions of documents in semantic space, then computes the distance between them using Word Mover Distance (roughly, the amount of effort that would be required to transform the probability distribution of document A into that of document B).

There's no tuning for this algorithm. It is extremely resource intensive, and often CPU-limited; it will run in parallel to the extent possible on your hardware.

Similarity-WordVecSoftCosine.py — Compute similarity with soft cosine distance

This script also uses the GloVe model for word embeddings, but calculates distance between document vectors using soft cosine. It is much faster than WMD, though in my testing it produces lower quality results.

RandomSchedule.py — Generate random schedule

This script creates a random schedule (possibly taking into account consistency with preferences about blocks). Pass it the structure of blocks, time-slots, and sessions for your conference.

OptimizeSchedule.py — Optimize schedule

This script performs random swaps to attempt to produce an optimal schedule, using simulated annealing. There are a number of tunable parameters for this algorithm, though in the vast majority of cases the defaults will be acceptable.

Raising the value of --alpha from 0.99 to, e.g., 0.999 will allow for more exploration of the space away from local optima, but will increase time and may decrease final output quality. Note that --iterations is an upper bound; the algorithm will stop when it fails to produce an increase in solution quality for 1000 consecutive iterations.

The entire algorithm will be run --runs times (default 200), and the final script will print the best solution found among those runs.

PrintSchedule.py — Print readable schedule

This script prints a basic, readable version of the optimized schedule to stdout.

License and History

This code copyright 2021 Charles H. Pence and released under the GNU GPL v3. See LICENSE.txt.

This code is based upon the Automated Conference Scheduler code by Prashanti Manda, copyright 2014 and released under the GNU GPL. For information about that original project, see README-original.md.