Bad Hashing: Does not include every nodes Here they use approx 10% of nodes in one bucket Should include ways for near-misses They found the cost for Hashing is relatively constant Works best for finding exact clones -> That's why we use a bad one now
Maybe worsen the one we have at the moment Hash function that ignores small subtrees is considered good by ICSM98 ? trees with a similar modulo identifier good This appraoch has gone later on to define how CloneDR works -> made by the same guys TODO: Change the hash function acording to the paper Words from the man himself: https://stackoverflow.com/questions/5629397/finding-similar-code-sections-using-sub-trees
Detecting of sequences: Sequences in their example are "chained", meaning there is a top node which goes down to all following nodes after another this differs from the representation in javaparser, there a root holds all the nodes on the toplevel and has them all as children sequences are neighboring children So the trick for use would be checking if similar subtrees are in sequence (When to do this? I would say now to do this after the matching of hashed nodes though this can arise into unnecessary checks or we might miss some nodes, if higher up there are identical trees)
Current Problems to be resolved:
- Java Parser does unwanted deep compare on some nodes
- We can either fix this with implementing compares for the affected nodes (e.g. BlockStmt) or with a different approach (we simply have to match the class the nodes is instance of since we can't hotpatch this in the equalsvisitor easily)
- Different approach is hashing for example check if this list of hashes is included, but which hash to choose, we may run into the same problems here.