Title: Are Longer Verbal Expressions Really Semantically More Similar to Each
Other? An Investigation of the Elaboration-Bias in Vector-Based Models of Word
Meaning

Authors: Forthmann, B., Günther, F., Hass, R., Benedek, M., Doebler, P.

Affiliation: Institut für Psychologie in Bildung und Erziehung, WWU Münster

Abstract: 
Vector-based models of word meaning such as latent semantic analysis (LSA),
the hyperspace analogue to language model (HAL), and the continuous bag of
words model (CBOW) are widely used in cognitive psychology and related
disciplines to quantify semantic similarity of verbal expressions. However,
applications of LSA to automatic text-grading systems and research on the
scoring of divergent thinking tests revealed that semantic similarity can be
biased for longer verbal expressions. That is, semantic similarity which is
calculated based on vector cosines is a monotone increasing function of the
number of words (i.e., the elaboration-bias). We demonstrate that this bias
occurs also for HAL and CBOW semantic spaces by means of a simulation study.
We further show that using rank or inverse quantile transformations yield
unbiased semantic similarities. Importantly, cosines based on transformed
semantic spaces yielded at least comparable or better validity results on
several benchmarks for both English and German language semantic spaces. The
suggested transformations of the semantic spaces were implemented in the
statistical software R and the transformed spaces can be easily used for
various computations by means of the R package LSAfun.