This paper presents a first study of how consistently human assessors are able to identify, from query logs, when searchers are facing difficulties re-finding documents. Using 12 assessors, we investigate the effect of two variables on assessor agreement: the assessment guideline detail, and assessor experience. The results indicate statistically significant better agreement when using detailed guidelines. An upper agreement of 78.9% was achieved, which is comparable to the levels of agreement in other information retrieval contexts. The effects of two contextual factors, representative of system performance and user effort, were studied. Significant differences between agreement levels were found for both factors, suggesting that contextual factors may play an important role in obtaining higher agreement levels. The findings contribute to a better understanding of how to generate ground truth data both in the re-finding and other labeling contexts, and have further implications for building automatic re-finding difficulty prediction models.