2.1 Cross-corpus Analysis of Interviewer Questions

A cross-corpus analysis of 89,759 interviewer questions from David Boder’s Voices of the Holocaust, the Yale Fortunoff Archive, the USC Shoah Foundation’s Visual History Archive, and the USC Shoah Foundation’s Dimensions in Testimony.

The 89,759 questions of the four archives were processed with the “bert-base-nli-mean-tokens” model of SBERT (or Sentence-BERT), a Python framework based on Google’s BERT neural network algorithm, to produce a a fixed size, 768-length embedding vector for each question.
The question embeddings were then grouped together using K-means clustering to identify similar questions and derive question topics.

Download question data and SBERT embeddings:

Analysis by Michelle Lee and Todd Presner.

1. Four archives: 89,759 questions

Developed by Michelle Lee and Todd Presner

The question embeddings were grouped into 100 clusters with K-means clustering (k=100). The clusters were manually examined, reorganized and labelled as “question topics”. The process generated 310 question topics for 89,759 questions, plus 25 “parent topics” to group the topics together. 


2. Parent Topics distribution across the corpora

Developed by Michelle Lee and Todd Presner


3. Exploration of question topics

Developed by Michelle Lee and Todd Presner


4. SBERT clustering accuracy

Developed by Michelle Lee and Todd Presner