Open-ended
Cross-lingual generalizability of LLM evals
Steps:
- take existing LLM evals (e.g., for a specific dangerous capability)
- auto-translate the eval dataset ( make sure to sanity-check the translations)
- run the same eval(s) in different languages using a multilingual model
- compare outcomes across languages
- repeat testing and analysis for different LLMs
Are the results comparable? Do the outcomes scale similarly with model size in all languages?
NLPDeep LearningAI Governance