Hierarchical Prompting Taxonomy (HPT)

A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

Category: NLP, LLM Evaluation

Overview

Hierarchical Prompting Taxonomy (HPT) is a universal evaluation framework for large language models (LLMs), grounded in human cognitive principles. HPT introduces the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchy based on their cognitive demands. The framework also proposes the Hierarchical Prompting Index (HPI) to quantify the complexity of tasks and the cognitive competencies of LLMs across diverse datasets.

HPT enables comprehensive evaluation of LLMs' problem-solving abilities and dataset intricacy, providing a standardized metric for task complexity. Experiments show HPF can enhance LLM performance by 2% to 63% compared to baselines, with GSM8k identified as the most cognitively complex task among those tested.

Authors: Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, Aman Chadha

Technologies Used

Python PyTorch Transformers HuggingFace

Features & Functionality

Hierarchical Prompting Framework (HPF) with five unique prompting strategies
Hierarchical Prompting Index (HPI) for quantifying task complexity
Evaluation of LLMs' cognitive competencies across diverse datasets
Standardized metric for dataset and task complexity
Open-source implementation for reproducibility

Challenges & Learnings

The main challenge was designing a framework that meaningfully aligns LLM evaluation with human cognitive principles, and developing metrics that are both interpretable and robust across tasks and models.

This project deepened my understanding of prompt engineering, cognitive science, and the evaluation of large language models in NLP.

View the Project

View on GitHub Read Paper DOI: 10.48550/arXiv.2406.12644

Back to All Projects