Project Overview
Client: Dr. Ali Jannesari, Advisor: Arushi Sharma
We have extracted feature representations from code-trained large language models (LLMs) with the goal of
explaining and interpreting these representations. By applying clustering algorithms, we have obtained clusters of
latent concepts that represent various patterns learned by the models from the code.
This project focuses on auto-labeling code datasets to capture different code properties. We will use Abstract Syntax
Tree (AST) tools like Tree-sitter, regular expressions, and LLM-generated labels to automatically annotate these
datasets. Once the datasets are labeled, we will evaluate the concepts learned by the LLMs by measuring their
alignment with the auto-labeled datasets. This evaluation will help determine how well the machine-learned concepts
correspond to human-defined code properties, enhancing the interpretability of the models.
Team Members
Manjul Balayar
Engineer
Software Engineering
Rayne Wilde
Engineer
Software Engineering/Data Science
Sam Frost
Engineer
Software Engineering
Akhilesh Nevatia
Engineer
Software Engineering
Ethan Rogers
Engineer
Electrical Engineering
Final Deliverables
Demo VideoDesign Document
IRP Presentation
Poster