Project Overview


Client: Dr. Ali Jannesari, Advisor: Arushi Sharma


We have extracted feature representations from code-trained large language models (LLMs) with the goal of explaining and interpreting these representations. By applying clustering algorithms, we have obtained clusters of latent concepts that represent various patterns learned by the models from the code.

This project focuses on auto-labeling code datasets to capture different code properties. We will use Abstract Syntax Tree (AST) tools like Tree-sitter, regular expressions, and LLM-generated labels to automatically annotate these datasets. Once the datasets are labeled, we will evaluate the concepts learned by the LLMs by measuring their alignment with the auto-labeled datasets. This evaluation will help determine how well the machine-learned concepts correspond to human-defined code properties, enhancing the interpretability of the models.


Team Members

Manjul Balayar

Engineer

Software Engineering

Rayne Wilde

Engineer

Software Engineering/Data Science

Sam Frost

Engineer

Software Engineering

Akhilesh Nevatia

Engineer

Software Engineering

Ethan Rogers

Engineer

Electrical Engineering


Final Deliverables

Demo Video
Design Document
IRP Presentation
Poster