Project Overview


Client: Dr. Ali Jannesari, Advisor: Arushi Sharma


We have extracted feature representations from code-trained large language models (LLMs) with the goal of explaining and interpreting these representations. By applying clustering algorithms, we have obtained clusters of latent concepts that represent various patterns learned by the models from the code.

This project focuses on auto-labeling code datasets to capture different code properties. We will use Abstract Syntax Tree (AST) tools like Tree-sitter, regular expressions, and LLM-generated labels to automatically annotate these datasets. Once the datasets are labeled, we will evaluate the concepts learned by the LLMs by measuring their alignment with the auto-labeled datasets. This evaluation will help determine how well the machine-learned concepts correspond to human-defined code properties, enhancing the interpretability of the models.


Team Members

Manjul Balayar

Engineer

Software Engineering

Rayne Wilde

Engineer

Software Engineering/Data Science

Sam Frost

Engineer

Software Engineering

Akhilesh Nevatia

Engineer

Software Engineering

Ethan Rogers

Engineer

Electrical Engineering


SE 4920

Final Deliverables

Can’t see the video? Watch the demo on YouTube .

Design Document
IRP Presentation
Poster

Status Reports

Report 6
Report 5
Report 4
Report 3
Report 2
Report 1

SE 4910

Weekly Reports

Report 10
Report 9
Report 8
Report 7
Report 6
Report 5
Report 4
Report 3
Report 2
Report 1

Lightning Talks

Lightning Talk 8: Ethics
Lightning Talk 7: Prototyping
Lightning Talk 6: Design Check-In
Lightning Talk 5: Detailed Design
Lightning Talk 4: Project Planning
Lightning Talk 3: User Needs and Requirements
Lightning Talk 2: Problem and Users
Lightning Talk 1: Product Research