Assessing Large Language Models for Type Inference in python on real-world dataset / by Rashida Bharmal ; Thesis Supervisor: Prof. Eric Bodden & Jun Prof. Mohammed Soliman

Rashida, Bharmal

Titelaufnahme

Titel
Assessing Large Language Models for Type Inference in python on real-world dataset / by Rashida Bharmal ; Thesis Supervisor: Prof. Eric Bodden & Jun Prof. Mohammed Soliman
Autor
Rashida, Bharmal
Gutachter
Bodden, Eric ; Soliman, Mohammed
Erschienen
Paderborn, 2026
Umfang
1 Online-Ressource (vi, 50 Seiten) : Illustrationen, Diagramme
Hochschulschrift
Universität Paderborn, Masterarbeit, 2025
Anmerkung
Tag der Abgabe: 28.03.2025
Datum der Abgabe
28.3.2025
Sprache
Englisch
Dokumenttyp
Masterarbeit
Schlagwörter (GND)
Paderborn
URN
urn:nbn:de:hbz:466:2-57058
DOI
https://doi.org/10.17619/UNIPB/1-2481

Links

Social Media

Share
Nachweis
Universitätsbibliothek Paderborn
IIIF
IIIF-Manifest

Dateien

Assessing Large Language Models for Type Inference in python on real-world dataset [Pdf 1.09 mb]
RIS

Klassifikation

Besondere Sammlungen → Veröffentlichungen der Universität → Fakultät für Elektrotechnik, Informatik und Mathematik
Klassifikation (DDC) → Informatik, Informationswissenschaft, allgemeine Werke → Informatik, Wissen, Systeme → Informatik, Informationswissenschaft, allgemeine Werke

Abstract

Python’s dynamic type system offers flexibility but often leads to runtime errors and reduced maintainability in large-scale software systems. While optional type annotations (PEP 484) help mitigate these issues, they are inconsistently adopted across real-world codebases. To address this gap, recent studies have explored the use of Large Language Models (LLMs) for type inference, showing promising results on micro-benchmarks. However, their performance on real-world codebases remains underexplored.This thesis investigates the effectiveness of LLMs for Python type inference using a real-world dataset. We extend the TypeEvalPy framework by incorporating the ManyTypes4Py dataset, enabling a comprehensive evaluation of LLM performance across frequent, rare, and user-defined types. Two state-of-the-art LLMs, Codestral (22B) and Qwen2.5-Coder (7B), are evaluated using two prompting strategies on microbenchmark: mask-based prompting and question-and-answer (QnA) prompting. Furthermore, we apply Parameter-Efficient Fine-Tuning (PEFT) using LoRA to adapt these models to the type inference task.Our results show that QnA prompting significantly outperforms mask-based prompting on the TypeEvalPy micro-benchmark. Codestral achieves an overall exact match accuracy of 88.7% with QnA prompting, compared to 67.8% with mask-based prompting. Qwen2.5-Coder improves from 61.5% to 83.6% using the same strategy. Finetuning further boosts performance: Codestral improves from 86.4% to 96.9% , and Qwen2.5-Coder from 84.0% to 93.8%. Analysis of frequent and rare types shows that fine-tuning enhances structured type inference while occasionally misclassifying generictypes. These findings suggest that LLMs provide a robust solution for type inference in real-world scenarios, though improvements are needed for rare and user-defined types.

Lizenz-/Rechtehinweis

Creative Commons Namensnennung 4.0 International Lizenz

Publizieren

Besondere Sammlungen

Digitalisierungsservice

Hilfe

Impressum

Datenschutz

Titelaufnahme