Titelaufnahme
Titelaufnahme
- TitelAssessing Large Language Models for Type Inference in python on real-world dataset / by Rashida Bharmal ; Thesis Supervisor: Prof. Eric Bodden & Jun Prof. Mohammed Soliman
- Autor
- Gutachter
- Erschienen
- Umfang1 Online-Ressource (vi, 50 Seiten) : Illustrationen, Diagramme
- HochschulschriftUniversität Paderborn, Masterarbeit, 2025
- AnmerkungTag der Abgabe: 28.03.2025
- Datum der Abgabe28.3.2025
- SpracheEnglisch
- DokumenttypMasterarbeit
- Schlagwörter (GND)
- URN
- DOI
Links
- Social MediaShare
- Nachweis
- IIIF
Dateien
Klassifikation
Abstract
Python’s dynamic type system offers flexibility but often leads to runtime errors and reduced maintainability in large-scale software systems. While optional type annotations (PEP 484) help mitigate these issues, they are inconsistently adopted across real-world codebases. To address this gap, recent studies have explored the use of Large Language Models (LLMs) for type inference, showing promising results on micro-benchmarks. However, their performance on real-world codebases remains underexplored.This thesis investigates the effectiveness of LLMs for Python type inference using a real-world dataset. We extend the TypeEvalPy framework by incorporating the ManyTypes4Py dataset, enabling a comprehensive evaluation of LLM performance across frequent, rare, and user-defined types. Two state-of-the-art LLMs, Codestral (22B) and Qwen2.5-Coder (7B), are evaluated using two prompting strategies on microbenchmark: mask-based prompting and question-and-answer (QnA) prompting. Furthermore, we apply Parameter-Efficient Fine-Tuning (PEFT) using LoRA to adapt these models to the type inference task.Our results show that QnA prompting significantly outperforms mask-based prompting on the TypeEvalPy micro-benchmark. Codestral achieves an overall exact match accuracy of 88.7% with QnA prompting, compared to 67.8% with mask-based prompting. Qwen2.5-Coder improves from 61.5% to 83.6% using the same strategy. Finetuning further boosts performance: Codestral improves from 86.4% to 96.9% , and Qwen2.5-Coder from 84.0% to 93.8%. Analysis of frequent and rare types shows that fine-tuning enhances structured type inference while occasionally misclassifying generictypes. These findings suggest that LLMs provide a robust solution for type inference in real-world scenarios, though improvements are needed for rare and user-defined types.
Lizenz-/Rechtehinweis

