Hi, I'm Saurabh Pujar

AI Researcher and Software Engineer

I lead cutting-edge research at the intersection of AI and software engineering, creating novel machine learning methods to automate code generation, bolster application security with advanced vulnerability detection, and empower developers with intelligent code‐analysis tools.

Saurabh Pujar

About Me

I am fascinated by how AI can bridge human intent and executable code—whether it's translating natural language into production-ready workflows or establishing benchmarks that ensure self-consistent, reliable code understanding. I believe in harnessing the transformative power of intelligent systems to tackle real-world challenges in software engineering, security, and developer productivity.

5+

Years Research

15+

Publications

5

Patents

1000+

Citations

Recent Updates

Research Areas

Code Understanding & Generation

Building deep representation and generative models that turn natural-language intent into accurate, maintainable code while enabling smarter search, refactoring, and repair.

Vulnerability Detection & Analysis

Developing data-driven techniques and benchmarks that automatically surface, classify, and help remediate security flaws across large, real-world codebases.

Developer Productivity & Tools

Creating AI-augmented assistants and evaluations that measure, streamline, and elevate everyday software-engineering workflows from coding to deployment.

Selected Publications

Project CodeNet: A Large‑Scale AI for Code Dataset

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

...In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. ...

363 citations

Automated Code generation for Information Technology Tasks in YAML through Large Language Models

Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones, Alessandro Morari, Ruchir Puri

2023 60th ACM/IEEE Design Automation Conference (DAC)

... The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as the ones used for IT Automation, received far less attention, despite involving many active developers and being an essential component of modern cloud platforms. This work focuses on the generation of Ansible YAML, a widely used markup language for IT Automation. ...

22 citations

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini

2024 IEEE Symposium on Security and Privacy (SP)

... We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. ...

123 citations

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty

60th Annual Meeting of the Association for Computational Linguistics, 2022

... we design structure-guided code transformation algorithms to generate synthetic code clones and inject real-world security bugs, augmenting the collected datasets in a targeted way. We propose to pre-train the Transformer model with such automatically generated program contrasts to better identify similar code in the wild and differentiate vulnerable programs from benign ones. ...

61 citations

Get In Touch

Let's collaborate on AI research!

I'm always interested in research collaborations, academic discussions, and opportunities to advance the field of artificial intelligence. Feel free to reach out!

saurabh.s.pujar@gmail.com
IBM Research
Greater NYC Area, NY

I'd love to hear from you! Fill out the form below to send me a message directly.