Sean Benhur J

Machine Learning Practitioner and Lang Tech Enthusiasist

A little about me,



Hey there!, I am Sean from India, a lifelong learner. Currently I am pursuing MSc in Software Systems at PSG College Of Arts and Science, Coimbatore. I love NLP, Deep Learning and Software Development in general and open to learning new things📖❤ and getting the work done!. Previously I have interned at NVIDIA Bangalore where I worked on Building GPT style models for low resource langauges using Megatron-LM. I also interned at Insight SFI Center for Data Analytics, National University of Ireland, Galway where I worked on using Adapters for Offensive Language Identification, MultiModal Meme Detection and Finegrained Classification. In coming days, I am more interested to work on these areas of Machine Learning and Deep Learning.

My Skills

Machine Learning


Natural Language Processing

Tech I'm familiar with

Tech Stack















Masters in Software Systems(5 years Integrated)

PSG College of Arts and Science

2018 - 2023



Nov 2021 - April 2022

  • Proposed Adapter based efficient Transformers for Offensive Language Detection for low resource and codemixed languages.
  • Developed a Multimodal misogyny meme identification system using late fusion with CLIP and transformer model
  • Volunteering a shared task on ”Emotional Recognition in Tamil” for the DravidianLangTech workshop at ACL2022
  • Applied Research Intern - Low Resource NLP

    NVIDIA - Bangalore

    Dec 2021 - March 2022

  • Created Monolingual corpora for four under resourced languages of about 25 GB each from existing open source corpora
  • Developed 345 Million parameter GPT 2 models for four low resource languages using Megatron-LM and analyzed their perfomance in downstream tasks with Multilingual model
  • Software Engineering Intern(AI/ML Team)

    Impiger Tech

    May 2021 - Nov 2021

  • Primarily worked on Invoice extraction system, learned about common OCR tools such as tesseract, Camelot, and ocrmypdf
  • Researched existing techniques on invoice automation and employed an object detection-based approach which is both efficient and involves less cost in annotation(other methods require commercial OCR tools as annotation). Learned and employed the YOLO v5 state-of-the-art model in this process.
  • Analyzed transformer-based models for handwritten text extraction
  • Created rule-based methods for signature and seal detection
  • Implemented state-of-the-art pretrained models as activities in ImpigerRPA framework, learned on productionizing ML models as web services via Flask.
  • Formulated an approach for volunteer and senior citizen matching using NLP techniques
  • Data Science Intern

    Avishkar Tech Solutions

    September 2020 - December 2020

  • Worked on various stages of Data Science from Web Scraping, Data Collection, Data Wrangling to Machine Learning Production
  • Developed projects on NLP, Computer Vision and Interpretability in Machine Learning, Learned new frameworks and best practices in Machine Learning.
  • Worked on a remote team and helped out junior team members as mentor.
  • Virtual Data Science Intern Experience


    August 2020 - October 2020

  • Worked on drawing unique insights from the customers data of ANZ
  • Implemented a predictive analytics system on a regression project to predict the salaries of customers.
  • Improved the RMSE metric by reducing the score from 1.5 to 0.5 contributing to more than 50% impact
  • Personal Projects


    Hindi Image Captioning system

    Project Link

  • Novel Hindi Image Captioning system made completely with Transformers
  • Employed VIT for encoder and GPT2 for decoder
  • Cab Fare prediction system

    Project Link

  • Prediction of cab fare prices using LightGBM
  • Deployed on AWS EC2, used DVC2 for Data Version control
  • Signature Verification System with Siamese Networks

    Project Link

  • Offline Signature verification system for independent writers using Siamese Networks on ICDAR 11 Dataset to classify Signatures as genuine and forgery
  • Implemented the paper "SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification".
  • Publications


    Offensive Language Detection in Tamil YouTube comments by Adapters and Cross-domain Knowledge Transfer

    Computer Speech and Language Journal, Elsiever(Impact Factor: 1.899)

    Transformers at SemEval-2022 Task 5: A Feature Extraction based Approach for Misogynous Meme Detection

    Semeval@ NAACL,Seattle, Washington,2022

    Findings of the Shared Task on Emotion Analysis in Tamil

    DravidianLangTech Workshop at ACL 2022

    DE-ABUSE@TamilNLP-ACL 2022: Transliteration as Data Augmentation for Abuse Detection in Tamil

    DravidianLangTech Workshop at ACL 2022

    TamilEmo: Finegrained Emotion Detection Dataset for Tamil


    Arxiv 2022

    Hypers at ComMA@ICON: Modelling Aggressiveness, Gender Bias and Communal Bias Identification


    ComMA@ICON 2021, Silchar, India

    Pretrained Transformers for Offensive Language Identification in Tanglish


    DravidianCodemix@FIRE 2021, Virtual Event




  • Computer Speech and Language Journal, Elsiever
  • SemEval@NAACL2022
  • DravidianLangTech@ACL 2022
  • Student Coordinator

  • FacePrep, SixPhrase, Placement Cell, PSG College of Arts and Science
  • Let's Talk


    Want to connect?
    My inbox is always open!