About Me

Data scientist holding a Master's in Data Science from Rutgers University with expertise in Data Engineering, Machine Learning, and Statistical Analysis. Proficient in constructing robust data pipelines, optimizing advanced models, applying NLP techniques, creating chatbots, and developing impactful recommendation systems. A dedicated and experienced researcher with a sharp analytical mindset, excelling in complex project execution, and leveraging a diverse skill set that includes Python, cloud technologies, and cutting-edge data science tools.

Domain Expertise

Data Analytics, Data Engineering, Data mining, Machine Learning, Deep Learning, Statistics, A/B testing, MLOps, Natural Language Processing, Computer Vision, Generative AI

Proficient

Python, SQL, Pandas, Numpy, Scikit-Learn, TensorFlow, PyTorch, NLTK, OpenCV, LangChain, Streamlit, FastAPI, Matplotlib, GCP(Certified Data Engineer), BigQuery, Databricks, Collibra, ETL, Informatica Cloud (IICS)

Worked with

R, Unix, MongoDB, PySpark, SageMaker, Docker, MLflow, Ariflow, Kafka, AWS, PowerBI, Tableau

  • Oct 2022 - Dec 2023
    Graduate Research Assistant at Rutgers University

    Key Tasks:
  • Analyzed Media datasets using Topic Modeling
  • Performed Statistical Analysis on genes and gene expression datasets

  • May 2023 - Aug 2023
    Machine Learning Engineer at Omdena

    Key Tasks:
  • Built recommedations system based on user perferences
  • Improved user experience by incorporating hybrid models

  • Aug 2020 - Jun 2022
    Data Engineer at Deloitte Consulting

    Key Tasks:
  • Built ETL pieplines for various sources using custom and built-in drivers
  • Performed automations using python and unix scripts using API calls
  • Sep 2022 - May 2024

    Master of Science in Data Science from Rutgers University, New Brunswick

    Coursework: Data Structures & Algorithms, Regression and Time Series Analysis, Probability & Statistical Inference, Data Mining, Statistical Modeling and Computing, Database Management Systems, Data Wrangling, Natural Language Processing, Deep Learning

    GPA: 3.65/4


  • Aug 2016 - May 2020

    Bachelor of Technology in Electrical Engineering from National Institute of Technology Kurukshetra

    Relevant Coursework: Data Structures & Algorithms, Database Management Systems, Computer Networks, Computer Architecture, Operating Systems

    GPA: 8.47/10

Work Experience

Data Engineer

  • Led a team of 4, seamlessly integrated Databricks with Collibra Catalog using JDBC simba spark driver. Automated metadata ingestion for 260+ schemas using Python scripting and Tidal jobs to reduce manual effort by 99%

  • Developed SQL queries to extract data from MYSQL, Oracle, and PostgreSQL databases, optimizing efficiency, and seamlessly exposed the results as APIs using MuleSoft proxy for enhanced accessibility and integration

  • Engineered an ETL pipeline for data processing automation of Qlik Sense data into Informatica Cloud (IICS) and Collibra via REST API calls using Unix Script, parallelized the ingestion for a 66% time reduction

  • Facilitated and actively contributed to the successful execution of 5 production releases, demonstrating expertise in deploying and maintaining data engineering solutions to ensure operational stability
  • Machine Learning Engineer

  • Collected and curated crowdsourced data from over 75+ contributors, conducted EDA, and employed data cleaning and statistical probabilistic data imputation techniques to enhance data quality by achieving a 98% completion rate

  • Engineered a recommendation system that leveraged content-based, collaborative filtering and NLP techniques. Explored matrix factorization and neural networks, to achieve a 94% f1-score

  • Implemented an ensemble model, to enhance the click-through rates by 33%

  • Deployed the models to AWS utilizing Streamlit and FastAPI for users to interact and test as a POC



  • Learn More

    Graduate Research Assistant

  • Analyzed media data using topic modeling to uncover hidden narratives. Achieved 10x clustering speedup with FAISS in KMeans, delving into DBSCAN, DP-Means, and ultimately opting for BERTopic, uncovering 150+ clusters

  • Optimized data integrity through data standardization across 3 sources and preprocessing using NLP techniques resulting in a reduction of the data by 30% and removal of URLs, HTML tags, and emojis by 99%

  • Analyzed gene sequence and gene expression datasets, designed and implemented algorithms to calculate gene-drug interactions, and identified the top 10% cases of interest based on statistical analysis and bioinformatics

  • Enforced parallel processing over 64 cores of a remote server, resulting in an 80% reduction in execution time



  • Learn More

    Projects

    Chat with your PDF(s)

    Developed a chatbot with a Streamlit-based dynamic interface where users can engage in natural language dialogue to pose questions and gain insights about the input PDF file(s). Leveraged OpenAI’s embeddings to process the files into a FAISS vector store, implemented a RAG pipeline in LangChain with ChatGPT-3.5 and used prompting techniques to reduce token size

    Human Emotion Detection

    Implemented image classification using modern CNN architecture like ResNet50 and EfficientNetB4, building the code from scratch, fine-tuning them to improve performance. Explored Transformers in vision to obtain the optimal results.

    Customer Churn Prediction

    Built a classification model to predict if a customer is likely to leave the company based on certain user features. Explored different models ranging from simple logistic regression to complex tree based models like random forest and XGBoost to compare and identify best performing and most suitable model. Performed hyperparameter tuning and SMOTEEN to improve the performance by 15%

    Text Summarization using LLMs

    Utilized pre-trained Large Language Models (LLMs) for text summarization through diverse fine-tuning techniques. Comparative analysis with baseline RNN/LSTM language models is undertaken, utilizing established metrics such as Rouge score and BLEU.

    Object Detection System

    Designed and built a rea-time object detection system that takes input as either webcam video or a pre-recorded video and idetifies number of objects passing through apredetermined path. Employed the standard YOLOv8 model to achieve optimal results.

    Twitter Search Application

    Built a search application for tweets data using SQL and MongoDB as storage and enhanced the search capabilities 500x using caching

    See more

    Contact Me

    rohitmacherla125@gmail.com