More  Posts
Founding Data Engineer @SoketLabs, Bangalore
Posted in BITS Pilani
₹ 15 - 20 LPA
2-5 Yrs Exp.

Hey Guys, I have an opening to share with you.


Company Description

Soket Labs is an AI research firm with a vision to further the advancement in AI towards ethical AGI. We are building a modular GenAI platform designed to accelerate model development and deployment, ensure scalability, enhance team efficiency, and mitigate ethical and operational risks in AI applications. The stack will be utilized to train a multilingual foundation model, embedding global knowledge with balanced representation of major Indian languages.


Objective

The incumbent will be instrumental in establishing and overseeing the data engineering frameworks necessary for preparing language datasets, focusing on an extensive open corpus of code and various Indian languages. These datasets will be pivotal for training large language models.


Key Responsibilities

1. Data Pipeline Development: Design and implement robust data pipelines using PySpark to ingest and process large datasets efficiently.

2. Data Cleaning and Deduplication: Apply advanced techniques to clean data, including deduplication and the removal of irrelevant or unwanted elements, to enhance data quality.

3. Data Preparation for Language Models: Prepare and structure datasets specifically tailored for training foundation language models, ensuring the inclusion of diverse Indian languages.

4. Quality Assurance: Implement stringent quality control measures to ensure the integrity and reliability of data.

5. Collaboration and Leadership: Work collaboratively with data scientists, language experts, and other stakeholders, providing technical leadership and guidance.


Essential Qualifications

1. Educational Background: A minimum of a Bachelor’s degree in Computer Science, Data Science, or a related field. A Master’s degree or higher is preferable.

2. Technical Expertise: Proficient in PySpark and other relevant data processing frameworks. Demonstrable experience in handling large datasets is essential.

3. Language Proficiency: Familiarity with Indian languages and their linguistic nuances is highly desirable.

4. Experience in Data Engineering: Proven experience in data pipeline creation, data cleaning, and preparation, specifically for language model training.

5. Analytical Skills: Strong analytical skills with the ability to solve complex problems and adapt to changing requirements.


Desirable Skills

- Experience with machine learning and natural language processing.

- Knowledge of cloud computing and distributed systems.

- Strong communication and project management skills.


This role demands a combination of technical acumen, leadership qualities, and a deep understanding of language data processing, essential for contributing to the advancement of large language model training in India.



If this is interesting for you then please share your resume while applying. Also, please feel free to reach out in case of any queries.


Thanks,

Siddharth Mishra

More  Posts
Feedback