Alex Jia - Software Engineer & Data Engineer

Software Engineer & Data Engineer passionate about building scalable systems, optimizing data pipelines, and contributing to open-source projects. Currently pursuing Computer Science and Data Science at NYU.

>Experience

Incoming Software Engineer Intern

TikTok

Bellevue, WA

May 2026 — Aug 2026

Design and implement real-time and offline data architecture for large-scale recommendation systems
Build scalable and high-performance streaming Lakehouse systems that power feature pipelines, model training, and real-time inference
Collaborate with ML platform teams to support PyTorch-based model training workflows and design efficient data formats and access patterns for large-scale samples and features

Open Source Contributor

Apache Hudi

Remote

Jan 2026 — Present

Apache Hudi is an Apache lakehouse storage layer for data lakes: it adds incremental pipelines, record-level upserts and deletes, ACID transactions, efficient indexing, and time-travel queries on object stores alongside engines like Spark and Flink.
Actively contributing to the Hudi project through code, reviews, and collaboration with maintainers and the wider community.

Java

Apache Hudi

Apache Spark

Data Lakehouse

Data Engineer Intern

Trepp

New York, NY

May 2025 — Aug 2025

Implemented Python-based system to handle AWS SQS messages and process 100K+ address records daily, containerized by ECS
Optimized Kinesis stream ingestion by integrating Hudi with Apache Spark to write to S3, reducing batch sizes by 40%
Decommissioned dependency on third-party ESRI resolution service by prioritizing in-house Property Search API, resulting in ~70% match rate and $120K/year cost savings on a 15M record backlog
Setup 20+ AWS Step Functions orchestrating Glue Crawlers and Table creation, enabling Athena queries and QuickSight dashboards on new S3 datasets

Python

AWS

SQS

ECS

Kinesis

Apache Spark

Hudi

Step Functions

Glue

Athena

QuickSight

Open Source Software Engineer

Google Summer of Code - SQLancer

Remote

June 2025 — Sept 2025

Improved enterprise database reliability across 5,000+ systems through PostgreSQL v12-v18 testing framework upgrade
Contributed 20+ JSON features in Java and Scala, improving test coverage for common PostgreSQL database JSON operations
Architected CI/CD pipelines with GitHub Actions to automate multi-database test workflows (PostgreSQL, ClickHouse, etc.)
Collaborated with 15+ global open-source contributors via GitHub code reviews, discussions, and documentation updates

Java

Scala

PostgreSQL

GitHub Actions

CI/CD

ClickHouse

Software Engineer Intern

Flowlytics

New York, NY

Dec 2024 — May 2025

Built an assessment platform using Python, NGINX, and Docker to auto-scale (1–5 nodes) for 1,000+ concurrent users
Developed OAuth2 in Python for JWT validation, token refresh, and custom claims mapping across providers
Delivered a set of RESTful API with Python Flask to provide search functionality for assessment and audit data integrated with the internal search engine, utilizing PostgreSQL as relational storage, and improved efficiency by 25% using database indexing

Python

NGINX

Docker

OAuth2

JWT

Flask

REST API

PostgreSQL

Data Engineer Intern

NYU Berkley Center For Entrepreneurship

New York, NY

Sept 2024 — May 2025

Migrated legacy systems to a modern AWS S3 data lake, achieving 15K cost savings with automated testing
Built a real-time PySpark & Python ETL pipeline handling 1000+ events/sec using batch intervals and watermarking
Optimized 2 TB+ relational and document data using Parquet + Snappy compression, reduced query latency by 30s
Connected to Tableau dashboards reflecting business metrics such as monthly revenue statistics and investor contributions

Python

PySpark

AWS

ETL

Parquet

Tableau

>Check Out These Websites

Conventional Commits

A specification for adding human and machine readable meaning to commit messages

https://www.conventionalcommits.org/en/v1.0.0/

Git Internals PDF

A comprehensive PDF explaining the internal workings of the Git source code control system

https://github.com/pluralsight/git-internals-pdf

>Recent Posts

All Posts

Thursday, April 2, 2026

Getting Started with Open Source Contributions Through GSoC

A guide to participating in Google Summer of Code, including timeline insights, community engagement strategies, and tips for writing a strong proposal that gets accepted.

Open Source

Career Development

Thursday, April 2, 2026

Viewstamped Replication Revisited: A Deep Dive into Distributed Consensus

An exploration of the Viewstamped Replication protocol, covering crash fault tolerance, view changes, recovery mechanisms, and practical optimizations for building reliable distributed systems.

Distributed Systems

Backend Development

Thursday, April 2, 2026

Ionia: High-Performance Distributed Write-Optimized Key-Value Stores

Exploring the Ionia protocol that achieves high throughput and low latency in distributed WO-KV stores by decoupling scalability from locality, enabling parallel execution and scalable reads.

Distributed Systems

Backend Development

Storage Systems