Application
engineer
Staying humble, building everything
知らないから、すべてを創れる
Latest Publications
5 / 5 ItemsControllable Text-To-Speech with FastSpeech2
01A summary of research on enhancing control over synthesized voice timbre, tone, and emotion while maintaining naturalness.
This is a final conclusion of my B.E graduation work, code is open-source at: https://github.com/aucki6144/ctts
Introduction
Speech synthesis, the task of converting text into natural-sounding speech, is a central topic in AI, NLP, and speech processing. While recent advancements in deep learning have significantly improved the naturalness and robustness of speech synthesis, there are still challenges in controlling the nuances of synthesized speech, such as tone, pitch, and emotion. This b...
Reading Notes: Deduplicating Training Data Makes Language Models Better
02An introduction to Resilient Distributed Datasets (RDDs) in PySpark, covering lineage, transformations, and actions.
Reading notes - Deduplicating Training Data Makes Language Models Better
1 Introduction & Motivation
A key factor behind the recent progress in Natural language processing (NLP) and large language models (LLMs) is that the scale of both model parameter and dataset is growing rapidly. This moves us into all web-based crawled dataset, leading to an unpromised data quality. It's too expensive to performance manual review. It's impossible for us to regulate and design the datasets to guarant...
Algorithm Design for Big Data
03Parallel algorithm design patterns including Prefix Sums and Sample Sort using Spark's mapPartitions.
Spark: Alogrithm Design for Big Data
Embarrassingly parallel problems
......
Spark: Job Scheduling and Locality
04How Spark schedules jobs, stages, and tasks based on data locality and memory management.
Spark: Job Scheduling
Operations on RDDs
......
Spark: Data Partitioning Strategies
05Understanding Hash vs Range partitioning to optimize parallelism and balance workloads in Spark RDDs.
Spark: Partitions
RDDs are stored in partitions. Programmer specifies number of partitions for an RDD (Default value used if unspecified). More partitions means more parallelism but also more overhead.
- RDDs are stored in partitions. When performing computations on RDDs, these partitions can be operated on in parallel.
- You get better parallelism when the partitions are balanced.
- When RDDs are first created, the partitions are balanced.
- However, partitions may get out of balance after...