Hardy Chen

I am an incoming PhD student at UCSC, advised by Prof. Yuyin Zhou. My research interest lies broadly in different subareas of Vision and Large Language Models.

I spent 5 wonderful years at The Chinese University of Hong Kong, Shenzhen. From 2023 to 2024, I worked as a full-time research engineer advised by Prof. Benyou Wang. From 2019 to 2023, I studied as an undergrad in School of Data Science, during which I worked shortly on AI security. I had unforgettable experience to work with excellent dudes in DY223, DY224, RB315.

Email / Google Scholar / GitHub /

What's New

[June 4, 2025] I will attend the CHAI2025 workshop!

[Oct 11, 2024] I will attend EMNLP2024 as a studnet volunteer! See y'all in Miami!

[Sept 19, 2024] Two works are accepted to EMNLP2024 main conference, including HuatuoGPT-Vision and my first first-authored paper on LLM Evaluation Bias! It's nice to work with Shunian, Junying, Benyou and other co-authors!

[Aug 28, 2024] Our work Humans or LLMs as the Judge? A Study on Judgement Biases inspires Chatbot Arena to disentangle sytle and substance in ELO scores. Check out their blog!

Selected Works

	MileBench: Benchmarking MLLMs in Long Context Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang COLM2024 Links: [arXiv] [Website] [Dataset] [Code] [Leaderboard] Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing benchmarks often focus on single-image and short-text samples, and when assessing multi-image tasks, they either limit the image count or focus on specific task (e.g time-series captioning), potentially obscuring the performance challenges of MLLMs. To address these limitations, we introduce MileBench, a pioneering benchmark designed to test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises not only multimodal long contexts, but also multiple tasks requiring both comprehension and generation. We establish two distinct evaluation sets, diagnostic and realistic, to systematically assess MLLMs' long-context adaptation capacity and their ability to complete tasks in long-context scenarios. Our experimental results, obtained from testing 22 models, revealed that while the closed-source GPT-4o outperforms others, most open-source MLLMs struggle in long-context situations. Interestingly, the performance gap tends to widen with an increase in the number of images. We strongly encourage an intensification of research efforts towards enhancing MLLMs' long-context capabilities, especially in scenarios involving multiple images.
	ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, Benyou Wang arXiv Links: [arXiv] [Dataset] [Model] [Demo] [Code] Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset ALLaVA, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
	Humans or LLMs as the Judge? A Study on Judgement Biases Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang EMNLP2024 Links: [arXiv] Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human and LLMs, questioning the reliability of the evaluation results. In this paper, we propose a novel framework that is free from referencing groundtruth annotations for investigating Misinformation Oversight Bias, Gender Bias, Authority Bias and Beauty Bias on LLM and human judges. We curate a dataset referring to the revised Bloom's Taxonomy and conduct thousands of evaluations. Results show that human and LLM judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.
	CMB: A Comprehensive Medical Benchmark in Chinese Xidong Wang, Guiming Hardy Chen*, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li NAACL2024 Links: [arXiv] [Code] Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in contextual incongruities to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at this https URL.
	On the Difference of BERT-style and CLIP-style Text Encoders Zhihong Chen, Guiming Hardy Chen*, Shizhe Diao, Xiang Wan, Benyou Wang ACL2023 Findings* Links: [arXiv] [Code] Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.
	Phoenix: Democratizing ChatGPT across Languages Zhihong Chen, Feng Jiang, Junying Chen, Tiannan Wang, Fei Yu, Guiming Hardy Chen, Hongbo Zhang, Juhao Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xiang Wan, Benyou Wang, Haizhou Li arXiv Links: [arXiv] [Code] This paper presents our efforts to democratize ChatGPT across language. We release a large language model "Phoenix", achieving competitive performance among open-source English and Chinese models while excelling in languages with limited resources (covering both Latin and non-Latin languages). We believe this work will be beneficial to make ChatGPT more accessible, especially in countries where people cannot use ChatGPT due to restrictions from OpenAI or local goverments.

Service

2025: ACL2025, NAACL2025, COLM2025, ACL2025 Student Research Workshop

2024: EMNLP2024

Awards

EMNLP2024 Outstanding Reviewer

Other Interests

I practiced Taekwondo from 2008 to 2015 and received 1st Dan Black Belt. I was a badminton college team member during my undergrad. I practiced clarinet for years when I was young but currently I'm more into guitar for its chorus. I am more than interested in piano and music theory but haven't got a chance to learn them formally.

I am an introverted person yet open towards different culture and values. I speak multiple dialects/languages, including English, Mandarin, Teochew, Cantonese and a bit of Malay and Indonesian. Hope to explore more in the future : )

Website template