| 
               
                Jun Wang
               
              Jun is a Senior Machine Learning Engineer at Salesforce Research, working on  multimodal LLM. He earned his Ph.D. degree from University of Maryland, College Park, where he was honored to be co-advised by Prof. Larry S. Davis and Prof. Joseph F. JaJa. His research primarily focuses on  multi-modal learning,  object detection, and  3D scene understanding. Currently, he is on the job market. 
               
              
              Recently, Jun has been fortunate to work with Dr.  Kishore Prahallad (Apple), Dr.  Mingfei Gao and Dr.  Ran Xu (Salesforce Research), and Dr.  Siheng Chen (Mitsubishi Electric Research Labs). Prior to that, he obtained his M.S. degree in Electrical and Computer Engineering from University of Michigan, Ann Arbor in 2017 and B.S. degree from Beijing Institute of Technology, China in 2015.
              
              
                Email: junwong [AT] terpmail [DOT] umd [DOT] edu
               
              
                
                
                Google Scholar
                
                  /  
                Semantic Scholar
                  /  
                DBLP
                  /  
                LinkedIn
                  /  
                GitHub
               
             | 
            
               
             | 
           
         
        
            
            | 
              News
             | 
           
        
        
        
            -  [Jun. 2025][New]  BLIP3-KALE won Best Paper at  Synthetic Data for Computer Vision Workshop, CVPR 2025.
            
-  [Dec. 2024] ProVision is now released. Excited to contribute to our open-source instruction data generation pipeline to train multimodal LLM.
            
-  [Aug. 2024] xGen-MM (BLIP-3) model is officialy released. Thrilled to contribute to advancing open-source multimodal LLM. Don't miss our BLIP3-tailored datasets: BLIP3-OCR-200M and BLIP3-Grounding-50M—check them out!
            
-  [Aug. 2024] xGen-VideoSyn-1 is now available! Excited to contribute to our open-source video generation model. Stay tuned for the upcoming release!
            
-  [Feb. 2024] Start working as a senior machine learning engineer on GenAI at  Salesforce Research, Palo Alto, CA.
            
-  [Dec 2023] Two patent applications for scene flow estimation in autonomous driving are filed.
            
-  [July 2023] Start working as a senior machine learning engineer on automated driving at  Qualcomm, Novi, MI.
            
-  [Mar. 2023] Their work A2Summ, on multi-modal summarization is accepted by CVPR 2023. Hello Vancouver!
            
-  [Sept. 2022] Their work TAG, a generic text-aware question-answer generation approach for Text-related VQA is accepted by BMVC 2022.
            
-  [Aug. 2022] Their work NAPL, a novel prototype learning paradigm for 3D LiDAR point cloud semantic segmentation will be presented at Computer Vision for Metaverse Workshop, ECCV 2022. 
            
-  [June 2022] The work ESSumm, an unsupervised speech summarization framework employing Wav2Vec, is accepted by INTERSPEECH 2022. Annyeong haseyo, Incheon.
            
-  [Apr. 2022] A patent application for motion prediction in autonomous driving is filed.
            
-  [Apr. 2022] Their paper "PointMotionNet: Point-Wise Motion Learning for Large-Scale LiDAR Point Clouds Sequences" will be presented at Workshop on Autonomous Driving (WAD), CVPR 2022. Where y'at, New Orleans.
            
-  [Jan. 2022] The code for M3DETR is released. He gave a presentation of M3DETR at WACV 2022, Waikoloa, Hawaii.
            
-  [Nov. 2021] Their work PointMotionNet, a framework of 3D motion learning on LiDAR point clouds, achieves the 4th place out of 85 teams in the leaderboard of SemanticKITTI Multiscan Semantic Segmentation.
            
-  [Aug. 2021] Pass his Ph.D. research proposal examination and advance to candidacy.
            
-  [July 2021] Their Paper "M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers" is accepted by WACV 2022 in the First Round. Aloha, Hawaii.
 
            -  [May 2021] Start his machine learning internship with Dr.  Kishore Prahallad at  Apple, Cupertino, CA.
            
-  [Feb. 2021] Start his research internship with Dr.  Mingfei Gao and Dr.  Ran Xu  at  Salesforce Research, Palo Alto, CA.    
            
-  [Nov. 2020] A manuscript on 3D motion learning in LiDAR point clouds is under review. Fingers Crossed.
            
-  [Sept. 2020] Start his research internship with Prof.  Siheng Chen (now Shanghai Jiao Tong University) at  Mitsubishi Electric Research Labs, Cambridge, MA.
            
-  [July 2020] Their Paper "InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling" is accepted by ECCV 2020.
 
                             
         
        
        
        
          
               
            
           | 
      
            
              xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
            
             
            Le Xue*, Manli Shu*, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
             
            arXiv, 2024
                       
            arXiv
            /
            code
            /
            dataset
            /
            VentureBeat coverage
            
            Open-sourced multimodal Large Language Models (MLLM).  
           | 
         
      
          
               
            
           | 
      
            
              ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
            
             
            Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu
             
            arXiv, 2024
                       
            arXiv
            /
            code
            /
            dataset
            /
            VentureBeat coverage
            
            A scalable system generating 10M+ vision-centric
            instructions, improving multimodal benchmark by 8%.  
           | 
         
          
               
            
           | 
      
            
              xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
            
             
            Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong
             
            arXiv, 2024
                       
            arXiv
            /
            code
            
            A T2V model leveraging VideoVAE compression and Diffusion Transformer.  
           | 
         
        
          
               
            
           | 
          
            
              Align and Attend: Multimodal Summarization with Dual Contrastive Losses
            
             
             Bo He, 
            Jun Wang,
             Jielin Qiu, 
             Trung Bui, 
             Abhinav Shrivastava, 
             Zhaowen Wang 
             
            CVPR, 2023
                       
            arXiv
            /
            code
            /
            project
            /
            bibtex
            
            Multimodal summarization that summarizes video frames and text sentences with time correspondence.  
           | 
         
        
          
               
            
           | 
          
            
              TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
            
             
            Jun Wang,
             Mingfei Gao, 
             Yuqian Hu, 
             Ramprasaath R. Selvaraju, 
             Chetan Ramaiah, 
             Ran Xu, 
             Joseph F. JaJa,  
             Larry S. Davis
             
            BMVC, 2022
                       
            arXiv
            /
            code
            /
            poster
            /
            bibtex
            
            The first generic text-aware question-answer generation approach for Text-related VQA. 
           | 
         
        
          
               
            
           | 
          
            
              ESSumm: Extractive Speech Summarization from Untranscribed Meeting
            
             
            Jun Wang
             
            INTERSPEECH, 2022
             
            arXiv
            /
            code
            /
            slides
            /
            bibtex
            
            The first automatic speech summarization system with Wav2vec 2.0. 
           | 
         
        
          
               
            
           | 
          
            
              PointMotionNet: Point-Wise Motion Learning for Large-Scale LiDAR Point Clouds Sequences
            
             
             Jun Wang*, 
             Xiaolong Li*,
            Alan Sullivan,
            Lynn Abbott,
            Siheng Chen
             
            * denotes equal contribution.
             
            WAD, CVPR, 2022
             
            arXiv
            /
            bibtex
            
            3D motion learning with a novel point-based spatiotemporal convolution operation module.  
           | 
         
        
          
               
            
           | 
          
            
              M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers
            
             
            Jun Wang*,
            Tianrui Guan*,
             Shiyi Lan,
            Rohan Chandra,
            Zuxuan Wu,
            Larry S. Davis,
            Dinesh Manocha
             
            * denotes equal contribution.
             
            WACV, 2022
             
            arXiv
            /
            code
            /
            slides 
            /
            bibtex
            
            The multi-representation, multi-scale, mutual-relation 3D object detector with transformers. 
           | 
         
        
          
               
            
           | 
          
            
              InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling
            
             
             Jun Wang*, 
             Shiyi Lan*,
            Mingfei Gao,
            Larry S. Davis
             
            * denotes equal contribution.
             
            ECCV, 2020
             
            arXiv
            /
            slides 
            /
            bibtex
            
            3D Object Detection with the effective dynamic attention module.  
           | 
         
         
  
        
        
          
               
            
           | 
          
              RTL Design of R10K Out-of-Order 2-way Superscalar Processor with Simultaneous Multithreading
            
             
             Ruobai Feng, 
             Wang Cao, 
            Jun Wang,
             Yujun Yan, 
            Jiapeng Zhao
             
            EECS 470 Computer Architecture, 2016
                       
            
            The two-way superscalar SMT processor design based on MIPS R10K out- of-order execution architecture. 
           | 
         
        
          
               
            
           | 
          
              Design and Layout of a 16-bit RISC Pipelined Processor
            
             
             Farzad Asgarian, 
             Harsha Chawla, 
             Isaac Jarman, 
             Cody Piekarz, 
            Jun Wang
             
            EECS 427 VLSI Design I, 2015
                       
            
            The baseline processor design with a customized kogge-stone adder based on a 16-bit RISC architecture using IBM’s 130nm CMOS process. 
           | 
         
        
         
        
        
        
            -  Program Committee: AAAI'25
 
            -  Reviewer: CVPR, ICCV, ECCV, ICML, NeurIPS, ICLR, AAAI, BMVC, WACV, ACM MM
 
            
            
            -  Student Volunteer: INTERSPEECH'22
 
         
    
        
        
        
             -  [2023] Outstanding Overseas Student Scholarship - Government of China
 
             -  [2022] International Conference Student Support Award - University of Maryland
 
             -  [2022] CVPR Travel Grant 
 
             -  [2022] Jacob K. Goldhaber Travel Grant - University of Maryland
 
             -  [2019] Teaching Assistant Training and Development (TATD) Fellow - University of Maryland
 
             -  [2015] Outstanding Graduates - Beijing Institute of Technology 
 
             -  [2014] Honorable Mention - Mathematical Contest in Modeling 
 
         
    
        
        
        
      
   |