• 207 Posts
  • 18 Comments
Joined 1 year ago
cake
Cake day: June 12th, 2023

help-circle
















  • Hey everyone! I’m Taters, one of the friendly admins here. Just a little about me: I’m a guy in my late 30’s, a proud dog dad, and while I’ve lived in a half dozen cities across the country, I’m finally settled down out in the Southwest. In my past lives, I’ve been a touring comedian’s personal sound engineer, managed a country music production studio in Nashville Tennessee, and even created a blockchain-based project to contribute to food assistance by supplying my state with tons (and I mean TONS) of potatoes (hence the name).

    I’ve been fascinated by AI science and theory ever since I was a kid and read Kurzweil’s “Age of Spiritual Machines,” which had a huge impact on me. Fast forward a couple of decades, and I’ve developed a passion for everything AI-related, across all sorts of vertical sectors. Lately, I’ve been focusing on prompt engineering with GPT-4 models, and I’ve discovered some incredible techniques thanks to the help of friends.

    I’m currently working on a few different projects in the areas of insurance, publishing and online learning. I’m excited to have an AI community to share all my news with.

    So, if you ever want to chat about AI, don’t hesitate to reach out! I’m always up for discussing ethics, news, tools, or anything else AI-related. Looking forward to getting to know you all, and thanks for being here!




















  • Title: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

    **Authors: Lianmin Zheng et al.

    Word Count: Approximately 6,400 words

    Estimated Read Time: 22-24 minutes

    Summary:

    The paper proposes using strong LLMs as judges to evaluate LLM-based chat assistants in a scalable way. The authors examine the use of LLM-as-a-judge by looking at position bias, verbosity bias, and limited reasoning ability. They evaluate LLM judges using two benchmarks: MT-Bench, a multi-turn question set, and Chatbot Arena, a crowdsourced platform.

    The results show that GPT-4 achieves over 80% agreement with human preferences, matching the level of human-human agreement. This suggests that LLM-as-a-judge is a promising alternative to costly human evaluations.

    The authors explore variations of LLM-as-a-judge: pairwise comparison, single answer grading, and chain-of-thought/reference-guided judging. They also examine finetuning a Vicuna base model as a judge.

    The authors release MT-Bench with 80 questions and Chatbot Arena data comprising 30K conversations. They argue for a hybrid evaluation framework combining standardized benchmarks and LLM-as-a-judge evaluations.

    Evaluation of variants with MMLU, TruthfulQA, and MT-Bench show that benchmarks capture complementary aspects, indicating the need for a comprehensive evaluation approach.

    In summary, the paper provides empirical evidence that LLMs can serve as scalable proxies for human preferences in chatbot evaluation. However, further work is needed to mitigate biases and improve LLM judging models.

    Potential Use: LLM-as-a-judge can enable fast, automated assessments of LLMs’ helpfulness, relevance and instruction-following ability in human-aligned dialogue systems. The proposed benchmarks and finetuning methods can be used to improve existing dialogue models.


  • Title: CREPE: Can Vision-Language Foundation Models Reason Compositionally?

    Authors: Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna

    Word Count: Approximately 7,600 words

    Estimated Read Time: 25-30 minutes

    Summary:

    The paper introduces CREPE, a benchmark for evaluating the compositional reasoning abilities of vision-language foundation models. Compositionality refers to the ability to understand and generate complex visual scenes or statements by combining simpler parts. The benchmark covers two important aspects of compositionality: systematicity and productivity.

    Systematicity evaluates a model’s ability to systematically recombine known visual concepts in unseen combinations. CREPE includes three systematicity splits based on whether models have seen all concepts (“Seen Compounds”), only the individual concepts (“Unseen Compounds”), or neither (“Unseen Atoms”). CREPE finds that most models’ performance decreases when evaluating on unseen combinations of concepts, especially for models trained on the larger LAION-400M dataset.

    Productivity evaluates a model’s ability to comprehend visual concepts of increasing compositional complexity. CREPE includes captions ranging from 4 to 12 visual concepts. It finds that most models struggle on captions with higher compositional complexity, with retrieval performance nearing random chance.

    Overall, CREPE finds that vision-language foundation models trained with contrastive loss struggle at compositional reasoning, even when trained on large datasets and using large model architectures. CREPE aims to provide a comprehensive benchmark to track the emergence of compositionality as vision-language models improve.

    CREPE provides large-scale datasets with ground truth image-caption pairs and hard negative captions to evaluate both systematicity and productivity. Hard negatives differ from ground truth captions in minimal ways to isolate model failure modes.

    Researchers can use CREPE to identify gaps in current foundation models relating to compositionality. Improving compositionality could make models more controllable and robust. CREPE’s hard negative generation approach could also be used to improve the training of compositional models.

    CREPE relies on a scene graph representation to define compositional language. The generated hard negatives are noisy, especially swapping and negation foils. Evaluating productivity also relies on generated captions.


  • Title: Machine intuition: Uncovering human-like intuitive decision-making in GPT-3

    Authors: Thilo Hagendorff, Sarah Fabi , and Michal Kosinski

    Word Count: Approximately 10,200 words

    Estimated Read Time: 35-40 minutes

    Summary:

    The paper investigates whether large language models (LLMs) like GPT-3 exhibit behaviors similar to human intuition and cognitive biases. The authors probe various LLMs with the Cognitive Reflection Test (CRT) and semantic illusions, which were originally designed to study intuitive decision-making in humans.

    The results show that early LLMs lack the mathematical abilities and knowledge to perform these tasks. However, as LLMs become more complex, they begin to show human-like intuitive behavior and make the same errors as humans. GPT-3 in particular exhibits a strong inclination for intuitive responses on the CRT and semantic illusions, responding correctly in only around 10% of cases.

    However, newer LLMs like ChatGPT and GPT-4 overcome these intuitive errors, responding correctly in around 80% and 97% of cases respectively. The authors attribute this to increases in ChatGPT and GPT-4’s reasoning capabilities.

    The authors explore methods to reduce intuitive behavior in GPT-3, such as providing multiple choice options, eliciting deliberate reasoning, and providing training examples. These methods are effective, bringing GPT-3’s performance close to ChatGPT’s.

    The findings suggest that LLMs can develop probability distributions over language that mimic human intuition, even though they lack cognitive mechanisms. The authors argue that investigating LLMs with methods from psychology has the potential to reveal otherwise unknown behavior.

    In summary, the paper demonstrates that LLMs gradually develop the ability to make human-like intuitive decisions and errors. However, the newest LLMs seem to overcome these tendencies, suggesting major improvements in their reasoning capabilities. The findings highlight the value of using methods from psychology to study the abilities and behaviors of LLMs.

    The findings could inform the development of LLMs that are designed to avoid intuitive errors and be more robust reasoners. The methods used to study human-like behavior in LLMs could also be applied to new models as they are developed. The results also highlight the need for careful scrutiny of LLMs before deploying them in real-world applications.


  • Title: Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design

    Authors: Michelle S. Lam, Zixian Ma, Anne Li, Izequiel Freitas, Dakuo Wang, James A. Landay, and Michael S. Bernstein

    Word Count: Approximately 24,000 words

    Estimated Read Time: 80-90 minutes

    Source Code/Repositories: ModelSketchBook API code - https://github.com/StanfordHCI/ModelSketchBook

    Summary:

    The paper introduces the concept of model sketching for early-stage machine learning model design. Model sketching allows ML practitioners to prototype model behavior through lightweight sketches that focus on high-level concepts relevant to a decision-making task.

    The key ideas are:

    • Concepts: Human-understandable factors that a model reasons over, like “profanity” or “sarcasm”. Concepts serve as functional building blocks of model logic.

    • Zero-shot concept instantiation: Models like GPT-3 and CLIP are leveraged to flexibly and rapidly instantiate concepts without diverting user attention.

    • Sketch models: Composite models that aggregate concept scores, allowing ML practitioners to explore different combinations of concepts.

    The authors implemented model sketching in ModelSketchBook, an open-source Python API. An evaluation with 17 ML practitioners found that model sketching shifted their focus from technical implementation to higher-level concepts. This cognitive shift helped participants explore a broader model design space and identify gaps in data, labels, and problem formulation.

    The paper argues that model sketching can help ML practitioners move beyond reactive fixes and engage in more proactive model design exploration from the start.

    Overall, model sketching allows ML practitioners to rapidly prototype different model design possibilities through lightweight sketches centered on human-understandable concepts. This represents a shift away from technical tunneling towards higher-level conceptual thinking during early model design stages.


  • Paper Title: GPT Understands, Too

    Authors: Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, Jie Tang

    Word Count: 13,000 words

    Estimated Read Time: 40-45 minutes

    Source Code/Repositories:

    Summary:

    The paper studies GPT-3’s capabilities beyond language generation, finding that GPT-3 has the ability to understand knowledge and commonsense reasoning despite its generative pre-training objective.

    The key findings are:

    1. GPT-3 can perform knowledge verification tasks with high accuracy, detecting factual errors in 95.5% of cases.

    2. GPT-3 can infer correct results from premises in 88.6% of cases on a causal reasoning task.

    3. GPT-3 demonstrates systematicity in its reasoning, generalizing causal rules to novel contexts.

    4. GPT-3 shows dose-response behavior, with performance increasing as the number of evidence sentences increases.

    5. GPT-3’s performance is relatively robust to the number of facts and details in a given context.

    The authors argue that GPT-3’s knowledge and reasoning capabilities emerge from its autoregressive pre-training objective, which implicitly forces the model to capture dependencies between words to predict the next token.

    In summary, the paper provides compelling evidence that large language models like GPT-3 have acquired substantial capabilities beyond text generation, posing new opportunities and challenges for deploying and scrutinizing these powerful systems.

    The findings suggest that generative pre-training objectives can implicitly teach language models to perform tasks like knowledge verification and commonsense reasoning, without being optimized for those specific goals. This suggests large language models may become a promising foundation for building AI applications with more comprehensive capabilities.