Language Models Understand Us, Poorly

Jared Moore

December, 2022

Resources

Slides: https://jaredmoore.org/talks/understand_poorly/emnlp2022.html

Paper: https://arxiv.org/abs/2210.10684

Unresolved issues in NLP: Understanding

Michael et al. (2022)

Views on Understanding
Climbing the Right Hill
Under-specification
Challenges to Scale
Conclusion

(corresponding to slide numbers)

Views

Understanding-as-mapping

There is a “barrier of meaning” which separates human from machine understanding (Bender et al. 2021).
Syntax is separate from semantics.

Understanding-as-reliability

No distinction between human and machine understanding.
Models will close that gap soon (Agüera y Arcas 2022).
Scale is paramount (Chowdhery et al. 2022; Kaplan et al. 2020).

Understanding-as-representation

There’s a continuum of understanding…
but it depends on demonstrating the same skills.
- (Language models have a “sorta” comprehension; they perform well in some domains (Dennett 2017).)

Climbing the Right Hill

	Necessary	Not Necessary
Sufficient		As-representation
Not Sufficient	As-reliability	As-mapping

Not asking right?

“Just because you don’t observe something doesn’t mean you can’t infer anything about it.”

Percy Liang on Twitter

See Michael (2020) for further discussion.

Humans assume a similarity of representation.

We can’t make that assumption with our models.

Under-specification

Towards a Similarity of Representation

Or how to correct models’ inductive biases

Generalization

By 5yo, the average American child has heard between 10 and 50 million words (Sperry, Sperry, and Miller 2019).
Embodiment is needed eventually (Lynott et al. 2020; Bisk et al. 2020).

Challenges to Scale

Scale

A comparison of scale and performance

10-100,000 times more words than a kid

Chowdhery et al. (2022)

Scale

“Although there is a large amount of very high-quality textual data available on the web, there is not an infinite amount. For the corpus mixing proportions chosen for PaLM, data begins to repeat in some of our subcorpora after 780B tokens” (Chowdhery et al. 2022) (emphasis added)

The whole earth?

E.g. “Collective superintelligence could be either loosely or tightly integrated. To illustrate a case of loosely integrated collective superintelligence, imagine a planet, MegaEarth , which has the same level of communication and coordination technologies that we currently have on the real Earth but with a population one million times as large.” or “The sluggishness of neural signals limits how big a biological brain can be while functioning as a single processing unit. For example, to achieve a round-trip latency of less than 10 ms between any two elements in a system, biological brains must be smaller than 0.11 m 3 . An electronic system, on the other hand, could be 6.1×10 17 m 3 , about the size of a dwarf planet: eighteen orders of magnitude larger.” (Bostrom 2014)

Conclusion

Sorta Understands != Understands

“computers which understand”
- probably false advertising
- maybe theory

Pragmatic NLP

Probe model internals.

Black box and white box them

Add more of human language.

E.g. intersubjective, multi-agent environments.
CHILDES database of childhood language learning (MacWhinney 2000; Linzen 2020).

Measure what models can learn.

E.g. how many different streams of data (or “world scopes” (Bisk et al. 2020) must we add to models to make them more reliable?

Questions?

Email me at jared@jaredmoore.org.

As-mapping

As-reliability

As-representation

Works Cited

Agüera y Arcas, Blaise. 2022. “Do Large Language Models Understand Us?” Dædalus 151 (2). https://doi.org/10.1162/DAED_a_01909.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. Virtual Event Canada: ACM. https://doi.org/10.1145/3442188.3445922.

Bisk, Yonatan, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, et al. 2020. “Experience Grounds Language.” arXiv:2004.10151 [Cs], November. http://arxiv.org/abs/2004.10151.

Bostrom, Nick. 2014. Superintelligence: Paths, Dangers, Strategies. First edition. Oxford: Oxford University Press.

Breazeal, Cynthia. 2003. “Emotion and Sociable Humanoid Robots.” International Journal of Human-Computer Studies 59 (1-2): 119–55. https://doi.org/10.1016/S1071-5819(03)00018-1.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” arXiv:2005.14165 [Cs], June. http://arxiv.org/abs/2005.14165.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “PaLM: Scaling Language Modeling with Pathways.” arXiv:2204.02311 [Cs], April. http://arxiv.org/abs/2204.02311.

Dennett, Daniel C. 2017. From Bacteria to Bach and Back: The Evolution of Minds. WW Norton & Company.

Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv:2001.08361 [Cs, Stat], January. http://arxiv.org/abs/2001.08361.

Linzen, Tal. 2020. “How Can We Accelerate Progress Towards Human-Like Linguistic Generalization?” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5210–17. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.465.

Lynott, Dermot, Louise Connell, Marc Brysbaert, James Brand, and James Carney. 2020. “The Lancaster Sensorimotor Norms: Multidimensional Measures of Perceptual and Action Strength for 40,000 English Words.” Behavior Research Methods 52 (3): 1271–91.

MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates.

McCoy, R. Thomas, Ellie Pavlick, and Tal Linzen. 2019. “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference.” arXiv:1902.01007 [Cs], June. http://arxiv.org/abs/1902.01007.

McCoy, R. Thomas, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2021. “How Much Do Language Models Copy from Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN.” arXiv:2111.09509 [Cs], November. http://arxiv.org/abs/2111.09509.

Michael, Julian. 2020. “To Dissect an Octopus: Making Sense of the Form/Meaning Debate.” Julian Michael. https://julianmichael.org/blog/2020/07/23/to-dissect-an-octopus.html.

Michael, Julian, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, et al. 2022. “What Do NLP Researchers Believe? Results of the NLP Community Metasurvey.” arXiv. https://doi.org/10.48550/arXiv.2208.12852.

Sap, Maarten, Ronan LeBras, Daniel Fried, and Yejin Choi. 2022. “Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs.” arXiv. http://arxiv.org/abs/2210.13312.

Sperry, Douglas E., Linda L. Sperry, and Peggy J. Miller. 2019. “Reexamining the Verbal Environments of Children From Different Socioeconomic Backgrounds.” Child Development 90 (4): 1303–18. https://doi.org/10.1111/cdev.13072.

Thrush, Tristan, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. 2022. “Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality.” arXiv:2204.03162 [Cs], April. http://arxiv.org/abs/2204.03162.

Weidinger, Laura, Madeline G. Reinecke, and Julia Haas. 2022. “Artificial Moral Cognition: Learning from Developmental Psychology.” Preprint. PsyArXiv. https://doi.org/10.31234/osf.io/tnf4e.

Language Models Understand Us, Poorly

Resources

Unresolved issues in NLP: Understanding

Contents

Views

Understanding-as-mapping

Understanding-as-reliability

Understanding-as-representation

Climbing the Right Hill

Not asking right?

Under-specification

Uni-modal underspecifications

Multi-modal underspecifications

Towards a Similarity of Representation

Social domains

Generalization

Challenges to Scale

Scale

Scale

Conclusion

Sorta Understands != Understands

Pragmatic NLP

Questions?

Works Cited