Notes on 'PSYCGD02: Principles of Cognition'

https://www.ucl.ac.uk/lifesciences-faculty-php/courses/viewcourse.php?coursecode=PSYCGD02

This module outlines general theoretical principles that underlie cognitive processes across many domains, ranging from perception to language, to reasoning and decision making. The focus will be on general, quantitative regularities, and the degree to which theories focusing on specific cognitive scientific topics can be constrained by such principles. There will be an introduction on general methods and approaches in cognitive science and some of the problems related to them. Later in the course, some computational approaches in cognitive science will be discussed. There will be particular emphasis on understanding cognitive principles that are relevant to theories of decision making.

What is Cognitive Science?

Brief history.

Notable that the narrative revolves around several key conferences where prominent figures from different fields became aligned.

Bridging Levels of Analysis for Probabilistic Models of Cognition

Levels of models:

Computational - problem and declarative solution eg Bayesian inference
Algorithmic - representation and constructive solution eg message passing
Implementation - physical processes eg neurons

Popular research method is to look at where people diverge from ideal solutions, to figure out what algorithms their mind is using to approximate the solution. But.. .

Vulnerable to misidentifying the computational problem being solved.
- eg strategies for iterated PD look irrational in single PD
Requires understanding how levels constrain each other
- eg are probalistic models fundamentally incompatible with connectionist models or can we implement one on top of the other?

Rational process models - identify algorithm for approximating probabilistic inference under time/space limits, compare to what we know about mind and behavior.

Bridges computational and algorithmic levels.
Constrains possible algorithms to those that produce ideal behavior in limit.
Explains many cases where individuals deviate but average behavior is close to ideal.

Example - Monte Carlo with small number of samples is tractable. Consistent with:

Averaging multiple guesses from one person increases accuracy (ie contains some independent error)
Recall similar events ~= importance sampling. Predicts availability bias? Incorrect re-weighting?
Order effects (order of information incorrectly affects results of update) ~= particle filter.
Perceptual bistability ~= random walk.

Some progress in bridging to implementation level eg neural models of importance sampling.

Lecture 1

Cognitive science as reverse engineering - understand how the mind works by trying to build one and see what differs.

Brief history:

Structuralism
- Building blocks are qualia
- Learning via systematic introspection
  - Controlled, replicable experiments
  - But different labs struggled to replicate each others results
  - Difficult to relate conscious experiences which don’t match qualia (eg non-visual mental models)
  - Vulnerable to observer effects, confirmation, priming, retroactive justification
  - Introspection actually = retrospection
  - eg visual illusions, choice blindness
Behaviorism
- Only talk about observable stimulus and response
- Mostly experiments with animal learning
- eg classic conditioning (event -> event -> response => event -> … -> response)
- eg operant conditioning (action -> +/- => +/- action)
- Reinforcement machines, not reasoning machines
- Doesn’t allow internal state/structure
  - Doesn’t explain how stimulus/response are categorized - theoryless learning
  - But language has infinite structure => can’t be learned from stimulus/response without hyperpriors
  - Rats choose shorted route available, rather than most reinforced route
Cognitive science
- Thought as computation / information processing - data + algorithms
- We needed to invent computation first to be able to have this idea!

Methods:

Behavioral studies
Lesion studies
Single-cell recordings
fMRI
- Neural activity -> blood de-oxygenation -> magnetic interaction changes -> measure with big magnets
- Spatial resolution ~1mm
- Temporal resolution ~seconds
EEG
- Neural activity -> electromagnetic field -> measure with electrodes on scalp
- Can only measure large fields
- Spatial resolution ~poor
- Temporal resolution ~1ms
MEG
- Neural activity -> electromagnetic field -> measure with ?
- Spatial resolution ~better
- Temporal resolution ~1ms
tDCS

(Paired with the more recent failed replication.)

Arguing that non-conscious priming can strongly affect behavior.

Experiment 1:

34 undergrads
Use scrambled sentence test with words that prime rude/polite/neutral
All experimenters blinded
Sent to another room for next test, where waiting confederate is asking experimenter questions
Time how long it takes them to interrupt
Huge effect sizes: almost 2x mean time, <20% vs >60% interruptions within 10min cutoff
No significant differences in reported perceptions of experimenters politeness
Should we trust reports of politeness? It’s a bad idea to call your professor rude!
Effect sizes are enormous. If a few words can double impatience, what could listening to angry music on the journey do? If we’re so strongly susceptible to small influences, how is there room for personality? How do we have any resistance to marketing?

Experiment 2:

Two successful iterations!
30 + 30 undergrads
Same setup, but priming elderly/neutral (without priming slow)
Timed how long subjects took to walk to the next room
Much smaller effect size - mean 7.30s -> 8.28s
Near identical results in both iterations!
Elderly -> slow? I get thinking about rudeness making me rude, but thinking about elderly making me slow seems a much bigger stretch. Thinking about predators makes me want to eat meat? Being chased by a tiger and stop for a steak sandwich?
Followup experiment with 19 undergrads found only 1 noticed the elderly priming
33 undergrads
Do elderly priming, then Affect-Arousal Scale
Primed group were in slightly more positive mood, but not significantly
Uses this to defend against the idea that they walked slower because sad, but seems bizarre that they are affected so much that they move differently but not so much that they feel differently.

Experiment 3:

41 non-African-American undergrads
Long boring computer task.
Flash either African-American or Caucasian face before each trial.
On 130th claim error and say they have to start again. Experimenter explains error, but is blinded.
Facial expression caught by camera and rated by blinded experimenter.
Only two subjects reported seeing the faces when asked and couldn’t identify which they saw
Both experimenter in room and raters of pictures gave near-identical results!
But no difference in self-reported racial prejudice.

Argues that this works where subliminal adverts for pepsi don’t because they directly activate traits which contain behavior whereas pepsi just activates the pepsi representation. So elderly -> walk slow but pepsi -/> drink pepsi? Also because there is some activation energy to get up and buy coke, whereas they setup situations where the action was already required and the only difference was in accessibility. So priming for hostility will make people more likely to react to an annoying trigger but not to be randomly hostile.

Note that results for behavior here are stronger than their previous results for judgments, but would assume that judgments mediate behavior. But in ex1 there was no effect on perception of the experimenter. And little evidence so far for judgment mediating behavior.

Behavioral Priming: It’s all in the Mind, but Whose Mind?

Failed replication of previous paper.

Reasons to doubt original:

Only two indirect replications.
Small sample sizes.
Evidence from neuroscience suggests that top-down attention and bottom-up saliency are both required for the spreading activations that are used to explain priming.
Experimenter who administered the task was not blinded enough - authors found that it was easy to accidentally glimpse the task sheet (original describes them as being in a closed envelope?)
Measuring time with a stopwatch is susceptible to bias
Not clear exactly what participants where asked afterwards - aware of stimulus vs aware of response vs aware of link.

Experiment 1:

120 (French) undergrads
Task sheets in a closed envelope, opened by subjects
Experimenters assigned to subjects are random
Experimenters follow a strict script
Walking speed recorded by infrared beam
No significant difference in walking times
Four students reported being aware of the elderly-ness
Primed group chose pictures of old people significantly more often in forced choice test
No experimenters reported having any specific expectations about subject behavior

Experiment 2:

50 subjects, 10 experimenters
Half of experimenters told that primed participants will walk slower, other half told faster
Experimenters were unblinded
First subject for each experimenter was a confederate who behaved to confirm this expectation
Experimenters measured with stopwatch
For stopwatch times, fast+prime went faster and slow+prime went slower.
For infrared times, slow+prime went slightly slower and fast+prime was same as fast+control.

Most subjects were aware of the prime (but it said 6%…) and are in psych course so might be expected to be suspicious.

Priming via social cues is way more believable to me than priming via word choice. Clear selective pressure for understanding and reacting to social cues.

Lecture 2

Scientific reasoning. Psi hypothesis as running example.

Base-rate fallacy vs significance testing.

Successful replication could just mean replicating the mistakes of the original.

In a replication aim to improve on original methods or test some new factor - more likely to be received in good faith and more likely to generate new insight beyond back-and-forth.

A good successfully replication can falsify a hypothesis by more accurately identifying the mechanism behind the effect eg previous paper replicated slow walking, but showed that the effect disappeared under proper blinding.

Defenses of priming:

Hidden moderators
Experienced researchers

But:

Then the original effect is less powerful/robust than claimed
Post-hoc reasoning - just a hypothesis until tested
Administering questionnaires is not that hard
Most of the legwork is done by grad students anyway

Try to structure experiments with multiple competing hypotheses where any given result would support some hypothesis and weaken the others.

The Cognitive Neuroscience of Human Memory Since H.M.

Intro:

Current categories used in memory took time to establish - non-obvious.
Specific impairments from lesions rather than general degradation shows that brain is structured and specialized.

Hippocampus:

Hippocampal volume reduction of ~40% is common in memory-impaired patients - may be maximum cell loss ie 60% remaining is just dead tissue.
Damage to other regions can also impair memory.

HM:

Learned a motor skill => memory not one single unit
Reasoning and perception intact => memory not required for reasoning/perception
Could sustain attention and had short-term recall => damaged ares not required for working memory
Had memories from before surgery => long-term storage not in damaged areas

Other patients:

Perceptual priming still works
Can learn in Bayesian fashion, but not explicit memorization
Learned skills are rigid, fail if task is modified

Declarative: facts, representations, conscious recall, compare/contrast memories Non-declarative memory: unconscious performance, black box

Visual perception:

Initially thought to require memory in some cases, but…
Tests accidentally benefit from memory
Often damage to adjacent vision-processing areas
Requires better imaging/locating of lesions to clear up confusion

Immediate and working memory:

HM limited to 6 digit recall, but could maintain memory for 15 mins
=> Immediate memory not time-limited, but maintenance-limited
Demonstrated in other patients - they do fine on tasks where distractions impair healthy subjects (working memory) but fail on tasks where distractions are fine for healthy subjects (long-term memory)
Open question - are there tasks that can be handled by working memory but are still impaired by hippocampus damage
- Debate around path integration - unclear whether subjects are each using same process and representation

Remote memory:

HM initially had autobiographical memory
(Later in life was limited to factual recall, but later MRIs also showed changes since initial event)
Many other patients also have autobiographical memory.
In patients without, often unclear how far damage extends and whether it might affect other areas

Working theory of long-term memory::

Medial temporal lobes deal with creating and maintaining declarative memories
Sensory memories stored in same area that initially processed them
- Supported by many individual patients eg ‘colorblind painter’ - after damage that removed color perception, could no longer remember colors except declaratively
Recall consists of tying all of these together
- Supported by various fMRI studies
Initially requires hippocampus, but over years memories reorganized, stored more permanently by changes across neocortex that tie these areas together

Structure::

Working theory - organized by semantic categories
- eg JBR lost memory of things identified by attributes but not things identified by function
Recollection = what was in specific memory?
Familiarity = was prompt in any memory?
Hippocampus damaged patients are impaired on both old/new task (familiarity + some recollection) and free recall (recollection)
Combine old/new with recall of which source - patients have less instances of familiarity without recall => damage is not recall only

Group studies average out individual variation - allows studying less obvious effects

Finding the engram

Engram def=

Persistance - persistent physical change in brain resulting from specific experience
Ecphory - automatic retrieval in presence of cue
Content - reflects what happened and what can be retrived
Dormancy - exists (but dormant) even when encoding and retrieval not active

The hunt:

Moving target eg reconsolidation
Many learning-related changes observed in brain eg synaptic, chemical, epigenetic.
Different persistence periods.
Not clear if related to engrams.
Often don’t predict retrieval success.
Dominant theory - stronger connections between neurons that are active during encoding - neuronal ensemble

Sharp-wave ripple events in hippocampus:

Multi-unit recordings in rodents, fMRI in humans
Replay observed during tasks, resting and sleeping
Strength of replay correlates with later retrieval performance
Disrupting waves impairs subsequent expression
Some progress on correlating content
Related sensory cues may trigger replay
Hard to observe dormancy

Tracking:

Non-specific lesions only caused retrieval failure when wide areas damaged => memories are distributed
- But overtrained rats => resilient memories
- But may have accidentally damaged hippocampus with large lesions
Would like to lesion specific ensembles
Tagging shows that some same neurons active during both encoding and retrieval (~10%, >chance, possibly collateral tagging during encoding)
Neurons with higher levels of CREB are more often recruited into ensemble
Neurons with virally over-expressed CREB are more often recruited into ensemble
More CREB -> more excitable
Increasing excitability via various other methods also has same effect
Allocate-and-erase - ablating (killing?) artificially excitable neurons reduces retrieval performance without affecting future learning
- Even if only one brain region is targeted => some parts of ensemble have key roles
Tag-and-erase - tag active neurons, apply inhibitors (how are these targeted?), same effect
Worries about collateral tagging resolved:
- Tag 1st experience
- Silence during 2nd
- 2nd still learned but 1st is gone => not enough collateral tagging to interfere with 2nd task

Activating:

Uncontrolled experiments with focal electrical stimulation during surgery
Tag-and-manipulate / allocate-and-manipulate - re-triggers learned behavior even in unrelated contexts
In both cases, activation seems to spread from initial site to entire ensemble
Can create false associations:
- Tag ensemble in context 1
- Activate in context 2 and shock mice
- Learned fear response in context 1
- No fear response in context 2
Even indirectly
- Tag ensemble in context 1
- Tag ensemble during shock
- Repeatedly active both in context 2
- Learned fear response in context 2
- No fear response in context 1
Artificial activation paired with chemical that inhibits reconsolidation removes association
So far stimuli limited to fear/reward and response limited to freeze/approach/avoid - need more complex tasks to test episodic memory

Having a lot of trouble with this paper. Needs much more time and depth.

Navigation:

Allocentric / map-based navigation - static representation, navigate by external landmarks
Egocentric navigation / path integration - track motion, estimate path from origin
Hippocampus and entorhinal cortex support both declarative memory and navigation
Semantic memory (data independent of temporal context) ~ allocentric navigation
Episodic memory (first-person experiences in context) ~ egocentric navigation
Semantic memory abstracts repeated patterns in episodic memory ~ allocentric maps abstract repeated paths and observations

Implementation possibilities:

Place cells in hippocampus - fire at specific locations in space - possibly encode position or distance?
Grid cells in medial entorhinal cortex - fire in repeating hexagonal pattern in space - different scales - possibly coordinate system?
Head direction cells - ?
Border cells - ?
This is too complicated to skim
Firing patterns are not simple - small changes in environment can result in large change in firing patterns - provides high-dimensional code for storing many different envs?
Insects manage to navigate with much simpler circuits / less storage.
Massive excess capacity in mammals might be related to reuse for different kinds of memory.
Might also enable ‘maps’ of semantic knowledge
cf spatial metaphors in language
Recognition and recall associated with unique firing patterns in that area for each object/event
If episodic memories are stored similarly to paths through environment, might explain time-asymmetry and temporal contiguity (recalling one events makes it easier to recall other events that are nearby in time)
Neuronal assembly sequences:
Patterns of activation in time?
Generated continuously even when environment and body signals are kept constant
Can predict correct/incorrect moves in maze seconds before motor event
Maybe used to organize episodic memory
Are chunked, just like paths and memory
- Limits error in long sequences
- Is chunking like a hash tree?

Some complex ideas about implementation in theta waves that I can’t follow, but apparently explains:

Fine resolution near recalled event/location, coarse structure elsewhere
Limited number of concurrently recalled events/locations
Long-distance jumps between events/locations (related to chunking?)
Compressed recall eg episodic recall tends to focus around highlights/lowlights rather than being linear in time
Why episodic recall plays out in real-time - tied to same mechanism that implements subjective time tracking

Maybe this explains why word-vec works? Are we just reverse-engineering the minds spatial relationships?

Questions:

Encoding/meaning of firing patterns
Other animals have similar cells but that are not theta modulated - do they have some substitute system?
What does the representation space look like (size, layout)?
How does the cell layout vary between rodents and primates? Do some areas grow out of proportion?
?
Does awareness of recollections require only the prefontal cortex, or also interaction with the rest of the cerebral cortex.

Place cells, grid cells etc seem to imply that the hippocampus provides navigation. Paper argues that the evidence actually shows that it provides general cognitive maps and that navigation is just one usecase.

Navigation strategies:

Search
- No active goal orientation
- Just movement and goal recognition
Target approaching
- Orienting towards observable goal
Guidance
- Towards pre-calculated goal location
- eg defined by relationship between multiple landmarks
- Requires some spatial computation, and thereafter is just target approaching
Wayfinding
- Recognizing and approaching landmarks
- Joining landmarks into route
- Joining routes together into topological map
Survey / metric navigation
- Embed known routes/maps into common frame of reference
- Supports novel routes, detours, shortcuts

Rats with hippocampal lesions:

Can handle route navigation (eg turn left a T) - presumably recognition-triggered
Can handle alternating routes - again presumably recognition-triggered - but not if delays are inserted
Can handle guidance navigation with single route (eg water maze task - memorizing location of invisible platform relative to objects on wall - same starting point)
Can’t handle guidance navigation with multiple routes (eg water maze task with different starting points)
Can’t handle survey navigation (eg maze rotated after learning)
May or may not be able to handle path integration
- (and both rats and humans suck at it anyway)
- In one experiment, humans could but rats couldn’t
- In another, rats were impaired even when visual cues existed => maybe the problem is forgetting where the goal is
- Recording studies haven’t found compelling evidence of hippocampal neurons involved in path integration
- Grid cell firing patterns degrade in the dark => they don’t work well with path integration alone

Humans with hippocampal lesions:

Can navigate by reading a map
Can handle guidance navigation and path integration, so long as fits in working memory
Can describe routes in areas they knew before damage

Working theory:

Hippocampus is required for survey navigation.
But survey navigation is sometimes used even when lower-level strategies would suffice, explaining failures on simpler tasks
- eg when foraging for food in open field, see firing patterns in grid cells et al, see place cells fire in sequence when navigating to regular food drops, seee map updates when goal locations change
- eg when disoriented animals reorient, they use local geometry even if prominent landmark is available
Hippocampus probably not required for path integration, except to remember starting point and goal

Evidence that different spatial mappings are used for different tasks within the same environment.

Hippocampus maps abstract spaces:

Rats with lesions can learn direct SR but not transitive
Humans with lesions have higher deficits for order of events than for direct recall
Rats with lesions can recognize odors but not recall order in which they were presented
Interesting signals in human brains when presented with social or associative problems
Similarly to in spatial tasks, some memory tasks engage hippocampal relational processing even when not required (this paragraph seems to contradict itself?)

Imaging suggests that hippocampus is not continuously involved when using cognitive maps in navigation, but only when learning or when planning/altering routes.

Speculation that hippocampus originally evolved for navigation but was co-opted for abstract relationships. (How does hippocampus size vary across species?).

Lecture 3

Divide into declarative vs non-declarative memory no longer seems to be carving at the joints:

HM couldn’t learn maze routes but could learn mirror drawing.
House task - recall vs recognition of complex spatial arrangements (front doors and porches). Suddenly recall tanks for patients.
Patients impaired at statistical learning of relationships and associations.
Mountain task - normal when matching color/time-of-day but impaired when matching arrangement/rotation.
Lesioned rats can detect novel objects and novel placements but can’t pair placement with background context.

Pattern separator vs pattern completer.

Old/new task -> old/similar/new task.
Old people struggle at pattern separation (old vs similar).
CA1 responds to any difference, CA3/DG responds to degree of difference.

Patients learn facts at school, have high IQ and get good grades.

Use fMRI to detect 60% periodicity in humans when navigating => grid cells. Periodicity correlates with success on spatial memory task.

Experiment suggesting that periodicity can be observed even for abstract spaces, by pairing a coordinate system with bird pictures of varying neck and leg length.

Something analogous to space cells for time observed in rats.

Uniting the Tribes of Fluency to Form a Metacognitive Nation

Theory: the difficulty of a cognitive task (from fluent to non-fluent) is used as a meta-cognitive cue that feeds into other judgments via ‘naive theories’ aka heuristics.

Fluency:

Perceptual
- Physical eg illegible text, varying contrast
- Temporal eg briefly flashed images
Memory
- Retrieval eg availability heuristic
- Encoding eg memorization techniques
Embodied (not connected to judgments by the references here)
- Facial expressions eg smiling in math class
- Body feedback eg mirror writing
Linguistic
- Phonological eg pronounceable vs unpronounceable letter strings
- Lexical eg familiar vs unfamiliar synonyms
- Syntactic eg sentence tree structure
- Orthographic eg using other alphabets, 12% vs twelve percent (reading latex?)
Conceptual eg priming with structurally similar explanations, semantic coherence
Spatial reasoning eg rotating shapes (not connected to judgments by the references here)
Imagery eg imagining hypothetical scenarios
Decision eg jam choices

Judgments:

Truth
Liking
Confidence

Discounting - if fluency is recognized, subject corrects and may even over-correct.

Seems like discounting provides a lot of adjustment room in this theory. How to falsify? Could try varying eg legibility over a wide scale and looking for a discounting effect.

Lecture 4

Fluency can induce:

familiarity
likability
dis-likability (but not replicated).
perception of light or darker image (but not replicated)
judgments of fame (abolished by eating popcorn)
judgments of danger (abolished by eating popcorn)
volume of background noise

Familiarity seems like a reasonable heuristic - exposure => fluency, so assume fluency => exposure.

Explanation for the popcorn is that it prevents subvocalisation so can’t judge pronunciation fluency of words.

Others make less sense to me.

Notable that the class was typically split when asked to predict outcome of experiments ie proposed mechanism is so vague that either outcome is plausible.

Other ‘constructs’:

Subjects reconstruct past to create useful narratives
Subjects claim even under strong pressure to remember seeing events that only their partner saw
Subjects remember seeing words when only related words were present

Not worth reviewing, not confident in results.

Understanding face recognition

Broad view of facial recognition, including processes like retrieving information about the faces owner.

What information might components of facial recognition produce?

Pictorial - when viewing static photo, reconstruct some 3d representation after correcting for lighting, grain etc
Structural - angle/lighting/expression -invariant model of face shape/structure usable for recognition
- Identifiable from low-res photos and caricatures
- Pictorial vs structural - recognition of photos of strangers faces is impaired by changing angle/lighting => structural representation takes time to build up.
- Recognition of familiar faces is less impaired by changes to external features => over long-term representation picks up on more unchangeable details eg feature arrangement vs hair color
- Recognition from restricted (eg just eyes) and occluded (eg wearing sunglasses) views => heavy redundancy in structural code
Visually-derived semantic eg age, gender, similar faces
Identity-specific semantic eg occupation, friends
- Slower than recognition alone
Name
- Separated from identity-specific because it is sometimes uniquely effected by injury
- Often get familiarity without identity, or identity without name. But name without identity would be surprising.
- Usually try to get name by searching for further identity details, suggests it’s attached to identity rather than directly to structural info.
- Slower than identity-specific semantic alone
Expression
Facial speech - everyone lip-reads a little.
- Separated from recognition by injury in both directions

Open questions:

Finer-grained breakdown of cognitive processes involved.
Do we decide that something is a face and then apply facial recognition or vice versa?
How is contextual information included? eg not recognizing someone because you didn’t expect to see them in that place

Are faces special?

Are there dedicated cognitive process for facial processing, or do we just reuse generic object recognition?

Main arguments:

Face-directed activity in infants => innate
Holistic recognition only occurs for faces, not other objects
There are face-specific neural representations

Main challenges

Most experiments test within-class discrimination for faces vs between-class discrimination for objects - may be different processes
Expertise hypothesis - maybe similar results for any class that is well practiced eg dog judge recognizing different dogs

Innate:

Newborn babies can distinguish similar faces even after changing hair and viewpoint
Same for young monkeys with no previous exposure to faces
But only for upright faces
Perceptual narrowing to faces of familiar races occurs

Holistic/configural processing vs within-class discrimination:

Inversion effects much stronger for faces than within other classes
Inversion effects occur for ambiguous patterns that are primed as faces, but not if primed as characters
Part-whole effect - much better recognition for face parts when presented in a face vs alone, not for objects
Composite effect - much worse recognition for top half with non-matching bottom half than top half alone, not for objects
Inversion effects for objects disappear with repeated trials, but not for faces.

Neural:

Monkeys and humans show face-selective cells in large clusters
Can be disrupted with TMS
Face and object discrimination can be separated by injury
FFA is strongly activated by face tasks but (usually) not by object tasks

Expertise:

No holistic effects found in object experts (eg radiologists, ornithologists)

Argument that too many studies rely on significant vs not-significant, rather than testing interactions.

Lecture 5

Are faces special?

Functional specificity - specialized mechanisms
Neural specificity - implemented in face-selective areas/neurons/cells
Holistic - face is not represented as collection of parts, but as single object. (Tricky to pin down - makes more sense relative to later experiments.)
Configural - face representation depends on spatial configuration of features, not just features alone

Face recognition could be:

Domain-general object recognition (item-level hypothesis)
Domain-specific object recognition (eg expertise hypothesis)
Face-specific (face-specificity hypothesis)
Some mixture of the above

Behavioral experiments:

Have to separate ‘face’ from ‘low-level details that happen to occur in faces’ - inverted faces are good control
Face inversion effect - face recognition impaired much more by inversion than other expert objects
- But much more expert in faces than anything else
- Experiments testing correlation between degree of expertise and inversion effect have mixed results - still unsettled
Face-composite effect - easier to tell if top halves of faces are different when bottom halves are misaligned
Part-whole effect - easier to discriminate features in context of whole face, rather than alone
(Face-composite and part-whole seem directly opposed?)
Both effects much stronger for faces vs objects of expertise
Measures of degree of holistic processing? Comparing strengths of effects within subjects:
- Inversion ~ part-whole = 0.28
- Inversion ~ composite = -0.03
- Part-whole ~ composite = 0.05
- Inversion ~ face recognition = 0.42
- Part-whole ~ face recognition = 0.25
- Composite ~ face recognition = 0.04
- Would expect strong correlations all round

Neural experiments:

In FMRI, FFA reacts more strongly to faces vs objects
Low-level features? Faces vs scrambled faces.
Item-level recognition? Faces vs houses/porches.
Animate objects? Faces vs hands.
But stronger response for inverted faces.
- More processing for triggered-but-failed recognition?
Similar results for other objects categories in other areas - indicates other specificities?
- Places
- Visual words
- Bodies
- Other peoples thoughts
Similar results for single-cell recordings in monkeys
Can find cells which react linearly to continuous changes in several of many face features
Deep brain stimulation results in mis-recognition
Face space (Chang & Tsao 2017)
- Use PCA to choose vectors in face space
- Found faces cells that react only to single vectors
- Can reconstruct faces from cell responses

Medical cases:

Prosopagnosia (developmental in ~2% of population)
- Module defect or the tail of a bell curve?
- Most visible symptom of general object agnosia? Some prosopagnosiacs have normal object recognition
- Impairment of item-level recognition? Some prosopagnosiacs have normal item-level recognition
- Impaired recognition of visually similar forms? Some prosopagnosiacs score normally on differentiation of morphed objects, as long as they are not faces
- Impaired recognition of objects-of-expertise? WJ learned to recognize sheep at expert levels after injury.
Some subjects with object agnosia can recognize faces made out of vegetables, but can’t recognize the vegetables => independent mechanisms, not superset

Innate:

Babies orient more towards face-like arrangements
Subject with upside-down head shows normal recognition accuracy on inverted faces, and > inverted accuracy on normal faces
- (Surprised by interpretation. Also, maybe vision is flipped upstream?)

Lecture 6

Skipped the reading this week :S

Social cognition - ‘the psychological processes that result from inferring the actual, imagined, or implied mental state of another’

Affect is creeping back into models of decision-making.

Moving away from 2-process model because of neuro evidence - clearly many systems involved.

What makes a process automatic? Not requiring:

Intent
Capacity
Effort
Awareness

Rare for any given process to hit all 4.

Illusion of agency - maybe intent does not exist.

Debate over value of heuristics vs rationality.

Mentalizing:

inferring intentions, goals, desires of other mind (or own mind?)
typically care about intent and capability (eg warmth, competence etc)

When do we attribute responsibility to an agent for an action?

Jones says single behavior => specific intent when:
- given choice
- has capability
- departs from behavior of other agents
- behaves differently in other contents / with other targets
Kelley says behavior over time => disposition when:
- departs from behavior of other agents
- behaves differently in other contents / with other targets
- consistently behaves in this way in this context

John laughs at the comedian. No one else laughs at the comedian. John laughs at every comedian. John laughs at the comedian every time. => Behavior is attributable to John, not to comedian

Experimentally, seems to be less sensitive to consensus than other two.

Attribute agency to objects similarly, but not moral status eg ‘computer said no’ but don’t feel bad for throwing the computer away. How do we tell the difference?

Emotions hard to define.

Facial expressions are interpreted in context - changing context changes perception
No 1-1 mapping from face muscles to emotions - complex signal
Much disagreement on mapping emotions to brain regions
Anxious reappraisal
Self-reported eg happiness easily influenced by context, but discounted if made aware
Ability to mimic faces is innate, so universality of expressions could be from cultural transmission
Subjects with amygdala lesions can be fear-conditioned but are not aware of being afraid
Awareness of own heart rate predicts differing emotional reactions

Dominant theory - emotion as cognitive interpretation of physiological signals . Behavior change:

motivation + capacity
very resistant eg anti-smoking ads
changing environment almost always easier than changing the person

Default mode = social cognition applied to self?

Lecture 7

Examples of theories that try to unify multiple phenomena:

Scale invariance
Decision by sampling
A theory of magnitude

Scale invariance:

$y \propto x^\alpha$
Examples in cogsci:
- Weber’s law - smallest perceptable change : magnitude of stimulus
- Fechner’s law - subjective intensity : physical intensity
  - Exponent varies by sense
- Fitt’s law - time to hit target : log (target distance / target width)
- Forgetting - recollection probability(?) : time
  - Surprising - exponential decay is a much more natural model
- Practice - task reaction time : practice time
- Recall - number of items recalled : time spent recalling
  - Seems not to depend at all on period covered by recall
- Luce’s choice rule and Herrnstein’s matching law - probability of choosing item : attractiveness/payoff
Most examples cover a few ranges of magnitude but fall down at extremes
Causes?
- Need to operate at multiple different scales => use a representation that is scale invariant
- Log-scale turns constant error into proportional error - useful if operating over different scales cf floating point
- Maybe just over-fitting - with proper testing many examples stop looking like power laws
Tends to be null hypothesis since it turns up so often
Violations, switching points are interesting

Decision by sampling:

Need to be able to trade-off between utility of different outcomes, subjective probability, time
Well-calibrated
- eg prospect theory matches up with empirical distribution of credits/debits into bank accounts, supermarket prices
- eg temporal discounting matches up with number of google hits / newspaper entries for different durations
- eg subjective risk evaluation matches up with probability judgments of probabilistic phrases + distribution of phrases in British National Corpus
How to explain this calibration?
Could be caused in other direction - subjective curves => behavior - but hard to see why it would affect distribution in this way.
Plausible algorithm - no numerical scale, just sample several similar elements and compare to get a rough ranking
How does sampling work? How is the reference class decided?
- From memory - choose a reference class - explains framing
- From context - explains anchoring and effect of irrelevant options
- From exploration
How do we translate between reference scales eg trade off time vs money?
- Poorly, usually.
- CFAR’s ‘units of exchange’ provides anchors / exchange rate?
Picoeconomics claims willpower problems caused by hyperbolic discounting. Can we change the discounting curve by changing sampling process?

A theory of magnitude:

Walsh 2003
Proposes that time, space and number are represented by same mechanism
Poorly supported, lecturer expects it to be wrong but useful as research direction
Time and space usually need to processed together eg for motor action, predicting movement
Plausible that number sense piggybacks on same system
Number vs space (well supported):
- Quicker to distinguish numbers that have larger differences (ie further apart on number line)
- SNARC effect - quicker response to small numbers on left side of vision, large numbers on right side of vision
- Attention bias effect - quicker to notice stimuli in left when fixated on small number, right when fixated on large number
- Line bisection effect - left/right bias when picking middle of string depending on number word in string eg “twotwotwotwo”
- Asymmetric deficits on number tasks in neglect patients / TMS subjects
- Some subjects describe weird number lines and also deviate from these patterns
Time vs number (poorly supported):
- Number tasks and time estimation impair each other
Time vs space (poorly supported):
- Subjects imaging 30m activity in scale model take longer for larger models
- Neglect patients show asymmetric deficits when estimating duration of stimulus in neglected side of field

Scale-invariance as a unifying psychological principle

Scale invariance common in nature. Psych processes adapted to reflect this?

Clear examples in perception:

Luminance between sunlight and shade can be 10000x but brightness and color of an object is perceived same in both - visual system processes ratios, not absolute magnitudes
Similarly for hearing frequency - absolute pitch is rare but relative pitch is common
Weber’s law - difficulty of distinguishing perceptions proportional to ratio of magnitude, not absolute difference
- But power varies across scale, so not totally clear
Steven’s law - in >30 perceptual/motor dimensions mapping to numerical scale is power law
- When making judgments on numerical scale, does anchoring a point in the middle shift judgements in a scale-invariant fashion?

Can’t be purely scale-invariant, because it is possible to judge magnitudes, but usually poorly.

Not true at all for eg color perception.

Perhaps reflects that the systems themselves are implemented physically.

A theory of magnitude: common cortical metrics of time, space and quantity

Argues that:

Hemispheric asymmetry is because numerical calculation tied to language
Number-selective neurons located in same space as space-selective neurons, and some circumstantial evidence of temporal-sensitive neurons in same area

Explaining interference in terms of attention is way too unconstrained. Sounds like single theory but close reading of literature shows that wide variety of proposed effects and causal mechanisms.

Predicts SNARC should work for any space/action -coded magnitude.

Decision by sampling

Typical theories of decision-making take utility functions as given. How do we build/calibrate a utility function given basic psychological operations?

To relate this back to previous two papers, how do we get an absolute judgment of utility out of brain systems that are only good at relative, scale-invariant judgments?

Many examples of utility functions (in aggregate) matching cumulative distribution of events in the real world.

Proposes that we sample several items from memory and use these to estimate percentile on empirical distribution.

Many other examples of similar processes:

Norm theory - judge normality by similarity to sampled events
Decision field theory - compare alternative by weighted sampling of advantages on random walk
Support theory - subjective probability depends on alternative hypotheses sampled
MINERVA-DM - subjective probability/plausibility based on similarity to sampled events
Stochastic difference model - ?

Assumes that sampling from memory is a good approximation of sampling from reality. Some evidence for this eg Anderson & Schooler 1991.

Has anyone tested the predicted binomial noise?

Tweaks:

Temporal discount rate decreases with magnitude of gain. Explained by assuming that time and magnitude are sampled together, not independently.
Temporal discount rate is higher for gains than losses. Explained by curvature of gain/loss utility interacting with base discount rate - discount applies to utility, not gain/loss directly.
Working-memory load increases discounting of delayed vs immediate gains. Explained by failing to sample enough large delays - biases score upwards.

Lecture 8

Language is hard to define:

Clark & Clark 1977
- Arbitrary - mapping from words to meanings
- Structured - mapping from sentence to meaning
- Generative - not limited to fixed set of meanings
- Dynamic - words and structure change over time
Hocket 1963 - 13 features, of which 10-13 are claimed to only exist in humans
- Displacement - refer to things removed in time and space
- Productivity - create novel utterances/meanings which are nevertheless understood by others
- Cultural transmission
- Duality of patterning - generative
- (But many of these arguably displayed in animals eg Alex the parrot)

Levels of analysis:

Phonology - phonemes, speech perception, spectrograms
Semantics - words, semantic priming
Grammar - hierarchical structure, formal grammars
Orthography - writing, reading

Traditional Wernicke-Geschwind model:

Broca’s area = speech production
Wernicke’s area = speech comprehension
Connected by arcuate fasciculus
Concentrated in left hemisphere:
- Wada test - inject sodium amital into artery to sedate one hemisphere
- Anatomical asymmetry in related areas
- Asymmetry in PET and fMRI on language tasks
- Differences in neuron shape between hemispheres
- But hugely confounded by motor control which is also asymmetric

Problems with model:

No clear causal relation between lesions and defects (including patients recovering from defects over time)
No consistent correlation established by functional imaging
Activations in non-linguistic tasks
Voxel-based lesion-symptom mapping identifies different areas
Evidence for multiple networks for language comprehension
Right hemisphere dominant for many complex language tasks
Word-specific activation distributed throughout brain, seemingly paralleling organization of sensory and motor systems eg action words in the motor system

Speech perception is ambiguous - requires top-down processing. Illusion of speech units.

At phonology level:
- Segmentation problem - cannot find word/syllable boundaries in spectrogram
- ‘Lack of invariance’ problem - phonemes do no have consistent representation in spectrogram
- Speaking rate eg careful pronunciation vs normal conversation produce different spectrograms
- Huge variation between accents
At word level:
- Homonyms
- Polysemy eg ‘the door fell off its hinge’ vs ‘the child ran through the door’
At syntax level:
- Ambiguous binding
Combined eg ‘Mary made her dress correctly’
Correct interpretation improved by access to mouth movements, body movements (co-speech), conversational context

Really no reason to continue teaching Wernicke-Geschwind model.

The free-energy principle: a unified brain theory?

Summary of Surfing Uncertainty

Summary of The Predictive Mind

Wikipedia on free-energy principle

Variational Bayes:

Posterior $P(Z \vert X)$ is hard to calculate exactly, so instead we approximate it by some family of distributions $Q_\theta(Z)$
Want to minimize , because we have to minimize something and this is both reasonable and tractable.
- Related - $P_\mathrm{new}(\theta, X) = \mathrm{argmin}_Q D_{KL}(Q(\theta, X) \Vert P_\mathrm{old}(\theta, X)) \text{ subject to } \sum_\theta Q(\theta, X=x) = 1 \text{ and } \sum_\theta Q(\theta, X \neq x) = 0$ . Is minimizing distance to posterior equivalent to minimizing distance to prior subject to constraints?
- Implications for forward vs reverse KL
Can rewrite as $D_{\mathrm{KL}}(Q \Vert P) = \mathrm{constant} -H(Q) -E_Q[\log{P(Z,X)}]$ . Last term (last two terms?) is called ‘variational free energy’. Because thermodynamics?
If $Q$ has some factorization over $Z$ can use calculus of variations (somehow) to produce a set of recursive equations that describe the minimum and which converge under iteration.

Free energy principle

$P$ is joint distribution of world model (‘causes’) and sensory input. Bayesian update on this model predicts future sensory inputs from past sensory inputs, via inferring underlying causes.
$Q$ is referred to as recognition density. (Why?)
Express free energy $F$ wrt energy and entropy:
- $F = -E_Q[\log{P(\text{sense}, \text{cause})}] -H(Q(\text{cause}) = \text{energy} - \text{entropy} = \text{expected surprise} - \text{complexity of model}$
- Shows that free energy can be evaluated using information that the agent has
Rewrite free energy $F$ wrt action:
- $F = D(\text{approx posterior} \Vert \text{prior}) - E_Q[\log{P(\text{sense caused by action} \vert \text{cause}, \text{model underlying p})}] = \text{complexity / Bayesian surprise} - \text{accuracy / expected surprise}$
- IE how much we had to mess with the model vs how much predictive accuracy we gained for the recent sensation
- The action that minimizes free energy is the one that minimizes surprise about the resulting sensations => act to confirm predictions
- Hard to interpret. Eg changing point of view to disambiguate optical illusion?
- Active inference
Rewrite free energy $F$ wrt sensation:
- $F = D(\text{approx posterior} \Vert \text{posterior}) - \log{P(\text{sense} \vert \text{model})} = \text{accuracy of approximation} + \text{surprise}$
- As approximation -> model, $F$ -> surprise
- Choosing actions and models to minimize $F$ places an upper bound on surprise
Perceptions feed into online update of $Q$ to more accurately model causes and hence future perceptions.

But we like surprising things? Presumably this is to be explained. Or are actions chosen to minimize $F$ in general, rather than for this specific action?

Relation to infomax principle (maximizing mutual information between sense and model subject to constraints on complexity of model). Complexity term in 1st formulation penalizes more complex models - regularization/shrinking.

The fact that these models predict empirically observed receptive fields so well suggests that we are endowed with (or acquire) prior expectations that the causes of our sensations are largely independent and sparse.

Arranged hierarchically, so each model passes prediction error up and passes predictions down. Precision parameter models noise at each level. High noise => more trust in priors / predictions from above. Low noise => more trust in sensory data from below.

States ‘value is inverse proportional to surprise’. (In a particular simple model) if we perform gradient ascent on value, then the long-term proportion of time spent in a state is proportional to value, so surprise is inversely proportional to value. Since we act to minimize free energy, priors can encode values. But does acting to minimize free energy lead to gradient ascent on value? Seems like the argument is backwards.

Starting to get flashes of picoeconomics here - recursive relation between model of the future and model of own decision making.

Many references to more general connections between minimizing free energy and defying thermodynamics over lifetime of agent, which I don’t follow at all.

Active Inference, Curiosity and Insight

Various activities can be explained as acting to reduce uncertainty:

Hidden states -> perceptual inference
Future states -> information-seeking behavior, intrinsic motivation
Future outcomes -> goal-seeking behavior, extrinsic motivation
World model / parameters -> novelty-seeking behavior, curiosity

To infer expected free energy, we need priors on our own behavior.

Minimizing free energy == avoiding surprise
Minimizing expected free energy == acting to resolve uncertainty
Need prior on our own behavior to calculate expected free energy. Active inference == prior that we will minimize free energy.

Using example of learning complex rules by active inference. Use prior beliefs about own behavior to encode rules of task, in a way that I don’t understand.

Non-REM sleep. In absence of new sensory input, minimizing free energy => minimizing model complexity vs accuracy. Pruning as regularization.

REM sleep. After pruning parameters, need to reevaluate posterior. Can do this by re-simulating observed evidence.

Superstition as premature pruning.

Open confusions: choice of action vs expected free energy, encoding values as priors, explore vs exploit, precision. Suspect that many of these would be resolved by implementing one of the examples

Active inference and epistemic value

Lecture 9

Value of actions can depend on order eg find food then eat vs eat then find food. So have to evaluate policies, not individual actions.

$\sigma$ is softmax.

Penalizes divergence between $Q$ and $P_\text{prior}$, can set prior on future state to encode value. Not clear how to encode non-bounded tasks.

Bear in mind that we are summing log-probabilities == multiplying probabilities. So states that have 0 on any of the decompositions are still worthless overall.

Depression, self-destructive behavior etc explained as malformed priors.

From discussion afterwards:

Example models don’t show precision. When used, it’s often to fixed to a constant unless they are trying to model dopamine.
Policies are pure function of Q - so not timeless but not directly depending on time either - allows controlling how much memory the model has by controlling what Q remembers of the past.
In examples path integral is trivial, but in more complex models use time slicing?

Lecture 10

Embodied cognition - cognitive processes rooted in perception and action, knowledge not stored as abstract symbolic representation but derived on the fly from perception (past or present) and action.

Doesn’t seem to pin down a clear hypothesis, makes it difficult to figure out which experiments support which version of the theory.

Eg language

Some support for abstract representation
- eg mix up phonemes sometimes => phonemes are a unit at some level of processing
But how is abstract representation connected to the world?
Embodied metaphors eg future is in front, past is behind.

Usually attempt to demonstrate embodiment by demonstrating interaction between cognition and perception/action.

Classic experiments which failed to replicate:

Hold pen in mouth to create frown or smile, affects humor rating of cartoons
Adverts which suggest an action matching the viewers handedness are preferred, reversed when hand is already occupied
Parsing action sentences as valid/invalid is quicker when the correct option is presented in a location that matches the action direction
Hearing words associated with body parts activates same brain region as moving those body parts

Presented several other experiments which have yet to be replicated. Effect sizes are typically <1%

Think of embodiment as a spectrum from purely symbolic/logical to fully embodied. Claim evidence does not strongly support either end of the spectrum.

Models of embodiment underspecified. Any effect of the body on thought taken as evidence for embodiment without understanding of how embodiment works. We should be able to explain the pattern of results, not just whether embodiment is there or not.

Notes on 'PSYCGD02: Principles of Cognition'

What is Cognitive Science?

Bridging Levels of Analysis for Probabilistic Models of Cognition

Lecture 1

Behavioral Priming: It’s all in the Mind, but Whose Mind?

Lecture 2

The Cognitive Neuroscience of Human Memory Since H.M.

Finding the engram

Memory, navigation and theta rhythm in the hippocampal-entorhinal system

The role of the hippocampus in navigation is memory

Lecture 3

Uniting the Tribes of Fluency to Form a Metacognitive Nation

Lecture 4

Understanding face recognition

Are faces special?

Lecture 5

Lecture 6

Lecture 7

Scale-invariance as a unifying psychological principle

A theory of magnitude: common cortical metrics of time, space and quantity

Decision by sampling

Lecture 8

The free-energy principle: a unified brain theory?

Active Inference, Curiosity and Insight

Active inference and epistemic value

Lecture 9

Lecture 10