Notes on 'The Man Who Lied To His Laptop'

http://www.amazon.com/Man-Who-Lied-His-Laptop/dp/1617230049

Author is seemingly well regarded in HCI.

Chapters

Introduction

Anecdote about sock puppet testifying in Congress. Puppet makes a statement about violence in video games, senator asks puppeteer if she agrees with the puppet. Takes the room several seconds to realize that this is ridiculous and start laughing.

Company running software evals notices big jump in ratings at some companies. Finds that the only thing that has changed is that questionnaires are now being run on the same machine as the software being reviewed. Same effect is reproduced in controlled experiment. Interpretation is that people subconsciously avoid criticizing the computer to its face. That may sound crazy, but a senator interviewed a sock puppet in congress. Anthropomorphisation is very easy.

Author hired to improve Clippy. If we think of Clippy as being a social actor, the reason for his unpopularity is clear. He’s a asshole. Also leaning over your shoulder, interrupting your work and never even remembers your name. Can’t fix directly, so author made a scapegoat instead - if Clippys help isn’t good enough he offers to help you write hate mail to MS. Very popular, but not adopted by MS for some reason…

German drivers complain about a new gps device that has a female voice. Say they don’t trust directions from a woman. Not mollified by the idea that all the programmers and designers were men, because it’s still a lady robot.

Author claims many results on social psychology transfer directly to HCI. But doesn’t really back it up.

Wanted to study flattery but no experimenters interested - too hard to control. Realized that computers are an ideal confederate - can totally control interaction, personality, appearance. Designs experiment with 3 groups. A) runs driving games, gets +ve feedback, told the feedback is from a highly sophisticated AI. B) runs driving game, gets +ve feedback, told the feedback is just random standin until the AI is ready. C) gets no feedback at all. Groups A+B both rate their driving better than C, even though B was told the feedback was meaningless. Author concludes that flattery is always safe and effective.

Or maybe flattery has no downsides when there is no possible suspicion of manipulation? Seems like a big leap.

Does degree of tech literacy affect the extent to which this effect holds? Who are the test subjects?

Praise and criticism

Company evaluations suck.

Thalamus - involved in valence/judgment

Cannot avoid valence response to feedback - too deeply wired.

Hedonic asymmetry - negative valencies are stronger and last longer, and negative spirals much easier to start/maintain than positive spirals.

In driving game, random flattery increases self-evaluation of skill, but random criticism has no effect. Explained away.

In news reading experiment, articles which criticized third parties reduced approval of third party, writer and newpapers. Criticism seems to cause all-round negativity.

Don’t recall details of praise for long, but details of criticism stick in the memory.

Retrospective interference - negative valence can cause so much load that interferes with committing immediately preceding events to long-term memory. Reason why tech support calls often can’t remember exactly what they did to trigger the problem.

Proactice enhancement - events immediately following negative event are more easily recalled.

Warning drivers about instances of poor driving didn’t cause improvement, caused negative spiral.

Implies feedback should start with few negative items, followed by lots of positive items. Constructive criticism - focusing on desired outcome rather than current undesirable state. Specific details, so easy to make response and feel like issue is resolved rather than lingering. Leave time to react - no hit and run - but don’t pressure for immediate response.

Mindset. Single message about mindset (“you are good enough for this game” vs “anyone can learn to play this with enough effort”) followed by crushing default affects willingness to play other games described as being really hard.

Similarly, fixed mindset praise (“you are good at this sort of game, this game will be really easy for you”) reduced enjoyment of task and self-evaluation.

Implies framing of feedback is important - “let’s come up with a plan to do better at X” vs “you are bad at X” - “you’ve improved at X” vs “you are good at X”.

Computers which criticize others viewed less favorably but judged as more intelligent.

Voice recognition programs which blamed themselves for errors were better rated and sold more items but were viewed as less accurate, vs those that blamed the user or even some third party.

Voice recognition programs which praised themselves were rated lower all round.

Almost all the experiments are the authors own. No effect sizes or sample sizes given. References not given inline (although appendix has list of papers per chapter). Seems like a lot of generalization from small number of experiments, especially given that it’s extrapolating from machine interactions to human.

Personality

Personality traits - various behaviors measures form correlated clusters - attach labels to these clusters - measurable and have some predictive power.

Large lit review says these traits are largely set in place by 5 years old. TODO read review.

Computer federates can have personality to order - useful for experiment.

Control - dominant vs submissive - desire to influence/control other people.

Affiliation - friendly vs cold - desire to interact with other people

Extrovert - dominant and friendly. Introvert - submissive and cold. Critic - dominant and cold. Sidekick - submissive and friendly.

Comes with horoscopes for each. TODO how strongly predictive/correlated are these trait clusters?

Took strong introverts and extroverts and had them rate item descriptions on ebay. Descriptions written to match either personality, but with same factual content. Items written by similar personalities are rated higher and bought more often. But doesn’t even control eg length of description. And extrapolates from this that introverts prefer to interact with other introverts and vice versa.

Can judge personality by voice - reference another book by same author. Similar results - ‘introverted’ voices preferred by introverts and vice versa.

Ambiguous personalities (eg voice and body language don’t match) disliked by everyone. Sign of trying to fake interaction? Apparently a widely reproduced effect.

Gradually switching personality to match subject does lead to approval. Recommends doing this in real life, but can the average human really do that without triggering the ambiguity effect?

Teams and team building

Identification and interdependence. (Linked via kin selection.)

Identification - something, anything, as shared identity to form group around.

Record subject giving feedback phrases. In test, subjects who get their own recordings as feedback rate their performance higher than subjects who get recordings of someone else (of the same age, race, gender) and rated feedback as more valid and objective.

Visit swing voters and show them pictures of Bush and Kerry, ask which they will vote for. Unknown to them, one of the candidates has had their face merged with the subject. Causes small but significant shift in voting. Strong implications for diversity in hiring.

Manipulate identity to form teams. Focus on shared attributes.

Inside jokes, jargon etc.

Interdependence - focus on or create shared goals.

Can be completely arbitrary eg matching wristband color with computer color affects approval rates, trust, perceived intelligence, persuasiveness.

Widely established that feeling of team spirit correlates with positive outcomes. Causation in other direction seems entirely plausible.

Typical activities don’t work. Trust falls bring focus on lack of trust. Bridge building creates interdependence but focuses attention on mistakes. Whitewater rafting leads to associating teammates with fear. No team-building retreats. Focus instead on continuous reinforcement of identity and interdependence. Just so! Could tell an equally convincing story in the other direction - what activity involves interdependence but has no potential to make mistakes?

T-shirts and swag - make it identifiable to team members but not a loud billboard - think of it as a secret handshake.

Individual incentives (raises, promotions) reduce interdependence, promote competition.

If no obvious shared goal, make cooperation itself the shared goal.

Merges create asymmetric dependencies. Create artificial dependencies in other direction eg make people rely on subteam for infrastructure.

Dealing with defection. Highlight benefits of cooperation. Highlight the fact that the team members were selected for this job. Make membership exclusive. Hazing / initiation.

Or cut losses - single out the defector and unify the group against them. Deviants can useful as bounds on acceptable behavior. Removing deviant tends to lead to another one being singled out. Wait, I thought we were dealing with a defector, how is a new defector made… just starting to sound like an excuse to pick on someone.

Maybe create an artificial deviant instead.

Identity can engender group think. Make diversity of opinion a shared goal, rather than consensus. Focus on individual understanding of all group opinions rather than group decisions.

Commitment to unanimity polarizes opinion. Risky shift - focus on positive outcomes and unwilling to bring up possible negative outcomes, betraying the group - leads to shift towards high risk / high reward choices. Ahem.

Emotion

Root of emotion is self-evaluation:

how well am I doing on my current goals?
should I do something about my current goals?

Maps to axes of emotion:

valence - how happy
arousal - how excited

All emotions totally explainable as just position on those two axes. Uh…

Ok, what about bittersweet nostalgia? Neutral valence? Same as diffidence?

Other descriptions I can find of these axes don’t make nearly such strong claims - just that they capture much of the variety of emotion, not all of it.

Emotions cause physiological response. Negative responses are stronger and last longer. Response can linger and be confused with other emotion eg being forced to smile increases happiness eg riots are as often caused by positive events (eg winning big match) as by negative events - high arousal lingers and next negative stimulus gets treated as if it had that magnitude.

Contradictions between emotion in eg tone and body language cause negative reaction, similar to contradictions in personality traits.

Positive valence increases problem-solving ability / creativity. Brain at Work also covers downsides - suggests that being able to regulate valence/arousal is useful for changing mental state.

Valence is extremely contagious. Not clear from experiments described whether this is contagion or similar people clustering together. Maybe some details are missing.

Happy drivers prefer advice from happy computers. Sad drivers prefer advice from sad computers. Seems tricky setting the subjects emotion at the start of the experiment.

Humorous computers rated more popular, did not distract subject (no reduction in time taken or accuracy on task). Same results even when jokes aren’t funny. Again, generalizes from this to humor being safe and non-distracting in all contexts.

Arousal lasts longer than valence. Avoid giving negative stimulus to people who are still aroused.

Hard to fight over-arousal with logic. Emotion is pulling the levers.

Frustrating game with deliberate freezes/hangs. Group A) gets multiple-choice questionnaire B) gets open ended questionnaire C) gets open ended questionnaire + emotional support from computer. C led to longer playtime in next game. What about emotional support alone?

Suppression vs reappraisal. Driving computer that suggests reappraisal when cut off led to improved driving and reduced frustration.

Evidence that reappraisal can be learned/trained.

Persuasion

Expertise / specialism - news programmes on tv labeled “News TV” rated more important and trustworthy than news programmes on tv labeled “News and Entertainment TV”. Similarly for comedy on tv labeled “Entertainment TV”.

Dark side skills: Grill opposition with questions you know the answer to, makes you look like an expert - audience ignores that you picked the questions. Assign arbitry specialties when interacting with outside world eg designate one person on team as UI expert, refer to them that way and send all comments on UI from their address. You were supposed to bring balance!

Discomfort with uncertainty - want to believe or disbelieve immediately. I am more and more aware of this affecting my decisions - have to fight to remind myself to leave things open. ‘Is an expert’ is a heuristic that can be grabbed to make a decision and relieve uncertainty.

All cultures create gender identities from early age via clothing, colour, segregation etc. Which is freshly disturbing after reading so many experiments on the power of even small artificial differences.

Products sell better when aligned with gender stereotype of voice - judged as more knowledgeable even though the voice is clearly synthesized and it’s gender couldn’t possibly influence the text. Interesting way to remove the possibility of rational stereotyping by demonstrating that the stereotype is applied even when it couldn’t possibly make sense - seems to indicate that even knowing for certain that someone is knowledgeable/proficient in a non-stereotypical area is not enough to prevent bias.

Stereotype threat - isolated minorities have their performance swayed towards stereotypes eg a single girl in a roomful of boys will do worse on math tests. Given the choice, may be preferable to have some non-diverse groups and some groups with substantial minority presence, rather than distributing minorities evenly.

Online test with virtual avatars for classmates. When told that group performance is measured instead of individual performance, stereotype threat vanishes - performance is the same as with a female majority.

Huge list of halo effects.

Two points of judgment:

Does this person have good information (knowledge, status, expertise)
Can I trust this person (motives, identity, familiarity)

Common mistake is to try to convince people only on point 1, forgetting to make them trust you first.

Reciprocity. Computers which appear to be working hard to help subjects receive more effort on later data-gathering tasks. Switching computers removes effect, so not just general good mood. Computers which appear to be lazy get less effort, so not just familiarity.

Same experiment in Japan - no effect. Some explanation about how reciprocity in Japan is aligned with groups, not individuals. Trying experiment with PC vs Mac shows effect. Does this not terrify the author - how many of the experiments here will only reproduce on WASP ivy-league students? Also, gave a glib explanation and only a single additional test - doesn’t egender confidence that this was actually the reason and not just luck. Sample size? Effect size? Nada.

Sharing personal info / vulnerability creates reciprocity. Computers that share ridiculous personal info (“My CPU is only 400MHz so sometimes I can’t keep up”) get more personal info in questionnaires (as judged by blinded readers).

I notice here that I am predicting the results of the experiments before reading them. Where are the surprises? Where are the failures? Are we only seeing a cherry-picked sample?

Following structure of justification gets more help even if content makes no sense. eg at copier machine ‘can I go ahead of you?’ vs ‘can I go ahead of you because I’m in a rush’ vs ‘can I go ahead of you because I need to make copies’. The last two are equally effective vs first, even though ‘because I need to make copies’ is not actually a reason.

Similarity attraction for expertise eg US readers prefer US advice about Swedish destinations over Swedish advice about Swedish destinations. This is not necessarily irrational, eg I have learned to be very cautious about Chinese advice on where to eat in China. Advice is not just about the facts but also about understanding of advisees desires/goals/values. An American would be much more likely to accurately judge what kind of food I would enjoy eating.

Trustworthiness is often more persuasive than expertise. For example, politicians competing on who can be the most ordinary - “I’m just like you”. I don’t want someone like me running the country - I want someone much smarter. But it is rational to prefer someone who appears to be in your tribe - can be trusted to represent your interests.

Elaboration likelihood model - two modes of thought:

Central route - focus on content, facts, conscious effort - expertise wins
Peripheral route - not focused, vulnerable to bias, intuitive judgment, - trustworthiness wins

ie trustworthiness is a cheap heuristic, acts as default / fallback

High arousal, low energy, time pressure push towards peripheral route. Need for cognition pushes towards central route. Oh, this suddenly brings into focus a typical mind fallacy I have been carrying around for ages - I expect other people to at least attempt to dispassionately examine evidence when making important decisions and am often frustrated when they don’t - but if that mode of thinking is largely predicted by need for cognition it makes sense that many people won’t choose it at all. I’ve been trying to persuade people on the central route when I could have predicted that they would be using the peripheral route.

Framing/priming can switch routes eg signs that someone might not be trustworthy (eg ambiguous tells that we saw earlier) can trigger peripheral route.

Argues that Bush was one of the most intelligent presidents ever but had poor presentation / mixed tells that lead to much distrust. This runs head first into my wall of liberal bias, so I went looking for falsification of my existing poor opinion. Stanovich argues that while tests show he has an IQ around 120, reports from various aides indicate that he has a low RQ. I would like to see if someone has cleaned up and re-presented one of his speeches, to see if I would be more persuaded without his various verbal tics.

Pitches from Aus-accented whites vs Aus-accented asians vs Korean-accented asians - pitches with mixed-stereotypes are less successful.

Thoughts

No mention of any experiments which find difference between human and machine interaction - the first thing I would want to do is figure out under which situations this holds and to what degree and what kind of human interactions produce same results as computer interactions. Find it hard to believe that every single experiment succeeded.

Experiments are also predictable from the narrative - can guess the results from the setup. Might just be good storytelling, but seems like everything lines up way too neatly. Given that we’ve previously encountered strong evidence that people prefer a coherent story over an accurate report, it seems worth being extra suspicious of anything that doesn’t exhibit a real-world level of messiness.

The results are neat too, and massively generalized from the single experiment (or for some of the sections, from no experiments). Maybe the author is speaking from experience and has seen enough evidence to generalize, but I want to be convinced directly.

Also, the one experiment tried in a different context fails to generate result. They try a single tweak, got a result that matches a theory and call it a day. If I was the author I would be looking back at a book full of experiments run in a single culture with probably very homogeneous subjects and wondering just how much of it was down to context. This is Fundamental Attribution Error 101 - preferring to attribute behavior to immutable attributes rather than context/environment.

(Could this be an example of the introvert vs extrovert preferences in the ebay experiment? Do I distrust this because of the presentation style? Have I been more harsh on the evidence here than in the more cautiously-toned book? I think I need to come up with a structured process for deciding whether I believe a claim, to avoid potential bias.)

One of the ways I evaluate ideas is in terms of expected predictive power - how confident I am that I could use this idea to usefully predict results in the future. I have low confidence in most of the individual claims in this book - due to the limited samples, lack of replication and the vulnerability to different contexts and cultures. Worse, many of the claims can interact and it’s not clear from these experiments that you could predict which effect would be dominant.

Another way I evaluate ideas is in terms of hypothesis generation and attention focus - do they give me a new viewpoint from which to notice patterns. From that point of view, the broad ideas outlined in the book - that interaction with computers can draw on social heuristics and that computers can be used to isolate social heuristics - seem worth adding to my collection of perspectives. I am more likely to be on the lookout for such effects in future, even if I don’t think that I can predict them in advance from the evidence here.

Some of the results are more widely known / replicated, and of those there are some that seem very useful to bear in mind:

Mixed signals can subconsciously cause negative reactions
Valence/arousal is a useful mental model (even if it doesn’t capture as much as the author claims)
Trustworthiness/familiarity/identity is an important factor to be aware of in persuasion. It’s not enough to rely on evidence alone, you also must generate the feeling that you are on their side.

Lastly, I benefited a lot from the reminder that most people use the peripheral route more than I would expect based on my own internal experience.