Yi reports his students taking ~10 hours to skim 100 papers over a single weekend. His students are clearly way more disciplined than I am. I dragged it out for months. It was brutal.
Towards the end I did finally get up to about 50 papers a day, so if I do this again for other journals I’ll probably just take a random sample of 50 papers and try to do the whole lot in one go.
What did I learn?
Not a great deal about the psychology of programming itself. For the most part the field doesn’t feel like a stumbling progression towards enlightenment, but just plain stumbling.
Here are some common failure modes that frustrated me:
Plain bad science, especially in the early years where a lot of the experiments are ‘my pet project vs the world’ and somehow the pet project always comes out looking good. My favorite example has a graph where the control group are clearly performing better, and the author explains this away, saying it’s because the control group were cheating, and in the conclusion of the paper declares the treatment a success. And it was published!
Failing to validate instruments. In particular, a lot of papers that involve coding qualitative data didn’t bother to have two people code the data independently to check for agreement.
‘We have to do some science, and this is science’. Many of the experiments are from the start clearly incapable of answering the original question. I realize that getting good data for these subjects is hard, but the opportunity cost still stings. Do a different experiment!
Theoryless science. For example, one paper had programmers read programs under an eye-tracker and found there was a significant difference in the gaze patterns between the more experienced group and the less experienced group. So what? There’s no suggestions as to what it means or how it can be used or under what conditions it’s expected to be replicable. These papers typically end with “further research is required” and then no further research materializes.
The last point is often detectable right at the beginning. If I read the description of the experiment and I can’t even make up an interesting result, that’s not a good sign.
There were also a lot of papers that read like “I found this in a book, we should apply it to programming”. I don’t know if it’s fair to call that a failure mode, but I rarely found anything enlightening in these and most of them were not followed up with papers where they actually apply the idea.
On the positive side:
There are a ton of negative results. PPIG seems to be totally willing to publish failed experiments, which is awesome.
Things get better over time. Somewhere around the mid-2010s I started finding papers that actually seemed like results. The series of papers on testing whether students form consistent models is a particular highlight - it starts with an interesting correlation, then some failed replications, then refinements of the test, then some successful replications, then more refinements, then combining it with interviews to test validity. It’s not sexy, but it does seem to be actual progress towards a reliable measure of one specific leak in the education pipeline.
I have a sort of vague idea that most of this work is just attacking things at too high a level. Programming is a big, diverse, complex skill and we don’t really understand really basic subskills yet, or even know how to break it down into subskills. Without that it’s impossible to know whether it makes sense to extrapolate any one set of results. If barely trained students benefit from syntax highlighting, does that mean professional programmers will? Or programmers in other languages? Or programmers dealing with programs longer than 50 lines? We have no idea.
Much of this reminds me of Feynman’s stories about rat experiments.
…his papers are not referred to, because he didn’t discover anything about the rats. In fact, he discovered all the things you have to do to discover something about rats.
What are the things we have to discover before we can discover things about programming? I have no idea.
Anyway, here are the full notes. They’re highly opinionated and I made no attempt to be fair to the authors. If I said something mean about your research, don’t worry - I barely read it anyway :)
Two year programming class. 100 engineers. Errors in semicolon use concentrated on specific contexts, even though students have learned explicit syntax rules. Later errors become uniformly distributed ie attention lapses rather than misunderstanding.
Students don’t learn syntax from specification of syntax - need contextualized experience. Parallel to natural language learning.
Single engineer observed writing spec in the wild for three weeks. Claimed to be following hierarchical plan but was observed deviating opportunistically.
Tools should allow interrupting planning of one component to jump to another.
8 experienced programmers - 4 OOP beginners, 4 OOP experienced. Up to one day to solve two problems. Observed heavy resuse of solutions/schemas once discovered.
Framework for building end-user programming environments. Aims for progressive specialization where programmers hand-off partially finished environments to end-users. No study on actual users.
Judge programming language by breaking writing down into steps, and look at how much knowledge required to choose between options at each step. Novel language, observe four paper writers working through problems. Example of a macro feature where the correct choice is not clear.
Provides a method for evaluating how natural/obvious a given language or feature is.
Studies transfer of ‘plans’ learned in Pascal to Ada or Icon. 13 students with Pascal experience. 3 work in Pascal, 5 in Ada, 5 in Icon. Observed, broken into episodes, episodes classified by consensus.
Two week industrial course for new graduate hires. Subjects forced to work only in company hours - no last-minute rush - teaches important of planning/scheduling. Role-play vague and unhelpful customer. Uni courses have to be fair => perfect environment - better learning from realistic disruptions.
Proposes study of how expert Prolog programmers use tracer. Didn’t actually perform study.
2nd and 3rd year students. 51 returned questionnaire and some were interviewed? Focused on four courses that use formal methods.
Students struggle to manipulate formal expressions by hand. Need automated tools. Only realized relevance of early course much later, when it was too late. Don’t believe that formal methods are used ‘in the real world’.
Do expert programmers develop better working memory, or do they learn to use external memory more effectively? Notice that experts jump around more often, while beginners tend towards linear program generation.
10 professionals. 12 2nd year undergrads.
Experiment 1. Have to program while speaking strings of digits - strains working memory. Expert performance did not suffer. Novices have more errors under suppression, and jump around code more often.
Experiment 2. Write a program using a program that doesn’t allow lines to be edited after hitting enter. Experts produce more errors than novices in this environment.
Suggests that experts don’t develop more working memory, but instead switch to strategies that use external memory more.
Plan for a dynamic visualization of code execution.
13 subjects solve problems using one of three tracers. No clear differences noted.
Seems to be research proposal, not a paper.
Proposes a model of computation based on agents which make discrete observations of state?
Graphical version of prolog. Slides only.
Two axes. #1 whether common patterns are taught. #2 whether explicit mappings are given between example and exercises. Shipped experiment as free online class. No results yet.
Case study in UK bank. Large team. Interviews and questionnaire. Early results only. Nothing notable.
Agent-based model of programming for OXO games. No experimental results.
Self-paced LISP course. 25 students, no programming experience. Taped sessions and students think aloud. Analyses student explanations.
Better-performing students made more complex and more connected explanations of LISP concepts, and asked more questions of other students. No data or statistics given.
Visual OOP env built on smalltalk. Based around theatre metaphor. Abstract only.
14 students x 6 interviews. Self-reported experience of programming. No results.
Studies of visual programming useless because they focus on toy problems. Personal model of how OOP programming works.
Questionnaire to explore connection between programming ability and math ability. No results.
32 2nd year undergrads. Experience with procedural programming. Asked to solve problem in Prolog.
Two different strategies observed. Both were procedural.
R-technology - visual control flow developed in Soviet Union. 24 professional R-tech programmers. Comprehension test for two 150loc programs. Then asked to write a program in either C or R-tech.
Case library tool for existing courseware design suite. Designed by observing users on mockups. Tool was not effective.
Study how multiple representations of logic helps. No body.
20 novice students. 7 faculty members. Add comments to ADA program.
Falsified hypothesis that experts would produce semantic comments and beginners would produce paraphrases of code.
Graphical version of LISP supporting execution of partial programs. No body.
Case studies of multi-disciplinary design problems. Unclear results.
Kids programming environment. 56 children, 10-13yo. Varied problems, class schedule and instruction.
Clear enjoyment, but no good evidence of transferable learning.
Toolkit for studying debugging of programs with multiple bugs. Little information.
14 grad students. Experienced in structured programming, 6 months experience with OO. Procedural and OO program with identical spec. Asked to make changes.
Worse performance on 4/6 tasks for the OO group.
New Prolog tracer based on different mental model of evaluation. 48 cogsci students. Given printed program, worked through modified versions in one of three tracers and asked to identify differences.
Exposure to language predicts success more than exposure to tracer tool. Students were more successful with Plater tracer.
Dev environment based on observing what programmers actually do, rather than what models prescribe.
32 students. 10 week course. Standardized teaching methodology. Group 1 got standard course. Group 2 got standard course with different method of teaching recursion. Group 3 got same course as 2 but also used the TED editor.
Group 3 made less mistakes overall. Mixed results for different problems. TED editor not described.
Argues that this process goes through an intermediate imagery step.
Sarcastic faux history.
Examines process of finding software to reuse. No data.
Skipping the doctoral consortium - no bodies.
Two environments experienced by the authors where people skills were lacking.
Proposes metrics on callable-from graph. No evaluation.
18 1st year undergrads. Graphical vs tabular vs narrative representation of design rationale. Options-based vs criteria-based (unclear on meaning).
Narrative format had slower response times, fewer correct answers. Graphical and tabular were matched. Criteria-based had similar times to option-based but more correct answers.
Visual language for cognitive modeling. Focus is not on building models but on understanding them. Used for student project. Questionnaire at end of course, qualitative analysis.
Students liked having no textual syntax. Found it easier to understand Hank models than Prolog models. Able to understand each others programs. Could understand execution process. Hank was useable with pen and paper.
32 c programmers. 32 spreadsheet programmers. Checking whether programmers build mental models based on control flow or data flow. Used recognition task with priming to see whether recognition is improved by primes nearby in control flow or nearby in data flow. Not clear what language the test program was in or how distance is judged in each model.
Interpretation of results is not clear to me.
Considering user action in terms of choosing where to invest limited resource of attention eg spend attention looking up api call vs just hope that memory is correct.
Investigating visual programming languages for novices.
Program comprehension in Prolog. 10 experts, 10 novices, 10 non-programmers. Asked to read program, reconstruct from memory and then explain purpose. Programs are very short.
Considered 4 different ways of choosing key points of program. Only way for which difference in recall was significant was ‘schemas’ - common patterns of computation in prolog eg build list and then aggregate.
Describing future studies proposed for Hank. No results yet.
Something about funding decisions.
Piaget-ian construction of mental imagery. Traditional calculators and algebra notation don’t provide mapping to any concrete imagery, so children struggle to map classroom math to their own experiences. Proposes a computer environment with multiple representations, at different levels of abstraction, running simultaneously.
Questionnaire on usability of theorem provers. Subjects complained about vague and sometimes inapplicable questions.
170 1st year students in logic course. Post-course questionnaire.
Students preferred to memorize rules rather than practice with Jape. No other results yet.
10 science teachers. Five problems in Word, where knowledge of underlying implementation would allow easy solution.
Most subjects resorted to trial and error, and out-loud gave incorrect explanations of underlying mechanics. No attempts to falsify their models.
Proposes that experiments in programming would produce similar results.
Want to see if changing naming scheme in spreadsheets from individual cells to named blocks of cells changes the kinds of errors that users make. 154 1st year undergrads, various subjects. Four different spreadsheet tasks, given in various orders. Compared Excel vs Basset (home-grown system). Errors are categorised by author.
Lower number of errors for some tasks, higher for others. No clear numbers for categories.
Tool for recording student interactions with a Smalltalk environment.
Developed a language based on review of past psychology of programming results. Experimented with different forumulations of boolean queries. Found confusion over ‘AND’, precedence/grouping and users totally ignoring parens.
GRAIL vs LOGO. 26 1st year undergrads, no experience. Similar exercises for both groups, but not identical.
Fewer errors for GRAIL. Difference larger for syntax errors than for logic errors. Presumably graded by author.
Theorising about how to teach robotics to children.
Questionnaire given to 14 students. Self-reported background / computer use. Programming quiz.
Structured interview given to ? students assessing learning style, development process.
Concludes that sample is too small to say anything of interest.
Transcribed 7 design meetings. Cut up into speech acts, classified according to coding scheme developed during analysis. Classifies into 5 kinds of exchanges, using unclear methodology.
Questions: 34 what is the type of this, 34 why is this an error. 6 subjects, all of whom at least post-grad. Video categorised by the methods subjects used to answer, unclear methodology.
Jeliot - animations for understanding algorithms.
Programming intro w/ 564 undergrads, didn’t regularly use Jeliot. Short programming course w/ 37 high school students, course based around Jeliot. Semi-structured interviews for small numbers of students in each group, not explained how they were selected.
Rejected by group 1 lecturer because of library issues. Students found UI confusing.
Discusses how taboos spread socially. Argues that all the notable taboos are about crossing abstraction boundaries.
16 novices. 16 experts. Showed a program for 2 or 10 seconds. Asked 5 comprehension questions. Experts perform better in both 2s and 10s cases. Data- and control- flow questions benefit from 10s more than function, operation and state questions.
Trying to teach beginners the mental representation of recursion used by experts.
49 students taught to use a diagrammatic representation of recursive programs. Test mental model by giving a list of possible solutions and asking which ones work correctly. Students who were tested with diagrams made fewer errors than students who tested with diagrams ie model was learned but failed to transfer to normal code.
Design of real-time temporal logic language using cognitive dimensions.
Questionnaire on UML use cases, covering cognitive dimensions. 14 students.
Testing whether to use arrows, lines or juxtaposition for control flow. 84 students. Given maze in each style and have to follow paths. Measured response time. Arrows were fastest and most accurate.
How well are mental representations transmitted to students via verbal discussions? 3x groups of 7-9 high school students, oral interview + written questionnaire, questions on syllabus material.
High school students suck at definitions.
Questionnaire sent to users and designers of a prover assistant. Answers on numerical scale. Minor differences in responses between the two groups.
Generalised CD questionnaire. Tested on users of various systems. Not clear whether this is useful.
Adapt UML for formal specs. No evaluation.
51 students, 4 questions, each a choice of two implementations of simple problem. After four weeks of Z classes, given same set of questions described in Z.
Claims strong shift in approach, but just eyeballing the numbers doesn’t look like much.
Functional programming in templates. Maybe misunderstood, but seems incredibly trivial.
Compares 4 formal modeling methods. Evaluated directly by authors on a single example. Prefer tables over trees.
Smalltalk env. Made a ‘coach’ with list of recent actions and error messages. No evaluation yet.
Advocates for using this practice. Not totally clear on process or goal, but seems related to the problem of not begin able to teach implicit knowledge.
Comparing two teams of students, high vs low performance. Coded email and irc histories.
Concludes that communication is important.
Experiments with graphical languages.
Different graphical layouts for programs. 21 students tested on comprehension of graphical vs textual. 60 students tested on comprehension of different graphical layouts. No main effect in either.
Additional cognitive skills test, no correlation with main experiment.
Switching between RLT and LTR writing in Word is confusing. Gave students an explicit conceptual model.
Control group exam was under different conditions. Used to explain their better performance.
Argues for more empirical research in CSE.
Patterns in working hours for remote learners in an OU course.
Plans for end-user programming in languages other than English.
there is little in a way of a conclusion, other than the common oft repeated call of, ‘we need to do more research in this area
Comparing human approach to explaining type errors to approach of authors’ new type-checker. Coded by author, without explanation.
Gave (translation of) general CD questionnaire to 10 spreadsheet users.
CD questionnaire to evaluate new language, given to 5 professional programmers.
Media Cubes. Physical cubes that correspond to (dataflow?) operators. Placed together to create program. Provides direct referencing eg place cube on tv to reference tv.
Iota.HAN. ML / Pi-calc based language.
Ignoring some cognitive dimensions in analysis can lead to missing effects on those dimensions.
Protocol for analyzing spoken transcripts to determine whether programmers are understanding a given program by comparing against preexisting domain knowledge or by recognizing patterns of code.
Mostly XP advocacy. Little in the way of actual numbers.
Planned data collection from student projects.
Studying language high-school students use.
How do people choose between the two? Give parallel programming problems to students along with Myers-Brigg test.
Hugely forking paths. Only 0-3 students per MBTI / shared-mv-messaging intersection.
Trying to define programming in light of rising end-users.
Ideas on difficulties - loss of direct manipulation, use of notation, abstraction.
Break variables into different types - constant, counter, control etc. Apply different graphics when visualizing program.
45 students. Given single line, judge how likely it is to have come from a binary search. Ratings(?) influenced by expertise.
30 students. Given single line, map it to one of three algorithms. Unclear results.
19 students. Shown algorithm and asked to comprehend/memorize (under eyetracker). Then given comprehension test. Divide sections of code into categories, normalize time in each category by screen area. Experienced group focused more on ‘complex’ category.
Speculation on programmer folklore.
Students struggle surprisingly with symbolic systems. May be caused by overhead of manipulating data-structures. Suggest pattern-matching support in language to help.
33 students. No significant correlation between VB course final exam grade and high-school grades or SAT grades. Huoman programming aptitude test explains 25% of final grade.
HCI researcher compares pattern languages to cognitive dimensions.
Educators given animations of systems and asked to describe how they work.
No coding or statistical analysis.
Inexperienced subjects tended to describe each entity until influenced by external event. Suggests describing systems in terms of individual behaviors and cross-entity conditions for behavior change.
Proposes agent-based modeling of software organizations.
Anecdotal reports from lecturer of first-year programming course.
3 main problems. Students have no prior exposure to computational thinking, or even basic logic. No familiar experience to compare most programming concepts to. Unused to rigid syntax rules.
Suggests several analogies for use in teaching.
Distance learning students. Given questionnaire to assess learning style.
Many forks. Significant gender differences.
Presents simple computational model of human cognition. Suggests studying software tools in terms of how their use maps onto this model.
Common-sense ideas for comparing languages and IDEs.
Unclear. Map language spec into this tool and automatically compute CDs?
Library design should be treated as a usability problem.
CD questionnaire given to students in Z course reveals many issues, too many to list here.
Two different notations for UML diagrams. CD analysis favours one. Multiple-choice exam given to students shows no significant difference.
Anecdotal reports from programming course.
Students can often give the correct code without understanding how it works. Often mislearn concepts, while making the correct noises.
Very similar to previous paper.
Ideas for how to reduce cognitive load when teaching programming.
Empirical studies of programmers tend to take place in very artificial settings. Plans to study programmers in-situ.
At lexical level TUIs provide no advantage - too small a vocabulary, no established conventional symbols, no significant improvement in interaction speed or recall.
At syntactic level - more possible kinds of relationships (position, orientation, proximity…), input-only interface (eg no undo).
CD questionnaire given to 2(!) subjects. Both were critical of the questionnaire.
Brief survey finds mixed support for visualization in programming tools.
Categorize variables into one of 10 roles.
Taught 80 students with a) normal methods b) variable roles c) variable roles + animated simulator which shows roles.
Conclusion is confusing - raw data looks to me like roles group did worse in their final exam, but author claims that poor grading was the cause.
Tries again with the two different UML notation. Three different studies show no difference in performance.
Java debugger with ghetto eye-tracking. 49 novice students. Given spec and test cases, then given 10 mins in debugger to fix program.
Numbers are in a different paper. This one only discusses verbalizations from 2 subjects.
Subjects started by reading code almost top-to-bottom. Debugging switched between forward and backward reasoning.
Teaching language design by working with students to build a language for programming Mindstorms robots.
Groups of programmers produce less over-optimistic estimates than individual programmers (effect size 1.25).
Seems similar to proposed mechanism in Superforecasters - if estimate is ideal time + list of things-that-might-go-wrong, then compiling things-that-might-go-wrong from whole group should provide better coverage.
Current experiments in psych-prog do not inform practice. HCI faced similar criticism in previous decades and changed focus from lab experiments to field studies.
Several observed cases of mental imagery originating in one dev and spreading throughout an entire team.
4 companies. (4?) quality managers. Semi-structured interviews on department structure, history, responsibilities, practices.
Not clear what the conclusions are.
Scheme for grading student explanations of programs.
CD analysis suggests that existing categorization of prototyping tools into lo- and hi-fidelity is not enough.
Identifies 4 key activities: authoring, validation, implementation, confirmation. Not clear how these relate to categorization.
Design rationale may not be available as explicit, conscious, verbalizable knowledge.
Authors elicit rationale from students designing web pages using real-time self-reporting and laddering. (Laddering seems similar to root-cause analysis, but for exploring knowledge rather than causes).
Suggests a variant CD specifically designed for evaluating libraries.
Variable roles again. No new results.
Proposal for natural language programming for kids.
Plans for studying extreme programming.
Review of previous XP studies.
Same as previous paper, but with a real eye-tracker this time. Similar results.
Collaborative IDE for teaching. Anonymous group work was popular.
Proposes teaching system that tracks students weaknesses down to the level of individual concepts. (Not unlike Khan academy).
Bluetooth network simulation for teaching. No evaluation.
Many examples of metaphorical language used by devs.
Anecdotal observation of students. Students were reluctant to draw object diagrams before coding. Instructor didn’t approve of their modelling choices.
36 psych students, later 64 vaguely-sourced adults. Tested on ‘does regex match string’ and ‘make a regex to match this class’. Exposure to former doesn’t improve performance in latter (but only 6 questions per person?).
Collected program snippets from Java mailing lists (mostly from bug reports and patches). Don’t really follow the results.
75 students. Computer Programming Self-Efficacy Scale questionnaire. Program comprehension test. Program recall test. Weak correlations between self-efficacy at end of course and test scores.
Proposes evaluating understanding of concurrent systems with a particular talk-aloud protocol and coding system.
Proposes using restricted focus viewer and other recording methods.
Proposes an extension of UML to handle multiple agents.
Visualisation tool for CORBA. Records events, allows replay. No evaluation.
3 commercial eye-trackers. 12 subjects given program comprehension tests. Head-mounted device took longer to setup and was less accurate but allowed subjects to move around.
~150 students. Modified Witkin field-dependency test. Digit span test for working memory. Correlation to exam results 0.1-0.2 for working memory, 0.2-0.4 for field-independency.
(Would effect of field-independency be explained away by spatial IQ?)
Coding scheme for program summaries shows ~80% agreement between 3 users. Suggests refinements to the scheme to reduce differences.
11 subjects visit ecommerce pagers under eye-tracker. Self-reported data agreed with eye-tracker - users look at top-left or top-middle first.
63 students. Record full source code every time they hit compile. More than half of errors accounted for by missing semicolons, misspelled variables, missing brackets, illegal start of expression, misspelled class. Majority of recompiles after error take place within 20s. Vast majority of recompiles after success take place after > 5 minutes.
‘Grounded Theory’. Framework for guiding qualitative research.
Computer literacy course. Computer Attitude Scale questionnaire. Attitudes became increasingly negative over the duration of the course.
Interviews with 4 programmers and 2 artists. Conclusions unclear.
ToonTalk. No body.
18 high-school students in undergrad programming course. Animate three short programs. Comprehension task. Gaze-tracking.
No significant variation in behaviour wrt experience.
58 students pair-programming. Observation, questionnaires, semi-structure interviews, field notes.
Differences in skill level affect collaboration. Debugging tasks reported as particularly tiring / unenjoyable.
29 undergrads. 1 modification task, 1 comprehension task, 6 debugging tasks.
Forking paths. Questionable interpretations of results.
Language model for writing C# with Dasher. No evaluation.
27 male and 24 female subjects. Self-efficacy questionnaire. Spreadsheet with extensions for testing. 2 spreadsheets with bugs.
Female subjects had lower self-efficacy about debugging ability. Less likely to use new debugging features, although no difference in learning time. Introduced more new bugs, but also correlates highly with use of new debugging features.
Ethnographic study of pair programming at two companies. No results yet.
Pilot to study whether Java implementation of Irish electoral system is easier to understand than the legal language.
8 software eng postgrads. Comprehension questionnaire. Java group scored worse and expressed more confusion and frustration in talk-aloud.
Aptitude Profile Test Series (sounds like an IQ test) administered to 34 students. Correlates 0.4 with exam results.
Demographic survey of 197 students. Country of birth and first language significantly affecte exam results - non-english students fare worse.
APTS and demographic survey of 80 students. Students with prior experience in programming fared better. APTS correlates 0.3-0.4 with exam results.
Notes sample bias - students that volunteered are much more motivated than those that didn’t.
45 pair programmers asked to rate ability and experience for themselves and for peers. Unclear results.
Deja vu. Published effectively the same paper a few years earlier.
13 professional programmers asked to sort variables into groups based on similarity. Claims that grouping agrees with roles, but not clearly falsifiable.
12 students. Field-dependency test. Taught variable roles. Visualisation of pascal program, asked to write program summaries. Questionnaire about tools.
Tool that records stack of open tasks, notes and queue of pending tasks.
(Particularly interesting, because I made a similar tool for myself but didn’t use it for very long.)
Talk-aloud of subjects working on sheep-dog game. No real results yet.
Proposes framework for studying social dynamics in teams.
57 students. Motivated Strategies For Learning Questionnaire. Modified version of Rosenberg Self-esteem questionnaire. Modified version of Computer Programming Self-Efficacy Scale.
Samples test scores were representative of whole class.
Intrinsic motivation and self-efficacy both correlated with performance at ~0.5.
Argues that live coding is an interesting topic for PPIG.
Mines existing published interviews with famous computer scientists for quotes on creativity.
Studying code excerpts from Java mailing lists. Couldn’t reject the null hypothesis. Not sure I understand the null hypothesis anyway.
Plans to study whether mental models of code build on existing spatial navigation strategies.
20 students. Group 1 taught traditionally. Group 2 taught using roles. Group 3 taught using roles and animator. Only group 3 had any students who successfully completed programming task.
Unsurprising observations on names.
Plans to study general principles of api design rather than just specific apis.
2 programmers talk-aloud at work. Try to generalize from this a model of how programmers look for information.
Still talking about grounded theory without yet reporting any results.
Studying spatial and social metaphors in javadocs. Don’t see any actual examples, just graphs.
Does gaze predict comprehension? Highly forking, no strong signal.
Ethnographic studies at software company. I don’t really know how to summarize this, but it was interesting.
Ethnographic study. No real information.
Studying programming environments by the extent to which one can make useful programs without leaving various subsets of features.
Demonstrates that eg hello world in C or Java involves a large subset of the language features.
Arrange feature subsets as a lattice and examine what can be done in each subset.
Object-first teaching leads to students who can’t code for-loops. Proposes variable-first teaching - start with teaching variable roles and control flow.
Previous study noted that consistency of mental model in early tests was predictive of long-term success in programming courses. Criticized for being too vague to replicate. Introduces a new marking scheme that doesn’t require marker judgment.
15 devs and product managers interviewed on design decision process. Little input from professional designers during process. Main factor in decisions was consistency with existing design.
Tutor program combines explanation with multiple-choice tests. Improvement in test results after use was significant but effect size is small.
Somewhat unclear. Talks about ‘threhold concepts’ - concepts that once learnt change the way the student views computing. Then presents a simplified model of a computer.
Proposes vocab for discussing scm in terms of Vygotskyian model. I don’t actually see any vocab though.
Studies Python Enhancement Proposals. Interviewed 10 active python community members. No obvious conclusions.
Interviewed (all 60?) students. Asked why they didn’t use state diagrams for their concurrent systems homework.
Students who used state diagrams did better on their homework.
Widespread reports that state diagrams take too much effort.
Ethnographic study of XP team. Nothing surprising.
Argues OSS communities need to be studied more.
Survey of existing work on XP and pairing.
Anecdote about a programmer with poor communication skills.
Does what it says on the tin.
Interviewed team leaders of 6 person locator websites. None looked for existing sites or teams before starting. All resisted aggregators combining their content.
21 students. Asked to find various program fragments. Two syntax highlighting schemes had no significant effect vs black-on-white control.
Comparing structured editing to text editing. No results yet.
Proposes hierarchy of understanding for OOP execution model. 125 students given questionnaire followed by test. Results appear to agree with hierarchy, but some questionable change to data before analysis.
Model that combines UML and B. Compared to B alone. 41 students assigned comprehension and modification tasks. UML-B scored higher.
10 students given CD questionnaire on UML-B.
Give students recursive definition of a language, ask them various questions about it and then ask them to write various recursive functions on the language.
Talks up the benefits of this approach but it’s not clear to me from a quick reading what the difference is.
Survey given to 455 students with 23% reply rate. Students trust automated systems more. Worried that human contact would be reduced.
19 students given Autism Research Centres EQ and SQ tests. Both correlate with programming test (.44, -.45, p~=.05), combined score correlates more (.67, p=.002). (Combined score also known to correlate with autism spectrum).
Suggested explanation is that high SQ-EQ makes students more likely to rack up hours messing with computers.
49 students in MSc IT. Mental rotation test correlates with success at 0.48. Acknowledges that students with high spatial ability are known to be more likely to pick engineering courses.
Argues that students struggly in programming because they lack basic problem solving skills ala Polya. Propses tool that nudges students through the basic process eg making them list what data is known and what unknowns need to be solved.
Make 72 programmers watch short video clips before debugging test. Small effect on arousal axis, but poor power.
Use machine learning for understanding programs. Example of extracting variable roles. Not clear what the learning algorithm is or how it’s applied.
Procedural to OO has been a big shift for teaching. Can’t assume that psych results will carry over.
Some of the that is claims are totally unstudied seem to have been published in previous PPIG papers.
In-field observation and interviews of 10 programmers working on large complex problems. No body.
Anecdotal observations of teaching kids to program using narratives and variable roles - placing kids in the role of a particular variable and playing out live.
Advocates for process of explicitly stating models and (manually or automatically) looking for contradictions in reality to refine the model.
Main example is software tool for expressing explicit models of how software operates. User maps model to code. Tool shows where system violates their model eg calls between two components that don’t have a dependency in the model.
Code search that is aware of documentation? Too much jargon, no idea what the tool actually does.
Eye-tracking data is hard to analyze. Complex tasks, difficult to map into simple components for hypothesis testing.
Broke previous data into small chunks and looked at trends over time and switching patterns between activities.
PlanAni again. 24 students. Two tests under eye-tracking - 1) predict variable values 2) choose inputs to produce particular values.
Various numbers produced but no idea what they were actually trying to find out.
Failed experiment. Subject sampling/participation messed up by other commitments. Much of the data is missing because subjects didn’t get around to writing it down. Two groups behaved so differently in collaborative stage as to be effectively different experiments.
Proposes language design to focus on variable roles.
Attempting to use GT initially failed. Transcription of recording was impractical so directly annotated the video. Overwhelmed by possible choice of features because they had no actual goal.
Suggests choosing feature categories before starting coding, using structured names for concepts, making a UML model for results, pair coding.
Want to be able to group students by current skill level for better targeted teaching.
Give a test to 254 students that follows Blooms Taxonomy. Cluster students by results - observed non-linear skill progression.
Anecdotal report from team of PhD students using XP.
GT concepts for pair programming.
Several examples of mapping studies. Notes that coverage is low, both in the mapping studies themselves and in the areas they address.
Trying to determine mood from mouse and keyboard logs. Analysis is awful.
OOPy folks argue that thinking in terms of objects is innate. Relates various psych findings that don’t seem to actually be related to OOP objects. Confusion over the word.
Models code cognition in terms of various memory types.
Designing software with multiple views. Not strongly justified.
Reality is messy, has to be abstracted away to fit in tidy computers. What could go wrong?
Not very convincing examples.
Tools that infers regexes from positive/negative examples. Evaluation with 6 subjects vs manual editing with Word and vs manual rename with Adobe Bridge. Large effect size.
Culture clashes between devs and scientists, especially in waterfall style dev.
More mailing list data-mining. Conclusions are banal.
74 students take MBTI test and code comprehension test. Introversion slightly predicts success.
Largely seems to be about integrating mock-ups and end-user testing into XP workflow.
Using Markov Clustering to try to retrieve logical cell blocks. Minimal evaluation.
LOGO clone. Slight improvement in course participation in same year.
Suggests that loops are hard because there is not a 1-1 mapping between text and execution.
Data-mining Eclipse repo. Inheritance depth did not predict refactorings.
Testing out systematic lit review with 1 student. Seems to work.
10 students and 6 professionals recorded pair programming. Significant differences in their behavior mean that past studies on students probably can’t be extrapolated.
Subjects given two comparisons eg A>B and B>C and have to report what mapping from A,B,C to small,medium,large can be deduced. Success rates between 50% and 100% depending on class of problem.
10 subjects. Three sessions. Small but significant increase in performance over time.
10 subjects. Replaced middle session with ‘training session’ - same interface but explains answers. Near 100% success rates on last session.
General discussion of nature vs nurture as it pertains to psychometrics. Doesn’t really say anything specific about programming.
Observation of 10 grad students learning Java. GT concepts again.
Interviewed software architects about projects. Found that they weren’t really focused on the projects so the interviews were pointless.
Text-mining JDK. Doesn’t seem to have any new results from their past paper.
6 attempted replications of consistent-mental-model-predicts-programming-ability experiment.
First failure had too very high pass rates - almost all of the students were consistent. Second failure was a test given after the course - expect that a programming course should teach students to pass the test (this objection doesn’t actually seem to me to make sense).
Improved protocol, detailed in previous paper.
Next four replications from author and collaborators. All seem to be somewhat successful.
Eye-tracking during pair programming. Method and results are not really clear.
Yet more mailing list mining. Low response rate. No other interesting observations.
10 C++ and 10 Java projects out of top 100 sourceforge downloads. Conclusions seem to be far too strong given the metrics examined eg low number of protected methods -> not enough use of information hiding.
Reflexion modeling for fighting architectural drift. 2 year longitudinal study at IBM Dublin. Discovered many violations of the model, but they generally weren’t removed. Designed a CI version of the tool to catch violations as they happen.
Tries to break effective use of abstraction down into subskills.
12 subjects given code comprehension task in either Java or Scala under eye-tracker. The dense Scala code was more quickly understood.
Survey answered by 23/60 professionals in single company. Reported poor overall visibility of test state, poor coordination between departments.
6 professionals talk-aloud in maintenance tasks at work. Tries to code utterances into Bloom taxonomy. Not clear what they are looking for.
21 students asked to debug spreadsheet that uses named ranges. Uses previous experiment as control. Worse performance than control.
Tries to justify use of separate control by comparing error distribution to control group in a different (and properly randomized) experiment.
Tool opens source code inline instead of jumping to different file.
7 subjects given various comprehension tasks. 14% faster with inline. No significance test.
Problem-solving course. Students reportedly approve. No other evaluation.
Basic overview of memory models.
Case study - 3 years, electronic patient record system in ICU. Customised in the field by users. Hired a professional programmer for 5 days to contrast their mental model with users.
Modifications made live with little version control tooling. Users willing to make long-term learning efforts, but only when there is a clear need/payoff. Not interested in understanding for it’s own sake.
Programmer spent a lot of time trying to teach inheritance and talking about OOPy models of cars. Users completely exasperated at this apparent waste of time.
UML diagrams for concurrency. 15 students given comprehension test on concurrent system. Miscomprehensions are tangled chains of mistakes. Tries to categorize them by semantic level at which they occur.
Tendency to write tests that confirm hypotheses rather than refute them.
88 subjects drawn from 4 companies and from grad school. Watson’s rule discovery task and selection task. Students did better. Don’t see a control for IQ or similar.
Mined 30k pipes. Most pipes are DAGs (ie no use of looping constructs). Most pipes use a small subset of the available constructs. Most pipes are hardwired - number of exposed parameters fits exponential distribution.
39 students in control group taught with trace tables. 61 students taught with diagrams of memory layout. Different teachers. (Numbers suggest not randomly assigned?).
Simple programming exam. Tiny but significant effect size.
Interviewing XP programmers to determine their goals. No attempt to determine reliability or validity of their methods.
Case study with Scratch. Given LOGO test at middle and end - small improvement. Main upside is more enthusiasm, less frustration (compared to?).
Breaks liveness into levels. Compares different tasks in programming and music.
Neat diagrams of feedback loops in various systems.
Designed self-efficacy questionnaire for APIs. Don’t understand their attempt to prove that the test is reliable. They note that their other experiments have terrible power.
Case study of distributed pair programming. Small number of sessions, small number of subjects.
Ask students about recollections of efficacy at various points in time. Results are confused.
List of heuristics novices might benefit from.
Teaching binary search. Same as previous paper, I can’t figure out what the contribution is.
Looks at previous studies of distribution of MBTI types in devs.
Programming interactive apps is hard. Proposed advances from academia have not taken off. Big list of properties that make interactive software different to purely computational software.
IACHE inventory given to 72 high school students in Brazil and 258 in Portugal. No conclusive differences between the two.
Specializing CD questionnaire to POS software.
211 undergrads from 3 schools. Comprehension test on OOPy and non-OOPy programs in VB and Java. Students given non-OOPy did worse, but looking at the graph the only major difference is in the ‘Class’ category of questions, which seems like an obvious conclusion?
Media designers often need to reuse ‘code’ eg designing same dvd menu for many different markets. Presents a tool that separates style from content.
Proposes trying to relate eye-tracker data to complexity metrics.
Subjects asked to query or edit typical office spreadsheets on a phone. Problems with cell selection, character selection, inconsistency between platforms, having to focus on keyboard while editing.
Software that lets users train classifier on their own visual syntax, map properties of syntax to synth inputs and then play music by pointing the camera at notation.
Many usability complaints.
Proposes a sort of high-level virtual machine with visualization + replay as a teaching tool.
Interviews with 13 professionals. Motivated by the work itself, and the aesthetics for their code. Demotivated by external obstacles to producing satisfying code.
CS should borrow ideas from human error research.
12 students of various ages. Even experienced programmers struggled with the tasks. Loop blocks confused most subjects.
36 students taught an OOPy UML. When asked to choose between two models, more preference for simple models when expressed as UML vs informal.
Classifier for sort algorithms with 87% accuracy on student submissions. Wants to improve it so it can give automatic feedback to students.
(One of the MIT MOOCs uses a semi-interactive classifier to help grade and give feedback on the huge number of student submissions, so it’s plausible in practice.)
CS students are often shy, and report difficulties with asking questions in class or asking for help from tutors.
Applying CD to design to a language.
Make generative testing map to more concrete natural inputs where appropriate eg random names rather than random strings. No evaluation yet.
Timing visualisation tool for hard real-time. Proposes researching such.
Trying to relate OOP to cognitive uses of abstraction, I think?
Speculates on reasons for poor code. Nothing surprising.
Interviews with students suggest that it leads them to view the system from the outside rather than from the inside.
Single session outreach. 135 school students in total.
Pre/post programming tests show improvements in most areas, but suspicious that it had to be broken down into areas to isolate the negative area.
Focus on knowledge seems weird for outreach anyway, would have expected them to survey attitudes.
Students report emotional state in class either via desktop widget or via a hand-held ball gadget thing. Students liked the ball as a fidget toy and as a signaling mechanism.
I kind of want one.
Tool that assigns reputation to code, and then to programmers based on their interaction history with that code.
Past experiments found correlations of .64 and .88 with actual reputation (from surveying peers?).
Field study in postgrad lab. Does not appear to increase actual code quality. Students reported perception of unfairness and opacity. Reputation score not seen as a high priority.
Piaget strikes again. Still have no idea what I’m supposed to be learning from these stories.
Computer Anxiety Rating Scale and five-factor personality test given to >100 biz students. Agreeableness and emotional stablity negatively correlated with computer anxiety, explaining 38% of variance.
Common problems with field studies. Some are surprising, eg the need for extension cables.
Proposes list of qualities to bear in mind when designing learning environments.
Gave either old list or new list to 9 students and ask them to do usability reviews. No clear conclusions.
Prototype compilers that give feedback on possible performance problems. Hard to evaluate because users mostly disabled the popups.
Pair programming under eye-tracking. Different gaze patterns between experts and novices.
Proposes testing whether restricted focus tools alter programmers behavior.
92 students get normal instructor. 14 volunteer students get MTL tool instead. Course score very slightly higher for volunteers.
Adjusted consistent-model test for online testing.
126 high school students in UK. Almost a third scored highly, mostly with an underlying model of parallel execution of lines. Post-test interviews confirmed the models recognized by the test by were accurate. Students who were labeled unrecognized reported multiple models, switching models partway or ‘shrug’.
92 undergrad students in Mexico. Around half marked algorithmic, similar to original experiment.
Interviews used to refine questions and marking scheme for future tests.
Expanding their algorithm classifier. On 222 implementations from textbooks and webpages, gets 94% accuracy.
Interesting. Not sure what to summarize.
Semantic web ontologies organized around strict hierarchies. Probably not a good map to human knowledge.
Planning to use virtual robots for teaching.
Interviews with 7 devs from a single (academic?) institution. Tries to categorize response to errors. Thrashing - undirected, random. Tolerating - just ignore the error. Compromising - hack something together for the sake of progress.
I don’t know why there is only one paper listed for 2013.
Not totally clear on what meta-ethnography entails. Description is surprisingly similar to the research process described in How To Read A Book.
Proposed data-collection PhD project.
Biz students given spreadsheet tasks. Text and number entry was faster in google sheets, eyeballed at about 2/3 mean time. Formula entry faster in Excel.
Kind of a pointless comparison.
No new results in here.
Proposed PhD project.
Interviews with devs at Google and Autodesk on their use of VCS. Common themes: high understanding of concepts, ritualized interaction staying with narrow paths, signs of fear and uncertainty. (Slapping down the oft repeated assertion that git is only confusing if you don’t understand the underlying model).
Speculates that dependencies on undisplayed state (eg staging area, current branch) and premature commitment (eg many operations are potentially destructive to the working tree) are the main culprits, which is why existing guis don’t alleviate the problem.
Suggests representation of time as the most problematic area.
Two online tutors. >2000 students from 56 schools. Affective learning questionnaire - no significant interactions with gender, race or subject. Male students and caucasian/asian students scored better on pre-tests for arithmetic tutor. Non-CS students improved more between pre- and post-test.
Ask students to trace code, and what the code is supposed to do. Identifies a group of students who can trace code and can’t explain it. In talk-aloud, these students struggle to move from concrete value to abstract sets.
(Implies that laddering up is a additional skill on top of execution models.)
298 subjects. Task-switching test. Results comparable to similar published tests.
65 students with some experience + 45 novices. Significant correlation between task switching test score and average grade at -.34, but no significant correlation with final exam grade or credits received.
40 C++ samples from 14 students. Extract identifiers. No correlation between years of experience and number of not-in-stdlib concepts.
We don’t know how to interpret all this eye-tracking data. Proposes a scheme for developing coding schemes for eye-tracking data.
12 novices. Given tutorials on either Sonic Pi or Kids Ruby. Questionnaire on motivation and background. Sonic Pi users typed more, recalled more commands. No notable differences in qualitative questions.
Anecdotal reports. Not summarizable.
Looking for tests that predict computational thinking. Tried D.48 (abstraction, relation, analagy), spatial reasoning test, GATB (arithmetic reasoning and tool matching subsets only).
12 students. Used raw test scores only. Sample too small to obtain significant results, but D.48 and spatial reasoning show correlations ~.5 with exam grades.
Project helping 5 artists work with RPi. Frustration with setup cost and lack of portability.
101 students in intro course with 100 assignments, broken into 170 tasks. Automated test system collects snapshots every time it is run.
Majority of students build programs up incrementally and pass tests one by one.
Similar consistent-model test. 360 students at 2 unis. High and significant correlations with eventual grades.
Live programming, computers as personal expression.
Programming environment that supports both direct manipulation, block editing and raw text editing.
21 school students. Taught in 4 groups, each introducing interactions in different orders. Pre/post-test of syntax knowledge.
Analysis is confusing, but conclusion suggests that introducing block editing before text editing was beneficial.
Sociolinguistics? Demonstrates that natural language approaches don’t carry over directly.
Noodle, a VPL for blind children. Work in progress.
Distributed cognition. Anecdotal reports from users of NetLogo.
Trying to tease out what people in various domains mean by ‘object’ by showing them tables of data and asking how many objects are in view.
6 devs, first given programming task and then later asked to narrate the recording. Code completion used for checking method names, api exploration, catching bugs (eg if completion list looks wrong, look for bugs in code leading up this line).
Mining names in 60 java projects. Not sure what the point is.
Properties that need to be exposed in end-user machine learning tools. How sure is the answer, how well does the program understand the domain, how complex was the method used to arrive at the answer.
Using EEG to measure cognitive load. Not clear how useful it is compared to just asking.
Describes a programming model meant to allow many different editing representations. Without either examples or formal specs it’s hard to understand how it’s intended to work.
Trains a readability model for tests. Outperforms generic readability models. Develop optimization process that produces average readability improvement of 1.9%.
Proposed redesign of speech-controlled tool for writing math formulae (which is a weird idea in the first place - even between humans face-to-face there is a still a preference for writing formulae).
Hard to find small problems - even experienced programmers can’t do as much as they think in 40 min class.
Adults struggle to learn because afraid of / unwilling to make mistakes.
Again, where is the content.
Ideas > tech. Computational thinking > programming. Progress in new UK curriculum.
10 subjects under eye-tracker. Mental execution task. Syntax highlighting reduces completion time. Stats are a bit suspect.
Notably, under syntax highlighting there are much fewer fixations on keywords. (Maybe colour enables peripheral vision?)
10 subjects. Writing and debugging tasks. Significantly faster completion with syntax highlighting (looks like ~25% faster, weird lack of numbers in this section). Programming and musical experience have no significant effect.
How to teach students about quality of code. Not much here yet, all looking forward to future work.