Monthly Archives: July 2018

Open Science: My List of Best Practices

IMG_20180708_144620_Bokeh

This has nothing to do with Open Science. I just piled these rocks up at Lake Huron

Are you interested in Open Science? Are you already implementing Open Science practices in you lab? Are you skeptical of Open Science? I have been all of the above and some recent debates on #sciencetwitter have been discussing the pros and cons of Open Science practices. I decided to write this article to share my experiences as I’ve been pushing my own research in the Open Science direction.

Why Open Science?

Scientists have a responsibility to communicate their work to their peers and to the public. This has always been part of the scientific method but the methods of communication have differed throughout the years and differ by fields. This essay reflects my opinions on Open Science (capitalized to reflect this as set of principles), and I also give an overview of my lab’s current practices. I’ve written about this in my lab manual (which is also open) but until I sat down to write this essay, I had not really codified how my lab and research has adopted and Open Science practice. This should not be taken as a recipe for your own science, lab, and these ideas may not apply to other fields. This is just my experience trying to adopt Open Science practices in my Cognitive Psychology lab.

Caveats First

Let’s get a few things out of the way…

First, I am not an expert in open science. In fact until about 2-3 years ago, it never even occurred to me to create a reproducible archive for my data, or to ensure that I could provide analysis scripts to someone else so that they could reproduce my analysis, or that I would provide copies of all of the items / stimuli that I used in an experiment. I’ve received requests for data before, but I usually handled those in a piecemeal, ad hoc fashion. If someone asked, I would put together a spreadsheet.

Second, my experience is only generalizable to other comparable fields. I work in cognitive psychology and have collected behavioural data, survey questionnaire data, and electrophysiological data. I realized data sharing can be complicated by ethics concerns for people who collect sensitive personal or health data. I realize that other fields collect complex biological data that may not lend itself well to immediate sharing.

Finally, the principles and best practices that I’m outlining here were adopted in 2018. Some of this was developed over the course of the last few years, but this is how I am running my lab now, and how I plan to run my life foreseeable future. That means there are still gaps: studies that were published a few years ago that have not yet been archived, papers that may not have a preprint, analyses that were done 20 years ago in SAS on the VAX 11/780 at University at Buffalo, and if anyone wants to see data from my well-cited 1998 paper on prototype and exemplar theory, I can get it, but it is not going to be easy.

Core Principles

There are many aspects to Open Science, but I am going to outline three areas that cover most of these. There will be some overlap and some aspects may be missed.

Materials and Methods

The first aspect of Open Science concerns openness with respect to methods, materials, and reproducibility. In order to satisfy this criteria, a study or experiment should be designed and written in such a way that another scientist or lab in the same field would be able to carry out the same study if they wanted to. That means that any equipment that was used is described in enough detail or is readily available. This also means that computer programs that were used to carry out the study are accessible and the code is freely available. As well, in psychology, there are often visual, verbal, or auditory stimuli that participants make decisions about or questions that they answer. These should also be available.

Data and Analysis

The second aspect of Open Science concerns open availability of data that have been collected in the study. In psychology, data takes many forms, but usually refers to responses by participants on surveys, presentation of visual stimuli, recordings of EEG, data collected in an fMRI study. In other fields, it may consist of observations taken at a field station, measurements taken of an object or substance, or trajectories of objects in space. Anything that is measured, collected, analyzed for a publication should be available for other scientists in the field.

Of course, in a research study or scientific project, the data that have been collected are also processed and analyzed. Here, several decisions need to be made. It may not always be practical to share raw data, especially if things were recorded by hand in a notebook or if the digital files are so large as to be unmanageable. On the other hand, it may not be useful to publish data that have been processed and summarized too much. For most fields, there is probably a middle-ground where the data have been cleaned and minimally processed but no statistical analyses of been done, and the data have not been transformed. In my experience so far, this is one of the most difficult decisions to make. I don’t have a solid answer yet.

In most scientific fields, data are analyzed using software and field-specific statistical techniques. Here again, several decisions need to be made while the research is being done in order to ensure that the end result is open and usable. For example, if you analyze your data with Microsoft Excel, what might be simple and straightforward to you might be uninterpretable to someone else. This is especially true if there are pivot tables, unique calculations entered into various cells, and transformations that have not been recorded. This, unfortunately, describes a large part of the data analysis I did as a graduate student in the 1990s. And I’m sure I’m not alone. Similarly, any platform that is proprietary will present limits to openness. This includes Matlab, SPSS, SAS, and other popular computational and analytic software. I think that’s why you see so many people who are moving towards Open Science practices encouraging the use of R and Python, because they are free, openly available, and they lend themselves well to scientific analysis.

Publication

The third aspect of Open Science concerns the availability of the published data and interpretations: the publication itself. This is especially important for any research that is carried out at a university or research facility that is supported by public research grants. Most of these funding agencies require that you make your research accessible.

There are several good open access research journals that make the publications freely available for anyone because the author helps to cover the cost of publication. But many traditional journals are still behind a payroll and are only available for paid subscribers. You may not see the effects of this if you’re working in a university because your institution may have a subscription to the journal. The best solution is to create a free and shareable version of your manuscript, a preprint, that is available on the web and that anyone can access but does not violate the copyright of the publisher.

Putting this in practice

I tried to put some guidelines in place in my lab to address these three aspects of open science. I started with one overriding principle: When I submit a manuscript for publication in a peer-reviewed journal, I should also ensure that at the time of submission, I have a complete data file that I can share, analysis scripts that I can share, and a preprint.

I implemented as much of this is possible with every project paper that we’ve submitted for publication since late 2017 and all our ongoing projects. We don’t submit a manuscript until we can meet the following:

  • We create a reprint of the manuscript that can be shared via a public online repository. We post this preprint to the online suppository at the same time that we submit it to the journal.
  • We create shareable data files for all of the data collected in the study described in that manuscript. These are almost always unprocessed or minimally processed data in a Microsoft Excel spreadsheet or a text file. We don’t use Excel for any summary calculations, so the data are just data
  • As we’re carrying out the data analysis, we document our analyses in R notebooks. We share the R scripts /notebooks for all of the statistical analyses and data visualizations in the manuscript. These are open and accessible and should match exactly what appears the manuscript. In some cases, we have posted R notebooks with additional data visualization beyond what is in the manuscript as a way to add value to the manuscript.
  • We also create a shareable document for any nonproprietary assessments or questionnaires that were designed for this study and copies of any visual or auditory stimuli used in the study.

Now on this list of best practices, it would be disingenuous to suggest that every single study paper from my lab meets all of those criteria. For example, one recently published study made use of Matlab instead of Python, because that’s how we knew how to analyze the data. But we’re using these principle as a guide as out work progresses. I view Open Science and these guidelines as an important and integral part of training my students. I view this as being just as important as the theoretical contributions that we’re making to the field.

Additional Resources and Suggestions

In order to achieve this goal, the following guidelines and resources have been helpful to me.

OSF

My public OFS profile lists current and recent projects. OSF stands for “open science Framework” and it’s one of many data repositories that can be used to share data, preprints, unformatted manuscripts, analysis code, and other things. I like OSF, and it’s kind of incredible to me that thus wonderful resource is free for scientists to use. But if you work at a University or public research institute, your library probably runs a public repository as well.

Preregistration

For some studies, preregistration may be helpful, additional step in carrying out the research. There are limits to preregistration, many of which are addressed with Registered Reports. At this point, we haven’t done any register reports. The preregistration is helpful though, because it encourages the researcher student to lay out a list of analyses they plan to do, to describe how the data are going to be collected, and to make that plan publicly available before the data are collected. This doesn’t mean that preregistered studies are necessarily better, but it’s one more tool to encourage openness in science.

Python and R

If you’re interested in open science it really is worth looking closely at R and Python for data manipulation, visualization, and analysis. In psychology, for example, SPSS has been a long-standing and popular way to analyze data. SPSS does have a syntax mode that allows the researcher to share their analysis protocol, but that mode of interacting with the program is much less common than the GUI version. Furthermore, SPSS is proprietary. If you don’t have a license, you can’t easily look at how the analyses were done. The same is true of data manipulation in Matlab. My university has a license, but if I want to share my data analysis with a private company, they may not have a license. But anyone in the world can install and use R and Python.

Conclusion

Science isn’t a matter of belief. Science works when people trust in the methodology, the data and interpretation, and by extension, the results. In my view, Open Science is one of the best ways to encourage scientific trust and to encourage knowledge organization and synthesis.

Cognitive Bias and the Gun Debate

171017-waldman-2nd-amendment-tease_yyhvy6

image from GETTY

I teach a course at my Canadian university on the Psychology of Thinking and in this course, we discuss topics like concept formation, decision making, and reasoning. Many of these topics lend themselves naturally to the discussion of current topics and in one class last year, after a recent mass shooting in the US, I posed the following question:

“How many of you think that the US is a dangerous place to visit?”

About 80% of the students raised their hands. This is surprising to me because although I live and work in Canada and I’m a Canadian citizen, I grew up in the US; my family still lives there and I still think it’s a reasonably safe place to visit. Most students justified their answer by referring to school shootings, gun violence, and problems with American police. Importantly, none of these students had ever actually encountered violence in the US. They were thinking about it because it has been in the news. That were making a judgment on the basis of the available evidence about the likelihood of violence.

Cognitive Bias

The example above is an example of a cognitive bias known as the Availability Heuristic. The idea, originally proposed in the early 1970s by Daniel Kahneman and Amos Tversky (Kahneman & Tversky, 1979; Tversky & Kahneman, 1974) is that people generally make judgments and decisions on the basis of the most relevant memories that they retrieve and that are available at the time that the assessment or judgement is made. In other words, when you make a judgment about a likelihood of occurrence, you search your memory and make your decision on the basis of what you remember. Most of the time, this heuristic produces useful and correct evidence. But in other cases, the available evidence may not correspond exactly to evidence in the world. For example, we typically overestimate the likelihood of shark attacks, airline accident, lottery winning, and gun violence.

Another cognitive bias (also from Kahneman and Tversky) is known as the Representativeness Heuristic. This is the general tendency to treat individuals as representative of their entire category. For example, suppose I formed concept of American gun owners as being violent (based on what I’ve read or seen in the news), I might infer that each individual American is a violent gun owner. I’d be making a generalization or a stereotype and this can lead to bias in how a treat people. As with availability, the representativeness heuristic arrises out of the natural tendency of humans to generalize information. Most of the time, this heuristic produces useful and correct evidence. But in other cases, the representative evidence may not correspond exactly to individual evidences in the world.

The Gun Debate in the US

I’ve been thinking about this a great deal as the US engages in their ongoing debate about gun violence and gun control. It’s been reported widely that the US has the highest rate of private gun ownership in the world, and also has an extraordinary rate of gun violence relative to other counties. These are facts. Of course, we all know that “correlation does not equal causation” but many strong correlations often do derive from a causal link. The most reasonable thing to do would be to begin to implement legislation that restricts access to firearms but this never happens and people are very passionate about the need to restrict guns.

So why to do we continue to argue about this? One problem that I rarely see being discussed is that many of us have limited experience with guns and/or violence and have to rely on what we know from memory and from external source and we’re susceptible to cognitive biases.

Let’s look at things from the perspective of an average American gun owner. This might be you, people you know, family, etc. Most of these gun owners are very responsible, knowledgeable, and careful. They own firearms for sport and also for personal protection and in some cases, even run successful training courses for people to learn about gun safety. From the perspective of a responsible and passionate gun owner, it seems to be quite true that the problem is not guns per se but the bad people who use them to kill others. After all, if you are safe with your guns and all your friends and family are safe, law abiding gun owners too, then those examples will be the most available evidence for you to use in a decision. And so you base your judgements about gun violence on the this available evidence and decide that gun owners are safe. As a consequence, gun violence is not a problem of guns and their owners, but must be a problem of criminals with bad intentions. Forming this generalization is an example of the availability heuristic. It my not be entirely wrong,  but it is a result of a cognitive bias.

But many people (and me also) are not gun owners. I do not own a gun but I feel safe at home. As violent crime rates decrease, the likelihood being a victim of a personal crime that a gun could prohibit is very small, Most people will never find themselves in this situation. In addition, my personal freedoms are not infringed by gun regulation and I too recognize that illegal guns are a problem. If I generalize from my experience, I may have difficulty understanding why people would need a gun in the first place whether for personal protection or for a vaguely defined “protection from tyranny”. From my perspective it’s far more sensible to focus on reducing the number of guns. After all, I don’t have one, I don’t believe I need one, so I generalize to assume that anyone who owns firearms might be suspect or irrationally fearful. Forming this generalization is also an example of the availability heuristic. It my not be entirely wrong,  but it is a result of a cognitive bias.

In each case, we are relying on cognitive biases to infer things about others and about guns. These things and inferences may be stifling the debate

How do we overcome this?

It’s not easy to overcome a bias, because these cognitive heuristics are deeply engrained and indeed arise as a necessary function of how the mind operates. They are adaptive and useful. But occasionally we need to override a bias.

Here are some proposals, but each involves taking the perspective of someone on the other side of this debate.

  1. Those of us on the left of the debate (liberals, proponents of gun regulations) should try to recognize that nearly all gun enthusiasts are safe, law abiding people who are responsible with their guns. Seen through their eyes, the problem lies with irresponsible gun owners. What’s more, the desire to place restrictions on their legally owned guns activates another cognitive bias known as the endowment effect in which people place high value on something that they already possess, the prospect of losing this is seen as aversive because it increases the feeling of uncertainty for the future.
  2. Those on the right (gun owners and enthusiasts) should consider the debate from the perspective of non gun owners and consider that proposals to regulate firearms are not attempts to seize or ban guns but rather attempts to address one aspect of the problem: the sheer number of guns in the US, any of which could potentially be used for illegal purposes. We’re not trying to ban guns, but rather to regulate them and encourage greater responsibility in their use.

I think these things are important to deal with. The US really does have a problem with gun violence. It’s disproportionally high. Solutions to this problem must recognize the reality of the large number of guns, the perspectives of non gun owners, and the perspectives of gun owners. We’re only going to do this by first recognizing these cognitive biases and them attempting to overcome them in ways that search for common ground. By recognizing this, and maybe stepping back just a bit, we can begin to have a more productive conversation.

As always: comments are welcome.