Open Science: My List of Best Practices

IMG_20180708_144620_Bokeh

This has nothing to do with Open Science. I just piled these rocks up at Lake Huron

Are you interested in Open Science? Are you already implementing Open Science practices in you lab? Are you skeptical of Open Science? I have been all of the above and some recent debates on #sciencetwitter have been discussing the pros and cons of Open Science practices. I decided to write this article to share my experiences as I’ve been pushing my own research in the Open Science direction.

Why Open Science?

Scientists have a responsibility to communicate their work to their peers and to the public. This has always been part of the scientific method but the methods of communication have differed throughout the years and differ by fields. This essay reflects my opinions on Open Science (capitalized to reflect this as set of principles), and I also give an overview of my lab’s current practices. I’ve written about this in my lab manual (which is also open) but until I sat down to write this essay, I had not really codified how my lab and research has adopted and Open Science practice. This should not be taken as a recipe for your own science, lab, and these ideas may not apply to other fields. This is just my experience trying to adopt Open Science practices in my Cognitive Psychology lab.

Caveats First

Let’s get a few things out of the way…

First, I am not an expert in open science. In fact until about 2-3 years ago, it never even occurred to me to create a reproducible archive for my data, or to ensure that I could provide analysis scripts to someone else so that they could reproduce my analysis, or that I would provide copies of all of the items / stimuli that I used in an experiment. I’ve received requests for data before, but I usually handled those in a piecemeal, ad hoc fashion. If someone asked, I would put together a spreadsheet.

Second, my experience is only generalizable to other comparable fields. I work in cognitive psychology and have collected behavioural data, survey questionnaire data, and electrophysiological data. I realized data sharing can be complicated by ethics concerns for people who collect sensitive personal or health data. I realize that other fields collect complex biological data that may not lend itself well to immediate sharing.

Finally, the principles and best practices that I’m outlining here were adopted in 2018. Some of this was developed over the course of the last few years, but this is how I am running my lab now, and how I plan to run my life foreseeable future. That means there are still gaps: studies that were published a few years ago that have not yet been archived, papers that may not have a preprint, analyses that were done 20 years ago in SAS on the VAX 11/780 at University at Buffalo, and if anyone wants to see data from my well-cited 1998 paper on prototype and exemplar theory, I can get it, but it is not going to be easy.

Core Principles

There are many aspects to Open Science, but I am going to outline three areas that cover most of these. There will be some overlap and some aspects may be missed.

Materials and Methods

The first aspect of Open Science concerns openness with respect to methods, materials, and reproducibility. In order to satisfy this criteria, a study or experiment should be designed and written in such a way that another scientist or lab in the same field would be able to carry out the same study if they wanted to. That means that any equipment that was used is described in enough detail or is readily available. This also means that computer programs that were used to carry out the study are accessible and the code is freely available. As well, in psychology, there are often visual, verbal, or auditory stimuli that participants make decisions about or questions that they answer. These should also be available.

Data and Analysis

The second aspect of Open Science concerns open availability of data that have been collected in the study. In psychology, data takes many forms, but usually refers to responses by participants on surveys, presentation of visual stimuli, recordings of EEG, data collected in an fMRI study. In other fields, it may consist of observations taken at a field station, measurements taken of an object or substance, or trajectories of objects in space. Anything that is measured, collected, analyzed for a publication should be available for other scientists in the field.

Of course, in a research study or scientific project, the data that have been collected are also processed and analyzed. Here, several decisions need to be made. It may not always be practical to share raw data, especially if things were recorded by hand in a notebook or if the digital files are so large as to be unmanageable. On the other hand, it may not be useful to publish data that have been processed and summarized too much. For most fields, there is probably a middle-ground where the data have been cleaned and minimally processed but no statistical analyses of been done, and the data have not been transformed. In my experience so far, this is one of the most difficult decisions to make. I don’t have a solid answer yet.

In most scientific fields, data are analyzed using software and field-specific statistical techniques. Here again, several decisions need to be made while the research is being done in order to ensure that the end result is open and usable. For example, if you analyze your data with Microsoft Excel, what might be simple and straightforward to you might be uninterpretable to someone else. This is especially true if there are pivot tables, unique calculations entered into various cells, and transformations that have not been recorded. This, unfortunately, describes a large part of the data analysis I did as a graduate student in the 1990s. And I’m sure I’m not alone. Similarly, any platform that is proprietary will present limits to openness. This includes Matlab, SPSS, SAS, and other popular computational and analytic software. I think that’s why you see so many people who are moving towards Open Science practices encouraging the use of R and Python, because they are free, openly available, and they lend themselves well to scientific analysis.

Publication

The third aspect of Open Science concerns the availability of the published data and interpretations: the publication itself. This is especially important for any research that is carried out at a university or research facility that is supported by public research grants. Most of these funding agencies require that you make your research accessible.

There are several good open access research journals that make the publications freely available for anyone because the author helps to cover the cost of publication. But many traditional journals are still behind a payroll and are only available for paid subscribers. You may not see the effects of this if you’re working in a university because your institution may have a subscription to the journal. The best solution is to create a free and shareable version of your manuscript, a preprint, that is available on the web and that anyone can access but does not violate the copyright of the publisher.

Putting this in practice

I tried to put some guidelines in place in my lab to address these three aspects of open science. I started with one overriding principle: When I submit a manuscript for publication in a peer-reviewed journal, I should also ensure that at the time of submission, I have a complete data file that I can share, analysis scripts that I can share, and a preprint.

I implemented as much of this is possible with every project paper that we’ve submitted for publication since late 2017 and all our ongoing projects. We don’t submit a manuscript until we can meet the following:

  • We create a reprint of the manuscript that can be shared via a public online repository. We post this preprint to the online suppository at the same time that we submit it to the journal.
  • We create shareable data files for all of the data collected in the study described in that manuscript. These are almost always unprocessed or minimally processed data in a Microsoft Excel spreadsheet or a text file. We don’t use Excel for any summary calculations, so the data are just data
  • As we’re carrying out the data analysis, we document our analyses in R notebooks. We share the R scripts /notebooks for all of the statistical analyses and data visualizations in the manuscript. These are open and accessible and should match exactly what appears the manuscript. In some cases, we have posted R notebooks with additional data visualization beyond what is in the manuscript as a way to add value to the manuscript.
  • We also create a shareable document for any nonproprietary assessments or questionnaires that were designed for this study and copies of any visual or auditory stimuli used in the study.

Now on this list of best practices, it would be disingenuous to suggest that every single study paper from my lab meets all of those criteria. For example, one recently published study made use of Matlab instead of Python, because that’s how we knew how to analyze the data. But we’re using these principle as a guide as out work progresses. I view Open Science and these guidelines as an important and integral part of training my students. I view this as being just as important as the theoretical contributions that we’re making to the field.

Additional Resources and Suggestions

In order to achieve this goal, the following guidelines and resources have been helpful to me.

OSF

My public OFS profile lists current and recent projects. OSF stands for “open science Framework” and it’s one of many data repositories that can be used to share data, preprints, unformatted manuscripts, analysis code, and other things. I like OSF, and it’s kind of incredible to me that thus wonderful resource is free for scientists to use. But if you work at a University or public research institute, your library probably runs a public repository as well.

Preregistration

For some studies, preregistration may be helpful, additional step in carrying out the research. There are limits to preregistration, many of which are addressed with Registered Reports. At this point, we haven’t done any register reports. The preregistration is helpful though, because it encourages the researcher student to lay out a list of analyses they plan to do, to describe how the data are going to be collected, and to make that plan publicly available before the data are collected. This doesn’t mean that preregistered studies are necessarily better, but it’s one more tool to encourage openness in science.

Python and R

If you’re interested in open science it really is worth looking closely at R and Python for data manipulation, visualization, and analysis. In psychology, for example, SPSS has been a long-standing and popular way to analyze data. SPSS does have a syntax mode that allows the researcher to share their analysis protocol, but that mode of interacting with the program is much less common than the GUI version. Furthermore, SPSS is proprietary. If you don’t have a license, you can’t easily look at how the analyses were done. The same is true of data manipulation in Matlab. My university has a license, but if I want to share my data analysis with a private company, they may not have a license. But anyone in the world can install and use R and Python.

Conclusion

Science isn’t a matter of belief. Science works when people trust in the methodology, the data and interpretation, and by extension, the results. In my view, Open Science is one of the best ways to encourage scientific trust and to encourage knowledge organization and synthesis.

One thought on “Open Science: My List of Best Practices

  1. J. Colomb, @pen (@j_colomb)

    Dear Prof. Minda,
    it is great to hear people moving their way toward open data and open science. I wondered whether you discussed your data management system with specialists at your university (if they are any). You may indeed save a lot of time if you try to implement best practice at the beginning of a project (before data aquisition) and not at the end (when writing the papers).
    For instance, when data is managed professionally, sharing becomes a couple of clicks: it would probably save you quite some trouble in deciding what to share, by simply sharing everything.

    I would love to hear about your experience by the way, and discuss how you would be/are trying to convince your colleagues to go the same route. I am indeed gathering such stories to build a working argumentation for the promotion of research data management and open data at http://rdmpromotion.rbind.io

    congratulation on your initiatives!
    (comment also posted on medium)

    Reply

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s