I am Using Other Researchers Data and You Could (and Should) be Too!

Ben Porter

I am in my third year of a tenure-track position in the psychology department at Mississippi State University. Most assistant professors in my position are trying to create a unique research program that advances the state of science to make a name for themselves. I, too, am striving to move science forward and trying to make a name for myself, but I am doing it in a weird way: I want to use other people’s work to do it.

Study and the Nurses’ Health Study have generated thousands of papers that have informed healthier, happier lives. A rich history of publicly accessible datasets exists in public health. Large health studies like the Framingham Heart Study and the Nurses’ Health Study have generated thousands of papers that have informed healthier, happier lives. Even with the extensive number of papers, new papers are coming out (e.g., the Framingham Heart Study generated over 140 manuscripts published in 2023).

Getting data is one of the most time-consuming tasks for researchers. As a graduate student, I spent a significant portion of my time collecting data. Running participants through long baseline sessions and then trying to get them to complete daily surveys took up my mornings for about a year during graduate school. Don’t try to look up the publication; I couldn’t turn it into a paper.

After graduating, I got a job working on a large public health project. I realized that such studies collected more data than I could ever analyze over an entire career. Literally, all I needed was ideas, and then I could start analysis almost immediately. I discovered that these sources are everywhere now. Why would I spend time collecting data when I could spend it analyzing and disseminating the answers to my questions?!

Traditional sources of data are also often less than ideal. Research pools are commonly used (especially in psychology), but undergraduate students participating for class credit are typically not the most motivated participants, and the population is not very generalizable. Similarly, online sources offer similar participant pools and have a more diverse population, but cost money to collect, and there is the potential for bots to take your survey. Certain companies will seek out and verify specific populations to limit bots, but the cost of such services can make large studies prohibitively expensive (at least for me!) Publicly accessible datasets are often meticulously collected and validated. Data generally come pre-cleaned and formatted for use. Moreover, the size of these datasets is often larger than I would be able to manage myself. Among the projects I have used to date, the Midlife in the U.S. Study has 7108 participants, a study I conducted using the Healthy Minds Study used 119,400 observations over five years, and the All of Us Research Program has over 400,000 participants (and is still actively enrolling if you’re interested…)

Archival data generally lives in various places on the internet. However, the happiest of hunting grounds I’ve found is at Inter-university Consortium for Political and Social Research (ICPSR). There are thousands of projects and datasets that are shared through this resource. Large studies, small studies, and government data are shared here. Another rapidly expanding resource is the NIMH Data Archive (NDA). The name is a bit of a misnomer, as all NIH data is being held here. This resource is designed to reach across studies such that if you’re interested in variables like the Beck Depression Inventory and income, all studies containing these variables can be combined into a single dataset. Furthermore, other large studies like the All of Us Research Program provide an enormous resource that is explicitly created to be available to researchers.

A large reason I like to use publicly accessible data is that I find it easy. However, whenever I talk about publicly accessible data, I always make sure to mention why using this data is also good. These data are provided by participants who know the purpose of the research is to answer questions and want their data to be used in this manner. Using this data makes this a reality. Additionally, it makes good use of the time participants have put into a study and improves the value of the investment that occurred to collect it. Finally, it reduces the waste of duplicate studies that could already be answered using existing data. With that, do good! Find some data and write some papers!

