snps is a new open source Python package that aims to help users interact with genetic data from a variety of sources, including direct-to-consumer (DTC) DNA testing companies and whole genome sequencing (WGS) services. Specifically, snps provides tools to help with reading, writing, merging, and remapping SNPs.
Two of our project grant awardees – Kevin Arvai and Andrew Riha – have been working tirelessly to build two new web tools that can make use of your genetic data that’s stored in Open Humans in interesting ways. And their hard work has paid off: Kevin’s Imputer and Andrew’s Lineage are now available!
Imputeris designed to fill the gaps in your genetic testing data. Direct-To-Consumer companies like 23andMe usually genotype just a small fraction of your genome, focusing on generating a low-resolution snapshot across your whole genome. Genotype imputation fills in those gaps by looking at reference populations of many individuals who have been fully sequenced in a high resolution, using this data to predict how to fill the gaps in your own data set. Imputer is using the reference data from the 1000 Genomes Project to perform this gap-filling and deposits the filled-up data in your Open Humans account. Kevin also provides two Personal Data Notebooks that you can use to explore your newly imputed data set. If you want to explore the quality of the newly identified variants, you can use this quality control notebook. And if you’re interested to see where your genome falls within a two-dimensional graph of different populations from around the globe, this notebook allows you to explore how closely you relate to other people in the 1000 Genomes data.
Andrew’s Lineage brings some further tools and genetic genealogy methods to Open Humans. If you have been tested by more than one Direct-To-Consumer genetic testing company, Lineage allows you to merge those different datasets into one large file, while also highlighting the variants that came out as different between those tests. You can also lift your files to a newer version of the human reference genome, which might be needed for using your data with other tools. Furthermore, Lineage brings a lot of interesting genetic genealogy tools: It allows you to compute how much shared DNA can be found between your own data and the genetic data of other individuals, using a genetic map. You can then create plots of the shared DNA between those two data sets, determine which genes are shared between them and even find discordant SNPs between the data sets.
I’ve recently joined the board of directors of Open Humans, joining the current board along with two other new directors, Marja Pirttivaara and Alexander (Sasha) Wait Zaranek. I’m honored to be in their company, and I want to take advantage of joining the board to explain how, in my view, Quantified Self and Open Humans fit together. Both communities include many people working in science and technology who take an interest in biometric data. But this isn’t enough to define a common purpose, and in fact a much deeper connection between Open Humans and Quantified Self has developed over the last few years, as each community has approached, from nearly opposite directions, a common problem: How can we make meaningful discoveries with our own personal data?
Open Humans has its roots in the Personal Genome Project,
whose purpose was to supply scientists with human genomic data so that
they could make discoveries more quickly. The geneticist George Church
created a project to sequence the genome of individual volunteers who
agreed to donate their genomic data non-anonymously, creating a common
data resource. Since many important genomic questions cannot be answered
with genome data alone, volunteers also shared other information about
themselves. The Personal Genome Project inevitably became a somewhat
more general personal data resource for science; however, with its focus
on genomic data, much relevant data, including the kind of data that
could be collected in daily life, remained out of scope.
When I first met Jason Bobe, who co-founded Open Humans with Mad Price Ball,
he was keenly interested in this question of how to connect personal
genomes with other personal data sets. Jason had worked with George
Church on the Personal Genome Project. He and Mad saw Open Humans as an
analogous effort, but one that would allow volunteers to contribute any
kind of data. The Personal Genome Project was now a decade old. Perhaps,
with deep personal data sets to work with, scientists could deliver on
the promise of genomics to revolutionize medicine, a promise that had
been long frustrated by the complexity connecting genomic data with real
I understood the goal. A few years earlier, I’d written a long Wired story about the taxonomic collaboration between Daniel Janzen and Paul Hebert.
Janzen, along with his other accomplishments, was among the world’s
most knowledgeable field biologists. Hebert had developed a genomic
assay that promised to identify animals using an extremely small region
(about 650 base pairs) of their mitochondrial DNA. Hebert was confident
in his technique, but needed to prove its utility. How could the genomic
data he was collecting be paired to real world ecological knowledge? At
their field station in the Guanacaste Preserve in Costa Rica, Janzen
and his partner Winnie Hallwachs, along with their students and
colleagues, collected hundreds of butterflies and moths, identified
them, snipped off a leg, and shipped it to Guelph, a city in Canada,
where Hebert ran the sequence. Slowly, painstakingly, they connected the
genomic data to the real world data. More than just proving that
Hebert’s technique worked, they also brought a new degree of resolution
to the ecological picture; showing, for instance, that individual
specimens, though visually almost identical as adults, may belong to distinct evolutionary clades and feed on different plants.
In my first conversations with Jason, I saw this as how Open Humans
should work. It promised to provide the “field biology” for the genomic
studies of the Personal Genome Project.
Unfortunately, as attentive readers, link followers, and experts in
the history of overconfidence in science may already have realized,
there’s a pretty serious flaw in my analogy. Paul Hebert was using the
genome to distinguish strands in evolutionary history, mostly at the
level of species. He wanted to know, given a leg, what kind of creature
it was from. Answering relevant health questions requires understanding
the world at a far more detailed level, down to extremely small
differences among individuals of the same species. The trick that Hebert
used is never going to work; and, for many of the health related
questions we care about, nobody knows the tricks that will work. Fifteen
years after the launch of the Personal Genome Project, it continues to
supply data resources to basic science, but its relevance to medicine
remains mostly a promise.
In the Quantified Self community the focus has always been on
individual discovery: How can we learn about ourselves using our own
data? Many of the questions addressed by people doing their own QS
projects relate to health and disease. Browse the archive of Quantified Self Show&Tell
presentations and you’ll find projects on Parkinson’s disease,
diabetes, cognitive decline, cardiovascular health, depression, hearing
loss, and many other health related issues. The kind of “everyday
science” practiced in the Quantified Self community can be understood as
being the opposite of the genome-wide association studies. Instead of
finding small, telling differences among groups of people, the everyday
science of the Quantified Self finds large effects within a single
person who is both subject and scientist.
This comes with its own kinds of difficulties. People doing
Quantified Self projects related to health face a number of discouraging
barriers, including lack of access to their own data and medical
records, bureaucratic roadblocks and exorbitant costs in ordering their
own lab tests, problems in acquiring the requisite domain knowledge to
test their ideas and interpret their data, and – perhaps most
discouraging to people who are dependent on medical professionals for
some aspect of their care – lack of recognition in the health care
system that self-collected data can be useful for making decisions about
In the 11 years since Quantified Self started, participants have
tried many different ways to overcome these barriers, both individually
for their own projects and systematically through creating tools and
advocating for better policies. One of the lessons from this work is
that while the focus of self-tracking projects is typically on
individual learning, the methods required to make sense of our data
often require collaboration. Existing systems are not designed to
provide support for the kind of highly individualized reasoning we do;
therefore, we have to build a new system. Key requirements of this new
system include: private, secure data storage; capacity to integrate data
from commercial wearable devices; fine-grained permissions allowing
sharing of particular data with particular projects, and withdrawal of
permission; capacity for ethical review both to protect individual
participants and to enable academic collaborations.
Two years ago, we organized our first participant-led research
project in the Quantified Self community. A group of about two dozen of
us measured our blood cholesterol as often as once per hour, exploring
both individual questions about the patterns and causes of variation in
our blood lipids and a common group question about lipid variability. We
had a pressing need for some collective study infrastructure, but there
was no available tool that worked for our needs. We took a DIY approach
and at the end of the project we’d learned a tremendous amount both
about our own varying cholesterol and about the process of self-directed
research. (Our paper, “Approaches to governance of participant-led research,”
has recently been published in BMJ Open; our paper on our collective
discovery about lipid variability has been accepted for publication in
the Journal of Circadian Biology; we’ll add a URL when we have it.)
At the conclusion of our study, one of the participant organizers
Azure Grant, decided to press ahead with another participant-led study
on ovulatory cycling. Azure had already presented a self-study on using continuous body temperature to predict ovulation
at a Quantified Self conference. Now, she wanted to organize a group of
self-trackers to try something similar, but integrating newer
measurement tools to acquire higher resolution data. Among these tools
was the new version of the Oura ring,
which offered body temperature, heart rate, and sleep data. This idea
put new demands on our study infrastructure. Thanks to generous
collaboration from Oura engineers, we could offer participants access to
detailed data from their rings. But how could this data be stored
privately and controlled by each individual, while also being available
using fine-grained permissions to their fellow participants and study
organizers? How could this data be integrated with other data types they
might decide to collect during the project? Where was there
infrastructure for a “field biology” of the self?
We turned to Open Humans. The personal reasons were as important as
the technical ones. Mad Ball, along with her work leading Open Humans,
is a long time participant in the Quantified Self community, who has
consistently advocated for non-exploitive approaches to handling
personal data, and has contributed the results of her own self-directed
research. (See Mad’s recent talk on “A Self-Study Of My Child’s Genetic Risk.”) And Bastian Greshake Tzovaras, the Open Humans research director, quickly proved to be an extremely sensitive and skilled collaborator. Bastian co-founded openSNP,
a grassroots effort that outgrew Personal Genome Project by supporting
citizen science participation. (Currently, there are more genotyping
datasets publicly shared in openSNP than all other projects in the world
With help from Mad and Bastian and the Open Humans infrastructure, we
built our next stage study workflows with encouraging speed and
harmony. Fundamentally, we found ourselves aligned on the core idea that
research processes designed around personal data sets should be built
to protect individual agency, even where this requirement creates
friction for academic collaborators. The rarity of this commitment may
only be obvious to those few people who have gotten painfully deep into
the workflows of study infrastructure. (And I recognize that a post of
this length that is this deep in the weeds can have very few readers!)
But, in a way, that’s one of the beautiful things about this stage of
building a new knowledge infrastructure. We’re far into it enough to
have evidence that we’re on the right track. But we’re still close
enough to the beginning that each step is a significant contribution and
a potential model to build on.
I very much hope that over time – and the sooner the better – our
shared ideas about individual agency and everyday reasoning are embodied
in tools and policies that are so commonplace that no single
organization is responsible for them. But for now, it’s impossible not
to recognize that Open Humans is an indispensable resource, defining an
approach that needs to be developed and expanded, and managed by a team
that has deep insight into the challenges and potential of participatory
science. I look forward to building more connections between our two
We got a great selection of new projects and personal data explorations for you as an end-of-year gift. Here is an overview of the data import projects recently launched on Open Humans:
Oura Ring: You can now explore your sleep habits, body temperature and physical activity data as collected by the Oura Ring.
Overland: If you are using an iPhone you can now use Overland to collect your own geo locations along with additional data such as your phone’s battery levels over the day.
Google Location History: As an alternative way to record and import your location data you can now import a full Google Location History data set.
Spotify: Start creating an archive of your listening history through the Spotify integration
RescueTime: Import your computer usage data and productivity records into your account
Read more details about those integrations below:
Connect your Oura Ring
The Oura is a wearable device well hidden inside a ring. It measures heart rate, physical activity and body temperature to generate insights into your sleep and activity habits. With Oura Connect you can setup an ongoing import of those data into your Open Humans account. This allows you to explore those data more thanks to already available Personal Data Notebooks!
Use Google Location History to explore your location data
Thanks to our Outreachy interns we have another new geolocation data source: Google Location History. No matter if you are using an iPhone or an Android phone, you can use the Google or Google Maps app on your phone to record where you have been. Through Google Takeout you can now export this data and then load it into Open Humans and explore it through Personal Data Notebooks.
Explore your music listening behaviour with Spotify data
Another Outreachy intern project was to collect your Spotify Listening History through Open Humans. Using Spotify Connect will automatically import the songs you listen to along with lots of metadata (e.g. how popular was the song at the time you listened to it?). Once you have collected some data, you can explore these through another Personal Data Notebook!
Open Humans now consists of over 6,000 members that collectively have uploaded over 16,000 data sets!
To share this great community effort as a resource, we wrote our first academic manuscript. In it, we describe the platform, community, and some diverse projects that we’ve all enabled. You can find a pre-print on BioRxiv.
True to the community spirit of Open Humans, we wrote the manuscript completely in public and with an open call for contributions through our Slack. Thanks to this we could gather diverse perspectives of how Open Humans can be utilized for both research as well as personal data exploration. Using these existing projects and studies running on Open Humans as examples, we explore how our community tackles complex issues such as informed consent, data portability, and individual-centric research paradigms. Read more about this in the manuscript.
All of this is only made possible by your contributions to Open Humans, so we want to take this opportunity to thank you for your participation!
With Open Humans we are not only working to empower you to decide with whom to share your personal data – but also to explore your own data. With our latest project addition – the Personal Data Notebooks – we are taking a further step in that direction. Based on the increasingly popular Jupyter Notebooks they bring together data analysis code, documentation and data visualization. With the added twist that the Personal Data Notebooks also easily provide simple and private access to your personal data that is stored in Open Humans. Which not only makes it easy to write and use a data analysis – it also makes it easy to share your results without having to share your personal data with someone else. That way you can not only learn about yourself and your data, but also about how data analyses are performed.
If you want to write your own data analysis for the notebooks from scratch you can get started in Python, R or Julia. Or if you want to tweak or run existing data analysis you can use and adapt existing notebooks. In the simplest case you don’t even have to write/edit any code, as the input data are standardized according to their Open Humans data source. So for example you can easily run a Fitbit analysis notebook written by someone else right away on your own Fitbit data. To get you started we have a step-by-step guide on how to use the Personal Data Notebooks, along with a set of ready-to-use data analysis notebooks for Fitbit, Apple Health,Moves, 23andMeand Twitter archive data.
Today we’re introducing Andrew Riha who recently was awarded one of our project grants for his tool lineage. With lineage Andrew will make the genetic data you store on Open Humans even more useful, by enabling Ancestry analyses!
Hey Andrew, please give our blog readers a quick introduction about who you are!
I’m a systems engineer at an aerospace company in Southern California. I studied at Iowa State University, the University of Newcastle, and Delft University of Technology, and I have a B.S. and M.S. in computer engineering. A few years ago, I became interested in direct-to-consumer DNA testing after a friend told me about his experience with 23andMe. This interest developed into a passion, and I’m currently pursuing a graduate certificate in bioinformatics. My hobbies include running, traveling, and backpacking.
When and how did you come to Open Humans?
Director of Research, Bastian, introduced me to the Open Humans platform in early 2018. I had mentioned to Bastian that I wanted to turn my hobby open source Python project lineage into a web app, so he suggested I consider applying for a project grant.
Have you been involved in any projects on Open Humans so far, either as a participant or even running your own?
This is my first project with Open Humans. I’m looking forward to learning from others and further developing and integrating lineage into the Open Humans ecosystem as a great open source web app!
Your project lineage was awarded one of the Open Humans project grants. Can you explain us what the project is about?
lineage is a framework for analyzing genotype files (e.g., raw data files from 23andMe, Ancestry, etc.), primarily for the purposes of genetic genealogy and ancestry analysis. It can identify DNA and genes shared between individuals, and it provides other useful capabilities such as merging raw data files from different testing companies, identifying discrepant and discordant SNPs, and remapping SNPs to different assemblies / builds.
How did you come up with the idea behind lineage?
After my friend told me about his experience with 23andMe, I started researching how to get tested and found the International Society of Genetic Genealogy’s wiki very helpful and informative. The wiki led me to an excellent paper by Whit Athey that discussed using genotype files to phase the chromosomes of a family group and “reverse engineer” the DNA of a missing parent in the process! So, for a CS50 final project, I challenged myself to implement Whit’s algorithm in Python, using scientific libraries and vectorized programming in order to efficiently handle and analyze the large datasets involved.
The initial algorithm implementation was successful, and lineage had begun. But, I soon realized the need for other capabilities, such as comparing / merging files from different testing companies and determining what DNA is shared between individuals so that it could be used to guide the phasing algorithm. So, lineage grew into the framework that exists today, and I eventually want to return to implementing Whit’s algorithm, applying the bioinformatics and visualization concepts that I’ve learned along the way.
Is there anything important that we didn’t cover so far that you’d like to add?
lineage wouldn’t have been possible without the knowledge and help graciously provided by so many people. It is in that spirit that I’d like to encourage others to create and contribute to open source projects – sharing your ideas and passions with the world can be a very rewarding endeavor!
Oh, and thanks Mom, Dad, grandmas, and grandpas for the genes. 🙂
How can we make it easy to add data to Open Humans?
Open Humans lives through its community of members and the projects they design. That’s why there’s a large number of tools that make the creation of these projects possible: Projects can be run right on-site, use the Python command line interface library or use generic OAuth2-based API-methods to interact with Open Humans. But one simple need remained painful: simply enabling Open Humans members to upload file(s) into your own project.
Doing this needed some fiddling. Even if you code, setting up your own website can be time-consuming and often is something you don’t want to spend a lot of time on. Along with Mad – and the great help of some of our prospective Outreachy interns – I’ve been busy to reduce this pain…
Meet the oh_data_uploader template! All you need to allow Open Humansmembers to upload data into your project, with a one-click deployment to Heroku, for free! All of the project configuration can be done right in your browser, no assembly or coding required.
Now the process boils down to a simple 5-step guide and instead of taking some hours to set up your own data source it should now take between 5-10 minutes. Just use the administrative backend to fill out the configuration parameters, add the file meta data you expect and edit the copy-text of your project website using Markdown in the same way and you’re good to go. You can click here to see how it looks like out of the box (just ask if you want to have the demo password 😊).
Today we’re interviewing Kevin Arvai. Kevin is a bioinformatician with an interest in personal genetic data and he was awarded a project grant to implement a project that will bring genotype imputation to the Open Humans community.
Kevin, please give our blog readers a quick introduction about who you are!
I am a data scientist at a clinical genetics company in Maryland. My background and formal education is in biology, however I completed a master’s degree in computational biology and bioinformatics. Like many, I’m riding the wave of data that our generation has found itself immersed in by competing in data science competitions and contributing to “open-” (source, science, data) projects. I’m particularly interested in machine learning and human genetics but looking forward to learning new skills by building Imputer.
Have you been involved in any projects on Open Humans so far, either as a participant or even running your own
Not only is this my first project working with Open Humans, this is my first project as part of a open source community. Open Humans was a welcoming and collaborative group of people that encouraged my ideas, so it seemed like a perfect fit to start contributing.
Your project Imputer was awarded one of the Open Humans project grants. Can you explain us what the project is about?
The goal of Imputer is to provide users with a more comprehensive picture of their genome. Direct to consumer genetics companies, like 23andMe, only genotype a small fraction of the genome. Researchers are finding new genetic locations associated with traits and diseases at a rapid pace. Users might be interested in knowing their genotype status for these new associations, but the locations may be in regions that direct to consumer tests are not genotyping. Imputer leverages the vast amount of genotype data made available by 1000 genomes project and by the Haplotype Research Consortium to provide Open Humans users with genotype estimates at additional locations in their genome.
How did you come up with the idea behind Imputer?
The genesis of Imputer was spawned from long conversation over lunch with Bastian.
Is there anything important that we didn’t cover so far that you’d like to add?
I’d like to encourage others who are “interested in, but anxious about” contributing to open source projects to take the leap! If you’ve found this post, Open Humans is a great place to start!
President Bartlet of The West Wing is calling his famous “What’s next” to his secretary after managing a task.
I just defended my PhD last week, and one question from virtually every person who attended and stayed for the after-party: What’s Next? Which initially felt a bit weird. After all, I already took my next step three months ago when I joined Open Humans as the Director of Research. But then I realized that this is a nice opportunity to reflect a bit on my first months and think about what my next goals for Open Humans are.
Where is Open Humans so far?
So far I spent good parts on learning the ropes. First of all, I had to find my way into the technical infrastructure of Open Humans. Learning the code base, the APIs, server setups and so on. And what better way to do this but starting my own projects? I thus integrated two new projects on Open Humans: First I connected my long-standing project openSNP with Open Humans – allowing users of both platforms to re-use their genetic data more easily. Then I started TwArχiv, which not only brings a new data source but also some data-visualization to Open Humans. This integration of Twitter data will hopefully also be a first step towards a more holistic view of personal data that includes non-medical data.
Hand in hand with the technical side of things I also found my way into the community around Open Humans. Learning which projects there are, how to best support them and also how to grow the Open Humans community even more. I not only got to know many of the brilliant individuals inside the Open Humans community, but I also helped them to achieve their goals – be it through bug fixes, relevant connections or finding out how to optimize our website to make it work for their needs. First steps towards a further community growth were also taken: We could announce the first three successful grant applications, all bringing new data sources to Open Humans. And a fourth grant announcement – enhancing existing data sets – will be out soon!
The Open Humans community grows nicely and is becoming more and more engaged. So things are on track. But where should we go from here? And what is the larger vision? Traditional academic research – as well as corporate data silos – put themselves into the center of all data collection. In contrast, Open Humans is very different to this. As Steph laid out in her blog post: Open Humans is a technological platform; a vibrant community; and a paradigm shift to how research is done at the same time. In addition to all these things there is one thing that I always mention when people ask me what Open Humans is: It is empowerment. Putting individuals in control of their own data and of research at large. And to me, this means more than ‘just’ giving people the choice of when and where to share their data.
What should Open Humans be?
Empowerment means giving people the opportunity and chance to explore and understand their own data. Be it on their own – or in collaboration as a community outside the traditional academic research setting. The growth of the independent Open Artificial Pancreas community – which aggregates their own data through Open Humans – is a stellar example for this empowerment. As stewards of the Open Humans ecosystem it is our responsibility to support people to run projects like these. It is up to us to make it easier to create and run projects on Open Humans – empowering more people including those who are not highly programming savvy. Open Humans offers the unique chance to democratize science, enabling people outside academia to do new research that has never existed before. To pull this off we have to become more inclusive in our approach. This means getting everybody on board who has great ideas for research.
First steps towards this direction have been made already: We now have a first data uploader template that allows everyone to create their own, data-collecting Open Humans project while requiring zero programming knowledge. Instead a web browser is enough to do the complete setup. A similar idea for the administration of projects should become a reality in the near future. Furthermore, we are on the way to create shareable analyses notebooks. These can be written and run by everyone – facilitating community-driven data analysis. By increasing our inclusivity more we will not only see more projects on Open Humans, we will also see a much wider diversity in how these projects will use data. I can’t wait to interact with all of them.
I see this diversity reflected in the kinds of data that will be on Open Humans and the kinds of research that will be done with it. Traditionally many of the projects on Open Humans have and had a focus on health. But I don’t see why this should be the sole kind of research that profits by being run with and by highly involved participants. After all, while much of the Quantified Self revolves around health, it is far from the only topic: People are interested in their personal finance data, phone usage, emails and more. And so are social scientists, economists and other academic disciplines. My goal is to get these people on board for Open Humans too, showing them the huge benefit that an engaged study population offers.
Let’s just think of a simple example: Everyone can pay Twitter to get access to their firehose of data or just scrape tweets for keywords from the web. But who but Open Humans can offer potential access to 200 or more full Twitter archives that are available right now? And more importantly, who offers the possibility to get in touch with these people and as such a way to get additional metadata and consent them? The same is true for virtually all kinds of social media data and many other data types. Humans are more than their bodies, and Open Humans should reflect this.
So this is what’s next for Open Humans: Creating an ecosystem that enables the largest possible number of people to do research; that collects and enables the re-use of the most diverse set of data; and that brings together participants and researchers from all disciplines and walks of life – informing each other and creating the most interesting research.