Monday, August 14, 2017

PubMed Lets Google Track User Searches

CT scan of a Mesothelioma patient.
CC BY-SA by Frank Gaillard
If you search on Google for "Best Mesothelioma Lawyer" and then click on one of the ads, Google can earn as much as a thousand dollars for your click. In general, Google can make a lot of money if it knows you're the type of user who's interested in rare types of cancer. So you might be surprised that Google gets to know everything you search for when you use PubMed, the search engine offered by the National Center for Biotechnology Information (NCBI), a service of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). Our tax dollars work really hard and return a lot of value at NCBI, but I was surprised to discover Google's advertising business is getting first crack at that value!

You may find this hard to believe, but you shouldn't take may word for it. Go and read the NLM Privacy Policy,  in particular the section on "Demographic and Interest Data"
On some portions of our website we have enabled Google Analytics and other third-party software (listed below), to provide aggregate demographic and interest data of our visitors. This information cannot be used to identify you as an individual. While these tools are used by some websites to serve advertisements, NLM only uses them to measure demographic data. NLM has no control over advertisements served on other websites.
DoubleClick: NLM uses DoubleClick to understand the characteristics and demographics of the people who visit NLM sites. Only NLM staff conducts analyses on the aggregated data from DoubleClick. No personally identifiable information is collected by DoubleClick from NLM websites. The DoubleClick Privacy Policy is available at https://www.google.com/intl/en/policies/privacy/
You can opt-out of receiving DoubleClick advertising at https://support.google.com/ads/answer/2662922?hl=en.
I will try to explain what this means and correct some of the misinformation it contains.

DoubleClick is Google's display advertising business. DoubleClick tracks users across websites using "cookies" to collect "demographic and interest information" about users. DoubleClick uses this information to improve its ad targeting. So for example, if a user's web browsing behavior suggests an interest in rare types of cancer, DoubleClick might show the user an ad about mesothelioma. All of this activity is fully disclosed in the DoubleClick Privacy Policy, which approximately 0% of PubMed's users have actually read. Despite what the NLM Privacy Policy says, you can't opt-out of receiving DoubleClick Advertising, you can only opt out of DoubleClick Ad Targeting. So instead of Mesothelioma ads, you'd probably be offered deals at Jet.com

It's interesting to note that before February 21 of this year, there was no mention of DoubleClick in the privacy policy (see the previous policy ). Despite the date, there's no reason to think that the new privacy policy is related to the change in administrations, as NIH Director Francis Collins was retained in his position by President Trump. More likely it's related to new leadership at NLM. In August of 2016, Dr. Patricia Flatley Brennan became NLM director. Dr. Brennan, a registered nurse and an engineer, has emphasized the role of data to the Library's mission. In an interview with the Washington Post, Brennan noted:
In the 21st century we’re moving into data as the basis. Instead of an experiment simply answering a question, it also generates a data set. We don’t have to repeat experiments to get more out of the data. This idea of moving from experiments to data has a lot of implications for the library of the future. Which is why I am not a librarian.
The "demographic and interest data" used by NLM is based on individual click data collected by Google Analytics. As I've previously written, Google Analytics  only tracks users across websites if the site-per-site tracker IDs can be connected to a global tracker ID like the ones used by DoubleClick. What NLM is allowing Google to do is to connect the Google Analytics user data to the DoubleClick user data. So Google's advertising business gets to use all the Google Analytics data, and the Analytics data provided to NLM can include all the DoubleClick "demographic and interest" data.

What information does Google receive when you do a search on Pubmed?
For every click or search, Google's servers receive:
  • your search term and result page URL
  • your DoubleClick user tracking ID
  • your referring page URL
  • your IP address
  • your browser software and operating system
While "only NLM staff conducts analyses on the aggregated data from DoubleClick", the DoubleClick tracking platform analyzes the unaggregated data from PubMed. And while it's true that "the demographic and interest data" of PubMed visitors cannot be used to identify them as  individuals, the data collected by the Google trackers can trivially be used to identify as individuals any PubMed users who have Google accounts. Last year, Google changed its privacy policy to allow it to associate users' personal information with activity on sites like PubMed.
"Depending on your account settings, your activity on other sites and apps may be associated with your personal information in order to improve Google’s services and the ads delivered by Google.
So the bottom line is that Google's stated policies allow Google to associate a user's activity on PubMed with their personal information. We don't know if Google makes use of PubMed activity or if the data is saved at all, but NLM's privacy policy is misleading at best on this fact.

Does this matter? I have written that commercial medical journals deploy intense advertising trackers on their websites, far in excess of what NLM is doing. "Everybody" does it. And  we know that agencies of the US government spend billions of dollars sifting through web browsing data looking for terrorists, so why should NLM be any different? So what if Google gets a peek at PubMed user activity - they see such a huge amount of user data that PubMed is probably not even noticeable.

Google has done some interesting things with search data. For example, the "Google Flu Trends" and "Google Dengue Trends" projects studied patterns of searches for illness - related terms. Google could use the PubMed Searches for similar investigations into health provider searches.

The puzzling thing about NLM's data surrender is the paltry benefit it returns. While Google gets un-aggregated, personally identifiable data, all NLM gets is some demographic and interest data about their users. Does NLM really want to better know the age, gender, and education level of PubMed users??? Turning on the privacy features of Google Analytics (i.e. NOT turning on DoubleClick) has a minimal impact on the usefulness of the usage data it provides.

Lines need to be drawn somewhere. If Google gets to use PubMed click data for its advertising, what comes next? Will researchers be examined as terror suspects if they read about nerve toxins or anthrax? Or perhaps inquiries into abortifactants or gender-related hormone therapies will be become politically suspect. Perhaps someone will want a list of people looking for literature on genetically modified crops, or gun deaths, or vaccines? Libraries should not be going there.

So let's draw the line at advertising trackers in PubMed. PubMed is not something owned by a publishing company,  PubMed belongs to all of us. PubMed has been a technology leader worthy of emulation by libraries around the world. They should be setting an example. If you agree with me that NLM should stop letting Google track PubMed Users, let Dr. Brennan know (politely, of course.)

Notes:
  1. You may wonder if the US government has a policy about using third party services like Google Analytics and DoubleClick. Yes, there is a policy, and NLM appears to be pretty much in compliance with that policy.
  2. You might also wonder if Google has a special agreement for use of its services on US government websites. It does, but that agreement doesn't amend privacy policies. And yes, the person signing that policy for Google subsequently became the third CTO of the United States.
  3.  I recently presented a webinar which covered the basics of advertising in digital libraries in the National Network of Libraries of Medicine [NNLM] "Kernal of Knowledge" series.
  4. (8/16) Yes, this blog is served by Google. So if you start getting ads for privacy plug-ins...
  5. (8/16) urlscan.io is a tool you can use to see what goes on under the cover when you search on PubMed. Tip from Gary Price.