Windows 10 and Research Data Management

How to use Windows 10 and handle sensitive data without going to jail in 10 easy steps.

Background

Why on Earth do you expose sensitive data to Windows ?!?, you might ask.

Because I work in a Microsoft – dominated setting and the world in general is full of legacy applications. If you want to behave professionally in a not completely autistic capacity, you will have to touch Word and Excel files once in a while. The fragility and volatility of these formats demand that they only be handled with the latest and fanciest Original Microsoft-Office Products ®. No, LibreOffice is not an option.

Also, in many settings you can only function with Microsoft Outlook ® at your fingertips. Else you can’t put appointments on other people’s agendas, can’t place reservations for boats (yes, we have boats), or organize your mail. No, the OWA webinterface is not an option.

Starting position

Run Windows in a VM on your Linux box. That works amazingly well with VMware Workstation Pro and some people also seem to be happy with VirtualBox.

As a data manager, you (and by proxy programs running on your Windows box) will have read access to researchers’ sensitive data (in particular personal data) that is stored on the shared infrastructure of your company’s Windows Domain, which your Windows box joins automatically. In case you have also write-access  to other people’s data, the risk is not only data exfiltration but also malware such as crypto-trojans.

Observing the network-traffic of a Windows 10 installation, you will notice that even having switched off the most hidden “telemetry” settings will do pretty much nothing to stop the unasked-for blaring of your Windows box. Various executables, system-ones and third party ones that you did not install (or installed under the assumption that a PDF reader doesn’t need to communicate with shady servers around the world) talk without pause and obvious reasons to computers in places you never heard of. There are even programs that will download and re-install other programs that you just uninstalled. You can find that out by using Windows “firewall” tools such as WFN. This is an interesting, yet by no means stable experience, but at this stage the Windows box is unusable for work anyway

microsoft-windows10-privacy
Swati Khandelwal @ thehackernews.com

Solution

(This refers to Debian Stretch on the host with VMware Workstation Pro 14.1. or VirtualBox)

Goal: The VM can talk to the company network (Outlook, updates, …) but is completely cut off from the rest of the world.

Strategy: Use iptables on your host-system to gag and shackle the VM running Windows.

1. Have the virtualization software create virtual “host-only” interfaces on the host.

VirtualBox: VirtualBox Manger -> File -> Host Network Manager -> Create. Leave the defaults, don’t enable DHCP Server. Let’s assume in the following that the name of the adapter is vboxnet0 .

VMware Workstation: Edit -> Virtual Network Editor -> Add Network. Let’s assume in the following its name is vmnet1. Select “Host-only” and “Conect a host virtual adapater (vmnet1) to this network.” Don’t “Use a local DHCP service …”.

2. Associate the VM’s Ethernet adapter with the virtual host adapter.

(The VM has to be down to do that)

VirtualBox: Machine -> Settings -> Network. Select “Host-only Adapter”, select Name: vboxnet0.

VMware Workstation: Virtual Machine Setting -> Network adapter -> “Custom: Specify virtual network” -> Select vmnet1. In case your Windows machine gets a fixed IP from the company’s DHCP server, you might want to set the proper (old) MAC address under “Advanced”.

3. Make sure your host has the following packages installed:

  • iptables
  • iptables-persistent
  • bridge-utils

4. Load the module br_netfilter. That is necessary for iptables to be able to filter on a bridge:

sudo echo br_netfilter >>/etc/modules-load.d/modules.conf

sudo systemctl restart systemd-modules-load

5. Modify /etc/interfaces so that the host’s primary interface and the virtual host adapter are slaves to a newly created bridge br0:

(In the following we use vboxnet0 and eth0 for virtual host interface and physical interface, respectively. Replace with their real names if necessary.)

# define the slave interfaces (possibly redundant)
iface eth0 inet manual
iface vboxnet0 inet manual

# set up the bridge
auto br0
iface br0 inet dhcp
bridge_ports eth0 vboxnet0
bridge_stp off
bridge_fd 0

# set bridge's MAC address. Useful if your host gets a fixed IP from the company's DHCP server.
post-up ip link set br0 address 08:62:66:2c:6e:66

# loopback interface (unchanged)
auto lo
iface lo inet loopback

6. Restart the host’s network

sudo systemctl stop networking

For good measure, remove IP addresses from slave interfaces:

sudo ip address flush dev eth0
sudo ip address flush dev vboxnet0

sudo systemctl start networking

7. Check host settings

sudo ifconfig should show eth0 and vboxnet0 up, but without addresses. br0 should look like your primary interface (eth0) looked before. sudo brctl show should show the bridge and the two enslaved interfaces.

8. Set the firewall rules

(Replace 152.88.0.0/16 with your company network)
sudo iptables -P FORWARD DROP
sudo iptables -A FORWARD -m physdev --physdev-in vboxnet0 -d 152.88.0.0/16,255.255.255.255 -j ACCEPT
sudo iptables -A FORWARD -m physdev --physdev-out vboxnet0 -s 152.88.0.0/16 -j ACCEPT

In case your company network is IPv6, repeat with ip6tables. Otherwise just DROP IPv6 traffic across the bridge:

sudo ipv6tables -A FORWARD -j DROP

9. Make the firewall rules persistent:

sudo netfilter-persistent save

10. Check it

Fire up the Windows VM and check whether you can reach (e.g. in the browser) your company homepage (should work) and an external website (should not work).

Advertisements

There is no research data for archiving

archive-shelving

The main message of this post is trivially obvious for any practitioner concerned with collecting , organizing and archiving research data. It is so trivial that it is rarely explicitly stated. However, for the benefit of people who get involved with research data in an administrative capacity or for archivists, for example,  it might be worthwhile to write up this basic fact about the nature of research data.

A general requirement for an archival information system, here in the words of the OAIS specification (ISO 14721:2012 / CCSDS 650.0-M-2) [1], is to

Ensure that the information to be preserved is Independently Understandable to the Designated Community. In particular, the Designated Community should be able to
understand the information without needing special resources such as the assistance of the experts who produced the information.

Traditional archivistical practice assumes that the necessary description, or “metadata”, that has to be attached to the items to be archived can and should be provided by the archivist (e.g. Rules for archival description [2]). If the items to be archived are the products of current administrative processes, such description becomes particularly simple and might even be automated. A document in such a context will already contain most of the relevant metadata itself (e.g. author, data, department, business process, …) and might be created within a software that knows these metadata and attaches them to the document.

csm_umik_0f566723d0

A research data item however is a completely different thing. Data items that are created in the course of research, such as numerical tables,  microscope images, or pieces of programming code can not be understood or interpreted on their own. The necessary context, or “scientific metadata” has to be produced by the researchers themselves as it requires in-depth knowledge about the research at hand and can be quite voluminous. A research data item will only make sense to be archived as part of a collection of other data items, information about the linkages among these items, text that describes the method used to produce that data, and potentially a wealth of other, very domain specific context information. Let’s call such an archivable collection “research data set”.

The bad news is that research data sets in this sense are in general not created at all. It is just not part of the traditional process of scientific production. Creating such a collection might require significant work for which there are unclear, mostly weak or non-existent incentives. Usually researchers at universities comply with some sort of “Good Scientific Practice Guidelines” that require them to individually make sure that they can come up with their data a couple of years after publication in case someone is asking. In case someone actually asks, they will sit down on a weekend, sift through their custom storage system, pack up an ad-hoc research data set and email it to the colleague who asked. This data set will certainly have  not the quality required by the OAIS reference standard as quoted above, since the colleague likely will call on Monday and ask about the unit in column 3 and the level of detection in table 5.

A corollary of these considerations is that in order to collect, store, archive research data in findable, accessible, usable, or even interoperable fashion, the technological problem (what web-platform shall I use? Is it fancy enough and will it twitter when a dataset is submitted?) is a comparably tame one that has already been solved may times before. Strategy considerations, resource allocation and a lot of thinking has instead to be spent on the largely unsolved, difficult and costly problem of how do we get research data sets from research data items?

[1]: CCSDS Secretariat, 2012, Reference model for an open archival information system (OAIS), Washington.

[2]: Bureau of Canadian Archivists, 2008, Rules for archival description, Ottawa.

File formats for long-term archiving

Repositories for all kinds of data usually require or at least recommend file formats suitable for long-term archiving. These recommendations are frequently incomplete, sometimes contradict each other, or appear not totally consistent to me. This is an attempt to start a discussion that might contribute to a consolidation of best practices. The following  thoughts are also incomplete and subjective  and inconsistent and should only serve to get the discussion going.

Definition long-term archiving

The data should be still usable when the current curators are not available anymore, when the data might not be stored in the current repository anymore, and at a time for which the available soft- and hardware is unknown today. Think ~100 years.

Criteria

A potential future user has to have the ability to implement from scratch the necessary software to interpret the data. Consequently,

  1. proprietary (non-open) formats can be ruled out, and
  2. open, but patent-encumbered formats should be avoided.
  3. Simple format descriptions have precedence over more complicated ones.
  4. Formats should be described by a reasonably established international standards organization such as W3C, IEC or ISO.

As a repository operator I want to enable the submitting users to directly submit data in acceptable formats. While automatic conversion would be frequently possible, this would raise issues regarding the responsibility for data quality and correctness. Therefore

  1.  The software to transform to the recommended format should be readily available, and the process should be easy to explain and to understand.
  2.  Ideally, the format should be directly supported by applications that a majority of users already employ to handle that data. Mainstream takes precedence over exotic.
  3. Software to read the format should be available on all major platforms (operating systems) in use today. Interoperability is a benefit in itself, but the number of cross-platform implementations is also indicative of the perceived usefulness, actual use and technical sanity of the format.

In particular 5.  and 6. seem to be not considered by most recommendations I came across so far. Those criteria appear rather important to me, when the aim of the repository operation includes to make life easy for the data producers who otherwise might choose not to submit anything in the first place.

The type of the file format needs to be communicated as well.

  1. File formats that indicate their type in their own bytestream, e.g., through a particular start sequence of bytes, are preferable.
  2. If the format has to be communicated in meta-data, that should be as simple as possible. It is a plus if a naming convention exists (e.g. through the file extension) that identifies the format as precisely as possible.
  3. The identification of the file-format should be as straight- forward as possible, in particular to enable automatic detection.

Further,

  1. The promoted format should not encourage or enforce a conversion that looses information.
  2. The format should allow for automatic validation.

Data types

Raw text

Everybody agrees that this should be encoded as ASCII or UTF-8 or UTF-16 or UTF-32 or ISO8859-1. It would be good to settle on one. UTF-8 has the dominant position today, as per requirement for a lot of internet protocols, requirement as encoding for many higher level formats such as markup languages, multiple standardization by different organizations and its general ubiquity. Also ASCII is a subset of UTF-8.

It seems to have a number of other advantages: No byte-order ambiguities, and it is simply detected by decoding correctly (http://stackoverflow.com/a/4522251). According to IANUS the BOM should be omitted and that seems to make sense to me. Often this is not mentioned though.

Why accept anything else at all? I suggest to discourage the use of other encodings.

Text with layout

This quite loosely can include the description of a page (margins, size, etc.) and style elements such as font and colors, graphical elements, etc.

PDF/A

Everybody agrees that PDF/A is good for this.  Which PDF/A ?

Presumably PDF/A-1 and PDF/A-2 with respective levels a and b and PDF/A-2u are good.

Clearly, PDF/A-3 is not good, because it allows embedding of arbitrary other formats.

Just stating that PDF/A is OK might lead people to submit bitmaps embedded in PDF. Is that OK? Or should a bitmap format be promoted for such cases?

Office Document Standards

There are two: OOXML and ODF. In general such formats are very complicated because the are designed to describe at least the output  of word processors, presentation creation software, and spreadsheet software. They are generally less robust than less featured formats, are relatively young and new versions are developed at a comparatively fast rate.

Therefore, these formats should be acceptable only as a last resort, in case the data can’t be represented in a more suitable formats, e.g. wordprocessing documents and presentations as PDF/A and spreadsheets as plain text tables (e.g. CSV).

  • Office Open XML (OOXML)

OOXML describes file formats for “office documents” and is almost exclusively used as the default format for Microsoft Office output (.docx, .pptx, .xlsx). In many recommendations these formats are marked as acceptable for long-term archiving, presumably their ISO standardization is taken as an indicator of suitability. This judgment eludes me since these formats currently do not even provide a robust form of day-to-day storage. The criticism of these formats is abundant, here the most important points, in my opinion:

  1. Actually there are three standards for OOXML which are incompatible to each other, and are implemented to varying degrees in currently available software (ECMA-376, ISO/IEC 29500 Transitional, and  ISO/IEC 29500 Strict).
  2. OOXML is much more complicated in structure than the competing ODF and the size of the standard description (6546 pages) compares unfavorably wiht that of ODF (867 pages). And this doesn’t include the 1500 pages necessary to describe the difference between ISO/IEC 29500 Transitional and  ISO/IEC 29500 Strict.
  3. There exist only few software packages today that implement some variety of OOXML and it is not immediately clear to the user which version is actually written when she clicks on “save as docx”, for example.
  4. Apparently there is no or only broken versioning (inclusion of a pointer to the version of the standard of the document) in OOXML, which preculdes automatic validation, see here and here (german). Apart from that, the complexity of the standards alone apparently did not allow for the development of validation tools.
  5. OOXML seems to allow to embed various external media formats and ActiveX controls.

There is an in-depth analysis of the problems with OOXML by Markus Feilner and another overview by Markus Dapp in German.

The integrity of the ISO standardization process has been questioned and Wikipedia has a lengthy entry about that.

Every IT group supporting a heterogeneous infrastructure will sing long and sad songs about the interoperability problems created by OOMXL. Long-term archives should be kept strictly clean of OOXML, as should be even intra-institution medium-term archives. 

  • Open Document Format for Office Applications (ODF)

ODF is a comparatively sane standard that is maintained and developed by a broad industry consortium (OASIS). A large number of software packages across all major platforms support it. ODF should be used if storing an “office format” is absolutely necessary at all. (Are there cases where this is so?)

Still Image

This is either JPEG2000 or TIFF or DNG or PNG or GIF (GIF sometimes is mistakenly described as using lossy compression)

For JPEG2000 and TIFF there is a zoo of substandards that is hard to disentangle. The matter is complicated by the fact that at least TIFF can serve as a container that contains, for example, PNG. Both JPEG2000 and TIFF also allow for lossy compression even though that appears rather unpopular.

My ad-hoc recommendation would be to keep any of above formats if the original data to be submitted comes in such a format. A conversion could result in loss of information, even among lossless compressed formats, due to different meta-data capabilities or different capabilities of color representation. Usually there is advice against using GIF because its compression algorithm (LZW) is patent encumbered. According to Wikipedia though, these patents expired long ago making this a non-issue.

JPEG is mostly used for lossy compressed images and conversion from lossless formats must be avoided. However, if the original data is in JPEG there seems to be no reason not to archive it like that.


In future installments I am planning to write about

  • Tabular Data
  • Vector Graphics
  • Audio
  • Video
  • Geospatial Data

Resources

http://www.library.ethz.ch/en/Media/Files/File-formats-for-archiving

http://www.digitalpreservation.gov/formats/index.shtml

http://www.ianus-fdz.de/it-empfehlungen/dateiformate

http://www.preforma-project.eu/index.html

https://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf

 

Private resources in CKAN

Private resources in CKAN

Open Data is good

CKAN has its roots and its largest user-base in Open Data applications. It comes therefore as no surprise that mechanisms to selectively hide data, to make data unavailable, are hardly present. One could even argue that, give that we all like Open Data and advocate for it, implementing such features should be consciously avoided.

Using CKAN as a repository for research data has a very large Open Data aspect as well: Researchers are obviously interested to share their product as much or even more as any other creative profession. And there is also an increasing need for Open research data repositories, as  journals, funding agencies and Mr. and Ms. Taxpayer increasingly demand the opening up of scientific data.

Non Open Data necessarily exist

That is all very well. However, research data management is much more than just making data public. And CKAN can play an important role in this area that appears to gain more and more importance and attention. If you are in the business of providing a data management tool for researchers and to convince them to use it, you will run  pretty soon into the requirement of “restrict access to resources”.

There is hardly any field that is free of confidential data. The case where a researcher hasn’t yet written all the papers that are in a dataset and therefore wants to keep it hidden for a while is only the simplest one.  And not problematic, because that is a singular interest of a temporary nature. More problematic cases include:

  • Non-disclosure agreements with collaborators from the industry
  • Non-disclosure agreements for data from the administration that is security-sensitive
  • Non-disclosure of data which could self-invalidate if published (think sampling locations for environmental monitoring activities that  carry political or economical  implications)
  • Non-disclosure of of personal data (questionnaires or other data that could be used to identify individuals or small groups of people)

Solutions?

  1. The pedestrian one: Instead of to the resource in question point to some text stating the reason for the unavailability of the resource and what steps to take to actually get hold of it.
    Drawback: The resource has to be stored somewhere, which is not in this central repository, for which I use CKAN. It is on some researcher’s workstation, will get lost, forgotten, destroyed or altered. Avoiding this is exactly the point of having a central repository.
  2. Existing CKAN-means: Make the organization small enough and flag the containing dataset as “private”.
    Drawbacks:

    • A one-person organization might clash with your idea of what an organization should be.
    • You either make all other files in that package unavailable too, or you put that particular resource in a separate package, which might clash with your idea what a package is supposed to contain.
  3. Use ckanext-privatedatasets. This ameliorates one of the drawbacks in 2. but not the other. Is anybody using it, does it work well?

In general it is not completely clear to me how reliable existing CKAN (+extensions) provided protection is, and whether it is good enough against unauthorized access of, say, individual’s health data on a public facing installation.

I am contemplating to use encryption. Sensitive files could be stored encrypted and the problem would be moved to key-handling. One possibility could be to use a user’s API-key to encrypt the file. This key is only available to him or her and the admin, who will take care of matters should the researcher become unavailable. Depending on safety requirements (example above), that might not be enough and encryption would have to happen with a helper script, locally at researcher’s client machine.

List of Authors

List of Authors

To properly cite a dataset the metadata should contain a list of authors. ckanext-repeating and some or similar js for the WebUI help to implement that.

However, for a scholary citation the name of the authors is not enough and additional information about the authors, e.g., affiliation, ORCID-Id, address, website, has to be provided. To that end it would be nice to store a list of authors as a list of dicts, e.g.


authorlist = [{"name": "",
"email": "",
"affiliation": "",
"website": "",
"ORCID-ID": "",
"ResearcherID": ""},
...]

Next to a nice representation of such a structured metadata field in the Web-UI it is not clear (to me) what has to be done, and how, to get that working with a) the database-layer and b) the search indexing. This scheming issue appears to be related.