Data Management in the
Research Environment
RSM 674 Spring
Dr. Timothy Norris - Data Curation Fellow -
Angela Clark - Librarian Associate Professor RSMAS -
Todays Outline

Promoting the Stewardship of Research Data

ICPSR LEADS project findings for NSF- and NIH-sponsored awards that created social science data (2008)

Adapted from: Committee on Ensuring the Utility and Integrity of Research Data in a Digital Age (2009). "Promoting the Stewardship of Research Data" (Chap 4) in Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. National Academies Press, Washington D.C.
Promoting the Stewardship of Research Data

Promoting the Stewardship of Research Data
"The question of who pays, how much, and for how long are at the heart of the problem of how to ensure long-term stewardship of research data." (p 113)

Wgsimon (2011), Licensed under CC BY-SA 3.0 via Commons's_law#/media/File:Transistor_Count_and_Moore%27s_Law_-_2011.svg
The 2013 OMB Memorandum
  • Value – “manage information as an asset throughout its lifecycle”
  • Privacy, security, ownership
  • Data:
    “refers to all structured information, unless otherwise noted.”
  • Information Life Cycle
    “means the stages through which information passes, typically characterized as creation, collection processing, dissemination, use, storage, and disposition.”

The 2013 OMB Memorandum
  • Open Data
    • Public
    • Accessible
    • Described
    • Reuseable
    • Complete
    • Timely
    • Managed Post-Release

The 2013 OMB Memorandum
  • Policy Requirements
    • Collect information in a way that supports downstream use
    • Machine readable formats
    • Use data standards
    • Open licenses
    • Common core and extensible metadata

On Data
Qualitative - Quantitative
Text, Image, Sound
Nominal, Ordinal, Interval, Ratio
Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.
What is Data?

Measurement LevelDefinitionExample
NominalCategorical in nature, with observations recorded into discrete units.Unmarried, married, divorced, widowed
OrdinalObservations that are placed in a rank order, where certain observations are greater than othersLow, medium, high
IntervalMeasurements along a scale which possesses a fixed but arbitrary interval and an arbitrary origin. Addition or multiplication by a constant will not alter the interval nature of the observations. Data can either be continuous or discrete in nature.Temperature along the Celsius scale
RatioSimilar to interval data except the scale possesses a true zero origin, and multiplication by a constant will not alter the ratio nature of the observations.Exam marks on a scale of 0–10

Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.
On Data
Qualitative - Quantitative
Text, Image, Sound
Nominal, Ordinal, Interval, Ratio
Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.
Captured, Exhaust, Transient, Derived
Technical Metadata
Not "Raw"
Levels (more in a moment)
Data Levels (as described by NASA)

Data LevelDescription
Level 0Reconstructed, unprocessed instrument and payload data at full resolution, with any and all communications artefacts (e.g., synchronisation frames, communications headers, duplicate data) Removed.
Level 1aReconstructed, unprocessed instrument data at full resolution, time-referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters computed and appended but not applied to Level 0 data.
Level 1bLevel 1A data that have been processed to sensor units
Level 2Derived geophysical variables at the same resolution and location as Level 1 source data.
Level 3Variables mapped on uniform space-time grid scales, usually with some completeness and consistency
Level 4Model output or results from analyses of lower-level data (e.g., variables derived from multiple measurements).
Sensors and Data Levels

Active vs. Static:Data Storage:Example or Focus:Typical File Formats:
ACTIVERaw Data:Temperature readings over timePaper? Device-specific? .xlsx, …
Processed Data:“Cleaned,” normalized temperature data compiled in spreadsheet.xlsx, .sas, …
Analyzed Data:Temperature data with averages computed, graphs charted.xlsx, .sas, …
STATICFinalized, Published Data:Do the data support hypothesis?.csv

adapted from
On Data
Qualitative - Quantitative
Text, Image, Sound
Nominal, Ordinal, Interval, Ratio
Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.
Captured, Exhaust, Transient, Derived
Technical Metadata
Not "Raw"
Levels (more in a moment)
Structured, Semi-structured, Unstructured
Irregular, Flexible
Nested, Trees, Tagged
Data model, Schema,
Relational Database
Primary, Secondary, Tertiary
Created, Collected
Data: Primary, Secondary, and Tertiary
  • Primary: research generated (from instruments or observations)
  • Secondary: acquired for research project from another source
  • Tertiary: derivative of primary or secondary data (anonymized, annotated, bundled, and so on)

On Data
Qualitative - Quantitative
Text, Image, Sound
Nominal, Ordinal, Interval, Ratio
Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.
Captured, Exhaust, Transient, Derived
Technical Metadata
Not "Raw"
Levels (more in a moment)
Structured, Semi-structured, Unstructured
Irregular, Flexible
Nested, Trees, Tagged
Data model, Schema,
Relational Database
Primary, Secondary, Tertiary
Created, Collected
Indexical, Attribute, Metadata
On Metadata
One persons data is another persons metadata

Some Useful Abstractions

“Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not truth.
Truth is not beauty.
Beauty is not love.
Love is not music.
Music is THE BEST.”

― Frank Zappa  
Another Way of Seeing

Framing Data
  • Technical Perspective
    • quality, validity, reliability, authenticity, and useability
    • process, structure, share, analysis
  • Ethical Perspective
    • purpose and use
  • Political - Economic Perspective
    • public goods and private ownership
    • governance
  • Spatial - Temporal Perspective
    • mutable mobiles ... (Latour)
  • Philosophical Perspective
    • ontologies and epistemologies

Kitchin, R (2014). “Conceptualizing Data” in Kitchin, R The Data Revolution. Washington DC: Sage.

What data will you collect / create / wrangle ?

  • Will you use sensors? - OBSERVATIONAL

    • Captured in situ?
    • Can’t be recreated, recaptured or replaced - VALUE
    • Includes survey instruments and hired research assistants
    • But, will you collect data, buy data from a provider, or receive data as a contracted service?

What data will you collect / create / wrangle ?

  • Will you conduct and experiment? - EXPERIMENTAL

    • In situ or laboratory based (also considered are natural experiments)?
    • Should be reproducuble, but can be expensive
    • May include sensors and observations

What data will you collect / create / wrangle ?

  • Will you build models? - SIMULATED

    • Will you write code?
    • How will you parametrize the model?
    • Inouts may be more valuable than outputs
    • What software (or other tools) will you use?

What data will you collect / create / wrangle ?

  • Will you combine and analyze previously shared data to create new data? – DERIVED or COMPILED

    • Integration from several sources
    • Recreation can be very expensive
    • Again, software and tools?
    • Are there copyright concerns?

What data will you collect / create / wrangle ?

  • Will you draw from previously published materials? – REFERENCE or CANONICAL

    • Peer reviewed
    • Can be data or textual

"This is the most creative, important and valuable aspect of research data."

  • Do you agree?
  • Write a paragraph on why or why not you agree with this statement

National GeograpahicInstitute
Ministry of theEnvironment
Ministry of Energyand Mines
Previous Area Studies
Systemic Theories
USGS / NASAsatellite imagery
SRTM togographic data
Six countygovernments
Ten communities
Three stategovernments
Pasture Transects
Water QualityAnalysis
GPS Data Collected
Productivity Model
Disturbance Model
Topographic Model
Conservation Zoning Maps
Land Use Maps
Reference / Canonical
National GeograpahicInstitute
Ministry of theEnvironment
Ministry of Energyand Mines
Previous Area Studies
Systemic Theories
USGS / NASAsatellite imagery
SRTM togographic data
Six countygovernments
Ten communities
Three stategovernments
Pasture Transects
Water QualityAnalysis
GPS Data Collected
Productivity Model
Disturbance Model
Topographic Model
Conservation Zoning Maps
Land Use Maps
Finalized / Published

Your Turn
  • Think about your research project – if you don’t have one, partner with someone who does OR imagine your future internship

      • Remember: before, during, after

      • Qualitative and Quantitative
      • observational, experimental, derived, simulated, reference
      • raw, processed, analyzed, published
      • primary, secondary, tertiary

    • MATCH the STAGES of the research lifecycle with DATA TYPES
      • Think about management/wrangling at each research stage with each data type

Reading for Next Class (Wednesday)

Todays Outline
  • File formats
  • Text-based "open" formats
  • Image formats
Download this:
File Formats

Recomended Formats for Long-term Access and Sharing

Non-proprietary – no software purchase to open the file
Lossless – uncompressed with all of the original data
Indexable – if possible a plain text format that is both human and machine readable

Best file format?????

File Formats

  • Text:
  • Tabular:
  • Stat:
  • Images:
  • Geographic
  • Video
  • Music
  • Plain text:
doc, docx, rtf, odt, pages
xls, xlsx, numbers, dbf
spss, sas, jmp, rdata
jpg, tiff, svg, png, gif, bmp
shp, geotiff, kml, kmz, gdb
mp4, mov, avi, ogg
mp3, wav, m4a, aiff
txt, csv, json, html, xml

File Formats

  • Text:
  • Tabular:
  • Stat:
  • Images:
  • Geographic
  • Video
  • Music
  • Plain text:
doc, docx, rtf, odt, pages
xls, xlsx, numbers, dbf
spss, sas, jmp, rdata
jpg, tiff, svg, png, gif, bmp
shp, geotiff, kml, kmz, gdb
mp4, mov, avi, ogg
mp3, wav, m4a, aiff
txt, csv, json, html, xml

General Formats
  • proprietary
  • mixed
  • open

File Formats

  • Text:
  • Tabular:
  • Stat:
  • Images:
  • Geographic
  • Video
  • Music
  • Plain text:
doc, docx, rtf, odt, pages
xls, xlsx, numbers, dbf
spss, sas, jmp, rdata
jpg, tiff, svg, png, gif, bmp
shp, geotiff, kml, kmz, gdb
mp4, mov, avi, ogg
mp3, wav, m4a, aiff
txt, csv, json, html, xml

General Formats
  • proprietary
  • mixed
  • open
  • lossy
  • depends
  • lossless

Quick note on statistics files and conversions

  • Often contain much metadata embedded in the file
    • For example SPSS and SAS include data types (nominal, ordinal, interval, ration) and data dictionaries (code keys for nominal data, units for interval and ration data, etc.)
  • How to best share???
    • Option 1: keep in the proprietary format
    • Option 2: convert to text based format (csv, xml) and have either
      • A data dictionary in a text based format so that a user can reconstruct the data-metadata association
      • Some sort of ‘installer’ that contains the metadata and automatically reconstructs the data-metadata association

This also applies to relational databases, images, and some geographical data

OK - so what?

First, make sure your operating system lets you see the file formats!!!!

  • Mac file extensions
    • Finder: Finder -> Preferences ... :
      "Advanced" tab, check box next to "Show all filename extensions"
  • PC file extensions
    • Win 7 and below
      • File Explorer: Organize -> Folder and search options:
        "View" tab, uncheck the box next to "Hide extensions for known file types"
    • Win 8 and above
      • File Explorer: "view" tab, check the box next to: "File name extensions"

Some things to remember

  • Text and numbers
    • Plain text - BUT STRUCTURED
    • Character enconding??? UTF-8
    • PDF - preferably not!! (hard to index/search UNLESS created with specific care)
  • Images (bitmap)
    • TIFF, JPEG2000 (??), PNG, JPEG

hyper text markup language
comma seperated values
extensible markup language
javascript object notation
portable document format
joint photographic experts group
.jpg [ .jpeg, .jp2, j2k ]
tagged image file format
.tif [ .tiff ]
portable network graphic
Dublin Core Schema (metadata)

File Formats

character encoding??? UTF-8

ASCII – American Standard Code for Information Interchange
[ old school, 128 characters in 7 bits ]
lowercase “j” would become binary 01101010 and decimal 106

UTF-8 – Universal Coded Character Set + Transformation Format – 8-bit
[ now the new standard, only since about 2007, first 128 characters are ASCII ]
[ encodes 1,112,064 “code points” or characters ]

Bitmap and vector images

Raster – a “grid” of numeric color values, also known as a bitmap
[ .tif, .jpg, .png ]

Vector – a collection of points that can be connected to make lines, polygons, and volumes
[ no standards yet, but common in Adobe Illustrator, AutoCAD, and many GIS applications ]
WATCH for .svg – scaleable vector graphic
Some things to remember

  • Cartographic (maps)
    • Raster: GeoTIFF
    • Vector: shapefile, AutoCAD, GeoJSON
    • Note: shapefile has .shp, .shx, .dbf optional (?!) .prj, .sbx, .sbn
  • Audio
    • AIFF, WAVE 44.1 kHz / 16 bit or higher
    • BUT MP3 with FLAC encoding OK (Free Lossless Audio Codec)

data base format
projection (for maps)
drawing exchange format
shapefile index
audio interchange file format
moving pictures expert group
Some things to remember

  • Video
    • MPEG-4
  • Documentation
    • Rich Text Format
    • Open Document text
    • html
    • Plain text

motion picture expert group
rich text format
open document text

What do you use in your work?


Actually a collection of tools. NetCDF “is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data” – wikipedia

Color is entirely a creation of the mind

Red Green Blue - RGB

  • Additive color model based on ‘primary’ colors
    • Used on all electronic display devices
    • Primary colors are closely matched to three receptors in eye
  • The modern computer uses numbers from 0-255 to represent each primary color
    • 8 bits for each color, three colors, 24-bit true color
    • Approximately 16,777,216 colors – more than we can see

Cyan Magenta Yellow Black - CMYK

  • Subtractive color model based on printer colors
    • Also known as "process" or "four-color" system
    • All printing devices use this model
    • More ink 'subtracts' lightness from the white paper
    • 'K' is for 'key' as the black plate in an offset press is the 'key' plate
  • 8-bit or 16-bit information in each color 'channel'
    • CMYK image files are notiveably larger than RGB image files
    • way more colors than we can see ...

Hexadecimal for the Internet

  • The hex color system is based on the RGB model
    • Used in HTML for the internet
    • Is the preferred representation for color by programmers
  • Hexadecimal is the name for counting in base 16
    • Good for computers: 24 = 16 and 16 x 16 = 256
    • Counted from 0-15 like this:

      0 1 2 3 4 5 6 7 8 9 A B C D E F

  • A hex color might look like this: #FF00FF


Red: FF = 15*1 + 15*16 = 255

Green: 00 = 0*1 + 0*16 = 0

Blue: FF = 15*1 + 15*16 = 255
Photoshop's color picker

Approximate screen color
A note on resolution

“Highest resolution available, not rescaled or interpolated” – LOC recommendation

Resolution is directly related to pixel dimensions
  • usually expressed as dots per inch (DPI): 300 dpi
  • can be megapixels (the product of height x width): 2.07 megapixels (1920 x 1080 = 2073600)
  • can be simple image dimensions in pixels ('height' X 'width'): 1920 x 1080

1000 DPI – standard number for greyscale reproduction on a printing press
300 DPI – standard minimum for color reproduction on a printing press
72 DPI – standard screen resolution

So what?

  • How are file size and image quality related to:
    • Resolution (image dimensions)?
    • Color model?
    • Compression?
    • Verctor or Raster?
  • What Implications are there for
    • Short-term workflows?
    • Long-term preservation?

Reading for Next Class (Monday)