Data Protection, Backups, Archiving, Preservation Are They the Same Thing? Not Quite…
Data Protection
Includes topics such as backups, archives, and preservation
also includes physical security, encryption, and others not addressed here (for later)...
Terms “backups” and “archives” are often used interchangeably, but do have different meanings
Backups: a copy (or copies) of the original file is made before the original is overwritten
Archives: preservation of the file
Data Preservation
Includes archiving in addition to processes such as data rescue, data reformatting, data conversion, metadata
slide from
Backups vs. Archiving
Backups
Used to take periodic snapshots of data in case the current version is destroyed or lost
Backups are copies of files stored for short or near-long-term
Often performed on a somewhat frequent schedule
Archives
Used to preserve data for historical reference or potentially during disasters
Archives are usually the final version, stored for long-term, and generally not copied over
Often performed at the end of a project or during major milestones
It is a good idea to have multiple copies of your backups and archives, in case one copy fails.
slide from
Backup and Storage
Major Considerations
Who is responsible for backup ?
How often do you backup ?
Partial vs. full backups ?
Non-digital backups ?
Where (literally) will the backups be located ?
Do the backups need a description (metadata) ?
Manual vs automatic ?
Recovery procedures ?
Verification – how do you know the backup was successful ?
How long do you keep your backups ?
What happens when the project ends ?
Don't forget
Data conversions and formats
Versioning
File Naming
Data in real life
A design firm was handling their own backups. The system was working fine and the backup software was reporting that the data was successfully backed up.
slide from
Data in real life
The administrator checked the backups immediately after they were done and confirmed they were good.
slide from
Data in real life
After a computer virus erased most of their files, they went back to their backups. Unfortunately they found that the backups were all blank and all of the data was gone. Only after some investigation did they discover that the computer tapes (which contained the backups) were placed against a wall that had an elevator on the other side of it. When the elevator went past, the magnets inside erased all of the tapes.
Had they checked their backups properly, they probably would have noticed this before there was an emergency
slide from
Validation Do you trust your computer?
Always check file sizes after backing up
Or at least check periodically
The MD5 checksum
With this you can monitor the integrity of your data over time
Other checksums are CSC and SHA
Like a fingerprint for a file
Command line: md5sum
Syncronization Do you trust your computer?
Cloud based services (box, dropbox, etc.) are based on Folder Synchronization
Only copies newer files (that have changed or been created)
Thus contains a "mirror image"
There are commercial and free folder synchronization tools
All have privacy and user data issues
All based on “command line” tools that already exist on your computer
Command line interface (CLI) synching
Mac or Linux: rsync
Windows: xcopy or robocopy
Backup Media Options
Local Machines
Hard disk in computer
External hard drives
Online Solutions
Networked drives (personal cloud)
Repositories
Versioning Tolls
The "cloud"
2.5” 500 GB Western Digital SATA Evan-Amos - CC BY-SA 3.0
Spinning (metal) disk: Laptop or Desktop
Pros
High level of control over file system, naming, and physical location of disk
Easy to backup
Convenient
Cons
Risk of malware (virus)
Risk of theft, damage, loss, etc
System can eventually corrupt the disk (especially pcs)
Finite lifespan
RECOMMENDATIONS:
Never have master copies on your computer
Not for long term storage
Have a backup plan for this storage option
Spinning (metal) disk: (Network) Server
Pros
High level of control over file system, naming, and physical location of disk
Likely has backup and maintenance schedule
Possible duplicate (mirror) images – RAID systems
Safe physical location
Redundant Array of Independent Disks
Cons
Expensive to maintain
Migration can be difficult
Susceptible to catastrophic events
RECOMMENDATIONS:
Good for master copies
Good for up to 5 year storage
External storage: memory and drives
Pros
Drives are cheap (sort of) and portable
Convenient
Memory is cheap and portable
Cons
Connection technologies change (USB, Firewire, SATA, and so on)
Drive failure (both spinning drives and memory devices)
Easily damaged, stolen or lost
Finite space for large projects multiple drives may be necessary
Malware can be propagated (think unsafe sex)
RECOMMENDATIONS:
Not for master copies
Not for long term storage
Have a backup plan for this storage option
Does anyone use CDs anymore??? ZIP disks??? NOT recommended!!
External storage: magnetic tapes
Pros
Massive amunts of data, cheap
Fast backup
Reusable
Cons
Slow retrevial
Degradation over time
Installation and maintenance is expensive
RECOMMENDATIONS:
Excellent for rolling backups
Networked drives: personal cloud
Pros
Drives are cheap (sort of) and portable
Convenient access from anywhere
Easy to install and sync
Private: password protected
Cons
Upload/download bottlenecks
Susceptible to acts catastrophic events
Needs permanent power
Needs IP address
RECOMMENDATIONS:
Good "third" option
Western Digital, Seagate . . . Perhaps buy online, out of state?
Depending on data type, can be good for working data
Privacy concerns vary
The Cloud
The term “cloud computing” (or just “cloud”, in the context of computing) is a marketing buzzword with no coherent meaning. It is used for a range of different activities whose only common characteristic is that they use the Internet for something beyond transmitting files. Thus, the term spreads confusion. If you base your thinking on it, your thinking will be confused.
The Bottom Line: Storage and Backup [ best practices ]
DO
RAID storage
External hard drives (XFAT)
Cloud storage and file-syncing
Duplicate computers or hard drives
Write down roles and responsibilities
Organize, file naming conventions, versioning
Have automatic backups
Verify backups
Open formats
DON'T
USB thumb drives
Email files to yourself
Save files without knowing their location in the computer’s file structure
Backup when you remember
The XFAT format is essential if you ever want to share between a mac and a pc
Mac: Applications:Utilities:Disk Utility
PC: right click in explorer -> Format
The Bottom Line: Storage and Backup [ best practices ]
Have all your work in at least three places at all times: working version + two backups
Drives fail, computers break, viruses happen, computers get stolen, usb thumb drives ALWAYS fail, you will make a mistake and delete your work on accident, ex-partners seek revenge, and the list goes on . . .
Every request is evaluated on a case-by-case basis. Evaluations are based on the requested resource needs and the current resource allocations across campus. If the request is exceptionally large there may be cost sharing requirements.
Short-term solutions at UM
For general sharing and collaboration needs please see the cloud storage solutions that Information Technology provides for students, staff and faculty:
File syncing tools that exist on your machine already:
Mac and linux: rsync
PC: xcopy or robocopy
You will have to understand the command line first:
Mac: Applications/Utilities/Terminal
PC: Start Button->search “cmd”
Storage and Backup
PC: xcopy
C:\> xcopy <source> <destination> [<options>]
C:\> xcopy c:\Users\tnorris\Documents\MapData\*.* G:\MapData /D /S /Y
This copies all NEWER files from the MapData directory on the local machine
to the backup MapData folder on an external hard drive.
C:\> xcopy /?
This will show all of the options for the xcopy command. You can see that:
/D tells xcopy to only copy newer files
/S goes through all sub-directories
/Y tells xcopy to proceed without asking the user to confirm (be careful!!)
Storage and Backup
Mac or linux: rsync
% rsync [<options>] <source> <destination>
% rsync -arvu /Users/tnorris/Documents/MapData/* /Volumes/MyDrive/MapDataFB
This copies all NEWER files from the MapData directory on the local machine to
the backup MapDataFB folder on an external hard drive named “MyDrive”.
% man rsync
This will show all of the options for the rsync command. You can see that
-a tells rsync preserve archival information (date stamps, owners, permissions)
-r tells rsync to go through all sub-directories
-v tells rsync to tell you what it is doing (which files it copied)
–u will only update newer files and skip older files.
Briney, K., Goben, A., & Zilinski, L. (2015). Do You Have an Institutional Data Policy? A Review of the Current Landscape of Library Data Services and Institutional Data Policies
Journal of Librarianship and Scholarly Communication, 3(2). http://dx.dio.org/10.7710/2162-3309.1232. SKIM THIS - LOOK MOSTLY AT THE RESULTS OF THE STUDY.
Boyle (2003). The Second Enclosure Movement and the Construction of the Public Domain. Law and Contemporary Problems, 66:33(Winter/Spring), 33-74. http://scholarship.law.duke.edu/lcp/vol66/iss1/2/.
David (2008). The Historical Origins of ‘Open Science’: An Essay on Patronage, Reputation and Common Agency Contracting in the Scientific Revolution. Capitalism and Society 3(2), Article 5. https://dx.doi.org/10.2202/1932-0213.1040.
Uhlir, Paul (ed) (2016). "Legal Interoperability of Research Data: Principles and Implementation Guidelines." Research Data Alliance - Committee on Data for Science and Technology - Legal Interoperability Interest Group. https://zenodo.org/record/162241#.WDRQln17Ifg
Madison (2011). “Knowledge Curation.” Notre Dame Law Review, Vol. 86, p. 1957, 2011; U. of Pittsburgh Legal Studies Research Paper No. 2011-13. Available at SSRN: http://ssrn.com/abstract=1848086
Singh, S. 1999. The Code Book: the science of secrecy from ancient Egypt to quantum cryptography. Fourth Estate, London.
NOTE: encryption mistakes are irreversible
Once again ...
this will get personal
access and passwords
online habits
Access Terms (keywords)
User ID / Password
your UM account
Limited Network Access -or- local area network (LAN)
some research labs, many government offices, many business offices: usually limited to physical presence within the network and virtual private networks (VPN)
Role-Based Access Rights
your computer login (administrator, standard, sharing only, guest)
Password Management
Mistakes leading to weak passwords(Do not make these mistakes when choosing a password):
your username as a password (even backwards or mixed up).
using any name, or any word in any language.
obvious personal information (your year of birth, phone number, national insurance number, address, etc.).
all digits, or just one letter.
real words with only one or two obvious digit substitutions, like 'p4ssword' or '5ecret'.
fewer than eight characters ("brute force" attack cracks 7 letters in a few minutes).
characters from books, films, etc. (Gandalf, Sherlock), band names, song titles etc. (no matter how obscure).
passwords that are too easy or too difficult to type