[Duplicity-talk] S3 ECONNRESET during restore results in SHA1 hash mismatch

Discussion:

Aphyr

2016-05-04 03:15:51 UTC

Hello, all!

I've got a couple terabytes backed up to S3 via duply. I'm doing a disaster
recovery drill and trying to restore that data, but I can't make it more than a
few minutes/hours without hitting a (recoverable!) S3 network hiccup which
breaks the restore process and forces me to restart the restore from scratch.
Each time it breaks on a different file, so I know the issue is a network fault,
not that that the files themselves are corrupt.

For example:

--- Start running command RESTORE at 21:41:53.892 ---
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Thu Oct 15 11:45:39 2015
Download
s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg
failed (attempt #1, reason: SSLError: ('The read operation timed out',))
Download
s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg
failed (attempt #2, reason: SSLError: ('The read operation timed out',))
Download
s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg
failed (attempt #3, reason: SSLError: ('The read operation timed out',))
Download
s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg
failed (attempt #4, reason: error: [Errno 104] Connection reset by peer)
Invalid data - SHA1 hash mismatch for file:
duplicity-full.20151015T164539Z.vol26.difftar.gpg
Calculated hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
Manifest hash: 71d69b04b6ed6aa75b604e4eecff51ab08a24cfe

Sometimes it breaks as early as vol3, other times it gets as far as vol745. I've
added

DUPL_PARAMS="$DUPL_PARAMS --num-retries=100 "

to my duply config, but it doesn't seem to make a difference.

I've also seen a suggestion that I download the full S3 archives to a local
directory, and restore from there. Sadly, I don't have enough free disk to cache
an additional 2+ TB.

I think connection-reset errors are always recoverable here; if Duply could
either resume restores, or just sleep and retry the download instead of trying
to verify incomplete files, I think this would work fine. Any suggestions?

--Kyle

e***@web.de

2016-05-04 08:54:42 UTC

Permalink

Post by Aphyr
Hello, all!

hey Aphyr, please state your duply, duplicity, boto, python versions

Post by Aphyr
I've got a couple terabytes backed up to S3 via duply. I'm doing a disaster recovery drill and trying to restore that data, but I can't make it more than a few minutes/hours without hitting a (recoverable!) S3 network hiccup which breaks the restore process and forces me to restart the restore from scratch. Each time it breaks on a different file, so I know the issue is a network fault, not that that the files themselves are corrupt.
--- Start running command RESTORE at 21:41:53.892 ---
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Thu Oct 15 11:45:39 2015
Download s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg failed (attempt #1, reason: SSLError: ('The read operation timed out',))
Download s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg failed (attempt #2, reason: SSLError: ('The read operation timed out',))
Download s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg failed (attempt #3, reason: SSLError: ('The read operation timed out',))
Download s3://s3-us-west-2.amazonaws.com/my-bucket/b/duplicity-full.20151015T164539Z.vol26.difftar.gpg failed (attempt #4, reason: error: [Errno 104] Connection reset by peer)
duplicity-full.20151015T164539Z.vol26.difftar.gpg
Calculated hash: da39a3ee5e6b4b0d3255bfef95601890afd80709
Manifest hash: 71d69b04b6ed6aa75b604e4eecff51ab08a24cfe
Sometimes it breaks as early as vol3, other times it gets as far as vol745. I've added
DUPL_PARAMS="$DUPL_PARAMS --num-retries=100 "
to my duply config, but it doesn't seem to make a difference.

--num-retries should work over all backends

run duply w/ '--preview' and check that --num-retries is propagated to the command line.

Post by Aphyr
I've also seen a suggestion that I download the full S3 archives to a local directory, and restore from there. Sadly, I don't have enough free disk to cache an additional 2+ TB.
I think connection-reset errors are always recoverable here; if Duply could either resume restores, or just sleep and retry the download instead of trying to verify incomplete files, I think this would work fine. Any suggestions?

probably should, but nobody contributed that so far ;).. ede/duply.net

Aphyr

2016-05-04 20:37:55 UTC

Permalink

Hi Edgar! Thanks for the response. :)

Post by e***@web.de
hey Aphyr, please state your duply, duplicity, boto, python versions

duply version 1.9.1

Using installed duplicity version 0.6.24, python 2.7.9, gpg 2.0.26 (Home:
~/.gnupg), awk 'GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)', bash
'4.3.30(1)-release (x86_64-pc-linux-gnu)'.

Looks like boto is 2.34.0--that's the only egg I have in
/usr/lib/python2.7/dist-packages/.

Post by e***@web.de
--num-retries should work over all backends
run duply w/ '--preview' and check that --num-retries is propagated to the command line.

duplicity --name duply_xxx --encrypt-key xxx --sign-key xxx --verbosity '4'
--num-retries=100 -t now 's3://s3-us-west-2.amazonaws.com/xxx/b' 'test/

I've tried both --num-retries 100 and --num-retries=100. Neither seems to retry
ECONNRESET.

Post by e***@web.de
probably should, but nobody contributed that so far ;).. ede/duply.net

Heh, I'll have to learn python and see what I can do. Just figured someone else
might have hit this before. :)

--Kyle

e***@web.de

2016-05-04 21:07:24 UTC

Permalink

Post by Aphyr
Hi Edgar! Thanks for the response. :)

Post by e***@web.de
hey Aphyr, please state your duply, duplicity, boto, python versions

duply version 1.9.1
Using installed duplicity version 0.6.24, python 2.7.9, gpg 2.0.26 (Home: ~/.gnupg), awk 'GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)', bash '4.3.30(1)-release (x86_64-pc-linux-gnu)'.
Looks like boto is 2.34.0--that's the only egg I have in /usr/lib/python2.7/dist-packages/.

there seems to be a boto 2.40 and even a boto3.. often newer version resolve quirks like the one you are experiencing

Post by Aphyr

Post by e***@web.de
--num-retries should work over all backends

didn't check the 0.6 source, but 0.7 definitely uses retry in boto backend

Post by Aphyr

Post by e***@web.de
run duply w/ '--preview' and check that --num-retries is propagated to the command line.

duplicity --name duply_xxx --encrypt-key xxx --sign-key xxx --verbosity '4' --num-retries=100 -t now 's3://s3-us-west-2.amazonaws.com/xxx/b' 'test/
I've tried both --num-retries 100 and --num-retries=100. Neither seems to retry ECONNRESET.

hmm, maybe an issue w/ the "old" duplicity 0.6 .. try the latest 0.7.07.1 ;)

Post by Aphyr

Post by e***@web.de
probably should, but nobody contributed that so far ;).. ede/duply.net

Heh, I'll have to learn python and see what I can do. Just figured someone else might have hit this before. :)

have a look, it's not as difficult as it seems.

good luck.. ede/duply.net

Aphyr

2016-05-10 16:22:31 UTC

Permalink

Post by e***@web.de
probably should, but nobody contributed that so far ;).. ede/duply.net

To follow up: I was never able to figure out why the backend returns
successfully after ECONNRESET (alas, my understanding of python error handling
is still tentative), but I *was* able to complete my restore by patching
restore_get_enc_fileobj to retry downloads when signature verification failed.
Here's my (probably terrible) solution, in case anyone else runs into a similar
problem.

def restore_get_enc_fileobj(backend, filename, volume_info):
"""
Return plaintext fileobj from encrypted filename on backend

If volume_info is set, the hash of the file will be checked,
assuming some hash is available. Also, if globals.sign_key is
set, a fatal error will be raised if file not signed by sign_key.

"""
parseresults = file_naming.parse(filename)
for n in range(100):
tdp = dup_temp.new_tempduppath(parseresults)
backend.get(filename, tdp)

""" verify hash of the remote file """
verified, hash_pair, calculated_hash = restore_check_hash(volume_info, tdp)
if verified:
break

""" verification failed """
log.Warn("%s\n %s\n %s\n %s\n" %
(_("Invalid data - %s hash mismatch for file:") %
hash_pair[0],
util.ufn(filename),
_("Calculated hash: %s") % calculated_hash,
_("Manifest hash: %s") % hash_pair[1]),
log.ErrorCode.mismatched_hash)
""" abandon file and retry """
tdp.delete()
time.sleep(10)

fileobj = tdp.filtered_open_with_delete("rb")
if parseresults.encrypted and globals.gpg_profile.sign_key:
restore_add_sig_check(fileobj)
return fileobj

--Kyle