Thứ Năm, 17 tháng 1, 2019

Câu hỏi Cách đọc và ghi tệp HDF5 bằng Python

Tôi đang cố gắng đọc dữ liệu từ tập tin hdf5 bằng Python. Tôi có thể đọc tệp hdf5 bằng h5py, nhưng tôi không thể tìm ra cách truy cập dữ liệu trong tệp.

Code

import h5py    
import numpy as np    
f1 = h5py.File(file_name,'r+')

Điều này làm việc và tập tin được đọc. Nhưng làm thế nào tôi có thể truy cập dữ liệu bên trong đối tượng tập tin f1?

Các câu trả lời:

Đọc HDF5

import h5py
filename = 'file.hdf5'
f = h5py.File(filename, 'r')

# List all groups
print("Keys: %s" % f.keys())
a_group_key = list(f.keys())[0]

# Get the data
data = list(f[a_group_key])

Viết HDF5

#!/usr/bin/env python
import h5py

# Create random data
import numpy as np
data_matrix = np.random.uniform(-1, 1, size=(10, 3))

# Write data to HDF5
data_file = h5py.File('file.hdf5', 'w')
data_file.create_dataset('group_name', data=data_matrix)
data_file.close()

Xem tài liệu h5py để biết thêm thông tin.

Giải pháp thay thế

JSON: Thật tuyệt vời khi viết dữ liệu có thể đọc được của con người; RẤT thường được sử dụng (đọc viết)
CSV: Định dạng siêu đơn giản (đọc viết)
pickle: Một định dạng tuần tự hóa Python (đọc viết)
MessagePack (Gói Python): Đại diện nhỏ gọn hơn (đọc viết)
HDF5 (Gói Python): Tốt cho ma trận (đọc viết)
XML: tồn tại quá * sigh * (đọc & viết)

Đối với ứng dụng của bạn, những điều sau đây có thể quan trọng:

Hỗ trợ bởi các ngôn ngữ lập trình khác
Hiệu năng đọc / ghi
Nhỏ gọn (kích thước tệp)

Xem thêm: So sánh các định dạng tuần tự hóa dữ liệu

(Nguồn: http://paginaswebpublicidad.com/questions/18578/cach-doc-tep-hdf5-bang-python)

=============================================================================

Câu hỏi Cách nhanh nhất để ghi tệp HDF5 bằng Python?

Với tệp CSV lớn (10s GB) của văn bản / số hỗn hợp, cách nhanh nhất để tạo tệp HDF5 có cùng nội dung là gì, trong khi vẫn giữ mức sử dụng bộ nhớ hợp lý?

Tôi muốn sử dụng h5py module nếu có thể.

Trong ví dụ đồ chơi dưới đây, tôi đã tìm thấy một cách cực kỳ chậm và cực kỳ nhanh chóng để ghi dữ liệu vào HDF5. Nó sẽ là thực hành tốt nhất để viết cho HDF5 trong khối 10.000 hàng hay như vậy? Hay là có cách nào tốt hơn để viết một lượng lớn dữ liệu vào một tập tin như vậy?

import h5py

n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')

# this is terribly slow
for i in xrange(n):
  dset[i] = i

# instantaneous
dset[...] = 42

Các câu trả lời:

Tôi sẽ tránh chunking dữ liệu và sẽ lưu trữ dữ liệu như là một loạt các bộ dữ liệu mảng đơn (dọc theo dòng của những gì Benjamin đang đề xuất). Tôi vừa tải xong đầu ra của một ứng dụng doanh nghiệp mà tôi đã làm việc trên HDF5, và có thể đóng gói khoảng 4,5 tỷ kiểu dữ liệu phức hợp như 450.000 tập dữ liệu, mỗi bộ chứa 10.000 mảng dữ liệu. Viết và đọc bây giờ có vẻ khá tức thời, nhưng đã rất chậm khi tôi cố gắng chunk dữ liệu.

Cập nhật:

Đây là một vài đoạn trích được lấy từ mã thực tế của tôi (tôi đang viết mã bằng C so với Python, nhưng bạn nên lấy ý tưởng về những gì tôi đang làm) và sửa đổi để làm rõ. Tôi chỉ viết các số nguyên không dấu dài trong mảng (10.000 giá trị cho mỗi mảng) và đọc chúng lại khi tôi cần một giá trị thực

Đây là mã nhà văn điển hình của tôi. Trong trường hợp này, tôi chỉ đơn giản là viết chuỗi số nguyên không dấu dài vào một chuỗi các mảng và tải mỗi chuỗi mảng vào hdf5 khi chúng được tạo ra.

//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i < NUMDATASETS; i++){
    for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){
        kValues[j] = k;
        k += 1UL;
    }
    //Create the data set.
    dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    //Write data to the data set.
    H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
    //Close the data set.
    H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dSpace);
//Close the data files.
H5Fclose(ssdb);

Đây là phiên bản sửa đổi của mã trình đọc của tôi. Có nhiều cách thanh lịch hơn để thực hiện điều này (tức là, tôi có thể sử dụng hyperplanes để lấy giá trị), nhưng đây là giải pháp sạch nhất đối với quy trình phát triển Agile / BDD khá kỷ luật của tôi.

unsigned long int getValueByIndex(unsigned long int nnValue){
    //NUMPERDATASET = 10,000
    unsigned long int ssValue[NUMPERDATASET];
    //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
    //to avoid index out of range error 
    unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
    //Open the data file in read-write mode.
    hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
    //Create the data set. In this case, each dataset consists of a array of 10,000
    //unsigned long int and is named according to its integer division value of i divided
    //by the number per data set.
    hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
    //Read the data set array.
    H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
    //Close the data set.
    H5Dclose(dSet);
    //Close the data file.
    H5Fclose(db);
    //Return the indexed value by using the modulus of i divided by the number per dataset
    return ssValue[i % NUMPERDATASET];
}

Việc lấy đi chính là vòng lặp bên trong trong mã viết và phân chia số nguyên và các phép toán mod để lấy chỉ mục của mảng tập dữ liệu và chỉ mục của giá trị mong muốn trong mảng đó. Hãy cho tôi biết nếu điều này là rõ ràng, đủ để bạn có thể đặt cùng một cái gì đó tương tự hoặc tốt hơn trong h5py. Trong C, điều này là chết đơn giản và mang lại cho tôi tốt hơn đáng kể đọc / ghi lần so với một giải pháp dữ liệu chunked. Thêm vào đó, vì tôi không thể sử dụng nén với bộ dữ liệu hợp chất, nên sự lộn xộn rõ ràng là một điểm tranh luận, vì vậy tất cả các hợp chất của tôi được lưu trữ theo cùng một cách.

(Nguồn: http://paginaswebpublicidad.com/questions/23085/cach-nhanh-nhat-de-ghi-tep-hdf5-bang-python)

Thứ Tư, 16 tháng 1, 2019

Dynamic equilibrium models with efficient computation, and estimation.

My research interests are in the formulation of dynamic equilibrium models, their efficient computation, and their estimation.

Below you can get pdf copies of my papers and codes for reproducing some of the computations involved.

New Items

November 13, 2018. A preliminary version of my paper "Financial Frictions and the Wealth Distribution" with Galo Nuño and Samuel Hurtado can be found here. Note that some of the results reported in the figures are not completely described in the text. The slides are here here

October 9, 2018. We updated our paper "A Practical Guide to Parallelization in Economics" here to include the new Julia 1.0 parallelization syntax. The Github repository: here is also updated.

September 15, 2018. An interview about my research published at Econ Focus, from the Federal Reserve Bank of Richmond, can be found here.

September 11, 2018. The paper "Cryptocurrencies: A Crash Course in Digital Monetary Economics" can be found here. Una traducción al español aquí.

August 27, 2018. A new paper "The Lack of European Productivity Growth: Causes and Lessons for the U.S." with Lee Ohanian can be found here.

April 22, 2018. A new paper "A Practical Guide to Parallelization in Economics" here. An easy-to-follow guide of what you need to know to start coding in parallel in economics. It comes with a Github repository: here. Download and fork the codes!

March 22, 2018. An update on the computations on "A Comparison of Programming Languages" here with the results for new versions of each language. Bottom line: Matlab and R have improved a lot, Python is still awful, and Julia rocks!

February 10, 2018. A short paper on the economics of minimum wage regulations here.

July 25, 2017. A paper with Thorsten Drautzburg and Pablo Guerron-Quintana on the importance of political risk and bargaining shocks to understand aggregate fluctuations can be found here.

July 9, 2017. An updated version of my paper with Daniel Sanches on currency competiton among privately issued at currencies such as Bitcoin or Ethereum can be found here. Also, a short companion paper summarizing some of the findings can be found here.

July 4, 2017. A paper on how institutions and party systems are affected by the euro with Tano Santos can be found here.

May 6, 2017. A paper on safe assets with Robert Barro, Oren Levintal, and Andrew Mollerus can be found here.

December 28, 2016. A paper on the effects of labor market regulations can be found here.

April 21, 2016. The paper "The Pruned State-Space System for Non-Linear DSGE Models: Theory and Empirical Applications

," joint with Martin Andreasen and Juan Rubio-Ramirez has been updated with a new empirical application and a few more results regarding GIRFs. The new copy is here: here. A detailed technical appendix is here. Codes for Dynare 4.4: here.

February 6, 2016. A new paper with Oren Levintal on the computation of models with rare disasters can be found here. Companion code here.

December 30, 2015. My chapter on solution and estimation methods for DSGE models for the Handbook of Macroeconomics (joint with Juan Rubio-Ramirez and Frank Schorfheide) can be found here.

September 20, 2015. A brief paper "Magna Carta, the Rule of Law, and the Limits on Government," written for a conference celebrating the 800th anniversary of Magna Carta can be found here.

Handbook Chapters

Solution and Estimation Methods for DSGE Models.

Joint with Juan Rubio-Ramirez and Frank Schorfheide.

The New Macroeconometrics: A Bayesian Approach.

Joint with Pablo Guerron-Quintana and Juan F. Rubio-Ramirez.

Dynamic Macroeconomics

Political Distribution Risk and Aggregate Fluctuations

Joint with Thorsten Drautzburg and Pablo Guerron-Quintana.

Safe Assets

Joint with Robert Barro, Oren Levintal, and Andrew Mollerus.

Can Currency Competition Work?

Joint with Daniel Sanches.

A short companion paper summarizing some of the findings can be found here

Cryptocurrencies: A Crash Course in Digital Monetary Economics

Optimal Capital Versus Labor Taxation with Innovation-Led Growth.

Joint with Philippe Aghion and Ufuk Akcigit.

Nonlinear Adventures at the Zero Lower Bound.

Joint with Grey Gordon, Pablo Guerron, and Juan Rubio-Ramirez.

Supply-Side Policies and the Zero Lower Bound.

Joint with Pablo Guerron and Juan Rubio-Ramirez.

Fiscal Volatility Shocks and Economic Activity.

Joint with Keith Kuester, Pablo Guerron and Juan Rubio-Ramirez.

Reading the Recent Monetary History of the U.S., 1959-2007.

Joint with Pablo Guerron-Quintana and Juan Rubio-Ramirez.

A Review Sesssion for Monetary and Fiscal Policy

Macroeconomics and Volatility: Data, Models, and Estimation.

Joint with Juan Rubio-Ramirez.

Fortune or Virtue: Time-Variant Volatilities Versus Parameter Drifting in U.S. Data.

Joint with Pablo Guerron-Quintana and Juan Rubio-Ramirez.

The Term Structure of Interest Rates in a DSGE Model with Recursive Preference.

Joint with Jules van Binsbergen, Ralph Koijen, and Juan Rubio-Ramirez.

Fiscal Policy in a Model with Financial Frictions.

An extended version of the model can be found here.

Risk Matters: The Real Effects of Volatility Shocks.

Joint with Pablo Guerron-Quintana, Juan F. Rubio-Ramirez, and Martin Uribe.

From Shame to Game in One Hundred Years: An Economic Model of the Rise in Pre-marital Sex and its De-Stigmatization.

Joint with Jeremy Greenwood and Nezih Guner.

A,B,C's (and D)'s for Understanding VARs.

Joint with Juan F. Rubio-Ramirez, Tom Sargent, and Mark Watson.

The older (and longer) working paper version is here.

Life-Cycle Consumption, Debt Constraints and Durable Goods

Joint with Dirk Krueger.

Consumption over the Life Cycle: Facts from Consumer Expenditure Survey Data.

Joint with Dirk Krueger.

Here you can find the technical appendix of the paper with further details about our estimation.

Optimal Fiscal Policy in a Business Cycle Model without Commitment (incomplete draft).

Joint with Aleh Tsyvinski.

Was Malthus Right? Economic Growth and Population Dynamics.

Some Further Notes on "Was Malthus Right? Economic Growth and Population Dynamics".

These notes present further discussion of several aspects of "Was Malthus Right? Economic Growth and Population Dynamics". They should be read following each particular section of the main paper.

Can We Really Observe Hyperbolic Discounting?

Joint with the late Arijit Mukherji.

Evaluating Labor Market Reforms: A General Equilibrium Approach.

Joint with Cesar Alonso-Borrego and Jose E. Galdon.

On the Solution of the Growth Model with Investment-Specific Technological Change.

Joint with Juan F. Rubio-Ramirez.

Computation of Dynamic Equilibrium Models

A Practical Guide to Parallelization in Economics.

Joint with David Zarruk Valencia.

Github repository: here

Comparing Solution Methods for Dynamic Equilibrim Economies.

Joint with S. Boragan Aruoba and Juan F. Rubio-Ramirez.

Click on this link to go to the companion web page where you can find the codes used in this paper.

Solution Methods for Models with Rare Disasters.

Joint with Oren Levintal.

Companion code here.

A Comparison of Programming Languages in Economics.

Joint with S. Boragan Aruoba.

Click on this link to get the codes used in this paper.

Tapping the Supercomputer Under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors.

Joint with Eric Aldrich, Ron Gallant, and Juan Rubio-Ramirez

Computing DSGE Models with Recursive Preferences and Stochastic Volatility.

Joint with Dario Caldara, Juan F. Rubio-Ramirez, and Yao Wen.

Solving DSGE Models with Perturbation Methods and a Change of Variables.

Joint with Juan F. Rubio-Ramirez.

Mathematica Notebook to compute the optimal change of variables.

A Generalization of the Endogenous Grid Method.

Joint with Francisco Barillas.

Fortran Code to compute the models describe in the paper using the Endogenous Grid Method and Value function iteration.

Estimation of Dynamic Equilibrium Models

Our Research Agenda: Estimating DSGE Models .

Joint with Juan F. Rubio-Ramirez.

This note, which appears in the newsletter of the Review of Economic Dynamics, fall 2006, describes our agenda on the estimation of DSGE Models. We discuss our different papers and explain how they fit together.

The Pruned State-Space System for Non-Linear DSGE Models: Theory and Empirical Applications.

Joint with Martin Andreasen and Juan F. Rubio-Ramirez.

A detailed technical appendix is here. Codes for Dynare 4.4: here.

The Econometrics of DSGE Models.

MEDEA: A DSGE Model for the Spanish Economy.

Joint with Pablo Burriel and Juan F. Rubio-Ramirez.

Estimating Macroeconomic Models: A Likelihood Approach.

Joint with Juan F. Rubio-Ramirez.

The technical appendix offers further details in some aspects of the paper.

Sequential Monte Carlo Filtering: an Example

Here you can find a simple example of how to use a Sequential Monte Carlo to evaluate the likelihood function of a nonlinear and non-normal process.

Estimating Dynamic Equilibrium Economies: Linear versus Nonlinear Likelihood.