Challenges of Big Data Analysis - Jianqing Fan, Fang Han and Han Liu

Created at 2pm, Jan 14

Sekizsu

Book

Contract ID

G5ZuamWGnmKfmkW-5JyksRwY2TXyBBdswObgR9bRL5I

File Type

PDF

Entry Count

133

Embed. Model

jina_embeddings_v2_base_en

Index Type

hnsw

Challenges of Big Data analysis. National Science Review - 2014

There are many flexibilities in defining the sparsest solution in high-confidence set. First of all, we have a choice of the loss function (cid:6)n(). We can regard (cid:6)(cid:9) n () = 0 as the estimation equations and define directly the high-confidence set (10) from the estimation equations. Secondly, we have many ways to measure the sparsity. For example, we can use a weighted L1-norm to measure the sparsity of in (12). By proper choices of estimating equations in (10) and measure of sparsity in (12), the authors of showed that many useful procedures can be regarded as the sparsest solution in the highconfidence set. For example, CLIME for estimating sparse precision matrix in both the Gaussian graphic model and the linear programming discriminant rule for sparse high-dimensional classification is the sparsest solution in the high-confidence set. The authors of also provided a general convergence theory for such a procedure under a con

id: 35c77061db0857aa6cdee55bb13fa61f - page: 12

Finally, the idea is applicable to the problems with measurement errors or even endogeneity. In this case, the high-confidence set will be defined accordingly to accommodate the measurement errors or endogeneity. See, for example, .

id: 610ab588c83d72a9afa3861c177551db - page: 12

Downloaded from by guest on 03 May 2018 REVIEW Independence screening An effective variable screening technique based on marginal screening has been proposed by the authors of . They aim at handling ultra-highdimensional data for which the aforementioned penalized quasi-likelihood estimators become computationally infeasible. For such cases, the authors of proposed to first use marginal regression to screen variables, reducing the original large-scale problem to a moderate-scale statistical problem, so that more sophisticated methods for variable selection can be applied. The proposed method, named is computationally sure independence screening, very attractive. It has been shown to possess sure screening property and to have some theoretical advantages over Lasso [13,88].

id: 4283e840f22acbee15f74798e8741474 - page: 12

There are two main ideas of sure independent screening: (i) it uses the marginal contribution of a covariate to probe its importance in the joint model; and (ii) instead of selecting the most important variables, it aims at removing variables that are not important. For example, assuming each covariate has been standardized, we denote (cid:4) M the estimated regression coefficient in a univariate regression model. The set of covariates that survive the marginal screening is defined as j (cid:4)S = { j : |(cid:4) M | } (13) for a given threshold . One can also measure the importance of a covariate Xj by using its deviance reduction. For the least-squares problem, both methods reduce to ranking importance of predictors by using the magnitudes of their marginal correlations with the response Y. The authors of and gave conditions under which sure screening property can be established and false selection rates are controlled.

id: 1778fe0edb905552148de4afab62a983 - page: 12

How to Retrieve?

# Search

curl -X POST "https://search.dria.co/hnsw/search" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"rerank": true, "top_n": 10, "contract_id": "G5ZuamWGnmKfmkW-5JyksRwY2TXyBBdswObgR9bRL5I", "query": "What is alexanDRIA library?"}'
        
# Query

curl -X POST "https://search.dria.co/hnsw/query" \
-H "x-api-key: <YOUR_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"vector": [0.123, 0.5236], "top_n": 10, "contract_id": "G5ZuamWGnmKfmkW-5JyksRwY2TXyBBdswObgR9bRL5I", "level": 2}'