Main

October 16, 2007

Lost in lossy: compression loses information

Jpeg: image compression artifacts.

June 11, 2007

Cash back at closing -- Mortgage fraud ?

First he built a dictionary of 150 keywords in real estate
ads — “creative financing,” for instance — that might
signal a seller’s willingness to play loose. He then looked
for instances in which a house had languished on the
market and yet wound up selling at or even above the
final asking price. In such cases, he found that buyers
typically paid a very small down payment; the smaller
the down payment, in fact, the higher the price they
paid for the house. What could this mean?

Either the most highly leveraged buyers were terrible
bargainers — or, as Ben-David concluded, such anomalies
indicated the artificial inflation that marked a cash-back deal.

Having isolated the suspicious transactions in the data,
Ben-David could now examine the noteworthy traits they
shared. He found that a small group of real estate agents
were repeatedly involved, in particular when the seller was
himself an agent or when there was no second agent in the
deal. Ben-David also found that the suspect transactions
were more likely to occur when the lending bank, rather
than keeping the mortgage, bundled it up with thousands
of others and sold them off as mortgage-backed securities.

This suggests that the issuing banks treat suspect mortgages
with roughly the same care as you might treat a rental car,
knowing that you aren’t responsible for its long-term outcome
once it is out of your possession.

-- Freakonomic pf the week.

June 01, 2007

Epicurean Dealmaker

epicureandealmaker on fat tails.
Derivatives: Transfering risk or reducing risk ?

May 30, 2007

Interest-rate term-structure pricing models: Riccardo Rebonato

Review Paper. Interest-rate term-structure pricing models: a review
Riccardo Rebonato

Interest-rate term structure modelling from the early short-rate-based
models to the current developments; use models for pricing complex
derivatives or for relative-value option trading. Therefore, relative-pricing
models are given a greater emphasis than equilibrium models.

The current state of modelling owes a lot to how models have
historically developed in the industry, and stresses the importance
of 'technological' developments (such as faster computers or more
efficient Monte Carlo techniques) in guiding the direction of theoretical
research.

The importance of the joint practices of vega hedging and daily
model-recalibration is analysed in detail. The relevance of market
incompleteness and of the possible informational inefficiency of
derivatives markets for calibration and pricing is also discussed.

Continue reading "Interest-rate term-structure pricing models: Riccardo Rebonato" »

April 22, 2007

Susan Athey, econometrician, wins Clarke Medal

Susan Athey's applied econometrics and heterogeneity of mentorship
wins a Clarke medal, awarded to the most accomplished economist
nearing 40 and is the most distinguished prize short of a Nobel.
Bio.

March 17, 2007

NOAA vs the world

Paul K Greed asks about why the anomaly figure is
less worrisome than it seems:

1. All weather observers are safely far away from Iraq
2. It's still 72 F and sunny in La Jolla
3. America is still read, white and blue.

February 15, 2007

Prosper lending community

Prosper now enjoys some powerfu tools.

regression analysis forum
adverse selection says avoid the high rate borrowers.
Money Walks journal of patterns in data, performance.

Erics great survey or lenders: oustandings, return

Prosper Analytics animated charts, not quite chart junk. ROI.

P2P conventional loan analytics.

Prosper's own loan performance database.

February 06, 2007

Infosthetics

Infosthetics shows time trends.

January 24, 2007

Many Eyes interactive data visualization

Data visualization in web browser, with interaction.
New champion: IBM's Many Eyes.

Liked by JHeer and radar.oreilly.

January 20, 2007

Atrios's bookshelf

Atrios's bookshelf.
A former economist, indeed.

db_atrios_bookshelf_pj22.jpg

October 06, 2006

Visualization and segmentation: Gelman's bag of tricks

Visualization and segmentation: Gelman's

Bag of tricks
for teaching statistics.

.

See also Gelman's Data Analysis Using Regression and Multilevel/Hierarchical Models.

August 06, 2006

Hedging beyond duration and convexity

Hedging beyond duration and convexity.

By considering a representation using a Fourier-like harmonic,
empirical evidence that such a series provides our hedging
strategy on a mortgage-backed security (MBS) with the first
four principal components of yield curve.

Continue reading "Hedging beyond duration and convexity" »

July 10, 2006

Haver data

Haver Analystics provides economic data, ready to use
in Stata and eViews formats.

PCE time series inflationary ?

July 07, 2006

Sparklines time series

Show the time series with a sparkline.
Sparklines wiki.
Go mad with stock charts.

US Federal Budget deficit, 1983-2003.

June 27, 2006

Zivot on time series

Zivot's class in time series econometrics notes.

May 25, 2006

Unobserved Components Model, Proc UCM

Underlying model and several of the features of Proc UCM, new in the
Econometrics and Time Series (ETS) module of SAS .

Time series data is generated by marketers as they monitor “sales by month”
and by medical researchers who collect vital sign information over time. This
technique is well suited to modeling the effect of interventions (drug administration
or a change in a marketing plan). This new procedure combines the flexibility of
Proc ARIMA with the ease of use and interpretability of Smoothing models.

UCM does not have the capability to easily model transfer functions, a useful
ARIMA function that is planned for Proc UCM.

An Animated Guide©: Proc UCM (Unobserved Components Model)
Russ Lavery, Contractor for ASG, Inc., PDF

May 24, 2006

Econometric notes

Econometric course notes by John Aldrich.

May 23, 2006

SURSE

Seemingly unrelated regressions and simulateous equations: PDF

May 22, 2006

Statespace is SAS

Statespace in SAS/ETS.

The STATESPACE procedure analyzes and forecasts multivariate
time series using the state space model. The STATESPACE procedure
is appropriate for jointly forecasting several related time series that
have dynamic interactions. By taking into account the autocorrelations
among the whole set of variables, the STATESPACE procedure may
give better forecasts than methods that model each series separately.

May 15, 2006

NumSum spreadsheets on the web

Spreadsheets put on the web by NumSum.
Like Flickr for accountants.

Continue reading "NumSum spreadsheets on the web" »

April 18, 2006

Home value by room count, Miller Samuel

Home value by rooms by Miller Samuel.
This regression is crying out for a log transformation.

And what do all the data points with fractional room counts represent ?

*

April 04, 2006

Dashboard spy

Dashboard spy gallery of mangement dashboards and consoles full of KPI
(Key performance indicators).
Update 2006 Dec.: Moved to enterprise-dashboard.com.

Ed Tufte adds,

Continue reading "Dashboard spy" »

December 22, 2005

Kimberly 'KC' Claffy

Kimberly 'KC' Claffy measures internet traffic.

December 01, 2005

log base 2

logbase2 is mostly biostatistics and visualization,
with a blast of r.

Bonus (detritus ?): And compliant lefty Canadian commentary.

October 27, 2005

Google News Report USA Score

Fetch headlines from Google News on a schedule, then rank
headlines by factors:

* appearance day and time,
* prominence on the google news page,
* number of appearances,
* others;

weighted to estimate referer traffic these links bring to their
source.

Listed are the top scoring stories in recent time periods, followed
by a ranking of sources. More detailed reports are linked-to at the
bottom of each table.

[*]

October 15, 2005

Joint regression analysis

Joint regression analysis to study genotype-environmental interaction,
genotype effects and/or interaction effects within individual
environments are related to environmental effects.

The interaction sum of squares is divided into two parts:
* one part represents the heterogeneity of linear regression
coefficients while
* the second represents the pooled deviations from individual
regression lines.

R. J. (Bob) Baker

September 08, 2005

Hospital Length of Stay: Mean or Median Regression

Length of stay (LOS) is an important measure of hospital activity and
health care utilization, but its empirical distribution is often
positively skewed.

Median regression appears to be a suitable alternative to analyze
the clustered and positively skewed LOS, without transforming and
trimming the data arbitrarily.

Continue reading "Hospital Length of Stay: Mean or Median Regression" »

August 29, 2005

r graphics (Paul Murrell) is out

R Graphics by Paul Murrell shipped.

Previously announced.

Continue reading "r graphics (Paul Murrell) is out" »

August 24, 2005

MCMC method bandwidth selection for multivariate kernel density estimation

Kernel density estimation for multivariate data is an important
technique that has a wide range of applications in econometrics and
finance. The lower level of its use is mainly due to the increased
difficulty in deriving an optimal data-driven bandwidth as the
dimension of data increases. We provide Markov chain Monte Carlo
(MCMC) algorithms for estimating optimal bandwidth matrices for
multivariate kernel density estimation.

Our approach is based on treating the elements of the bandwidth matrix
as parameters whose posterior density can be obtained through the
likelihood cross-validation criterion. Numerical studies for bivariate
data show that the MCMC algorithm generally performs better than the
plug-in algorithm under the Kullback-Leibler information criterion.
Numerical studies for five dimensional data show that our algorithm is
superior to the normal reference rule.

Continue reading "MCMC method bandwidth selection for multivariate kernel density estimation" »

August 23, 2005

Curve Forecasting by Functional Autoregression

This paper explores prediction in time series in which the data is
generated by a curve-valued autoregression process. It develops a
novel technique, the predictive factor decomposition, for estimation
of the autoregression operator, which is designed to be better suited
for prediction purposes than the principal components method.

The technique is based on finding a reduced-rank approximation to the
autoregression operator that minimizes the norm of the expected
prediction error. The new method is illustrated by an analysis of the
dynamics of Eurodollar futures rates term structure. We restrict the
sample to the period of normal growth and find that in this subsample
the predictive factor technique not only outperforms the principal
components method but also performs on par with the best available
prediction methods.

Curve Forecasting by Functional Autoregression
Presenter(s) Alexei Onatski, Columbia University
Co-Author(s) Vladislav Kargin, Cornerstone Research
Session Chair James Stock, Harvard University

Continue reading "Curve Forecasting by Functional Autoregression" »

August 20, 2005

Functional data analysis (FDA)

Functional data analysis (FDA) handles longitudinal data and treats
each observation as a function of time (or other variable). The
functions are related. The goal is to analyze a sample of functions
instead of a sample of related points.

FDA differs from traditional data analytic techniques in a number of
ways. Functions can be evaluated at any point in their domain.
Derivatives and integrals, which may provide better information (e.g.
graphical) than the original data, are easily computed and used in
multivariate and other functional analytic methods.


S+Functional Data Analysis User's Guide
by Douglas B. Clarkson, Chris Fraley, Charles C. Gu, James O. Ramsay




Functional Data Analysis (Springer Series in Statistics) (Hardcover)
by J. Ramsay, B. W. Silverman

Covers topics of linear models, principal components, canonical
correlation, and principal differential analysis in function spaces.




Applied Functional Data Analysis
(Paperback)
by J.O. Ramsay, B.W. Silverman

Bernard W. Silverman's code site Applied Functional Data Analysis: Methods and Case Studies

Continue reading "Functional data analysis (FDA)" »

August 19, 2005

Mathematical Statistics with MATHEMATICA

Mathematical Statistics with MATHEMATICA,
Colin Rose, Murray D. Smith (Hardcover)


The mathStatica software, an add-on to Mathematica, provides a
toolset specially designed for doing mathematical statistics. It
enables students to solve difficult problems by removing the technical
calculations often associated with mathematical statistics. The
professional statistician will be able to tackle tricky multivariate
distributions, generating functions, inversion theorems, symbolic
maximum likelihood estimation, unbiased estimation, and the checking
and correcting of textbook formulas. This text would be a useful
companion for researchers and students in statistics, econometrics,
engineering, physics, psychometrics, economics, finance, biometrics,
and the social sciences.

Companion site mathStatica.com

August 04, 2005

Information Visualisation with r

Information Visualisation Lecture Slides uses r.

Continue reading "Information Visualisation with r" »

July 21, 2005

Asset prices by Enricode Giorgi

Default models and asset pricing models at Enricode Giorgi's resource,
some with correlated defaults.

July 19, 2005

sas proc quantreg for quantile regression

Some PROC QUANTREG features are:

* Implements the simplex, interior point, and smoothing algorithms for
estimation

* Provides three methods to compute confidence intervals for the
regression quantile parameter: sparsity, rank, and resampling.

* Provides two methods to compute the covariance and correlation
matrices of the estimated parameters: an asymptotic method and a
bootstrap method

* Provides two tests for the regression parameter estimates: the Wald
test and a likelihood ratio test

* Uses robust multivariate location and scale estimates for leverage
point detection

* Multithreaded for parallel computing when multiple processors are
available

[PDF, *]

July 17, 2005

SAS examples with explanation at ucla.edu/stat/SAS/

SAS examples with explanation abound at UCLA: 1, 2.

July 10, 2005

Array manipulation: Perl Data Language (PDL) and piddles

To COMPACTLY store and SPEEDILY manipulate the large
N-dimensional data sets which are the bread and butter
of scientific computing. e.g. $a=$b+$c can add two
2048x2048 images in only a fraction of a second.

Perl Data Language (PDL), PDL::Impatient - PDL for the impatient

A PDL scalar variable (an instance of a particular class of
perl object, i.e. blessed thingie) is a piddle.

June 17, 2005

state of stats

What have we learnt ? State of stats: PDF, Antony Unwin on Statistical Learning.
Global criteria: – AIC, BIC, deviance, test error,...
Local criteria: – residuals, diagnostics

Continue reading "state of stats" »

June 16, 2005

Support Vector Machine

An SVM corresponds to a linear method in a very high dimensional feature
space which is nonlinearly related to the input space. It does not
involve any computations in that high dimensional space. By the use of
kernels, all necessary computations are performed directly in input space.

are a method for creating functions from a set of labeled training
data. The function can be a classification function (the output is
binary: is the input in a category) or the function can be a general
regression function.

For classification, SVMs operate by finding a hypersurface in the
space of possible inputs. This hypersurface will attempt to split the
positive examples from the negative examples. The split will be chosen
to have the largest distance from the hypersurface to the nearest of
the positive and negative examples. Intuitively, this makes the
classification correct for testing data that is near, but not
identical to the training data.


r (with module e1071):
estimate, predict, example, example2.

Matlab:
Kernel Methods for Pattern Analysis
John Shawe-Taylor & Nello Cristianini
Cambridge University Press, 2004
Detailed contents, inventory of algorithms and kernels, and matlab code.

Stand-alone:

SVM Light is a Support Vector Machine.

Continue reading "Support Vector Machine" »

June 15, 2005

Spectral Graph Transducer, SGTlight

SGTlight is an implementation of a Spectral Graph Transducer (SGT)
[Joachims, 2003] in C using Matlab libraries. The SGT is a method for
transductive learning. It solves a normalized-cut (or ratio-cut) problem
with additional constraints for the labeled examples using spectral
methods. The approach is efficient enough to handle datasets with
several ten-thousands of examples.

June 14, 2005

Analysis of patterns

Analysis of patterns

Automatic pattern analysis of data is a pillar of modern science,
technology and business, with deep roots in statistics, machine
learning, pattern recognition, theoretical computer science, and many
other fields. A unified conceptual understanding of this strategic
field is of utmost importance for researchers as well as for users of
this technology.

This workshop - course will emphasizes the common principles and roots
of modern pattern analysis technology, developed independently by many
different scientific communities over the past 30 years, and their
impact on modern science and technology.

Students and researchers from many disciplienes dealing with automatic
pattern analysis form the intended audience. These include (but are
not limited to) statistics, pattern recognition, data mining, machine
learning, information theory, sequence analysis, bioinformatics,
adaptive systems, etc.

Italy, October 28 - November 6, 2005

June 13, 2005

Data mining competition

Fair Isaac and UCSD data mining competition lets you test your predictive power.

May 03, 2005

Kalman filter with Mathematica

Kalman filter (An algorithm in control theory introduced by R. Kalman in 1960 and
refined by Kalman and R. Bucy. It is an algorithm which makes optimal use of imprecise
data on a linear (or nearly linear) system with Gaussian errors to continuously update
the best estimate of the system's current state.)

As a times series function (example); as an estimator for linear
(time series and panel) models with time-varying coefficients
.

Continue reading "Kalman filter with Mathematica" »

April 29, 2005

Decision Science News / Dan Goldstein

Decision Science News by Dan Goldstein and Kevin Flora
about the decision sciences including but not limited to Psychology,
Economics, Business, Medicine, and Law, but
mostly marketing.

Also on Wilmott.

April 28, 2005

Statistical Modeling, Causal Inference / MLM

Statistical Modeling, Causal Inference, and Social Science (MLM)
Andrew Gelman and Samantha Cook at Columbia.

April 27, 2005

XLISP-Stat estimates Generalised Estimating Equations

XLISP-Stat tools for building Generalised Estimating Equation models
offers an introduction to GEE models.

Much of the brain trust of XLISP Stat has moved on to r.

Continue reading "XLISP-Stat estimates Generalised Estimating Equations" »

April 19, 2005

r graphics, Paul Murrell

R Graphics by Paul Murrell

Update 2005 Sept 03: R Graphics is shipping !

A book on the core graphics facilities of the R language and
environment for statistical computing and graphics (to be published
by Chapman & Hall/CRC in August 2005). Preview now.

March 25, 2005

Wavelets

Wavelets are mathematical expansions that transform data from the
time domain into different layers of frequency levels. Compared to
standard Fourier analysis, they have the advantage [PDF] of being
localized both in time and in the frequency domain, and enable the
researcher to observe and analyze data at different scales.

Continue reading "Wavelets" »

February 01, 2005

Exploratory Data Analysis in NIST's Statistics Handbook

NIST's Engineering Statistics Handbook: Exploratory Data Analysis.

January 31, 2005

Basel default

Probability of Default (PD)
- the probability that a specific customer will default
within the next 12 months.

Loss Given Default (LGD)
- the percentage of each credit facility that will be lost
if the customer defaults.

Exposure at Default (EAD)
- the expected exposure for each credit facility in the
event of a default.

Continue reading "Basel default" »

January 29, 2005

How Ratings Agencies Achieve Rating Stability

Surveys on the use of agency credit ratings reveal that some
investors believe that rating agencies are relatively slow in
adjusting their ratings. A well-accepted explanation for this
perception on the timeliness of ratings is the "through-the-cycle"
methodology that agencies use. According to Moody's, through-the-cycle
ratings are stable because they are intended to measure the risk of
default risk over long investment horizons, and because they are
changed only when agencies are confident that observed changes in a
company's risk profile are likely to be permanent. To verify this
explanation, we quantify the impact of the long-term default horizon
and the prudent migration policy on rating stability from the
perspective of an investor - with no desire for rating stability. This
is done by benchmarking agency ratings with a financial ratio-based
(credit scoring) agency-rating prediction model and (credit scoring)
default-prediction models of various time horizons. We also examine
rating migration practices. Final result is a better quantitative
understanding of the through-the-cycle methodology.

By varying the time horizon in the estimation of default-prediction
models, we search for a best match with the agency-rating prediction
model. Consistent with the agencies' stated objectives, we conclude
that agency ratings are focused on the long term. In contrast to
one-year default prediction models, agency ratings place less weight
on short-term indicators of credit quality.

We also demonstrate that the focus of agencies on long investment
horizons explains only part of the relative stability of agency
ratings. The other aspect of through-the-cycle rating methodology -
agency rating-migration policy - is an even more important factor
underlying the stability of agency ratings. We find that rating
migrations are triggered when the difference between the actual agency
rating and the model predicted rating exceeds a certain threshold
level. When rating migrations are triggered, agencies adjust their
ratings only partially, consistent with the known serial dependency of
agency rating migrations.

Continue reading "How Ratings Agencies Achieve Rating Stability" »

January 26, 2005

Web mathematica takes derivatives.

Web Mathematica takes derivatives.

January 23, 2005

Treeage statistical software for non-statistician

TreeAge offers statistical software for non-statisticians.

Features include sensitivity analysis and distribution graphs.

January 22, 2005

Belief Networks and Decision Networks

Belief networks (also known as Bayesian networks, Bayes networks and
causal probabilistic networks), provide a method to represent
relationships between propositions or variables, even if the
relationships involve uncertainty, unpredictability or imprecision.

They may be learned automatically from data files, created by an
expert, or developed by a combination of the two. They capture
knowledge in a modular form that can be transported from one situation
to another; it is a form people can understand, and which allows a
clear visualization of the relationships involved.

By adding decision variables (things that can be controlled), and
utility variables (things we want to optimize) to the relationships of
a belief network, a decision network (also known as an influence
diagram) is formed. This can be used to find optimal decisions,
control systems, or plans.

Continue reading "Belief Networks and Decision Networks" »

January 21, 2005

Agena Risk bayesian network

Agena Risk bayesian network analysis software and whitepapers.

January 14, 2005

Bayesian Methods for Improving Credit Scoring Models

Abstract: We propose a Bayesian methodology that enables banks with
small datasets to improve their default probability estimates by
imposing prior information. As prior information, we use coefficients
from credit scoring models estimated on other datasets. Through
simulations, we explore the default prediction power of three Bayesian
estimators in three different scenarios and find that all three
perform better than standard maximum likelihood estimates. We
therefore recommend that banks consider Bayesian estimation for
internal and regulatory default prediction models.

Keywords: Credit Ratings, Rating Agency, Bayesian Inference, Basel II

JEL Classification: C11, G21, G33

Continue reading "Bayesian Methods for Improving Credit Scoring Models" »

January 13, 2005

Receiver Operating Characteristic (ROC)

ROC.

The ability of a test to discriminate diseased cases from normal cases
is evaluated using Receiver Operating Characteristic (ROC) curve
analysis (Metz, 1978; Zweig & Campbell, 1993). ROC curves can also be
used to compare the diagnostic performance of two or more laboratory or
diagnostic tests (Griner et al., 1981).

January 12, 2005

Lindeberg's central limit theorem

Lindeberg's Central Limit Theorem at Planetmath.

January 07, 2005

TreeBoost - Stochastic Gradient Boosting

TreeBoost - Stochastic Gradient Boosting.

"Boosting" is a technique for improving the accuracy of a predictive
function by applying the function repeatedly in a series and combining
the output of each function with weighting so that the total error of
the prediction is minimized. In many cases, the predictive accuracy of
such a series greatly exceeds the accuracy of the base function used
alone.

January 02, 2005

Correlation Monger

Correlation monger provides pair-wise correlation of
demographic variables across 50 US states. For example,
Canadians increase property values.

December 17, 2004

MedCalc basic statisitical features.

MedCalc has good list of basic statisitical features.

# Stepwise Multiple regression

# Stepwise Logistic regression

# Paired and unpaired t-tests

# Rank sum tests: Wilcoxon test (paired data), Mann-Whitney U test (unpaired data)

# Variance ratio test (F-test)

# One-way analysis of variance (ANOVA) with Student-Newman-Keuls (SNK) test for pairwise comparison of subgroups

# Two-way analysis of variance

# Kruskal-Wallis test

# Frequencies table, crosstabulation analysis, Chi-square test, Chi-square test for trend

# Tests on 2x2 tables: Fisher's exact test, McNemar test

# Frequencies bar charts

# Kaplan-Meier survival curve, logrank test for comparison of survival curves, hazard ratio, logrank test for trend

# Cox proportional-hazards regression

# Meta-analysis: odds ratio (random effects or fixed effects model - Mantel-Heinszel method); summary effects for continuous outcomes; Forest plot

# Reference interval (normal range)

# Analysis of Serial measurements with group comparison

# Bland & Altman plot for method comparison (bias plot) - repeatability