Topics
in Statistical Data Analysis
B. R.
Asrabadi
Analysis:
Regression:
Models:
Miscellaneos:
Statistical Data Analysis
Statistics
is a set of methods that are used to collect, analyze, present, and interpret
data. Statistical methods are used in a wide variety of occupations and
help people identify, study, and solve many complex problems. In the business
and economic world, these methods enable decision makers and managers to
make informed and better decisions about uncertain situations. Vast amounts
of statistical information are available in today's global and economic
environment because of continual improvements in computer technology. To
compete successfully globally, managers and decision makers must be able
to understand the information and use it effectively. Statistical data
analysis provides hands on experience to promote the use of statistical
thinking and techniques to apply in order to make educated decisions in
the business world. Computers play a very important role in statistical
data analysis. The statistical software package, SPSS, which is used in
this course, offers extensive data-handling capabilities and numerous statistical
analysis routines that can analyze small to very large data statistics.
The computer will assist in the summarization of data, but statistical
data analysis focuses on the interpretation of the output to make inferences
and predictions. Studying a problem through the use of statistical data
analysis usually involves four basic steps.
1. Defining
the problem
2. Collecting
the data
3. Analyzing
the data
4. Reporting
the results
-
Defining the Problem
-
An exact
definition of the problem is imperative in order to obtain accurate data
about it. It is
-
extremely
difficult to gather data without a clear definition of the problem.
-
Collecting the Data
We live
and work at a time when data collection and statistical computations have
become easy almost to the point of triviality. Paradoxically, the design
of data collection, never sufficiently emphasized in the statistical data
analysis textbook, have been weakened by an apparent belief that extensive
computation can make up for any deficiencies in the design of data collection.
One must start with an emphasis on the importance of defining the population
about which we are seeking to make inferences, all the requirements of
sampling and experimental design must be set. Designing ways to collect
data is an important job in statistical data analysis. Two important aspects
of a statistical study are:
Population-
a set of all the elements of interest in a study
Sample
- a subset of the population
Statistical
inference - extending your knowledge obtain
from a random sample to the whole population. This is known in mathematics
as an Inductive Reasoning.
That is, knowledge of whole from a particular. Its main application is
in hypotheses testing about a given population.
The
purpose of statistical inference is to obtain information about a population
form information contained in a sample. It is just not feasible to test
the entire population, so a sample is the only realistic way to obtain
data because of the time and cost constraints. Data can be either quantitative
or qualitative.
Qualitative data
are labels or names used to identify an attribute of each element. Quantitative
data are always numeric and indicate either how much or how many.
For
the purpose of statistical data analysis, distinguishing between cross-sectional
and time series data is important. Cross-sectional
data re data collected at the same or approximately the same point in time.
Time series data
are data collected over several time periods.
Data
can be collected from existing sources or obtained through observation
and experimental studies designed to obtain new data. In an experimental
study, the variable of interest is identified. Then one or more factors
in the study are controlled so that data can be obtained about how the
factors influence the variables. In observational studies, no attempt is
made to control or influence the variables of interest. A survey is perhaps
the most common type of observational study.
Analyzing
the Data
Statistical
data analysis divides the methods for analyzing data into two categories:
exploratory methods
and confirmatory
methods. Exploratory methods are used to discover what the data seems to
be saying by using simple arithmetic and easy-to-draw pictures to summarize
data. Confirmatory methods use ideas from probability theory in the attempt
to answer specific questions. Probability is important in decision making
because it provides a mechanism for measuring, expressing, and analyzing
the uncertainties associated with future events. The majority of the topics
addressed in this course fall under this heading.
Reporting
the Results
Through
inferences, an estimate or test claims about the characteristics of a population
can be obtained from a sample. The results may be reported in the form
of a table, a graph or a set of percentages. Because only a small collection
(sample) has been examined and not an entire population, the reported results
must reflect the uncertainty through the use of probability statements
and intervals of values.
To
conclude, a critical aspect of managing any organization is planning for
the future. Good judgment, intuition, and an awareness of the state of
the economy may give a manager a rough idea or "feeling" of what is likely
to happen in the future. However, converting that feeling into a number
that can be used effectively is difficult. Statistical data analysis helps
managers forecast and predict future aspects of a business operation. The
most successful managers and decision makers are the ones who can understand
the information and use it effectively.
Biostatistics
-
Biostatistics
is a sub-discipline of Statistics which focuses on statistical support
for the areas of medicine, environmental science, public health, and related
fields. Practitioners span the range from the very applied to the very
theoretical. The information which is useful to the biostatistician spans
the range from that needed by a general statistician, to more subject-specific
scientific details, to ordinary information that will improve communication
between the biostatistician and other scientists and researchers.
Evidential Statistics
-
Statistical
methods aim to answer a variety of questions about observations. A simple
example occurs when a fairly reliable test for a condition C, has given
a positive result. Three important types of questions are:
-
1. Should this observation lead me
to believe that condition C is present?
2. Does this observation justify my acting as if condition C were present?
3. Is this observation evidence that condition C is present?
-
We must
distinguish among these three questions in terms of the variables and principles
that determine their answers. Questions of the third type, concerning the
"evidential interpretation" of statistical data, are central to many applications
of statistics in many fields.
-
It is already
recognized that for answering the evidential question current statistical
methods are seriously flawed which could be corrected by a applying the
Law of Likelihood. This law suggests how the dominant statistical paradigm
can be altered so as to generate appropriate methods for objective, quantitative
representation of the evidence embodied in a specific set of observations,
as well as measurement and control of the probabilities that a study will
produce weak or misleading evidence.
-
Multivariate Data Analysis
-
Data are
easy to collect; what we really need in complex problem solving is information.
We may view a data base as a domain that requires probes and tools to extract
relevant information. As in the measurement process itself, appropriate
instruments of reasoning must be applied to the data interpretation task.
Effective tools serve in two capacities: to summarize the data and to assist
in interpretation. The objectives of interpretive aids are to reveal the
data at several levels of detail.
-
Exploring
the fuzzy data picture sometimes requires a wide-angle lens to view its
totality. At other times it requires a close up lens to focus on fine detail.
The graphically based tools that we use provide this flexibility. Most
chemical systems are complex because they involve many variables and there
are many interactions among the variables. Therefore, chemometric techniques
rely upon multivariate statistical and mathematical tools to uncover interactions
and reduce the dimensionality of the data.
-
Principal
component analysis used for exploring data. Two closely related techniques,
principal component analysis and factor analysis, are used to reduce the
dimensionality of multivariate data. In these techniques correlation and
interactions among the variables are summarized in terms of a small number
of underlying factors. The methods rapidly identify key variables or groups
of variables that control the system under study. The resulting dimension
reduction also permits graphical representation of the data so that significant
relationships among observations or samples can be identified.
-
Other techniques
include Multidimensional Scaling, Cluster Analysis, and Correspondence
Analysis.
-
Multivariate
analysis is a branch of statistics involving the consideration of objects
on each of which are observed the values of a number of variables. A wide
range of methods is used for the analysis of multivariate data, and this
course will give a view of the variety of methods available, as well as
going into some of them in detail. Multivariate techniques are used across
the whole range of fields of statistical application: in medicine, physical
and biological sciences, economics and social science, and of course in
many industrial and commercial applications.
Spatial Data Analysis
-
Data which
is geographically or spatially referenced is encountered in a very wide
variety of practical contexts. In the same way that data collected at different
points in time may require specialised analytical techniques, there are
a range of statistical methods devoted to the modelling and analysis of
data collected at different points in space. Increased public sector and
commercial recording and use of data which is geographically referenced,
recent advances in computer hardware and software capable of manipulating
and displaying spatial relationships in the form of digital maps, and an
awareness of the potential importance of spatial relationships in many
areas of research, have all combined to produced an increased interest
in spatial analysis. Spatial Data Analysis is concerned with the study
of such techniques---the kind of problems they are designed to address,
their theoretical justification, when and how to use them in practice.
-
Many natural
phenomena involve a random distribution of points in space. Biologists
who observe the locations of cells of a certain type in an organ, astronomers
who plot the positions of the stars, botanists who record the positions
of plants of a certain species and geologists detecting the distribution
of a rare mineral in rock are all observing spatial point patterns in two
or three dimensions. Such phenomena can be modelled by spatial point processes.
-
The spatial
linear model is fundamental to a number of techniques used in image processing,
for example, for locating gold/ore deposits, or creating maps. There are
many unresolved problems in this area such as the behavior of maximum likelihood
estimators and predictors, and diagnostic tools. There are strong connections
between kriging predictors for the spatial linear model and spline methods
of interpolation and smoothing. The two-dimensional version of splines/kriging
can be used to construct deformations of the plane, which are of key importance
in shape analysis.
Meta-Analysis
-
Meta-Analysis
deals with the art of combining information from the data from different
independent sources which are targeted at a common goal. There are plenty
of applications of Meta-Analysis in various disciplines such as Astronomy,
Agriculture, Biological and Social Sciences, and Environmental Science.
This particular topic of statistics has evolved considerably over the last
twenty years with applied as well as theoretical developments.
-
A Meta-analysis
deals with a set of RESULTs to give an overall RESULT that is (presumably)
comprehensive and valid.
-
a) Especially
when Effect-sizes are rather small, the hope is that one can gain good
power by essentially pretending to have the larger N as a valid, combined
sample.
-
b) When
effect sizes are rather large, then the extra POWER is not needed for main
effects of design: Instead, it theoretically could be possible to look
at contrasts between the slight variations in the studies themselves.
-
For example,
to compare two effect sizes (r) obtained by two separate studies, you may
use:
-
Z = (z1
- z2)/[(1/n1-3) + (1/n2-3)]1/2
-
where z1
and z2 are Fisher transformations of r, and the two ni's
in the denominator represent the sample size for each study.
-
If you
really trust that "all things being equal" will hold up. The typical "meta"
study does not do the tests for homogeneity that should be required
-
In other
words:
-
1. there
is a body of research/data literature that you would like to summarize
-
2. one
gathers together all the admissible examples of this literature (note:
some might be discarded for various reasons)
-
3. certain
details of each investigation are deciphered ... most important would be
the effect that has or has not been found. ie, how much larger in sd units
is the treatment group's performance compared to one or more controls.
-
4. call
the values in each of the investigations in #3 .. mini effect sizes.
-
5. across
all admissible data sets, you attempt to summarize the overall effect size
by forming a set of individual effects ... and using an overall sd as the
divisor .. thus yielding essentially an average effect size.
-
6. in the
meta analysis literature ... sometimes these effect sizes are further labeled
as small, medium, or large ....
-
You can
look at effect sizes in many different ways .. across different factors
and variables. but, in a nutshell, this is what is done.
-
I recall
a case in physics, in which, after a phenomenon had been observed in air,
emulsion data was examined. The theory would have about a 9% effect in
emulsion, and behold, the published data gave 15%. As it happens, there
was no significant (practical, not statistical) in the theory, and also
no error in the data. It was just that the results of experiments in which
nothing statistically significant was found were not reported.
-
This non-reporting
of such experiments, and often of the specific results which were not statistically
significant, which introduces major biases. This is also combined with
the totally erroneous attitude of researchers that statistically significant
results are the important ones, and than if there is no significance, the
effect was not important. We really need to between the term "statistically
significant", and the usual word significant.
-
It is very
important to distinction between statistically significant and generally
significant, see Discover Magazine (July, 1987), The Case of Falling Nightwatchmen,
by Sapolsky. In this article, Sapolsky uses the example to point out the
very important distinction between statistically significant and generally
significant: A diminution of velocity at impact may be statistically significant,
but not of importance to the falling nightwatchman.
-
Be careful
about the word "significant". It has a technical meaning, not a commonsense
one. It is NOT automatically synonymous with "important". A person or group
can be statistically significantly taller than the average for the population,
but still not be a candidate for your basketball team. Whether the difference
is substantively (not merely statistically) significant is dependent on
the problem which is being studied.
-
Meta-analysis
is a controversial type of literature review in which the results of individual
randomized controlled studies are pooled together to try to get an estimate
of the effect of the intervention being studied. It increases statistical
power and is used to resolve the problem of reports which disagree with
each other. It's not easy to do well and there are many inherent problems.
-
There is
also graphical technique to assess robustness of meta-analysis results.
We should carry out the meta-analysis dropping consecutively one study,
that is if we have N studies we should do N meta-analysis using N-1 studies
in each one. After that we plot these N estimates on the y axis and compare
them with a straight line that represent the overall estimate using all
the studies.
-
Topics
in Meta-analysis includes: Odds ratios; Relative risk; Risk difference;
Effect size; Incidence rate difference and ratio; Plots and exact confidence
intervals.
-
For details,
read,
Meta-Analysis
in Social Research, by Glass, McGraw and Smith, 1987,
and
Handbook
of Research Synthesis, by Cooper H., and L. Hedges,
(Eds.), New York, Russell Sage Foundation, 1994,
-
also visit
Meta-Analysis,
and
Meta
-Analysis: Methods of Accumulating Results Across Research Domains.
Variogram Analysis
-
Variables
are often measured at different locations. The patterns in these spatial
variables may be extrapolated by variogram analysis.
-
A variogram
summarizes the relationship between the variance of the difference in pairs
of measurements and the distance of the corresponding points from each
other.
Survival Analysis
-
Survival
analysis is suited to the examination of data where the outcome of interest
is 'time until a specific event occurs', and where not all individuals
have been followed up until the event occurs.
-
The methods
of survival analysis are applicable not only in studies of patient survival,
but also studies examining adverse events in clinical trials, time to discontinuation
of treatment, duration in community care before re-hospitalisation, contraceptive
and fertility studies etc.
-
If you've
ever used regression analysis on longitudinal event data, you've probably
come up against two intractable problems:
-
Censoring:
Nearly every sample contains some cases that do not experience an event.
If the dependent variable is the time of the event, what do you do with
these "censored" cases?
-
Time-dependent
covariates: Many explanatory variables (like income or blood pressure)change
in value over time. How do you put such variables in a regression analysis?
-
Makeshift
solutions to these questions can lead to severe biases. Survival methods
are explicitly designed to deal with censoring and time-dependent covariates
in a statistically correct way. Originally developed by biostatisticians,
these methods have become popular in sociology, demography, psychology,
economics, political science, and marketing.
-
In Short,
survival Analysis is a group of statistical methods for analysis and interpretation
of survival data. Even though survival analysis can be used in a wide variety
of applications (e.g. insurance, engineering, and sociology), the main
application is for analyzing clinical trials data. Survival and hazard
functions, the methods of estimating parameters and testing hypotheses
that are the main part of analyses of survival data. Main topics relevant
to survival data analysis are: Survival and hazard functions, Types of
censoring, Estimation of survival and hazard functions: the Kaplan-Meier
and life table estimators, Simple life tables, Peto's Logrank with trend
test and hazard ratios and Wilcoxon test, (can be stratified), Wei-Lachin,
Comparison of survival functions: The logrank and Mantel-Haenszel tests,
The proportional hazards model: time independent and time dependent covariates,
The logistic regression model, and Methods for determining sample sizes.
-
In the
last few years the survival analysis software available in several of the
standard statistical packages has experienced a major increment in functionality,
and is no longer limited to the triad of Kaplan-Meier curves, logrank tests,
and simple Cox models.
-
Further
Reading:
Lee
E., Statistical Methods for Survival Data Analysis, Wiley, 1992.
Split-half Analysis
-
What is
split-half analysis? Split your sample in half. Factor analyses each half.
Do they come out the same (or similar) as each other? Alternatively (or
also), take more than two 2 random subsample of your sample and do the
same.
-
Notice
that this is (like factor analysis itself) an "exploratory", not inferential
technique, i.e. hypothesis testing, confidence intervals etc. simply do
not apply.
-
Alternative,
randomly split the sample in half and then do an exploratory factor analysis
on Sample 1. Use those results to do a confirmatory factor analysis with
Sample 2.
The Central Limit Theorem
-
For practical
purposes, the main idea of the central limit theorem (CLT) is that the
average of a sample of observations drawn from some population with any
shape-distribution is approximately distributed as a normal distribution
if certain conditions are met. In theoretical statistics there are several
versions of the central limit theorem depending on how these conditions
are specified. These are concerned with the types of assumptions made about
the distribution of the parent population (population from which the sample
is drawn) and the actual sampling procedure.
-
One of
the simplest versions of the theorem says that if is a random sample of
size n (say, n 30) from an infinite population finite standard deviation
, then the standardized sample mean converges to a standard normal distribution
or, equivalently, the sample mean approaches a normal distribution with
mean equal to the population mean and standard deviation equal to standard
deviation of the population divided by square root of sample size n. In
applications of the central limit theorem to practical problems in statistical
inference, however, statisticians are more interested in how closely the
approximate distribution of the sample mean follows a normal distribution
for finite sample sizes, than the limiting distribution itself. Sufficiently
close agreement with a normal distribution allows statisticians to use
normal theory for making inferences about population parameters (such as
the mean ) using the sample mean, irrespective of the actual form of the
parent population.
-
It is well
known that whatever the parent population is, the standardized variable
will have a distribution with a mean 0 and standard deviation 1 under random
sampling. Moreover, if the parent population is normal, then is distributed
exactly as a standard normal variable for any positive integer n. The central
limit theorem states the remarkable result that, even when the parent population
is non-normal, the standardized variable is approximately normal if the
sample size is large enough (say, 30). It is generally not possible to
state conditions under which the approximation given by the central limit
theorem works and what sample sizes are needed before the approximation
becomes good enough. As a general guideline, statisticians have used the
prescription that if the parent distribution is symmetric and relatively
short-tailed, then the sample mean reaches approximate normality for smaller
samples than if the parent population is skewed or long-tailed.
-
On e must
study the behavior of the mean of samples of different sizes drawn from
a variety of parent populations. Examining sampling distributions of sample
means computed from samples of different sizes drawn from a variety of
distributions, allow us to gain some insight into the behavior of the sample
mean under those specific conditions as well as examine the validity of
the guidelines mentioned above for using the central limit theorem in practice.
-
Under certain
conditions, in large samples, the sampling distribution of the sample mean
can be approximated by a normal distribution. The sample size needed for
the approximation to be adequate depends strongly on the shape of the parent
distribution. Symmetry (or lack thereof) is particularly important. For
a symmetric parent distribution, even if very different from the shape
of a normal distribution, an adequate approximation can be obtained with
small samples (e.g., 10 or 12 for the uniform distribution). For symmetric
short-tailed parent distributions, the sample mean reaches approximate
normality for smaller samples than if the parent population is skewed and
long-tailed. In some extreme cases (e.g. binomial with ) samples sizes
far exceeding the typical guidelines (say, 30) are needed for an adequate
approximation. For some distributions without first and second moments
(e.g., Cauchy), the central limit theorem does not hold.
Review
also Central
Limit Theorem Applet, CLT,
and Quincunx
to illustrate the Central Limit Theorem.
Sampling Distribution
-
The main
idea of statistical inference is to take a random sample from a population
and then to use the information from the sample to make inferences about
particular population characteristics such as the mean (measure of central
tendency), the standard deviation (measure of spread) or the proportion
of units in the population that have a certain characteristic. Sampling
saves money, time, and effort. Additionally, a sample can, in some cases,
provide as much or more accuracy than a corresponding study that would
attempt to investigate an entire population-careful collection of data
from a sample will often provide better information than a less careful
study that tries to look at everything.
-
We will
study the behavior of the mean of sample values from a different specified
populations. Because a sample examines only part of a population, the sample
mean will not exactly equal the corresponding mean of the population. Thus,
an important consideration for those planning and interpreting sampling
results, is the degree to which sample estimates, such as the sample mean,
will agree with the corresponding population characteristic.
-
In practice,
only one sample is usually taken (in some cases a small ``pilot sample''
is used to test the data-gathering mechanisms and to get preliminary information
for planning the main sampling scheme). However, for purposes of understanding
the degree to which sample means will agree with the corresponding population
mean, it is useful to consider what would happen if 10, or 50, or 100 separate
sampling studies, of the same type, were conducted. How consistent would
the results be across these different studies? If we could see that the
results from each of the samples would be nearly the same (and nearly correct!),
then we would have confidence in the single sample that will actually be
used. On the other hand, seeing that answers from the repeated samples
were too variable for the needed accuracy would suggest that a different
sampling plan (perhaps with a larger sample size) should be used.
-
A sampling
distribution is used to describe the distribution of outcomes that one
would observe from replication of a particular sampling plan.
-
Know that
to estimate means to esteem (to give value to).
-
Know that
estimates computed from one sample will be different from estimates that
would be computed from another sample.
-
Understand
that estimates are expected to differ from the population characteristics
(parameters) that we are trying to estimate, but that the properties of
sampling distributions allow us to quantify, probabilistically, how they
will differ.
-
Understand
that different statistics have different sampling distributions with distribution
shape depending on (a) the specific statistic, (b) the sample size, and
(c) the parent distribution.
-
Understand
the relationship between sample size and the distribution of sample estimates.
-
Understand
that the variability in a sampling distribution can be reduced by increasing
the sample size.
-
See that
in large samples, many sampling distributions can be approximated with
a normal distribution.
-
Visit also
the following Web sites: Sample,
and Sampling Distribution
Applet
Least Squares Models
-
Many problems
in analyzing data involve describing how variables are related. The simplest
of all models describing the relationship between two variables is a linear,
or straight-line, model. The simplest method of fitting a linear model
is to ``eye-ball'' a line through the data on a plot, but a more elegant,
and conventional method is that of least squares, which finds the line
minimizing the sum of distances between observed points and the fitted
line.
-
Realize
that fitting the ``best'' line by eye is difficult, especially when there
is a lot of residual variability in the data.
-
Know that
there is a simple connection between the numerical coefficients in the
regression equation and the slope and intercept of regression line.
-
Know that
a single summary statistic like a correlation coefficient or does not tell
the whole story. A scatter plot is an essential complement to examining
the relationship between the two variables.
-
Know that
the model checking is an essential part of the process of statistical modelling.
After all, conclusions based on models that do not properly describe an
observed set of data will be invalid.
-
Know the
impact of violation of regression model assumptions (i.e., conditions)
and possible solutions by analyzing the residuals.
Least Median of Squares
Models
-
The standard
least squares techniques for estimation in linear models are not robust
in the sense that outliers or contaminated data can strongly influence
estimates. A robust technique which protects against contamination is least
median of squares (LMS) estimation. An extension of LMS estimation to generalized
linear models, giving rise to the least median of deviance (LMD) estimator.
Power of a Test
-
Significance
tests are based on certain assumptions: The data have to be random samples
out of a well defined basic population and one has to assume that some
variables follow a certain distribution - in most cases the normal distribution
is assumed.
-
Power of
a test is the probability of correctly rejecting a false null hypothesis.
This probability is one minus the probability of making a Type II error
(b). Recall
also that we choose the probability of making a Type I error when we set
a and that
if we decrease the probability of making a Type I error we increase the
probability of making a Type II error.
Power and
Alpha
-
Thus, the
probability of correctly retaining a true null has the same relationship
to Type I errors as the probability of correctly rejecting an untrue null
does to Type II error. Yet, as I mentioned if we decrease the odds of making
one type of error we increase the odds of making the other type of error.
What is the relationship between Type I and Type II errors?
-
Power and
the True Difference Between Population Means: Anytime we test whether a
sample differs from a population or whether two sample come from 2 separate
populations, there is the assumption that each of the populations we are
comparing has it's own mean and standard deviation (even if we do not know
it). The distance between the two population means will affect the power
of our test.
-
Power as
a Function of Sample Size and Variance: You should notice that what really
made the difference in the size of b
is how much overlap there is in the two distributions. When the means are
close together the two distributions overlap a great deal compared to when
the means are farther apart. Thus, anything that effects the extent the
two distributions share common values will increase b
(the likelihood of making a Type II error).
-
Sample
size has an indirect effect on power because it affects the measure of
variance we use to calculate the t-test statistic. Since we are calculating
the power of a test that involves the comparison of sample means, we will
be more interested in the standard error (the average difference in sample
values) than standard deviation or variance by itself. Thus, sample size
is of interest because it modifies our estimate of the standard deviation.
When n is large we will have a lower standard error than when n is small.
In turn, when N is large well have a smaller b
region than when n is small.
ANOVA: Analysis of Variance
-
The tests
we have learned up to this point allow us to test hypotheses that examine
the difference between only two means. Analysis of Variance or ANOVA will
allow us to test the difference between 2 or more means. ANOVA does this
by examining the ratio of variability between two conditions and variability
within each condition. For example, say we give a drug that we believe
will improve memory to a group of people and give a placebo to another
group of people. We might measure memory performance by the number of words
recalled from a list we ask everyone to memorize. A t-test would compare
the likelihood of observing the difference in the mean number of words
recalled for each group. An ANOVA test, on the other hand, would compare
the variability that we observe between the two conditions to the variability
observed within each condition. Recall that we measure variability as the
sum of the difference of each score from the mean. When we actually calculate
an ANOVA we will use a short-cut formula.
-
Thus, when
the variability that we predict (between the two groups) is much greater
than the variability we don't predict (within each group) then we will
conclude that our treatments produce different results.
P-values
-
The P-value,
which directly depends on a given sample, attempts to provide a measure
of the strength of the results of a test, in contrast to a simple reject
or do not reject. If the null hypothesis is true and the chance of random
variation is the only reason for sample differences, then the P-value is
a quantitative measure to feed into the decision making process as evidence.
The following table provides a reasonable interpretation of P-values:
-
This interpretation
is widely accepted, and many scientific journals routinely publish papers
using such an interpretation for the result of test of hypothesis.
-
For the
fixed-sample size, when the number of realizations is decided in advance,
the distribution of p is uniform (assuming the null hypothesis). We would
express this as P(p x) = x. That means the criterion of p <0.05
achieves a
of 0.05.
-
When a
p-value is associated with a set of data, it is a measure of the probability
that the data could have arisen as a random sample from some population
described by the statistical (testing) model.
-
A p-value
is a measure of how much evidence you have against the null hypothesis.
The smaller the p-value, the more evidence you have. One may combine the
p-value with the significance level to make decision on a given test of
hypothesis. In such a case, if the p-value is less than some threshold
(usually .05, sometimes a bit larger like 0.1 or a bit smaller like .01)
then you reject the null hypothesis.
-
Understand
that the distribution of p-values under null hypothesis H0 is uniform,
and thus does not depend on a particular form of the statistical test.
In a statistical hypothesis test, the P value is the probability of observing
a test statistic at least as extreme as the value actually observed, assuming
that the null hypothesis is true. The value of p is defined with respect
to a distribution. Therefore, we could call it "model-distributional hypothesis"
rather than "the null hypothesis".
-
In short,
it simply means that if the null had been true, the p value is the probability
against the null in that case. The p-value is determined by the observed
value, however, this makes it difficult to even state the inverse of p.
-
P-value
for Standard Normal and t-statistics
-
Conversion
of a z-statistic Into a (one-side) P-value
INPUT "Z : ", ZValue
a1# = .31938153#
a2# = -.356563782#
a3# = 1.781477937#
a4# = -1.821255978#
a5# = 1.330274429#
w1# = ABS(ZValue)
w# = 1 / (1 + .2316419# * w1#)
w1# = .39894228# * EXP(-.5 * w1# * w1#)
p0# = w# *(a1# + w# *(a2# + w# *(a3# + w# * (a4# + a5# * w#))))
p0# = (w1# * p0#)
IF ZValue 0 THEN
p0# = 1 - p0#
END IF
PRINT p0#
-
Area from
0 to z for normal density: EXP(-((83*Z+351)*Z+562)*Z/(703+165*Z))/2
Below is a silimar program:
INPUT z
a1 = .31938153#
a2 = -.356563782#
a3 = 1.781477937#
a4 = -1.821255978#
a5 = 1.330274429#
w1 = ABS(z)
w = 1 / (1 + .2316419 * w1)
w1 = .39894228# * EXP(-.5 * w1 * w1)
p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))))
p0 = w1 * p0
PRINT ABS(p0);
-
Conversion
of a z-statistic Into a (one-side) P-value: in C++ code
double __declspec(dllexport) NormalProb(double z)
{
const double a1 = .31938153;
const double a2 = -.356563782;
const double a3 = 1.781477937;
const double a4 = -1.821255978;
const double a5 = 1.330274429;
double w1 = absd(z);
double w = 1 / (1 + .2316419 * w1);
w1 = .39894228 * exp(-0.5 * w1 * w1);
double p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))));
p0 = w1 * p0;
return absd(p0);
}
-
Conversion
of a t-statistics Into a (one-side) P-value: C++
double __declspec(dllexport) TProb(double t, int df)
{
double a = 0.36338023;
double w = atan(t / sqrt(df));
double s = sin(w);
double c = cos(w);
double t1, t2;
int j1, j2, k2;
if (df % 2 == 0) // even
{
t1 = s;
if (df == 2) // special case df=2
return (0.5 * (1 + t1));
t2 = s;
j1 = -1;
j2 = 0;
k2 = (df - 2) / 2;
}
else
{
t1 = w;
if (df == 1) // special case df=1
return 1 - (0.5 * (1 + (t1 * (1 - a))));
t2 = s * c;
t1 = t1 + t2;
if (df == 3) // special case df=3
return 1 - (0.5 * (1 + (t1 * (1 - a))));
j1 = 0;
j2 = 1;
k2 = (df - 3)/2;
}
for (int i=1; i = k2; i++)
{
j1 = j1 + 2;
j2 = j2 + 2;
t2 = t2 * c * c * j1/j2;
t1 = t1 + t2;
}
return 1 - (0.5 * (1 + (t1 * (1 - a * (df % 2)))));
}
-
For more,
visit Statistics.
The Effect Size
-
Effect
size (ES) is a ratio of a mean difference to a standard deviation, i.e.
it is a form of z-score. Suppose an experimental treatment group has a
mean score of Xe and a control group has a mean score of Xc and a standard
deviation of Sc, then the effect size is equal to (Xe - Xc)/Sc
-
Effect
size permits the comparative effect of different treatments to be compared,
even when based on different samples and different measuring instruments.
-
Therefore,
the ES is the mean difference between the control group and the treatment
group. Howevere, by Glass's method, ES is (mean1 - mean2)/SD of control
group while by Hunter-Schmit's method, ES is (mean1 - mean2)/pooled SD
and then adjusted by instrument reliability coefficient. ES is commonly
used in meta-analysis and power analysis.
-
Further
Readings:
Glass
G., McGaw B., and M. Smith, Meta-analysis in Social Research, Newbury
Park, CA: Sage, 1981.
Cooper
H., and L. Hedges, The Handbook of Research Synthesis, NY, Russell
Sage, 1994.
Structural Equation Modeling
-
The structural
equation modeling techniques are used to study relations among variables.
The relations are typically assumed to be linear. In social and behavioral
research most phenomena are influenced by a large number of determinants
which typically have a complex pattern of interrelationships. To understand
the relative importance of these determinants their relations must be adequately
represented in a model, which may be done with structural equation modeling.
-
A structural
equation model may apply to one group of cases or to multiple groups of
cases. When multiple groups are analyzed parameters may be constrained
to be equal across two or more groups. When two or more groups are analyzed,
means on observed and latent variables may also be included in the model.
-
As an application,
how do you test the equality of regression slopes coming from the same
sample using 3 different measuring methods? You could use a structural
modeling approach.
-
1 - Standardize
all three data sets prior to the analysis because b
weights are also a function of the variance of the predictor variable and
with standardization, you remove this source.
-
2 - Model
the dependent variable as the effect from all three measures and obtain
the path coefficient (b
weight) for each one.
-
3 - Then
fit a model in which the three path coefficients are constrained to be
equal. If a significant decrement in fit occurs, the paths are not equal.
-
Further
Reading:
Schumacker
R., and R. Lomax, A Beginner's Guide to Structural Equation Modeling,
Lawrence Erlbaum, New Jersey, 1996.
-
Visit also
the Web site Structural
Equation Modeling on the Internet
Tri-linear Coordinates
Triangle
-
A "ternary
diagram" is usually used to show the change of opinion (FOR - AGAINST -
UNDECIDED). The triangular diagram used first by the chemist Willard Gibbs
in his studies on phase transitions. It is based on the proposition from
geometry that in an equilateral triangle, the sum of the distances from
any point to the three sides is constant. This implies that the percent
composition of a mixture of three substances can be represented as a point
in such a diagram, since the sum of the percentages is constant (100).
The three vertices are the points of the pure substances.
-
The same
holds for the "composition" of the opinions in a population. When percents
for, against and undecided sum to 100, the same technique for presentation
can be used. See the diagram below, which should be viewed with a non-proportional
letter. True equilateral may not be preserved in transmission. E.g. let
the initial composition of opinions be given by 1. That is, few undecided,
roughly equally as much for as against. Let another composition be given
by point 2. This point represents a higher percentage undecided and, among
the decided, a majority of "for".
Internal and Inter-rater
Reliability
-
"Internal
reliability" of a scale is often measured by Cronbach's coefficient a.
It is relevant when you will compute a total score and you want to know
its reliability, based on no other rating. The "reliability" is *estimated*
from the average correlation, and from the number of items, since a longer
scale will (presumably) be more reliable. Whether the items have the same
means is not usually important.
-
Tau-equivalent:
The true scores on items are assumed to differ from each other by no more
than a constant. For a
to equal the reliability of measure, the items comprising it have to be
at a least tau-equivalent, if this assumption is not met, a
is lower bound estimate of reliability.
-
Congeneric
measures: This least restrictive model within the framework of classical
test theory requires only that true scores on measures said to be measuring
the same phenomenon be perfectly correlated. Consequently, on congeneric
measures, error variances, true-score means, and true-score variances may
be unequal
-
For "inter-rater"
reliability, one distinction is that the importance lies with the reliability
of the single rating. Suppose we have the following data
Participants Time Q1 Q2 Q3 to Q17
001 1 4 5 4 4
002 1 3 4 3 3
001 2 4 4 5 3
etc.
-
By examining
the data, I think one cannot do better than looking at the paired t-test
and Pearson correlations between each pair of raters - the t-test tells
you whether the means are different, while the correlation tells you whether
the judgments are otherwise consistent.
-
Unlike
the Pearson, the "intra-class" correlation assumes that the raters do have
the same mean. It is not bad as an overall summary, and it is precisely
what some editors do want to see presented for reliability across raters.
It is both a plus and a minus, that there are a few different formulas
for intra-class correlation, depending on whose reliability is being estimated.
-
For purposes
such as planning the Power for a proposed study, it does matter whether
the raters to be used will be exactly the same individuals. A good methodology
to apply in such cases, is the Bland & Altman analysis.
-
Visit also
the Web site Common
Correlation and Reliability Analysis.
Nonparametric Techniques
-
One must
use statistical technique called nonparametric if it satisfies at least
on of the following five types of criteria:
-
1. The
data entering the analysis are enumerative - that is, count data representing
the number of observations in each category or cross-category.
-
2. The
data are measured and /or analyzed using a nominal scale of measurement.
-
3. The
data are measured and /or analyzed using an ordinal scale of measurement.
-
4. The
inference does not concern a parameter in the population distribution -
as, for example, the hypothesis that a time-ordered set of observations
exhibits a random pattern.
-
5. The
probability distribution of the statistic upon which the the analysis is
based is not dependent upon specific information or assumptions about the
population(s) which the sample(s) are drawn, but only on general assumptions,
such as a continuous and/or symmetric population distribution.
-
By this
definition, the distinction of nonparametric is accorded either because
of the level of measurement used or required for the analysis, as in types
1 through 3; the type of inference, as in type 4 or the generality of the
assumptions made about the population distribution, as in type 5.
-
For example
one may use the Mann-Whitney Rank Test as a nonparametric alternative to
Students T-test when one does not have normally distributed data.
-
Mann-Whitney:
To be used with two independent groups (analogous to the independent groups
t-test)
Wilcoxon:
To be used with two related (i.e., matched or repeated) groups (analogous
to the related samples t-test)
Kruskall-Wallis:
To be used with two or more independent groups (analogous to the single-factor
between-subjects ANOVA)
Friedman:
To be used with two or more related groups (analogous to the single-factor
within-subjects ANOVA)
Analysis of Incomplete
Data
-
Methods
dealing with analysis of data with missing values can be classified into:
-
- Analysis
of complete cases, including weighting adjustments,
-
Imputation methods, and extensions to multiple imputation, and
- Methods
that analyze the incomplete data directly without requiring a rectangular
data set, such as maximum likelihood and Bayesian methods.
-
Multiple
imputation (MI) is a general paradigm for the analysis of incomplete data.
Each missing datum is replaced by m 1 simulated values, producing m simulated
versions of the complete data. Each version is analyzed by standard complete-data
methods, and the results are combined using simple rules to produce inferential
statements that incorporate missing data uncertainty. The focus is on the
practice of MI for real statistical problems in modern computing environments.
-
Further
Readings:
Rubin
D., Multiple Imputation for Nonresponse in Surveys, New York, Wiley,
1987.
Schafer
J., Analysis of Incomplete Multivariate Data, London, Chapman and
Hall, 1997.
-
Little
R., and D. Rubin, Statistical Analysis with Missing Data, New York,
Wiley, 1987.
Interactions in ANOVA
and Regression Analysis
-
Interactions
are ignored only if you permit it. For historical reasons, ANOVA programs
generally produce all possible interactions, while (multiple) regression
programs generally do not produce any interactions - at least, not so routinely.
So it's up to the user to construct interaction terms when using regression
to analyze a problem where interactions are, or may be, of interest. (By
"interaction terms" I mean variables that carry the interaction information,
included as predictors in the regression model.)
-
The easiest
construction is to multiply together the predictors whose interaction is
to be included. When there are more than about three predictors, and especially
if the raw variables take values that are distant from zero (like number
of items right), the various products (for the numerous interactions that
can be generated) tend to be highly correlated with each other, and with
the original predictors. This is sometimes called "the problem of multicollinearity",
although it would more accurately be described as spurious multicollinearity.
It is possible, and often to be recommended, to adjust the raw products
so as to make them orthogonal to the original variables (and to lower-order
interaction terms as well).
-
What does
it mean if the standard error term is high? Multicolinearity is not the
only factor that can cause large SE's for estimators of "slope" coefficients
any regression models. SE's are inversely proportional to the range of
variability in the predictor variable. For example, if you were estimating
the linear association between weight (x) and some dichotomous outcome
and x=(50,50,50,50,51,51,53,55,60,62) the SE would be much larger than
if x=(10,20,30,40,50,60,70,80,90,100) all else being equal. There is a
lesson here for the planning of experiments. To increase the precision
of estimators, increase the range of the input. Another cause of large
SE's is a small number of "event" observations or a small number of "non-event"
observations (analogous to small variance in the outcome variable). This
is not strictly controllable but will increase all estimator SE's (not
just an individual SE). There is also another cause of high standard errors,
it's called serial correlation. This problem is frequent, if not typical,
when using time-series, since in that case the stochastic disturbance term
will often reflect variables, not included explicitly in the model, that
may change slowly as time passes by.
-
In a linear
model representing the variation in a dependent variable Y as a linear
function of several explanatory variables, interaction between two explanatory
variables X and W can be represented by their product: that is, by the
variable created by multiplying them together. Algebraically such a model
is represented by:
-
Y = a +b1X
+ b2 W + b3 XW + e .
-
When X
and W are category systems. This equation describes a two-way analysis
of variance (ANOV) model; when X and W are (quasi-)continuous variables,
this equation describes a multiple linear regression (MLR) model.
-
In ANOV
contexts, the existence of an interaction can be described as a difference
between differences: the difference in means between two levels of X at
one value of W is not the same as the difference in the corresponding means
at another value of W, and this not-the-same-ness constitutes the interaction
between X and W; it is quantified by the value of b3.
-
In MLR
contexts, an interaction implies a change in the slope (of the regression
of Y on X) from one value of W to another value of W (or, equivalently,
a change in the slope of the regression of Y on W for different values
of X): in a two-predictor regression with interaction, the response surface
is not a plane but a twisted surface (like "a bent cookie tin", in Darlington's
(1990) phrase). The change of slope is quantified by the value of b 3.
For details, see Modelling
and Interpreting Interactions in multiple Regression
Distance Sampling
-
The term
'distance sampling' covers a range of methods for assessing wildlife abundance:
-
line transect
sampling, in which the distances sampled are distances of detected objects
(usually animals) from the line along which the observer travels
-
point transect
sampling, in which the distances sampled are distances of detected objects
(usually birds) from the point at which the observer stands
-
cue counting,
in which the distances sampled are distances from a moving observer to
each detected cue given by the objects of interest (usually whales)
-
trapping
webs, in which the distances sampled are from the web center to trapped
objects (usually invertebrates or small terrestrial vertebrates)
-
migration
counts, in which the 'distances' sampled are actually times of detection
during the migration of objects (usually whales) past a watch point
-
Many mark-recapture
models have been developed over the past 40 years. Monitoring of biological
populations is receiving increasing emphasis in many countries. Data from
marked populations can be used for the estimation of survival probabilities,
how these vary by age, sex and time, and how they correlate with external
variables. Estimation of immigration and emigration rates, population size
and the proportion of age classes that enter the breeding population are
often important and difficult to estimate with precision for free-ranging
populations. Estimation of the finite rate of population change and fitness
are still more difficult to address in a rigorous manner.
-
For more
details read:
Buckland
S., D. Anderson, K. Burnham, and J. Laake, Distance Sampling: Estimating
Abundance of Biological Populations, Chapman and Hall, London, 1993.
Data Mining and Knowledge
Discovery
-
The continuing
rapid growth of on-line data and the widespread use of databases necessitate
the development of techniques for extracting useful knowledge and for facilitating
database access. The challenge of extracting knowledge from data is of
common interest to several fields, including statistics, databases, pattern
recognition, machine learning, data visualization, optimization, and high-performance
computing.
-
Data Mining
as an analytic process designed to explore large amounts of (typically
business or market related) data in search for consistent patterns and/or
systematic relationships between variables, and then to validate the findings
by applying the detected patterns to new subsets of data. The process thus
consists of three basic stages: exploration, model building or pattern
definition, and validation/verification.
-
What distinguishes
data mining from conventional statistical data analysis is that data mining
is usually done for the purpose of "secondary analysis" aimed at finding
unsuspected relationships unrelated to the purposes for which the data
were originally collected.
-
Data warehousing
as a process of organizing the storage of large, multivariate data sets
in a way that facilitates the retrieval of information for analytic purposes.
-
Data mining
is now a rather vague term, but the element that is common to most definitions
is "predictive modeling with large data sets as used by big companies".
Therefore, data mining is the extraction of hidden predictive information
from large databases. It is a powerful new technology with great potential,
for example,to help marketing managers "preemptively define the information
market of tomorrow." Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools. Data mining answers
business questions that traditionally were too time-consuming to resolve.
Data mining tools scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.
-
Data mining
techniques can be implemented rapidly on existing software and hardware
platforms across the large companies to enhance the value of existing resources,
and can be integrated with new products and systems as they are brought
on-line. When implemented on high performance client-server or parallel
processing computers, data mining tools can analyze massive databases while
a customer or analyst takes a coffee break, then deliver answers to questions
such as, "Which clients are most likely to respond to my next promotional
mailing, and why?"
-
Knowledge
discovery in databases aims at tearing down the last barrier in enterprises'
information flow, the data analysis step. It is a label for an activity
performed in a wide variety of application domains within the science and
business communities, as well as for pleasure. The activity uses a large
and heterogeneous data-set as a basis for synthesizing new and relevant
knowledge. The knowledge is new because hidden relationships within the
data are explicated, and/or data is combined with prior knowledge to elucidate
a given problem. The term relevant is used to emphasize that knowledge
discovery is a goal-driven process in which knowledge is constructed to
facilitate the solution to a problem.
-
Knowledge
discovery maybe viewed as a process containing many tasks. Some of these
tasks are well understood, while others depend on human judgment in an
implicit matter. Further, the process is characterized by heavy iterations
between the tasks. This is very similar to many creative engineering process,
e.g., the development of dynamic models. In this reference mechanistic,
or first principles based, models are emphasized, and the tasks involved
in model development are defined by:
1.
Initial data collection and problem formulation. The initial data are collected,
and some more or less precise formulation of the modeling problem is developed.
2.
Tools selection. The software tools to support modeling and allow simulation
are selected.
3.
Conceptual modeling. The system to be modeled, e.g., a chemical reactor,
a power generator, or a marine vessel, is abstracted at first. The essential
compartments and the dominant phenomena occurring are identified and documented
for later reuse.
4.
Model representation. A representation of the system model is generated.
Often, equations are used; however, a graphical block diagram (or any other
formalism) may alternatively be used, depending on the modeling tools selected
above.
5.
Implementation. The model representation is implemented using the means
provided by the modeling system of the software employed. These may range
from general programming languages to equation-based modeling languages
or graphical block-oriented interfaces.
6.
Verification. The model implementation is verified to really capture the
intent of the modeler. No simulations for the actual problem to be solved
are carried out for this purpose.
7.
Initialization. Reasonable initial values are provided or computed, the
numerical solution process is debugged.
8.
Validation. The results of the simulation are validated against some reference,
ideally against experimental data.
9.
Documentation. The modeling process, the model, and the simulation results
during validation and application of the model are documented.
10.
Model application. The model is used in some model-based process engineering
problem solving task.
For
other model types, like neural network models where data-driven knowledge
is utilized, the modeling process will be somewhat different. Some of the
tasks, like the conceptual modeling phase, will vanish.Typical application
areas for dynamic models are control, prediction, planning, and fault detection
and diagnosis. A major deficiency of today's methods is the lack of ability
to utilize a wide variety of knowledge. As an example, a black-box model
structure has very limited abilities to utilize first principles knowledge
on a problem. this has provided a basis for developing different hybrid
schemes. Two hybrid schemes will highlight the discussion. First, it will
be shown how a mechanistic model can be combined with a black-box model
to represent a pH neutralization system efficiently. Second, the combination
of continuous and discrete control inputs is considered, utilizing a two-tank
example as case. Different approaches to handle this heterogeneous case
are considered.The hybrid approach may be viewed as a means to integrate
different types of knowledge, i.e., being able to utilize a heterogeneous
knowledge base to derive a model. Standard practice today is that methods
and software can treat large homogeneous data-sets. A typical example of
a homogeneous data-set is time-series data from some system, e.g., temperature,
pressure, and compositions measurements over some time frame provided by
the instrumentation and control system of a chemical reactor. If textual
information of a qualitative nature is provided by plant personnel, the
data becomes heterogeneous.The above discussion will form the basis for
analyzing the interaction between knowledge discovery, and modeling and
identification of dynamic models. In particular, we will be interested
in identifying how concepts from knowledge discovery can enrich state-of-the-
art within control, prediction, planning, and fault detection and diagnosis
of dynamic systems.
Further
Readings:
Brodley
C., T. Lane, and T. Stough, Knowledge Discovery and Data Mining, American
Scientist, Jan.-Feb. 1999.
Chatfield
Ch., Model Uncertainty, Data Mining and Statistical Inference, Journal
of Royal Statistical Soc. Ser. A., 419-466, 1995.
Glymour
C., D. Madigan, et. al., Statistical themes and lessons for data
mining, Data Mining and Knowledge Discovery, 1, 11-28, 1997.
Hand
D. , Data Mining: Statistics and More?, The American Statistician,
52( 2), 1998.
Heckerman
D., Bayesian networks for data mining," Data Mining and Knowledge Discovery,
1, 79-119, 1997.
-
Visit also
the following Web sites: Data
Mining, and SAS.
Bayes and Empirical Bayes
Methods
-
Bayes and
empirical Bayes (EB) methods structure combining information from similar
components of information and produce efficient inferences for both individual
components and shared model characteristics. Many complex applied investigations
are ideal settings for this type of synthesis. For example, county-specific
disease incidence rates can be unstable due to small populations or low
rates. 'Borrowing information' from adjacent counties by partial pooling
produces better estimates for each county, and Bayes/empirical Bayes methods
structure the approach. Importantly, recent advances in computing and the
consequent ability to evaluate complex models, have increase the popularity
and applicability of Bayesian methods.
-
Bayes and
EB methods can be implemented using modern Markov chain Monte Carlo(MCMC)
computational methods. Properly structured Bayes and EB procedures typically
have good frequentist and Bayesian performance, both in theory and in practice.
This in turn motivates their use in advanced high-dimensional model settings
(e.g., longitudinal data or spatio-temporal mapping models), where a Bayesian
model implemented via MCMC often provides the only feasible approach that
incorporates all relevant model features.
-
Further
Readings:
Bayes
and Empirical Bayes Methods for Data Analysis, by
Carlin B., and T. Louis, Chapman and Hall, 1996.
Likelihood Methods
Direct Inverse
__________________________________________
Neyman-Pearson Bayesian (decision analysis
Decision Wald (H. Rubin, e.g.)
---------------------------------------------------
Hybrid "Standard" practice Bayesian (subjective)
-------------------------------------------------------
fiducial (Fisher)
Inference Early Fisher Likelihood (Edwards)
Bayesian (modern)
belief functions
(Shafer)
_________________________________________
-
In the
Direct schools, one uses Pr(data | hypothesis), usually from some model-based
sampling distribution, but one does not attempt to give the inverse probability,
Pr(hypothesis | data), nor any other quantitative evaluation of hypotheses.
The Inverse schools do associate numerical values with hypotheses, either
probabilities (Bayesian schools) or something else (Fisher, Edwards, Shafer).
-
The decision-oriented
methods treat statistics as a matter of action, rather than inference,
and attempt to take utilities as well as probabilities into account in
selecting actions; the inference-oriented methods treat inference as a
goal apart from any action to be taken.
-
The "hybrid"
row could be more properly labeled as "hypocritical"-- these methods talk
some Decision talk but walk the Inference walk.
-
Fisher's
fiducial method is included because it is so famous, but the modern consensus
is that it lacks justification.
-
Now it
is true, under certain assumptions, some distinct schools advocate highly
similar calculations, and just talk about them or justify them differently.
Some seem to think this is tiresome or impractical. One may disagree, for
three reasons:
-
First,
how one justifies calculations goes to the heart of what the calculations
actually MEAN; second, it is easier to teach things that actually make
sense (which is one reason that standard practice is hard to teach); and
third, methods that do coincide or nearly so for some problems may diverge
sharply for others.
-
The difficulty
with the subjective Bayesian approach is that prior knowledge is represented
by a probability distribution, and this is more of a commitment than warranted
under conditions of partial ignorance. (Uniform or improper priors are
just as bad in some respects as anything other sort of prior.) The methods
in the (Inference, Inverse) cell all attempt to escape this difficulty
by presenting alternative representations of partial ignorance.
-
Edwards,
in particular, uses logarithm of normalized likelihood as a measure of
support for a hypothesis. Prior information can be included in the form
of a prior support (log likelihood) function; a flat support represents
complete prior ignorance.
-
One place
where likelihood methods would deviate sharply from "standard" practice
is in a comparison between a sharp and a diffuse hypothesis. Consider H0:
X ~ N(0, 100) [diffuse] and H1: X ~ N(1, 1) [standard deviation 10 times
smaller]. In standard methods, observing X = 2 would be undiagnostic, since
it is not in a sensible tail rejection interval (or region) for either
hypothesis. But while X = 2 is not inconsistent with H0, it is much better
explained by H1--the likelihood ratio is about 6.2 in favor of H1. In Edwards'
methods, H1 would have higher support than H0, by the amount log(6.2) =
1.8. (If these were the only two hypotheses, the Neyman-Pearson lemma would
also lead one to a test based on likelihood ratio, but Edwards' methods
are more broadly applicable.)
-
I do not
want to appear to advocate likelihood methods. I could give a long discussion
of their limitations and of alternatives that share some of their advantages
but avoid their limitations. But it is definitely a mistake to dismiss
such methods lightly. They are practical (currently widely used in genetics)
and are based on a careful and profound analysis of inference.
Prediction Interval
-
The idea
is that if is the mean of a random sample of size n from a normal population,
and Y is a single additional observation, then the test statistic - Y is
normal with mean 0 and variance (1 + 1/n)s2.
-
Since
we don't actually know s2,
we need to use t in evaluating the test statistic. The appropriate Prediction
Interval for Y is
-
± ta/2.S.(1+1/n)1/2.
-
This is similar to construction of interval for individual prediction
in regression analysis.
Fitting Data to a Broken
Line
-
Fitting
data to a broken, how to determine the parameters, a, b, c, and d such
that
-
y = a +
b x, for x less than or equal c
y =
a - d c + (d + b) x, for x greater than or equal to c
-
A simple
solution is a brute force search across the values of c. Once c is known,
estimating a, b, and d is trivial through the use of indicator variables.
One may use (x-c) as your independent variable, rather than x, for computational
convenience.
-
Now, just
fix c at a fine grid of x values in the range of your data, estimate a,
b, and d, and then note what the mean squared error is. Select the value
of c that minimizes the mean squared error.
-
Unfortunately,
you won't be able to get confidence intervals involving c, and the confidence
intervals for the remaining parameters will be conditional on the value
of c.
-
For more
details, see Applied Regression Analysis, by Draper and Smith, Wiley
1981, Chapter 5, section 5.4 on use of dummy variables. example 6.
Two Parallel Regression
Lines
-
Would like
to determine if two regression lines are parallel? Construct the following
multiple linear regression model:
E(y) = b0 + b1X1 + b2X2 + b3X3
where X1 = interval predictor variable, X2 = 1 if group 1,
0 if group 0,
and X3 = X1.X2
Then, E(y|group=0) = b0 + b1X1
and E(y|group=1) = b0 + b1X1 + b2.1 + b3.X1.1
= b0 + b1.X1 + b2 + b3X1
= (b0 + b2) + (b1 + b3)X1
-
That is,
E(y|group=1) is a simple regression with a potentially different slope
and intercept compared to group=0.
-
Ho: slope(group
1) = slope(group 0) is equivalent to Ho: b3=0
-
Use t-test
from variables-in-the equation table to test this hypothesis.
Constrained Regression
Model
-
If you
fit a regression forcing the intercept to be zero, the standard error of
the slope is less. That seems counter-intuitive. The intercept should be
included in the model because it is significant, so why is the standard
error for the slope in the worse-fitting model actually smaller?
-
I agree
that it's initially counter-intuitive (see below), but here are two reasons
why it's true. The variance of the slope estimate for the constrained model
is s2
/ SXi2),
where Xi are actual X values and s2
is estimated from the residuals. The variance of the slope estimate for
the unconstrained model (with intercept) is s2
/ Sxi2),
where xi are deviations from the mean, and s2is
still estimated from the residuals). So, the constrained model can have
a larger s2
(mean square error/"residual" and standard error of estimate) but a smaller
standard error of the slope because the denominator is larger.
-
r2 also
behaves very strangely in the constrained model; by the conventional formula,
it can be negative; by the formula used by most computer packages, it is
generally larger than the unconstrained r2 because it is dealing with deviations
from 0, not deviations from the mean. This is because, in effect, constraining
the intercept to 0 forces us to act as if the mean of X and the mean of
Y both were 0.
-
Once you
recognize that the s.e. of the slope isn't really a measure of overall
fit, the result starts to make a lot of sense. Assume that all your X and
Y are positive. If you're forced to fit the regression line through the
origin (or any other point) there will be less "wiggle" in how you can
fit the line to the data than there would be if both "ends" could move.
-
Consider
a bunch of points that are ALL way out, far from zero, then if you Force
the regression through zero, that line will be very close to all the points,
and pass through origin, with LITTLE ERROR. And little precision, and little
validity. Therefore, no-intercept model is hardly ever appropriate.
Semiparametric and Non-parametric
modeling
-
Many parametric
regression models in applied science have a form like response = function(X1,...,
Xp, unknown influences). The "response" may be a decision (to
buy a certain product), which depends on p measurable variables and an
unknown reminder term. In statistics, the model is usually written as
-
Y = m(
X1, ..., Xp) + e
-
and the
unknown e is interpreted as error term.
-
The most
simple model for this problem is the linear regression model, an often
used generalization is the Generalized Linear Model (GLM)
-
Y= G(X1b1
+ ... + Xpbp) + e
-
where G
is called the link function. All these models lead to the problem of estimating
a multivariate regression. Parametric regression estimation has the disadvantage,
that by the parametric "form" certain properties of the resulting estimate
are already implied.
-
Nonparametric
techniques allow diagnostics of the data without this restriction. However,
this requires large sample sizes and causes problems in graphical visualization.
Semiparametric methods are a compromise between both: they support a nonparametric
modeling of certain features and profit from the simplicity of parametric
methods.
-
Further
Readings:
Härdle
W., S. Klinke, and B. Turlach, XploRe: An Interactive Statistical Computing
Environment, Springer, New York, 1995.
Moderation and Mediation
-
"Moderation"
is an interactional concept. That is, a moderator variable "modifies" the
relationships between two other variables. While "Mediation" is a "causal
modeling" concept. The "effect" of one variable on another is "mediated"
through another variable. That is, there is no "direct effect", but rather
an "indirect effect."
Discriminant and Classification
-
Classification
or discrimination involves learning a rule whereby a new observation can
be classified into a pre-defined class. Current approaches can be grouped
into three historical strands: statistical, machine learning and neural
network. The classical statistical methods make distributional assumptions.
There are many others which are distribution free, and which require some
regularization so that the rule performs well on unseen data. Recent interest
has focused on the ability of classification methods to be generalized.
-
We often
need to classify individuals into two or more populations based on a set
of observed "discriminating" variables. Methods of classification are used
when discriminating variables are:
-
quantitative
and approximately normally distributed;
-
quantitative
but possibly nonnormal;
-
categorical;
or
-
a combination
of quantitative and categorical.
-
It is important
to know when and how to apply linear and quadratic discriminant analysis,
nearest neighbor discriminant analysis, logistic regression, categorical
modeling, classification and regression trees, and cluster analysis to
solve the classification problem. SAS has all the routines you need to
for proper use of these classifications. Relevant topics are: Matrix operations,
Fisher's Discriminant Analysis, Nearest Neighbor Discriminant Analysis,
Logistic Regression and Categorical Modeling for classification, and Cluster
Analysis.
-
For example,
two related methods which are distribution free are the k-nearest neighbor
classifier and the kernel density estimation approach. In both methods,
there are several problems of importance: the choice of smoothing parameter(s)
or k, and choice of appropriate metrics or selection of variables. These
problems can be addressed by cross-validation methods, but this is computationally
slow. An analysis of the relationship with a neural net approach (LVQ)
should yield faster methods.
-
Further
Readings:
Cherkassky
V, and F. Mulier, Learning from Data: Concepts, Theory, and Methods, John
Wiley & Sons, 1998.
-
Visit also
the Web site Tree-Structured
& Rules Induction Programs Homepage
Generalized Linear and
Logistic Models
-
The generalized
linear model (GLM) is possibly the most important development in practical
statistical methodology in the last twenty years. Generalized linear models
provide a versatile modeling framework in which a function of the mean
response is "linked" to the covariates through a linear predictor and in
which variability is described by a distribution in the exponential dispersion
family. These models include logistic regression and log-linear models
for binomial and Poisson counts together with normal, gamma and inverse
Gaussian models for continuous responses. Standard techniques for analyzing
censored survival data, such as the Cox regression, can also be handled
within the GLM framework. Relevant topics are: Normal theory linear models,
Inference and diagnostics for GLMs, Binomial regression, Poisson regression,
Methods for handling overdispersion, Generalized estimating equations (GEEs).
-
Hre is
how to obtain degree of freedom number for the 2 log-likelihood, in a logistic
regression. Degrees of freedom pertain to the dimension of the vector of
parameters for a given model. Suppose we know that a model ln(p/(1-p))=Bo
+ B1x + B2y + B3w fits a set of data. In this case the vector B=(Bo,B1,
B2, B3) is an element of 4 dimensional Euclidean space, or R4.
-
Suppose
we want to test the hypothesis: Ho: B3=0. We are imposing a restriction
on our parameter space. The vector of parameters must be of the form: B'=B=(Bo,B1,
B2, 0). This vector is an element of a subspace of R4. Namely,
B4=0 or the X-axis. The likelihood ration statistic has the form:
-
2 log-likelihood
= 2 log(maximum unrestricted likelihood / maximum restricted likelihood)
=
2
log(maximum unrestricted likelihood)-2 log (maximum restricted likelihood)
-
Which is
unrestricted B vector 4-dimensions or degrees of freedom - restricted B
vector 3 dimensions or degrees of freedom = 1 degree of freedom which is
the difference vector: B''=B-B'=(0,0,0,B4) [one dimensional subspace of
R4.
-
The standard
textbook is Generalized Linear Models by McCullagh and Nelder (Chapman
& Hall, 1989).
LOGISTIC REGRESSION VAR=x
/METHOD=ENTER y x1 x2 f1ros f1ach f1grade bylocus byses
/CONTRAST (y)=Indicator
/contrast (x1)=indicator
/contrast (x2)=indicator
/CLASSPLOT /CASEWISE OUTLIER(2)
/PRINT=GOODFIT
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
-
Spearman's Correlation,
and Kendall's tau Application
-
How would
you compare the values of two variables to determine whether they are ordered
the same? For example:
Var1 Var2
Obs 1 x x
Obs 2 y z
Obs 3 z y
-
Is Var1
ordered the same as Var2? Two measures are Spearman's rank order correlation,
and Kendall's tau. For more details see, e.g., Fundamental Statistics
for the Behavioral Sciences, by David C. Howell, Duxbury Pr., 1995.
Repeated Measures and
Longitudinal Data
-
Repeated
measures and longitudinal data require special attention because they involve
correlated data that commonly arise when the primary sampling units are
measured repeatedly over time or under different conditions. Normal theory
models for split-plot experiments and repeated measures ANOVA can be used
to introduce the concept of correlated data. PROC GLM and PROC MIXED in
the SAS system may be used. Mixed linear models provide a general framework
for modeling covariance structures, a critical first step that influences
parameter estimation and tests of hypotheses. The primary objectives are
to investigate trends over time and how they relate to treatment groups
or other covariates. Techniques applicable to non-normal data, such as
McNemar's test for binary data, weighted least squares for categorical
data, and generalized estimating equations (GEE) are the main topics. The
GEE method can be used to accommodate correlation when the means at each
time point are modelled using a generalized linear model. Relevant topics
are: Balanced split-plot and repeated measures designs, Modeling covariance
structures of repeated measures, Repeated measures with unequally spaced
times and missing data, Weighted least squares approach to repeated categorical
data, Generalized estimating equation (Gee) method for marginal models,
Subject-specific versus population averaged interpretation of regression
coefficients, and Computer implementation using S-plus and the SAS system.
The following describes the McNemar's test for binary data.
-
McNemar
Change Test: For the yes/no questions under the two conditions, set
up a 2x2 contingency table:
f11 f10
f01 f00
-
McNemar's
test of correlated proportions is z = (f01 - f10)/sqrt(f01 + f10).
-
For those
items yielding a score on a scale, the conventional t-test for correlated
samples would be appropriate, or the Wilcoxon signed-ranks test.
What Is a Systematic
Review?
-
Health
care decision makers need to access research evidence to make informed
decisions on diagnosis, treatment and health care management for both individual
patients and populations. Systematic reviews are recognized as one of the
most useful and reliable tools to assist this practice of evidence-based
health care. These courses aim to train health care professionals and researchers
in the science and methods of systematic reviews.
-
There are
few important questions in health care which can be informed by consulting
the result of a single empirical study. Systematic reviews attempt to provide
answers to such problems by identifying and appraising all available studies
within the relevant focus and synthesizing their results, all according
to explicit methodologies. The review process places special emphasis on
assessing and maximizing the value of data, both in issues of reducing
bias and minimizing random error. The systematic review method is most
suitably applied to questions of patient treatment and management, although
it has also been applied to answer questions regarding the value of diagnostic
test results, likely prognoses and the cost-effectiveness of health care.
Incidence and Prevalence
Rates
-
Incidence
rate (IR) is the rate at which new events occur in a population. It is
defined as: Number of new events in a specified period divided by Number
of persons exposed to risk during this period
-
Prevalence
rate (PR) measures the number of cases that are present at a specified
period of time. It is defined as: Number of cases present at a specified
period of time divides by Number of persons at risk at that specified time.
-
These two
measures are related when considering the the average duration (D). That
is, PR = IR . D
-
Note that,
for example, county-specific disease incidence rates can be unstable due
to small populations or low rates. In epidemiology one can say that IR
reflects probability to Become thick at given age, while the PR reflects
probability to Be thick at given age.
-
Software Selection
-
You have
to be careful when selecting a software. A short list of item for comparison
is:
-
1) Ease
of learning,
2)
Amount of help incorporated for the user,
3)
Level of the user,
4)
Number of tests and routines involved,
5)
Ease of data entry,
6)
Data validation (and if necessary, data locking and security),
7)
Accuracy of the tests and routines,
8)
Integrated data analysis (graphs and progressive reporting on analysis
in one screen),
9)
Cost
-
No one
software meets everyone's needs. Determine the needs first and then ask
the questions relevant to the above seven criteria.
Box-Cox Power Transformation
-
In certain
cases data distribution is not normal (Gaussian), and we wish to find the
best transformation of variable in order to obtain a Gaussian data distribution
for further statistical processing.
-
Among others
the Box-Cox power transformation is often used for this purpose.
y = (xp - 1)/p, for p not zero
y = log x, for p = 0
-
trying
different values of p between -3 and +3 is usually sufficient but there
are MLE methods for estimating the best p. A good source on this and other
transformation methods is
Madansky
A., Prescriptions for working Statisticians, Springer-Verlag, 1988.
-
For percentages
or proportions (such as for binomial proportions), Arcsine transformations
would work better. The original idea of Arcsin(p)is to establish variances
as equal for all groups. The arcsin transform is derived analytically to
be the variance-stabilizing and normalizing transformation. The same limit
theorem also leads to the square root transform for Poisson variables (such
as counts) and to the arc hyperbolic tangent (i.e., Fisher's Z) transform
for correlations. The Arcsin Test yields a z and the 2x2 contingency test
yields a chi-sq. But z2 = chi-sq, for large sample size. A good
source is
Rao
C., Linear Statistical Inference and Its Applications, Wiley, 1973.
-
How to
normalize a set of data consisting of negative and positive values, and
make them positive between the range 0.0 to 1.0? Define XNew = (X-min)/(max-min).
-
Multiple Comparison Tests
-
Multiple
Comparison Procedures include topics such as Control of the family-Wise
Error rate, The closure Principle, Hierarchical Families of Hypotheses,
Single-Step and Stepwise Procedures, and P-value Adjustments. Areas of
applications include multiple comparisons among treatment means, multiple
endpoints in clinical trials, multiple sub-group comparisons, etc.
-
Nemenyi's
multiple comparison test is analogous to Tukey's test, using rank sums
in place of means and using sqrt[n2k(nk+1)/12] as the estimate
of standard error (SE), where n is the size of each sample and k is the
number of samples (means). Similarly to the Tukey test, you compare (rank
sum A - rank sum B)/SE to the studentized range for k. It is also equivalent
to the Dunn/Miller test which uses mean ranks and standard error sqrt[k(nk+1)/12].
Antedependent Modeling
for Repeated Measurements
-
Repeated
measures data arise when observations are taken on each experimental unit
on a number of occasions, and time is a factor of interest.
-
Many techniques
can be used to analyze such data. Antedependence modeling is a recently
developed method which models the correlations between observations at
different times.
Sequential Acceptance
Sampling
-
Acceptance
sampling is a quality control procedure used when a decision on the acceptability
of the batch has to be made from tests done on a sample of items from the
batch.
-
Sequential
acceptance sampling minimizes the number of items tested when the early
results show that the batch clearly meets, or fails to meet, the required
standards.
-
The procedure
has the advantage of requiring fewer observations, on average, than fixed
sample size tests for a similar degree of accuracy.
Local Influence
-
Cook's
distance measures the effect of removing a single observation on regression
estimates. This can be viewed as giving an observation a weight of either
zero or one: local influence allows this weight to be small but non-zero.
-
Cook defined
local influence in 1986, and made some suggestions on how to use or interpret
it; various slight variations have been defined since then. But problems
associated with its use have been pointed out by a number of workers since
the very beginning.
Credit Scoring
-
Credit
Scoring is now in widespread use across the retail credit industry. At
its simplest, a credit scorecard is a model usually statistical, but in
use it is embedded in a computer and or human process.
Components of the Interest
Rates
-
The interest
rates as quoted in the newspapers and by banks consist of several components.
The most important three are:
-
The
pure rate: This is the time value of money. A promise of 100 units
next year is not worth 100 units this year.
-
The
price-premium factor: If prices go up 5% each year, interest rates
go up at least 5%. For example, under the Carter Administration, prices
rose about 15% per year for a couple of years, interest was around 25%.
Same thing during the Civil War. In a deflationary period, prices may drop
so this term can be negative.
-
The
risk factor: A junk bond may pay a larger rate than a treasury note
because of the chance of losing the principal. Banks in a poor financial
condition must pay higher rates to attract depositors for the same reason.
Threat of confiscation by the government leads to high rates in some countries.
-
Other factors
are generally minor. Of course, the customer sees only the sum of these
terms. These components fluctuate at different rates themselves. This makes
it hard to compare interest rates across disparate time periods or economic
condition. The main questions are: how are these components combined to
form the index? A simple sum? A weighted sum? In most cases the index is
form both empirically and assigned on basis of some criterion of importance.
The same applies to other index numbers.
Partial Least Squares
-
Partial
Least Squares (PLS) regression is a multivariate data analysis technique
which can be used to relate several response (Y) variables to several explanatory
(X) variables.
-
The method
aims to identify the underlying factors, or linear combination of the X
variables, which best model the Y dependent variables.
Growth Curve Modeling
-
Growth
is a fundamental property of biological systems, occurring at the level
of populations, individual animals and plants, and within organisms. Much
research has been devoted to modeling growth processes, and there are many
ways of doing this: mechanistic models, time series, stochastic differential
equations etc.
-
Sometimes
we simply wish to summarize growth observations in terms of a few parameters,
perhaps in order to compare individuals or groups. Many growth phenomena
in nature show an "S" shaped pattern, with initially slow growth speeding
up before slowing down to approach a limit.
-
These patterns
can be modelled using several mathematical functions such as generalized
logistic and Gompertz curves.
Saturated Model &
Saturated Log Likelihood
-
A saturated
model is usually one that has no residual df. What is a "saturated" log
likelihood? So the "saturated LL" is the LL for a saturated model. It is
often used when comparisons made between the log likelihood with an intercept
only and the log likelihood for a particular model specification.
Pattern recognition and
Classification
-
Pattern
recognition and classification are fundamental concepts for understanding
living systems and essential for realizing artificial intelligent systems.
Applications include 3D modelling, motion analysis, feature extraction,
device positioning and calibration, feature recognition, solutions to classification
problems to industrial and medical applications.
Spatial Statistics
-
Many natural
phenomena involve a random distribution of points in space. Biologists
who observe the locations of cells of a certain type in an organ, astronomers
who plot the positions of the stars, botanists who record the positions
of plants of a certain species and geologists detecting the distribution
of a rare mineral in rock are all observing spatial point patterns in two
or three dimensions. Such phenomena can be modelled by spatial point processes.
-
Refrences:
Diggle
P., The Statistical Analysis of Spatial Point Patterns, Academic
Press, 1983.
Ripley
B., Spatial Statistics, Wiley, 1981.
What Is a Regression
Tree
-
A regression
tree is like a classification tree, only with a continuous target (dependent)
variable. Prediction of target value for a particular case is made by assigning
that case to a node (based on values for the predictor variables) and then
predicting the value of the case as the mean of its node (sometimes adjusted
for priors, costs, etc.).
-
Refrence:
Breiman
L., Friedman, Olshen, and Stone, Classification and Regression Trees,
Chapman & Hall, 1983.
Cluster Analysis for
Correlated Variables
-
Cluster
analysis is used to classify observations with respect to a set of variables.
The widely used Ward's method is predisposed to find spherical clusters
and may perform badly with very ellipsoidal clusters generated by highly
correlated variables (within clusters).
-
To deal
with high correlations, some model-based methods are implemented in the
S-Plus package. However, a limitation of their approach is the need to
assume the clusters have a multivariate normal distribution, as well as
the need to decide in advance what the likely covariance structure of the
clusters is.
-
Another
option is to combine the principal component analysis with cluster analysis.
-
Further
Readings:
Baxter
M., Exploratory Multivariate Analysis in Archaeology, pp. 167-170,
Edinburgh University Press, Edinburgh, 1994.
-
Manly F.,
Multivariate Statistical Methods: A Primer, Chapman and Hall, London,
1986.
A Summary of Forecasting
Methods
Widely used especially for short
to intermediate term analysis-forecasting the value of items affected by
factors other than time-simple regression when only one explanatory factor
considered-can be done on a hand calculator.
Multiple Regression Analysis: Used
when two or more independent factors are involved-widely used for intermediate
term forecasting. Used to assess which factors to include and which to
exclude. Can be used to develop alternate models with different factors.
Nonlinear Regression: Does not
assume a linear relationship between variables-frequently used when time
is the independent variable.
Trend Analysis: Uses linear and
nonlinear regression with time as the explanatory variable-used where pattern
over time.
Decomposition Analysis: Used to
identify several patterns that appear simultaneously in a time series-time
consuming each time it is used-also used to deseasonalize a series
Moving Average Analysis: Simple
Moving Averages-forecasts future values based on a weighted average of
past values-easy to update.
Weighted Moving Averages: Very
powerful and economical. They are widely used where repeated forecasts
required-uses methods like sum-of-the-digits and trend adjustment methods.
Adaptive Filtering A type of moving
average which includes a method of learning from past errors-can respond
to changes in the relative importance of trend, seasonal, and random factors.
Exponential Smoothing: A moving
average form of time series forecasting-efficient to use with seasonal
patterns- easy to adjust for past errors-easy to prepare follow-on forecasts-ideal
for situations where many forecasts must be prepared-several different
forms are used depending on presence of trend or cyclical variations.
Hodrick-Prescott Filter: This is
a smoothing mechanism used to obtain a long term trend component in a time
series. It is a way to decompose a given series into stationary and nonstationary
components in such a way that there sum of squares of the series from the
nonstationary component is minimum with a penalty on changes to the derivatives
of the nonstationary component.
Modeling and Simulation: Model
describes situation through series of equations-allows testing of impact
of changes in various factors-substantially more time-consuming to construct-generally
requires user programming or purchase of packages such as SIMSCRIPT. Can
be very powerful in developing and testing strategies otherwise non-evident.
Certainty models give only most
likely outcome-advanced spreadsheets can be utilized to do "what if" analysis-often
done e.g.; with computer-based spreadsheets.
Probabilistic Models Use Monte
Carlo simulation techniques to deal with uncertainty-gives a range of possible
outcomes for each set of events.
Forecasting error: All forecasting
models have either an implicit or explicit error structure, where error
is defined as the difference between the model prediction and the "true"
value. Additionally, many data snooping methodologies within the field
of statistics need to be applied to data supplied to a forecasting model.
Also, diagnostic checking, as defined within the field of statistics, is
required for any model which uses data.
Using any method for forecasting
one must use a performance measure to assess the quality of the method.
Mean Absolute Deviation (MAD), and Variance are the most useful measures.
However, MAD doesn't lend itself to further use making inferences, but
that the standard error does. For the error analysis purposes variance
is preferred since variances of independent (uncorrelated) errors are additive.
MAD is not additive.
-
How to Do Forecasting
by a Regression Analysis
-
Regression is the study of relationships
among variables, a principal purpose of which is to predict, or estimate
the value of one variable from known or assumed values of other variables
related to it.
-
Variables of Interest: To make predictions
or estimates we must identify the effective predictors of the variable
of interest: which variables are important indicators and can be measured
at the least cost, which carry only a little information, and which are
redundant.
-
Predicting the Future Predicting
a change over time or extrapolating from present conditions to future conditions
is not the function of regression analysis. To make estimates of the future,
use time series analysis.
-
Experiment: Begin with a hypothesis
about how several variables might be related to another variable and the
form of the relationship.
-
Types of Analysis
-
Simple Linear Regression: A regression
using only one predictor is called a simple regression.
-
Multiple Regression: Where there
are two or more predictors, multiple regression analysis is employed.
-
Data: Since it is usually unrealistic
to obtain information on an entire population, a sample which is a subset
of the population is usually selected. The sample may be either randomly
selected for a researcher may chose the x-values based on the capability
of the equipment utilized in the experiment or the experiment design. Where
the x-values are preselected, usually only limited inferences can be drawn
depending upon the particular values chosen. When both x and y are randomly
drawn, inferences can generally be drawn over the range of values in the
sample.
-
Scatter Diagram: A graphical representation
of the pairs of data called a scatter diagram can be drawn to gain an overall
view of the problem. Is there an apparent relationship? Direct? Inverse?
If the points lie within a band described by parallel lines we can say
there is a linear relationship between the pair of x and y values. If the
rate of change is generally not constant, then the relationship is curvilinear.
-
The Model: If we have determined
there is a linear relationship between t and y we want a linear equation
stating y as a function of x in the form Y = a + bt + e where a is the
intercept, b is the slope and e is the error term accounting for variables
that affect y but are not included as predictors, and/or otherwise unpredictable
and uncontrollable factors.
-
Least Squares Method: To predict
the mean y-value for a given t-value, we need a line which passes through
the mean value of both t and y and which minimizes the sum of the distance
between each of the points and the predictive line. Such an approach should
result in a line which we can call a "best fit" to the sample data. The
least squares method achieves this result by calculating the minimum average
squared deviations between the sample y points and the estimated line.
A procedure is sued for finding the values of a and b which reduces to
the solution of simultaneous linear equations. Shortcut formulas have been
developed as an alternative to the solution of simultaneous equations.
-
Solution Methods: Techniques of
Matrix Algebra can be manually employed to solve simultaneous linear equations.
When performing manual computations, this technique is especially useful
when there are more than two equations in two unknowns.
-
Several well-known computer packages
are widely available and can be utilized to relieve the user of the computational
problem, all of which can be used to solve both linear and polynomial equations:
the BMD packages (Biomedical Computer Programs) from UCLA; SPSS (Statistical
Package for the Social Sciences) developed by the University of Chicago;
and SAS (Statistical Analysis System). Another package that is also available
is IMSL, the International Mathematical and Statistical Libraries, which
contains a great variety of standard mathematical and statistical calculations.
All of these software packages use matrix algebra to solve simultaneous
equations.
-
Use and Interpretation of the Regression
Equation: The equation developed can be used to predict an average value
over the range of the sample data. The forecast is good for short to medium
ranges.
-
Measuring Error in Estimations:
The scatter or variability about the mean value can be measured by calculating
the variance, the average squared deviation of the values around the mean.
The standard error of estimate is derived from this value by taking the
square root. This value is interpreted as the average amount that actual
values differ from the estimated mean.
-
Confidence Intervals: Interval estimates
can be calculated to obtain a measure of the confidence we have in our
estimates that a relationship exists. These calculations are made using
t-distribution tables. From these calculations we can derive confidence
bands, a pair of non-parallel lines narrowest at the mean values which
express our confidence in varying degrees of the band of values surrounding
the regression equation.
-
Assessment: How confident can we
be that a relationship actually exists? The strength of that relationship
can be assessed by statistical tests of that hypothesis such as the null
hypothesis which are established using t-distribution, R-squared, and F-distribution
tables. These calculations give rise to the standard error of the regression
coefficient, an estimate of the amount that the regression coefficient
b will vary from sample to sample of the same size from the same population.
An Analysis of Variance (ANOVA) table can be generated which summarizes
the different components of variation.
-
When you want to compare models
of different size (different numbers of independent variables and/or different
sample sizes) you must use the Adjusted R-Squared, because the usual R-Squared
tends to grow with the number of independent variables.
-
The Standard Error of Estimate (i.e.
square root of error mean square) is a good indicator of the "quality"
of a prediction model since it "adjusts" the Error Sum of Squares (EMS)
for the number of predictors in the model as follow:
-
EMS = Error Sum of Squares/(N -
Number of Linearly Independent Predictors)
-
If one keeps adding useless predictors
to a model, the EMS will become less and less stable. R-squared is also
influenced by the range of your dependent value so if two models have the
same residual mean square but one model has a much narrower range of values
for the dependent variable that model will have a higher R-squared. This
explains the fact that both models will do as well for prediction purposes.
-
A considerable portion of the output
of the computer programs previously mentioned are devoted to a description
of the tests of significance of the regression.
-
Moving Average and Exponential
Smoothing
C SMA=SIMPLE MOVING AVERAGE
C DMA=DOUBLE MOVING AVERAGES
C FDMA=FORECAST WITH DOUBLE MOVING AVERAGES
C
C
NP1=N=2
NUM1=NUM
NUM=NUM1+1
AM1=1
SM2=NUM1
DO 8 I=NUM,SM2
SM-0
DO 450M-SM+1
SM=SM+Y(M+1)
450 CONTINUE
SM1=SM1+1
SM2=SM2+1
SMA(1)=SM/NUM1
SMASQ(I)=SMA(I)**2
8 CONTINUE
NUM=NUM1*2+1
DM1=1
DM2=NUM1
DO 45 I=NUM,NP1
DM=0.0
DO 460 M=DM1, DM2
DM=DM+SMA(M+1+NUM1)
460 CONTINUE
DM1=DM1+1
DM2=DM2+2
DMA(I)=DM/NIM1
MA(I)=SMA(I)*2-DMA(I)
MB(I)=(SMA(I)-DMA(I))2/3
FDMA(1+I)=MA(I)+MB(I)
FDMASQ(1+I)=FDMA(1+I)**2
45 CONTINUE
FORDNA=MA(J)+MB(J)*T
C
C SES=SMOOTHED STATISTIC FOR SINGLE EXPONENTIAL SMOOTHING
C DES=SMOOTHED STATISTIC FOR DOUBLE EXPONENTIAL SMOOTHING
C TES=SMOOTHING STATISTIC FOR DOU TRIBLE EXPONENTIAL SMOOTHING
C TA,TB,TC ARE THE COEFFICIENTS IN THE FORCASTING EQUATIONG EQUATION
C FOR DOUBLE EXPONENTIAL SMOOTHING
C FDES=FORCAST WITH DOUBLE EXPONENTIAL SMOOTHING
C FTES=FORCAST WITH TRIBLE EXPONENTIAL SMOOTHING
C
C
SES(I)=Y(2)
DO 46 I=2,J
SES(I)=ALPHA*(Y(I)-SES(1-I))+SES(1-I)
46 CONTINUE
DO 410 I=3,J
FSE(I)=SES(1-I)
SESSQ(I)=FSES(I)**2
410 CONTINUE
SESFOR=SES(J)
DES(1)=Y(2)
DO 55 I=2,J
DES(I)=ALPHA*SES(I)+(1.-ALPHA)*DES(1-I)
EA(I)=2*SESI)-DES(I)
EB(I)=(SES(I0-DES(I))*ALPHAR/(1.-ALPHA)
55 CONTINUE
DO 420 I=3,J
FDES(I)=EA(1-I)+EB(1-I)
FDESSQ(I)=FDES(1)**2
420 CONTINUE
DESFOR=EA(J)+T*EB(J)
TES(I)=Y(2)
DO 51 I==2,J
TES(I)=ALPHAR*DES(I)+(1.-ALPHAR)TES(1-I)
TA(I)=3*SES(I)-3*DES(I)+TES(1-I)
TB(I)=(ALPHA/(1-ALPHA)**2))*((6-5*ALPHA)SES(I)-(10-8*ALPHA)*DES(I)+
(4-3.*ALPHAR)*TES(I))
TC(I)=(ALPHA/(1-ALPHA))*2*(SES(I)-2*DES(1+TES(I))
51 CONTINUE
DO 430 I=3,J
FTES(I)=TA(1-I)+TB(1-I)+TC(1-I)/2
FTESSQ(I)=FTES(I)**2
430 CONYINUE
TESFOR=TA(J)+TB(J)*T+TC(J)/2.0*t**2
C
C ESMA, EDMA,ESES,EDES=DIFFERENCE BETWEENESTIMATED AND ACTUAL
C INSIMPLE,DOUBLE MOVING AVERAGESAND SINGLE,
C DOUBL EXPONTENTIAL SMOOTHING
C ETES=DIFFERENCE BETWEEN ESTIMATED AND ACTUAL VALUE IN
C TRIPLE EXPONENTIAL SMOOTHING
C
C
NUM=NUM1+2
DO 11 I=NUM,J
ESMA(I)=SMA(I)-Y(I)
ESMASQ(I)=ESMA(I)**2
11 CONTINUE
NUM=NUM1+2
DO 47 I=NUM,J
EDMA(I)=FDMA(I)-Y(I)
EDMASQ(I)=EDMA(I)**2
47 CONTINUE
DO 48 I=3,J
ESES(I)=FSES(I)-Y(I)
ESESSQ(I)=ESES(I)-Y(I)
EDES(I)=FDES(I)-Y(I)
EDESSQ(I)=EDES(I)**2
ETES(I)=FTES(I)-Y(I)
ETES(I)-FTES(I)-Y(I)
ETESSQ(I)=ETES(I)**2
48 CONTINUE
WRITE(6,20)
20 ORMAT(//,4X,"***MOVING AVERAGE***’)
WRITE(6,22)
22 FORMAT(//,4X,’PERIOD’, 2X,’ACTUAL’,2X,’SIMPLE MOVING AVERAGE’,
*27X, ’DOUBLE MOVING AVERAGE’)
WRITWE(6,23)
23 FORMAT(19X,’FORCAST’,2X,’RESIDUAL’,2X,’RESIDUAL-SQ’,
*15X,’M(2)’,4X,’FORECAST’,2X, ’RESIDUAL’,2X, ‘RESIDUAL –SQ’)
DO 98 I=2,J
WRITE(6,24) X(I),Y(I),SMA(I),ESMA(I),SMASQ(I),DMA(I), FDMA(I),
*EDMAS(I),EDMASQ(I)
24 FORMAT(7X,12,3X,F5,O,2X,F8,3,2X,F8.3,15,F11.3,2X,F8.3,2X,
*F8.3,2X,F8.3,2X,F11.3)
98 CONTINUE
NUM=NUM1+2
DO 13 I=NUM,J
S3=S3+SMA(I)
SS2=SS2+ESMASQ(I)
13 SS3=SS3+SMASQ(I)
NUM=NUM1*2+2
DO 49 I=NUM,J
S4=S4+EDMASQ(I)
SS4=SS4+FDMASQ(I)
49 SS5=SS5+FDMASQ(I)
WRITE(6,25)S3,SS2,S4,SS4
25 FORMAT(‘0’,/,12X,F15.3,12X,F11.3,22X,F15.3,3X,F15.3)
WRITE(6,59)T,FORDMA
59 FORMAT(/,64X,’FORECAST FOR’,1X,12,1X,
*’PERIOD(S) AHERD IS’,1X,F8.3)
WRITE(6,26)
26 FORMAT(//,4X,’***EXPONENTIAL SMOOTHING***’)
WRITE(6,27)
27 FORMAT(//,20X,’SINGLE EXPONENTIAL SNOOTHING’)
WRITE(6,28)
28 FORMAT(4X,’PERIOD’,2X,’ACTUAL’,4X,’SES’,4X,’FORECAST’,2X,
*RESIDUAL’,2X,’RESIDEAL-SQ’)
DO 14 I=1,J
WRITE(6,29)X(I),Y(I),SES(I),FSES(I),FSES(I),FSESSQ(I)
29 FORMAT(7X,12,3X,F5.0,2X,F8.3,2X,F11.3,2X,F11.3)
14 CONTINUE
DO 38 I=3,J
S5=S5+FSES)I)
S6=S6+FDES(I)
S8=S8+FTES(I)
SS7=SS7+SESSQ(I)
SS8=SS8+EDESSQ(I)
SS12=ETESSQI)+SS12
SS13=SS13+FTESSQ(I)
38 SS9=SS9+FDESSQ(I)
WRITE(6,35)S5,SS6
35 FORMAT(‘0’,/,21X,F15.3,12X,F11.3)
WRITE(6,21)T,SESFOR
WRITE(6,74)
74 FORMAT(//,20X,’DOUBLE EXEPONENTIAL SMOOTHING’)
WRITE(6,76)
76 FORMAT(4X,’PERIOD’,2X,’ACTUAL’,4X,’DES’,8X,’EA’,8X,’FR’,6X,FORECAST’,3X,
*’RESIDUAL’,2X, ’RESIDUAL-SQ’)
DO 77 I=1,J
WRITE(6,78)X(I),Y(I),DES(I),EAS(I),EB(I),FDES9I),EDES(I),EDESSQ(I)
78 FORMAT(7X,12,3X,F5.0,1X,F8.3,3X,F8..3,2X,F8.3,EX,F8.3,3X,F8.3,2X,F11.3)
77 CONTINUE
WRITE(6,79) S6,SS8
79 FORMAT(‘0’,/,41X,F11.3,12X,F11.3)
WRITE(6,21)T,DESFOR
21 FORMAT(/,’FORECAST FOR’.1X.12,1X,’PERIOD(S) AHEAD IS’,1X,F8.3)
WRITE(6,31)
31 FORMAT(//,20X,’TRIPLE EXPONENTIAL SMOOTHING’)
WRITE(6,32)
32 FORMAT(4x,’PERIOD’,2X,’ACTUAL’,4X,’TES’,6X,’TA’,8X,’TB’,6X,’TC’,4X,
*’FORCAST’,2X,’RESIDUAL’,2X,’RESIDUAL-SQ’)
DO 97 I=1,J
WRITE(6,33)X(I),Y(I),TES(I),TA(I),TB(I),TC(I),FTES(I),ETES(I),ETESSQ(I)
33 FORMAMT(7X,12,3X,F5.0,2X,F8.3,1X,F7.3,1X,F7.3,3X,F8.3,2X,F8.3,2X,F11.S)
97 CONTINUE
WRITE(6,30)T,TESFOR
30 FORMAR(/,’FORCAST FOR’,1X,12,1X,’PERIOD(S) AHEAD IS’,1X,F8.3)
End
-
Winters’ Method
C FOR INITIAL TREND LINE, WE USE SIMPLE LINEAR REGRESSION
C YEST(I)=A+BX(I)
C INITIAL MULTILICATIVE SERSONAL FACTORS (‘MSF’) BY USING THE 1ST
C AND 2ND YEAR IN THE DATA
C 1. FOR THE FIRST YEAR
C
C
L=L+1
DO 170 I=2, L
170 SF(I)=Y9I)/YEST(I)
C
C 2. FOR THE 2ND YEAR
C
LP1=1+L
LT2=2*L-1
DO 175 I=LP1,LT2
175 SF2(I)=Y(I)/YEST(I)
C
C INITIAL ESTIMATES OF THE FUTURE SEASONAL FACTORS(‘SF’)
C
DO 180 I=2,L
M=I=L-1
SF(I)=(SF(I)+SF2(M))/2
180 SF(M)=SF(I)
WRITE(6, 345)
345 FORMAT(//, 4X, ''**WINTERS' METHOD**')
WRITE(6,350)
350 FORMAT (/,4X,’PERIOD’,6X,’ACTUAL’,2X,’VALUE FROM TREND LINE’,2X,
*’MULT.SEASONAL FACTOR’)
DO 185 I=2,L
WRITE(6,355)X(I),Y(I),YEST(I),SF1(I)
355 FORMAT(7X,12,7X,F5.0,10X,F10.4,17X,F4.2)
186 CONTINUE
DO 190 I=LP1,LT2
WRITE(6,360)X(I),Y(I),YEST(I),SF(I)
360 FORMAT(7X,12,7X,F5.0,10X,F10.4,17X,F4.2)
190 CONTINUE
WRITE (6,365)
365 FORMAT(//,4X,’PERIOD’,2X’AVG.OF MULT.SEASONAL FACTORS’)
DO 195 I=2,L
WRITE(6,370)X(I),SF(I)
370 FORMAT(7X,12,15X,F5.2)
195 CONTINUE
C
C UPDATING THE ESTIMATE OF THE INTERCEPT,SLOPE,AND MULT.SEASONAL
C FACTOR BY USING EXPONENTIAL SMOOTHING
C
C AA(I)=ESTIMATED VALUE OF THE TEND LINE AT PERIOD 1
C BB(I)=ESTIMATED SLOPE OF THE TREND AT PERIOD1
C SSF(I)=REVISED SLOPE OF SEASONAL FACTOR
C FORECAST BY WINTERS’ METHOD
C
C
LP3=1+LT2
LT3=3*L-2
DO 200 I=LP3,K
AA(LT2)=YEST(LT2)
BB(LT2)=B
AA(I)=WALPHAR*Y(I)/SF(1+I-L)+(1-WALPHA)*(AA(I-1)+BR(I-1))
BB(I)=WBETA*(AA(I)-AA(I-1)+(1.-WBETA)*BB(1-I))
SSF(I)=WDELTA*Y(I)/AA(I)+(1-WBETA)*SF(1+I-L)
FW(I+1)=(AA(I)+BB(I)*1.)*SF(I+2-L)
200 CONTINUE
DO 205 I=LT3,J
SSF(LT2)=SF(LT2)
AA(I)=WALPHAR*Y(I)/SF(1+I-L)+(1-WALPHA)*(AA(I-1)+BR(I-1))
BB(I)=WBETA*(AA(I)-AA(I-1)+(1.-WBETA)*BB(1-I))
SSF(I)=WDELTA*Y(I)/AA(I)+(1-WBETA)*SF(1+I-L)
FW(I+1)=(AA(I)+BB(I)*1.)*SF(I+2-L)
205 CONTINUE
MOA=J+T-1
MOB=L-1
REM=MOD(MOD,MOB)
WINFOR=(AA(J)+BB(J)*T)*SSF(REM+LT2)
LP5=1+LP3
DO 210 I=LP5,J
EFW(I)=FW(I)-Y(I)
EFWSQ(I)=EFW(I)**2
FWSQ(I)=FW(I)**2
210 CONTINUE
DO 215 I=LP5,J
S7=S7+FW9I)
SS10=SS10+EFWSQ(I)
SS11=SS11+FWSQ(I)
215 CONTINUE
WRITE(6,375)
375 FORMAT(//,4X,' **FORECAST BY WINTERS METHOD** ' )
WRITE(6,380)
380 FORMAT(//,4X,’PERIOD’,6X,’ACTUAL’,3X,’FORECAST’,5X,’RESIDUAL’,
*2X,’RESIDUAL –SQ’)
DO 220 I=LP3,J
WRITE(6,385) X(I),Y(I),FW(I),EFW(I),EFWSQ(I)
385 FORMAT (7X,12,7X,F5.0,4X,F8.3,4X,F8.3,4X,F10.4)
220 CONTINUE
WRITE(6,390)S7,SS10
390 FORMAT(/,22X,F11.3,13X,F14.3)
WRITE(6,21)T,WINFOR
RETURN
END
-
Smoothing the Data
-
Given a
collection of data, this interactive program smooths the data using exponential
smoothing methods, and also do the forecasts for the number of periods
desired. It also computes the moving averages after receiving the desired
period. An input and output file assignments should be done before run
time, otherwise the interactive i/d is the default.
VARIABLE RECOGNITION:
PERIOD------ COULD BE A: WEEK, MONTH, QUARTER OR A YEAR
PERIODE--- NUMBER OF PERODES TO BE USED WHEN COMPUTING
THE MOVING AVERAGES.
X ------ ORIGINAL DATA
ST1 ------_ THE SMOOTHED VALUE SUSIN EXPO.FIRST DEGREE
ST2 ------_ THE SMOTHED VALLUE USING EXPO. SECOND DEGREE.
ST3 -------_ THE SMOOTHED VALUE USING EXPO. THIRD DEGREE.
INTEGER PERIOD, DATAITEMS, PERIODE
REAL X,ST1,ST2,ST3,AVR
DIMENSION X(1000), ST1(1000), ST2(1000),
$ ST3(1000), AVR(1000), PERIOD (1000)
WRITE (**)= PLEASE ENTER THE NUMBER OF DATA ITEMS THAT YOU HAVE:=
READ (*,*) DATAITEMS
INITIALIZING AND LOADING DATA INTO THE ARRAY X.
DO 10 I=1, DATAITEMS
X (I) = 0
READ (5,*)X(I)
CONTINUE
ST1(1) = X (I)
ST2(1) = X(1)
ST3(1) = X(1)
-
This part
of program computes exponentioally smoothed data, and moving average smoothed
data, forecasts for the required number of periods after computing the
coefficients and finally prints out the results.
WRITE(*,*) PLEASE ENTER THE VALUE OF COEFFICIENT ALPHA :=
READ(*,*)ALPHA
WRITE(6,100)
FORMAT(1=,10X,=PERIOD ,
$ 7X,=X=, 9X,=EXPO_1=, 6X, EXPO_2=,
$ 6X,=EXPO3=)
DO 20 J=2, DATAITEMS
ST1 (J) = ALPHA * X(J) + (1-ALPHA)* ST1(J-1)
ST2 (J) = ALPHA * ST1(J) + (1-ALPHA)* ST2(J-1)
ST3 (J) = ALPHA * ST2(J) + (1-ALPHA)* ST3(J-1)
CONTINUE
DO 30 K=1, DATAITEMS
WRITE (6,200)K,X(K),ST1(K), ST2(K),ST3(K)
FORMAT (11X,14,5X,F10.2,2X,F10.2,2X,F10.2)
CONTINUE
A2 = 2* ST1(DATAITEMS) B ST2(DATAITEMS)
B2 = (ALPHA/91-ALPHA) * (ST1 (DATAITEMS) B ST2 (DATAITEMS))
A3 =3*ST1(DATAITEMS) B 3*ST2(DATAITEMS) +ST3(DATAITEMS)
B3 = (ALPHA/2*(1-ALPHA)**2)) * ((6-5*ALPHA) * ST1 (DATAITEMS)
$ - (10 B 8*ALPHA) * ST2(DATAITEMS)
$ + (4-3*ALPHA) * ST2(DATAITEMS))
C3 = ((ALPHA))**2) * (ST1(DATAITEMS)-2*ST2(DATAITEMS)
$ + ST3(DATAITEMS))
FORCASTS
WRITE(*,*)=HOW MANY PERIODS DO YOU NEED TO FOR FORCAST?=
READ (*,*) NUMFORCASTS
WRITE(6,300)
FORMAT(////5X,= ------ FORCASTS------)
WRITE(6,400)
FORMAT (// 10X,= PERIOD ,= EXPO2.FORCASTS ,= EXPO3FORCASTS=/)
DO 40 L=1, NUMFORCASTS
FORCAST2 = A2 + B2*L
FORCAST3 = A3 + B3*L + (0.5)*(L**2)*C3
WRITE (6,500) DATAITEMS+L, FORCAST2, FORCAST3
FORMAT(12X,14X,F16.2)
FORCAST 2=0
FORCAST 3=0
CONTINUE
MOVING AVERAGE
WRITE (*,*)=PLESAE ENTER THE PERIOD_AVERAGE=
READ (*,*) PERIODE
DO 50 M=1, DATAITEMS B PERIODE + 1
DO 60 N=M, M + PERIODE - 1
SUM=SUM + X(N)
CONTINUE
WRITE (6,550)
FORMAT (//10X, ----- MOVING AVERAGE-----)
WRITE (6,600)
FORMAT (/////10X,= PERIOD .,= X(T) ,= MOVING AVERAGE=/)
DO 70 IJ=1, DATAITEMS
IF (IJ .LE. PERIODE .OR. IJ .GT. (DATAITEMS B PERIODE )) THEN
WRITE (6,700) IJ, X(IJ)
FORMAT ( 15X,12,6X,=--------)
ELSE
WRITE (6,800)IJ,X(IJ), AVR(IJ)
FORMAT (15X,I2,6X,F8.2)
ENDIF
CONTINUE
STOP
END
-
Transfer Functions Methodology
-
It is possible
to extend regression models to represent dynamic relationships between
variables via appropriate transfer functions used in the construction of
feedforward and feedback control schemes. Visit Autobox
for a software on this topic. The Transfer Function Analyzer module in
SCA
forecasting & modeling package is a frequency spectrum analysis package
designed with the engineer in mind. It applies the concept of the Fourier
integral transform to an input data set to provide a frequency domain representation
of the function approximated by that input data. It also presents the results
in conventional engineering terms.
-
Box-Jenkins Methodology
-
Forecasting Basics:
The basic idea behind self-projecting time
series forecasting models is to find a mathematical formula that will approximately
generate the historical patterns in a time series.
-
Time Series: A time series is a set of numbers
that measures the status of some activity over time. It is the historical
record of some activity, with measurements taken at equally spaced intervals
(exception: monthly) with a consistency in the activity and the method
of measurement.
-
Approaches to time Series Forecasting: There are
two basic approaches to forecasting time series: the self-projecting time
series and the cause-and-effect approach. Cause and effect methods attempt
to forecast based on underlying series that are believed to cause the behavior
of the original series. The self-projecting time series uses only the time
series data of the activity to be forecast to generate forecasts. This
latter approach is typically less expensive to apply and requires far less
data and is useful for short to medium-term forecasting.
-
Box-Jenkins Forecasting Method: The univariate
version of this methodology is a self- projecting time series forecasting
method. The underlying goal is to find an appropriate formula so that the
residuals are as small as possible and exhibit no pattern. The model- building
process involves four steps. Repeated as necessary, to end up with a specific
formula that replicates the patterns in the series as closely as possible
and also produces accurate forecasts.
-
Box-Jenkins Methodology
-
Box-Jenkins forecasting models are based on statistical
concepts and principles and are able to model a wide spectrum of time series
behavior. It has a large class of models to choose from and a systematic
approach for identifying the correct model form. There are both statistical
tests for verifying model validity and statistical measures of forecast
uncertainty. In contrast, traditional forecasting models offer a limited
number of models relative to the complex behavior of many time series with
little in the way of guidelines and statistical tests for verifying the
validity of the selected model.
-
Data: The misuse, misunderstanding, and inaccuracy
of forecasts is often the result of not appreciating the nature of the
data in hand. The consistency of the data must be insured and it must be
clear what the data represents and how it was gathered or calculated. As
a rule of thumb, Box-Jenkins requires at least 40 or 50 equally-spaced
periods of data. The data must also be edited to deal with extreme or missing
values or other distortions through the sue of functions as log or inverse
to achieve stabilization.
-
Preliminary Model Identification Procedure: A
preliminary Box-Jenkins analysis with a plot of the initial data should
be run as the starting point in determining an appropriate model. The input
data must be adjusted to form a stationary series, one whose values vary
more or less uniformly about a fixed level over time. Apparent trends can
be adjusted by having the model apply a technique of "regular differencing,"
a process of computing the difference between every two successive values,
computing a differenced series which has overall trend behavior removed.
If a single differencing does not achieve stationarity, it may be repeated,
although rarely if ever, are more than two regular differencings required.
Where irregularities in the differenced series continue to be displayed,
log or inverse functions can be specified to stabilize the series such
that the remaining residual plot displays values approaching zero and without
any pattern. This is the error term, equivalent to pure, white noise.
-
Pure Random Series: On the other hand, if the
initial data series displays neither trend nor seasonality and the residual
plot shows essentially zero values within a 95% confidence level and these
residual values display no pattern, then there is no real-world statistical
problem to solve and we go on to other things.
-
Model Identification Background
-
Basic Model: With a stationary series in place,
a basic model can now be identified. Three basic models exist, AR (autoregressive),
MA (moving average) and a combined ARMA in addition to the previously specified
RD (regular differencing) combine to provide the available tools. When
regular differencing is applied together with AR and MA, they are referred
to as ARIMA, with the I indicating "integrated" and referencing the differencing
procedure.
-
Seasonality: In addition to trend, which has now
been provided for, stationary series quite commonly display seasonal behavior
where a certain basic pattern tends to be repeated at regular seasonal
intervals. The seasonal pattern may additionally frequently display constant
change over time as well. Just as regular differencing was applied to the
overall trending series, seasonal differencing (SD) is applied to seasonal
nonstationarity as well. And as autoregressive and moving average tools
are available with the overall series, so too, are they available for seasonal
phenomena using seasonal autoregressive parameters (SAR) and seasonal moving
average parameters (SMA).
-
Establishing Seasonality: The need for seasonal
autoregression (SAR) and seasonal moving average (SMA) parameters is established
by examining the autocorrelation and partial autocorrelation patterns of
a stationary series at lags that are multiples of the number of periods
per season. These parameters are required if the values at lags s, 2s,
etc. are nonzero and display patterns associated with the theoretical patterns
for such models. Seasonal differencing is indicated if the autocorrelations
at the seasonal lags do not decrease rapidly.
Referring
to the above chart, know that, the variance of the errors of the underlying
model must be invariant (i.e. constant). This means that the variance for
each subgroup of data is the same and does not depend on the level or the
point in time. If this is violated then one can remedy this by stabilizing
the variance. Make sure that, that there are no deterministic patterns
in the data. Also one must not have any pulses or one-time unusual values.
Additionally there should be no level or step shifts. Also no seasonal
pulses should be present.
The
reason for all of this is that if they do exist then the sample autocorrelation
and partial autocorrelation will seem to imply ARIMA structure. Also the
presence of these kind of model components can obfuscate or hide structure.
For example a single outlier or pulse can create an effect where the structure
is masked by the outlier.
Improved
Quantitative Identification Method
Relieved
Analysis Requirements: A substantially improved procedure is now available
for conducting Box-Jenkins ARIMA analysis which relieves the requirement
for a seasoned perspective in evaluating the sometimes ambiguous autocorrelation
and partial autocorrelation residual patterns to determine an appropriate
Box-Jenkins model for use in developing a forecast model.
ARMA
(1, 0): The first model to be tested on the stationary series consists
solely of an autoregressive term with lag 1. The autocorrelation and partial
autocorrelation patterns are examined for significant autocorrelation often
early terms and to see whether the residual coefficients are uncorrelated,
that is the coefficient values are zero within 95% confidence limits and
without apparent pattern. When fitted values as close as possible to the
original series values are obtained, the sum of the squared residuals will
be minimized, a technique called least squares estimation. The residual
mean and the mean percent error should not be significantly nonzero. Alternative
models are examined comparing the progress of these factors, favoring models
which use as few parameters as possible. Correlation between parameters
should not be significantly large and confidence limits should not bracket
zero. When a satisfactory model has been established a forecast procedure
is applied.
ARMA
(2, 1): Absent a satisfactory ARMA (1, 0) condition with residual coefficients
approximating zero, the improved model identification procedure now proceeds
to examine the residual pattern when autoregressive terms with order 1
and 2 are applied together with a moving average term with an order of
1.
Subsequent
Procedure: To the extent that the residual conditions described above
remain unsatisfied, the Box-Jenkins analysis is continued with ARMA (n,
n-1) until a satisfactory model is arrived at. In the course of this iteration,
when an autoregressive coefficient (phi) approaches zero, the model is
reexamined with parameters ARMA (n-1, n-1). In like manner whenever a moving
average coefficient (theta) approaches zero, the model is similarly reduced
to ARMA (n, n-2). At some point, either the autoregressive term or moving
average term may fall away completely and the examination of the stationary
series is continued with only the remaining term until the residual coefficients
approach zero within the specified confidence levels.
Seasonal
Analysis: In parallel with this model
development cycle and in an entirely similar manner, seasonal autoregressive
and moving average parameters are added or dropped in response to the present
o fa seasonal or cyclical pattern in the residual terms or a parameter
coefficient approaching zero.
Model Adequacy:
In reviewing the Box-Jenkins output, care should be taken to insure that
the parameters are uncorrelated and significant and alternate models should
be weighted for these conditions as well as for overall correlation (R2),
standard error, and zero residual.
Forecasting with the Model:
The model is used for short and intermediate term forecasting, updated
as new data becomes available to minimize the number of periods ahead required
of the forecast.
Monitor the Accuracy of the Forecasts
in Real Time: As time progresses, the accuracy
of the forecasts should be closely monitored for increases in the error
terms, standard error and a decrease in correlation. When the series appears
to be thus changing over time, recalculation of the model parameters should
be undertaken.
SPSS Programs Listing
for Forecasting
simple linear regression (fit)
PLOT HSIZE=50/VSIZE 42/
FORMAT=REGRESSION/
PLOT= T WITH X
REGRESSION DESCRIPTIVES=DEFAULTS/ gives mean, st.dev.
VARS=T,X/ and corr.
DEP=X/
METHOD=ENTER/
RESIDUAL=HISTOGRAM/
RESIDUAL=NORMPROB/
SCATTERPLOT=(T,X), (*RESID,X),
(*RESID,T),(*RESID,*PRED)
/CASEWISE=ALL
or /CASEWISE = DEPENDENT PRED RESID ZRESID
or /CASEWISE = ALL DEPENDENT PRED RESID ZRESID
PEARSON CORR PRED X/
polynomial regression
COMPUTE TSQRT=T**2
COMPUTE TCUB=T**3
REGRESSION VARIABLES=X,T,TSQRT,TCUB/
EPT=X/
ENTER/
DEP=X/
FORWARD/ provides a sequential analysis
Box-Jenkins Method ARIMA
TITLE `B-J METHOD'
FILE HANDLE SERIESG/NAME=`SPS.DAT'
DATA LIST FILE=SERIESG LIST/X *
VAR LABLE
X `AIRLINE DATA'
LIST CASE CASE=144/VARIABLES=ALL/
1st Step
BOX-JENKINS VARIABLE=X/PLOT=SERIES/IDENTIFY
data, graph, original, logs, differencing?
2nd Step
BOX-JENKINS VARIABLE=X/LOG/DIFFERENCE=0 THRU 2/
PERIOD=12/SDIFFERENCE=0 THRU 2/
LAG=49/PLOT=DSE, PAC/IDENTIFY
tentative model(s)
3rd Step
BOX-JENKINS VARIABLE=X/LOG/DIFFERENCE=0 THRU 2/
PERIOD=12/SDIFFERENCE=1/LAG=49/
Q=1/SQ=1/NCONSTANT/BFR=13/
PLOT=RAC, RES/ESTIMATION
estimation and diagnostic check: residual s.s.? parameter(s),
significance? BP Chi-sq.? Residual, Autocorrelations?
graph of residuals? OK?
4th Step
BOX-JENKINS VARIABLE=X/LOG/DIFFERENCE=1/
PERIOD=12/SDIFFERENCE=1/Q=1/
SQ=1/FQ=(0.39631)/FSQ=(0.61306)/
ORIGIN=24/PLOT=FCF,FLF,CIN/
FORECAST
To get the forecast(24 backward, 12 forward),
plot of forecast function, fixed lead forecast, confidence
interval(95%).
Extended Version of SPSS
You may like to use
the Extended Version of SPSS. If so replace the first line in your program
file with the following two JCL lines
$START_SPSSX
$SPSSX/NOBANNER/OUTPUT=..
After
submitting your job, you receive notification that the job in completed,
together with some massages. Ignore these messages and proceed as with
the usual SPSS version.
SAS Programs
Listing for Exponential Smoothing and Winters Methods
DATA ONE;
INFILE ACME;
INPUT TIME VALUE;
PROC PRINT;
PROC PLOT DATA=ONE;
PLOT VALUE*TIME;
PROC FORECAST DATA=ONE OUT=TWO OUTEST=THREE
METHOD=EXPO TREND=1;
VAR VALUE;
ID TIME;
PROC PRINT DATA=THREE;
TITLE 'THE ESTIMATE FROM SINGLE EXPO';
PROC PRINT DATA=TWO;
TITLE ' THE OUTPUT FROM SINGLE EXPO';
PROC FORECAST DATA=ONE OUT=FOUR OUTEST=FIVE
METHOD=EXPO TREND=2;
VAR VALUE;
ID TIME;
PROC PRINT DATA= FIVE;
TITLE ' THE ESTIMATE FROM DOUBLE EXPO ';
PROC PRINT DATA=FOUR;
TITLE ' THE OUTPUT FROM SINGLE EXPO';
PROC FORECAST DATA=ONE OUT=SIX OUTEST=SEVEN
METHOD=EXPO TREND=3;
VAR VALUE;
ID TIME;
PROC PRINT DAT=SEVEN;
TITLE 'THE ESTIMATE FROM TRIPLE EXPO';
PROC PRINT DATA=SIX;
TITLE ' THE OUTPUT FROM TRIPLE EXPO';
PROC FORECAST DATA=ONE OUT=A OUTEST=B
METHOD=WINTERS SEASONS=4 TREND=2 OUTDATA OUT1STEP
OUTLIMIT INTERVAL=1 LEAD=5;
VAR VALUE;
ID TIME;
PROC PRINT DATA=B;
TITLE 'THE ESTIMATE FROM WINTERS METHOD';
PROC PRINT DATA=A;
TITLE ' THE OUTPUT FROM WINTERS METHOD';
PROC PLOT DATA=A;
PLOT (VALUE)*TIME=_TYPE_;
TITLE 'PLOT OF FORECAST: WINTERS METHOD';
Modeling Financial Time
Series
We
are attempting to 'model' what the reality is; so that we can predict it.
Statistical Modeling, in addition to being of central importance in statistical
decision making, is critical in any endeavor, since essentially everything
is a model of reality. As such, modeling has applications in such disparate
fields as marketing, finance, and organizational behavior. Particularly
compelling is econometric modeling since, unlike most disciplines (such
as Normative Economics), econometrics deals only with provable facts, not
with beliefs and opinions.
Time
series analysis is an integral part of financial analysis. The topic is
interesting and useful, with applications to the prediction of interest
rates, foreign currency risk, stock market volatility, and the like. There
are many varieties of econometric and multi-variate techniques. Specific
examples are regression and multi-variate regression; vector auto-regressions;
and co- integration regarding tests of present value models. Next section
presents the underlying theory on which statistical models are predicated.
Financial
Modeling: Econometric modeling is vital in finance and in financial
time series analysis. Modeling is, simply put, the creation of representations
of reality. It is important to be mindful that, despite the importance
of the model, it is in fact only a representation of reality and not the
reality itself. Accordingly, the model must adapt to reality; it is futile
to attempt to adapt reality to the model. As representations, models cannot
be exact. Models imply that action is only taken after careful thought
and reflections This can have major consequences in the financial realm.
A key element of financial planning and financial forecasting is the ability
to construct models showing the interrelatedness of financial data. Models
showing correlation or causation between variables can be used to improve
financial decision-making. For example, one would be more concerned about
the consequences on the domestic stock market of a downturn in another
economy if it can be shown that there is a mathematically provable causative
impact of that nation's economy and the domestic stock market. However,
modeling is fraught with dangers. A model which heretofore was valid may
lose validity due to changing conditions, thus becoming an inaccurate representation
of reality and adversely affecting the ability of the decision-maker to
make good decisions.
The
examples of univariate and multivariate regression, vector autoregression,
and present value cointegration illustrate the application of modeling,
a vital dimension in managerial decision making, to econometrics, and specifically
the study of financial time series. The provable nature of econometric
models is impressive; rather than proffering solutions to financial problems
based on intuition or convention, one can mathematically demonstrate that
a model is or is not valid, or requires modification. It can also be seen
that modeling is an iterative process, as the models must continuously
change to reflect changing realities. The ability to do so has striking
ramifications in the financial realm, where the ability of models to accurately
predict financial time series is directly related to the ability of the
individual or firm to profit from changes in financial scenarios.
Univariate
and Multivariate Models: The use of regression analysis is widespread
in examining financial time series. Some examples are the use of forward
exchange rates as optimal predictors of future spot rates; conditional
variance and the risk premium in foreign exchange markets; and stock returns
and volatility. A model that has been useful for this type of application
is called the GARCH-M model, which incorporates computation of the man
into the GARCH (generalized autoregressive conditional heteroskedastic)
model. This sounds complex and esoteric, but it only means that the serially
correlated errors and the conditional variance enter the mean computation,
and that the conditional variance itself depends on a vector of explanatory
variables. The GARCH-M model has been further modified, a testament of
finance practitioners to the necessity of adapting the model to a changing
reality. For example, this model can now accommodate exponential (non-linear)
functions, and is no longer constrained by non-negativity parameters.
One
application of this model is the analysis of stock returns and volatility.
Traditionally, the belief has been that the variance of portfolio returns
is the primary risk measure for investors. However, using extensive time
series data, it has been proven that the relationship between mean returns
and return variance or standard deviation I weak; hence the traditional
two-parameter asset pricing models appear to be inappropriate, and mathematical
proof replaces convention. Since decisions premised on the original models
are necessarily sub-optimal because the original premise is flawed, it
is advantageous for the finance practitioner to abandon the model in favor
of one with a more accurate representation of reality.
Correct
specification of a model is of paramount importance, and a battery of misspecification
testing criteria have been established. These include tests of normality,
linearity, and homoskedasticity, and can be applied to a variety of models.
A simple example which yields surprising results is the Capital Asset Pricing
Model, one of the cornerstones of elementary economics. Application of
the testing criterial to data concerning companies' risk premium shows
significant evidence of non-linearity, non-normality and parameter non-constancy.
The CAPM was found to be applicable for only three of seventeen companies
that were analyzed. This does not mean, however, that the CAPM should be
summarily rejected; it still has value as a pedagogic tool, and can be
used as a theoretical framework. For the econometrician or financial professional,
for whom the misspecification of the model can translate into suboptimal
financial decisions, the CAPM should be supplanted by a better model, specifically
one that reflects the time-varying nature of betas. The GARCH-M framework
is one such model.
Multivariate
linear regression models apply the same theoretical framework. The principal
difference is the replacement of the dependent variable by a vector. The
estimation theory is essentially a multivariate extension of that developed
for the univariate, and as such can be used to test models such as the
stock and volatility model and the CAPM. In the case of the CAPM, the vector
introduced is excess asset returns at a designated time. One application
is the computation of the CAPM with time-varying covariances. Although
in this example the null hypothesis that all intercepts are zero cannot
be rejected, the misspecification problems of the univariate model still
remain. Slope and intercept estimates also remain the same, since the same
regression appears in each equation.
Vector
Autoregression: General regression models assume that the dependent
variable is a function of past values of itself and past and present values
of the independent variable. The independent variable, then, is said to
be weakly exogenous, since its stochastic structure contains no relevant
information for estimating the parameters of interest. While the weak exogeneity
of the independent variable allows efficient estimation of the parameters
of interest without any reference to its own stochastic structure, problems
in predicting the dependent variable may arise if "feedback" from the dependent
to the independent variable develops over time. (When no such feedback
exists, it is said that the dependent variable does not Granger-cause the
independent variable.) Weak exogenetic coupled with Granger non-causality
yields strong exogenetic which, unlike weak exogenetic, is directly testable.
To perform the tests requires utilization of the dynamic structural equation
model (DSEM) and the vector autoregressive process (VAR). The multivariate
regression model is thus extended in two directions, by allowing simultaneity
between the endogenous variables in the dependent variable, and explicitly
considering the process generating the exogenous variables in the dependent
variable, and explicitly considering the process generating the exogenous
independent variables.
Results
of this testing are useful in determination of whether an independent variable
is strictly exogenous or is predetermined. Strict exogenetic can be tested
in DSEMs by expressing each endogenous variable as an infinite distributed
lag of the exogenous variables. If the independent variable is strictly
exogenous, attention can be limited to distributions conditional on the
independent variable without loss of information, resulting in simplification
of statistical inference. If the independent variable is strictly exogenous,
it is also predetermined, meaning that all of its past and current values
are independent of the current error term. While strict exogenetic is closely
related to the concept of Granger non-causality, the two concepts are not
equivalent and are not interchangeable.
It
can be seen that this type of analysis is helpful in verifying the appropriateness
of a model as well as proving that, in some cases, the process of statistical
inference can be simplified without losing accuracy, thereby both strengthening
the credibility of the model while increasing the efficiency of the modeling
process. Vector autoregressions can be used to calculate other variations
on causality, including instantaneous causality, linear dependence, and
measures of feedback from the dependent to he independent and from the
independent to the dependent variables. It is possible to proceed further
with developing causality tests, but simulation studies which have been
performed reach a consensus that the greatest combination of reliability
and ease can be obtained by applying the procedures described.
Cointegration
and Present Value Modeling: Present value models are used extensively
in finance to formulate models of efficient markets. In general terms.
A present value model for two variables y1 and x1, states that y1 is a
linear function of the present discounted value of the expected future
values of x1, where the constant term, the constant discount factor, and
the coefficient of proportionality are parameters that are either know
or need to be estimated. Not all financial time series are non-integrated;
the presence of integrated variables affects standard regression results
and procedures of inference. Variables may also be cointegrated, requiring
the superimposition of cointegrating vectors on the model, and resulting
in circumstances under which the concept of equilibrium loses all practical
implications and spurious regressions may occur. In present value analysis,
cointegration can be used to define the "theoretical spread" and to identify
co-movements of variables. This is useful in constructing volatility-based
tests.
One
such test is stock market volatility. Assuming cointegration, second-order
vector autoregressions are constructed, which show suggest that dividend
changes are not only highly predictable but are Granger-caused by the spread.
When the assumed value of the discount rate is increased, certain restrictions
can be rejected at low significance levels. This yields results showing
an even more pronounced "excess volatility" than that anticipated by the
present value model. It also illustrates that the model is more appropriate
in situations where the discount rate is higher. The implications of applying
a cointegration approach to stock market volatility testing for financial
managers are significant. Of related significance is the ability to test
the expectations hypotheses of interest rate term structure.
Measuring for Accuracy
Given
a set of data and its forecasted values obtained by using any method, this
interactive Fortran program computes the statistics that allows you to
have an idea about how good of the forecasting method used fits the original
data set.
INTEGER TESTART, PERIOD
REAL LASTX, LASTF, LASTERRAQR
WRITE (*, *)' PLEASE ENTER (IN ORDER) HOW MANY PERIODES'
WRITE (*, *)' DO YOU DISPOSE OF AND FROM WHAT PERIODE'
WRITE (*, *)' YOU WANT TO TEST YOUR FORECASTS ?'
READ (*, *) MAXPERIODS, TESTART
WRITE (6,100)
100 FORMAT (/,= PERIOD=,= DATA .= FORCASTS=)
DO 10 I=1, ( TESTART -1)
READ (5,150)X, F
150 FORMAT (2F8.2)
WRITE (6,250) I,X,F
250 FORMAT ( 7X,12X,3X,F8.2,5X,F8.2)
CONTINUE
LASTX=X
LASTF=F
LASTERR= LASTX - LASTF
DO 20 J=TESTART, MAXPERIODS
READ ( 5, 300) X,F
300 FORMAT ( 2F8.2)
WRITE (6,350) J,X,F
350 FORMAT (7X,12,3X,F8.2,5X,F8.2)
ERR = X - F
SSE = SEE + (ERR) **2
TMAPE = TMAPE + ABS (ERR/X)
SUMMER=SUMMER+ERR
SUMABSERR=NUMERATOR + ABS(ERR)
SUMX = SUMX + X
UNUMERATOR = NUMERATOR + ((F+X)/LASTX)**2
UDENOMINATOR = UDENOMINATOR + ((X + LASTX)/LASTX)**2
WDNUMERATOR = (ERR - LASERR) **2
LASTERR = ERR
LASTX = X
20 CONTINUE
VME = SUMERR/ (MAXPERIODS -TESTART)
VMAE = SUMMBSERR/ (MAXPERIODS - TESTART)
SDE = SQRT (SSE/MAXPERIODS - TESTART - 1))
VMSE = SSE/(MAXPERIODS - TESTART)
VMAPE = (TMAPE*100)/(MAXPERIODS - TESTART)
THEILSTAT = SQRT (UNUMERATOR/UNDENOMINATOR)
VLAUGHLINS = (4 - THEILSTAT) * 100
DW = WDNUMERATOR/SSE
WRITE (6,600)
600 FORMAT (//5X,' **** STATISTICS*** ')
WRITE(6,200) VME,VMAE,SDE,VMSE,VMAPE,THEILSTAT, VLAUGHLINGS,DW
FORMAT (ME= F8.2,/= MAE= ,F8.2,/= SDE==,F8.2,/= MSE= ,F8.2,
$ /= MAPE= , F8.2,/= THEILSTAT= , F8.2,/= LAUGHLINGS= , F8.2,
$ /= DURBIN_ WATSON= ,F8.2
STOP
END