Topics in Statistical Data Analysis

B. R. Asrabadi
Analysis:

 |Statistical Data Analysis |Multivariate Data Analysis|Spatial Data nalysis|Meta-Analysis|ANOVA: Analysis of Variance||Variogram Analysis|Survival Analysis|Split-half Analysis|Cluster Analysis for Correlated Variables|Analysis of Incomplete Data|

Regression:

|Least Squares Models||Least Median of Squares Models| |Regression Tree||Partial Least Squares|  |Two Parallel Regression Lines||Constrained Regression Model||Generalized Linear and Logistic Models| |Fitting Data to a Broken Line|

Models:

|Growth Curve Modeling||Saturated Model & Saturated Log Likelihood||Structural Equation Modeling||Antedependent Modeling for Repeated Measurements|| Semiparametric and Non-parametric Modeling |


Miscellaneos:

|Biostatistics||Evidential Statistics||The Central Limit Theorem||Sampling Distribution|
|
Tri-linear Coordinates Triangle||Internal and Inter-rater Reliability|
|
Power of a Test || P-value||P-value for Standard Normal and t-statistics| |The Effect Size |Nonparametric Technique ||Interactions| |Distance Sampling||Bayes and Empirical Bayes Methods| |Likelihood Methods Prediction Interval||Moderation and Mediation||Discriminant and Classification|| Spearman's Correlation, and Kendall's tau ||Repeated Measures and Longitudinal Data||Data Mining and Knowledge Discovery || Incidence and Prevalence Rates||Software Selection||Box-Cox Power Transformation||Multiple Comparison Tests| |Sequential Acceptance Sampling||Local Influence|Credit Scoring||Components of the Interest Rates| |Pattern recognition and Classification||Summary of Forecasting Methods|


Statistical Data Analysis

Statistics is a set of methods that are used to collect, analyze, present, and interpret data. Statistical methods are used in a wide variety of occupations and help people identify, study, and solve many complex problems. In the business and economic world, these methods enable decision makers and managers to make informed and better decisions about uncertain situations. Vast amounts of statistical information are available in today's global and economic environment because of continual improvements in computer technology. To compete successfully globally, managers and decision makers must be able to understand the information and use it effectively. Statistical data analysis provides hands on experience to promote the use of statistical thinking and techniques to apply in order to make educated decisions in the business world. Computers play a very important role in statistical data analysis. The statistical software package, SPSS, which is used in this course, offers extensive data-handling capabilities and numerous statistical analysis routines that can analyze small to very large data statistics. The computer will assist in the summarization of data, but statistical data analysis focuses on the interpretation of the output to make inferences and predictions. Studying a problem through the use of statistical data analysis usually involves four basic steps.

1. Defining the problem

2. Collecting the data

3. Analyzing the data

4. Reporting the results

Defining the Problem
An exact definition of the problem is imperative in order to obtain accurate data about it. It is
extremely difficult to gather data without a clear definition of the problem.
Collecting the Data
We live and work at a time when data collection and statistical computations have become easy almost to the point of triviality. Paradoxically, the design of data collection, never sufficiently emphasized in the statistical data analysis textbook, have been weakened by an apparent belief that extensive computation can make up for any deficiencies in the design of data collection. One must start with an emphasis on the importance of defining the population about which we are seeking to make inferences, all the requirements of sampling and experimental design must be set. Designing ways to collect data is an important job in statistical data analysis. Two important aspects of a statistical study are:

Population- a set of all the elements of interest in a study
Sample - a subset of the population
Statistical inference - extending your knowledge obtain from a random sample to the whole population. This is known in mathematics as an Inductive Reasoning. That is, knowledge of whole from a particular. Its main application is in hypotheses testing about a given population.
The purpose of statistical inference is to obtain information about a population form information contained in a sample. It is just not feasible to test the entire population, so a sample is the only realistic way to obtain data because of the time and cost constraints. Data can be either quantitative or qualitative. Qualitative data are labels or names used to identify an attribute of each element. Quantitative data are always numeric and indicate either how much or how many.
For the purpose of statistical data analysis, distinguishing between cross-sectional and time series data is important. Cross-sectional data re data collected at the same or approximately the same point in time. Time series data are data collected over several time periods.
Data can be collected from existing sources or obtained through observation and experimental studies designed to obtain new data. In an experimental study, the variable of interest is identified. Then one or more factors in the study are controlled so that data can be obtained about how the factors influence the variables. In observational studies, no attempt is made to control or influence the variables of interest. A survey is perhaps the most common type of observational study.
Analyzing the Data
Statistical data analysis divides the methods for analyzing data into two categories: exploratory methods and confirmatory methods. Exploratory methods are used to discover what the data seems to be saying by using simple arithmetic and easy-to-draw pictures to summarize data. Confirmatory methods use ideas from probability theory in the attempt to answer specific questions. Probability is important in decision making because it provides a mechanism for measuring, expressing, and analyzing the uncertainties associated with future events. The majority of the topics addressed in this course fall under this heading.
Reporting the Results
Through inferences, an estimate or test claims about the characteristics of a population can be obtained from a sample. The results may be reported in the form of a table, a graph or a set of percentages. Because only a small collection (sample) has been examined and not an entire population, the reported results must reflect the uncertainty through the use of probability statements and intervals of values.
To conclude, a critical aspect of managing any organization is planning for the future. Good judgment, intuition, and an awareness of the state of the economy may give a manager a rough idea or "feeling" of what is likely to happen in the future. However, converting that feeling into a number that can be used effectively is difficult. Statistical data analysis helps managers forecast and predict future aspects of a business operation. The most successful managers and decision makers are the ones who can understand the information and use it effectively.

Biostatistics

Biostatistics is a sub-discipline of Statistics which focuses on statistical support for the areas of medicine, environmental science, public health, and related fields. Practitioners span the range from the very applied to the very theoretical. The information which is useful to the biostatistician spans the range from that needed by a general statistician, to more subject-specific scientific details, to ordinary information that will improve communication between the biostatistician and other scientists and researchers. 

Evidential Statistics

Statistical methods aim to answer a variety of questions about observations. A simple example occurs when a fairly reliable test for a condition C, has given a positive result. Three important types of questions are:
1. Should this observation lead me to believe that condition C is present?
             2. Does this observation justify my acting as if condition C were present?
             3. Is this observation evidence that condition C is present?
We must distinguish among these three questions in terms of the variables and principles that determine their answers. Questions of the third type, concerning the "evidential interpretation" of statistical data, are central to many applications of statistics in many fields.
It is already recognized that for answering the evidential question current statistical methods are seriously flawed which could be corrected by a applying the Law of Likelihood. This law suggests how the dominant statistical paradigm can be altered so as to generate appropriate methods for objective, quantitative representation of the evidence embodied in a specific set of observations, as well as measurement and control of the probabilities that a study will produce weak or misleading evidence.
 

Multivariate Data Analysis

Data are easy to collect; what we really need in complex problem solving is information. We may view a data base as a domain that requires probes and tools to extract relevant information. As in the measurement process itself, appropriate instruments of reasoning must be applied to the data interpretation task. Effective tools serve in two capacities: to summarize the data and to assist in interpretation. The objectives of interpretive aids are to reveal the data at several levels of detail.
Exploring the fuzzy data picture sometimes requires a wide-angle lens to view its totality. At other times it requires a close up lens to focus on fine detail. The graphically based tools that we use provide this flexibility. Most chemical systems are complex because they involve many variables and there are many interactions among the variables. Therefore, chemometric techniques rely upon multivariate statistical and mathematical tools to uncover interactions and reduce the dimensionality of the data.
Principal component analysis used for exploring data. Two closely related techniques, principal component analysis and factor analysis, are used to reduce the dimensionality of multivariate data. In these techniques correlation and interactions among the variables are summarized in terms of a small number of underlying factors. The methods rapidly identify key variables or groups of variables that control the system under study. The resulting dimension reduction also permits graphical representation of the data so that significant relationships among observations or samples can be identified.
Other techniques include Multidimensional Scaling, Cluster Analysis, and Correspondence Analysis.
Multivariate analysis is a branch of statistics involving the consideration of objects on each of which are observed the values of a number of variables. A wide range of methods is used for the analysis of multivariate data, and this course will give a view of the variety of methods available, as well as going into some of them in detail. Multivariate techniques are used across the whole range of fields of statistical application: in medicine, physical and biological sciences, economics and social science, and of course in many industrial and commercial applications.

Spatial Data Analysis

Data which is geographically or spatially referenced is encountered in a very wide variety of practical contexts. In the same way that data collected at different points in time may require specialised analytical techniques, there are a range of statistical methods devoted to the modelling and analysis of data collected at different points in space. Increased public sector and commercial recording and use of data which is geographically referenced, recent advances in computer hardware and software capable of manipulating and displaying spatial relationships in the form of digital maps, and an awareness of the potential importance of spatial relationships in many areas of research, have all combined to produced an increased interest in spatial analysis. Spatial Data Analysis is concerned with the study of such techniques---the kind of problems they are designed to address, their theoretical justification, when and how to use them in practice.
Many natural phenomena involve a random distribution of points in space. Biologists who observe the locations of cells of a certain type in an organ, astronomers who plot the positions of the stars, botanists who record the positions of plants of a certain species and geologists detecting the distribution of a rare mineral in rock are all observing spatial point patterns in two or three dimensions. Such phenomena can be modelled by spatial point processes.
The spatial linear model is fundamental to a number of techniques used in image processing, for example, for locating gold/ore deposits, or creating maps. There are many unresolved problems in this area such as the behavior of maximum likelihood estimators and predictors, and diagnostic tools. There are strong connections between kriging predictors for the spatial linear model and spline methods of interpolation and smoothing. The two-dimensional version of splines/kriging can be used to construct deformations of the plane, which are of key importance in shape analysis

Meta-Analysis

Meta-Analysis deals with the art of combining information from the data from different independent sources which are targeted at a common goal. There are plenty of applications of Meta-Analysis in various disciplines such as Astronomy, Agriculture, Biological and Social Sciences, and Environmental Science. This particular topic of statistics has evolved considerably over the last twenty years with applied as well as theoretical developments.
A Meta-analysis deals with a set of RESULTs to give an overall RESULT that is (presumably) comprehensive and valid.
a) Especially when Effect-sizes are rather small, the hope is that one can gain good power by essentially pretending to have the larger N as a valid, combined sample.
b) When effect sizes are rather large, then the extra POWER is not needed for main effects of design: Instead, it theoretically could be possible to look at contrasts between the slight variations in the studies themselves.
For example, to compare two effect sizes (r) obtained by two separate studies, you may use:
Z = (z1 - z2)/[(1/n1-3) + (1/n2-3)]1/2
where z1 and z2 are Fisher transformations of r, and the two ni's in the denominator represent the sample size for each study.
If you really trust that "all things being equal" will hold up. The typical "meta" study does not do the tests for homogeneity that should be required
In other words:
1. there is a body of research/data literature that you would like to summarize
2. one gathers together all the admissible examples of this literature (note: some might be discarded for various reasons)
3. certain details of each investigation are deciphered ... most important would be the effect that has or has not been found. ie, how much larger in sd units is the treatment group's performance compared to one or more controls.
4. call the values in each of the investigations in #3 .. mini effect sizes.
5. across all admissible data sets, you attempt to summarize the overall effect size by forming a set of individual effects ... and using an overall sd as the divisor .. thus yielding essentially an average effect size.
6. in the meta analysis literature ... sometimes these effect sizes are further labeled as small, medium, or large ....
You can look at effect sizes in many different ways .. across different factors and variables. but, in a nutshell, this is what is done.
I recall a case in physics, in which, after a phenomenon had been observed in air, emulsion data was examined. The theory would have about a 9% effect in emulsion, and behold, the published data gave 15%. As it happens, there was no significant (practical, not statistical) in the theory, and also no error in the data. It was just that the results of experiments in which nothing statistically significant was found were not reported.
This non-reporting of such experiments, and often of the specific results which were not statistically significant, which introduces major biases. This is also combined with the totally erroneous attitude of researchers that statistically significant results are the important ones, and than if there is no significance, the effect was not important. We really need to between the term "statistically significant", and the usual word significant.
It is very important to distinction between statistically significant and generally significant, see Discover Magazine (July, 1987), The Case of Falling Nightwatchmen, by Sapolsky. In this article, Sapolsky uses the example to point out the very important distinction between statistically significant and generally significant: A diminution of velocity at impact may be statistically significant, but not of importance to the falling nightwatchman.
Be careful about the word "significant". It has a technical meaning, not a commonsense one. It is NOT automatically synonymous with "important". A person or group can be statistically significantly taller than the average for the population, but still not be a candidate for your basketball team. Whether the difference is substantively (not merely statistically) significant is dependent on the problem which is being studied.
Meta-analysis is a controversial type of literature review in which the results of individual randomized controlled studies are pooled together to try to get an estimate of the effect of the intervention being studied. It increases statistical power and is used to resolve the problem of reports which disagree with each other. It's not easy to do well and there are many inherent problems.
There is also graphical technique to assess robustness of meta-analysis results. We should carry out the meta-analysis dropping consecutively one study, that is if we have N studies we should do N meta-analysis using N-1 studies in each one. After that we plot these N estimates on the y axis and compare them with a straight line that represent the overall estimate using all the studies.
Topics in Meta-analysis includes: Odds ratios; Relative risk; Risk difference; Effect size; Incidence rate difference and ratio; Plots and exact confidence intervals.
For details, read,


Meta-Analysis in Social Research, by Glass, McGraw and Smith, 1987, and
Handbook of Research Synthesis, by Cooper H., and L. Hedges, (Eds.), New York, Russell Sage Foundation, 1994,

also visit Meta-Analysis, and


Meta -Analysis: Methods of Accumulating Results Across Research Domains

Variogram Analysis

Variables are often measured at different locations. The patterns in these spatial variables may be extrapolated by variogram analysis.
A variogram summarizes the relationship between the variance of the difference in pairs of measurements and the distance of the corresponding points from each other. 

Survival Analysis

Survival analysis is suited to the examination of data where the outcome of interest is 'time until a specific event occurs', and where not all individuals have been followed up until the event occurs.
The methods of survival analysis are applicable not only in studies of patient survival, but also studies examining adverse events in clinical trials, time to discontinuation of treatment, duration in community care before re-hospitalisation, contraceptive and fertility studies etc.
If you've ever used regression analysis on longitudinal event data, you've probably come up against two intractable problems:
Censoring: Nearly every sample contains some cases that do not experience an event. If the dependent variable is the time of the event, what do you do with these "censored" cases?
Time-dependent covariates: Many explanatory variables (like income or blood pressure)change in value over time. How do you put such variables in a regression analysis?
Makeshift solutions to these questions can lead to severe biases. Survival methods are explicitly designed to deal with censoring and time-dependent covariates in a statistically correct way. Originally developed by biostatisticians, these methods have become popular in sociology, demography, psychology, economics, political science, and marketing.
In Short, survival Analysis is a group of statistical methods for analysis and interpretation of survival data. Even though survival analysis can be used in a wide variety of applications (e.g. insurance, engineering, and sociology), the main application is for analyzing clinical trials data. Survival and hazard functions, the methods of estimating parameters and testing hypotheses that are the main part of analyses of survival data. Main topics relevant to survival data analysis are: Survival and hazard functions, Types of censoring, Estimation of survival and hazard functions: the Kaplan-Meier and life table estimators, Simple life tables, Peto's Logrank with trend test and hazard ratios and Wilcoxon test, (can be stratified), Wei-Lachin, Comparison of survival functions: The logrank and Mantel-Haenszel tests, The proportional hazards model: time independent and time dependent covariates, The logistic regression model, and Methods for determining sample sizes.
In the last few years the survival analysis software available in several of the standard statistical packages has experienced a major increment in functionality, and is no longer limited to the triad of Kaplan-Meier curves, logrank tests, and simple Cox models.
Further Reading:


Lee E., Statistical Methods for Survival Data Analysis, Wiley, 1992. 
 
 

Split-half Analysis

What is split-half analysis? Split your sample in half. Factor analyses each half. Do they come out the same (or similar) as each other? Alternatively (or also), take more than two 2 random subsample of your sample and do the same.
Notice that this is (like factor analysis itself) an "exploratory", not inferential technique, i.e. hypothesis testing, confidence intervals etc. simply do not apply.
Alternative, randomly split the sample in half and then do an exploratory factor analysis on Sample 1. Use those results to do a confirmatory factor analysis with Sample 2. 

The Central Limit Theorem

For practical purposes, the main idea of the central limit theorem (CLT) is that the average of a sample of observations drawn from some population with any shape-distribution is approximately distributed as a normal distribution if certain conditions are met. In theoretical statistics there are several versions of the central limit theorem depending on how these conditions are specified. These are concerned with the types of assumptions made about the distribution of the parent population (population from which the sample is drawn) and the actual sampling procedure.
One of the simplest versions of the theorem says that if is a random sample of size n (say, n 30) from an infinite population finite standard deviation , then the standardized sample mean converges to a standard normal distribution or, equivalently, the sample mean approaches a normal distribution with mean equal to the population mean and standard deviation equal to standard deviation of the population divided by square root of sample size n. In applications of the central limit theorem to practical problems in statistical inference, however, statisticians are more interested in how closely the approximate distribution of the sample mean follows a normal distribution for finite sample sizes, than the limiting distribution itself. Sufficiently close agreement with a normal distribution allows statisticians to use normal theory for making inferences about population parameters (such as the mean ) using the sample mean, irrespective of the actual form of the parent population.
It is well known that whatever the parent population is, the standardized variable will have a distribution with a mean 0 and standard deviation 1 under random sampling. Moreover, if the parent population is normal, then is distributed exactly as a standard normal variable for any positive integer n. The central limit theorem states the remarkable result that, even when the parent population is non-normal, the standardized variable is approximately normal if the sample size is large enough (say, 30). It is generally not possible to state conditions under which the approximation given by the central limit theorem works and what sample sizes are needed before the approximation becomes good enough. As a general guideline, statisticians have used the prescription that if the parent distribution is symmetric and relatively short-tailed, then the sample mean reaches approximate normality for smaller samples than if the parent population is skewed or long-tailed.
On e must study the behavior of the mean of samples of different sizes drawn from a variety of parent populations. Examining sampling distributions of sample means computed from samples of different sizes drawn from a variety of distributions, allow us to gain some insight into the behavior of the sample mean under those specific conditions as well as examine the validity of the guidelines mentioned above for using the central limit theorem in practice.
Under certain conditions, in large samples, the sampling distribution of the sample mean can be approximated by a normal distribution. The sample size needed for the approximation to be adequate depends strongly on the shape of the parent distribution. Symmetry (or lack thereof) is particularly important. For a symmetric parent distribution, even if very different from the shape of a normal distribution, an adequate approximation can be obtained with small samples (e.g., 10 or 12 for the uniform distribution). For symmetric short-tailed parent distributions, the sample mean reaches approximate normality for smaller samples than if the parent population is skewed and long-tailed. In some extreme cases (e.g. binomial with ) samples sizes far exceeding the typical guidelines (say, 30) are needed for an adequate approximation. For some distributions without first and second moments (e.g., Cauchy), the central limit theorem does not hold.
Review also Central Limit Theorem Applet, CLT, and Quincunx to illustrate the Central Limit Theorem. 

Sampling Distribution

The main idea of statistical inference is to take a random sample from a population and then to use the information from the sample to make inferences about particular population characteristics such as the mean (measure of central tendency), the standard deviation (measure of spread) or the proportion of units in the population that have a certain characteristic. Sampling saves money, time, and effort. Additionally, a sample can, in some cases, provide as much or more accuracy than a corresponding study that would attempt to investigate an entire population-careful collection of data from a sample will often provide better information than a less careful study that tries to look at everything.
We will study the behavior of the mean of sample values from a different specified populations. Because a sample examines only part of a population, the sample mean will not exactly equal the corresponding mean of the population. Thus, an important consideration for those planning and interpreting sampling results, is the degree to which sample estimates, such as the sample mean, will agree with the corresponding population characteristic.
In practice, only one sample is usually taken (in some cases a small ``pilot sample'' is used to test the data-gathering mechanisms and to get preliminary information for planning the main sampling scheme). However, for purposes of understanding the degree to which sample means will agree with the corresponding population mean, it is useful to consider what would happen if 10, or 50, or 100 separate sampling studies, of the same type, were conducted. How consistent would the results be across these different studies? If we could see that the results from each of the samples would be nearly the same (and nearly correct!), then we would have confidence in the single sample that will actually be used. On the other hand, seeing that answers from the repeated samples were too variable for the needed accuracy would suggest that a different sampling plan (perhaps with a larger sample size) should be used.
A sampling distribution is used to describe the distribution of outcomes that one would observe from replication of a particular sampling plan.
Know that to estimate means to esteem (to give value to).
Know that estimates computed from one sample will be different from estimates that would be computed from another sample.
Understand that estimates are expected to differ from the population characteristics (parameters) that we are trying to estimate, but that the properties of sampling distributions allow us to quantify, probabilistically, how they will differ.
Understand that different statistics have different sampling distributions with distribution shape depending on (a) the specific statistic, (b) the sample size, and (c) the parent distribution.
Understand the relationship between sample size and the distribution of sample estimates.
Understand that the variability in a sampling distribution can be reduced by increasing the sample size.
See that in large samples, many sampling distributions can be approximated with a normal distribution.
Visit also the following Web sites: Sample, and Sampling Distribution Applet 

Least Squares Models

Many problems in analyzing data involve describing how variables are related. The simplest of all models describing the relationship between two variables is a linear, or straight-line, model. The simplest method of fitting a linear model is to ``eye-ball'' a line through the data on a plot, but a more elegant, and conventional method is that of least squares, which finds the line minimizing the sum of distances between observed points and the fitted line.
Realize that fitting the ``best'' line by eye is difficult, especially when there is a lot of residual variability in the data.
Know that there is a simple connection between the numerical coefficients in the regression equation and the slope and intercept of regression line.
Know that a single summary statistic like a correlation coefficient or does not tell the whole story. A scatter plot is an essential complement to examining the relationship between the two variables.
Know that the model checking is an essential part of the process of statistical modelling. After all, conclusions based on models that do not properly describe an observed set of data will be invalid.
Know the impact of violation of regression model assumptions (i.e., conditions) and possible solutions by analyzing the residuals. 

Least Median of Squares Models

The standard least squares techniques for estimation in linear models are not robust in the sense that outliers or contaminated data can strongly influence estimates. A robust technique which protects against contamination is least median of squares (LMS) estimation. An extension of LMS estimation to generalized linear models, giving rise to the least median of deviance (LMD) estimator. 

Power of a Test

Significance tests are based on certain assumptions: The data have to be random samples out of a well defined basic population and one has to assume that some variables follow a certain distribution - in most cases the normal distribution is assumed.
Power of a test is the probability of correctly rejecting a false null hypothesis. This probability is one minus the probability of making a Type II error (b). Recall also that we choose the probability of making a Type I error when we set a and that if we decrease the probability of making a Type I error we increase the probability of making a Type II error.

Power and Alpha

Thus, the probability of correctly retaining a true null has the same relationship to Type I errors as the probability of correctly rejecting an untrue null does to Type II error. Yet, as I mentioned if we decrease the odds of making one type of error we increase the odds of making the other type of error. What is the relationship between Type I and Type II errors?
Power and the True Difference Between Population Means: Anytime we test whether a sample differs from a population or whether two sample come from 2 separate populations, there is the assumption that each of the populations we are comparing has it's own mean and standard deviation (even if we do not know it). The distance between the two population means will affect the power of our test.
Power as a Function of Sample Size and Variance: You should notice that what really made the difference in the size of b is how much overlap there is in the two distributions. When the means are close together the two distributions overlap a great deal compared to when the means are farther apart. Thus, anything that effects the extent the two distributions share common values will increase b (the likelihood of making a Type II error).
Sample size has an indirect effect on power because it affects the measure of variance we use to calculate the t-test statistic. Since we are calculating the power of a test that involves the comparison of sample means, we will be more interested in the standard error (the average difference in sample values) than standard deviation or variance by itself. Thus, sample size is of interest because it modifies our estimate of the standard deviation. When n is large we will have a lower standard error than when n is small. In turn, when N is large well have a smaller b region than when n is small. 

ANOVA: Analysis of Variance

The tests we have learned up to this point allow us to test hypotheses that examine the difference between only two means. Analysis of Variance or ANOVA will allow us to test the difference between 2 or more means. ANOVA does this by examining the ratio of variability between two conditions and variability within each condition. For example, say we give a drug that we believe will improve memory to a group of people and give a placebo to another group of people. We might measure memory performance by the number of words recalled from a list we ask everyone to memorize. A t-test would compare the likelihood of observing the difference in the mean number of words recalled for each group. An ANOVA test, on the other hand, would compare the variability that we observe between the two conditions to the variability observed within each condition. Recall that we measure variability as the sum of the difference of each score from the mean. When we actually calculate an ANOVA we will use a short-cut formula.
Thus, when the variability that we predict (between the two groups) is much greater than the variability we don't predict (within each group) then we will conclude that our treatments produce different results. 

P-values

The P-value, which directly depends on a given sample, attempts to provide a measure of the strength of the results of a test, in contrast to a simple reject or do not reject. If the null hypothesis is true and the chance of random variation is the only reason for sample differences, then the P-value is a quantitative measure to feed into the decision making process as evidence. The following table provides a reasonable interpretation of P-values:
This interpretation is widely accepted, and many scientific journals routinely publish papers using such an interpretation for the result of test of hypothesis.
For the fixed-sample size, when the number of realizations is decided in advance, the distribution of p is uniform (assuming the null hypothesis). We would express this as P(p x) = x. That means the criterion of p <0.05 achieves a of 0.05.
When a p-value is associated with a set of data, it is a measure of the probability that the data could have arisen as a random sample from some population described by the statistical (testing) model.
A p-value is a measure of how much evidence you have against the null hypothesis. The smaller the p-value, the more evidence you have. One may combine the p-value with the significance level to make decision on a given test of hypothesis. In such a case, if the p-value is less than some threshold (usually .05, sometimes a bit larger like 0.1 or a bit smaller like .01) then you reject the null hypothesis.
Understand that the distribution of p-values under null hypothesis H0 is uniform, and thus does not depend on a particular form of the statistical test. In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. The value of p is defined with respect to a distribution. Therefore, we could call it "model-distributional hypothesis" rather than "the null hypothesis".
In short, it simply means that if the null had been true, the p value is the probability against the null in that case. The p-value is determined by the observed value, however, this makes it difficult to even state the inverse of p.

P-value for Standard Normal and t-statistics

Conversion of a z-statistic Into a (one-side) P-value
INPUT "Z : ", ZValue
a1# = .31938153#
a2# = -.356563782#
a3# = 1.781477937#
a4# = -1.821255978#
a5# = 1.330274429#
w1# = ABS(ZValue)
w# = 1 / (1 + .2316419# * w1#)
w1# = .39894228# * EXP(-.5 * w1# * w1#)
p0# = w# *(a1# + w# *(a2# + w# *(a3# + w# * (a4# + a5# * w#))))
p0# = (w1# * p0#)
IF ZValue  0 THEN
  p0# = 1 - p0#
  END IF
PRINT p0#
Area from 0 to z for normal density: EXP(-((83*Z+351)*Z+562)*Z/(703+165*Z))/2
Below is a silimar program:

        INPUT z
        a1 = .31938153#
        a2 = -.356563782#
        a3 = 1.781477937#
        a4 = -1.821255978#
        a5 = 1.330274429#

        w1 = ABS(z)
        w = 1 / (1 + .2316419 * w1)
        w1 = .39894228# * EXP(-.5 * w1 * w1)
        p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))))
        p0 = w1 * p0

        PRINT ABS(p0);
Conversion of a z-statistic Into a (one-side) P-value: in C++ code
double __declspec(dllexport) NormalProb(double z)
{
        const double a1 = .31938153;
        const double a2 = -.356563782;
        const double a3 = 1.781477937;
        const double a4 = -1.821255978;
        const double a5 = 1.330274429;

        double w1 = absd(z);
        double w = 1 / (1 + .2316419 * w1);
        w1 = .39894228 * exp(-0.5 * w1 * w1);
        double p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))));
        p0 = w1 * p0;
        
        return absd(p0);
}
Conversion of a t-statistics Into a (one-side) P-value: C++
double __declspec(dllexport) TProb(double t, int df)
{
        double a = 0.36338023;
        double w = atan(t / sqrt(df));
        double s = sin(w);
        double c = cos(w);
        
        double t1, t2;
        int j1, j2, k2;

        if (df % 2 == 0)       // even
        {
                t1 = s;
                if (df == 2)   // special case df=2 
                        return (0.5 * (1 + t1));
                t2 = s;
                j1 = -1;
                j2 = 0;
                k2 = (df - 2) / 2;
        }
        else
        {
                t1 = w;
                if (df == 1)            // special case df=1
                        return 1 - (0.5 * (1 + (t1 * (1 - a))));
                t2 = s * c;
                t1 = t1 + t2;
                if (df == 3)            // special case df=3
                        return 1 - (0.5 * (1 + (t1 * (1 - a))));
                j1 = 0;
                j2 = 1;
                k2 = (df - 3)/2;
        }
        for (int i=1; i = k2; i++)
        {
                j1 = j1 + 2;
                j2 = j2 + 2;
                t2 = t2 * c * c * j1/j2;
                t1 = t1 + t2;
        }
        return 1 - (0.5 * (1 + (t1 * (1 - a * (df % 2)))));
}
For more, visit Statistics.
The Effect Size
Effect size (ES) is a ratio of a mean difference to a standard deviation, i.e. it is a form of z-score. Suppose an experimental treatment group has a mean score of Xe and a control group has a mean score of Xc and a standard deviation of Sc, then the effect size is equal to (Xe - Xc)/Sc
Effect size permits the comparative effect of different treatments to be compared, even when based on different samples and different measuring instruments.
Therefore, the ES is the mean difference between the control group and the treatment group. Howevere, by Glass's method, ES is (mean1 - mean2)/SD of control group while by Hunter-Schmit's method, ES is (mean1 - mean2)/pooled SD and then adjusted by instrument reliability coefficient. ES is commonly used in meta-analysis and power analysis.
Further Readings:


Glass G., McGaw B., and M. Smith, Meta-analysis in Social Research, Newbury Park, CA: Sage, 1981.
Cooper H., and L. Hedges, The Handbook of Research Synthesis, NY, Russell Sage, 1994. 


Structural Equation Modeling

The structural equation modeling techniques are used to study relations among variables. The relations are typically assumed to be linear. In social and behavioral research most phenomena are influenced by a large number of determinants which typically have a complex pattern of interrelationships. To understand the relative importance of these determinants their relations must be adequately represented in a model, which may be done with structural equation modeling.
A structural equation model may apply to one group of cases or to multiple groups of cases. When multiple groups are analyzed parameters may be constrained to be equal across two or more groups. When two or more groups are analyzed, means on observed and latent variables may also be included in the model.
As an application, how do you test the equality of regression slopes coming from the same sample using 3 different measuring methods? You could use a structural modeling approach.
1 - Standardize all three data sets prior to the analysis because b weights are also a function of the variance of the predictor variable and with standardization, you remove this source.
2 - Model the dependent variable as the effect from all three measures and obtain the path coefficient (b weight) for each one.
3 - Then fit a model in which the three path coefficients are constrained to be equal. If a significant decrement in fit occurs, the paths are not equal.
Further Reading:


Schumacker R., and R. Lomax, A Beginner's Guide to Structural Equation Modeling, Lawrence Erlbaum, New Jersey, 1996.

Visit also the Web site Structural Equation Modeling on the Internet 

Tri-linear Coordinates Triangle

A "ternary diagram" is usually used to show the change of opinion (FOR - AGAINST - UNDECIDED). The triangular diagram used first by the chemist Willard Gibbs in his studies on phase transitions. It is based on the proposition from geometry that in an equilateral triangle, the sum of the distances from any point to the three sides is constant. This implies that the percent composition of a mixture of three substances can be represented as a point in such a diagram, since the sum of the percentages is constant (100). The three vertices are the points of the pure substances.
The same holds for the "composition" of the opinions in a population. When percents for, against and undecided sum to 100, the same technique for presentation can be used. See the diagram below, which should be viewed with a non-proportional letter. True equilateral may not be preserved in transmission. E.g. let the initial composition of opinions be given by 1. That is, few undecided, roughly equally as much for as against. Let another composition be given by point 2. This point represents a higher percentage undecided and, among the decided, a majority of "for".

Internal and Inter-rater Reliability

"Internal reliability" of a scale is often measured by Cronbach's coefficient a. It is relevant when you will compute a total score and you want to know its reliability, based on no other rating. The "reliability" is *estimated* from the average correlation, and from the number of items, since a longer scale will (presumably) be more reliable. Whether the items have the same means is not usually important.
Tau-equivalent: The true scores on items are assumed to differ from each other by no more than a constant. For a to equal the reliability of measure, the items comprising it have to be at a least tau-equivalent, if this assumption is not met, a is lower bound estimate of reliability.
Congeneric measures: This least restrictive model within the framework of classical test theory requires only that true scores on measures said to be measuring the same phenomenon be perfectly correlated. Consequently, on congeneric measures, error variances, true-score means, and true-score variances may be unequal
For "inter-rater" reliability, one distinction is that the importance lies with the reliability of the single rating. Suppose we have the following data
 Participants           Time      Q1     Q2     Q3      to      Q17
 001                            1       4       5       4               4
 002                            1       3       4       3               3
 001                            2       4       4       5               3
 etc.
By examining the data, I think one cannot do better than looking at the paired t-test and Pearson correlations between each pair of raters - the t-test tells you whether the means are different, while the correlation tells you whether the judgments are otherwise consistent.
Unlike the Pearson, the "intra-class" correlation assumes that the raters do have the same mean. It is not bad as an overall summary, and it is precisely what some editors do want to see presented for reliability across raters. It is both a plus and a minus, that there are a few different formulas for intra-class correlation, depending on whose reliability is being estimated.
For purposes such as planning the Power for a proposed study, it does matter whether the raters to be used will be exactly the same individuals. A good methodology to apply in such cases, is the Bland & Altman analysis.
Visit also the Web site Common Correlation and Reliability Analysis

         Nonparametric Techniques

One must use statistical technique called nonparametric if it satisfies at least on of the following five types of criteria:
1. The data entering the analysis are enumerative - that is, count data representing the number of observations in each category or cross-category.
2. The data are measured and /or analyzed using a nominal scale of measurement.
3. The data are measured and /or analyzed using an ordinal scale of measurement.
4. The inference does not concern a parameter in the population distribution - as, for example, the hypothesis that a time-ordered set of observations exhibits a random pattern.
5. The probability distribution of the statistic upon which the the analysis is based is not dependent upon specific information or assumptions about the population(s) which the sample(s) are drawn, but only on general assumptions, such as a continuous and/or symmetric population distribution.
By this definition, the distinction of nonparametric is accorded either because of the level of measurement used or required for the analysis, as in types 1 through 3; the type of inference, as in type 4 or the generality of the assumptions made about the population distribution, as in type 5.
For example one may use the Mann-Whitney Rank Test as a nonparametric alternative to Students T-test when one does not have normally distributed data.
Mann-Whitney: To be used with two independent groups (analogous to the independent groups t-test)


Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the related samples t-test)
Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-factor between-subjects ANOVA)
Friedman: To be used with two or more related groups (analogous to the single-factor within-subjects ANOVA)


Analysis of Incomplete Data

Methods dealing with analysis of data with missing values can be classified into:
- Analysis of complete cases, including weighting adjustments,


- Imputation methods, and extensions to multiple imputation, and
- Methods that analyze the incomplete data directly without requiring a rectangular data set, such as maximum likelihood and Bayesian methods.

Multiple imputation (MI) is a general paradigm for the analysis of incomplete data. Each missing datum is replaced by m 1 simulated values, producing m simulated versions of the complete data. Each version is analyzed by standard complete-data methods, and the results are combined using simple rules to produce inferential statements that incorporate missing data uncertainty. The focus is on the practice of MI for real statistical problems in modern computing environments.
Further Readings:


Rubin D., Multiple Imputation for Nonresponse in Surveys, New York, Wiley, 1987.
Schafer J., Analysis of Incomplete Multivariate Data, London, Chapman and Hall, 1997.

Little R., and D. Rubin, Statistical Analysis with Missing Data, New York, Wiley, 1987. 

Interactions in ANOVA and Regression Analysis

Interactions are ignored only if you permit it. For historical reasons, ANOVA programs generally produce all possible interactions, while (multiple) regression programs generally do not produce any interactions - at least, not so routinely. So it's up to the user to construct interaction terms when using regression to analyze a problem where interactions are, or may be, of interest. (By "interaction terms" I mean variables that carry the interaction information, included as predictors in the regression model.)
The easiest construction is to multiply together the predictors whose interaction is to be included. When there are more than about three predictors, and especially if the raw variables take values that are distant from zero (like number of items right), the various products (for the numerous interactions that can be generated) tend to be highly correlated with each other, and with the original predictors. This is sometimes called "the problem of multicollinearity", although it would more accurately be described as spurious multicollinearity. It is possible, and often to be recommended, to adjust the raw products so as to make them orthogonal to the original variables (and to lower-order interaction terms as well).
What does it mean if the standard error term is high? Multicolinearity is not the only factor that can cause large SE's for estimators of "slope" coefficients any regression models. SE's are inversely proportional to the range of variability in the predictor variable. For example, if you were estimating the linear association between weight (x) and some dichotomous outcome and x=(50,50,50,50,51,51,53,55,60,62) the SE would be much larger than if x=(10,20,30,40,50,60,70,80,90,100) all else being equal. There is a lesson here for the planning of experiments. To increase the precision of estimators, increase the range of the input. Another cause of large SE's is a small number of "event" observations or a small number of "non-event" observations (analogous to small variance in the outcome variable). This is not strictly controllable but will increase all estimator SE's (not just an individual SE). There is also another cause of high standard errors, it's called serial correlation. This problem is frequent, if not typical, when using time-series, since in that case the stochastic disturbance term will often reflect variables, not included explicitly in the model, that may change slowly as time passes by.
In a linear model representing the variation in a dependent variable Y as a linear function of several explanatory variables, interaction between two explanatory variables X and W can be represented by their product: that is, by the variable created by multiplying them together. Algebraically such a model is represented by:
Y = a +b1X + b2 W + b3 XW + e .
When X and W are category systems. This equation describes a two-way analysis of variance (ANOV) model; when X and W are (quasi-)continuous variables, this equation describes a multiple linear regression (MLR) model.
In ANOV contexts, the existence of an interaction can be described as a difference between differences: the difference in means between two levels of X at one value of W is not the same as the difference in the corresponding means at another value of W, and this not-the-same-ness constitutes the interaction between X and W; it is quantified by the value of b3.
In MLR contexts, an interaction implies a change in the slope (of the regression of Y on X) from one value of W to another value of W (or, equivalently, a change in the slope of the regression of Y on W for different values of X): in a two-predictor regression with interaction, the response surface is not a plane but a twisted surface (like "a bent cookie tin", in Darlington's (1990) phrase). The change of slope is quantified by the value of b 3. For details, see Modelling and Interpreting Interactions in multiple Regression 


 
 
 
 

Distance Sampling

The term 'distance sampling' covers a range of methods for assessing wildlife abundance:
line transect sampling, in which the distances sampled are distances of detected objects (usually animals) from the line along which the observer travels
point transect sampling, in which the distances sampled are distances of detected objects (usually birds) from the point at which the observer stands
cue counting, in which the distances sampled are distances from a moving observer to each detected cue given by the objects of interest (usually whales)
trapping webs, in which the distances sampled are from the web center to trapped objects (usually invertebrates or small terrestrial vertebrates)
migration counts, in which the 'distances' sampled are actually times of detection during the migration of objects (usually whales) past a watch point
Many mark-recapture models have been developed over the past 40 years. Monitoring of biological populations is receiving increasing emphasis in many countries. Data from marked populations can be used for the estimation of survival probabilities, how these vary by age, sex and time, and how they correlate with external variables. Estimation of immigration and emigration rates, population size and the proportion of age classes that enter the breeding population are often important and difficult to estimate with precision for free-ranging populations. Estimation of the finite rate of population change and fitness are still more difficult to address in a rigorous manner.
For more details read:


Buckland S., D. Anderson, K. Burnham, and J. Laake, Distance Sampling: Estimating Abundance of Biological Populations, Chapman and Hall, London, 1993. 


Data Mining and Knowledge Discovery

The continuing rapid growth of on-line data and the widespread use of databases necessitate the development of techniques for extracting useful knowledge and for facilitating database access. The challenge of extracting knowledge from data is of common interest to several fields, including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing.
Data Mining as an analytic process designed to explore large amounts of (typically business or market related) data in search for consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The process thus consists of three basic stages: exploration, model building or pattern definition, and validation/verification.
What distinguishes data mining from conventional statistical data analysis is that data mining is usually done for the purpose of "secondary analysis" aimed at finding unsuspected relationships unrelated to the purposes for which the data were originally collected.
Data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes.
Data mining is now a rather vague term, but the element that is common to most definitions is "predictive modeling with large data sets as used by big companies". Therefore, data mining is the extraction of hidden predictive information from large databases. It is a powerful new technology with great potential, for example,to help marketing managers "preemptively define the information market of tomorrow." Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools. Data mining answers business questions that traditionally were too time-consuming to resolve. Data mining tools scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Data mining techniques can be implemented rapidly on existing software and hardware platforms across the large companies to enhance the value of existing resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client-server or parallel processing computers, data mining tools can analyze massive databases while a customer or analyst takes a coffee break, then deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"
Knowledge discovery in databases aims at tearing down the last barrier in enterprises' information flow, the data analysis step. It is a label for an activity performed in a wide variety of application domains within the science and business communities, as well as for pleasure. The activity uses a large and heterogeneous data-set as a basis for synthesizing new and relevant knowledge. The knowledge is new because hidden relationships within the data are explicated, and/or data is combined with prior knowledge to elucidate a given problem. The term relevant is used to emphasize that knowledge discovery is a goal-driven process in which knowledge is constructed to facilitate the solution to a problem.
Knowledge discovery maybe viewed as a process containing many tasks. Some of these tasks are well understood, while others depend on human judgment in an implicit matter. Further, the process is characterized by heavy iterations between the tasks. This is very similar to many creative engineering process, e.g., the development of dynamic models. In this reference mechanistic, or first principles based, models are emphasized, and the tasks involved in model development are defined by:


1. Initial data collection and problem formulation. The initial data are collected, and some more or less precise formulation of the modeling problem is developed.
2. Tools selection. The software tools to support modeling and allow simulation are selected.
3. Conceptual modeling. The system to be modeled, e.g., a chemical reactor, a power generator, or a marine vessel, is abstracted at first. The essential compartments and the dominant phenomena occurring are identified and documented for later reuse.
4. Model representation. A representation of the system model is generated. Often, equations are used; however, a graphical block diagram (or any other formalism) may alternatively be used, depending on the modeling tools selected above.
5. Implementation. The model representation is implemented using the means provided by the modeling system of the software employed. These may range from general programming languages to equation-based modeling languages or graphical block-oriented interfaces.
6. Verification. The model implementation is verified to really capture the intent of the modeler. No simulations for the actual problem to be solved are carried out for this purpose.
7. Initialization. Reasonable initial values are provided or computed, the numerical solution process is debugged.
8. Validation. The results of the simulation are validated against some reference, ideally against experimental data.
9. Documentation. The modeling process, the model, and the simulation results during validation and application of the model are documented.
10. Model application. The model is used in some model-based process engineering problem solving task.
For other model types, like neural network models where data-driven knowledge is utilized, the modeling process will be somewhat different. Some of the tasks, like the conceptual modeling phase, will vanish.Typical application areas for dynamic models are control, prediction, planning, and fault detection and diagnosis. A major deficiency of today's methods is the lack of ability to utilize a wide variety of knowledge. As an example, a black-box model structure has very limited abilities to utilize first principles knowledge on a problem. this has provided a basis for developing different hybrid schemes. Two hybrid schemes will highlight the discussion. First, it will be shown how a mechanistic model can be combined with a black-box model to represent a pH neutralization system efficiently. Second, the combination of continuous and discrete control inputs is considered, utilizing a two-tank example as case. Different approaches to handle this heterogeneous case are considered.The hybrid approach may be viewed as a means to integrate different types of knowledge, i.e., being able to utilize a heterogeneous knowledge base to derive a model. Standard practice today is that methods and software can treat large homogeneous data-sets. A typical example of a homogeneous data-set is time-series data from some system, e.g., temperature, pressure, and compositions measurements over some time frame provided by the instrumentation and control system of a chemical reactor. If textual information of a qualitative nature is provided by plant personnel, the data becomes heterogeneous.The above discussion will form the basis for analyzing the interaction between knowledge discovery, and modeling and identification of dynamic models. In particular, we will be interested in identifying how concepts from knowledge discovery can enrich state-of-the- art within control, prediction, planning, and fault detection and diagnosis of dynamic systems.

Further Readings:


Brodley C., T. Lane, and T. Stough, Knowledge Discovery and Data Mining, American Scientist, Jan.-Feb. 1999.
Chatfield Ch., Model Uncertainty, Data Mining and Statistical Inference, Journal of Royal Statistical Soc. Ser. A., 419-466, 1995.
Glymour C., D. Madigan, et. al., Statistical themes and lessons for data mining, Data Mining and Knowledge Discovery, 1, 11-28, 1997.
Hand D. , Data Mining: Statistics and More?, The American Statistician, 52( 2), 1998.
Heckerman D., Bayesian networks for data mining," Data Mining and Knowledge Discovery, 1, 79-119, 1997.

Visit also the following Web sites: Data Mining, and SAS. 

Bayes and Empirical Bayes Methods

Bayes and empirical Bayes (EB) methods structure combining information from similar components of information and produce efficient inferences for both individual components and shared model characteristics. Many complex applied investigations are ideal settings for this type of synthesis. For example, county-specific disease incidence rates can be unstable due to small populations or low rates. 'Borrowing information' from adjacent counties by partial pooling produces better estimates for each county, and Bayes/empirical Bayes methods structure the approach. Importantly, recent advances in computing and the consequent ability to evaluate complex models, have increase the popularity and applicability of Bayesian methods.
Bayes and EB methods can be implemented using modern Markov chain Monte Carlo(MCMC) computational methods. Properly structured Bayes and EB procedures typically have good frequentist and Bayesian performance, both in theory and in practice. This in turn motivates their use in advanced high-dimensional model settings (e.g., longitudinal data or spatio-temporal mapping models), where a Bayesian model implemented via MCMC often provides the only feasible approach that incorporates all relevant model features.
Further Readings:


Bayes and Empirical Bayes Methods for Data Analysis, by Carlin B., and T. Louis, Chapman and Hall, 1996. 


Likelihood Methods

                                Direct          Inverse
                       __________________________________________
                    Neyman-Pearson     Bayesian (decision analysis
Decision        Wald                      (H. Rubin, e.g.)
  
  ---------------------------------------------------
Hybrid        "Standard" practice      Bayesian (subjective)
                   
 -------------------------------------------------------
                                                   fiducial (Fisher)
Inference   Early Fisher                Likelihood (Edwards)
                                                  Bayesian (modern)
                                                  belief functions
                                                                (Shafer)
               _________________________________________
In the Direct schools, one uses Pr(data | hypothesis), usually from some model-based sampling distribution, but one does not attempt to give the inverse probability, Pr(hypothesis | data), nor any other quantitative evaluation of hypotheses. The Inverse schools do associate numerical values with hypotheses, either probabilities (Bayesian schools) or something else (Fisher, Edwards, Shafer).
The decision-oriented methods treat statistics as a matter of action, rather than inference, and attempt to take utilities as well as probabilities into account in selecting actions; the inference-oriented methods treat inference as a goal apart from any action to be taken.
The "hybrid" row could be more properly labeled as "hypocritical"-- these methods talk some Decision talk but walk the Inference walk.
Fisher's fiducial method is included because it is so famous, but the modern consensus is that it lacks justification.
Now it is true, under certain assumptions, some distinct schools advocate highly similar calculations, and just talk about them or justify them differently. Some seem to think this is tiresome or impractical. One may disagree, for three reasons:
First, how one justifies calculations goes to the heart of what the calculations actually MEAN; second, it is easier to teach things that actually make sense (which is one reason that standard practice is hard to teach); and third, methods that do coincide or nearly so for some problems may diverge sharply for others.
The difficulty with the subjective Bayesian approach is that prior knowledge is represented by a probability distribution, and this is more of a commitment than warranted under conditions of partial ignorance. (Uniform or improper priors are just as bad in some respects as anything other sort of prior.) The methods in the (Inference, Inverse) cell all attempt to escape this difficulty by presenting alternative representations of partial ignorance.
Edwards, in particular, uses logarithm of normalized likelihood as a measure of support for a hypothesis. Prior information can be included in the form of a prior support (log likelihood) function; a flat support represents complete prior ignorance.
One place where likelihood methods would deviate sharply from "standard" practice is in a comparison between a sharp and a diffuse hypothesis. Consider H0: X ~ N(0, 100) [diffuse] and H1: X ~ N(1, 1) [standard deviation 10 times smaller]. In standard methods, observing X = 2 would be undiagnostic, since it is not in a sensible tail rejection interval (or region) for either hypothesis. But while X = 2 is not inconsistent with H0, it is much better explained by H1--the likelihood ratio is about 6.2 in favor of H1. In Edwards' methods, H1 would have higher support than H0, by the amount log(6.2) = 1.8. (If these were the only two hypotheses, the Neyman-Pearson lemma would also lead one to a test based on likelihood ratio, but Edwards' methods are more broadly applicable.)
I do not want to appear to advocate likelihood methods. I could give a long discussion of their limitations and of alternatives that share some of their advantages but avoid their limitations. But it is definitely a mistake to dismiss such methods lightly. They are practical (currently widely used in genetics) and are based on a careful and profound analysis of inference. 


Prediction Interval

The idea is that if is the mean of a random sample of size n from a normal population, and Y is a single additional observation, then the test statistic - Y is normal with mean 0 and variance (1 + 1/n)s2.
Since we don't actually know s2, we need to use t in evaluating the test statistic. The appropriate Prediction Interval for Y is
&plusmn; ta/2.S.(1+1/n)1/2.
This is similar to construction of interval for individual prediction in regression analysis

Fitting Data to a Broken Line

Fitting data to a broken, how to determine the parameters, a, b, c, and d such that
y = a + b x, for x less than or equal c
y = a - d c + (d + b) x, for x greater than or equal to c
A simple solution is a brute force search across the values of c. Once c is known, estimating a, b, and d is trivial through the use of indicator variables. One may use (x-c) as your independent variable, rather than x, for computational convenience.
Now, just fix c at a fine grid of x values in the range of your data, estimate a, b, and d, and then note what the mean squared error is. Select the value of c that minimizes the mean squared error.
Unfortunately, you won't be able to get confidence intervals involving c, and the confidence intervals for the remaining parameters will be conditional on the value of c.
For more details, see Applied Regression Analysis, by Draper and Smith, Wiley 1981, Chapter 5, section 5.4 on use of dummy variables. example 6. 

 Two Parallel Regression Lines

Would like to determine if two regression lines are parallel? Construct the following multiple linear regression model:
E(y) = b0 + b1X1 + b2X2 + b3X3


where   X1 = interval predictor variable, X2 = 1 if group 1,
                                       0 if group 0,

and X3 = X1.X2

Then, E(y|group=0) = b0 + b1X1    
and   E(y|group=1) = b0 + b1X1 + b2.1 + b3.X1.1
                   = b0 + b1.X1 + b2   + b3X1

                   = (b0 + b2) +  (b1 + b3)X1
That is, E(y|group=1) is a simple regression with a potentially different slope and intercept compared to group=0.
Ho: slope(group 1) = slope(group 0) is equivalent to Ho: b3=0
Use t-test from variables-in-the equation table to test this hypothesis. 

Constrained Regression Model

If you fit a regression forcing the intercept to be zero, the standard error of the slope is less. That seems counter-intuitive. The intercept should be included in the model because it is significant, so why is the standard error for the slope in the worse-fitting model actually smaller?
I agree that it's initially counter-intuitive (see below), but here are two reasons why it's true. The variance of the slope estimate for the constrained model is s2 / SXi2), where Xi are actual X values and s2 is estimated from the residuals. The variance of the slope estimate for the unconstrained model (with intercept) is s2 / Sxi2), where xi are deviations from the mean, and s2is still estimated from the residuals). So, the constrained model can have a larger s2 (mean square error/"residual" and standard error of estimate) but a smaller standard error of the slope because the denominator is larger.
r2 also behaves very strangely in the constrained model; by the conventional formula, it can be negative; by the formula used by most computer packages, it is generally larger than the unconstrained r2 because it is dealing with deviations from 0, not deviations from the mean. This is because, in effect, constraining the intercept to 0 forces us to act as if the mean of X and the mean of Y both were 0.
Once you recognize that the s.e. of the slope isn't really a measure of overall fit, the result starts to make a lot of sense. Assume that all your X and Y are positive. If you're forced to fit the regression line through the origin (or any other point) there will be less "wiggle" in how you can fit the line to the data than there would be if both "ends" could move.
Consider a bunch of points that are ALL way out, far from zero, then if you Force the regression through zero, that line will be very close to all the points, and pass through origin, with LITTLE ERROR. And little precision, and little validity. Therefore, no-intercept model is hardly ever appropriate. 

Semiparametric and Non-parametric modeling

Many parametric regression models in applied science have a form like response = function(X1,..., Xp, unknown influences). The "response" may be a decision (to buy a certain product), which depends on p measurable variables and an unknown reminder term. In statistics, the model is usually written as
Y = m( X1, ..., Xp) + e
and the unknown e is interpreted as error term.
The most simple model for this problem is the linear regression model, an often used generalization is the Generalized Linear Model (GLM)
Y= G(X1b1 + ... + Xpbp) + e
where G is called the link function. All these models lead to the problem of estimating a multivariate regression. Parametric regression estimation has the disadvantage, that by the parametric "form" certain properties of the resulting estimate are already implied.
Nonparametric techniques allow diagnostics of the data without this restriction. However, this requires large sample sizes and causes problems in graphical visualization. Semiparametric methods are a compromise between both: they support a nonparametric modeling of certain features and profit from the simplicity of parametric methods.
Further Readings:


Härdle W., S. Klinke, and B. Turlach, XploRe: An Interactive Statistical Computing Environment, Springer, New York, 1995. 


Moderation and Mediation

"Moderation" is an interactional concept. That is, a moderator variable "modifies" the relationships between two other variables. While "Mediation" is a "causal modeling" concept. The "effect" of one variable on another is "mediated" through another variable. That is, there is no "direct effect", but rather an "indirect effect." 

Discriminant and Classification

Classification or discrimination involves learning a rule whereby a new observation can be classified into a pre-defined class. Current approaches can be grouped into three historical strands: statistical, machine learning and neural network. The classical statistical methods make distributional assumptions. There are many others which are distribution free, and which require some regularization so that the rule performs well on unseen data. Recent interest has focused on the ability of classification methods to be generalized.
We often need to classify individuals into two or more populations based on a set of observed "discriminating" variables. Methods of classification are used when discriminating variables are:
    1. quantitative and approximately normally distributed;
    2. quantitative but possibly nonnormal;
    3. categorical; or
    4. a combination of quantitative and categorical.
It is important to know when and how to apply linear and quadratic discriminant analysis, nearest neighbor discriminant analysis, logistic regression, categorical modeling, classification and regression trees, and cluster analysis to solve the classification problem. SAS has all the routines you need to for proper use of these classifications. Relevant topics are: Matrix operations, Fisher's Discriminant Analysis, Nearest Neighbor Discriminant Analysis, Logistic Regression and Categorical Modeling for classification, and Cluster Analysis.
For example, two related methods which are distribution free are the k-nearest neighbor classifier and the kernel density estimation approach. In both methods, there are several problems of importance: the choice of smoothing parameter(s) or k, and choice of appropriate metrics or selection of variables. These problems can be addressed by cross-validation methods, but this is computationally slow. An analysis of the relationship with a neural net approach (LVQ) should yield faster methods.
Further Readings:


Cherkassky V, and F. Mulier, Learning from Data: Concepts, Theory, and Methods, John Wiley & Sons, 1998.

Visit also the Web site Tree-Structured & Rules Induction Programs Homepage 

Generalized Linear and Logistic Models

The generalized linear model (GLM) is possibly the most important development in practical statistical methodology in the last twenty years. Generalized linear models provide a versatile modeling framework in which a function of the mean response is "linked" to the covariates through a linear predictor and in which variability is described by a distribution in the exponential dispersion family. These models include logistic regression and log-linear models for binomial and Poisson counts together with normal, gamma and inverse Gaussian models for continuous responses. Standard techniques for analyzing censored survival data, such as the Cox regression, can also be handled within the GLM framework. Relevant topics are: Normal theory linear models, Inference and diagnostics for GLMs, Binomial regression, Poisson regression, Methods for handling overdispersion, Generalized estimating equations (GEEs).
Hre is how to obtain degree of freedom number for the 2 log-likelihood, in a logistic regression. Degrees of freedom pertain to the dimension of the vector of parameters for a given model. Suppose we know that a model ln(p/(1-p))=Bo + B1x + B2y + B3w fits a set of data. In this case the vector B=(Bo,B1, B2, B3) is an element of 4 dimensional Euclidean space, or R4.
Suppose we want to test the hypothesis: Ho: B3=0. We are imposing a restriction on our parameter space. The vector of parameters must be of the form: B'=B=(Bo,B1, B2, 0). This vector is an element of a subspace of R4. Namely, B4=0 or the X-axis. The likelihood ration statistic has the form:
2 log-likelihood = 2 log(maximum unrestricted likelihood / maximum restricted likelihood) =


2 log(maximum unrestricted likelihood)-2 log (maximum restricted likelihood)

Which is unrestricted B vector 4-dimensions or degrees of freedom - restricted B vector 3 dimensions or degrees of freedom = 1 degree of freedom which is the difference vector: B''=B-B'=(0,0,0,B4) [one dimensional subspace of R4.
The standard textbook is Generalized Linear Models by McCullagh and Nelder (Chapman & Hall, 1989).
    LOGISTIC REGRESSION VAR=x
    /METHOD=ENTER y x1 x2 f1ros f1ach f1grade bylocus byses
    /CONTRAST (y)=Indicator 
    /contrast (x1)=indicator 
    /contrast (x2)=indicator
    /CLASSPLOT /CASEWISE OUTLIER(2)
    /PRINT=GOODFIT
    /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .


Spearman's Correlation, and Kendall's tau Application

How would you compare the values of two variables to determine whether they are ordered the same? For example:
          Var1    Var2
Obs 1      x       x
Obs 2      y       z
Obs 3      z       y
Is Var1 ordered the same as Var2? Two measures are Spearman's rank order correlation, and Kendall's tau. For more details see, e.g., Fundamental Statistics for the Behavioral Sciences, by David C. Howell, Duxbury Pr., 1995. 

Repeated Measures and Longitudinal Data

Repeated measures and longitudinal data require special attention because they involve correlated data that commonly arise when the primary sampling units are measured repeatedly over time or under different conditions. Normal theory models for split-plot experiments and repeated measures ANOVA can be used to introduce the concept of correlated data. PROC GLM and PROC MIXED in the SAS system may be used. Mixed linear models provide a general framework for modeling covariance structures, a critical first step that influences parameter estimation and tests of hypotheses. The primary objectives are to investigate trends over time and how they relate to treatment groups or other covariates. Techniques applicable to non-normal data, such as McNemar's test for binary data, weighted least squares for categorical data, and generalized estimating equations (GEE) are the main topics. The GEE method can be used to accommodate correlation when the means at each time point are modelled using a generalized linear model. Relevant topics are: Balanced split-plot and repeated measures designs, Modeling covariance structures of repeated measures, Repeated measures with unequally spaced times and missing data, Weighted least squares approach to repeated categorical data, Generalized estimating equation (Gee) method for marginal models, Subject-specific versus population averaged interpretation of regression coefficients, and Computer implementation using S-plus and the SAS system. The following describes the McNemar's test for binary data.
McNemar Change Test: For the yes/no questions under the two conditions, set up a 2x2 contingency table:
        f11     f10
        f01     f00
McNemar's test of correlated proportions is z = (f01 - f10)/sqrt(f01 + f10).
For those items yielding a score on a scale, the conventional t-test for correlated samples would be appropriate, or the Wilcoxon signed-ranks test. 

What Is a Systematic Review?

Health care decision makers need to access research evidence to make informed decisions on diagnosis, treatment and health care management for both individual patients and populations. Systematic reviews are recognized as one of the most useful and reliable tools to assist this practice of evidence-based health care. These courses aim to train health care professionals and researchers in the science and methods of systematic reviews.
There are few important questions in health care which can be informed by consulting the result of a single empirical study. Systematic reviews attempt to provide answers to such problems by identifying and appraising all available studies within the relevant focus and synthesizing their results, all according to explicit methodologies. The review process places special emphasis on assessing and maximizing the value of data, both in issues of reducing bias and minimizing random error. The systematic review method is most suitably applied to questions of patient treatment and management, although it has also been applied to answer questions regarding the value of diagnostic test results, likely prognoses and the cost-effectiveness of health care. 

Incidence and Prevalence Rates

Incidence rate (IR) is the rate at which new events occur in a population. It is defined as: Number of new events in a specified period divided by Number of persons exposed to risk during this period
Prevalence rate (PR) measures the number of cases that are present at a specified period of time. It is defined as: Number of cases present at a specified period of time divides by Number of persons at risk at that specified time.
These two measures are related when considering the the average duration (D). That is, PR = IR . D
Note that, for example, county-specific disease incidence rates can be unstable due to small populations or low rates. In epidemiology one can say that IR reflects probability to Become thick at given age, while the PR reflects probability to Be thick at given age.

Software Selection

You have to be careful when selecting a software. A short list of item for comparison is:
1) Ease of learning,


2) Amount of help incorporated for the user,
3) Level of the user,
4) Number of tests and routines involved,
5) Ease of data entry,
6) Data validation (and if necessary, data locking and security),
7) Accuracy of the tests and routines,
8) Integrated data analysis (graphs and progressive reporting on analysis in one screen),
9) Cost

No one software meets everyone's needs. Determine the needs first and then ask the questions relevant to the above seven criteria. 


Box-Cox Power Transformation

In certain cases data distribution is not normal (Gaussian), and we wish to find the best transformation of variable in order to obtain a Gaussian data distribution for further statistical processing.
Among others the Box-Cox power transformation is often used for this purpose.
        y = (xp - 1)/p, for p not zero
        y = log x,      for p = 0
trying different values of p between -3 and +3 is usually sufficient but there are MLE methods for estimating the best p. A good source on this and other transformation methods is


Madansky A., Prescriptions for working Statisticians, Springer-Verlag, 1988.

For percentages or proportions (such as for binomial proportions), Arcsine transformations would work better. The original idea of Arcsin(p)is to establish variances as equal for all groups. The arcsin transform is derived analytically to be the variance-stabilizing and normalizing transformation. The same limit theorem also leads to the square root transform for Poisson variables (such as counts) and to the arc hyperbolic tangent (i.e., Fisher's Z) transform for correlations. The Arcsin Test yields a z and the 2x2 contingency test yields a chi-sq. But z2 = chi-sq, for large sample size. A good source is


Rao C., Linear Statistical Inference and Its Applications, Wiley, 1973.

How to normalize a set of data consisting of negative and positive values, and make them positive between the range 0.0 to 1.0? Define XNew = (X-min)/(max-min).

Multiple Comparison Tests

Multiple Comparison Procedures include topics such as Control of the family-Wise Error rate, The closure Principle, Hierarchical Families of Hypotheses, Single-Step and Stepwise Procedures, and P-value Adjustments. Areas of applications include multiple comparisons among treatment means, multiple endpoints in clinical trials, multiple sub-group comparisons, etc.
Nemenyi's multiple comparison test is analogous to Tukey's test, using rank sums in place of means and using sqrt[n2k(nk+1)/12] as the estimate of standard error (SE), where n is the size of each sample and k is the number of samples (means). Similarly to the Tukey test, you compare (rank sum A - rank sum B)/SE to the studentized range for k. It is also equivalent to the Dunn/Miller test which uses mean ranks and standard error sqrt[k(nk+1)/12]. 

Antedependent Modeling for Repeated Measurements

Repeated measures data arise when observations are taken on each experimental unit on a number of occasions, and time is a factor of interest.
Many techniques can be used to analyze such data. Antedependence modeling is a recently developed method which models the correlations between observations at different times. 


Sequential Acceptance Sampling

Acceptance sampling is a quality control procedure used when a decision on the acceptability of the batch has to be made from tests done on a sample of items from the batch.
Sequential acceptance sampling minimizes the number of items tested when the early results show that the batch clearly meets, or fails to meet, the required standards.
The procedure has the advantage of requiring fewer observations, on average, than fixed sample size tests for a similar degree of accuracy. 

Local Influence

Cook's distance measures the effect of removing a single observation on regression estimates. This can be viewed as giving an observation a weight of either zero or one: local influence allows this weight to be small but non-zero.
Cook defined local influence in 1986, and made some suggestions on how to use or interpret it; various slight variations have been defined since then. But problems associated with its use have been pointed out by a number of workers since the very beginning. 


Credit Scoring

Credit Scoring is now in widespread use across the retail credit industry. At its simplest, a credit scorecard is a model usually statistical, but in use it is embedded in a computer and or human process. 

Components of the Interest Rates

The interest rates as quoted in the newspapers and by banks consist of several components. The most important three are:
The pure rate: This is the time value of money. A promise of 100 units next year is not worth 100 units this year.
The price-premium factor: If prices go up 5% each year, interest rates go up at least 5%. For example, under the Carter Administration, prices rose about 15% per year for a couple of years, interest was around 25%. Same thing during the Civil War. In a deflationary period, prices may drop so this term can be negative.
The risk factor: A junk bond may pay a larger rate than a treasury note because of the chance of losing the principal. Banks in a poor financial condition must pay higher rates to attract depositors for the same reason. Threat of confiscation by the government leads to high rates in some countries.
Other factors are generally minor. Of course, the customer sees only the sum of these terms. These components fluctuate at different rates themselves. This makes it hard to compare interest rates across disparate time periods or economic condition. The main questions are: how are these components combined to form the index? A simple sum? A weighted sum? In most cases the index is form both empirically and assigned on basis of some criterion of importance. The same applies to other index numbers. 

Partial Least Squares

Partial Least Squares (PLS) regression is a multivariate data analysis technique which can be used to relate several response (Y) variables to several explanatory (X) variables.
The method aims to identify the underlying factors, or linear combination of the X variables, which best model the Y dependent variables. 

Growth Curve Modeling

Growth is a fundamental property of biological systems, occurring at the level of populations, individual animals and plants, and within organisms. Much research has been devoted to modeling growth processes, and there are many ways of doing this: mechanistic models, time series, stochastic differential equations etc.
Sometimes we simply wish to summarize growth observations in terms of a few parameters, perhaps in order to compare individuals or groups. Many growth phenomena in nature show an "S" shaped pattern, with initially slow growth speeding up before slowing down to approach a limit.
These patterns can be modelled using several mathematical functions such as generalized logistic and Gompertz curves. 

Saturated Model & Saturated Log Likelihood

A saturated model is usually one that has no residual df. What is a "saturated" log likelihood? So the "saturated LL" is the LL for a saturated model. It is often used when comparisons made between the log likelihood with an intercept only and the log likelihood for a particular model specification. 

Pattern recognition and Classification

Pattern recognition and classification are fundamental concepts for understanding living systems and essential for realizing artificial intelligent systems. Applications include 3D modelling, motion analysis, feature extraction, device positioning and calibration, feature recognition, solutions to classification problems to industrial and medical applications. 




 
 
 

Spatial Statistics

Many natural phenomena involve a random distribution of points in space. Biologists who observe the locations of cells of a certain type in an organ, astronomers who plot the positions of the stars, botanists who record the positions of plants of a certain species and geologists detecting the distribution of a rare mineral in rock are all observing spatial point patterns in two or three dimensions. Such phenomena can be modelled by spatial point processes.
Refrences:


Diggle P., The Statistical Analysis of Spatial Point Patterns, Academic Press, 1983.
Ripley B., Spatial Statistics, Wiley, 1981.


What Is a Regression Tree

A regression tree is like a classification tree, only with a continuous target (dependent) variable. Prediction of target value for a particular case is made by assigning that case to a node (based on values for the predictor variables) and then predicting the value of the case as the mean of its node (sometimes adjusted for priors, costs, etc.).
Refrence:


Breiman L., Friedman, Olshen, and Stone, Classification and Regression Trees, Chapman & Hall, 1983.


Cluster Analysis for Correlated Variables

Cluster analysis is used to classify observations with respect to a set of variables. The widely used Ward's method is predisposed to find spherical clusters and may perform badly with very ellipsoidal clusters generated by highly correlated variables (within clusters).
To deal with high correlations, some model-based methods are implemented in the S-Plus package. However, a limitation of their approach is the need to assume the clusters have a multivariate normal distribution, as well as the need to decide in advance what the likely covariance structure of the clusters is.
Another option is to combine the principal component analysis with cluster analysis.
Further Readings:


Baxter M., Exploratory Multivariate Analysis in Archaeology, pp. 167-170, Edinburgh University Press, Edinburgh, 1994.

Manly F., Multivariate Statistical Methods: A Primer, Chapman and Hall, London, 1986. 

A Summary of Forecasting Methods


Widely used especially for short to intermediate term analysis-forecasting the value of items affected by factors other than time-simple regression when only one explanatory factor considered-can be done on a hand calculator.
Multiple Regression Analysis: Used when two or more independent factors are involved-widely used for intermediate term forecasting. Used to assess which factors to include and which to exclude. Can be used to develop alternate models with different factors.
Nonlinear Regression: Does not assume a linear relationship between variables-frequently used when time is the independent variable.
Trend Analysis: Uses linear and nonlinear regression with time as the explanatory variable-used where pattern over time.
Decomposition Analysis: Used to identify several patterns that appear simultaneously in a time series-time consuming each time it is used-also used to deseasonalize a series
Moving Average Analysis: Simple Moving Averages-forecasts future values based on a weighted average of past values-easy to update.
Weighted Moving Averages: Very powerful and economical. They are widely used where repeated forecasts required-uses methods like sum-of-the-digits and trend adjustment methods.
Adaptive Filtering A type of moving average which includes a method of learning from past errors-can respond to changes in the relative importance of trend, seasonal, and random factors.
Exponential Smoothing: A moving average form of time series forecasting-efficient to use with seasonal patterns- easy to adjust for past errors-easy to prepare follow-on forecasts-ideal for situations where many forecasts must be prepared-several different forms are used depending on presence of trend or cyclical variations.
Hodrick-Prescott Filter: This is a smoothing mechanism used to obtain a long term trend component in a time series. It is a way to decompose a given series into stationary and nonstationary components in such a way that there sum of squares of the series from the nonstationary component is minimum with a penalty on changes to the derivatives of the nonstationary component.
Modeling and Simulation: Model describes situation through series of equations-allows testing of impact of changes in various factors-substantially more time-consuming to construct-generally requires user programming or purchase of packages such as SIMSCRIPT. Can be very powerful in developing and testing strategies otherwise non-evident.
Certainty models give only most likely outcome-advanced spreadsheets can be utilized to do "what if" analysis-often done e.g.; with computer-based spreadsheets.
Probabilistic Models Use Monte Carlo simulation techniques to deal with uncertainty-gives a range of possible outcomes for each set of events.
Forecasting error: All forecasting models have either an implicit or explicit error structure, where error is defined as the difference between the model prediction and the "true" value. Additionally, many data snooping methodologies within the field of statistics need to be applied to data supplied to a forecasting model. Also, diagnostic checking, as defined within the field of statistics, is required for any model which uses data.
Using any method for forecasting one must use a performance measure to assess the quality of the method. Mean Absolute Deviation (MAD), and Variance are the most useful measures. However, MAD doesn't lend itself to further use making inferences, but that the standard error does. For the error analysis purposes variance is preferred since variances of independent (uncorrelated) errors are additive. MAD is not additive.


How to Do Forecasting by a Regression Analysis

Regression is the study of relationships among variables, a principal purpose of which is to predict, or estimate the value of one variable from known or assumed values of other variables related to it.
Variables of Interest: To make predictions or estimates we must identify the effective predictors of the variable of interest: which variables are important indicators and can be measured at the least cost, which carry only a little information, and which are redundant.
Predicting the Future Predicting a change over time or extrapolating from present conditions to future conditions is not the function of regression analysis. To make estimates of the future, use time series analysis.
Experiment: Begin with a hypothesis about how several variables might be related to another variable and the form of the relationship.
Types of Analysis
Simple Linear Regression: A regression using only one predictor is called a simple regression.
Multiple Regression: Where there are two or more predictors, multiple regression analysis is employed.
Data: Since it is usually unrealistic to obtain information on an entire population, a sample which is a subset of the population is usually selected. The sample may be either randomly selected for a researcher may chose the x-values based on the capability of the equipment utilized in the experiment or the experiment design. Where the x-values are preselected, usually only limited inferences can be drawn depending upon the particular values chosen. When both x and y are randomly drawn, inferences can generally be drawn over the range of values in the sample.
Scatter Diagram: A graphical representation of the pairs of data called a scatter diagram can be drawn to gain an overall view of the problem. Is there an apparent relationship? Direct? Inverse? If the points lie within a band described by parallel lines we can say there is a linear relationship between the pair of x and y values. If the rate of change is generally not constant, then the relationship is curvilinear.
The Model: If we have determined there is a linear relationship between t and y we want a linear equation stating y as a function of x in the form Y = a + bt + e where a is the intercept, b is the slope and e is the error term accounting for variables that affect y but are not included as predictors, and/or otherwise unpredictable and uncontrollable factors.
Least Squares Method: To predict the mean y-value for a given t-value, we need a line which passes through the mean value of both t and y and which minimizes the sum of the distance between each of the points and the predictive line. Such an approach should result in a line which we can call a "best fit" to the sample data. The least squares method achieves this result by calculating the minimum average squared deviations between the sample y points and the estimated line. A procedure is sued for finding the values of a and b which reduces to the solution of simultaneous linear equations. Shortcut formulas have been developed as an alternative to the solution of simultaneous equations.
Solution Methods: Techniques of Matrix Algebra can be manually employed to solve simultaneous linear equations. When performing manual computations, this technique is especially useful when there are more than two equations in two unknowns.
Several well-known computer packages are widely available and can be utilized to relieve the user of the computational problem, all of which can be used to solve both linear and polynomial equations: the BMD packages (Biomedical Computer Programs) from UCLA; SPSS (Statistical Package for the Social Sciences) developed by the University of Chicago; and SAS (Statistical Analysis System). Another package that is also available is IMSL, the International Mathematical and Statistical Libraries, which contains a great variety of standard mathematical and statistical calculations. All of these software packages use matrix algebra to solve simultaneous equations.
Use and Interpretation of the Regression Equation: The equation developed can be used to predict an average value over the range of the sample data. The forecast is good for short to medium ranges.
Measuring Error in Estimations: The scatter or variability about the mean value can be measured by calculating the variance, the average squared deviation of the values around the mean. The standard error of estimate is derived from this value by taking the square root. This value is interpreted as the average amount that actual values differ from the estimated mean.
Confidence Intervals: Interval estimates can be calculated to obtain a measure of the confidence we have in our estimates that a relationship exists. These calculations are made using t-distribution tables. From these calculations we can derive confidence bands, a pair of non-parallel lines narrowest at the mean values which express our confidence in varying degrees of the band of values surrounding the regression equation.
Assessment: How confident can we be that a relationship actually exists? The strength of that relationship can be assessed by statistical tests of that hypothesis such as the null hypothesis which are established using t-distribution, R-squared, and F-distribution tables. These calculations give rise to the standard error of the regression coefficient, an estimate of the amount that the regression coefficient b will vary from sample to sample of the same size from the same population. An Analysis of Variance (ANOVA) table can be generated which summarizes the different components of variation.
When you want to compare models of different size (different numbers of independent variables and/or different sample sizes) you must use the Adjusted R-Squared, because the usual R-Squared tends to grow with the number of independent variables.
The Standard Error of Estimate (i.e. square root of error mean square) is a good indicator of the "quality" of a prediction model since it "adjusts" the Error Sum of Squares (EMS) for the number of predictors in the model as follow:
EMS = Error Sum of Squares/(N - Number of Linearly Independent Predictors)
If one keeps adding useless predictors to a model, the EMS will become less and less stable. R-squared is also influenced by the range of your dependent value so if two models have the same residual mean square but one model has a much narrower range of values for the dependent variable that model will have a higher R-squared. This explains the fact that both models will do as well for prediction purposes.
A considerable portion of the output of the computer programs previously mentioned are devoted to a description of the tests of significance of the regression.

Moving Average and Exponential Smoothing

C       SMA=SIMPLE MOVING AVERAGE
C       DMA=DOUBLE MOVING AVERAGES
C       FDMA=FORECAST WITH DOUBLE MOVING AVERAGES
C
C
        NP1=N=2
        NUM1=NUM
        NUM=NUM1+1
        AM1=1
        SM2=NUM1
        DO 8 I=NUM,SM2
        SM-0
        DO 450M-SM+1
        SM=SM+Y(M+1)
450     CONTINUE
        SM1=SM1+1
        SM2=SM2+1        
        SMA(1)=SM/NUM1
        SMASQ(I)=SMA(I)**2
8       CONTINUE
        NUM=NUM1*2+1
        DM1=1
        DM2=NUM1
        DO 45 I=NUM,NP1
        DM=0.0
        DO 460 M=DM1, DM2
        DM=DM+SMA(M+1+NUM1)
460     CONTINUE
        DM1=DM1+1
        DM2=DM2+2
        DMA(I)=DM/NIM1
        MA(I)=SMA(I)*2-DMA(I)
        MB(I)=(SMA(I)-DMA(I))2/3        
        FDMA(1+I)=MA(I)+MB(I)
        FDMASQ(1+I)=FDMA(1+I)**2
45      CONTINUE
        FORDNA=MA(J)+MB(J)*T
 C
 C      SES=SMOOTHED STATISTIC FOR SINGLE EXPONENTIAL SMOOTHING
 C      DES=SMOOTHED STATISTIC FOR DOUBLE EXPONENTIAL SMOOTHING
 C      TES=SMOOTHING STATISTIC FOR DOU TRIBLE EXPONENTIAL  SMOOTHING
 C      TA,TB,TC ARE THE COEFFICIENTS IN THE FORCASTING EQUATIONG EQUATION 
C       FOR DOUBLE EXPONENTIAL SMOOTHING
 C      FDES=FORCAST WITH DOUBLE EXPONENTIAL SMOOTHING
 C      FTES=FORCAST WITH TRIBLE EXPONENTIAL SMOOTHING
 C      
 C
        SES(I)=Y(2)
        DO 46 I=2,J
        SES(I)=ALPHA*(Y(I)-SES(1-I))+SES(1-I)
46       CONTINUE
        DO 410 I=3,J
        FSE(I)=SES(1-I)
        SESSQ(I)=FSES(I)**2
410     CONTINUE
        SESFOR=SES(J)
        DES(1)=Y(2)
        DO 55 I=2,J
        DES(I)=ALPHA*SES(I)+(1.-ALPHA)*DES(1-I)
        EA(I)=2*SESI)-DES(I)
        EB(I)=(SES(I0-DES(I))*ALPHAR/(1.-ALPHA)
55      CONTINUE
        DO 420 I=3,J
        FDES(I)=EA(1-I)+EB(1-I)
        FDESSQ(I)=FDES(1)**2
420     CONTINUE
        DESFOR=EA(J)+T*EB(J)
        TES(I)=Y(2)
        DO 51 I==2,J
        TES(I)=ALPHAR*DES(I)+(1.-ALPHAR)TES(1-I)
        TA(I)=3*SES(I)-3*DES(I)+TES(1-I)
        TB(I)=(ALPHA/(1-ALPHA)**2))*((6-5*ALPHA)SES(I)-(10-8*ALPHA)*DES(I)+
        (4-3.*ALPHAR)*TES(I))
        TC(I)=(ALPHA/(1-ALPHA))*2*(SES(I)-2*DES(1+TES(I))
51      CONTINUE
        DO 430 I=3,J
        FTES(I)=TA(1-I)+TB(1-I)+TC(1-I)/2
        FTESSQ(I)=FTES(I)**2
430     CONYINUE
        TESFOR=TA(J)+TB(J)*T+TC(J)/2.0*t**2
C
C       ESMA, EDMA,ESES,EDES=DIFFERENCE BETWEENESTIMATED AND ACTUAL
C       INSIMPLE,DOUBLE MOVING AVERAGESAND SINGLE, 
C       DOUBL EXPONTENTIAL SMOOTHING 
C       ETES=DIFFERENCE BETWEEN ESTIMATED AND ACTUAL VALUE IN
C       TRIPLE EXPONENTIAL SMOOTHING
C
C
        NUM=NUM1+2
        DO 11 I=NUM,J
        ESMA(I)=SMA(I)-Y(I)
        ESMASQ(I)=ESMA(I)**2
 11     CONTINUE 
        NUM=NUM1+2
        DO 47 I=NUM,J
        EDMA(I)=FDMA(I)-Y(I)
        EDMASQ(I)=EDMA(I)**2
47      CONTINUE
        DO 48 I=3,J
        ESES(I)=FSES(I)-Y(I)
        ESESSQ(I)=ESES(I)-Y(I)
        EDES(I)=FDES(I)-Y(I)
        EDESSQ(I)=EDES(I)**2
        ETES(I)=FTES(I)-Y(I)
        ETES(I)-FTES(I)-Y(I)
        ETESSQ(I)=ETES(I)**2
48      CONTINUE
        WRITE(6,20)
20      ORMAT(//,4X,"***MOVING AVERAGE***’)
        WRITE(6,22)
22      FORMAT(//,4X,’PERIOD’, 2X,’ACTUAL’,2X,’SIMPLE MOVING AVERAGE’,
        *27X,   ’DOUBLE MOVING AVERAGE’)
        WRITWE(6,23)
23      FORMAT(19X,’FORCAST’,2X,’RESIDUAL’,2X,’RESIDUAL-SQ’,
        *15X,’M(2)’,4X,’FORECAST’,2X, ’RESIDUAL’,2X, ‘RESIDUAL –SQ’)
        DO 98 I=2,J
        WRITE(6,24) X(I),Y(I),SMA(I),ESMA(I),SMASQ(I),DMA(I), FDMA(I),
        *EDMAS(I),EDMASQ(I)
24      FORMAT(7X,12,3X,F5,O,2X,F8,3,2X,F8.3,15,F11.3,2X,F8.3,2X,
        *F8.3,2X,F8.3,2X,F11.3)
98      CONTINUE
        NUM=NUM1+2
        DO 13 I=NUM,J
        S3=S3+SMA(I)
        SS2=SS2+ESMASQ(I)
13      SS3=SS3+SMASQ(I)
        NUM=NUM1*2+2
        DO 49 I=NUM,J
        S4=S4+EDMASQ(I)
        SS4=SS4+FDMASQ(I)
49      SS5=SS5+FDMASQ(I)
        WRITE(6,25)S3,SS2,S4,SS4
25      FORMAT(‘0’,/,12X,F15.3,12X,F11.3,22X,F15.3,3X,F15.3)
        WRITE(6,59)T,FORDMA
59      FORMAT(/,64X,’FORECAST FOR’,1X,12,1X,
        *’PERIOD(S) AHERD IS’,1X,F8.3)
        WRITE(6,26)
26      FORMAT(//,4X,’***EXPONENTIAL SMOOTHING***’)
        WRITE(6,27)
27      FORMAT(//,20X,’SINGLE EXPONENTIAL SNOOTHING’)
        WRITE(6,28)
28      FORMAT(4X,’PERIOD’,2X,’ACTUAL’,4X,’SES’,4X,’FORECAST’,2X,
        *RESIDUAL’,2X,’RESIDEAL-SQ’)
        DO 14 I=1,J
        WRITE(6,29)X(I),Y(I),SES(I),FSES(I),FSES(I),FSESSQ(I)
29      FORMAT(7X,12,3X,F5.0,2X,F8.3,2X,F11.3,2X,F11.3)

14      CONTINUE
        DO 38 I=3,J
        S5=S5+FSES)I)
        S6=S6+FDES(I)
        S8=S8+FTES(I)
        SS7=SS7+SESSQ(I)
        SS8=SS8+EDESSQ(I)
        SS12=ETESSQI)+SS12
        SS13=SS13+FTESSQ(I)
38      SS9=SS9+FDESSQ(I)
        WRITE(6,35)S5,SS6
35      FORMAT(‘0’,/,21X,F15.3,12X,F11.3)
        WRITE(6,21)T,SESFOR
        WRITE(6,74)
74      FORMAT(//,20X,’DOUBLE EXEPONENTIAL SMOOTHING’)
        WRITE(6,76)
76      FORMAT(4X,’PERIOD’,2X,’ACTUAL’,4X,’DES’,8X,’EA’,8X,’FR’,6X,FORECAST’,3X,
        *’RESIDUAL’,2X, ’RESIDUAL-SQ’)
        DO 77 I=1,J
        WRITE(6,78)X(I),Y(I),DES(I),EAS(I),EB(I),FDES9I),EDES(I),EDESSQ(I)
78      FORMAT(7X,12,3X,F5.0,1X,F8.3,3X,F8..3,2X,F8.3,EX,F8.3,3X,F8.3,2X,F11.3)
77      CONTINUE
        WRITE(6,79) S6,SS8      
79      FORMAT(‘0’,/,41X,F11.3,12X,F11.3)
        WRITE(6,21)T,DESFOR
21      FORMAT(/,’FORECAST FOR’.1X.12,1X,’PERIOD(S) AHEAD IS’,1X,F8.3)
        WRITE(6,31)
31      FORMAT(//,20X,’TRIPLE EXPONENTIAL SMOOTHING’)
        WRITE(6,32)
32      FORMAT(4x,’PERIOD’,2X,’ACTUAL’,4X,’TES’,6X,’TA’,8X,’TB’,6X,’TC’,4X,
        *’FORCAST’,2X,’RESIDUAL’,2X,’RESIDUAL-SQ’)
        DO 97 I=1,J
        WRITE(6,33)X(I),Y(I),TES(I),TA(I),TB(I),TC(I),FTES(I),ETES(I),ETESSQ(I)
33      FORMAMT(7X,12,3X,F5.0,2X,F8.3,1X,F7.3,1X,F7.3,3X,F8.3,2X,F8.3,2X,F11.S)
97      CONTINUE
        WRITE(6,30)T,TESFOR
30      FORMAR(/,’FORCAST FOR’,1X,12,1X,’PERIOD(S) AHEAD IS’,1X,F8.3)
        End

Winters’ Method

C   FOR INITIAL TREND LINE, WE USE SIMPLE LINEAR REGRESSION 
C   YEST(I)=A+BX(I)
C    INITIAL MULTILICATIVE SERSONAL FACTORS (‘MSF’) BY USING THE 1ST 
C    AND 2ND YEAR IN THE  DATA
C    1. FOR THE FIRST YEAR
C
C

        L=L+1
        DO 170 I=2, L
170     SF(I)=Y9I)/YEST(I)

C
C  2. FOR THE 2ND YEAR
C
        LP1=1+L
        LT2=2*L-1
        DO 175 I=LP1,LT2
175      SF2(I)=Y(I)/YEST(I)
C
C       INITIAL ESTIMATES OF THE FUTURE SEASONAL FACTORS(‘SF’)
C

        DO 180 I=2,L    
        M=I=L-1
        SF(I)=(SF(I)+SF2(M))/2
180     SF(M)=SF(I)
        WRITE(6, 345)
345     FORMAT(//, 4X, ''**WINTERS' METHOD**')
        WRITE(6,350)
350     FORMAT (/,4X,’PERIOD’,6X,’ACTUAL’,2X,’VALUE FROM TREND LINE’,2X,
        *’MULT.SEASONAL FACTOR’)
        DO 185 I=2,L
        WRITE(6,355)X(I),Y(I),YEST(I),SF1(I)    
355     FORMAT(7X,12,7X,F5.0,10X,F10.4,17X,F4.2)
186     CONTINUE
        DO 190 I=LP1,LT2
        WRITE(6,360)X(I),Y(I),YEST(I),SF(I)
360     FORMAT(7X,12,7X,F5.0,10X,F10.4,17X,F4.2)
190     CONTINUE
        WRITE (6,365)
365     FORMAT(//,4X,’PERIOD’,2X’AVG.OF MULT.SEASONAL FACTORS’)
        DO 195 I=2,L
        WRITE(6,370)X(I),SF(I)
370     FORMAT(7X,12,15X,F5.2)
195      CONTINUE
C
C  UPDATING THE ESTIMATE OF THE INTERCEPT,SLOPE,AND MULT.SEASONAL
C  FACTOR BY USING EXPONENTIAL SMOOTHING
C       
C  AA(I)=ESTIMATED VALUE OF THE TEND LINE AT PERIOD 1
C  BB(I)=ESTIMATED SLOPE OF THE TREND AT PERIOD1        
C  SSF(I)=REVISED SLOPE OF SEASONAL FACTOR
C  FORECAST BY WINTERS’ METHOD 
C
C
        LP3=1+LT2
        LT3=3*L-2
        DO 200 I=LP3,K
        AA(LT2)=YEST(LT2)
        BB(LT2)=B
        AA(I)=WALPHAR*Y(I)/SF(1+I-L)+(1-WALPHA)*(AA(I-1)+BR(I-1))
        BB(I)=WBETA*(AA(I)-AA(I-1)+(1.-WBETA)*BB(1-I))
        SSF(I)=WDELTA*Y(I)/AA(I)+(1-WBETA)*SF(1+I-L)
                FW(I+1)=(AA(I)+BB(I)*1.)*SF(I+2-L)
200     CONTINUE
        DO 205 I=LT3,J
        SSF(LT2)=SF(LT2)
          AA(I)=WALPHAR*Y(I)/SF(1+I-L)+(1-WALPHA)*(AA(I-1)+BR(I-1))
        BB(I)=WBETA*(AA(I)-AA(I-1)+(1.-WBETA)*BB(1-I))
        SSF(I)=WDELTA*Y(I)/AA(I)+(1-WBETA)*SF(1+I-L)
          FW(I+1)=(AA(I)+BB(I)*1.)*SF(I+2-L)
205     CONTINUE
        MOA=J+T-1
        MOB=L-1
        REM=MOD(MOD,MOB)
        WINFOR=(AA(J)+BB(J)*T)*SSF(REM+LT2)
        LP5=1+LP3
        DO 210 I=LP5,J
        EFW(I)=FW(I)-Y(I)
        EFWSQ(I)=EFW(I)**2
        FWSQ(I)=FW(I)**2
210     CONTINUE
        DO 215 I=LP5,J
        S7=S7+FW9I)
        SS10=SS10+EFWSQ(I)
        SS11=SS11+FWSQ(I)
215     CONTINUE
        WRITE(6,375)
375     FORMAT(//,4X,' **FORECAST BY WINTERS METHOD** ' )
        WRITE(6,380)
380     FORMAT(//,4X,’PERIOD’,6X,’ACTUAL’,3X,’FORECAST’,5X,’RESIDUAL’,
        *2X,’RESIDUAL –SQ’)
        DO 220 I=LP3,J
        WRITE(6,385) X(I),Y(I),FW(I),EFW(I),EFWSQ(I)
385     FORMAT (7X,12,7X,F5.0,4X,F8.3,4X,F8.3,4X,F10.4)
220     CONTINUE
        WRITE(6,390)S7,SS10
390     FORMAT(/,22X,F11.3,13X,F14.3)
        WRITE(6,21)T,WINFOR
        RETURN
        END

Smoothing the Data

Given a collection of data, this interactive program smooths the data using exponential smoothing methods, and also do the forecasts for the number of periods desired. It also computes the moving averages after receiving the desired period. An input and output file assignments should be done before run time, otherwise the interactive i/d is the default.
        VARIABLE RECOGNITION:
        PERIOD------ COULD BE A:  WEEK, MONTH, QUARTER OR A YEAR
        PERIODE---  NUMBER OF PERODES TO BE USED WHEN COMPUTING 
                              THE MOVING AVERAGES.
        X          ------ ORIGINAL DATA
        ST1      ------_ THE SMOOTHED VALUE SUSIN EXPO.FIRST DEGREE
        ST2      ------_ THE SMOTHED VALLUE USING EXPO. SECOND DEGREE.
        ST3     -------_ THE SMOOTHED VALUE USING EXPO. THIRD DEGREE.

        INTEGER PERIOD, DATAITEMS, PERIODE
        REAL X,ST1,ST2,ST3,AVR
        DIMENSION X(1000), ST1(1000), ST2(1000),
        $ ST3(1000), AVR(1000), PERIOD (1000)
        WRITE (**)= PLEASE ENTER THE NUMBER OF DATA ITEMS THAT YOU HAVE:=
        READ (*,*) DATAITEMS

        INITIALIZING AND LOADING DATA INTO THE ARRAY X.

        DO 10 I=1, DATAITEMS
        X (I) = 0
        READ (5,*)X(I)
        CONTINUE
        ST1(1) = X (I)
        ST2(1) = X(1)
        ST3(1) = X(1)
This part of program computes exponentioally smoothed data, and moving average smoothed data, forecasts for the required number of periods after computing the coefficients and finally prints out the results.
        WRITE(*,*) PLEASE ENTER THE VALUE OF COEFFICIENT ALPHA :=
        READ(*,*)ALPHA
        WRITE(6,100)
        FORMAT(1=,10X,=PERIOD ,
        $    7X,=X=, 9X,=EXPO_1=, 6X, EXPO_2=, 
        $    6X,=EXPO3=)
        DO 20 J=2, DATAITEMS
        ST1 (J) = ALPHA * X(J) +  (1-ALPHA)*  ST1(J-1)
        ST2 (J) = ALPHA * ST1(J) + (1-ALPHA)*  ST2(J-1)
        ST3 (J) = ALPHA * ST2(J)  +  (1-ALPHA)*  ST3(J-1)

        CONTINUE
        DO 30 K=1, DATAITEMS
        WRITE (6,200)K,X(K),ST1(K), ST2(K),ST3(K)
        FORMAT (11X,14,5X,F10.2,2X,F10.2,2X,F10.2)
        CONTINUE
        A2 = 2* ST1(DATAITEMS) B ST2(DATAITEMS)
        B2 = (ALPHA/91-ALPHA) * (ST1 (DATAITEMS) B ST2 (DATAITEMS))
        A3 =3*ST1(DATAITEMS) B 3*ST2(DATAITEMS) +ST3(DATAITEMS)
        B3 = (ALPHA/2*(1-ALPHA)**2)) * ((6-5*ALPHA) * ST1 (DATAITEMS)
        $  - (10 B 8*ALPHA) * ST2(DATAITEMS)
        $ + (4-3*ALPHA) * ST2(DATAITEMS))
        C3 = ((ALPHA))**2) * (ST1(DATAITEMS)-2*ST2(DATAITEMS)
        $  + ST3(DATAITEMS))

FORCASTS
        WRITE(*,*)=HOW MANY PERIODS DO YOU NEED TO FOR FORCAST?=
        READ (*,*) NUMFORCASTS
        WRITE(6,300)
        FORMAT(////5X,= ------ FORCASTS------)
        WRITE(6,400)
        FORMAT (// 10X,= PERIOD ,= EXPO2.FORCASTS  ,= EXPO3FORCASTS=/)
        DO 40 L=1, NUMFORCASTS
        FORCAST2 = A2 + B2*L
        FORCAST3 = A3 + B3*L + (0.5)*(L**2)*C3
        WRITE (6,500) DATAITEMS+L, FORCAST2, FORCAST3
        FORMAT(12X,14X,F16.2)
        FORCAST 2=0
        FORCAST 3=0
        CONTINUE


        MOVING AVERAGE

        WRITE (*,*)=PLESAE ENTER THE PERIOD_AVERAGE=
        READ (*,*) PERIODE
        DO 50 M=1, DATAITEMS B PERIODE + 1
        DO    60 N=M, M + PERIODE     - 1
          SUM=SUM + X(N)
        CONTINUE
        WRITE (6,550)
        FORMAT (//10X, ----- MOVING AVERAGE-----)
        WRITE (6,600)
        FORMAT (/////10X,= PERIOD  .,=   X(T)  ,=   MOVING AVERAGE=/)
        DO 70 IJ=1, DATAITEMS
        IF (IJ .LE. PERIODE .OR. IJ .GT. (DATAITEMS B PERIODE )) THEN
        WRITE (6,700) IJ, X(IJ)
        FORMAT ( 15X,12,6X,=--------)
        ELSE
        WRITE (6,800)IJ,X(IJ), AVR(IJ)
        FORMAT (15X,I2,6X,F8.2)
        ENDIF
        CONTINUE
        STOP
        END

Transfer Functions Methodology

It is possible to extend regression models to represent dynamic relationships between variables via appropriate transfer functions used in the construction of feedforward and feedback control schemes. Visit Autobox for a software on this topic. The Transfer Function Analyzer module in SCA forecasting & modeling package is a frequency spectrum analysis package designed with the engineer in mind. It applies the concept of the Fourier integral transform to an input data set to provide a frequency domain representation of the function approximated by that input data. It also presents the results in conventional engineering terms.

Box-Jenkins Methodology

 

 
 
 

Forecasting Basics: The basic idea behind self-projecting time series forecasting models is to find a mathematical formula that will approximately generate the historical patterns in a time series.
Time Series: A time series is a set of numbers that measures the status of some activity over time. It is the historical record of some activity, with measurements taken at equally spaced intervals (exception: monthly) with a consistency in the activity and the method of measurement.
Approaches to time Series Forecasting: There are two basic approaches to forecasting time series: the self-projecting time series and the cause-and-effect approach. Cause and effect methods attempt to forecast based on underlying series that are believed to cause the behavior of the original series. The self-projecting time series uses only the time series data of the activity to be forecast to generate forecasts. This latter approach is typically less expensive to apply and requires far less data and is useful for short to medium-term forecasting.
Box-Jenkins Forecasting Method: The univariate version of this methodology is a self- projecting time series forecasting method. The underlying goal is to find an appropriate formula so that the residuals are as small as possible and exhibit no pattern. The model- building process involves four steps. Repeated as necessary, to end up with a specific formula that replicates the patterns in the series as closely as possible and also produces accurate forecasts.
Box-Jenkins Methodology
Box-Jenkins forecasting models are based on statistical concepts and principles and are able to model a wide spectrum of time series behavior. It has a large class of models to choose from and a systematic approach for identifying the correct model form. There are both statistical tests for verifying model validity and statistical measures of forecast uncertainty. In contrast, traditional forecasting models offer a limited number of models relative to the complex behavior of many time series with little in the way of guidelines and statistical tests for verifying the validity of the selected model.
Data: The misuse, misunderstanding, and inaccuracy of forecasts is often the result of not appreciating the nature of the data in hand. The consistency of the data must be insured and it must be clear what the data represents and how it was gathered or calculated. As a rule of thumb, Box-Jenkins requires at least 40 or 50 equally-spaced periods of data. The data must also be edited to deal with extreme or missing values or other distortions through the sue of functions as log or inverse to achieve stabilization.
Preliminary Model Identification Procedure: A preliminary Box-Jenkins analysis with a plot of the initial data should be run as the starting point in determining an appropriate model. The input data must be adjusted to form a stationary series, one whose values vary more or less uniformly about a fixed level over time. Apparent trends can be adjusted by having the model apply a technique of "regular differencing," a process of computing the difference between every two successive values, computing a differenced series which has overall trend behavior removed. If a single differencing does not achieve stationarity, it may be repeated, although rarely if ever, are more than two regular differencings required. Where irregularities in the differenced series continue to be displayed, log or inverse functions can be specified to stabilize the series such that the remaining residual plot displays values approaching zero and without any pattern. This is the error term, equivalent to pure, white noise.
Pure Random Series: On the other hand, if the initial data series displays neither trend nor seasonality and the residual plot shows essentially zero values within a 95% confidence level and these residual values display no pattern, then there is no real-world statistical problem to solve and we go on to other things.
Model Identification Background
Basic Model: With a stationary series in place, a basic model can now be identified. Three basic models exist, AR (autoregressive), MA (moving average) and a combined ARMA in addition to the previously specified RD (regular differencing) combine to provide the available tools. When regular differencing is applied together with AR and MA, they are referred to as ARIMA, with the I indicating "integrated" and referencing the differencing procedure.
Seasonality: In addition to trend, which has now been provided for, stationary series quite commonly display seasonal behavior where a certain basic pattern tends to be repeated at regular seasonal intervals. The seasonal pattern may additionally frequently display constant change over time as well. Just as regular differencing was applied to the overall trending series, seasonal differencing (SD) is applied to seasonal nonstationarity as well. And as autoregressive and moving average tools are available with the overall series, so too, are they available for seasonal phenomena using seasonal autoregressive parameters (SAR) and seasonal moving average parameters (SMA).
Establishing Seasonality: The need for seasonal autoregression (SAR) and seasonal moving average (SMA) parameters is established by examining the autocorrelation and partial autocorrelation patterns of a stationary series at lags that are multiples of the number of periods per season. These parameters are required if the values at lags s, 2s, etc. are nonzero and display patterns associated with the theoretical patterns for such models. Seasonal differencing is indicated if the autocorrelations at the seasonal lags do not decrease rapidly.
Referring to the above chart, know that, the variance of the errors of the underlying model must be invariant (i.e. constant). This means that the variance for each subgroup of data is the same and does not depend on the level or the point in time. If this is violated then one can remedy this by stabilizing the variance. Make sure that, that there are no deterministic patterns in the data. Also one must not have any pulses or one-time unusual values. Additionally there should be no level or step shifts. Also no seasonal pulses should be present.
The reason for all of this is that if they do exist then the sample autocorrelation and partial autocorrelation will seem to imply ARIMA structure. Also the presence of these kind of model components can obfuscate or hide structure. For example a single outlier or pulse can create an effect where the structure is masked by the outlier.
Improved Quantitative Identification Method
Relieved Analysis Requirements: A substantially improved procedure is now available for conducting Box-Jenkins ARIMA analysis which relieves the requirement for a seasoned perspective in evaluating the sometimes ambiguous autocorrelation and partial autocorrelation residual patterns to determine an appropriate Box-Jenkins model for use in developing a forecast model.
ARMA (1, 0): The first model to be tested on the stationary series consists solely of an autoregressive term with lag 1. The autocorrelation and partial autocorrelation patterns are examined for significant autocorrelation often early terms and to see whether the residual coefficients are uncorrelated, that is the coefficient values are zero within 95% confidence limits and without apparent pattern. When fitted values as close as possible to the original series values are obtained, the sum of the squared residuals will be minimized, a technique called least squares estimation. The residual mean and the mean percent error should not be significantly nonzero. Alternative models are examined comparing the progress of these factors, favoring models which use as few parameters as possible. Correlation between parameters should not be significantly large and confidence limits should not bracket zero. When a satisfactory model has been established a forecast procedure is applied.
ARMA (2, 1): Absent a satisfactory ARMA (1, 0) condition with residual coefficients approximating zero, the improved model identification procedure now proceeds to examine the residual pattern when autoregressive terms with order 1 and 2 are applied together with a moving average term with an order of 1.
Subsequent Procedure: To the extent that the residual conditions described above remain unsatisfied, the Box-Jenkins analysis is continued with ARMA (n, n-1) until a satisfactory model is arrived at. In the course of this iteration, when an autoregressive coefficient (phi) approaches zero, the model is reexamined with parameters ARMA (n-1, n-1). In like manner whenever a moving average coefficient (theta) approaches zero, the model is similarly reduced to ARMA (n, n-2). At some point, either the autoregressive term or moving average term may fall away completely and the examination of the stationary series is continued with only the remaining term until the residual coefficients approach zero within the specified confidence levels.
 
Seasonal Analysis: In parallel with this model development cycle and in an entirely similar manner, seasonal autoregressive and moving average parameters are added or dropped in response to the present o fa seasonal or cyclical pattern in the residual terms or a parameter coefficient approaching zero.
Model Adequacy: In reviewing the Box-Jenkins output, care should be taken to insure that the parameters are uncorrelated and significant and alternate models should be weighted for these conditions as well as for overall correlation (R2), standard error, and zero residual.
Forecasting with the Model: The model is used for short and intermediate term forecasting, updated as new data becomes available to minimize the number of periods ahead required of the forecast.
Monitor the Accuracy of the Forecasts in Real Time: As time progresses, the accuracy of the forecasts should be closely monitored for increases in the error terms, standard error and a decrease in correlation. When the series appears to be thus changing over time, recalculation of the model parameters should be undertaken. 

SPSS Programs Listing for Forecasting

simple linear regression (fit)

PLOT            HSIZE=50/VSIZE 42/
                FORMAT=REGRESSION/
                PLOT= T WITH X

REGRESSION      DESCRIPTIVES=DEFAULTS/  gives mean, st.dev.
                        VARS=T,X/                           and corr.
                        DEP=X/
                        METHOD=ENTER/
                        RESIDUAL=HISTOGRAM/
                        RESIDUAL=NORMPROB/
                        SCATTERPLOT=(T,X), (*RESID,X),
                        (*RESID,T),(*RESID,*PRED)
                        /CASEWISE=ALL
or                      /CASEWISE = DEPENDENT PRED RESID ZRESID
or                      /CASEWISE = ALL DEPENDENT PRED RESID ZRESID
PEARSON CORR    PRED X/


polynomial regression

COMPUTE TSQRT=T**2
COMPUTE TCUB=T**3
REGRESSION      VARIABLES=X,T,TSQRT,TCUB/
                        EPT=X/
                        ENTER/
                        DEP=X/
                        FORWARD/ provides a sequential analysis



Box-Jenkins Method ARIMA

TITLE   `B-J METHOD'
FILE HANDLE SERIESG/NAME=`SPS.DAT'
DATA LIST   FILE=SERIESG LIST/X *
VAR  LABLE
                  X `AIRLINE DATA'
LIST CASE   CASE=144/VARIABLES=ALL/

                1st Step
BOX-JENKINS  VARIABLE=X/PLOT=SERIES/IDENTIFY
data, graph, original, logs, differencing?
                2nd Step
BOX-JENKINS  VARIABLE=X/LOG/DIFFERENCE=0 THRU 2/
                   PERIOD=12/SDIFFERENCE=0 THRU 2/
                   LAG=49/PLOT=DSE, PAC/IDENTIFY
tentative model(s)
                3rd Step
BOX-JENKINS  VARIABLE=X/LOG/DIFFERENCE=0 THRU 2/
                   PERIOD=12/SDIFFERENCE=1/LAG=49/
                   Q=1/SQ=1/NCONSTANT/BFR=13/
                   PLOT=RAC, RES/ESTIMATION
estimation and diagnostic check:  residual s.s.? parameter(s),
significance?  BP Chi-sq.? Residual, Autocorrelations?
graph of residuals?        OK?
                4th Step
BOX-JENKINS  VARIABLE=X/LOG/DIFFERENCE=1/
                   PERIOD=12/SDIFFERENCE=1/Q=1/
                   SQ=1/FQ=(0.39631)/FSQ=(0.61306)/
                   ORIGIN=24/PLOT=FCF,FLF,CIN/
                   FORECAST
To get the forecast(24 backward, 12 forward),

 plot of forecast function, fixed lead forecast, confidence

 interval(95%).

Extended Version of SPSS

You may like to use the Extended Version of SPSS. If so replace the first line in your program file with the following two JCL lines
$START_SPSSX
$SPSSX/NOBANNER/OUTPUT=..
After submitting your job, you receive notification that the job in completed, together with some massages. Ignore these messages and proceed as with the usual SPSS version. 

SAS Programs Listing for Exponential Smoothing and Winters Methods

DATA ONE;
INFILE ACME;
INPUT TIME VALUE;
PROC PRINT;
PROC PLOT DATA=ONE;
     PLOT VALUE*TIME;

PROC FORECAST DATA=ONE OUT=TWO OUTEST=THREE
 METHOD=EXPO TREND=1;
VAR VALUE;
ID TIME;

PROC PRINT DATA=THREE;
 TITLE 'THE ESTIMATE FROM SINGLE EXPO';

PROC PRINT DATA=TWO;
 TITLE ' THE OUTPUT FROM SINGLE EXPO';

PROC FORECAST DATA=ONE OUT=FOUR OUTEST=FIVE
 METHOD=EXPO TREND=2;
VAR VALUE;
ID TIME;
PROC PRINT DATA= FIVE;
 TITLE ' THE ESTIMATE FROM DOUBLE EXPO ';

PROC PRINT DATA=FOUR;
 TITLE ' THE OUTPUT FROM SINGLE EXPO';

PROC FORECAST DATA=ONE OUT=SIX OUTEST=SEVEN
 METHOD=EXPO TREND=3;
VAR VALUE;
ID TIME;

PROC PRINT DAT=SEVEN;
 TITLE 'THE ESTIMATE FROM TRIPLE EXPO';

PROC PRINT DATA=SIX;
 TITLE ' THE OUTPUT FROM TRIPLE EXPO';

PROC FORECAST DATA=ONE OUT=A OUTEST=B
 METHOD=WINTERS SEASONS=4 TREND=2 OUTDATA OUT1STEP
 OUTLIMIT INTERVAL=1 LEAD=5;
VAR VALUE;
ID TIME;
PROC PRINT DATA=B;
 TITLE 'THE ESTIMATE FROM WINTERS METHOD';

PROC PRINT DATA=A;
 TITLE ' THE OUTPUT FROM WINTERS METHOD';
PROC PLOT DATA=A;
     PLOT (VALUE)*TIME=_TYPE_;
     TITLE 'PLOT OF FORECAST:  WINTERS METHOD';

Modeling Financial Time Series

We are attempting to 'model' what the reality is; so that we can predict it. Statistical Modeling, in addition to being of central importance in statistical decision making, is critical in any endeavor, since essentially everything is a model of reality. As such, modeling has applications in such disparate fields as marketing, finance, and organizational behavior. Particularly compelling is econometric modeling since, unlike most disciplines (such as Normative Economics), econometrics deals only with provable facts, not with beliefs and opinions.
Time series analysis is an integral part of financial analysis. The topic is interesting and useful, with applications to the prediction of interest rates, foreign currency risk, stock market volatility, and the like. There are many varieties of econometric and multi-variate techniques. Specific examples are regression and multi-variate regression; vector auto-regressions; and co- integration regarding tests of present value models. Next section presents the underlying theory on which statistical models are predicated.
Financial Modeling: Econometric modeling is vital in finance and in financial time series analysis. Modeling is, simply put, the creation of representations of reality. It is important to be mindful that, despite the importance of the model, it is in fact only a representation of reality and not the reality itself. Accordingly, the model must adapt to reality; it is futile to attempt to adapt reality to the model. As representations, models cannot be exact. Models imply that action is only taken after careful thought and reflections This can have major consequences in the financial realm. A key element of financial planning and financial forecasting is the ability to construct models showing the interrelatedness of financial data. Models showing correlation or causation between variables can be used to improve financial decision-making. For example, one would be more concerned about the consequences on the domestic stock market of a downturn in another economy if it can be shown that there is a mathematically provable causative impact of that nation's economy and the domestic stock market. However, modeling is fraught with dangers. A model which heretofore was valid may lose validity due to changing conditions, thus becoming an inaccurate representation of reality and adversely affecting the ability of the decision-maker to make good decisions.
The examples of univariate and multivariate regression, vector autoregression, and present value cointegration illustrate the application of modeling, a vital dimension in managerial decision making, to econometrics, and specifically the study of financial time series. The provable nature of econometric models is impressive; rather than proffering solutions to financial problems based on intuition or convention, one can mathematically demonstrate that a model is or is not valid, or requires modification. It can also be seen that modeling is an iterative process, as the models must continuously change to reflect changing realities. The ability to do so has striking ramifications in the financial realm, where the ability of models to accurately predict financial time series is directly related to the ability of the individual or firm to profit from changes in financial scenarios.
Univariate and Multivariate Models: The use of regression analysis is widespread in examining financial time series. Some examples are the use of forward exchange rates as optimal predictors of future spot rates; conditional variance and the risk premium in foreign exchange markets; and stock returns and volatility. A model that has been useful for this type of application is called the GARCH-M model, which incorporates computation of the man into the GARCH (generalized autoregressive conditional heteroskedastic) model. This sounds complex and esoteric, but it only means that the serially correlated errors and the conditional variance enter the mean computation, and that the conditional variance itself depends on a vector of explanatory variables. The GARCH-M model has been further modified, a testament of finance practitioners to the necessity of adapting the model to a changing reality. For example, this model can now accommodate exponential (non-linear) functions, and is no longer constrained by non-negativity parameters.
One application of this model is the analysis of stock returns and volatility. Traditionally, the belief has been that the variance of portfolio returns is the primary risk measure for investors. However, using extensive time series data, it has been proven that the relationship between mean returns and return variance or standard deviation I weak; hence the traditional two-parameter asset pricing models appear to be inappropriate, and mathematical proof replaces convention. Since decisions premised on the original models are necessarily sub-optimal because the original premise is flawed, it is advantageous for the finance practitioner to abandon the model in favor of one with a more accurate representation of reality.
Correct specification of a model is of paramount importance, and a battery of misspecification testing criteria have been established. These include tests of normality, linearity, and homoskedasticity, and can be applied to a variety of models. A simple example which yields surprising results is the Capital Asset Pricing Model, one of the cornerstones of elementary economics. Application of the testing criterial to data concerning companies' risk premium shows significant evidence of non-linearity, non-normality and parameter non-constancy. The CAPM was found to be applicable for only three of seventeen companies that were analyzed. This does not mean, however, that the CAPM should be summarily rejected; it still has value as a pedagogic tool, and can be used as a theoretical framework. For the econometrician or financial professional, for whom the misspecification of the model can translate into suboptimal financial decisions, the CAPM should be supplanted by a better model, specifically one that reflects the time-varying nature of betas. The GARCH-M framework is one such model.
Multivariate linear regression models apply the same theoretical framework. The principal difference is the replacement of the dependent variable by a vector. The estimation theory is essentially a multivariate extension of that developed for the univariate, and as such can be used to test models such as the stock and volatility model and the CAPM. In the case of the CAPM, the vector introduced is excess asset returns at a designated time. One application is the computation of the CAPM with time-varying covariances. Although in this example the null hypothesis that all intercepts are zero cannot be rejected, the misspecification problems of the univariate model still remain. Slope and intercept estimates also remain the same, since the same regression appears in each equation.
Vector Autoregression: General regression models assume that the dependent variable is a function of past values of itself and past and present values of the independent variable. The independent variable, then, is said to be weakly exogenous, since its stochastic structure contains no relevant information for estimating the parameters of interest. While the weak exogeneity of the independent variable allows efficient estimation of the parameters of interest without any reference to its own stochastic structure, problems in predicting the dependent variable may arise if "feedback" from the dependent to the independent variable develops over time. (When no such feedback exists, it is said that the dependent variable does not Granger-cause the independent variable.) Weak exogenetic coupled with Granger non-causality yields strong exogenetic which, unlike weak exogenetic, is directly testable. To perform the tests requires utilization of the dynamic structural equation model (DSEM) and the vector autoregressive process (VAR). The multivariate regression model is thus extended in two directions, by allowing simultaneity between the endogenous variables in the dependent variable, and explicitly considering the process generating the exogenous variables in the dependent variable, and explicitly considering the process generating the exogenous independent variables.
Results of this testing are useful in determination of whether an independent variable is strictly exogenous or is predetermined. Strict exogenetic can be tested in DSEMs by expressing each endogenous variable as an infinite distributed lag of the exogenous variables. If the independent variable is strictly exogenous, attention can be limited to distributions conditional on the independent variable without loss of information, resulting in simplification of statistical inference. If the independent variable is strictly exogenous, it is also predetermined, meaning that all of its past and current values are independent of the current error term. While strict exogenetic is closely related to the concept of Granger non-causality, the two concepts are not equivalent and are not interchangeable.
It can be seen that this type of analysis is helpful in verifying the appropriateness of a model as well as proving that, in some cases, the process of statistical inference can be simplified without losing accuracy, thereby both strengthening the credibility of the model while increasing the efficiency of the modeling process. Vector autoregressions can be used to calculate other variations on causality, including instantaneous causality, linear dependence, and measures of feedback from the dependent to he independent and from the independent to the dependent variables. It is possible to proceed further with developing causality tests, but simulation studies which have been performed reach a consensus that the greatest combination of reliability and ease can be obtained by applying the procedures described.
Cointegration and Present Value Modeling: Present value models are used extensively in finance to formulate models of efficient markets. In general terms. A present value model for two variables y1 and x1, states that y1 is a linear function of the present discounted value of the expected future values of x1, where the constant term, the constant discount factor, and the coefficient of proportionality are parameters that are either know or need to be estimated. Not all financial time series are non-integrated; the presence of integrated variables affects standard regression results and procedures of inference. Variables may also be cointegrated, requiring the superimposition of cointegrating vectors on the model, and resulting in circumstances under which the concept of equilibrium loses all practical implications and spurious regressions may occur. In present value analysis, cointegration can be used to define the "theoretical spread" and to identify co-movements of variables. This is useful in constructing volatility-based tests.
One such test is stock market volatility. Assuming cointegration, second-order vector autoregressions are constructed, which show suggest that dividend changes are not only highly predictable but are Granger-caused by the spread. When the assumed value of the discount rate is increased, certain restrictions can be rejected at low significance levels. This yields results showing an even more pronounced "excess volatility" than that anticipated by the present value model. It also illustrates that the model is more appropriate in situations where the discount rate is higher. The implications of applying a cointegration approach to stock market volatility testing for financial managers are significant. Of related significance is the ability to test the expectations hypotheses of interest rate term structure.

Measuring for Accuracy

Given a set of data and its forecasted values obtained by using any method, this interactive Fortran program computes the statistics that allows you to have an idea about how good of the forecasting method used fits the original data set.
        INTEGER TESTART, PERIOD
        REAL LASTX, LASTF, LASTERRAQR
        WRITE (*, *)' PLEASE ENTER (IN ORDER) HOW MANY PERIODES'
        WRITE (*, *)' DO YOU DISPOSE OF AND FROM WHAT PERIODE'
        WRITE (*, *)' YOU WANT TO TEST YOUR FORECASTS ?'
        READ   (*, *) MAXPERIODS, TESTART
        WRITE (6,100)
100     FORMAT (/,=     PERIOD=,=       DATA .= FORCASTS=)
        DO  10 I=1, ( TESTART -1)
        READ (5,150)X, F
150     FORMAT (2F8.2)
          WRITE (6,250) I,X,F
250     FORMAT ( 7X,12X,3X,F8.2,5X,F8.2)
        CONTINUE
        LASTX=X
        LASTF=F
        LASTERR= LASTX - LASTF
        DO 20 J=TESTART, MAXPERIODS
          READ ( 5, 300) X,F
300     FORMAT ( 2F8.2)
        WRITE (6,350) J,X,F
350     FORMAT (7X,12,3X,F8.2,5X,F8.2)
        ERR =  X - F
        SSE =  SEE + (ERR) **2
        TMAPE = TMAPE + ABS (ERR/X)
        SUMMER=SUMMER+ERR
        SUMABSERR=NUMERATOR + ABS(ERR)
        SUMX = SUMX + X
        UNUMERATOR = NUMERATOR + ((F+X)/LASTX)**2
        UDENOMINATOR = UDENOMINATOR + ((X + LASTX)/LASTX)**2
        WDNUMERATOR = (ERR - LASERR) **2
        LASTERR = ERR
        LASTX = X
20      CONTINUE
        VME = SUMERR/ (MAXPERIODS -TESTART)
        VMAE = SUMMBSERR/ (MAXPERIODS - TESTART)
        SDE = SQRT (SSE/MAXPERIODS - TESTART - 1))
        VMSE = SSE/(MAXPERIODS - TESTART)
        VMAPE = (TMAPE*100)/(MAXPERIODS - TESTART)
        THEILSTAT = SQRT (UNUMERATOR/UNDENOMINATOR)
        VLAUGHLINS = (4 - THEILSTAT) * 100
        DW = WDNUMERATOR/SSE
        WRITE (6,600)
 600    FORMAT (//5X,' **** STATISTICS*** ')
        WRITE(6,200) VME,VMAE,SDE,VMSE,VMAPE,THEILSTAT, VLAUGHLINGS,DW
        FORMAT (ME= F8.2,/= MAE= ,F8.2,/= SDE==,F8.2,/= MSE= ,F8.2,
        $    /= MAPE= , F8.2,/= THEILSTAT= ,          F8.2,/= LAUGHLINGS= , F8.2,
        $    /= DURBIN_ WATSON= ,F8.2
        STOP
        END