Statistical Data Analysis
B. R. Asrabadi
Math 503: Data
Analysis I
Math 504:
Data Analysis II
Topics
in Statistical Data Analysis
Statistical
Data Analysis
-
Statistics is
a set of methods that are used to collect, analyze, present, and interpret
data. Statistical methods are used in a wide variety of occupations and
help people identify, study, and solve many complex problems. In the business
and economic world, these methods enable decision makers and managers to
make informed and better decisions about uncertain situations.
-
Vast amounts of statistical
information are available in today's global and economic environment because
of continual improvements in computer technology. To compete successfully
globally, managers and decision makers must be able to understand the information
and use it effectively. Statistical data analysis provides hands on experience
to promote the use of statistical thinking and techniques to apply in order
to make educated decisions in the business world.
-
Computers play a very
important role in statistical data analysis. The statistical software package,
SPSS, which is used in this course, offers extensive data-handling capabilities
and numerous statistical analysis routines that can analyze small to very
large data statistics. The computer will assist in the summarization of
data, but statistical data analysis focuses on the interpretation of the
output to make inferences and predictions.
-
Studying a problem
through the use of statistical data analysis usually involves four basic
steps.
-
1. Defining the problem
-
2. Collecting
the data
-
3. Analyzing
the data
-
4. Reporting
the results
-
-
Defining
the Problem
-
An exact definition of the problem is imperative
in order to obtain accurate data about it. It is extremely difficult to
gather data without a clear definition of the problem.
-
Collecting the Data
-
We live and work at a time when data collection
and statistical computations have become easy almost to the point of triviality.
Paradoxically, the design of data collection, never sufficiently emphasized
in the statistical data analysis textbook, have been weakened by an apparent
belief that extensive computation can make up for any deficiencies in the
design of data collection. One must start with an emphasis on the importance
of defining the population about which we are seeking to make inferences,
all the requirements of sampling and experimental design must be met.
-
Designing ways to collect data is an important
job in statistical data analysis. Two important aspects of a statistical
study are:
Population
- a set of all the elements of interest in a study
Sample - a subset
of the population
Statistical inference
is refer to extending your knowledge obtain from a random sample from a
population to the whole population. This is known in mathematics as an
Inductive Reasoning. That is, knowledge of whole from a particular. Its
main application is in hypotheses testing about a given population.
The purpose of
statistical inference is to obtain information about a population form
information contained in a sample. It is just not feasible to test the
entire population, so a sample is the only realistic way to obtain data
because of the time and cost constraints. Data can be either quantitative
or qualitative. Qualitative data are labels or names used to identify an
attribute of each element. Quantitative data are always numeric and indicate
either how much or how many.
For the purpose
of statistical data analysis, distinguishing between cross-sectional and
time series data is important. Cross-sectional data re data collected at
the same or approximately the same point in time. Time series data are
data collected over several time periods.
Data can be collected
from existing sources or obtained through observation and experimental
studies designed to obtain new data. In an experimental study, the variable
of interest is identified. Then one or more factors in the study are controlled
so that data can be obtained about how the factors influence the variables.
In observational studies, no attempt is made to control or influence the
variables of interest. A survey is perhaps the most common type of observational
study.
Analyzing the Data
Statistical data
analysis divides the methods for analyzing data into two categories: exploratory
methods and confirmatory methods. Exploratory methods are used to discover
what the data seems to be saying by using simple arithmetic and easy-to-draw
pictures to summarize data. Confirmatory methods use ideas from probability
theory in the attempt to answer specific questions. Probability is important
in decision making because it provides a mechanism for measuring, expressing,
and analyzing the uncertainties associated with future events. The majority
of the topics addressed in this course fall under this heading.
Reporting the Results
Through inferences,
an estimate or test claims about the characteristics of a population can
be obtained from a sample. The results may be reported in the form of a
table, a graph or a set of percentages. Because only a small collection
(sample) has been examined and not an entire population, the reported results
must reflect the uncertainty through the use of probability statements
and intervals of values.
To conclude, a
critical aspect of managing any organization is planning for the future.
Good judgment, intuition, and an awareness of the state of the economy
may give a manager a rough idea or "feeling" of what is likely to happen
in the future. However, converting that feeling into a number that can
be used effectively is difficult. Statistical data analysis helps managers
forecast and predict future aspects of a business operation. The most successful
managers and decision makers are the ones who can understand the information
and use it effectively.
Data Processing: Coding,
Typing, and Editing
-
Data are often recorded
manually on data sheets. Unless the numbers of observations and variables
are small the data must be analyzed on a computer. The data will then go
through three stages:
-
Coding: the data are
transferred, if necessary to coded sheets.
-
Typing: the data are
typed and stored by at least two independent data entry persons. For example,
when the Current Population Survey and other monthly surveys were taken
using paper questionnaires, the U.S. Census Bureau used double key data
entry.
-
Editing: the data
are checked by comparing the two independent typed data. The standard practice
for key-entering data from paper questionnaires is to key in all the data
twice. Ideally, the second time should be done by a different key entry
operator whose job specifically includes verifying mismatches between the
original and second entries. It is believed that this "double-key/verification"
method produces a 99.8% accuracy rate for total keystrokes.
-
Types of error: Recording
error, typing error, transcription error (incorrect copying), Inversion
(e.g., 123.45 is typed as 123.54), Repetition (when a number is repeated),
Deliberate error.
Multivariate Data Analysis
-
Data are easy to collect;
what we really need in complex problem solving is information. We may view
a data base as a domain that requires probes and tools to extract relevant
information. As in the measurement process itself, appropriate instruments
of reasoning must be applied to the data interpretation task. Effective
tools serve in two capacities: to summarize the data and to assist in interpretation.
The objectives of interpretive aids are to reveal the data at several levels
of detail.
-
Exploring the fuzzy
data picture sometimes requires a wide-angle lens to view its totality.
At other times it requires a closeup lens to focus on fine detail. The
graphically based tools that we use provide this flexibility. Most chemical
systems are complex because they involve many variables and there are many
interactions among the variables. Therefore, chemometric techniques rely
upon multivariate statistical and mathematical tools to uncover interactions
and reduce the dimensionality of the data.
-
Principal component
analysis used for exploring data. Two closely related techniques, principal
component analysis and factor analysis, are used to reduce the dimensionality
of multivariate data. In these techniques correlations and interactions
among the variables are summarized in terms of a small number of underlying
factors. The methods rapidly identify key variables or groups of variables
that control the system under study. The resulting dimension reduction
also permits graphical representation of the data so that significant relationships
among observations or samples can be identified.
-
Other techniques include
Multidimensional Scaling, Cluster Analysis, and Correspondence Analysis.
-
Multivariate analysis
is a branch of statistics involving the consideration of objects on each
of which are observed the values of a number of variables. A wide range
of methods is used for the analysis of multivariate data, and this course
will give a view of the variety of methods available, as well as going
into some of them in detail. Multivariate techniques are used across the
whole range of fields of statistical application: in medicine, physical
and biological sciences, economics and social science, and of course in
many industrial and commercial applications.
The Meaning and Interpretation
of P-values (what the data say?)
The P-value, which
directly depends on a given sample, attempts to provide a measure of the
strength of the results of a test, in contrast to a simple reject or do
not reject. If the null hypothesis is true and the chance of random variation
is the only reason for sample differences, then the P-value is a quantitative
measure to feed into the decision making process as evidence.
T
-
When a p-value is
associated with a set of data, it is a measure of the probability that
the data could have arisen as a random sample from some population described
by the statistical (testing) model.
-
A p-value is a measure
of how much evidence you have against the null hypothesis. The smaller
the p-value, the more evidence you have. One may combine the p-value with
the significance level to make decision on a given test of hypothesis.
In such a case, if the p-value is less than some threshold (usually .05,
sometimes a bit larger like 0.1 or a bit smaller like .01) then you reject
the null hypothesis.
-
Understand that the
distribution of p-values under null hypothesis H0 is uniform, and thus
does not depend on a particular form of the statistical test. In a statistical
hypothesis test, the P value is the probability of observing a test statistic
at least as extreme as the value actually observed, assuming that the null
hypothesis is true. The value of p is defined with respect to a distribution.
Therefore, we could call it "model-distributional hypothesis" rather than
"the null hypothesis".
-
In short, it simply
means that if the null had been true, the p value is the probability against
the null in that case. The p-value is determined by the observed value,
however, this makes it difficult to even state the inverse of p.
-
P-value
for Standard Normal and t-statistics
-
Conversion
of a z-statistic Into a (one-side) P-value
INPUT "Z : ", ZValue
a1# = .31938153#
a2# = -.356563782#
a3# = 1.781477937#
a4# = -1.821255978#
a5# = 1.330274429#
w1# = ABS(ZValue)
w# = 1 / (1 + .2316419# * w1#)
w1# = .39894228# * EXP(-.5 * w1# * w1#)
p0# = w# *(a1# + w# *(a2# + w# *(a3# + w# * (a4# + a5# * w#))))
p0# = (w1# * p0#)
IF ZValue 0 THEN
p0# = 1 - p0#
END IF
PRINT p0#
-
Area from
0 to z for normal density: EXP(-((83*Z+351)*Z+562)*Z/(703+165*Z))/2
Below is a silimar program:
INPUT z
a1 = .31938153#
a2 = -.356563782#
a3 = 1.781477937#
a4 = -1.821255978#
a5 = 1.330274429#
w1 = ABS(z)
w = 1 / (1 + .2316419 * w1)
w1 = .39894228# * EXP(-.5 * w1 * w1)
p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))))
p0 = w1 * p0
PRINT ABS(p0);
-
Conversion
of a z-statistic Into a (one-side) P-value: in C++ code
double __declspec(dllexport) NormalProb(double z)
{
const double a1 = .31938153;
const double a2 = -.356563782;
const double a3 = 1.781477937;
const double a4 = -1.821255978;
const double a5 = 1.330274429;
double w1 = absd(z);
double w = 1 / (1 + .2316419 * w1);
w1 = .39894228 * exp(-0.5 * w1 * w1);
double p0 = w * (a1 + w * (a2 + w * (a3 + w * (a4 + a5 * w))));
p0 = w1 * p0;
return absd(p0);
}
-
Conversion
of a t-statistics Into a (one-side) P-value: C++
double __declspec(dllexport) TProb(double t, int df)
{
double a = 0.36338023;
double w = atan(t / sqrt(df));
double s = sin(w);
double c = cos(w);
double t1, t2;
int j1, j2, k2;
if (df % 2 == 0) // even
{
t1 = s;
if (df == 2) // special case df=2
return (0.5 * (1 + t1));
t2 = s;
j1 = -1;
j2 = 0;
k2 = (df - 2) / 2;
}
else
{
t1 = w;
if (df == 1) // special case df=1
return 1 - (0.5 * (1 + (t1 * (1 - a))));
t2 = s * c;
t1 = t1 + t2;
if (df == 3) // special case df=3
return 1 - (0.5 * (1 + (t1 * (1 - a))));
j1 = 0;
j2 = 1;
k2 = (df - 3)/2;
}
for (int i=1; i = k2; i++)
{
j1 = j1 + 2;
j2 = j2 + 2;
t2 = t2 * c * c * j1/j2;
t1 = t1 + t2;
}
return 1 - (0.5 * (1 + (t1 * (1 - a * (df % 2)))));
}
-
What
is a Meta-Analysis?
-
A Meta-analysis deals
with a set of RESULTs to give an overall RESULT that is comprehensive and
valid.
-
a) Especially when
Effect-sizes are rather small, the hope is that one can gain good power
by essentially pretending to have the larger N as a valid, combined sample.
-
b) When effect sizes
are rather large, then the extra POWER is not needed for main effects of
design: Instead, it theoretically could be possible to look at contrasts
between the slight variations in the studies themselves.
-
For example, to compare
two effect sizes (r) obtained by two separate studies, you may use:
-
Z = (z1
- z2)/[(1/n1-3) + (1/n2-3)]1/2
-
where z1
and z2 are Fisher transformations of r, and the two ni's
in the denominator represent the sample size for each study.
-
If you really trust
that "all things being equal" will hold up. The typical "meta" study does
not do the tests for homogeneity that should be required
-
In other words:
-
1. there is a body
of research/data literature that you would like to summarize
-
2. one gathers together
all the admissible examples of this literature (note: some might be discarded
for various reasons)
-
3. certain details
of each investigation are deciphered ... most important would be the effect
that has or has not been found. ie, how much larger in sd units is the
treatment group's performance compared to one or more controls.
-
4. call the values
in each of the investigations in #3 .. mini effect sizes.
-
5. across all admissible
data sets, you attempt to summarize the overall effect size by forming
a set of individual effects ... and using an overall sd as the divisor
.. thus yielding essentially an average effect size.
-
6. in the meta analysis
literature ... sometimes these effect sizes are further labeled as small,
medium, or large ....
-
You can look at effect
sizes in many different ways .. across different factors and variables.
but, in a nutshell, this is what is done.
-
I recall a case in
physics, in which, after a phenomenon had been observed in air, emulsion
data was examined. The theory would have about a 9% effect in emulsion,
and behold, the published data gave 15%. As it happens, there was no significant
(practical, not statistical) in the theory, and also no error in the data.
It was just that the results of experiments in which nothing statistically
significant was found were not reported.
-
This non-reporting
of such experiments, and often of the specific results which were not statistically
significant, which introduces major biases. This is also combined with
the totally erroneous attitude of researchers that statistically significant
results are the important ones, and than if there is no significance, the
effect was not important. We really need to between the term "statistically
significant", and the usual word significant.
-
Meta-analysis is a
controversial type of literature review in which the results of individual
randomized controlled studies are pooled together to try to get an estimate
of the effect of the intervention being studied. It increases statistical
power and is used to resolve the problem of reports which disagree with
each other. It's not easy to do well and there are many inherent problems.
-
For details, see,
Meta-Analysis in Social Research, by Glass, McGraw and Smith, 1987.
What Is the Effect Size
-
Effect size (ES) is
a ratio of a mean difference to a standard deviation, i.e. it is a form
of z-score. Suppose an experimental treatment group has a mean score of
Xe and a control group has a mean score of Xc and a standard deviation
of Sc, then the effect size is equal to (Xe - Xc)/Sc
-
Effect size permits
the comparative effect of different treatments to be compared, even when
based on different samples and different measuring instruments.
-
Therefore, the ES
is the mean difference between the control group and the treatment group.
Howevere, by Glass's method, ES is (mean1 - mean2)/SD of control group
while by Hunter-Schmit's method, ES is (mean1 - mean2)/pooled SD and then
adjusted by instrument reliability coefficient. ES is commonly used in
meta-analysis and power analysis.
Structural Equation Modeling
-
The structural equation
modeling techniques are used to study relations among variables. The relations
are typically assumed to be linear. In social and behavioral research most
phenomena are influenced by a large number of determinants which typically
have a complex pattern of interrelationships. To understand the relative
importance of these determinants their relations must be adequately represented
in a model, which may be done with structural equation modeling.
-
A structural equation
model may apply to one group of cases or to multiple groups of cases. When
multiple groups are analyzed parameters may be constrained to be equal
across two or more groups. When two or more groups are analyzed, means
on observed and latent variables may also be included in the model.
-
As an application,
how do you test the equality of regression slopes coming from the same
sample using 3 different measuring methods? You could use a structural
modeling approach.
-
1 - Standardize all
three data sets prior to the analysis because b
weights are also a function of the variance of the predictor variable and
with standardization, you remove this source.
-
2 - Model the dependent
variable as the effect from all three measures and obtain the path coefficient
(b weight)
for each one.
-
3 - Then fit a model
in which the three path coefficients are constrained to be equal. If a
significant decrement in fit occurs, the paths are not equal.
-
Further Reading:
Schumacker
R., and R. Lomax, A Beginner's Guide to Structural Equation Modeling,
Lawrence Erlbaum, New Jersey, 1996.
-
Visit also the Web
site Structural Equation
Modeling on the Internet
Tri-linear Coordinates
Triangle
-
A "ternary diagram"
is usually used to show the change of opinion (FOR - AGAINST - UNDECIDED).
The triangular diagram used first by the chemist Willard Gibbs in his studies
on phase transitions. It is based on the proposition from geometry that
in an equilateral triangle, the sum of the distances from any point to
the three sides is constant. This implies that the percent composition
of a mixture of three substances can be represented as a point in such
a diagram, since the sum of the percentages is constant (100). The three
vertices are the points of the pure substances.
-
The same holds for
the "composition" of the opinions in a population. When percents for, against
and undecided sum to 100, the same technique for presentation can be used.
See the diagram below, which should be viewed with a non-proportional letter.
True equilateral may not be preserved in transmission. E.g. let the initial
composition of opinions be given by 1. That is, few undecided, roughly
equally as much for as against. Let another composition be given by point
2. This point represents a higher percentage undecided and, among the decided,
a majority of "for".
Internal and Inter-rater
Reliability
-
"Internal reliability"
of a scale is often measured by Cronbach's coefficient a.
It is relevant when you will compute a total score and you want to know
its reliability, based on no other rating. The "reliability" is *estimated*
from the average correlation, and from the number of items, since a longer
scale will (presumably) be more reliable. Whether the items have the same
means is not usually important.
-
Tau-equivalent:The
true scores on items are assumed to differ from each other by no more than
a constant. For a
to equal the reliability of measure, the items comprising it have to be
at a least tau-equivalent, if this assumption is not met, a
is lower bound estimate of reliability.
-
Congeneric
measures: This least restrictive model within the framework of
classical test theory requires only that true scores on measures said to
be measuring the same phenomenon be perfectly correlated. Consequently,
on congeneric measures, error variances, true-score means, and true-score
variances may be unequal
-
For "inter-rater"
reliability, one distinction is that the importance lies with the reliability
of the single rating. Suppose we have the following data
Participants Time Q1 Q2 Q3 to Q17
001 1 4 5 4 4
002 1 3 4 3 3
001 2 4 4 5 3
etc.
-
By examining the data,
I think one cannot do better than looking at the paired t-test and Pearson
correlations between each pair of raters - the t-test tells you whether
the means are different, while the correlation tells you whether the judgments
are otherwise consistent.
-
Unlike the Pearson,
the "intra-class" correlation assumes that the raters do have the same
mean. It is not bad as an overall summary, and it is precisely what some
editors do want to see presented for reliability across raters. It is both
a plus and a minus, that there are a few different formulas for intra-class
correlation, depending on whose reliability is being estimated.
-
For purposes such
as planning the Power for a proposed study, it does matter whether the
raters to be used will be exactly the same individuals. A good methodology
to apply in such cases, is the Bland & Altman analysis.
-
Visit also
the Web site Common
Correlation and Reliability Analysis.
When to Use Nonparametric
Technique?
-
One must use statistical
technique called nonparametric if it satisfies at least on of the following
five types of criteria:
-
1. The data entering
the analysis are enumerative - that is, count data representing the number
of observations in each category or cross-category.
-
2. The data are measured
and /or analyzed using a nominal scale of measurement.
-
3. The data are measured
and /or analyzed using an ordinal scale of measurement.
-
4. The inference does
not concern a parameter in the population distribution - as, for example,
the hypothesis that a time-ordered set of observations exhibits a random
pattern.
-
5. The probability
distribution of the statistic upon which the the analysis is based is not
dependent upon specific information or assumptions about the population(s)
which the sample(s) are drawn, but only on general assumptions, such as
a continuous and/or symmetric population distribution.
-
By this definition,
the distinction of nonparametric is accorded either because of the level
of measurement used or required for the analysis, as in types 1 through
3; the type of inference, as in type 4 or the generality of the assumptions
made about the population distribution, as in type 5.
-
For example one may
use the Mann-Whitney Rank Test as a nonparametric alternative to Students
T-test when one does not have normally distributed data.
-
Mann-Whitney: To be
used with two independent groups (analogous to the independent groups t-test)
Wilcoxon: To
be used with two related (i.e., matched or repeated) groups analogous to
the related samples t-test)
Kruskall-Wallis:
To be used with two or more independent groups (analogous to the single-factor
between-subjects ANOVA)
Friedman: To be
used with two or more related groups (analogous to the single-factor within-subjects
ANOVA)
Analysis of Incomplete
Data
-
Methods dealing with
analysis of data with missing values can be classified into:
-
- Analysis of complete
cases, including weighting adjustments,
- Imputation
methods, and extensions to multiple imputation, and
- Methods that
analyze the incomplete data directly without requiring a rectangular data
set, such as maximum likelihood and Bayesian methods.
-
Multiple imputation
(MI) is a general paradigm for the analysis of incomplete data. Each missing
datum is replaced by m 1 simulated values, producing m simulated versions
of the complete data. Each version is analyzed by standard complete-data
methods, and the results are combined using simple rules to produce inferential
statements that incorporate missing data uncertainty. The focus is on the
practice of MI for real statistical problems in modern computing environments.
-
Further Readings:
Rubin D., Multiple
Imputation for Nonresponse in Surveys, New York, Wiley, 1987.
Schafer J., Analysis
of Incomplete Multivariate Data, London, Chapman and Hall, 1997.
-
Little R., and D.
Rubin, Statistical Analysis with Missing Data, New York, Wiley,
1987.
Interactions in ANOVA
and Regression Analysis
-
Interactions are ignored
only if you permit it. For historical reasons, ANOVA programs generally
produce all possible interactions, while (multiple) regression programs
generally do not produce any interactions - at least, not so routinely.
So it's up to the user to construct interaction terms when using regression
to analyze a problem where interactions are, or may be, of interest. (By
"interaction terms" I mean variables that carry the interaction information,
included as predictors in the regression model.)
-
The easiest construction
is to multiply together the predictors whose interaction is to be included.
When there are more than about three predictors, and especially if the
raw variables take values that are distant from zero (like number of items
right), the various products (for the numerous interactions that can be
generated) tend to be highly correlated with each other, and with the original
predictors. This is sometimes called "the problem of multicollinearity",
although it would more accurately be described as spurious multicollinearity.
It is possible, and often to be recommended, to adjust the raw products
so as to make them orthogonal to the original variables (and to lower-order
interaction terms as well).
-
What does it mean
if the standard error term is high? Multicolinearity is not the only factor
that can cause large SE's for estimators of "slope" coefficients any regression
models. SE's are inversely proportional to the range of variability in
the predictor variable. For example, if you were estimating the linear
association between weight (x) and some dichotomous outcome and x=(50,50,50,50,51,51,53,55,60,62)
the SE would be much larger than if x=(10,20,30,40,50,60,70,80,90,100)
all else being equal. There is a lesson here for the planning of experiments.
To increase the precision of estimators, increase the range of the input.
Another cause of large SE's is a small number of "event" observations or
a small number of "non-event" observations (analogous to small variance
in the outcome variable). This is not strictly controllable but will increase
all estimator SE's (not just an individual SE). There is also another cause
of high standard errors, it's called serial correlation. This problem is
frequent, if not typical, when using time-series, since in that case the
stochastic disturbance term will often reflect variables, not included
explicitly in the model, that may change slowly as time passes by.
-
In a linear model
representing the variation in a dependent variable Y as a linear function
of several explanatory variables, interaction between two explanatory variables
X and W can be represented by their product: that is, by the variable created
by multiplying them together. Algebraically such a model is represented
by:
-
Y = a +b1X + b2 W
+ b3 XW + e .
-
When X and W are category
systems. This equation describes a two-way analysis of variance (ANOV)
model; when X and W are (quasi-)continuous variables, this equation describes
a multiple linear regression (MLR) model.
-
In ANOV contexts,
the existence of an interaction can be described as a difference between
differences: the difference in means between two levels of X at one value
of W is not the same as the difference in the corresponding means at another
value of W, and this not-the-same-ness constitutes the interaction between
X and W; it is quantified by the value of b3.
-
In MLR contexts, an
interaction implies a change in the slope (of the regression of Y on X)
from one value of W to another value of W (or, equivalently, a change in
the slope of the regression of Y on W for different values of X): in a
two-predictor regression with interaction, the response surface is not
a plane but a twisted surface (like "a bent cookie tin", in Darlington's
(1990) phrase). The change of slope is quantified by the value of b 3.
For details, see Modelling
and Interpreting Interactions in multiple Regression
What Is Central Limit
Theorem?
-
For practical purposes,
the main idea of the central limit theorem (CLT) is that the average of
a sample of observations drawn from some population with any shape-distribution
is approximately distributed as a normal distribution if certain conditions
are met. In theoretical statistics there are several versions of the central
limit theorem depending on how these conditions are specified. These are
concerned with the types of assumptions made about the distribution of
the parent population (population from which the sample is drawn) and the
actual sampling procedure.
-
One of the simplest
versions of the theorem says that if is a random sample of size n (say,
n 30) from an infinite population finite standard deviation , then the
standardized sample mean converges to a standard normal distribution or,
equivalently, the sample mean approaches a normal distribution with mean
equal to the population mean and standard deviation equal to standard deviation
of the population divided by square root of sample size n. In applications
of the central limit theorem to practical problems in statistical inference,
however, statisticians are more interested in how closely the approximate
distribution of the sample mean follows a normal distribution for finite
sample sizes, than the limiting distribution itself. Sufficiently close
agreement with a normal distribution allows statisticians to use normal
theory for making inferences about population parameters (such as the mean
) using the sample mean, irrespective of the actual form of the parent
population.
-
It is well known that
whatever the parent population is, the standardized variable will have
a distribution with a mean 0 and standard deviation 1 under random sampling.
Moreover, if the parent population is normal, then is distributed exactly
as a standard normal variable for any positive integer n. The central limit
theorem states the remarkable result that, even when the parent population
is non-normal, the standardized variable is approximately normal if the
sample size is large enough (say, 30). It is generally not possible to
state conditions under which the approximation given by the central limit
theorem works and what sample sizes are needed before the approximation
becomes good enough. As a general guideline, statisticians have used the
prescription that if the parent distribution is symmetric and relatively
short-tailed, then the sample mean reaches approximate normality for smaller
samples than if the parent population is skewed or long-tailed.
-
On e must study the
behavior of the mean of samples of different sizes drawn from a variety
of parent populations. Examining sampling distributions of sample means
computed from samples of different sizes drawn from a variety of distributions,
allow us to gain some insight into the behavior of the sample mean under
those specific conditions as well as examine the validity of the guidelines
mentioned above for using the central limit theorem in practice.
-
Under certain conditions,
in large samples, the sampling distribution of the sample mean can be approximated
by a normal distribution. The sample size needed for the approximation
to be adequate depends strongly on the shape of the parent distribution.
Symmetry (or lack thereof) is particularly important. For a symmetric parent
distribution, even if very different from the shape of a normal distribution,
an adequate approximation can be obtained with small samples (e.g., 10
or 12 for the uniform distribution). For symmetric short-tailed parent
distributions, the sample mean reaches approximate normality for smaller
samples than if the parent population is skewed and long-tailed. In some
extreme cases (e.g. binomial with ) samples sizes far exceeding the typical
guidelines (say, 30) are needed for an adequate approximation. For some
distributions without first and second moments (e.g., Cauchy), the central
limit theorem does not hold.
-
Review also Central
Limit Theorem Applet, CLT,
and Quincunx
to illustrate the Central Limit Theorem.
What is a Sampling Distribution?
-
The main idea of statistical
inference is to take a random sample from a population and then to use
the information from the sample to make inferences about particular population
characteristics such as the mean (measure of central tendency), the standard
deviation (measure of spread) or the proportion of units in the population
that have a certain characteristic. Sampling saves money, time, and effort.
Additionally, a sample can, in some cases, provide as much or more accuracy
than a corresponding study that would attempt to investigate an entire
population-careful collection of data from a sample will often provide
better information than a less careful study that tries to look at everything.
-
We will study the
behavior of the mean of sample values from a different specified populations.
Because a sample examines only part of a population, the sample mean will
not exactly equal the corresponding mean of the population. Thus, an important
consideration for those planning and interpreting sampling results, is
the degree to which sample estimates, such as the sample mean, will agree
with the corresponding population characteristic.
-
In practice, only
one sample is usually taken (in some cases a small ``pilot sample'' is
used to test the data-gathering mechanisms and to get preliminary information
for planning the main sampling scheme). However, for purposes of understanding
the degree to which sample means will agree with the corresponding population
mean, it is useful to consider what would happen if 10, or 50, or 100 separate
sampling studies, of the same type, were conducted. How consistent would
the results be across these different studies? If we could see that the
results from each of the samples would be nearly the same (and nearly correct!),
then we would have confidence in the single sample that will actually be
used. On the other hand, seeing that answers from the repeated samples
were too variable for the needed accuracy would suggest that a different
sampling plan (perhaps with a larger sample size) should be used.
-
A sampling distribution
is used to describe the distribution of outcomes that one would observe
from replication of a particular sampling plan.
-
Know that to estimate
means to esteem (to give value to).
-
Know that estimates
computed from one sample will be different from estimates that would be
computed from another sample.
-
Understand that estimates
are expected to differ from the population characteristics (parameters)
that we are trying to estimate, but that the properties of sampling distributions
allow us to quantify, probabilistically, how they will differ.
-
Understand that different
statistics have different sampling distributions with distribution shape
depending on (a) the specific statistic, (b) the sample size, and (c) the
parent distribution.
-
Understand the relationship
between sample size and the distribution of sample estimates.
-
Understand that the
variability in a sampling distribution can be reduced by increasing the
sample size.
-
See that in large
samples, many sampling distributions can be approximated with a normal
distribution.
-
Visit also the following
Web sites: Sample,
and Sampling Distribution
Applet
Least Squares Models
-
Many problems in analyzing
data involve describing how variables are related. The simplest of all
models describing the relationship between two variables is a linear, or
straight-line, model. The simplest method of fitting a linear model is
to ``eye-ball'' a line through the data on a plot, but a more elegant,
and conventional method is that of least squares, which finds the line
minimizing the sum of distances between observed points and the fitted
line.
-
Realize that fitting
the ``best'' line by eye is difficult, especially when there is a lot of
residual variability in the data.
-
Know that there is
a simple connection between the numerical coefficients in the regression
equation and the slope and intercept of regression line.
-
Know that a single
summary statistic like a correlation coefficient or does not tell the whole
story. A scatter plot is an essential complement to examining the relationship
between the two variables.
-
Know that the model
checking is an essential part of the process of statistical modelling.
After all, conclusions based on models that do not properly describe an
observed set of data will be invalid.
-
Know the impact of
violation of regression model assumptions (i.e., conditions) and possible
solutions by analyzing the residuals.
Least Median of Squares
Models
-
The standard least
squares techniques for estimation in linear models are not robust in the
sense that outliers or contaminated data can strongly influence estimates.
A robust technique which protects against contamination is least median
of squares (LMS) estimation. An extension of LMS estimation to generalized
linear models, giving rise to the least median of deviance (LMD) estimator.
You Must Look at Your
Scattergrams!
-
Learn that
given a set data the regression line is unique. However, the inverse of
this statement is not true. The following interesting example is from,
D. Moore (1997) book, page 349:
Data set A:
x 10 8 13 9 11 14
y 8.04 6.95 7.58 8.81 8.33 9.96
x 6 4 12 7 5
y 7.24 4.26 10.84 4.82 5.68
Data set B:
x 10 8 13 9 11 14
y 9.14 8.14 8.74 8.77 9.26 8.10
x 6 4 12 7 5
y 6.13 3.10 9.13 7.26 4.74
Data set C:
x 8 8 8 8 8 8
y 6.58 5.76 7.71 8.84 8.47 7.04
x 8 8 8 8 19
y 5.25 5.56 7.91 6.89 12.50
-
All three
sets have the same correlation and regression line. The important moral
is look at your scattergrams.
-
How to
produce a numerical example where the two scatterplots show clearly different
relationships (strengths) but yield the same covariance? Perform
the following steps:
-
1. Produce
two sets of (X,Y) values that have different correlations;
2.
Calculate the two covariances, say C1 and C2;
3.
Suppose you want to make C2 equal to C1. Then you want to multiply C2 by
(C1/C2);
4.
Since C = r.Sx.Sy, you want two numbers (one of them
might be 1), a and b such that
a.b
= (C1/C2);
5.
Multiply all values of X in set 2 by a, and all values of Y by b: for the
new variables,
C =
r.a.b.Sx.Sy = C2.(C1/C2) = C1.
-
An interesting
numerical example showing two identical scatterplots but with differing
covariance is the following: Consider a data set of (X, Y) values, with
covariance C1. Now let V = 2X, and W = 3Y. The covariance of V and W will
be 2(3) = 6 times C1, but the correlation between V and W is the same as
the correlation between X and Y.
Power of a Test
-
Significance
tests are based on certain assumptions: The data have to be random samples
out of a well defined basic population and one has to assume that some
variables follow a certain distribution - in most cases the normal distribution
is assumed.
-
Power of
a test is the probability of correctly rejecting a false null hypothesis.
This probability is one minus the probability of making a Type II error
(b). Recall
also that we choose the probability of making a Type I error when we set
a and that
if we decrease the probability of making a Type I error we increase the
probability of making a Type II error.
Power and
Alpha
-
Thus, the
probability of correctly retaining a true null has the same relationship
to Type I errors as the probability of correctly rejecting an untrue null
does to Type II error. Yet, as I mentioned if we decrease the odds of making
one type of error we increase the odds of making the other type of error.
What is the relationship between Type I and Type II errors?
-
Power and
the True Difference Between Population Means: Anytime we test whether a
sample differs from a population or whether two sample come from 2 separate
populations, there is the assumption that each of the populations we are
comparing has it's own mean and standard deviation (even if we do not know
it). The distance between the two population means will affect the power
of our test.
-
Power as
a Function of Sample Size and Variance: You should notice that what really
made the difference in the size of b
is how much overlap there is in the two distributions. When the means are
close together the two distributions overlap a great deal compared to when
the means are farther apart. Thus, anything that effects the extent the
two distributions share common values will increase b
(the likelihood of making a Type II error).
-
Sample
size has an indirect effect on power because it affects the measure of
variance we use to calculate the t-test statistic. Since we are calculating
the power of a test that involves the comparison of sample means, we will
be more interested in the standard error (the average difference in sample
values) than standard deviation or variance by itself. Thus, sample size
is of interest because it modifies our estimate of the standard deviation.
When n is large we will have a lower standard error than when n is small.
In turn, when N is large well have a smaller b
region than when n is small.
ANOVA: Analysis of Variance
-
The tests
we have learned up to this point allow us to test hypotheses that examine
the difference between only two means. Analysis of Variance or ANOVA will
allow us to test the difference between 2 or more means. ANOVA does this
by examining the ratio of variability between two conditions and variability
within each condition. For example, say we give a drug that we believe
will improve memory to a group of people and give a placebo to another
group of people. We might measure memory performance by the number of words
recalled from a list we ask everyone to memorize. A t-test would compare
the likelihood of observing the difference in the mean number of words
recalled for each group. An ANOVA test, on the other hand, would compare
the variability that we observe between the two conditions to the variability
observed within each condition. Recall that we measure variability as the
sum of the difference of each score from the mean. When we actually calculate
an ANOVA we will use a short-cut formula.
-
Thus, when
the variability that we predict (between the two groups) is much greater
than the variability we don't predict (within each group) then we will
conclude that our treatments produce different results.
Distance Sampling
-
The term
'distance sampling' covers a range of methods for assessing wildlife abundance:
-
line transect
sampling, in which the distances sampled are distances of detected objects
(usually animals) from the line along which the observer travels
-
point transect
sampling, in which the distances sampled are distances of detected objects
(usually birds) from the point at which the observer stands
-
cue counting,
in which the distances sampled are distances from a moving observer to
each detected cue given by the objects of interest (usually whales)
-
trapping
webs, in which the distances sampled are from the web center to trapped
objects (usually invertebrates or small terrestrial vertebrates)
-
migration
counts, in which the 'distances' sampled are actually times of detection
during the migration of objects (usually whales) past a watch point
-
Many mark-recapture
models have been developed over the past 40 years. Monitoring of biological
populations is receiving increasing emphasis in many countries. Data from
marked populations can be used for the estimation of survival probabilities,
how these vary by age, sex and time, and how they correlate with external
variables. Estimation of immigration and emigration rates, population size
and the proportion of age classes that enter the breeding population are
often important and difficult to estimate with precision for free-ranging
populations. Estimation of the finite rate of population change and fitness
are still more difficult to address in a rigorous manner.
-
For more
details read:
Buckland
S., D. Anderson, K. Burnham, and J. Laake, Distance Sampling: Estimating
Abundance of Biological Populations, Chapman and Hall, London, 1993.
Data Mining and Knowledge
Discovery
-
The continuing
rapid growth of on-line data and the widespread use of databases necessitate
the development of techniques for extracting useful knowledge and for facilitating
database access. The challenge of extracting knowledge from data is of
common interest to several fields, including statistics, databases, pattern
recognition, machine learning, data visualization, optimization, and high-performance
computing.
-
Data Mining
as an analytic process designed to explore large amounts of (typically
business or market related) data in search for consistent patterns and/or
systematic relationships between variables, and then to validate the findings
by applying the detected patterns to new subsets of data. The process thus
consists of three basic stages: exploration, model building or pattern
definition, and validation/verification.
-
What distinguishes
data mining from conventional statistical data analysis is that data mining
is usually done for the purpose of "secondary analysis" aimed at finding
unsuspected relationships unrelated to the purposes for which the data
were originally collected.
-
Data warehousing
as a process of organizing the storage of large, multivariate data sets
in a way that facilitates the retrieval of information for analytic purposes.
-
Data mining
is now a rather vague term, but the element that is common to most definitions
is "predictive modeling with large data sets as used by big companies".
Therefore, data mining is the extraction of hidden predictive information
from large databases. It is a powerful new technology with great potential,
for example,to help marketing managers "preemptively define the information
market of tomorrow." Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools. Data mining answers
business questions that traditionally were too time-consuming to resolve.
Data mining tools scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.
-
Data mining
techniques can be implemented rapidly on existing software and hardware
platforms across the large companies to enhance the value of existing resources,
and can be integrated with new products and systems as they are brought
on-line. When implemented on high performance client-server or parallel
processing computers, data mining tools can analyze massive databases while
a customer or analyst takes a coffee break, then deliver answers to questions
such as, "Which clients are most likely to respond to my next promotional
mailing, and why?"
-
Knowledge
discovery in databases aims at tearing down the last barrier in enterprises'
information flow, the data analysis step. It is a label for an activity
performed in a wide variety of application domains within the science and
business communities, as well as for pleasure. The activity uses a large
and heterogeneous data-set as a basis for synthesizing new and relevant
knowledge. The knowledge is new because hidden relationships within the
data are explicated, and/or data is combined with prior knowledge to elucidate
a given problem. The term relevant is used to emphasize that knowledge
discovery is a goal-driven process in which knowledge is constructed to
facilitate the solution to a problem.
-
Knowledge
discovery maybe viewed as a process containing many tasks. Some of these
tasks are well understood, while others depend on human judgment in an
implicit matter. Further, the process is characterized by heavy iterations
between the tasks. This is very similar to many creative engineering process,
e.g., the development of dynamic models. In this reference mechanistic,
or first principles based, models are emphasized, and the tasks involved
in model development are defined by:
1.
Initial data collection and problem formulation. The initial data are collected,
and some more or less precise formulation of the modeling problem is developed.
2.
Tools selection. The software tools to support modeling and allow simulation
are selected.
3.
Conceptual modeling. The system to be modeled, e.g., a chemical reactor,
a power generator, or a marine vessel, is abstracted at first. The essential
compartments and the dominant phenomena occurring are identified and documented
for later reuse.
4.
Model representation. A representation of the system model is generated.
Often, equations are used; however, a graphical block diagram (or any other
formalism) may alternatively be used, depending on the modeling tools selected
above.
5.
Implementation. The model representation is implemented using the means
provided by the modeling system of the software employed. These may range
from general programming languages to equation-based modeling languages
or graphical block-oriented interfaces.
6.
Verification. The model implementation is verified to really capture the
intent of the modeler. No simulations for the actual problem to be solved
are carried out for this purpose.
7.
Initialization. Reasonable initial values are provided or computed, the
numerical solution process is debugged.
8.
Validation. The results of the simulation are validated against some reference,
ideally against experimental data.
9.
Documentation. The modeling process, the model, and the simulation results
during validation and application of the model are documented.
10.
Model application. The model is used in some model-based process engineering
problem solving task.
-
For other
model types, like neural network models where data-driven knowledge is
utilized, the modeling process will be somewhat different. Some of the
tasks, like the conceptual modeling phase, will vanish.
-
Typical
application areas for dynamic models are control, prediction, planning,
and fault detection and diagnosis. A major deficiency of today's methods
is the lack of ability to utilize a wide variety of knowledge. As an example,
a black-box model structure has very limited abilities to utilize first
principles knowledge on a problem. this has provided a basis for developing
different hybrid schemes. Two hybrid schemes will highlight the discussion.
First, it will be shown how a mechanistic model can be combined with a
black-box model to represent a pH neutralization system efficiently. Second,
the combination of continuous and discrete control inputs is considered,
utilizing a two-tank example as case. Different approaches to handle this
heterogeneous case are considered.
-
The hybrid
approach may be viewed as a means to integrate different types of knowledge,
i.e., being able to utilize a heterogeneous knowledge base to derive a
model. Standard practice today is that methods and software can treat large
homogeneous data-sets. A typical example of a homogeneous data-set is time-series
data from some system, e.g., temperature, pressure, and compositions measurements
over some time frame provided by the instrumentation and control system
of a chemical reactor. If textual information of a qualitative nature is
provided by plant personnel, the data becomes heterogeneous.
-
The above
discussion will form the basis for analyzing the interaction between knowledge
discovery, and modeling and identification of dynamic models. In particular,
we will be interested in identifying how concepts from knowledge discovery
can enrich state-of-the- art within control, prediction, planning, and
fault detection and diagnosis of dynamic systems.
-
Further
Readings:
Brodley
C., T. Lane, and T. Stough, Knowledge Discovery and Data Mining, American
Scientist, Jan.-Feb. 1999.
Chatfield
Ch., Model Uncertainty, Data Mining and Statistical Inference, Journal
of Royal Statistical Soc. Ser. A., 419-466, 1995.
Glymour
C., D. Madigan, et. al., Statistical themes and lessons for data
mining, Data Mining and Knowledge Discovery, 1, 11-28, 1997.
Hand
D. , Data Mining: Statistics and More?, The American Statistician,
52( 2), 1998.
Heckerman
D., Bayesian networks for data mining," Data Mining and Knowledge Discovery,
1, 79-119, 1997.
-
Visit also
the following Web sites: Data
Mining, and
SAS.
Bayes and Empirical Bayes
Methods
-
Bayes and
empirical Bayes (EB) methods structure combining information from similar
components of information and produce efficient inferences for both individual
components and shared model characteristics. Many complex applied investigations
are ideal settings for this type of synthesis. For example, county-specific
disease incidence rates can be unstable due to small populations or low
rates. 'Borrowing information' from adjacent counties by partial pooling
produces better estimates for each county, and Bayes/empirical Bayes methods
structure the approach. Importantly, recent advances in computing and the
consequent ability to evaluate complex models, have increase the popularity
and applicability of Bayesian methods.
-
Bayes and
EB methods can be implemented using modern Markov chain Monte Carlo(MCMC)
computational methods. Properly structured Bayes and EB procedures typically
have good frequentist and Bayesian performance, both in theory and in practice.
This in turn motivates their use in advanced high-dimensional model settings
(e.g., longitudinal data or spatio-temporal mapping models), where a Bayesian
model implemented via MCMC often provides the only feasible approach that
incorporates all relevant model features.
-
Further
Readings:
Bayes
and Empirical Bayes Methods for Data Analysis, by Carlin
B., and T. Louis, Chapman and Hall, 1996.
Likelihood Methods
Direct Inverse
__________________________________________
Neyman-Pearson Bayesian (decision analysis
Decision Wald (H. Rubin, e.g.)
---------------------------------------------------
Hybrid "Standard" practice Bayesian (subjective)
-------------------------------------------------------
fiducial (Fisher)
Inference Early Fisher Likelihood (Edwards)
Bayesian (modern)
belief functions
(Shafer)
_________________________________________
-
In the
Direct schools, one uses Pr(data | hypothesis), usually from some model-based
sampling distribution, but one does not attempt to give the inverse probability,
Pr(hypothesis | data), nor any other quantitative evaluation of hypotheses.
The Inverse schools do associate numerical values with hypotheses, either
probabilities (Bayesian schools) or something else (Fisher, Edwards, Shafer).
-
The decision-oriented
methods treat statistics as a matter of action, rather than inference,
and attempt to take utilities as well as probabilities into account in
selecting actions; the inference-oriented methods treat inference as a
goal apart from any action to be taken.
-
The "hybrid"
row could be more properly labeled as "hypocritical"-- these methods talk
some Decision talk but walk the Inference walk.
-
Fisher's
fiducial method is included because it is so famous, but the modern consensus
is that it lacks justification.
-
Now it
is true, under certain assumptions, some distinct schools advocate highly
similar calculations, and just talk about them or justify them differently.
Some seem to think this is tiresome or impractical. One may disagree, for
three reasons:
-
First,
how one justifies calculations goes to the heart of what the calculations
actually MEAN; second, it is easier to teach things that actually make
sense (which is one reason that standard practice is hard to teach); and
third, methods that do coincide or nearly so for some problems may diverge
sharply for others.
-
The difficulty
with the subjective Bayesian approach is that prior knowledge is represented
by a probability distribution, and this is more of a commitment than warranted
under conditions of partial ignorance. (Uniform or improper priors are
just as bad in some respects as anything other sort of prior.) The methods
in the (Inference, Inverse) cell all attempt to escape this difficulty
by presenting alternative representations of partial ignorance.
-
Edwards,
in particular, uses logarithm of normalized likelihood as a measure of
support for a hypothesis. Prior information can be included in the form
of a prior support (log likelihood) function; a flat support represents
complete prior ignorance.
-
One place
where likelihood methods would deviate sharply from "standard" practice
is in a comparison between a sharp and a diffuse hypothesis. Consider H0:
X ~ N(0, 100) [diffuse] and H1: X ~ N(1, 1) [standard deviation 10 times
smaller]. In standard methods, observing X = 2 would be undiagnostic, since
it is not in a sensible tail rejection interval (or region) for either
hypothesis. But while X = 2 is not inconsistent with H0, it is much better
explained by H1--the likelihood ratio is about 6.2 in favor of H1. In Edwards'
methods, H1 would have higher support than H0, by the amount log(6.2) =
1.8. (If these were the only two hypotheses, the Neyman-Pearson lemma would
also lead one to a test based on likelihood ratio, but Edwards' methods
are more broadly applicable.)
-
I do not
want to appear to advocate likelihood methods. I could give a long discussion
of their limitations and of alternatives that share some of their advantages
but avoid their limitations. But it is definitely a mistake to dismiss
such methods lightly. They are practical (currently widely used in genetics)
and are based on a careful and profound analysis of inference.
What is a Meta-Analysis?
-
Meta-Analysis
deals with the art of combining information from the data from different
independent sources which are targeted at a common goal. There are plenty
of applications of Meta-Analysis in various disciplines such as Astronomy,
Agriculture, Biological and Social Sciences, and Environmental Science.
This particular topic of statistics has evolved considerably over the last
twenty years with applied as well as theoretical developments.
-
A Meta-analysis
deals with a set of RESULTs to give an overall RESULT that is (presumably)
comprehensive and valid.
-
a) Especially
when Effect-sizes are rather small, the hope is that one can gain good
power by essentially pretending to have the larger N as a valid, combined
sample.
-
b) When
effect sizes are rather large, then the extra POWER is not needed for main
effects of design: Instead, it theoretically could be possible to look
at contrasts between the slight variations in the studies themselves.
-
If you
really trust that "all things being equal" will hold up. The typical "meta"
study does not do the tests for homogeneity that should be required
-
In other
words:
-
1. there
is a body of research/data literature that you would like to summarize
-
2. one
gathers together all the admissible examples of this literature (note:
some might be discarded for various reasons)
-
3. certain
details of each investigation are deciphered ... most important would be
the effect that has or has not been found. ie, how much larger in sd units
is the treatment group's performance compared to one or more controls.
-
4. call
the values in each of the investigations in #3 .. mini effect sizes.
-
5. across
all admissible data sets, you attempt to summarize the overall effect size
by forming a set of individual effects ... and using an overall sd as the
divisor .. thus yielding essentially an average effect size.
-
6. in the
meta analysis literature ... sometimes these effect sizes are further labeled
as small, medium, or large ....
-
You can
look at effect sizes in many different ways .. across different factors
and variables. but, in a nutshell, this is what is done.
-
I recall
a case in physics, in which, after a phenomenon had been observed in air,
emulsion data was examined. The theory would have about a 9% effect in
emulsion, and behold, the published data gave 15%. As it happens, there
was no significant (practical, not statistical) in the theory, and also
no error in the data. It was just that the results of experiments in which
nothing statistically significant was found were not reported.
-
This non-reporting
of such experiments, and often of the specific results which were not statistically
significant, which introduces major biases. This is also combined with
the totally erroneous attitude of researchers that statistically significant
results are the important ones, and than if there is no significance, the
effect was not important. We really need to between the term "statistically
significant", and the usual word significant.
-
It is very
important to distinction between statistically significant and generally
significant, see Discover Magazine (July, 1987), The Case of Falling Nightwatchmen,
by Sapolsky. In this article, Sapolsky uses the example to point out the
very important distinction between statistically significant and generally
significant: A diminution of velocity at impact may be statistically significant,
but not of importance to the falling nightwatchman.
-
Be careful
about the word "significant". It has a technical meaning, not a commonsense
one. It is NOT automatically synonymous with "important". A person or group
can be statistically significantly taller than the average for the population,
but still not be a candidate for your basketball team. Whether the difference
is substantively (not merely statistically) significant is dependent on
the problem which is being studied.
-
Meta-analysis
is a controversial type of literature review in which the results of individual
randomized controlled studies are pooled together to try to get an estimate
of the effect of the intervention being studied. It increases statistical
power and is used to resolve the problem of reports which disagree with
each other. It's not easy to do well and there are many inherent problems.
-
There is
also graphical technique to assess robustness of meta-analysis results.
We should carry out the meta-analysis dropping consecutively one study,
that is if we have N studies we should do N meta-analysis using N-1 studies
in each one. After that we plot these N estimates on the y axis and compare
them with a straight line that represent the overall estimate using all
the studies.
-
Topics
in Meta-analysis includes: Odds ratios; Relative risk; Risk difference;
Effect size; Incidence rate difference and ratio; Plots and exact confidence
intervals.
-
For details,
read,
Meta-Analysis
in Social Research, by Glass, McGraw and Smith, 1987, and
Handbook
of Research Synthesis, by Cooper H., and L. Hedges, (Eds.),
New York, Russell Sage Foundation, 1994,
-
also visit
Meta-Analysis, and
Meta
-Analysis: Methods of Accumulating Results Across Research Domains.
Prediction Interval
-
The idea
is that if
is the mean of
a random sample of size n from a normal population, and Y is a single additional
observation, then the test statistic
-
Y is normal with mean 0 and variance (1 + 1/n)s2.
-
Since
we don't actually know s2,
we need to use t in evaluating the test statistic. The appropriate Prediction
Interval for Y is
-
±
ta/2.S.(1+1/n)1/2.
-
This is similar to construction of interval
for individual prediction in regression analysis.
Fitting Data to a Broken
Line
-
Fitting
data to a broken, how to determine the parameters, a, b, c, and d such
that
-
y = a +
b x, for x less than or equal c
y
= a - d c + (d + b) x, for x greater than or equal to c
-
A simple
solution is a brute force search across the values of c. Once c is known,
estimating a, b, and d is trivial through the use of indicator variables.
One may use (x-c) as your independent variable, rather than x, for computational
convenience.
-
Now, just
fix c at a fine grid of x values in the range of your data, estimate a,
b, and d, and then note what the mean squared error is. Select the value
of c that minimizes the mean squared error.
-
Unfortunately,
you won't be able to get confidence intervals involving c, and the confidence
intervals for the remaining parameters will be conditional on the value
of c.
-
For more
details, see Applied Regression Analysis, by Draper and Smith, Wiley
1981, Chapter 5, section 5.4 on use of dummy variables. example 6.
How to Determine if Two
Regression Lines Are Parallel?
-
Would like
to determine if two regression lines are parallel? Construct the following
multiple linear regression model:
E(y) = b0 + b1X1 + b2X2 + b3X3
where X1 = interval predictor variable, X2 = 1 if group 1,
0 if group 0,
and X3 = X1.X2
Then, E(y|group=0) = b0 + b1X1
and E(y|group=1) = b0 + b1X1 + b2.1 + b3.X1.1
= b0 + b1.X1 + b2 + b3X1
= (b0 + b2) + (b1 + b3)X1
-
That is,
E(y|group=1) is a simple regression with a potentially different slope
and intercept compared to group=0.
-
Ho: slope(group
1) = slope(group 0) is equivalent to Ho: b3=0
-
Use t-test
from variables-in-the equation table to test this hypothesis.
Constrained Regression
Model
-
If you
fit a regression forcing the intercept to be zero, the standard error of
the slope is less. That seems counter-intuitive. The intercept should be
included in the model because it is significant, so why is the standard
error for the slope in the worse-fitting model actually smaller?
-
I agree
that it's initially counter-intuitive (see below), but here are two reasons
why it's true. The variance of the slope estimate for the constrained model
is s2
/ SXi2),
where Xi are actual X values and s2
is estimated from the residuals. The variance of the slope estimate for
the unconstrained model (with intercept) is s2
/ Sxi2),
where xi are deviations from the mean, and s2
is still estimated from the residuals). So, the constrained model can have
a larger s2
(mean square error/"residual" and standard error of estimate) but a smaller
standard error of the slope because the denominator is larger.
-
r2
also behaves very strangely in the constrained model; by the conventional
formula, it can be negative; by the formula used by most computer packages,
it is generally larger than the unconstrained r2 because it is dealing
with deviations from 0, not deviations from the mean. This is because,
in effect, constraining the intercept to 0 forces us to act as if the mean
of X and the mean of Y both were 0.
-
Once
you recognize that the s.e. of the slope isn't really a measure of overall
fit, the result starts to make a lot of sense. Assume that all your X and
Y are positive. If you're forced to fit the regression line through the
origin (or any other point) there will be less "wiggle" in how you can
fit the line to the data than there would be if both "ends" could move.
-
Consider
a bunch of points that are ALL way out, far from zero, then if you Force
the regression through zero, that line will be very close to all the points,
and pass through origin, with LITTLE ERROR. And little precision, and little
validity. Therefore, no-intercept model is hardly ever appropriate.
Semiparametric and Non-parametric
modeling
-
Many parametric
regression models in applied science have a form like response = function(X1,...,
Xp, unknown influences). The "response" may be a decision (to
buy a certain product), which depends on p measurable variables and an
unknown reminder term. In statistics, the model is usually written as
-
Y = m(
X1, ..., Xp) + e
-
and the
unknown e is interpreted as error term.
-
The most
simple model for this problem is the linear regression model, an often
used generalization is the Generalized Linear Model (GLM)
-
Y= G(X1b1
+ ... + Xpbp) + e
-
where G
is called the link function. All these models lead to the problem of estimating
a multivariate regression. Parametric regression estimation has the disadvantage,
that by the parametric "form" certain properties of the resulting estimate
are already implied.
-
Nonparametric
techniques allow diagnostics of the data without this restriction. However,
this requires large sample sizes and causes problems in graphical visualization.
Semiparametric methods are a compromise between both: they support a nonparametric
modeling of certain features and profit from the simplicity of parametric
methods.
-
Further
Readings:
Härdle
W., S. Klinke, and B. Turlach, XploRe: An Interactive Statistical Computing
Environment, Springer, New York, 1995.
Moderation and Mediation
-
"Moderation"
is an interactional concept. That is, a moderator variable "modifies" the
relationships between two other variables. While "Mediation" is a "causal
modeling" concept. The "effect" of one variable on another is "mediated"
through another variable. That is, there is no "direct effect", but rather
an "indirect effect."
Discriminant and Classification
-
Classification
or discrimination involves learning a rule whereby a new observation can
be classified into a pre-defined class. Current approaches can be grouped
into three historical strands: statistical, machine learning and neural
network. The classical statistical methods make distributional assumptions.
There are many others which are distribution free, and which require some
regularization so that the rule performs well on unseen data. Recent interest
has focused on the ability of classification methods to be generalized.
-
We often
need to classify individuals into two or more populations based on a set
of observed "discriminating" variables. Methods of classification are used
when discriminating variables are:
-
quantitative
and approximately normally distributed;
-
quantitative
but possibly nonnormal;
-
categorical;
or
-
a combination
of quantitative and categorical.
-
It is important
to know when and how to apply linear and quadratic discriminant analysis,
nearest neighbor discriminant analysis, logistic regression, categorical
modeling, classification and regression trees, and cluster analysis to
solve the classification problem. SAS has all the routines you need to
for proper use of these classifications. Relevant topics are: Matrix operations,
Fisher's Discriminant Analysis, Nearest Neighbor Discriminant Analysis,
Logistic Regression and Categorical Modeling for classification, and Cluster
Analysis.
-
For example,
two related methods which are distribution free are the k-nearest neighbor
classifier and the kernel density estimation approach. In both methods,
there are several problems of importance: the choice of smoothing parameter(s)
or k, and choice of appropriate metrics or selection of variables. These
problems can be addressed by cross-validation methods, but this is computationally
slow. An analysis of the relationship with a neural net approach (LVQ)
should yield faster methods.
-
Further
Readings:
Cherkassky
V, and F. Mulier, Learning from Data: Concepts, Theory, and Methods, John
Wiley & Sons, 1998.
-
Visit also
the Web site Tree-Structured
& Rules Induction Programs Homepage
Generalized Linear and
Logistic Models
-
The generalized
linear model (GLM) is possibly the most important development in practical
statistical methodology in the last twenty years. Generalized linear models
provide a versatile modeling framework in which a function of the mean
response is "linked" to the covariates through a linear predictor and in
which variability is described by a distribution in the exponential dispersion
family. These models include logistic regression and log-linear models
for binomial and Poisson counts together with normal, gamma and inverse
Gaussian models for continuous responses. Standard techniques for analyzing
censored survival data, such as the Cox regression, can also be handled
within the GLM framework. Relevant topics are: Normal theory linear models,
Inference and diagnostics for GLMs, Binomial regression, Poisson regression,
Methods for handling overdispersion, Generalized estimating equations (GEEs).
-
Hre is
how to obtain degree of freedom number for the 2 log-likelihood, in a logistic
regression. Degrees of freedom pertain to the dimension of the vector of
parameters for a given model. Suppose we know that a model ln(p/(1-p))=Bo
+ B1x + B2y + B3w fits a set of data. In this case the vector B=(Bo,B1,
B2, B3) is an element of 4 dimensional Euclidean space, or R4.
-
Suppose
we want to test the hypothesis: Ho: B3=0. We are imposing a restriction
on our parameter space. The vector of parameters must be of the form: B'=B=(Bo,B1,
B2, 0). This vector is an element of a subspace of R4. Namely,
B4=0 or the X-axis. The likelihood ration statistic has the form:
-
2 log-likelihood
= 2 log(maximum unrestricted likelihood / maximum restricted likelihood)
=
2
log(maximum unrestricted likelihood)-2 log (maximum restricted likelihood)
-
Which is
unrestricted B vector 4-dimensions or degrees of freedom - restricted B
vector 3 dimensions or degrees of freedom = 1 degree of freedom which is
the difference vector: B''=B-B'=(0,0,0,B4) [one dimensional subspace of
R4.
-
The standard
textbook is Generalized Linear Models by McCullagh and Nelder (Chapman
& Hall, 1989).
LOGISTIC REGRESSION VAR=x
/METHOD=ENTER y x1 x2 f1ros f1ach f1grade bylocus byses
/CONTRAST (y)=Indicator
/contrast (x1)=indicator
/contrast (x2)=indicator
/CLASSPLOT /CASEWISE OUTLIER(2)
/PRINT=GOODFIT
/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .
-
Survival Analysis
-
Survival
analysis is suited to the examination of data where the outcome of interest
is 'time until a specific event occurs', and where not all individuals
have been followed up until the event occurs.
-
The methods
of survival analysis are applicable not only in studies of patient survival,
but also studies examining adverse events in clinical trials, time to discontinuation
of treatment, duration in community care before re-hospitalisation, contraceptive
and fertility studies etc.
-
If you've
ever used regression analysis on longitudinal event data, you've probably
come up against two intractable problems:
-
Censoring:
Nearly every sample contains some cases that do not experience an event.
If t