5.2 Statistics and Probability and Collecting and Summarizing Data

朗朗xl 2017-02-25

展开全文

B. Statistics and Probability

The Measure phase, uses statistics to aid in analysis.

1. Drawing Valid Statistical Conclusions

Enumerative/Descriptive Statistics is the branch of statistics that focuses on collecting, summarizing, and presenting a static set of data. It is the analysis of the numerical data to provide information about the specific data that is being analyzed.

For example, (1) the mean of the three values 3, 5 and 8 is 5.33

(2) The employee benefits like travel allowances, sick leave, healthcare costs and so on, used by the employees of any organization in any given fiscal year.

(3) A study of customer call handling time in a BPO in a particular process in a given month. Conclusions can be made about the average handling time of a sample of selected customers. Questions like why processing time varies for every customer, or are different processes facing the same problem are not addressed in an enumerative study.

Analytical/Inferential Statistics are used to draw inferences about a broader population based on the sample data. A proper analytical study is based on sufficient sample size (the sample should not be too large or too small), and proper samplingmethods to give confidence about the fact that the selected sample is representative of the population under study.

For example, children of ages 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 play certain computer games. Analytical studies would infer that the average age of all children who play that game is 11.

Population Parameter

To know about a population parameter, first the meaning of population is to be understood. A population is the whole group of units, items, and people or services under particular study for a fixed period of time and for a fixed location.

A population parameter is a statistical measure of a population. The value of a parameter is estimated from a sample drawn from the population, such as, population mean and standard deviation. The exact value of a parameter is never known with certainty.

Sample Statistic is a statistical property of a set of data set, such as the mean or standard deviation of the sample. The value of the statistic is known with certainty because it is calculated using all the items of the set of data.

The difference between enumerative and analytical studies

According to Deming (1975),

(1)an enumerative study is defined as a study in which action will be taken on the universe. “Universe” is defined as the entire group of people, items, or services under a particular study. Sampling a selected lot of defects to determine the nature of defects of the entire lot is a case of an enumerative study. Enumerative studies draw conclusions about the universe actually studied. The aim of this study is estimation of parameters. This is the deductive approach. It involves counting techniques for huge numbers of possible outcomes.

An analytic study is defined as a study in which action will be taken on a process to improve performance in the future. In this study, the focus is on a process and ways of improving the process. Thus analytic studies direct their efforts on a universe which is yet to be produced; on predicting a universe of the future. This method is the inductive method; it provides information for inductive reasoning. Analytical methods makes use of tools like control charts, run charts, histograms, stem and leaf plots etc.

In an enumerative study, the environment for study is static, whereas in an analytical study the environment is dynamic.

Another difference between these studies is that enumerative statistics progress from predetermined hypotheses, whereas analytic studies aim to help the analyst in producing new hypotheses. Analytical studies involves in using data to develop possible explanations, new theories for quality improvement.

2. Sampling Distributions

Most Six Sigma projects involving enumerative studies deal with samples, and not populations. Some common formulae that are of interest to Six Sigma are given below.

1. The empirical distribution assigns the probability 1/n to each X i in the sample. Thus the mean of the distribution is

X is called the sample mean, since, the empirical distribution is determined by a sample.

2. The variance of the empirical distribution is given by following equation:

The above equation is called the sample variance.

3. The unbiased sample standard deviation is given by the following equation:

4. Another sampling statistic is the standard deviation of the average, also called the standard error (SE). This is given by the following formula:

It is evident from the above formula that the standard error of the mean is inversely proportional to the square root of the sample size. This relationship is shown in the graph below:

It is seen that averages of n=4 have a distribution half as variable as the population from which the samples are drawn.

3. Central Limit Theorem

The Central Limit Theorem is stated as:

Irrespective of the shape of the distribution of the population or the universe, the distribution of average values of samples drawn from that universe will tend toward a standard normal distribution i.e., with mean 0 and standard deviation 1, for a large sample size or when n tends to infinity.(Thomas Pyzdek, 1976)

In other words, the distribution of an average tends to be normal, even when the distribution from which the average is calculated is definitely not normal. A remarkable thing about this theorem is that no matter what the shape of the original distribution is, the sampling distribution of the mean approaches a normal distribution.

Furthermore, the average of sample averages will have the same average as the universe, and the standard deviation of the averages will be equal to the standard deviation of the universe divided by the square root of the sample size.

This is a symmetric distribution, the mean, median, mode are equal.

The central limit theorem has many practical implications. The Central Limit Theorem provides the basis for many statistical process control tools, like quality control charts, which are used widely in Six Sigma. By the Central Limit Theorem, you can use means of small samples to evaluate any process using the normal distribution.

Application of Inferential Statistics

The statistical methods described in the preceding section are enumerative. In Six Sigma applications of enumerative statistics, inferences about populations based on data from samples are made. Statistical inference is concerned with decision making. For example, sample means and standard deviations can be used to foretell future performance like long term yields or possible failures.

The techniques of statistical inference are so designed that it is not possible to be certain about the correctness of a particular decision, but in the long run the proportion of correct decisions are known in advance. Any estimate that is based on a sample has some amount of sampling error. There are several types of errors that may occur in statistical inference:

Type 1 error refers to rejection of a hypothesis when it should not be rejected.

Type 2 error refers to acceptance of a hypothesis when it should be rejected. (Hypothesis testing will be discussed in the following chapter in detail: Chapter 6- Black Belt, Analyze)

The sample statistics discussed above: sample mean, sample standard deviation, and sample variance are point estimators. These are single values used to represent population parameters. An interval about the statistics that has a predetermined probability of including the true population parameter can also be found out. This interval is called the confidence interval or confidence limits. Confidence intervals can be both one-sided and two sided.

For example, if the mean income in a sample is $6000, it may be desirable to know the interval in which the mean income of the parameter probably lies. This is expressed in terms of confidence limits.

Six Sigma uses analytic statistics most of the time, but sometimes enumerative statistics prove useful. Analytic methods are used to locate the fundamental process dynamics and to improve and control the processes involved in Six Sigma.

C. Collecting and Summarizing Data

The data collection plan is built while measuring the process. A process can be improved by studying the information gathered from data collected from the actual process. This data collected has to be accurate and relevant to the quality issue being taken up under the Six Sigma project. Any data collection plan includes:

A brief overview of the project, along with the problem statement (stating why the data has to be collected)

A list of questions, which should be answered by the data collected

Determining the data type which will be suitable for the data a process is generating

Determining the number of iterations of the data collected that will be enough to present the change in the chart

A list of the measures to be taken, once the data has been collected

The name of the person who will be collecting the data and when

A good data collection plan facilitates the accurate and efficient collection of data.

After the data is collected, it must be figured out that what kind of data a particular process holds. Before measuring the data, it is necessary to know the type of data you are analyzing so that you can apply an appropriate tool to the data.

The following section gives the definitions and classification of data. After studying the data, it becomes essential to identify opportunities to transform the attribute data to variable measures.

1. Types of Data

No two things are exactly alike; therefore there are inherent differences in the data. Each characteristic under study is referred to as a variable. In Six Sigma, these are known as CTQ or critical to quality characteristics for a process, product or service.

Attribute (Discrete) Data: Attribute data, also known as discrete data, can take on only a finite number of points. Typically such data is counted in whole numbers. Attribute data cannot be broken down into smaller units. For example, the number of family members cannot be 4.5. No additional meaning can be added to such data. For example, the number of defects in a sample is discrete data.

Some other examples of attribute data are:

Zip codes in a country

“Sweet” or “Sour” taste

“Regular”, “Medium” or “Large” sizes of pizza

“Fat” or “Thin” attributes given to a person

Variable (Continuous) data: Variable data, also known as continuous data, is data which can have any value on a continuous scale. Continuous data exists on intervals, or on several intervals. Variable data can have almost any numeric value and can be meaningfully forked into finer increments or decrements, depending upon the precision of the measurement system.

For example: The height of a person on a ruler can be read as 1.2 meters, 1.05 meters or 1.35 meters.

The important distinction between attribute data and variable data is that variable data can be meaningfully added or subtracted, while attribute data cannot be meaningfully added or subtracted.

2. Scales of Measurement

The next step in data collection is to define and apply measurement scales to the data collected.

The idea behind measurement is that improvement in a process can begin only when quality is measured or quantified. Essentially,a numerical assignment to a non-numerical element is called measurement. Measurements communicate certain information about the relationship between one element and the other elements.

There are four types of measurement scales for categorical data:

a. Nominal Scale: This shows the simplest and weakest kind of measurement. They are a form of classification. This shows only the presence or absence of an attribute. The data collected by nominal scale is called attribute data. For example, success/fail, accept/reject, correct/incorrect.
Nominal measurements can represent a membership or a designation like (1=female, 2=male). The statistics used in nominal scale are percent, proportion, chi-square tests etc.

b. Ordinal Scale: This scale has a natural order of the values. This scale can express the degree of how much one item is more or less than another. But the space between the values is not defined. For example, product popularity rankings can be high, higher, and highest. Product attributes can be taste, or attractiveness. This scale can be studied with mathematical operators like =, ≠, <, >.
Statistical techniques can be applied to ordinal data like rank order correlation. Ordinal data is converted to nominal data and analyzed using Binomial or Poisson models in quality improvement models like Six Sigma.

c. Interval Scale: A variable measured on an interval scale gives the same information about more or less as ordinal scales do, but interval variables have an equal distance between each value. In this scale, difference between any two successive points is equal, like temperature, calendar time, etc. This scale has measurements where ratios of differences are unchanging. For example, 180°C = 356°F. Conversion between two interval scales is accomplished by the transformation, y= ax+b, a>0.
Statistical techniques can be applied to interval data like correlations, t-tests, multiple regression and F-tests.

d. Ratio Scale: In this scale, measurements of an object in two different metrics are related to one another by an invariant ratio. (Thomas Pyzdek, 1976). For example, if an object’s mass was calculated in pounds (x) and kilograms (y), then x/y = 2.2 for all values of x and y. This means that a transformation from one ratio measurement scale to another is executed by a transformation of the form y = ax, a >0, e.g., pounds = 2.2 × kilograms. 0 has a meaning here, it means an absence of mass. Another example is temperature measured in Kelvin. There is no value possible below 0° Kelvin. Weight below 0 lbs is a meaningful absence of weight.
Statistical techniques can be applied to ratio measurements like correlations, multiple regression, T-tests, and F-tests.