  # Basic Statistics

## For a course called "Data Analytics" you'd expect a key knowledge dotpoint on statistics...

But NO!

Here's some stuff anyway, since I already wrote it for an old textbook.

# Identifying patterns and relationships between data

Science and business often need to find meaning in large data sets. Statistics lets data managers summarise huge amounts of raw data into small, informative, meaningful summaries. You are not expected to carry out very complex statistics when examining your hypothesis in U3O2, but an understanding of basic statistical concepts is essential when manipulating data, and reaching sensible conclusions.

All of the statistics discussed below can easily and automatically be calculated with a spreadsheet.

You will find it useful to know these statistical concepts:

 Concept Purpose Average The single value that gives the most representative summary value a range of numbers. Mean, mode and median are different ways of averaging data under different circumstances. Significance Indicates whether two data sets are basically similar, or are different in an important way. Standard deviation Indicates how consistent the values are in a data set. Are the values all close to the average, or do they vary greatly? A low standard deviation means that the mean is truly representative of the data and can be trusted. A high standard deviation suggests you should not trust the mean. Correlation Do two data sets show the same trends, as if they might be connected in some way? Causation One factor (e.g. eating patterns) is responsible for causing changes in another factor (e.g. weight)

## Basic Statistics

### Averages

Hiroyuki, a sushi restaurant manager, wants to monitor the performance of his chefs, Bruno and Eva. He puts each chef on duty for one week, and counts the numbers of diners each day.

• During Bruno’s week at work the counts were 12, 16, 8, 15, 22, 35, 7.
• During Eva’s week at work the counts were 4, 20, 5, 19, 7, 43, 3.

What conclusions can Hiroyuki draw from these raw figures? Not a lot. But an average number of diners would be informative. There are 3 methods of calculating averages.

#### The mean.

When most people talk about an ‘average’, they usually refer to an arithmetic mean.

Excel’s AVERAGE( ) function calculates a mean.

The mean is the sum of the data, divided by the number of data items.

• Bruno’s mean = 12 + 16 + 8 + 15 + 22 + 35 + 7 = 115 / 7 = 16.42 daily diners.
• Eva’s mean = 4 + 20 + 5 + 19 + 7 + 43 + 3 = 100 / 7 = 14.28 daily diners.

#### The median

The median is the “middle” number when the data are sorted.

• Bruno’s sorted data are = 7, 8, 12, 15, 16, 22, 35. Excel’s MEDIAN( ) function returns 15.
• Eva’s sorted data are 3, 4, 5, 7, 19, 20, 43. Excel’s median for her is 7.

As you can see, the mean and median can be very different, and can lead to very different interpretations. Choosing the most appropriate type of average is important. It is easy to misrepresent data with an inappropriate choice of average.

#### The mode

The mode is the value that occurs most often. It is sometimes (but not always) the most logical and representative average to use. For example:

Four people are chosen at random and asked how many children they have. The answers are 2, 2, 2, 16. What is the logical average? The mode, which is 2.

The “outlier” (16) – a figure that lies so far out from the rest of the data that it is not at all typical or representative of the majority – is ignored.

An outlier can be caused by experimental error or it might be a truly remarkable exception to the general population. It might be worthy of investigation in its own right, but it is not typical. A spreadsheet like Excel has a function called TRIMMEAN which will ignore outliers when calculating a mean, but you have to tell it when to start ignoring values.

Using the mean (5.5) would be misleading because it would not come even close to the truth for any of the surveyed people. Each type of average is logical and accurate under different conditions, but they often will not agree with each other.

The annual incomes of a representative sample of citizens of The Democratic Republic of Informatica are collected:

100, 100, 100, 100, 50, 120, 30, 200000.

You (as Supreme Ruler For Life of Informatica) want the world to believe that your loyal subjects are all well paid. What average method would you choose, and why?

An impartial observer wants to reflect the truth of the poverty of most Informatican citizens. What average method would she use, and why? DO THIS: Use a spreadsheet to enter the sample data shown and use the AVERAGE( ), MODE( ) and MEDIAN( ) functions to see the differences..

### Significance

If Bruno’s mean number of customers is 16.42 and Eva’s is 14.28 – is that a difference that the manager should pay attention to, or could it have happened purely by chance? A difference between data sets does not always mean anything important. A significance test can tell you when a difference is important, e.g. a “t-test” can be easily done with a spreadsheet.

Published conclusions about hypotheses often include statements like ‘p < 0.05’. This is a significance measure and means that the probability of the result being purely due to chance is less than 5%. The lower the number, the more confident you can be that the hypothesis is supported.

### Standard deviation

Data managers often need to know how much variety is in data. For example, if diner counts in week 1 were 8, 7, 9, 8, 7, 9, 8 (mean = 8), it’s obvious that all the data are very consistent and differ very little from the mean. A value of 33 would be very unusual.

On the other hand, if the figures for week 2 were 1, 33, 2, 1, 0, 3, 16, the mean is also 8 but the data are very inconsistent and none of them come anywhere near the value of the mean.

Excel calculates the standard deviation of week 1 figures as 0.816.
The second week’s standard deviation is 12.300. This indicates to a researcher that:

• week 1’s figures has very little variation and its mean accurately reflects the individual values of the whole data set.
• week 2’s data varies wildly, has very little consistency or pattern, and the mean can’t be trusted to accurately reflect the general nature of the whole data set.

Excel uses the STDEV( ) function to calculate standard deviations.

Using the mean and standard deviation we can work out whether any value in a data set could be considered “unusual” or special. This rule is used:

• a value that is more than one standard deviation away from the mean – either smaller or greater – is rather unusual and worthy of note.
• a value more than two standard deviations away from the mean is really unusual.
• three or more standard deviations away from the mean indicates a freakishly-uncommon value.

For example, thousands of 18 year-old Australian men’s heights are measured. The mean height is 175cm. The standard deviation is calculated to be 15cm. In English, this means “The average bloke is 175cm, give or take 15cm.”

The classic normal distribution curve looks like this. Most data are close to the mean, and progressively fewer and fewer of them vary greatly from the mean. The height of the curve represents the number of men with each height, so the highest point on the curve occurs at the mean value, 175cm. Above - Normal distribution of heights.

• A height between 160cm and 190cm – the mean (175) minus and plus one standard deviation (15) - would not be considered noteworthy. 68% of the men’s heights would fall in the shaded area.
• A height between 1 and 2 standard deviations (30cm) from the mean – 145cm to 160cm (area “B”) or 190cm to 205cm (area “E”) might be considered a little out of the ordinary. 13.6% of men fall into each of these areas. 96% of the entire population fall between -2 and +2 standard deviations from the mean.
• Only 3.4% of the population would be more or less than 2 standard deviations from the mean, and they would be seen as exceptional cases.
• Beyond 3 standard deviations from the mean, only 0.2% of the population can be found. They are considered exceptionally rare cases.

### Correlation and causality

Our human brains are hard-wired to seek patterns in information. This talent can be useful when we make out the face of a hungry lion crouched nearby in grassland. It can also mislead us when we think we see faces on Mars, or in burnt toast.   Humans see faces everywhere because our brains are built to actively seek them out on Mars, bats and sockets.

Faces are so important to humans that our brains have areas dedicated to face detection and recognition.

We also instinctively seek patterns in data - and we sometimes get that wrong too. Let’s say that research shows that the sales of umbrellas in Victoria varied over 10 years. The sales of pepperoni pizzas also varied over the same time, increasing and decreasing in the same way as umbrellas sales did.

Should you “logically” conclude that:

• pizza causes rain?
• rain makes people eat pizza?
• the figures were purely coincidental and one did not cause the other?
• there is a third factor that has not been identified?

O Often when people see a correlation – similar trends in two data sets – they assume that there must be a causal connection and one factor causes the other.

This is not always true. Only further research can tell whether one factor in a correlation actually caused a change in the other. “Having a higher income leads to better health.” Discuss the merits of the causality in this hypothesis.

A data manager can calculate correlation between data sets (using Excel’s CORREL function) but a wise manager will understand that the correlation value will not indicate which (if either) is the cause and which is the effect. Here are some examples of potentially-dangerous cause and effect reasoning:

• Observation: Schools that introduced sex education classes soon had increased reports of sexually transmitted diseases.
• Reasoning: Sex education makes students more sexually active.
• Observation: Prime Minister X came into power and war broke out soon after.
• Reasoning: Prime Minister X caused the war to start. Propose valid alternative interpretations for these observations.