For a course called "Data Analytics" you'd expect a key knowledge dotpoint on statistics...
Here's some stuff anyway, since I already wrote it for an old textbook.
Identifying patterns and relationships between data
Science and business often need to find meaning in large data sets. Statistics lets data managers summarise huge amounts of raw data into small, informative, meaningful summaries. You are not expected to carry out very complex statistics when examining your hypothesis in U3O2, but an understanding of basic statistical concepts is essential when manipulating data, and reaching sensible conclusions.
All of the statistics discussed below can easily and automatically be calculated with a spreadsheet.
You will find it useful to know these statistical concepts:
Hiroyuki, a sushi restaurant manager, wants to monitor the performance of his chefs, Bruno and Eva. He puts each chef on duty for one week, and counts the numbers of diners each day.
What conclusions can Hiroyuki draw from these raw figures? Not a lot. But an average number of diners would be informative. There are 3 methods of calculating averages.
When most people talk about an ‘average’, they usually refer to an arithmetic mean.
Excel’s AVERAGE( ) function calculates a mean.
The mean is the sum of the data, divided by the number of data items.
The median is the “middle” number when the data are sorted.
As you can see, the mean and median can be very different, and can lead to very different interpretations. Choosing the most appropriate type of average is important. It is easy to misrepresent data with an inappropriate choice of average.
The mode is the value that occurs most often. It is sometimes (but not always) the most logical and representative average to use. For example:
Four people are chosen at random and asked how many children they have. The answers are 2, 2, 2, 16. What is the logical average? The mode, which is 2.
The “outlier” (16) – a figure that lies so far out from the rest of the data that it is not at all typical or representative of the majority – is ignored.
An outlier can be caused by experimental error or it might be a truly remarkable exception to the general population. It might be worthy of investigation in its own right, but it is not typical. A spreadsheet like Excel has a function called TRIMMEAN which will ignore outliers when calculating a mean, but you have to tell it when to start ignoring values.
Using the mean (5.5) would be misleading because it would not come even close to the truth for any of the surveyed people. Each type of average is logical and accurate under different conditions, but they often will not agree with each other.
The annual incomes of a representative sample of citizens of The Democratic Republic of Informatica are collected:
100, 100, 100, 100, 50, 120, 30, 200000.
You (as Supreme Ruler For Life of Informatica) want the world to believe that your loyal subjects are all well paid. What average method would you choose, and why?
An impartial observer wants to reflect the truth of the poverty of most Informatican citizens. What average method would she use, and why?
DO THIS: Use a spreadsheet to enter the sample data shown and use the AVERAGE( ), MODE( ) and MEDIAN( ) functions to see the differences..
If Bruno’s mean number of customers is 16.42 and Eva’s is 14.28 – is that a difference that the manager should pay attention to, or could it have happened purely by chance? A difference between data sets does not always mean anything important. A significance test can tell you when a difference is important, e.g. a “t-test” can be easily done with a spreadsheet.
Published conclusions about hypotheses often include statements like ‘p < 0.05’. This is a significance measure and means that the probability of the result being purely due to chance is less than 5%. The lower the number, the more confident you can be that the hypothesis is supported.
Data managers often need to know how much variety is in data. For example, if diner counts in week 1 were 8, 7, 9, 8, 7, 9, 8 (mean = 8), it’s obvious that all the data are very consistent and differ very little from the mean. A value of 33 would be very unusual.
On the other hand, if the figures for week 2 were 1, 33, 2, 1, 0, 3, 16, the mean is also 8 but the data are very inconsistent and none of them come anywhere near the value of the mean.
Excel calculates the standard deviation of week 1 figures as 0.816.
Excel uses the STDEV( ) function to calculate standard deviations.
Using the mean and standard deviation we can work out whether any value in a data set could be considered “unusual” or special. This rule is used:
For example, thousands of 18 year-old Australian men’s heights are measured. The mean height is 175cm. The standard deviation is calculated to be 15cm. In English, this means “The average bloke is 175cm, give or take 15cm.”
The classic normal distribution curve looks like this. M
Figure 26 – a normal distributionMost data are close to the mean, and progressively fewer and fewer of them vary greatly from the mean. The height of the curve represents the number of men with each height, so the highest point on the curve occurs at the mean value, 175cm.
Above - Normal distribution of heights.
Traditionally the Greek letter δ (lowercase sigma) represents “standard deviation” and μ (lowercase mu) represents the mean.
Correlation and causality
Our human brains are hard-wired to seek patterns in information. This talent can be useful when we make out the face of a hungry lion crouched nearby in grassland. It can also mislead us when we think we see faces on Mars, or in burnt toast.
Faces are so important to humans that our brains have areas dedicated to face detection and recognition.
We also instinctively seek patterns in data - and we sometimes get that wrong too. Let’s say that research shows that the sales of umbrellas in Victoria varied over 10 years. The sales of pepperoni pizzas also varied over the same time, increasing and decreasing in the same way as umbrellas sales did.
Should you “logically” conclude that:
Often when people see a correlation – similar trends in two data sets – they assume that there must be a causal connection and one factor causes the other.
This is not always true. Only further research can tell whether one factor in a correlation actually caused a change in the other.
TIP:Don’t misread “causal” as “casual”
“Having a higher income leads to better health.” Discuss the merits of the causality in this hypothesis.
A data manager can calculate correlation between data sets (using Excel’s CORREL function) but a wise manager will understand that the correlation value will not indicate which (if either) is the cause and which is the effect. Here are some examples of potentially-dangerous cause and effect reasoning:
Propose valid alternative interpretations for these observations.
All original content copyright ©
This page was created on 2022-07-17