title

Write to Mark Kelly

Data Validation

by Mark Kelly, vcedata.com

 


 

Data Validation

  • Validation methods
  • Manual (performed by human intelligence)
  • Electronic (done by programming)
  • Validation Types
  • existence check (has a value been entered?)
  • type check (does the type of data match the type required by the question?)
  • range check (does the data match one of our criteria or sit within our limits? e.g. yes, no, sometimes / January-December / under 13)
  • The step taken after data input.
  • Ensures data are reasonable and complete.
  • Cannot check for accuracy of data!
  • Does not apply to information (output)!
  • Helps prevents processing of bad data to create bad information...
    • GIGO = garbage in, garbage out

 

  • You run a kindergarten. You receive some filled-in forms for prospective enrolments.
    • #1 says the person was born in England 3 years ago.
    • #2 says the person was born in Mordor in the year 1592.
    • #3 does not say when the child was born.
  • Validation can tell #1 is reasonable
    • England - the country does exist
    • people do come from England
  • But the data still may be completely incorrect
    • the student may not be from England.

#1

  • England does exist = example of range checking
  • Child is 3 years old example of range checking (47-year-old preschoolers may be a problem)
  • Range checking = Does a value...
    • exist in a list of approved options?
      • e.g. Is your preferred hand: left? right? either?
    • fit within an approved range of values?

Example

  • Validation can tell form #2 is unreasonable and therefore invalid
    • Mordor does not exist
    • people (or orcs) do not come from Mordor
  • Kellyrule 2047b - Do not process invalid data
    • or let orcs immigrate

Range checking

Range checking

  • Is an input datum within an approved range of values?
    • “nice weather” = 20C – 25C (inclusive)
    • people allowed in pub are 18+
    • Old age pension for people >= 65 years

Be careful of ranges in Software Development!

Old age pension for people >= 65 years

Do ranges include or exclude the limit values?

> vs >= < vs <=

SD exam questions often test validation range limits. e.g. legal drinking age...

 

 

Is the legal drinking age (in Victoria, Australia in 2022)

  • > 18 years
  • >= 18 years
  • = 18 years
  • <= 18 years

Answer

  • >= 18 years
  • SD exams often test subtle range limits.

     

Validity? Testing?

  • Validate input data to weed out obvious rubbish that is not even worth processing.
  • Test output information to check its accuracy, relevance, completeness, clarity etc.
  • VALIDITY ≠ CORRECTNESS or ACCURACY
  • TESTING ≠ validation

Validity ?

  • Valid data may be completely incorrect.
  • VALIDITY ≠ CORRECTNESS or ACCURACY

The rules of email addresses easily determine validitybut not accuracy

Validation rules – OK?

  • Validation is simple
  • Can reliably be based on rules
  • Rules can be programmed
    • e.g. into databases, spreadsheets
  • = electronic validation

Electronic Validation

  • Databases/spreadsheets allow easy & automatic data validation based on simple rules:

Validation rules in Microsoft Access

has any value been entered? e.g. not worth cooking a pizza to be delivered if no address has been given in the order.
  • has any value been entered? e.g. not worth cooking a pizza to be delivered if no address has been given in the order.

Electronic Validation

Validation Rule 2 : type

  • type: is input data...?
      • text (e.g. a real name),
      • numeric (number of orcs immigrating today), date (of birth),
      • email address (alphanumeric string + “@” + alphanumeric string + DOT + valid top-level-domain (e.g. COM, ORG, SPACE)
      • binary – consists of only characters 1 and 0
      • Australian postcode – 4 digits, zero to nine
      • alphanumeric – mixture of letters and digits, e.g. credit card number
      • JPG – a valid JPEG picture in JPG format
  • type: is input data text (e.g. a real name),
    • Tesla chief executive Elon Musk, he may need to rethink the name of his new baby boy, X Æ A-12.
    • Musk announced that he and his girlfriend, singer Grimes, had welcomed a baby son and ... identified the second character as “ash” — saying it was pronounced X-Ash-A-12.
    • But California, where Musk lives, allows only the 26 English letters of the alphabet to be used on birth certificates, along with a limited list of special characters including apostrophes, hyphens and periods. That means no numbers or Æ symbols.
    • https://www.washingtonpost.com/technology/2020/05/08/musk-grimes-baby-name/

  • is input data in an approved list, or within acceptable limits
      • text (e.g. a name),
      • numeric (number of orcs immigrating today), date (of birth),
      • alphanumeric – a combination of text and numeric, often in a specific pattern, e.g. SMY34 (three alphabetic then two numeric characters)
        • it could be more rigorous, e.g. the alpha characters must be uppercase, and the digits must not be 1 or 0 (to avoid misreading letters I and O)

RULE 3 : RANGE

  • is input data in an approved list, or within acceptable limits
      • email address (alphanumeric string + “@” + alphanumeric string + DOT + valid top-level-domain (e.g. COM, ORG, SPACE)
      • an invented code (e.g. PY + year of manufacture + model code + series number) e.g. IFC2020X103b

RULE 3 : RANGE

Why validate?

  • Strips out worthless data that only takes up storage/communication space
  • Saves processing time processing unworthy inputs
  • Prevents rubbish data leading to rubbish information that threatens the integrity (reliability, trustworthiness) of output

Validity

  • Data may be valid but incorrect or inaccurate.
  • A form saying a kindergarten enrolee is 45.67 years old is accurate but not reasonable.
  • Sometimes ‘unreasonable’ data needs further verification...
  • e.g. stress does not cause stomach ulcers
  • tossing a coin 20 times and getting 19 heads

To review - What’s ‘reasonable’?

  • The main validation checks:
    • Type check
    • Range check
    • Existence check (or presence check)

Type check

  • A value should be of the right type
    • Names should be text
    • Dates should be dd/mm/yyyy format (in Oz)
      • some type checks are regional and may need to be extended to accommodate other regions e.g. an Oz site accepting USA zipcodes
    • Number of pets should be numeric
    • A photo field should contain a JPG image rather than a spreadsheet
    • A credit card number field rejects the letter “O” instead of a zero

Range check

  • Data should be within a certain range
    • Kindergarten students’ ages should be between 3 and 6
    • Number of pets should be >= 0
  • Data must exist in a limited list of options
    • State of residence must be in (Vic, SA, Qld, WA, ACT, Tas, NT).
    • Sex must be M or F.
    • Gender must be M or F.
    • The definition and range of some lists is fluid and needs ongoing maintenance by database managers over time.
  • Some data must be unique
    • e.g. a phone number, customer ID
  • Data must be of a minimum, maximum or fixed length
    • e.g. credit card number has 16 digits
  • Data must have a given format
    • E.g. user ID must have 3 letters and 2 digits
    • Date must be in dd-mm-yyyy format
    • credit card numbers

Existence check

  • Is the data present?
  • Some data absences are OK, others are not
    • with pizza delivery orders, leaving off the customer’s name may not be vital, but leaving off the address is.
  • Database existence checks can reject records that lack vital data.

Warning!

  • Don’t make validation so strict that valid data are rejected.
  • Data can be unexpected but accurate!
    • kindergarten child could be 8 years old in rare circumstances
    • mandating the entry of a phone number would wrongly reject people with no phone
  • Validation should reject absurd data, not slightly odd or unexpected data

Bad validation

  • American websites that insist on a 5 digit zip code (postcode) – even if you’re not American
  • Databases that reject addresses with no street/ Avenue/Road/Lane (etc) value: some addresses don’t have them. (e.g. Broadway)
  • Testing for existence in a limited list when the list is not actually limited.
  • e.g. Title must be one of: Mr, Miss, Mrs,

    Dr, Sir, Madam,

    Señorita, Professor

Bad validation

Don‘t force people to invent or change true data to suit your inadequate validation rules!

e.g. Having to change their address from ‘123 Broadway’ to ‘123 Broadway St’ because your database’s StreetType field is marked Required.

You would be damaging your own data’s integrity.

Manual & Electronic

  • Some data needs a human’s common sense to detect unreasonable entries
    • Spelling of names
    • Entry of Adam Sandler in a database of talented actors
    • “Jane Smith” registered as a male – some “common sense” validation may change over time
  • Other checks are better done electronically
    • It’s faster, more accurate to get a database to check for missing or alphabetical ID number values
  • Tip – never try to validate people’s names using a limited list. The list is infinitely long –
    • Janine
    • Jueaneene
    • Janeen
    • Jenene
    • Janeene
    • Jeeneen
    • X Æ A-12
    • etc

Some validation techniques

Check digits

  • Used for numerical data. An extra digit is added to a number – the digit is calculated from the other digits.
  • The computer checks this calculation when data are entered, e.g., The ISBN for a book. The last digit is a check digit calculated using a modulus 11 method.
  • Detects mis-typing or fake entries.

Credit card number validation

Credit card numbers are self-validating and internally consistent

  • Fake or mistyped numbers are identifiable from the number alone
  • First digit identifies the card issuer (e.g. Visa = 4)
  • Depending on the issuer certain other digits identify the issuing bank (e.g. Commonwealth)
  • Other specific digits give the account number
  • A check digit, calculated from other digits*, is used to check the validity of the entire card number. If the check digit is not right, the card number is invalid.
  • The 3-digit CVC number on the back of the card is another card validation strategy.

*using the Luhn algorithm

Some validation techniques

Spelling and grammar check

  • Looks for spelling and grammatical errors.
  • Can be electronic, but needs human common sense too, especially with regional spellings and grammar.
    • Never electronically spell-check people’s names.
    • Place names can be electronically checked.

Consistency Check

  • Checks fields to ensure that data in different fields agree with each other, e.g.
    • If Title = "Mr" then Sex* should be "M" (again, time may change some rules)
    • If employed = “N” then employer_name field should be empty

Prevention is better than cure

Better it is to prevent the input of invalid data than it be to deal with it later. - Wallace Shakespeare

It’s far better to stop invalid data getting into a dataset in the first place.

But this safeguard comes before validation – it is not validation in itself...

in spite of what some past VCAA exams have tried to assert

Preventing invalid data entry

Drop-down menus or lists

  • Force data to be entered from a pre-made selection
    • States of Australia
    • Months of the year

Once again – do not use menus or lists to enter data that is not strictly limited in scope. You don’t want people entering incorrect data because your data entry choice forced them to!

Helping users enter valid data

  • Give examples or advice about expected data formats…

Helping users

  • Make data entry rules clear…

This form clearly indicates which fields are required (*) and which are optional

Use a calendar control

  • Date cannot be invalid since it must be chosen from a set of valid dates
  • It can still be inaccurate however!

Calendar control helps organisations

  • Date data is not ambiguous
    • Does 02/05/06 entered mean:
      • 2 May 2006?
      • 5 Feb 2006?
      • 6 May 2002?

A bad date could be prevented with good validation

Calendar control helps users

  • User does not have to guess correct date format expected
    • 23/08/09?
    • 23-08-2009?
    • 23 Aug 09?
    • 8/23/09 (US format)?
    • 2009-08-23 Asian [and my favourite] format)?

Radio buttons

  • Use these wisely.

    RADIO BUTTONS force users to choose one option (and no more than one) from a list.

Radio buttons

If the option is yes/no, consider a switch / toggle control instead

or a single checkbox

Checkboxes (tick boxes)

  • Use these wisely.
  • CHECKBOXES (tickboxes) – allow users to choose none, one, or many listed items

Tip: start to reconsider your data entry controls if the list of items gets large

Tip: grey-out (disable) options that are not currently available

Tip: if you’re not sure if your list is indeed 100% limited, offer an “Other” textbox option

 

That's it. You can go now. Have fun.

 

Write to Mark Kelly

All original content copyright © vcedata.com
All rights reserved.

This page was created on 2022-04-28 @ 11:59 AM
Last modified on Thursday 28 April, 2022 12:16