Wednesday, January 30, 2013

Designing Your First Table

Till now, I have been using the schema someone before you have created. Its time to do a simple DB design for yourself. Lets suppose we have to make a simple DB design for an employee attendance tracking application. So application need to save the details of employee and daily attendance timing. 
While DB designing anything, the following steps are usually
  • Understand the system
  • Identify the key entities and what all information need to be saved
  • Define the first schema
    • Try with other possible structures if required
    • Re-define based on expected usage and behaviour of application
  • Add indexes based on expected usage

Understanding the System

We can assume that the attendance is tracked using swiping machines which are located at the entrance and the exit. An employee can swipe in and swipe out any number of times on a given day. A users first swipe-in and last swipe-out will be used to make sure he was present. He can be present(8+hours), half day(4-8hours) or leave based on his availability at office on a given day.


Identifying the key entities

So here most important entity is employee. Each employee has name, email, gender, and ID card number (Lets bother only about these many information as of now). Now we have swipe in and out. There are two approaches here - 1. considering swiping in/out as entity or 2. attendance itself as an entity.  Lets us consider both cases, we are saving all swipe ins and outs plus direct attendance information. If swipe-in and swipe-out will have a time associated with it. Attendance will have first swipe-in and last-swipe out along with derived information of hours at office and his presence type. 

Defining First Schema

Lets define for each entity the attributes possible and their data types

Employees
id - Int - Auto-increment id generated in the system(internally we will refer any employee using id)
name - string - user name is string with max size as say 100 characters
ID card-number - string - Each employee has a unique card number, but once the employee exits the company the same card number can be allocated to a different employee
gender - enum - an employee can be male or female(we can assume be default employees are female)
email - string - optional and can have max of 50 characters

Swipe Details
id - Int - Auto-increment id generated in the system
time - timestamp, we can gracefully assume that time at which the entry is made is same as swipe time. (Default  value has to be now())
swipe_action - enum - swipe in or swipe out
card_number - string - card number mentioned in the card
employee_id - employee id for whom the card belongs (captured from employee table)

Attendance Register
id - Int - Auto-increment id generated in the system
swipe_in_time - timestamp, we can gracefully assume that time at which the entry is made is same as swipe-in time. (Default  value has to be now())
swipe_out_time - timestamp - time of last swipe-out
card_number - string - card number mentioned in the card
employee_id - employee id for whom the card belongs (captured from employee table)
time_at_office - time - time for which he was present
attendance_type - enum  - can be present, half day or leave

The need to maintain attendance register alone and/or swipe details alone depends on you need. Attendance register cannot give all swipe information, but if you record all swipes, you may look at as junk data. If you look other way round, if you want to store all swipes, attendance_register is redundant data; again based on your application, it might be good to have a register for quick report generation. 

Trade-Offs In Data Type & Column
  • Integer, String and float number selection are quite obvious if you are familiar with some programming language and I dont want to spend much time on the same.
  • Enum Vs TinyInt - You can define an Enum which can also be achieved using integers. If you use integers, it lack readability and also all applications using that column should have a mapping for what each number stands for. But if you use number, its easy to add a new value(To allow a new value to an existing ENUM field, you have to alter table, which might be time consuming if its a big table). In this case itself, suppose an employee came to office on a weekend and left in 3 hours, he will marked as leave and it was decided to introduce a new status saying "holiday" so that. 
  • Datetime Vs Timestamp - As said in earlier post, timestamp is internally saved as a number and is always better than a string base datetime.
  • Text Vs Blob - Both fields saves long text. For now you can take it like this - searchable text field refers to text and in other cases blob(blobs usually used for binary data chunk)
  • Employee Id Vs Card number in attendance register - The question is strictly based on you load. If you have such a huge employees list, on each swipe, it might not be possible to get the employee id from employees table(as it might result in too many hits on you employees table also). If you are expecting huge reports on attendance, you can think of some fetching and updating employee id once in a day to make reports faster. So its all based on your application you have to make the decision.
Some common fields I recommend
As someone said, "only change is  permanent", so is you application. It always good to leave space for future when u design today. 
  • id - Usually for tables we can go for unsigned int. If you are expecting huge data, bigint would be good.(1billion records in DB not that often, so I would say int)
  • Audit fields - In any table, I would always recommend these 4 fields - created_on, created_by, last_modified_on, last_modified_by(wherever possible). That is quick reference log, even if you have audit logs, these are good to have
  • status - In previous post on delete, I said never to DELETE, but to use a status field which is usually by default 1(active). I would recommend it to be tinyint rather than boolean - again for purpose. 
  • additional_info - This column is not available in all my DBs, but I would use it whenever there is a space of some additional info coming. Assume in our employee table we need to add blood group today and then comes request for hobbies next day, then home town and so on. If I am sure, these are just some readonly data and no search will be done on these details(as indexing wont work in this case), I would json_encode and dump the data into that column. This is violation of 1NF and  recommended only if you senior approves this :)
Note: Naming convention for MySQL tables and columns is usually words separated by underscore(my_table, column_name_1) but camel case is also good

Selection Of Indexes

Which all columns should be indexed. In normal case, I would say any column you are expecting a search, add an index. By searching, I include the columns used for join.
I would make any column(s) expected to be unique in the system, be indexed as unique and reduce the pain on application for checking if it already exists or not. Here I would recommend employee card number to be unique in employees table. Also adding a column of type date to attendance_register and making it unique along with employee_id would be good(so one record per employee per day)

Note: 
Indexing on columns like status might not actually serve the purpose. If the search yields more than 30% of data, mysql prefers full table scan to using the index usage
If there are two indexes on column_1 and column_2 and if the search is on both 2 columns, mysql will use one of these or both indexes based on necessity

Now lets see what we have come up










employees

swipe_details

attendance_register


id

id

id


name

swiped_on

swiped_in


card_number

swipe_action

swipe_out


gender

card_number

card_number


email

employee_id

employee_id


status



date


created_on



time_at_office


created_by



attendance_type


modified_on






modified_by






additional_info












Query

So lets now write our first create table query:

create table employees(
id int(11) unsigned not null auto_increment,
name varchar(100) not null,
card_number varchar(50) not null,
gender enum('Male', 'Female') default 'Female',
email varchar(100),
status tinyint(3) unsigned  not null default 1,
created_on timestamp default current_timestamp,
created_by int not null default -1,
modified_on timestamp null,
modified_by int null,
additional_info text,
primary key (id),
index indx_name(name),
unique index indx_card_number(card_number)
);

For the sake of explanation, I would give a brief explanation

Generic Syntax for column specification:
<column_name> <data_type[(<size allowed>)]> [[NOT ]NULL] [DEFAULT <value>]
column_name can be alphanumberic starting with non digit(if you give column name inside ``, you can infact give anything as column name- keywords, quotes, slashes, so be careful
datatype and size defines how much memory it takes. For int fields, size is for only display purpose, doesnot effect actual memory
null - means the column can have null values
default - this defines the default value

Generic Syntax for index specification

<index_type> [index_name](<col1>[, col2[,col3..]])


Now try to create other table for yourself. Feel free to make changes based on the application you can think of.

Note: I have purposefully removed concept of normalization from here feeling its too much for first table and I have got into habit of violating it :)

No comments:

Post a Comment