When dealing with large data sets, it’s important to ensure that the data can be accessed correctly. Failure to address this issue early on in database development can lead to problems when later attempting to extract information from the data itself. This was highlighted recently during a count for London’s Mayoral election on 5 May, when staff at the Electoral Commission had to ‘manually query a bug-stricken database’ which delayed the result by ‘several hours’. These kinds of problems aren’t helpful in terms of furthering the role of computerisation of such tasks, and in particular for data companies. At present, electronic voting machines are not currently used by UK voters, however, counting software was used for the London Mayoral and Assembly elections. In the article, electoral law expert Prof. Bob Watt of Buckingham University expressed concerns about such tasks being undertaken digitally; he said: ‘The bigger issue is of course just how stable these machines are and that’s something that I have a great deal of worry about and have had for some time’. As you can see, it’s important that companies dealing with data get it right. The BBC article on this story doesn’t go into any specific details on why the data was not accessible, so it’s difficult to offer any kind of assessment - however, when dealing with such large sets of data involving tens of thousands of rows, it’s important that the right people have access to the information when they need it. When storing data in a relational database management system (RDBMS), data is contained in multiple tables which are related to each other through one or more common factors. Using the example above, a table may contain information relating to the voter in a table named ‘voter’, such as id, name and email_address. This then may be linked to another table containing contact information for the voter such as their address, telephone and other related information. We may then have a table relating to the candidates of the election, and information relating to them which is necessary for The Electoral Commission to efficiently track the vote. In order to access the information across two tables, we would use a join. A join is used to examine information, not just across two tables, but across multiple tables (depending on system capabilities) and is an integral part of querying data. Each table in a database should contain a primary key which ensures that each row is unique and prevents duplication of redundant data. In addition, a table can contain a foreign key which references another table. For example, by connecting a table containing voter information using voter_id and candidate_id, we could find out which individuals voted for a specific candidate. In this instance, the candidate_id would be the foreign key in our voters table, as it would reference a different table containing information on the candidate. To perform a join in SQL, we need to create a statement which references which information we require, followed by the tables we want to join and how we want to join them together to provide the results we need. To fully understand how joins work, we need to create two tables and some data within them. But first, we will briefly go over the different types of joins. The default join is an inner join; this statement is used to return results where the data in each table matches. The next type of join we will look at is a left join. This statement is used to return all the data from the left table and only the matching data in the right table. The next table is a right join; this is the opposite to a left join. This statement is used to return all of the data from the right table and only the matching data in the left table. The final join is a full outer join; this statement is used to return all the data in both tables. Let’s have a look at each in join in more detail. Firstly, we will examine the data contained in our tables.Table examples:votercandidate Inner JoinThe first join we spoke of was the inner join. The inner join is used to return matching data in each table. Using the statement below on the tables we have created, we can display all of the people that voted for a candidate. SELECT voter.voter_id,voter.forename,candidate.candidate_id,candidate.forenameFROM voterINNER JOIN candidateON voter.candidate_id=candidate.candidate_idORDER BY voter.voter_id;This table shows the results from the above statement:The results from the inner join show us that only five rows from the voter table have matches in the candidate table.
The above venn diagram shows us how the tables are connected when we join two tables on an inner join. In this instance we only get the data which matches in both tables. Left JoinThe next join statement is the left join. This will provide us with all the data from the left table and only the connecting data in the right table. SELECT voter.voter_id,voter.forename,candidate.candidate_id,candidate.forenameFROM voterLEFT JOIN candidateON voter.candidate_id=candidate.candidate_idORDER BY voter.voter_id; Results table from the above statement:The left join displays all of the data in the left table, which in this statement is voter, and displays data of the connected table. If there is no data available for a row, then null is added to that row; this is what has happened in row 6 as Liam didn't choose a candidate.
This venn diagram shows us how a left join is used. All of the data contained in voter is returned as is the matching data in candidate. Right Join As you may have guessed, the next statement we will look at is the right join. This statement will result in the right table providing us with all the data it holds and then only displaying the data that is connected to it from the left table. SELECT voter.voter_id,voter.forename,candidate.candidate_id,candidate.forenameFROM voterRIGHT JOIN candidateON voter.candidate_id=candidate.candidate_idORDER BY candidate.candidate_id; Results from the above statement:The SQL statement has been tweaked slightly. The right join statement has the order by changed to candidate.canidate_id; this is to make the results more readable. The right join statement has displayed all of the data available in the right table, which was candidate, and now only shows the connecting data in the left table (voter), and again if there isn’t any data in the left table that connected to the right table then null is added to that row.
This right join venn diagram shows us the opposite to a left join, and displays all of the data in candidate and the connecting data in the voter table. Full Outer JoinThe last statement that will be explained now is a full outer join. This statement brings in all the data from both tables and connects the results together where it can.SELECT voter.voter_id,voter.forename,candidate.candidate_id,candidate.forenameFROM voterFULL OUTER JOIN candidateON voter.candidate_id=candidate.candidate_idORDER BY voter.voter_id; Results from the above statement:With a full outer join, the results are from both tables, and results will be matched together. Also, if there is no matching data, the field will again contain a null value.
And this is the last venn diagram. This shows us a full outer join and displays all the data in both tables; it attempts to match the data together where it can. As you can see from the examples above, we have managed to join the data from two individual tables and link them together in a variety of different ways to provide the results we required. The ability to join tables is a fundamental aspect of SQL and a common feature of the role of a database developer. Sources: http://www.bbc.co.uk/news/technology-36558446