Tuesday, April 17, 2007

DB Optimization Assignment

MRP Database

Complex and Simple Query Division

Below is our division of simple and complex queries. To make things more readable, we have included only the complex queries in this paper (Appendix A). The simple queries are all other queries not included in Appendix A.

In our division, we have mostly looked at typecasting, multiple joins, and nested queries. Queries which included any of these three elements were classified as complex.

Indexes

In general, indexes should be added to optimize the queries in two cases. First, indexes should be added whenever fields are used in the WHERE clause, especially to compute JOINs. Second, indexes should be added when fields are computed outside of the WHERE clause. An example would be SELECT Max(X) where the maximum function works in the select clause. The index on field X should speed things up. We assume that primary keys are indexed automatically, as this happened during our importing of the data into MySQL.

In accordance with the guidelines, the following indices should be added:

  • Activity.campID
  • Double index (order and reverse order) on adjacentcampground.groundID and adjacentcampground.adjacentTo (since this is a linking or associate table)
  • Agentstake.stakeID (this does not seem to be a part of composite primary key)
  • Agentstake.campID
  • Amenity.groundID
  • Announcement.campID
  • Double index (order and reverse order) on assignedto.stakeID and assignedto.campID (since this is a linking or associate table)
  • Campground.campID
  • Camppicture.campID
  • Camprule.catID
  • Camprulecategory.catID
  • Campsite.groundID
  • Made codes.code the PK
  • Direction.campID
  • Facility.campID
  • Made glossary.term the PK
  • Reservation.stake
  • Reservation.ward
  • Reservation.campID
  • Reservationdate.reservedate
  • Reservationdate.reserveID
  • Double index (order and reverse order) on reservationforcampsite.siteID and reservationforcampsite.reserveID (since this is a linking or associate table)
  • Ruletable.catID
  • Unavailabledate.campID
  • Made usertable.userID the PK
  • Double index (order and reverse order) on usercamp.campID and usercamp.userID (since this is a linking or associate table)
  • Made ward.wardID the PK
  • Ward.stakeID

After careful consideration, we selected the following complex queries to optimize:

Query Optimization 1

Before

SELECT announceID, announceTitle, announceText,

CAST(MONTH(annBeginDate) AS varchar) + '/' +

CAST(DAY(annBeginDate) AS varchar) + '/' +

CAST(YEAR(annBeginDate) AS varchar) annBeginDate,

CAST(MONTH(annEndDate) AS varchar) + '/' +

CAST(DAY(annEndDate) AS varchar) + '/' +

CAST(YEAR(annEndDate) AS varchar) annEndDate

FROM Announcement

WHERE campID='cid'

After

SELECT announceID, announceTitle, announceText, annBeginDate, annEndDate

FROM announcement

WHERE camped = 'cid'


This query suffers from several unnecessary complexities. First of all, casting the different parts of a date as a string does not make sense in this case, as it's reassembled in the exact format of the datetime data type. Thus, all the CASTing can be removed to improve query speed. Additionally, an index on Announcement.campID should be added.

This cuts the number of rows returned from several (e.g. 21 for campID=11) to 1. With the casting, we couldn't get the query to run at all despite an hour of research. After the casting was removed, the query took 0.0224 seconds uncached, 0.0003 cached. After the index was added, the query took .0011 seconds uncached and .0003 uncached. Since this table does not change a lot, this optimization actually does not help the database performance tremendously, because after the query is cached, the time to execute it is the same. The optimization does, however, show the principles of proper query design.

Query Optimization 2

Before

SELECT r.reservid, r.reservedBy, r.stake, r.ward, r.groupType,

CAST(r.startDate AS DateTime) AS arrival,

CAST(r.endDate AS DateTime) AS departure,

(SELECT stakeName FROM Stake s

WHERE s.stakeID=r.stake) AS stakeName,

(SELECT wardName FROM Ward w

WHERE CAST(w.wardID AS NVARCHAR)=r.ward) AS wardName

FROM Reservation r

INNER JOIN ReservationForCampsite rfc ON (r.reservid=rfc.reservid)

WHERE ((CAST(r.startDate AS DateTime) >= 'firstDate' AND

CAST(r.startDate AS DateTime) <= 'lastDate') OR

(CAST(r.endDate AS DateTime) >= 'firstDate' AND

CAST(r.endDate AS DateTime) <= 'lastDate')) AND r.campID=cid AND rfc.siteID=sid

ORDER BY departure

After

SELECT r.reservid, r.reservedBy, r.stake, r.ward, r.groupType, r.startDate AS arrival, r.endDate AS departure,

(SELECT stakeName FROM Stake s WHERE s.stakeID = r.stake ) AS stakeName,

(SELECT wardName FROM Ward w WHERE w.wardID = r.ward) AS wardName

FROM Reservation r, ReservationForCampsite rfc

WHERE r.reservid = rfc.reservid AND

((r.startDate >= 'firstDate' AND r.startDate <= 'lastDate') OR (r.endDate >= 'firstDate' AND r.endDate <= 'lastDate'))

AND r.campID=cid AND rfc.siteID=sid

ORDER BY departure


To begin the optimization, we took out the casting and changed all the varchars to datetimes, which is what we think they should be in the first place. Again, we could not get the original query to execute, so we don't know exactly how much time was gained through this change. Having seen optimization in practice before, however, we feel confident this change speeds up the query. We then changed the ward field by first replacing "all" with null and then changing the field type to int to remove the need to cast w.wardid as a varchar (now null became a special value to mean "all").

We indexed reservation.ward, changed reservation.stake to int(11) from varchar, then indexed reservation.stake as well. In the reservationforcampsite, an association (linking) table, we created two indexes for the two fields, rfc.siteID and rfc.reserveID. The indexes contained both fields, each index in different order.

When the query was first run, before the indexing, it took .0392 seconds uncached and .0004 seconds cached. After the indexing the query took .0030 seconds uncached and .0004 seconds cached.

Query Optimization 3

Before

SELECT s.siteID, s.siteName, s.siteCapacity, g.groundName, r.reservID

FROM Campsite s

INNER JOIN Campground g ON (s.groundID=g.groundID)

INNER JOIN Camp c ON (g.campID=c.campID)

LEFT JOIN ReservationForCampsite r ON (r.reservID=" + editMode AND r.siteID=s.siteID)

WHERE c.campID=cid AND c.campID=g.campID AND g.groundID=s.groundID

ORDER BY g.groundName, s.siteName

After

SELECT s.siteID, s.siteName, s.siteCapacity, g.groundName, r.reservID

FROM Campsite s, Campground g, Camp c, ReservationForCampsite r

WHERE c.campID=cid AND c.campID=g.campID AND g.groundID=s.groundID AND r.siteID = s.siteID

ORDER BY g.groundName, s.siteName


Before we started on the optimization, we decided to remove the mysterious 'r.reservID=" + editMode AND' from the FROM clause. This would actually cause the query to break, as the lone double quote would completely throw it off. So, part of the optimizing sometimes involves getting things to work. After this small change, we ran the query, which completed in .0473 seconds uncached and .0003 cached.

The next step involved removing the inner joins from the FROM clause, as the data is joined in the WHERE clause anyway. This improvement alone brought the query execution time to .0024 seconds uncached, and .0003 cached.

Next, we created indexes on campground.campID, campsite.groundID. There were already indexes in the reservationforcampsite table, so we didn't need to create them. All these changes brought down the query execution time to .0016 uncached, and .0003 cached.

Database Schema Changes

The main changes to the schema we recommend are data type changes. The data types in the beginning database are often not ideal for the data they store. In addition to the datatype changes already made for the query optimization, the following datatype changes should be made:

  • Reservationdate.reservdate from varchar to date
  • Unavailabledate.unavailDate from varchar to date
  • Reservation.startDate and endDate from varchar to date

Additionally, the following changes can be made to improve the schema:

  • The tables priority schedule and nextpriorityschedule should be combined, because the data would not lose any integrity, and logically nextpriorityschedule is simply a continuation of priorityschedule. Basically, currently we have horizontal fragmentation of a table with 12 rows, and that's just cruel.

Changes vs. Improvements

Whether the changes to the database are actually beneficial needs to be carefully measured. Specifically, changes made need to increase the efficiency of the database as measured by the time queries take to complete. If a change to a table increases the speed of the queries operating on that table, it is a good indication that the changes are really improvements.

We have sampled the complex queries optimized above and they all yielded improvements. Even though we have not measured every single index and datatype change, we have no reason to suspect they would yield a different result.

Scriptures Database

Index Changes

Given the database schema and the queries, there are several unnecessary indexes. Indexes are meant to speed up queries, and they shouldn't exist if they don't serve that purpose, because indexes also slow down transactions during updates. This is the underlying motivation for the JODDB index changes.

The general recommendation is to remove all indexes on fields which are not in the WHERE clauses or computation functions of queries. For example, the 'speaker' table contains three indexes which are not tied to any WHERE clauses or computation functions of queries. Therefore, each of these indexes could be removed. Additionally, the "LastName_2" is an index of two columns, which can also be used as an index for just one of the columns, making the "LastName" index unnecessary. Without listing every single index, suffice it to say that these two general tests ((1)indexes used in WHERE clauses and (2) index redundancy) should be implemented across the whole database.

Recommendations

The recommendations above to delete unnecessary indexes would actually increase database performance for updates as well as decrease database size, both of which are part of optimization.

No comments: