Introduction In the real estate world, key players are brokers, agents, sellers, and buyers. When a seller decides to sell his/her house, he/she has to be represented by an agent. An agent who represents a seller is called a seller agent. Seller agent will put the listing in the Multiple Listings Services (MLS) as an active listing. The active listing will become pending listing if the seller has accepted a buyer agent’s offer. A buyer agent represents a buyer. The pending listing will become sold listing if the transaction is in the closed escrow state. All of those states of each real estate transaction could be found in the MLS.
It is easy for prospective buyers to find their desired houses in the MLS. It is not easy for them to answer a question, “How much money should I spend to buy that house?” On the other side, it is difficult for prospective sellers to answer a question, “How much is my house now?” Answering those decision-making questions is the objective of this project.
This project uses data mining as a knowledge discovery technique, which provides valuable, hidden business and scientific “intelligence” from MLS data. Data mining is a process of discovering various models, summaries, and derived values from a given collection of data.  In practice, the two primary goals of data mining are prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. The data mining domain of this project is real estate price prediction. Data mining method used by this project is predictive regression method which predicts the continuous values using regression.
Data mining involves running decision-support queries that require a large amount of computing power and summarize thousands of rows.  It is difficult to do data mining using operational production system, such as MLS, without impacting its performance. As a solution, a data warehouse is built based on MLS data. A data warehouse is a copy of transaction data specifically structured for query and analysis.
Data warehouse is a collection of data gathered and organized so that it can easily be analyzed, extracted, synthesized, and otherwise for the purposes of further understanding the data. Building a data warehouse is a prerequisite for mining of large amount of data.  There are several tasks involve in building a data warehouse. They are extraction, transformation and cleansing, modeling, and transport. A large amount of data is extracted from MLS operational production system to a staging area. Field selection, data transformation and cleansing from user error entry are preformed at the staging area. A data warehouse schema is designed and created on a database management system, such as Oracle. There are several data warehouse schema, such as star schema, snowflake schema, et cetera. Once a data warehouse is created, data is loaded or transported to the warehouse.
This project uses Visual Basic .NET to write codes for transformation and cleansing, loading data to the warehouse, and data mining purpose. This project uses Oracle data warehousing tools to create a data warehouse. Loading data to the warehouse uses Oracle Provider for OLE DB (OraOLEDB) component. Oracle Provider for OLE DB (OraOLEDB) establishes a connection between .NET framework and Oracle database.    Once the connection is established, queries that are generated based on user criterias, are passed to the warehouse. As final step, predictive regression calculation is used to calculate the result of the queries. To present the result, a number appeares on the screen that estimates the real estate price today for a particular type of property interested by the user.
There are two steps to achieve the data mining objectives in this project, which are building a data warehouse, and implementing a data mining method that is appropriate to the data and the objective. What a data warehouse is and how to build a data warehouse are discussed in chapter 2. As implementation of chapter 2, MASTERDW data warehouse is built. Chapter 3 discusses MASTERDW data warehouse design and implementation. This project uses predictive regression as the data mining method. What predictive regression is and related calculation are discussed in chapter 4. Real estate price prediction code is the implementation of chapter 4 and it can be found in appendix J. The user interface and result of real estate price prediction code is discussed in chapter 5. Chapter 6 is the conclusion of this project.
Data Warehouse Methodology Traditionally, analysts have performed the task of extracting useful information from recorded data. But, the increasing volume of data in modern business and science call for computer-based approaches. As data sets have grown in size and complexity, there has been an inevitable shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of computers, networks, and sensors have made data collection and organization an almost effortless task. However, the captured data needs to be converted into information and knowledge from recorded data to become useful.
A data warehouse is a prerequisite for mining large amount of data.    A primary goal of a data warehouse is to increase the “intelligence” of a decision process and the knowledge of the people involved in this process. For example, the ability of product marketing executives to look at multiple dimensions of a product’s sales performance – by region, by type sales, by customer demographics – may enable better promotional efforts, increased production, or new decisions in product inventory and distribution.
Data warehouse is a collection of data gathered and organized so that it can easily be analyzed, extracted, synthesized, and otherwise for the purposes of further understanding the data.  A data warehouse contains a wide variety of data used for decision support and analytical processing. The function of the data warehouse is to store the historical data of an organization in an integrated manner that reflects the various facets of the organization and business. The data in a warehouse are never updated but used only to respond to queries from decision-makers or data miners. Typically, data warehouses are huge, storing billions of records.
There are several steps involve in building a data warehouse:
The first step of building a data warehouse is to obtain data from operational source directly. The source of data needs to be documented. This task involves not only identifying the databases and files containing the data of interest, but also analyzing and documenting the bussines meaning of the data, data relationships and business rules.
Data extraction programs and tools either extract all (or a subset of) the data in the source system, or extract the changes made to the source data by operational systems as they occur. In the former case, the extraction process will typically use either an unload utility or data manipulation language statements to extract the required source data. In the latter case, a recovery log or a database trigger is used to extract the changes into an intermediate file or database for processing by the data transformation subsystem. After getting the data, the data is copied to a temporary location where the data will be cleansed, transformed, and prepared for the warehouse. The temporary location is usually called as staging area.
2.2. Cleansing and Transformation
One of the more difficult tasks in building a data warehouse is cleansing and transformation source data. The main objective of cleansing and transformation source data is to improve the quality of the data flowing from a source system (operational files and databases) into a target system (a data warehouse database).
There are several steps in cleansing and transformation process:
Check the integrity of the source data to verify that it conforms to the business rules and relationships identified in extraction process.
The objective here is to determine the quality of the source data.
Check the accuracy of the source data.
The objective again is to verify the quality of the source data, but in this case the data values are checked to confirm that they reflect the real world. Often this can be done by sampling the source data rather than auditing the whole database.
Identify the task and tool requirements for data cleansing.
If the analysis from Step 1 or 2 indicates that data needs to cleansed, then appropriate procedures and tools need to be identified to do the work. Additional data cleansing may also be required to resolve issues concerning missing field values, the handling of freeform data such as name and address information and so forth.
Identify the tool requirements for transforming and integrating the data into the format required by the target system.Target system could be Oracle 8i, SQL Server 2000, et cetera where a data warehouse will be build. Data transformation and integration involves the restructuring of the source data including files, records and fields, and the removal of data that is not required in the target system. It may also involve enhancing the data by decoding and translating field values, adding a time attribute (if one is not present in the source data) to reflect the currency of data, data summarization, and the calculation of derived values.
Example of transformation and cleansing process will be shown in section 3.2.
Designing a data warehouse is unlike designing a database for a transaction processing type system. The typical approach used to construct a transaction processing type system is entity-relationship (E-R) diagram. An entity-relationship diagram shows the interaction between numerous entities in the system, in considerable detail. This approach is far too complex for the data warehouse which does not require the same detail level as the transaction processing system. The main objective of a data warehouse is to assist decision-makers who are interested only part of the operational processing system data for the knowledge discovery process. 
Figure 2.1 An Example of Star Schema
A data warehouse uses dimensional model which describes data using facts and dimensions.  Fact is a numeric (or other type of) data element by which an organization measures aspects of its business. The most useful facts are indeed numeric and are a value that may be computed in some form. Some examples of fact data are dollar amounts (such as budget, expenditure, encumbrance, revenue, value of order), counts (such as headcount, credit hours, number of items), et cetera. The fact table could be more than one and it will contain million of rows. Dimension is a set of attributes, usually hierarchical, that is used to describe an organization’s business by constraining and grouping facts. Some examples of dimensions are time, students, faculty, organization, funds, product, et cetera. The number of dimension tables could be less than twenty and at least four. Data is a dimension if the data is static.
The combination of facts and dimensions results a schema called star schema. A star schema is a decision support system construction that arranges data in a format that are useful for analysis process. An example of star schema is shown in Figure 2.1.
Figure 2.2 An Example of Snowflaking Schema
A snowflake schema is another data warehouse schema. Snowflaking occurs when the dimension tables have the hierarchies broken out into separate tables.  This typically occurs when the designer wants to avoid repeating information within the dimension. This is a more normalized structure but leads to more difficult queries and slower response times. Figure 2.2 represents the beginning of the snowflake process. The category hierarchy is being broken out of the ProductDimension table. This structure increases the number of joins and can slow queries. Since the purpose of transaction processing system is to speed up queries, snowflaking is usually not something a designer wants to do. However, in the overall schema of the data warehouse, the dimension tables usually only hold about 1% of the records. Therefore, any space savings from normalizing, or showfalking, are negligible.
After the data is transformed and a data warehouse is created, the data is ready to be transported and loaded into the warehouse. It is important to identify the tools and techniques to be used for loading the data into the target system. Data can be loaded into the target system using a load utility or data manipulation language statements. It the transformed data is stored in a flat file, it could be transported using FTP and then loaded using the SQL*Loader utility.  If the transformed data is stored in an Oracle database, transportable tablespaces (a new Oracle 8i feature) may be used to move a tablespace from one database to another.
For applications involving large amounts of data, compression techniques may aid transportation performance. Data encryption may be required for transporting highly sensitive data across a network. Once the transport process is done, the data in the data warehouse is ready to be mined.
MASTERDW Data Warehouse Design and Implementation
This project uses Oracle 8i Data Warehousing tools to build a data warehouse for real estate price prediction purpose. The data warehouse for this project is called MASTERDW. The process to build MASTERDW data warehouse is shown in figure 3.1 based on methodology discussed in section 2.1.
The data source extracted from MLS operational database system is in the flat file form. The processes at the staging area, such as transformation and cleansing, modeling, and transport, are in the flat file form too. The flat file names could be found inside the squares in figure 3.1. The arrows in figure 3.1 represent the processes.
“RESI.TXT” is a data source extracted from MLS operational database system. There are four transformation and cleansing processes in figure 3.1. They are transformation and cleansing 1, update, transformation and cleansing 2, and duplication detection. “RESI.TXT” is transformed and cleansed by transformation and cleansing 1 process. It creates a log file called “RESSOLDLOG.TXT”. “RESSOLDLOG.TXT” contains of any error data entries that are existed in “RESI.TXT”. An update process is preformed in “RESI.TXT” based on “RESSOLDLOG.TXT”. It creates “RES.TXT”. “RES.TXT” is divided into four flat files by transformation and cleansing 2 process. They are “OFCSRC.TXT”, “AGTSRC.TXT”, “RESIDENTIAL.TXT”, and “AREA.TXT”. “OFCSRC.TXT” and “AGTSRC.TXT”contain duplicate records while “RESIDENTIAL.TXT” and “AREA.TXT” do not contain duplicate records. The duplicate records are removed from “OFCSRC.TXT” and “AGTSRC.TXT” by duplicate detection process. They become “OFFICE.TXT” and “AGENT.TXT”. The four flat files are loaded to the MASTERDW data warehouse. They become OFFICES table, AGENTS table, RESIDENTIAL table, and AREA table.
Figure 3.1 Building MASTERDW Data Warehouse Process
The detail process of building MASTERDW data warehouse and explanation of each flat file are discussed in sections 3.1, 3.2, 3.3, and 3.4.
In the real estate world, each area has its own Multiple Listings Services (MLS). The area could be a subset of a county, a county, a group of cities, a group of counties, or even a state, depending on how real estate agents want their associations to be. MLS is a place for seller and buyer agents to exchange their listings information. MLS is updated every minute whenever there are listing transactions. The operational data source used by this project is extracted from Sacramento, El Dorado, Placer, and Yolo Counties Multiple Listings Services (MLS) database. Sacramento, El Dorado, Placer and Yolo counties are grouped together into one MLS database.
MLS has incremental database backup that is updated everyday. The database backup consists of six property types, such as residential, mobile home, residential income, land, commercial, and business opportunity. Each property type is backed up into a zip file. The data source used by this project is from the residential MLS database backup that is updated in January 9, 2004. It captures all the residential data in the source system since January 1, 1998 until January 9, 2004. The source data is in the “|” delimited flat file and contains of 191 fields and 295787 rows. It is called “RESI.TXT”. The list of the fields name could be found in appendix A. The example of one listing in the “RESI.TXT”:
"2664 Fox Run Rd Georgetown, CA 95634"|"Georgetown"|"CA"|""|"Fox Run"
|"2664"|""|"Rd"|""|"95634"|""|"Seller Will Lease/Option, Present All Offers."|""
|"061-820-201"|"12901"|"MLS"|"2"|"0"|"3"|"0"|"0"|""|"3%"|"0000-00-00"|"El Dorado"|"Wentworth"|"Hwy. 193 To Wentworth Springs Rd., Approx. 2 1/2 Miles To Left On Fox Run Road, Follow Road Approx. 1 Mile, Stay Right Until Sign."|"566"|"Black Oak Mine"|"1994-05-17"|""|"Conventional"|"0000-00-00"
|"Black Oak Mine"|"1999-07-21"|"1999-07-21 00:00:00"|"00:00:00"|"1/1/1900"
|""|"ESMITHII"|"Irene Smith"|""|""|""|""|""|""|""|""|""|""|""|"916-999-9999"|"1994-05-17"|"1242314"|"6180 Hwy 193 Georgetown, CA 95634"|"Georgetown"
|"PATI"|"The Patti Smith Real Estate Gr"|"530-333-4336"|"6180 Hwy 193"
|"95634"|"158000"|"14000"|"0.32"|"10"|"435600"|""|"A"|"Compass Book (PL)"
|"16 C-2"|"Farwest Manufactured Home With Large Pantry, Lots Of Storage And Closets, Large Fenced Garden And More! Not Too Far Out Of Town, Yet Private And Nice Location."|"Gunder Rinset"|"158000"|"1995-12-04"|"0"|"100"
|"Residential"|"1"|"El Dorado"|"ESMITHII"|"Irene Smith"|""|""|""|""|""|""|"1998-12-02"|"PATI"|"The Patti Smith Real Estate Gr"|"530-333-4336"|"140000"|"Black Oak Mine"|"Seller"|"1400"|"Sold"|""|"None"|"0"|"1990"|""|"1400"|"Window"|"0"
|""|"Other"|""|""|""|""|"No"|""|"No"|"0"|"Frame,Other"|"0"|"Space In Kitchen"|""|""
|"None"|""|"1st American"|"1111111si"|""|"Treat As Clear"|""|""|""|"Living Room"|"Carpet,Linoleum/Vinyl"|""|"2 Car Attached"|"Central,Propane"|"No"
|"0"|""|""|"Yes"|"Other"|""|"B/I Range Gas,Disposal,Dishwasher"|""|""|"Inside Laundry Room"|""|"Irregular"|""|""|"Patio"|"1"|"(916) 333-1515"|""|""|""|""|""|""
After extracting data source from the operational system, it needs to be transformed and cleansed. There are four steps of transformation and cleansing in this project:
Transformation and cleansing 1
Update process for the result of transformation and cleansing 1
Transformation and cleansing 2
Duplication detection for office and agent records
3.2.1. Transformation and Cleansing 1
Transformation and cleansing process requires thorough understanding of business meaning of the data, data relationships and business rules. There are several statuses in the residential listings, such as active, expired, expired pending, pending, sold, temporary off market, and duplicate withdrawn. This project uses sold residential listing status only for the purpose of real estate price prediction. Transformation and cleansing 1 checks several fields of all the sold listings in the “RESI.TXT” and write the checking result in a log file (“RESSOLDLOG.TXT”). Transformation and cleansing 1 source code could be found in appendix B. The log file as the result of transformation and cleansing 1 could be found in appendix C. The fields that are checked in transformation and cleansing 1 are:
Listing price is a price that is assigned for a listing by a seller when he/she listed his/her house on the market. Listing price should be integer between 10000 and 49999999. If the listing price is less than zero or more than 99999999, then it will write “Invalid LP” message to the log file. If the listing price is less than 10000 and more than 49999999, then it will write “LP exceeds limit” message to the log file. Listing price that is less than 10000 could be rental or lease properties which are not used in this project. Listing price that is more than 49999999 is very rare and it needs to be checked if it is right or error data entry.
Square footage is a building size of a house in square feet. Square footage should be integer. Square footage is not well populated because sometimes seller did not put the square footage in his/her listing. If the square footage is more than 10000 for the listing price less than 1000000, then it will write “SQFT exceeds limit” message to the log file.
Listing date is a date when a seller listed his/her house on the market. Listing date should be a date form. If the listing date is not a date form, then it will write “Invalid LD” message to the log file. If the listing date is less than “1900-01-01”, then it will write “LD exceeds limit” message to the log file.
Number of full bathrooms and half bathrooms
There are two kinds of bathrooms which are full bathrooms and half bathrooms. Full bathroom has a tub or a shower, but half bathroom does not have either of them. Number of full bathrooms should not be zero. If the number of full bathrooms is zero, it will write “No full bathrooms” message and number of half bathrooms to the log file. There is a possibility of error data entry between number of full bathrooms and number of half bathrooms fields.
Number of bedrooms
Every house should have at least one bedroom. If the number of bedrooms is zero, then it will write “No bedrooms” message to the log file.
Year built is a year when a house has been built and is ready to be used. Year built should be four bytes and greater than 1900.
A listing house becomes pended once a buyer offer was accepted by the seller. Pending status will become sold status when the transaction is closed escrow. Pending date is a date when a listing house became pended. Pending date should be a date form and greater than “1990-01-01” since the data source consists of all sold listing that are listed in “1994” or later. If the pending date is not a date form, then it will write “Invalid PD” message to the log file. If the pending date is before “1990-01-01”, then it will write “PD exceeds limit” message to the log file. Pending date should be greater than listing date. If the pending date is less than listing date, it will write “PD is less than LD” message to the log file.
Sold date is a date when a pended listing became sold. Sold date should be a date form and greater than “1990-01-01” since the data source consists of all sold listing that are listed in “1994” or later. If the sold date is not a date form, then it will write “Invalid SD” message to the log file. If the sold date is before “1990-01-01”, then it will write “SD exceeds limit” message to the log file. Sold date should be greater than listing date and equal or greater than pending date. If the sold date is less than listing date, it will write “SD / LD” message, sold date and listing date values to the log file. If the sold date is less than pending date, it will write “SD / PD” message, sold date and pending date values to the log file.
Sold price is a price that is agreed by the seller and the buyer at the time of closed escrow. Sold price should be integer between 10000 and 49999999. If the sold price is less than zero or more than 99999999, then it will write “Invalid SP” message to the log file. If the sold price is less than 10000 and is more than 49999999, then it will write “SP exceeds limit” message to the log file. Sold price that is less than 10000 could be rental or lease properties which are not used in this project. Sold price that is more than 49999999 is very rare and it needs to be checked if it is right or error data entry. Sold price should be between listing price divided by 1.5 and listing price times 1.5. If sold price is less than listing price divided by 1.5 or sold price is more than listing price times 1.5, then it will write “LP / SP exceeds norm” message to the log file.
Days on market
Days on market are number of days when a house is listed on the market until it becomes pended. Days on market are the difference between listing date and pending date. Days on market should be between 0 and 730. If the days on market are less than zero, then it will write “DOM too small” message to the log file. If the days on market are more than 730, then it will write “DOM too large” message to the log file.