'data.frame': 17384 obs. of 21 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : Factor w/ 368 levels "20140502T000000",..: 164 219 287 219 280 11 57 250 336 302 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
Variable descriptions were obtained from King County, Department of Assessments. All feature engineering should be done in the first code chunks of your document.
id
- Unique ID for each home solddate
- Date of the home saleprice
- Price of each home soldbedrooms
- Number of bedroomsbathrooms
- Number of bathrooms, where .5 accounts for a room with a toilet but no showersqft_living
- Square footage of the apartments interior living spacesqft_lot
- Square footage of the land spacefloors
- Number of floorswaterfront
- A dummy variable for whether the apartment was overlooking the waterfront or not * 1’s represent a waterfront property, 0’s represent a non-waterfront propertyview
- An index from 0 to 4 of how good the view of the property was, 0 - lowest, 4 - highestcondition
- An index from 1 to 5 on the condition of the apartment, 1 - lowest, 4 - highestgrade
- An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.sqft_above
- The square footage of the interior housing space that is above ground levelsqft_basement
- The square footage of the interior housing space that is below ground levelyr_built
- The year the house was initially builtyr_renovated
- The year of the house’s last renovationzipcode
- What zipcode area the house is inlat
- Latitudelong
- Longitudesqft_living15
- The square footage of interior housing living space for the nearest 15 neighborssqft_lot15
- The square footage of the land lots of the nearest 15 neighborsModeling Home Prices Using Realtor Data offers a thoughtful and intelligent approach to predicting the price of homes using relevant variables regarding the home. Since we share this objective, I directly follow the steps taken by Pardoe in many places throughout this project.
sqft_living
, sqft_above
& sqft_basement
sqft_lot
& sqft_lot15
yr_built
To assess the validity of any variable transformation performed, I’ll refer to the correlation coefficient r which reports the strength of the relationship between two variables. I’ll compare the r value between price
and the variable before transformation to the r value between price
and the variable after transformation.
sqft_living
, sqft_above
, sqft_basement
There seems to be a lot of redundant information stored between the sqft_living
, sqft_above
and sqft_basement
columns. If possible, I’d like to consolidate these variables and eliminate any superfluous information.
basement
column that returns a 1
if sqft_basement > 0
and compare to sqft_basement
housedata$sqft_basement <- as.numeric(housedata$sqft_basement)
housedata$basement <- ifelse(housedata$sqft_basement > 0, 1, 0)
After creating the basement
variable, the r values between price
and sqft_basement
/basement
decreased from 0.3312296 to 0.1832654. Therefore, this transformation was ineffective and shall be removed.
sqft_lot
& sqft_lot15
In the Modeling Homes article, Pardoe explains the common practice of realtors using “lot size ‘categories’” when pricing homes instead of the raw ft\(^2\) value. If this is truly common practice, then recoding the data from raw ft\(^2\) to lot size categories should improve the relationship between price
and land size.
lotsize
and lotsize15
from sqft_lot
and sqft_lot15
housedata <- housedata %>%
mutate(lotsize = cut(sqft_lot, breaks=c(-Inf, 3000, 5000, 7000, 10000, 15000, 20000, 43560, 130680, 217800, 435600, Inf), labels=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)))
housedata$lotsize <- as.numeric(housedata$lotsize)
# Lotsize15
housedata <- housedata %>%
mutate(lotsize15 = cut(sqft_lot15, breaks=c(-Inf, 3000, 5000, 7000, 10000, 15000, 20000, 43560, 130680, 217800, 435600, Inf), labels=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)))
housedata$lotsize15 <- as.numeric(housedata$lotsize15)
Creating the lotsize
and lotsize15
variables increased the \(r\) value from 0.0882381 and 0.0808064 to 0.1662823 and 0.1527959 respectively. Remove sqft_lot
and sqft_lot15
to reduce redundancy.
yr_built
& age
mean(housedata$yr_built)
[1] 1971.153
housedata <- housedata %>%
mutate(age = (1971 - as.numeric(yr_built))/10) # Decades since 1971 (mean yr_built)
Here, I created an age
variable which indicates the age of the home in decades since 1971 (mean yr_built
). The r value changed from 0.0525221 to -0.0525221. While the creation of the age
variable didn’t increase the r value, it made the data in the yr_built
column available for linear regression. Remove yr_built
.
Analysis of Variance Table
Model 1: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + yr_renovated + zipcode + lat + long + sqft_living15 +
sqft_lot15
Model 2: price ~ bedrooms + bathrooms + sqft_living + floors + waterfront +
view + condition + grade + sqft_above + sqft_basement + yr_renovated +
zipcode + lat + long + sqft_living15 + lotsize + lotsize15 +
age
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17298 4.5804e+14
2 17292 4.5376e+14 6 4.2815e+12 27.193 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After performing an anova on full models of the previous dataframe and the updated version, we find that the updated dataframe has a lower \(RSS\) and therefore a higher prediction accuracy. The transformations performed above have improved the relationship between price
and every other variable. This will provide greater potential for a robust model.
Repeat all steps to form test dataframe.
Here, I will use a method known as subset selection to generate a model to predict price
from other variables, and then improve upon it by performing nonlinear transformations and introducing interaction terms.
The set of plots for both methods of subset selection indicates that the same model would contain the highest \(R^2\) value and the lowest \(C_p\) and BIC values.
The model selected by both forward selection and backward elimination contains 20 predictors and has an \(R^2_{adj}\) value of 0.6873832. This means that roughly 68.7% of the variability in price
can be explained by the model.
I will begin by referring to the residualPlots()
output to determine which predictors would benefit from a nonlinear transformation.
The residualPlots()
output indicates that the sqft_living
, grade
and sqft_basement
all have a nonlinear relationship with price
. I will accommodate for this in the model.
Call:
lm(formula = price ~ bedrooms + sqft_living + I(waterfront ==
"Yes") + I(view == "good") + I(view == "very good") + grade +
I(zipcode == 98005) + I(zipcode == 98007) + I(zipcode ==
98034) + I(zipcode == 98040) + I(zipcode == 98042) + I(zipcode ==
98103) + I(zipcode == 98106) + I(zipcode == 98112) + I(zipcode ==
98115) + I(zipcode == 98116) + I(zipcode == 98122) + lat +
long + sqft_basement + I(sqft_living^2) + I(grade^2) + I(sqft_basement^2),
data = housedata)
Residuals:
Min 1Q Median 3Q Max
-3790778 -89063 -10418 64356 2829843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.566e+07 1.500e+06 -30.429 < 2e-16 ***
bedrooms -1.442e+03 2.116e+03 -0.682 0.49546
sqft_living 9.209e+00 6.544e+00 1.407 0.15938
I(waterfront == "Yes")TRUE 5.583e+05 2.073e+04 26.925 < 2e-16 ***
I(view == "good")TRUE 1.322e+05 1.001e+04 13.202 < 2e-16 ***
I(view == "very good")TRUE 2.619e+05 1.514e+04 17.297 < 2e-16 ***
grade -2.544e+05 1.241e+04 -20.503 < 2e-16 ***
I(zipcode == 98005)TRUE 1.011e+05 1.686e+04 5.999 2.02e-09 ***
I(zipcode == 98007)TRUE 5.661e+04 1.782e+04 3.177 0.00149 **
I(zipcode == 98034)TRUE -6.294e+04 9.403e+03 -6.693 2.25e-11 ***
I(zipcode == 98040)TRUE 3.279e+05 1.359e+04 24.127 < 2e-16 ***
I(zipcode == 98042)TRUE -2.163e+04 9.617e+03 -2.249 0.02451 *
I(zipcode == 98103)TRUE 6.249e+04 9.242e+03 6.761 1.41e-11 ***
I(zipcode == 98106)TRUE -7.189e+04 1.208e+04 -5.949 2.76e-09 ***
I(zipcode == 98112)TRUE 3.553e+05 1.326e+04 26.798 < 2e-16 ***
I(zipcode == 98115)TRUE 7.713e+04 9.120e+03 8.458 < 2e-16 ***
I(zipcode == 98116)TRUE 8.262e+04 1.211e+04 6.823 9.18e-12 ***
I(zipcode == 98122)TRUE 1.081e+05 1.292e+04 8.370 < 2e-16 ***
lat 6.007e+05 1.149e+04 52.280 < 2e-16 ***
long -1.480e+05 1.175e+04 -12.595 < 2e-16 ***
sqft_basement 7.149e+01 8.256e+00 8.659 < 2e-16 ***
I(sqft_living^2) 2.794e-02 1.012e-03 27.607 < 2e-16 ***
I(grade^2) 2.156e+04 7.748e+02 27.833 < 2e-16 ***
I(sqft_basement^2) -4.746e-02 5.938e-03 -7.994 1.39e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 191300 on 17360 degrees of freedom
Multiple R-squared: 0.7326, Adjusted R-squared: 0.7323
F-statistic: 2068 on 23 and 17360 DF, p-value: < 2.2e-16
Here, I add interactions to the model to improve accuracy. One could infer that there is a relationship between bedrooms
and sqft_living
, and additionally between sqft_basement
and sqft_living
.
Call:
lm(formula = price ~ bedrooms + sqft_living + I(waterfront ==
"Yes") + I(view == "good") + I(view == "very good") + grade +
I(zipcode == 98005) + I(zipcode == 98007) + I(zipcode ==
98034) + I(zipcode == 98040) + I(zipcode == 98042) + I(zipcode ==
98103) + I(zipcode == 98106) + I(zipcode == 98112) + I(zipcode ==
98115) + I(zipcode == 98116) + I(zipcode == 98122) + lat +
long + sqft_basement + I(sqft_living^2) + I(grade^2) + I(sqft_basement^2) +
bedrooms:sqft_living + sqft_living:sqft_basement + sqft_basement:I(sqft_living^2) +
bedrooms:I(sqft_living^2) + I(sqft_living^2):I(sqft_basement^2),
data = housedata)
Residuals:
Min 1Q Median 3Q Max
-2364730 -88079 -10422 63804 2309151
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.637e+07 1.490e+06 -31.122 < 2e-16
bedrooms -3.079e+02 7.330e+03 -0.042 0.96649
sqft_living -5.327e+01 1.918e+01 -2.778 0.00547
I(waterfront == "Yes")TRUE 5.430e+05 2.057e+04 26.396 < 2e-16
I(view == "good")TRUE 1.247e+05 9.932e+03 12.556 < 2e-16
I(view == "very good")TRUE 2.617e+05 1.500e+04 17.442 < 2e-16
grade -2.411e+05 1.326e+04 -18.180 < 2e-16
I(zipcode == 98005)TRUE 1.044e+05 1.668e+04 6.260 3.95e-10
I(zipcode == 98007)TRUE 5.441e+04 1.763e+04 3.087 0.00203
I(zipcode == 98034)TRUE -6.555e+04 9.314e+03 -7.037 2.04e-12
I(zipcode == 98040)TRUE 3.313e+05 1.347e+04 24.596 < 2e-16
I(zipcode == 98042)TRUE -2.453e+04 9.516e+03 -2.578 0.00995
I(zipcode == 98103)TRUE 6.231e+04 9.149e+03 6.811 1.00e-11
I(zipcode == 98106)TRUE -7.010e+04 1.196e+04 -5.860 4.72e-09
I(zipcode == 98112)TRUE 3.521e+05 1.315e+04 26.775 < 2e-16
I(zipcode == 98115)TRUE 7.874e+04 9.024e+03 8.725 < 2e-16
I(zipcode == 98116)TRUE 8.517e+04 1.198e+04 7.108 1.22e-12
I(zipcode == 98122)TRUE 1.099e+05 1.279e+04 8.593 < 2e-16
lat 6.085e+05 1.140e+04 53.353 < 2e-16
long -1.506e+05 1.165e+04 -12.926 < 2e-16
sqft_basement 6.056e+01 1.911e+01 3.168 0.00154
I(sqft_living^2) 4.889e-02 3.715e-03 13.158 < 2e-16
I(grade^2) 2.064e+04 8.271e+02 24.950 < 2e-16
I(sqft_basement^2) -4.868e-02 1.242e-02 -3.921 8.87e-05
bedrooms:sqft_living 1.626e+01 5.186e+00 3.136 0.00171
sqft_living:sqft_basement -3.934e-02 1.514e-02 -2.598 0.00938
sqft_basement:I(sqft_living^2) 1.807e-05 2.334e-06 7.744 1.01e-14
bedrooms:I(sqft_living^2) -6.082e-03 8.789e-04 -6.920 4.68e-12
I(sqft_living^2):I(sqft_basement^2) -3.063e-09 3.177e-10 -9.641 < 2e-16
(Intercept) ***
bedrooms
sqft_living **
I(waterfront == "Yes")TRUE ***
I(view == "good")TRUE ***
I(view == "very good")TRUE ***
grade ***
I(zipcode == 98005)TRUE ***
I(zipcode == 98007)TRUE **
I(zipcode == 98034)TRUE ***
I(zipcode == 98040)TRUE ***
I(zipcode == 98042)TRUE **
I(zipcode == 98103)TRUE ***
I(zipcode == 98106)TRUE ***
I(zipcode == 98112)TRUE ***
I(zipcode == 98115)TRUE ***
I(zipcode == 98116)TRUE ***
I(zipcode == 98122)TRUE ***
lat ***
long ***
sqft_basement **
I(sqft_living^2) ***
I(grade^2) ***
I(sqft_basement^2) ***
bedrooms:sqft_living **
sqft_living:sqft_basement **
sqft_basement:I(sqft_living^2) ***
bedrooms:I(sqft_living^2) ***
I(sqft_living^2):I(sqft_basement^2) ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 189200 on 17355 degrees of freedom
Multiple R-squared: 0.7384, Adjusted R-squared: 0.738
F-statistic: 1750 on 28 and 17355 DF, p-value: < 2.2e-16
After performing nonlinear transformations and adding interaction terms, the \(R^2_{adj}\) for the model increases from 0.6873832 to 0.738011. This means that updated model is capable of explaining 73.8% of the variability in price
. I will consider this the final model produced by the subset selection method.
In the following sections, I begin with a full model that regresses price
on to every available predictor. I will then proceed to strengthen the model by again performing nonlinear transformations and introducing interaction terms.
Call:
lm(formula = price ~ ., data = housedata)
Residuals:
Min 1Q Median 3Q Max
-1237318 -70270 -251 62652 4353961
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.198e+07 6.930e+06 -4.615 3.97e-06 ***
bedrooms -3.083e+04 1.797e+03 -17.158 < 2e-16 ***
bathrooms 2.497e+04 2.967e+03 8.414 < 2e-16 ***
sqft_living 1.312e+02 4.000e+00 32.797 < 2e-16 ***
floors -4.098e+04 3.646e+03 -11.239 < 2e-16 ***
waterfrontYes 5.836e+05 1.791e+04 32.585 < 2e-16 ***
viewfair 7.118e+04 1.037e+04 6.862 7.03e-12 ***
viewaverage 6.866e+04 6.174e+03 11.121 < 2e-16 ***
viewgood 1.413e+05 8.707e+03 16.226 < 2e-16 ***
viewvery good 3.165e+05 1.303e+04 24.292 < 2e-16 ***
conditionfair 6.766e+04 3.606e+04 1.876 0.060658 .
conditionaverage 5.029e+04 3.339e+04 1.506 0.132012
conditiongood 6.723e+04 3.339e+04 2.013 0.044084 *
conditionvery good 1.110e+05 3.361e+04 3.304 0.000955 ***
grade 5.957e+04 2.060e+03 28.918 < 2e-16 ***
sqft_above 7.802e+01 4.106e+00 19.000 < 2e-16 ***
sqft_basement NA NA NA NA
yr_renovated 1.458e+01 3.347e+00 4.357 1.33e-05 ***
zipcode98002 4.485e+04 1.638e+04 2.738 0.006189 **
zipcode98003 -1.868e+04 1.447e+04 -1.291 0.196572
zipcode98004 7.300e+05 2.646e+04 27.586 < 2e-16 ***
zipcode98005 2.357e+05 2.830e+04 8.329 < 2e-16 ***
zipcode98006 2.314e+05 2.305e+04 10.039 < 2e-16 ***
zipcode98007 1.976e+05 2.900e+04 6.813 9.86e-12 ***
zipcode98008 2.170e+05 2.768e+04 7.840 4.78e-15 ***
zipcode98010 9.536e+04 2.456e+04 3.883 0.000104 ***
zipcode98011 3.974e+04 3.618e+04 1.098 0.272024
zipcode98014 7.708e+04 3.992e+04 1.931 0.053535 .
zipcode98019 4.913e+04 3.891e+04 1.262 0.206811
zipcode98022 6.485e+04 2.164e+04 2.997 0.002734 **
zipcode98023 -5.325e+04 1.336e+04 -3.985 6.79e-05 ***
zipcode98024 1.648e+05 3.415e+04 4.827 1.40e-06 ***
zipcode98027 1.611e+05 2.375e+04 6.784 1.21e-11 ***
zipcode98028 2.312e+04 3.509e+04 0.659 0.510003
zipcode98029 2.108e+05 2.713e+04 7.771 8.23e-15 ***
zipcode98030 6.762e+03 1.581e+04 0.428 0.668847
zipcode98031 1.122e+04 1.658e+04 0.677 0.498691
zipcode98032 -1.109e+04 1.932e+04 -0.574 0.565817
zipcode98033 3.007e+05 3.005e+04 10.005 < 2e-16 ***
zipcode98034 1.267e+05 3.221e+04 3.934 8.38e-05 ***
zipcode98038 6.048e+04 1.789e+04 3.381 0.000723 ***
zipcode98039 1.161e+06 3.498e+04 33.201 < 2e-16 ***
zipcode98040 4.718e+05 2.367e+04 19.930 < 2e-16 ***
zipcode98042 2.066e+04 1.530e+04 1.350 0.176878
zipcode98045 1.443e+05 3.303e+04 4.369 1.26e-05 ***
zipcode98052 1.729e+05 3.062e+04 5.645 1.68e-08 ***
zipcode98053 1.459e+05 3.271e+04 4.462 8.18e-06 ***
zipcode98055 2.665e+04 1.855e+04 1.437 0.150864
zipcode98056 7.036e+04 2.016e+04 3.491 0.000483 ***
zipcode98058 1.984e+04 1.746e+04 1.136 0.255829
zipcode98059 6.375e+04 1.980e+04 3.220 0.001283 **
zipcode98065 1.054e+05 3.043e+04 3.463 0.000535 ***
zipcode98070 -6.956e+04 2.346e+04 -2.965 0.003032 **
zipcode98072 6.751e+04 3.578e+04 1.887 0.059215 .
zipcode98074 1.341e+05 2.893e+04 4.636 3.57e-06 ***
zipcode98075 1.402e+05 2.773e+04 5.056 4.32e-07 ***
zipcode98077 3.632e+04 3.738e+04 0.972 0.331297
zipcode98092 -2.142e+04 1.428e+04 -1.500 0.133681
zipcode98102 4.889e+05 3.168e+04 15.431 < 2e-16 ***
zipcode98103 2.544e+05 2.923e+04 8.705 < 2e-16 ***
zipcode98105 4.048e+05 2.996e+04 13.513 < 2e-16 ***
zipcode98106 9.428e+04 2.161e+04 4.362 1.30e-05 ***
zipcode98107 2.721e+05 2.999e+04 9.073 < 2e-16 ***
zipcode98108 8.092e+04 2.414e+04 3.352 0.000805 ***
zipcode98109 4.238e+05 3.091e+04 13.708 < 2e-16 ***
zipcode98112 5.419e+05 2.749e+04 19.711 < 2e-16 ***
zipcode98115 2.517e+05 2.959e+04 8.505 < 2e-16 ***
zipcode98116 2.206e+05 2.409e+04 9.158 < 2e-16 ***
zipcode98117 2.336e+05 3.000e+04 7.786 7.28e-15 ***
zipcode98118 1.356e+05 2.104e+04 6.445 1.19e-10 ***
zipcode98119 4.118e+05 2.917e+04 14.120 < 2e-16 ***
zipcode98122 2.774e+05 2.619e+04 10.595 < 2e-16 ***
zipcode98125 1.139e+05 3.199e+04 3.559 0.000373 ***
zipcode98126 1.460e+05 2.231e+04 6.547 6.05e-11 ***
zipcode98133 6.706e+04 3.298e+04 2.033 0.042051 *
zipcode98136 1.882e+05 2.267e+04 8.299 < 2e-16 ***
zipcode98144 2.404e+05 2.427e+04 9.903 < 2e-16 ***
zipcode98146 4.713e+04 2.021e+04 2.332 0.019702 *
zipcode98148 3.965e+04 2.841e+04 1.395 0.162915
zipcode98155 4.995e+04 3.439e+04 1.452 0.146441
zipcode98166 9.326e+03 1.863e+04 0.501 0.616676
zipcode98168 3.613e+04 1.964e+04 1.839 0.065861 .
zipcode98177 1.059e+05 3.438e+04 3.081 0.002066 **
zipcode98178 1.116e+04 2.007e+04 0.556 0.578218
zipcode98188 2.108e+03 2.047e+04 0.103 0.917964
zipcode98198 -2.470e+04 1.563e+04 -1.581 0.113960
zipcode98199 2.913e+05 2.847e+04 10.232 < 2e-16 ***
lat 2.176e+05 7.147e+04 3.044 0.002336 **
long -1.731e+05 5.105e+04 -3.390 0.000701 ***
sqft_living15 6.986e+00 3.269e+00 2.137 0.032636 *
lotsize 9.441e+03 1.656e+03 5.702 1.21e-08 ***
lotsize15 -1.917e+03 1.834e+03 -1.045 0.295975
age 7.282e+03 7.492e+02 9.720 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 162000 on 17292 degrees of freedom
Multiple R-squared: 0.809, Adjusted R-squared: 0.808
F-statistic: 804.9 on 91 and 17292 DF, p-value: < 2.2e-16
Again, I begin by referring to the residualPlots()
output to determine which predictors would benefit from a nonlinear transformation.
The output of residualPlots()
reveals that a nonlinear relationship exists between price
and bathrooms
, sqft_living
, grade
, sqft_above
, sqft_basement
, yr_renovated
, and sqft_living15
.
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_renovated + zipcode + lat + long + sqft_living15 + lotsize +
lotsize15 + age + I(bathrooms^2) + I(sqft_living^2) + I(grade^2) +
I(sqft_above^2) + I(sqft_basement^2) + I(yr_renovated^2),
data = housedata)
Residuals:
Min 1Q Median 3Q Max
-3463192 -57252 2735 54593 2405534
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.033e+07 6.299e+06 -6.402 1.57e-10 ***
bedrooms -5.166e+03 1.694e+03 -3.049 0.002300 **
bathrooms 1.757e+04 7.731e+03 2.273 0.023042 *
sqft_living -1.745e+01 7.519e+00 -2.321 0.020297 *
floors -2.908e+04 3.414e+03 -8.516 < 2e-16 ***
waterfrontYes 5.908e+05 1.630e+04 36.248 < 2e-16 ***
viewfair 7.641e+04 9.434e+03 8.099 5.89e-16 ***
viewaverage 6.422e+04 5.617e+03 11.432 < 2e-16 ***
viewgood 1.267e+05 7.933e+03 15.968 < 2e-16 ***
viewvery good 2.668e+05 1.191e+04 22.394 < 2e-16 ***
conditionfair 1.328e+05 3.283e+04 4.044 5.27e-05 ***
conditionaverage 1.520e+05 3.046e+04 4.991 6.08e-07 ***
conditiongood 1.797e+05 3.047e+04 5.896 3.79e-09 ***
conditionvery good 2.320e+05 3.068e+04 7.564 4.12e-14 ***
grade -2.433e+05 1.045e+04 -23.282 < 2e-16 ***
sqft_above 8.348e+01 8.931e+00 9.348 < 2e-16 ***
sqft_basement NA NA NA NA
yr_renovated -2.654e+03 3.508e+02 -7.566 4.04e-14 ***
zipcode98002 2.410e+04 1.489e+04 1.619 0.105552
zipcode98003 -1.535e+04 1.315e+04 -1.167 0.243110
zipcode98004 7.067e+05 2.405e+04 29.385 < 2e-16 ***
zipcode98005 2.562e+05 2.572e+04 9.962 < 2e-16 ***
zipcode98006 2.201e+05 2.095e+04 10.509 < 2e-16 ***
zipcode98007 2.158e+05 2.635e+04 8.189 2.82e-16 ***
zipcode98008 2.381e+05 2.516e+04 9.462 < 2e-16 ***
zipcode98010 1.130e+05 2.232e+04 5.063 4.16e-07 ***
zipcode98011 6.533e+04 3.288e+04 1.987 0.046909 *
zipcode98014 1.086e+05 3.628e+04 2.993 0.002762 **
zipcode98019 8.809e+04 3.537e+04 2.491 0.012753 *
zipcode98022 7.876e+04 1.967e+04 4.004 6.26e-05 ***
zipcode98023 -5.852e+04 1.215e+04 -4.818 1.47e-06 ***
zipcode98024 1.823e+05 3.103e+04 5.875 4.30e-09 ***
zipcode98027 1.708e+05 2.158e+04 7.913 2.65e-15 ***
zipcode98028 4.529e+04 3.188e+04 1.421 0.155448
zipcode98029 2.423e+05 2.466e+04 9.824 < 2e-16 ***
zipcode98030 1.805e+04 1.437e+04 1.257 0.208900
zipcode98031 1.915e+04 1.507e+04 1.271 0.203797
zipcode98032 -2.018e+04 1.756e+04 -1.149 0.250504
zipcode98033 3.021e+05 2.731e+04 11.065 < 2e-16 ***
zipcode98034 1.331e+05 2.927e+04 4.546 5.50e-06 ***
zipcode98038 9.082e+04 1.627e+04 5.583 2.40e-08 ***
zipcode98039 1.095e+06 3.181e+04 34.416 < 2e-16 ***
zipcode98040 4.508e+05 2.152e+04 20.949 < 2e-16 ***
zipcode98042 2.966e+04 1.391e+04 2.133 0.032935 *
zipcode98045 1.826e+05 3.003e+04 6.081 1.22e-09 ***
zipcode98052 2.019e+05 2.783e+04 7.256 4.16e-13 ***
zipcode98053 1.836e+05 2.973e+04 6.173 6.84e-10 ***
zipcode98055 2.776e+04 1.686e+04 1.647 0.099587 .
zipcode98056 6.707e+04 1.831e+04 3.662 0.000251 ***
zipcode98058 3.351e+04 1.587e+04 2.111 0.034786 *
zipcode98059 7.674e+04 1.799e+04 4.266 2.00e-05 ***
zipcode98065 1.515e+05 2.767e+04 5.476 4.42e-08 ***
zipcode98070 -7.738e+04 2.132e+04 -3.629 0.000285 ***
zipcode98072 9.208e+04 3.252e+04 2.832 0.004632 **
zipcode98074 1.621e+05 2.629e+04 6.166 7.18e-10 ***
zipcode98075 1.684e+05 2.520e+04 6.681 2.44e-11 ***
zipcode98077 5.568e+04 3.397e+04 1.639 0.101172
zipcode98092 1.860e+02 1.299e+04 0.014 0.988575
zipcode98102 4.415e+05 2.881e+04 15.322 < 2e-16 ***
zipcode98103 2.514e+05 2.657e+04 9.461 < 2e-16 ***
zipcode98105 4.029e+05 2.723e+04 14.797 < 2e-16 ***
zipcode98106 6.038e+04 1.967e+04 3.070 0.002146 **
zipcode98107 2.590e+05 2.728e+04 9.496 < 2e-16 ***
zipcode98108 7.319e+04 2.195e+04 3.334 0.000858 ***
zipcode98109 4.241e+05 2.810e+04 15.092 < 2e-16 ***
zipcode98112 5.314e+05 2.500e+04 21.257 < 2e-16 ***
zipcode98115 2.607e+05 2.690e+04 9.690 < 2e-16 ***
zipcode98116 2.175e+05 2.191e+04 9.926 < 2e-16 ***
zipcode98117 2.275e+05 2.728e+04 8.341 < 2e-16 ***
zipcode98118 1.226e+05 1.913e+04 6.410 1.49e-10 ***
zipcode98119 4.128e+05 2.652e+04 15.564 < 2e-16 ***
zipcode98122 2.809e+05 2.381e+04 11.796 < 2e-16 ***
zipcode98125 1.149e+05 2.907e+04 3.953 7.75e-05 ***
zipcode98126 1.264e+05 2.029e+04 6.230 4.76e-10 ***
zipcode98133 5.941e+04 2.997e+04 1.982 0.047481 *
zipcode98136 1.818e+05 2.063e+04 8.813 < 2e-16 ***
zipcode98144 2.309e+05 2.207e+04 10.461 < 2e-16 ***
zipcode98146 2.983e+04 1.837e+04 1.623 0.104502
zipcode98148 3.379e+04 2.582e+04 1.309 0.190655
zipcode98155 4.519e+04 3.125e+04 1.446 0.148195
zipcode98166 3.065e+03 1.693e+04 0.181 0.856373
zipcode98168 1.476e+03 1.787e+04 0.083 0.934142
zipcode98177 1.101e+05 3.124e+04 3.525 0.000425 ***
zipcode98178 2.667e+03 1.824e+04 0.146 0.883794
zipcode98188 -2.562e+03 1.860e+04 -0.138 0.890427
zipcode98198 -3.098e+04 1.420e+04 -2.181 0.029168 *
zipcode98199 2.832e+05 2.588e+04 10.943 < 2e-16 ***
lat 2.098e+05 6.493e+04 3.231 0.001236 **
long -2.537e+05 4.642e+04 -5.466 4.66e-08 ***
sqft_living15 2.223e+01 3.012e+00 7.379 1.66e-13 ***
lotsize 1.104e+04 1.505e+03 7.333 2.35e-13 ***
lotsize15 -3.926e+03 1.668e+03 -2.354 0.018569 *
age 2.289e+03 6.938e+02 3.300 0.000970 ***
I(bathrooms^2) 1.548e+03 1.524e+03 1.016 0.309825
I(sqft_living^2) 3.997e-02 1.565e-03 25.541 < 2e-16 ***
I(grade^2) 1.928e+04 6.427e+02 29.998 < 2e-16 ***
I(sqft_above^2) -2.591e-02 2.027e-03 -12.781 < 2e-16 ***
I(sqft_basement^2) -7.125e-02 5.770e-03 -12.349 < 2e-16 ***
I(yr_renovated^2) 1.343e+00 1.758e-01 7.643 2.23e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 147200 on 17286 degrees of freedom
Multiple R-squared: 0.8424, Adjusted R-squared: 0.8415
F-statistic: 952.5 on 97 and 17286 DF, p-value: < 2.2e-16
Here, I introduce the same interaction terms as before.
Call:
lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_renovated + zipcode + lat + long + sqft_living15 + lotsize +
lotsize15 + age + I(bathrooms^2) + I(sqft_living^2) + I(grade^2) +
I(sqft_above^2) + I(sqft_basement^2) + I(yr_renovated^2) +
bedrooms:sqft_living + sqft_basement:I(sqft_living^2) + bedrooms:I(sqft_living^2) +
I(sqft_living^2):I(sqft_basement^2), data = housedata)
Residuals:
Min 1Q Median 3Q Max
-2208286 -57262 1990 53552 1967109
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.760e+07 6.212e+06 -6.054 1.44e-09
bedrooms 1.572e+04 5.771e+03 2.725 0.006444
bathrooms -4.866e+03 7.810e+03 -0.623 0.533237
sqft_living 2.435e+00 1.609e+01 0.151 0.879715
floors -2.795e+04 3.370e+03 -8.294 < 2e-16
waterfrontYes 5.710e+05 1.612e+04 35.420 < 2e-16
viewfair 7.873e+04 9.307e+03 8.459 < 2e-16
viewaverage 6.430e+04 5.544e+03 11.598 < 2e-16
viewgood 1.241e+05 7.837e+03 15.838 < 2e-16
viewvery good 2.722e+05 1.176e+04 23.147 < 2e-16
conditionfair 1.199e+05 3.239e+04 3.703 0.000214
conditionaverage 1.373e+05 3.006e+04 4.568 4.95e-06
conditiongood 1.650e+05 3.008e+04 5.487 4.15e-08
conditionvery good 2.169e+05 3.028e+04 7.162 8.26e-13
grade -2.140e+05 1.080e+04 -19.814 < 2e-16
sqft_above 5.937e+00 1.524e+01 0.390 0.696815
sqft_basement NA NA NA NA
yr_renovated -2.609e+03 3.460e+02 -7.540 4.94e-14
zipcode98002 2.357e+04 1.468e+04 1.605 0.108509
zipcode98003 -1.277e+04 1.296e+04 -0.985 0.324797
zipcode98004 7.062e+05 2.372e+04 29.770 < 2e-16
zipcode98005 2.618e+05 2.536e+04 10.322 < 2e-16
zipcode98006 2.195e+05 2.066e+04 10.623 < 2e-16
zipcode98007 2.138e+05 2.599e+04 8.227 < 2e-16
zipcode98008 2.323e+05 2.481e+04 9.360 < 2e-16
zipcode98010 1.117e+05 2.200e+04 5.077 3.88e-07
zipcode98011 6.932e+04 3.242e+04 2.138 0.032513
zipcode98014 1.028e+05 3.577e+04 2.874 0.004059
zipcode98019 8.595e+04 3.487e+04 2.464 0.013730
zipcode98022 7.531e+04 1.939e+04 3.883 0.000104
zipcode98023 -5.482e+04 1.198e+04 -4.576 4.76e-06
zipcode98024 1.781e+05 3.060e+04 5.822 5.90e-09
zipcode98027 1.709e+05 2.128e+04 8.031 1.03e-15
zipcode98028 4.857e+04 3.144e+04 1.545 0.122421
zipcode98029 2.376e+05 2.432e+04 9.772 < 2e-16
zipcode98030 1.675e+04 1.416e+04 1.183 0.236889
zipcode98031 1.795e+04 1.485e+04 1.208 0.226991
zipcode98032 -2.174e+04 1.731e+04 -1.256 0.209251
zipcode98033 3.028e+05 2.693e+04 11.245 < 2e-16
zipcode98034 1.342e+05 2.886e+04 4.651 3.33e-06
zipcode98038 8.685e+04 1.604e+04 5.414 6.23e-08
zipcode98039 1.078e+06 3.139e+04 34.348 < 2e-16
zipcode98040 4.617e+05 2.124e+04 21.737 < 2e-16
zipcode98042 2.614e+04 1.371e+04 1.906 0.056681
zipcode98045 1.723e+05 2.961e+04 5.819 6.03e-09
zipcode98052 2.037e+05 2.744e+04 7.422 1.20e-13
zipcode98053 1.862e+05 2.932e+04 6.349 2.22e-10
zipcode98055 2.808e+04 1.662e+04 1.689 0.091182
zipcode98056 6.770e+04 1.806e+04 3.749 0.000178
zipcode98058 3.188e+04 1.565e+04 2.037 0.041647
zipcode98059 7.649e+04 1.774e+04 4.312 1.63e-05
zipcode98065 1.454e+05 2.728e+04 5.330 9.96e-08
zipcode98070 -6.329e+04 2.104e+04 -3.008 0.002632
zipcode98072 9.482e+04 3.206e+04 2.957 0.003108
zipcode98074 1.619e+05 2.592e+04 6.245 4.34e-10
zipcode98075 1.691e+05 2.485e+04 6.804 1.05e-11
zipcode98077 5.323e+04 3.349e+04 1.589 0.111997
zipcode98092 -1.935e+03 1.281e+04 -0.151 0.879911
zipcode98102 4.509e+05 2.842e+04 15.865 < 2e-16
zipcode98103 2.572e+05 2.622e+04 9.812 < 2e-16
zipcode98105 4.118e+05 2.687e+04 15.325 < 2e-16
zipcode98106 6.418e+04 1.941e+04 3.307 0.000944
zipcode98107 2.645e+05 2.691e+04 9.828 < 2e-16
zipcode98108 7.618e+04 2.165e+04 3.519 0.000435
zipcode98109 4.334e+05 2.772e+04 15.636 < 2e-16
zipcode98112 5.404e+05 2.466e+04 21.914 < 2e-16
zipcode98115 2.663e+05 2.654e+04 10.036 < 2e-16
zipcode98116 2.224e+05 2.161e+04 10.288 < 2e-16
zipcode98117 2.353e+05 2.691e+04 8.742 < 2e-16
zipcode98118 1.264e+05 1.887e+04 6.698 2.17e-11
zipcode98119 4.233e+05 2.616e+04 16.178 < 2e-16
zipcode98122 2.869e+05 2.350e+04 12.210 < 2e-16
zipcode98125 1.199e+05 2.867e+04 4.181 2.92e-05
zipcode98126 1.329e+05 2.003e+04 6.637 3.29e-11
zipcode98133 6.449e+04 2.956e+04 2.181 0.029164
zipcode98136 1.898e+05 2.036e+04 9.324 < 2e-16
zipcode98144 2.343e+05 2.178e+04 10.756 < 2e-16
zipcode98146 3.451e+04 1.812e+04 1.905 0.056861
zipcode98148 3.628e+04 2.546e+04 1.425 0.154093
zipcode98155 4.928e+04 3.082e+04 1.599 0.109905
zipcode98166 8.622e+03 1.670e+04 0.516 0.605640
zipcode98168 4.977e+03 1.762e+04 0.282 0.777631
zipcode98177 1.130e+05 3.081e+04 3.666 0.000247
zipcode98178 3.796e+03 1.799e+04 0.211 0.832902
zipcode98188 -1.997e+03 1.834e+04 -0.109 0.913291
zipcode98198 -2.679e+04 1.401e+04 -1.913 0.055762
zipcode98199 2.930e+05 2.553e+04 11.476 < 2e-16
lat 2.063e+05 6.402e+04 3.222 0.001276
long -2.320e+05 4.578e+04 -5.068 4.07e-07
sqft_living15 2.497e+01 2.980e+00 8.378 < 2e-16
lotsize 1.097e+04 1.485e+03 7.391 1.53e-13
lotsize15 -4.430e+03 1.645e+03 -2.693 0.007081
age 2.566e+03 6.849e+02 3.746 0.000180
I(bathrooms^2) 7.340e+03 1.549e+03 4.740 2.15e-06
I(sqft_living^2) 1.070e-02 5.813e-03 1.841 0.065614
I(grade^2) 1.737e+04 6.652e+02 26.105 < 2e-16
I(sqft_above^2) 2.955e-02 5.908e-03 5.001 5.76e-07
I(sqft_basement^2) -4.307e-02 6.922e-03 -6.222 5.03e-10
I(yr_renovated^2) 1.321e+00 1.734e-01 7.617 2.73e-14
bedrooms:sqft_living 4.798e+00 4.044e+00 1.186 0.235456
sqft_basement:I(sqft_living^2) 2.046e-05 1.809e-06 11.308 < 2e-16
bedrooms:I(sqft_living^2) -5.184e-03 6.808e-04 -7.614 2.79e-14
I(sqft_living^2):I(sqft_basement^2) -3.336e-09 2.459e-10 -13.568 < 2e-16
(Intercept) ***
bedrooms **
bathrooms
sqft_living
floors ***
waterfrontYes ***
viewfair ***
viewaverage ***
viewgood ***
viewvery good ***
conditionfair ***
conditionaverage ***
conditiongood ***
conditionvery good ***
grade ***
sqft_above
sqft_basement
yr_renovated ***
zipcode98002
zipcode98003
zipcode98004 ***
zipcode98005 ***
zipcode98006 ***
zipcode98007 ***
zipcode98008 ***
zipcode98010 ***
zipcode98011 *
zipcode98014 **
zipcode98019 *
zipcode98022 ***
zipcode98023 ***
zipcode98024 ***
zipcode98027 ***
zipcode98028
zipcode98029 ***
zipcode98030
zipcode98031
zipcode98032
zipcode98033 ***
zipcode98034 ***
zipcode98038 ***
zipcode98039 ***
zipcode98040 ***
zipcode98042 .
zipcode98045 ***
zipcode98052 ***
zipcode98053 ***
zipcode98055 .
zipcode98056 ***
zipcode98058 *
zipcode98059 ***
zipcode98065 ***
zipcode98070 **
zipcode98072 **
zipcode98074 ***
zipcode98075 ***
zipcode98077
zipcode98092
zipcode98102 ***
zipcode98103 ***
zipcode98105 ***
zipcode98106 ***
zipcode98107 ***
zipcode98108 ***
zipcode98109 ***
zipcode98112 ***
zipcode98115 ***
zipcode98116 ***
zipcode98117 ***
zipcode98118 ***
zipcode98119 ***
zipcode98122 ***
zipcode98125 ***
zipcode98126 ***
zipcode98133 *
zipcode98136 ***
zipcode98144 ***
zipcode98146 .
zipcode98148
zipcode98155
zipcode98166
zipcode98168
zipcode98177 ***
zipcode98178
zipcode98188
zipcode98198 .
zipcode98199 ***
lat **
long ***
sqft_living15 ***
lotsize ***
lotsize15 **
age ***
I(bathrooms^2) ***
I(sqft_living^2) .
I(grade^2) ***
I(sqft_above^2) ***
I(sqft_basement^2) ***
I(yr_renovated^2) ***
bedrooms:sqft_living
sqft_basement:I(sqft_living^2) ***
bedrooms:I(sqft_living^2) ***
I(sqft_living^2):I(sqft_basement^2) ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 145100 on 17282 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8459
F-statistic: 946 on 101 and 17282 DF, p-value: < 2.2e-16
After performing nonlinear transformations and adding interaction terms, the \(R^2_{adj}\) for the model increases from 0.8079995 to 0.8459358. This means that updated model is capable of explaining 84.6% of the variability in price
. I will consider this the final model created by the self-selection method.
Analysis of Variance Table
Model 1: price ~ bedrooms + sqft_living + I(waterfront == "Yes") + I(view ==
"good") + I(view == "very good") + grade + I(zipcode == 98005) +
I(zipcode == 98007) + I(zipcode == 98034) + I(zipcode ==
98040) + I(zipcode == 98042) + I(zipcode == 98103) + I(zipcode ==
98106) + I(zipcode == 98112) + I(zipcode == 98115) + I(zipcode ==
98116) + I(zipcode == 98122) + lat + long + sqft_basement +
I(sqft_living^2) + I(grade^2) + I(sqft_basement^2) + bedrooms:sqft_living +
sqft_living:sqft_basement + sqft_basement:I(sqft_living^2) +
bedrooms:I(sqft_living^2) + I(sqft_living^2):I(sqft_basement^2)
Model 2: price ~ bedrooms + bathrooms + sqft_living + floors + waterfront +
view + condition + grade + sqft_above + sqft_basement + yr_renovated +
zipcode + lat + long + sqft_living15 + lotsize + lotsize15 +
age + I(bathrooms^2) + I(sqft_living^2) + I(grade^2) + I(sqft_above^2) +
I(sqft_basement^2) + I(yr_renovated^2) + bedrooms:sqft_living +
sqft_basement:I(sqft_living^2) + bedrooms:I(sqft_living^2) +
I(sqft_living^2):I(sqft_basement^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17355 6.2142e+14
2 17282 3.6389e+14 73 2.5753e+14 167.54 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis of variance reports an \(RSS\) of 6.214208410^{14} for the final model chosen by subset selection and an \(RSS\) of 3.638931510^{14} for the self-selected model. The lower \(RSS\) for the self-selected model indicates less overall error and therefore a higher prediction accuracy. Thus, the final model I select is the model created by self-selection.
This document uses tidyverse
by Wickham (2017), leaps
by Lumley (2017), MASS
by Ripley (2018), corrplot
by Wei and Simko (2017), DT
by Xie (2018b), ISLR
by James et al. (2017), knitr
by Xie (2018c), and bookdown
by Xie (2018a).
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2017. ISLR: Data for an Introduction to Statistical Learning with Applications in R. https://CRAN.R-project.org/package=ISLR.
Lumley, Thomas. 2017. Leaps: Regression Subset Selection. https://CRAN.R-project.org/package=leaps.
Ripley, Brian. 2018. MASS: Support Functions and Datasets for Venables and Ripley’s Mass. https://CRAN.R-project.org/package=MASS.
Wei, Taiyun, and Viliam Simko. 2017. Corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Xie, Yihui. 2018a. Bookdown: Authoring Books and Technical Documents with R Markdown. https://CRAN.R-project.org/package=bookdown.
———. 2018b. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.
———. 2018c. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.