library(readxl)
library(tidyverse)
house <- read_excel("Housing.xlsx")
Examine the statistics and values for each variable. Are any missing? Do any values need clarification or modification? If so, why and what did you do?
str(house)
summary(house)
which(is.na(house))
house$status <- as.factor(house$status)
house$elem <- as.factor(house$elem)
str(house)
## tibble [76 × 11] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:76] 1 2 3 4 5 6 7 8 9 10 ...
## $ price : num [1:76] 388 450 386 350 156 ...
## $ size : num [1:76] 2.18 2.05 2.11 1.44 1.8 ...
## $ lot : num [1:76] 4 5 5 6 1 5 4 4 4 5 ...
## $ bath : num [1:76] 3 3 2 1 2 2 1.1 2 2.1 2.1 ...
## $ bedrooms : num [1:76] 4 4 4 2 4 3 4 4 4 3 ...
## $ yearbuilt : num [1:76] 1940 1957 1955 1956 1994 ...
## $ agestandardized: num [1:76] -3 -1.3 -1.5 -1.4 2.4 -3 -1.2 -0.9 -0.5 -0.2 ...
## $ garagesize : num [1:76] 0 2 2 1 1 1 1 2 2 2 ...
## $ status : Factor w/ 3 levels "act","pen","sld": 3 3 3 1 3 3 1 3 1 3 ...
## $ elem : Factor w/ 6 levels "adams","crest",..: 4 4 4 1 1 1 6 6 6 6 ...
plot(house[1:9])
library("PerformanceAnalytics")
chart.Correlation(house[, c(1:9)], histogram=TRUE, pch=19)
Here we can see that price as a single predictor against the other variables. We can see that individually each predictor doesn’t correlate strongly to price. Examine some of the variables relationships with price to help you determine which variables might be useful in an initial model. Explain your conclusions from this initial screening.
We can see that the promising predictor variables are garagesize, lot, bedrooms(negatively). We can look further in size, bath, yearbuilt/agestandardized if we need more predictors. It makes sense that id doesn’t predict anythign since its just an index and that yearbuilt and agestandardized are the same since agestandardized is just a transformation of yearbuilt.
test <- house %>%
select(bedrooms, price) %>%
arrange(bedrooms)
plot(test)
OutVals = boxplot(house$size, plot = T, main = "size")$out
which(house$size %in% OutVals)
[1] 4 20 76
OutVals = boxplot(house$lot, plot = T, main = "lot")$out
which(house$lot %in% OutVals)
[1] 74
OutVals = boxplot(house$bath, plot = T, main = "bath")$out
which(house$bath %in% OutVals)
integer(0)
OutVals = boxplot(house$bedrooms, plot = T, main = "bedrooms")$out
which(house$bedrooms %in% OutVals)
[1] 35
OutVals = boxplot(house$yearbuilt, plot = T, main = "year built")$out
which(house$yearbuilt %in% OutVals)
[1] 35 52 54 64
OutVals = boxplot(house$garagesize, plot = T, main = "garage size")$out
which(house$garagesize %in% OutVals)
integer(0)
house$size[c(4, 20, 76)]
[1] 1.442 1.440 2.896
house$lot[c(74)]
[1] 11
#house$bath
house$bedrooms[c(35)]
[1] 6
house$yearbuilt[c(35, 52, 54, 64)]
[1] 1920 1919 1905 1908
#house$garagesize
house[c(74, 76),]
# A tibble: 2 x 11
id price size lot bath bedrooms yearbuilt agestandardized garagesize
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 74 435 2.25 11 2 3 1979 0.9 2
2 76 350. 2.90 4 3 5 1979 0.9 2
# … with 2 more variables: status <fct>, elem <fct>
house <- house[-c(74,76), ]
house <- subset(house, select = -id )
str(house)
tibble [74 × 10] (S3: tbl_df/tbl/data.frame)
$ price : num [1:74] 388 450 386 350 156 ...
$ size : num [1:74] 2.18 2.05 2.11 1.44 1.8 ...
$ lot : num [1:74] 4 5 5 6 1 5 4 4 4 5 ...
$ bath : num [1:74] 3 3 2 1 2 2 1.1 2 2.1 2.1 ...
$ bedrooms : num [1:74] 4 4 4 2 4 3 4 4 4 3 ...
$ yearbuilt : num [1:74] 1940 1957 1955 1956 1994 ...
$ agestandardized: num [1:74] -3 -1.3 -1.5 -1.4 2.4 -3 -1.2 -0.9 -0.5 -0.2 ...
$ garagesize : num [1:74] 0 2 2 1 1 1 1 2 2 2 ...
$ status : Factor w/ 3 levels "act","pen","sld": 3 3 3 1 3 3 1 3 1 3 ...
$ elem : Factor w/ 6 levels "adams","crest",..: 4 4 4 1 1 1 6 6 6 6 ...
model1 <- lm(price ~. -agestandardized , house)
summary(model1)
Call:
lm(formula = price ~ . - agestandardized, data = house)
Residuals:
Min 1Q Median 3Q Max
-90.440 -28.267 -3.084 25.464 105.416
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 134.51524 652.40740 0.206 0.83735
size 53.86491 35.98329 1.497 0.13965
lot 5.48249 4.24177 1.293 0.20114
bath 9.28329 11.71581 0.792 0.43127
bedrooms -13.29101 9.06656 -1.466 0.14789
yearbuilt 0.01577 0.32888 0.048 0.96191
garagesize 10.42915 9.34539 1.116 0.26889
statuspen -13.71117 16.37471 -0.837 0.40573
statussld -37.64023 13.91346 -2.705 0.00887 **
elemcrest 4.82206 33.49879 0.144 0.88602
elemedge 0.65773 28.80361 0.023 0.98186
elemedison 83.25149 30.26369 2.751 0.00785 **
elemharris 46.11221 30.42429 1.516 0.13486
elemparker -8.86534 29.99090 -0.296 0.76856
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43.92 on 60 degrees of freedom
Multiple R-squared: 0.5297, Adjusted R-squared: 0.4278
F-statistic: 5.199 on 13 and 60 DF, p-value: 4.292e-06
Using your conclusions from the exploratory data analysis, build a regression model and report your findings.
We can see that the model isn’t just random and does explain the price in someway, by the p-value on the F-statistic.
F-statistic: 5.118 on 13 and 59 DF, p-value: 5.685e-06.
The model also explains about 53% of the variability in price, using the Multiple R-squared: 0.53. however this is inflated due to the sheer amount of predictors that we are using, looking at the Adjusted R-squared: 0.4264 means that it should only be about 42%.
We can see that most of the significant predictors depend on the school district and status 3(whether the house is listed). There is also NA values for the agestandardized which makes sense since it’s the same as yearbuilt and adds nothing to the model.
Consider modifying your model based upon your initial results and/or diagnostic plots from the model. Explain any modifications that you are making. Consider variance inflation factors for each predictor in your model and comment on that in your model selection process.
library(car)
vif(model1)
GVIF Df GVIF^(1/(2*Df))
size 1.629861 1 1.276660
lot 1.447221 1 1.203005
bath 1.688067 1 1.299256
bedrooms 1.626480 1 1.275335
yearbuilt 2.310691 1 1.520096
garagesize 2.005255 1 1.416070
status 1.687671 2 1.139782
elem 2.841724 5 1.110090
model2 <- lm(price ~. -agestandardized -yearbuilt -bath -lot -garagesize, house)
summary(model2)
Call:
lm(formula = price ~ . - agestandardized - yearbuilt - bath -
lot - garagesize, data = house)
Residuals:
Min 1Q Median 3Q Max
-88.973 -25.321 -2.025 25.832 130.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 195.485 60.858 3.212 0.00206 **
size 78.424 31.368 2.500 0.01499 *
bedrooms -19.763 7.791 -2.537 0.01364 *
statuspen -13.160 16.142 -0.815 0.41794
statussld -45.721 13.154 -3.476 0.00092 ***
elemcrest 10.841 33.495 0.324 0.74725
elemedge 9.657 28.350 0.341 0.73449
elemedison 88.205 29.473 2.993 0.00392 **
elemharris 48.962 29.637 1.652 0.10342
elemparker -1.809 29.857 -0.061 0.95187
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.24 on 64 degrees of freedom
Multiple R-squared: 0.4909, Adjusted R-squared: 0.4193
F-statistic: 6.856 on 9 and 64 DF, p-value: 7.769e-07
vif(model2)
GVIF Df GVIF^(1/(2*Df))
size 1.220356 1 1.104697
bedrooms 1.183316 1 1.087803
status 1.390658 2 1.085938
elem 1.452477 5 1.038032
We then started to remove predictor variables base on the highest p value on their individual t statistic. While also keeping in mind the vif at each iteration on whether one would show multicollinearity, this is not occur. After removing yearbuilt and bath we saw size became significant. When we removed lot bedroom started to show that it might be significant and finally removing garagesize, we end up with a model where all the predictors are significant.
The p value on the f-statistic shows us this is a significant model overall with a R^2 of 49%, adjusted of 42%.
par(mfrow = c(2,2))
plot(model2)
house[4,]
# A tibble: 1 x 10
price size lot bath bedrooms yearbuilt agestandardized garagesize status
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 350 1.44 6 1 2 1956 -1.4 1 act
# … with 1 more variable: elem <fct>
model3 <- lm(price ~. -agestandardized -yearbuilt -bath -lot -garagesize + agestandardized:size, house)
summary(model3)
Call:
lm(formula = price ~ . - agestandardized - yearbuilt - bath -
lot - garagesize + agestandardized:size, data = house)
Residuals:
Min 1Q Median 3Q Max
-91.001 -25.479 -1.336 25.801 126.792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 196.287 61.132 3.211 0.00208 **
size 74.545 32.027 2.328 0.02316 *
bedrooms -17.784 8.359 -2.127 0.03730 *
statuspen -13.213 16.211 -0.815 0.41810
statussld -43.926 13.478 -3.259 0.00180 **
elemcrest 11.094 33.641 0.330 0.74266
elemedge 8.225 28.552 0.288 0.77425
elemedison 91.547 30.014 3.050 0.00334 **
elemharris 45.614 30.178 1.512 0.13566
elemparker -3.041 30.041 -0.101 0.91969
size:agestandardized 1.003 1.492 0.672 0.50383
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.43 on 63 degrees of freedom
Multiple R-squared: 0.4945, Adjusted R-squared: 0.4143
F-statistic: 6.163 on 10 and 63 DF, p-value: 1.756e-06
par(mfrow = c(2,2))
plot(model3)
model4 <- lm(log(price) ~. -agestandardized -yearbuilt -bath -lot -garagesize, house)
summary(model4)
Call:
lm(formula = log(price) ~ . - agestandardized - yearbuilt - bath -
lot - garagesize, data = house)
Residuals:
Min 1Q Median 3Q Max
-0.28833 -0.08378 -0.00452 0.08781 0.36940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.31581 0.21388 24.854 < 2e-16 ***
size 0.26461 0.11024 2.400 0.019297 *
bedrooms -0.07492 0.02738 -2.736 0.008028 **
statuspen -0.05673 0.05673 -1.000 0.321045
statussld -0.17636 0.04623 -3.815 0.000309 ***
elemcrest 0.09411 0.11771 0.799 0.426984
elemedge 0.08765 0.09963 0.880 0.382294
elemedison 0.35659 0.10358 3.443 0.001020 **
elemharris 0.22717 0.10415 2.181 0.032855 *
elemparker 0.04825 0.10493 0.460 0.647197
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1555 on 64 degrees of freedom
Multiple R-squared: 0.5106, Adjusted R-squared: 0.4417
F-statistic: 7.418 on 9 and 64 DF, p-value: 2.502e-07
par(mfrow = c(2,2))
plot(model4)
Here we’ve looked at the interaction term that we thought we be significant earlier from looking at the data. We find out that is it not.
We applied a transformation on the Y response variable, seeing if it will yield to better plot diagnostics. It doesn’t change much for the most part aside from the residual vs fitted plot being more spread out but curving more drastically.
model4 <- lm(price ~ elem + status + I(size /100) + bedrooms, house)
summary(model4)
Call:
lm(formula = price ~ elem + status + I(size/100) + bedrooms,
data = house)
Residuals:
Min 1Q Median 3Q Max
-88.973 -25.321 -2.025 25.832 130.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 195.485 60.858 3.212 0.00206 **
elemcrest 10.841 33.495 0.324 0.74725
elemedge 9.657 28.350 0.341 0.73449
elemedison 88.205 29.473 2.993 0.00392 **
elemharris 48.962 29.637 1.652 0.10342
elemparker -1.809 29.857 -0.061 0.95187
statuspen -13.160 16.142 -0.815 0.41794
statussld -45.721 13.154 -3.476 0.00092 ***
I(size/100) 7842.351 3136.758 2.500 0.01499 *
bedrooms -19.763 7.791 -2.537 0.01364 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.24 on 64 degrees of freedom
Multiple R-squared: 0.4909, Adjusted R-squared: 0.4193
F-statistic: 6.856 on 9 and 64 DF, p-value: 7.769e-07
par(mfrow = c(2,2))
plot(model4)
Present your final model and diagnostic plots in support of that final model. In that presentation of the final model, comment on the R-squared value and its interpretation, give 95% confidence intervals for each of the β coefficients in your model, and illustrate your model’s use with a 95% confidence interval for the mean response and a 95% prediction interval for individual response for a hypothetical house of your choosing.
modelf <- lm(price ~ elem + status + size + bedrooms, house)
summary(modelf)
Call:
lm(formula = price ~ elem + status + size + bedrooms, data = house)
Residuals:
Min 1Q Median 3Q Max
-88.973 -25.321 -2.025 25.832 130.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 195.485 60.858 3.212 0.00206 **
elemcrest 10.841 33.495 0.324 0.74725
elemedge 9.657 28.350 0.341 0.73449
elemedison 88.205 29.473 2.993 0.00392 **
elemharris 48.962 29.637 1.652 0.10342
elemparker -1.809 29.857 -0.061 0.95187
statuspen -13.160 16.142 -0.815 0.41794
statussld -45.721 13.154 -3.476 0.00092 ***
size 78.424 31.368 2.500 0.01499 *
bedrooms -19.763 7.791 -2.537 0.01364 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.24 on 64 degrees of freedom
Multiple R-squared: 0.4909, Adjusted R-squared: 0.4193
F-statistic: 6.856 on 9 and 64 DF, p-value: 7.769e-07
par(mfrow = c(2,2))
plot(modelf)
After looking at the transformation on the on the response, interaction terms, and some of the predictor variables. We did not find anything compelling that would improve the model from what we already had.
The R^2 value is 0.4909 which mean that the model explains about 49% of the variation. Looking at the adjusted R^2 of 0.4193 we’re penalized by the amount of variables in the model.
confint(modelf, level = .95)
2.5 % 97.5 %
(Intercept) 73.90789 317.062309
elemcrest -56.07266 77.754847
elemedge -46.97820 66.292616
elemedison 29.32650 147.083452
elemharris -10.24411 108.167339
elemparker -61.45455 57.836119
statuspen -45.40613 19.086839
statussld -71.99990 -19.442006
size 15.75956 141.087464
bedrooms -35.32660 -4.198871
These are the β coefficients in your model, we are 95% confident that the actual mean data point is between the 2.5% and 97.5% values.
We can be 95% confident that the actual mean price value for elemcrest is between -$56.1K to $77.8K.
We can be 95% confident that the actual mean price value for elemedge is between -$47.0K to $66.3K.
We can be 95% confident that the actual mean price value for elemedison is between $29.3K to $147.1K.
We can be 95% confident that the actual mean price value for elemharris is between -$10.2K to $108.2K.
We can be 95% confident that the actual mean price value for elemparker is between -$45.4K to $19.1K.
We can be 95% confident that the actual mean price value for statuspen is between -$72.0K to -$19.4K.
We can be 95% confident that the actual mean price value for statussld is between $19.4K to $72.0K.
We can be 95% confident that the actual mean price value for size is between $15.8K to $141.1K.
We can be 95% confident that the actual mean price value for bedrooms is between -$35.3K to -$4.2K.
newhouse <- data.frame(size = 2, bedrooms = 4, status = "pen", elem = "crest")
predict(modelf, newdata = newhouse, interval = 'confidence')
fit lwr upr
1 270.9626 219.8833 322.0419
newhouse <- data.frame(size = 2, bedrooms = 4, status = "pen", elem = "crest")
predict(modelf, newdata = newhouse, interval = 'predict')
fit lwr upr
1 270.9626 168.8784 373.0468