50 Predicting with Regression Multiple Covariates

50.1 Example: Predicting wages

##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##               education             jobclass               health    
##  1. < HS Grad      :268   1. Industrial :1544   1. <=Good     : 858  
##  2. HS Grad        :971   2. Information:1456   2. >=Very Good:2142  
##  3. Some College   :650                                              
##  4. College Grad   :685                                              
##  5. Advanced Degree:426                                              
##                                                                      
##   health_ins        wage       
##  1. Yes:2083   Min.   : 20.09  
##  2. No : 917   1st Qu.: 85.38  
##                Median :104.92  
##                Mean   :111.70  
##                3rd Qu.:128.68  
##                Max.   :318.34
## [1] 2102    9
## [1] 898   9

50.2 Fit a Linear Model

\[ED_i = \beta_0 + \beta_1(age)+ \beta_2 ~I(Jobclass_i = "Information") + \sum_{i=1}^4 \gamma_{K} ~I(Education_i = level_K)\]

## Generalized Linear Model 
## 
## 2102 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   35.7214  0.2425235  24.61777

50.3 EDA in GLMs

Do we have any particular subsets that are especially fucked though?

This is particularly good exploratory technique. Look at the residuals vs the fitted plot and color by different variables.

Another technique is plotting by index. Is there a certain point in the data where we start getting erroneous results?